Re: pristine store design

From: Neels J Hofmeyr <neels_at_elego.de>
Date: Wed, 17 Feb 2010 13:28:46 +0100

Thanks for the numerous replies! I'd like the design document to soak them
all up, and to that end have a few questions (misunderstandings?) to clarify.

Verification upon using a pristine.
Which of these are/should be true?
[ ] We want to make sure that there is no way that a corrupted pristine
    can cause faulty content to be successfully committed to a repository.
[ ] We want to make sure that there is no way that a corrupted pristine
    can affect the 'actual' working copy files.
[ ] It is cheap to tee to a checksumming stream when reading a
    stored pristine. The I/O is much slower anyway.
[ ] It is expensive to *always* want to tee to a checksumming stream,
    because that defies speed gain when wanting to read just a subsection
    of a pristine, and defies using OS copy without de-/translation to put
    a pristine's copy into the WC.
[ ] We don't care about optimizing for reading subsections of a pristine.
    We "always" read the whole pristine anyway.
[ ] If the checksum is already known and the data exists in the local
    file system, there is no need to verify the checksum.
[ ] If the checksum is already known and the data exists in the local
    file system, we still have to verify the checksum again via a tempfile
    unless that local file already *is* a tempfile (and is thus reasonably
    protected from accidental modification before we finish our copy).
[ ] Checking size and mtime is a basic check, but it allows for crafted,
    coincidentally-matching-fstat or hw failure corruptions. We don't want
    that to be possible, so we will make sure that any corruption will
    be caught at *some* point in the code path.
[ ] We are totally fine with just checking size (and maybe mtime) of
    pristines locally, no need to *always* do checksumming when reading.

Pristine check.
Do we need a "fast" check? Note that it doesn't need to say how it's
implemented; just say whether you think we need a presence check that tries
to be as fast as possible by assuming the pristine store is consistent.
Greg said there is a use case for that lurking, so I suggest keeping it in
the design doc but marking it for post-1.7. Yes?

Write.
Julian said: there possibly should be only _write(), a simple API that takes
care of the rest internally.
Greg said: _write() has to go away, it has no way of telling whether the
write stream was interrupted by error or ended successfully.
Greg also said: It is fine to write directly to a pristine file location.
Which are/should be true?
[ ] We will validate pristines when reading anyway, so we can neglect
    being paranoid about writing to pristine files without losing safety.
[ ] It is ok to open a write stream to the final location of the
    pristine file for a given checksum, no need to store in a
    pristine-tempfile first.
[ ] However, we do not provide an API function for that. The only way
    to create a pristine (for now) will be via _tempdir() and _install().
[ ] But we will likely see a need at some point, upon which such API
    will be added. We will change _write() so that it can tell the
    difference between successful and unsuccessful stream closure.
[ ] Callers can also pass any other file locations to _install(), to
    facilitate OS-copying/moving into the pristine store without de-/
    translation (when the checksum is known beforehand).
[ ] ^ ... copying ...
[ ] ^ ... moving ...

Separate pristine store?
The current design envisions multiple working copies per wc.db. This
suggests that a pristine store be closely tied to a wc.db.
Which is/should be true?
[ ] It is likely that we need to separate a pristine store from a wc.db
    once we want to move to a system wide pristine store.
[ ] A system wide pristine store is a whole other ballgame. There will
    always be a pristine store tied to a wc.db.
[ ] They are currently tied. We are not prepared/able/willing to say
    anything else right now. We'll just go with the flow.

SHA1/MD5 compat.
[ ] When talking to an old server, we use only MD5 checksums to reference
    the pristine in the pristine store, nevermind hash collisions.
[ ] When talking to an old server, we use SHA1 to index the pristine
    store and also store the MD5 checksum for server comm. We create the
    SHA1 checksum locally after having verified the MD5 checksum.
[ ] When talking to a 1.7 server, we will receive and use only SHA1
    checksums and don't store any MD5 checksums.

Thanks for your insights/opinions/check marks.

(I have noted that these things should soak into the doc:
- clarify which parts are 1.7 and which are beyond.
- clarify use cases WRT which API layer is doing what (client vs. wc).
- different use case distinction than "store"/"new"/"fetch".
- when to use which method of pristine verification.
- need db for backwards compat (md5).
- how the PRISTINE table's size column works.
- how closely the pristine store is/will be tied to wc.db.
- incorporate 'repair' in 'check(_usable)'.
- something like a "high-water mark" is not designed yet.
- maybe think through the transition of pristine store use.)

BTW, I've seen my use case descriptions more like a list of general
situations, not as something specific to the internal pristine API.
...needs to be clarified. :)

Thanks,
~Neels

Neels J Hofmeyr wrote:
> Hi all,
>
> taking stock of the current state of the pristine store API and finding
> design docs missing, I have created a "design paper" to clarify.
>
> If you could be so kind to glance over it and straighten out my picture, if
> necessary. Upon approval, I'll check it into notes/ so we can edit.
>
> Note, if my view is correct, this design text implies small changes to the
> current state of the API:
> - no need for _pristine_write()
> - need to add _pristine_forget()
> - change the _pristine_checkmode_t enum.
>
> Still missing completely: how to handle a "high-water mark", i.e. how to
> determine which pristines get forgotten first.
>
> Thanks!
> ~Neels
>

application/pgp-signature attachment: OpenPGP digital signature

Received on 2010-02-17 13:29:31 CET

This message: [ Message body ]
Next message: Bert Huijben: "RE: pristine store design"
Previous message: Julian Foad: "Re: Setting a 1.7 branch goal"
In reply to: Neels J Hofmeyr: "pristine store design"
Next in thread: Bert Huijben: "RE: pristine store design"
Reply: Bert Huijben: "RE: pristine store design"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]