Re: pristine store database -- was: pristine store design

From: Greg Stein <gstein_at_gmail.com>
Date: Tue, 16 Feb 2010 14:15:40 -0500

Meta-comment: all of these issues that you're bringing up are
*exactly* why I wanted to punt the issue of external-to-WC stores out
of 1.7. Keeping a 1:1 correspondence of pristine stores to working
copies keeps the problem tractable, especially given all the other
work that is needed.

On Tue, Feb 16, 2010 at 09:58, Bert Huijben <bert_at_qqmail.nl> wrote:
>...
> The idea was not to move just the pristine store, but also wc.db to a
> central location.

Yup. There were also some comments of separable locations of the
wc.db: one where you keep your metadata (e.g. home dir), and one for
the pristines (e.g. /var/svn). I never liked that idea, but reference
it for completeness sake.

As Bert noted in his response, the schema is designed to manage
multiple working copies.

>...
>> Instead, we could not store size and mtime at all! :)
>
> Or we could store both to perform simple consistency checks...

Dunno about that, but the storage of SIZE is part of the (intended)
algorithm for pristine storage. It is allowed to have a row in
PRISTINE with SIZE==0 in order to say "I know about this pristine, and
this row is present to satisfy integrity constraints with other
tables, but the pristine has NOT been written into the store." Once
the file *is* written, then the resulting size is stored into the
table.

> A query over checksum from BASE_NODE and WORKING_NODE + older_checksum,
> left_checksum and right_checksum from ACTUAL_NODE would give a list from
> all PRISTINE records that are in-use.
> The rest can safely be deleted if diskspace is required.

Yes. That was the design goal.

it gets more complicated when you have a centralized wc.db and one or
more of the working copies are offline (e.g. removable storage,
network unavailable, etc). Again, these questions are why a
centralized concept has been punted for this generation.

>...
>> So, while checking size and mtime gives a sense of basic sanity, it is
>> really just a puny excuse for not checking full checksum validity. If we
>> really care about correctness of pristines, *every* read of a pristine
>> should verify the checksum along the way. (That would include to always

Since the pristine design returns a *stream* on the pristine, then we
can always insert a checksumming stream in order to verify the
contents. I believe the bottleneck will be I/O, so performing a
checksum should be Just Fine. If the stream is read to completion,
then we can validate the checksum (and don't worry about partial
reads; that isn't all that common, I believe).

Note that we'd only have to insert a single checksum stream, not SHA1 *and* MD5.

Also, since I/O is the (hypothetical) bottleneck, this is also why
compression is handy.

>...
> This is cheap to detect. If we have to read the file anyway, we can just
> recalculate the hash via the stream hash apis. The disk IO for reading the
> file is the real pain with current CPUs. (At least for SHA1/MD5. Sha256 is
> still painful in my tests)
>...
> You would probably see our binary diff fail before you get here and if the
> RA layer provides a hash for the result or input file you can use that for
> verification on the streams. (See the update editor for some examples where
> we do that in one pass since 1.6)

Right, and right.

>...

Cheers,
-g
Received on 2010-02-16 20:16:18 CET

This message: [ Message body ]
Next message: C. Michael Pilato: "Re: Setting a 1.7 branch goal"
Previous message: Greg Stein: "Re: [VOTE] Subversion podling for graduation"
In reply to: Bert Huijben: "RE: pristine store database -- was: pristine store design"
Next in thread: Johan Corveleyn: "Re: pristine store database -- was: pristine store design"
Reply: Johan Corveleyn: "Re: pristine store database -- was: pristine store design"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]