Re: pristine store design

From: Greg Stein <gstein_at_gmail.com>
Date: Wed, 17 Feb 2010 19:52:53 -0500

On Wed, Feb 17, 2010 at 18:32, Neels J Hofmeyr <neels_at_elego.de> wrote:
>...
> +1 -- I was hoping it'd make sense to go that way.
> - "always" internally checksum streams returned by pristine_read()
> (until proven that performance is affected significantly)
> - Allow a boolean flag to switch off this checksumming

yup...

> - _read() knows the size of the pristine from the pristine file. It
> will skip checksum comparison on premature close.

Unless it does a stat(), it doesn't know. But that isn't need. The
checksumming stream just reads to EOF. If it never hits EOF, then it
never attempts to validate. Very simple.

>...
>>>> [ ] If the checksum is already known and the data exists in the local
>>>> file system, there is no need to verify the checksum.
>>
>> I'm assuming you mean copying from a source in the local file system,
>> and writing that into the pristine store.
>>
>> Yes, if we have a checksum for that file, then we can just use it
>> since we (invariably) *just* computed it.
>
> Are you talking specifically about a tempfile? Because below, we all seem to
> agree that it is dangerous to compute a checksum from an Actual working copy
> file and to only copy it later. ?

Yes.

>>> That depends. If we are going to read it anyway...
>>> In most cases I agree. The filesystem also has verifications on bit errors.
>>> But you also have sed -i <.....>
>>
>> Not all filesystems do. That is why we record checksums for repository
>> contents. That "one bit error every five years" needs to be
>> detectable.
>
> ^ Further support for rather doing checksums with every read, where certain
> situations in the code get "special permission" to switch off checksumming.

I doubt working copies will suffer degradation like a server
repository, but ... yeah. Sure.

>...
>> I think a flag to pristine_read() may be the way to go.
>
> And I'd prefer to term it from the other end, as I stated above, i.e. to say
> "do checksums unless we know in advance that it makes no sense, and unless
> we know that it hits performance really hard".

Sure.

I would also have no problem with a separate function like
svn_wc__db_pristine_read_unsafely() :-)

>...
> To prevent two concurrently running clients from installing the same
> pristine twice, a file lock seems more appropriate (with one wc.db per WC,
> benefit of even that is debatable, but with one wc.db per user that might be
> good).

Eep. I hadn't considered this.

Okay. Forgot all that I've written about allowing faulty files, and
writing directly to the target location. Tempfile-then-move is best.

Thanks.

>...
>>>> [ ] It is ok to open a write stream to the final location of the
>>>> pristine file for a given checksum, no need to store in a
>>>> pristine-tempfile first.
>>> I think we should say no here, to guarantee a stable behavior. If the file
>>> is in the pristine store it should be OK. We can't trust the initial write
>>> to always provide the full file and a valid hash of this file.
>>> We should calculate ourselves if it is possible that we receive an
>>> incomplete file.
>>
>> The schema allows for this, so it can/should be just fine.
>
> This kinda contradicts the impression I get from you saying "_write() has to
> go" and "we always copy to temp before/while calculating the checksum".
> (s.b. ...)

_write() just can't function properly because of the
stream-close-means-WHAT? issue. That's independent of allowing faulty
files into the store.

>>> And calculating ourselves defines that we don't know the final location.
>>
>> Usually, we assemble a file in the temp area that is detranslated. At
>> that time, we can compute the checksum. And then we install the file
>> as a pristine.
>>
>> We might have a file incoming from the repository with a known
>> checksum. We can spool that directly into the pristine store.
>
> (...) But when we receive a pristine over the network, we should surely
> verify its checksum before writing it to the final pristine file location?
>
> I tend to favor doing all partial work in the pristine tempdir and OS-moving
> into place once the file is complete.

Agreed (now), for the multi-process issue.

>...
>>> ...
>>>> SHA1/MD5 compat.
>>>> [ ] When talking to an old server, we use only MD5 checksums to reference
>>>> the pristine in the pristine store, nevermind hash collisions.
>>
>> If we don't have a SHA1 handy, then I don't mind this. But I think we
>> should store stuff using *only* MD5 or *only* SHA1. Mixing them could
>> end up with duplicates. Not sure if that is a problem, but it feels
>> like it could be.
>
> I was hoping to get a -1 on this. To me it feels preferable to make the
> pristine store *always* use SHA1 for the hash index. I'd be totally against
> having a mixed hash index. Given that streaming to create a checksum when
> receiving from the repos should be cheap, let's only use SHA1 indexes?

That's fine. Just saying I don't mind either way, as long as we stick
to *one* form of checksums.

>>> ...
>>>> [ ] When talking to an old server, we use SHA1 to index the pristine
>>>> store and also store the MD5 checksum for server comm. We create the
>>>> SHA1 checksum locally after having verified the MD5 checksum.
>>> +1. This is the original WC-NG plan
>>
>> I wouldn't go *that* far. There was definitely some hand-waving going
>> on. The MD5 in the editor API is a severely complicating item. I had
>> hoped to fix that particular problem with Ev2.
>
> ...so you disagree.

nono... was just disagreeing with Bert's assertion that it was "the
original WC-NG plan".

> What about md5 collisions? Are they unlikely enough?
> Also, with only SHA1 in use, we don't need to maintain code that figures out
> which kind of checksum is used in a pristine store.
>
> What say you?

My assumption is that we'll figure out how to get a SHA-1 for any file
we wish to install, despite the Editor API problems, and (thus) we'll
use SHA-1 for the index. And that further means that we'll store MD5's
into the metadata store.

But if you *can't* always get a SHA1 easily, and determine it would be
easier *today* to use MD5 for the indexes? Fine.

If we can get Ev2 into 1.7 and solve some of these problems? Awesomeness.

>>>> [ ] When talking to a 1.7 server, we will receive and use only SHA1
>>>> checksums and don't store any MD5 checksums.
>>> There is currently no delta editor upgrade for 1.7, so we still have to send
>>> MD5 to our 1.7 server as it is now. And to wrap old editors with new editors
>>> (and reversed) we probably have to send both hashes whenever possible.
>>> (The repository filesystem already stores SHA1 hashes since 1.6 or ?)
>>
>> I think we need to keep MD5 checksums around, in addition to the SHA1
>> for compat with the Editor v1 APIs. Sigh.
>
> Group sigh.

Bah. You're just thinking too small. Work on Editor v2!!

... hehehe

Actually, I suspect that we *may* want to use Ev2 in one or two places
to make our lives manageable. Not a full conversion of all editors.
Just a targeted, surgical replacement of (say) the update editor.

Cheers,
-g
Received on 2010-02-18 01:53:27 CET

This message: [ Message body ]
Next message: Greg Stein: "memory leak? ref: merge test 4"
Previous message: Neels J Hofmeyr: "Re: pristine store design"
In reply to: Neels J Hofmeyr: "Re: pristine store design"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]