[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: fs-rep-sharing branch

From: Martin Furter <mf_at_rola.ch>
Date: Wed, 22 Oct 2008 19:31:01 +0200 (CEST)

On Tue, 21 Oct 2008, David Glasser wrote:

> Er, to be clear. I am *not* talking about changing any use of md5s as
> *checksums*, like in the editor interface, etc.
>
> I'm talking about the use of md5s as *keys*.
>
> md5 checksum collision just means that corruption might not be
> noticed. md5 key collision means that there are realistic use cases
> for repositories that cannot exist.

Right. But changing the hash algorithm won't change the problem at all.

So why not the correct solution?

1. Calculate hash, add sequence number zero.
2. Lookup the hash.
3. If it does not exist store it.
4. Compare the contents.
5. If it matches store a link to the hash/seqNr pair.
6. Set seqNr to max+1 and store it.

But what's the cost for this?

For different contents the compare will probably stop early. For files
with the same content the whole thing has to be compared. This can happen
in two cases: When a file is copied subversion already knows that it is a
duplicate. So only the accidental hash collision remains. And this should
be a rare case.

Just my 2 cents.

Martin

>
> --dave
>
> On Tue, Oct 21, 2008 at 8:34 PM, David Glasser <glasser_at_davidglasser.net> wrote:
>> Did you miss the "I have real experience doing support for Subversion
>> repositories for cryptographic researchers who would in fact be trying
>> to make these collisions"? md5 has known collisions. sha1 is still
>> solid, for today. Most other open source version control systems
>> using content-addressable stores use sha1. *fs_base* uses sha1. Why
>> not FSFS?
>>
>> --dave
>>
>> On Tue, Oct 21, 2008 at 6:50 PM, Greg Stein <gstein_at_gmail.com> wrote:
>>> There is a HUGE difference between constructing two files with the
>>> same md5 in order to falsify a signature, and that of two files in a
>>> repository having the same md5 hash by accident.
>>>
>>> Sit down and look at the odds. 1 in 2^128. If I understand my powers
>>> of two properly, I believe that means the earth is more likely to
>>> spontaneously explode, than for two files to have the same hash key.
>>>
>>> Cheers,
>>> -g
>>>
>>> On Tue, Oct 21, 2008 at 3:57 PM, David Glasser <glasser_at_davidglasser.net> wrote:
>>>> As far as I can tell from reading the source, this (at least in FSFS)
>>>> assumes that reps sharing the same md5 are the same file. (BDB seems
>>>> to use sha1.)
>>>>
>>>> This means that you cannot store two files with the same md5 in the
>>>> same repository. While obviously all hashes have collisions in
>>>> theory, md5 has collisions in practice: there are known instances.
>>>> And you know, cryptography researchers use Subversion! (At one point
>>>> I tried to help fix Ron Rivest's corrupted svn repo...) I do not
>>>> think that this limitation is appropriate for Subversion; I would
>>>> highly advise against releasing this without changing FSFS to use SHA
>>>> as well. (I can't find a mailing-list discussion of this choice; my
>>>> apologies if I missed one, I have admittedly been not paying as much
>>>> attention as I'd like to Subversion development recently.)
>>>>
>>>> --dave
>>>>
>>>> On Mon, Oct 6, 2008 at 8:59 PM, Hyrum K. Wright
>>>> <hyrum_wright_at_mail.utexas.edu> wrote:
>>>>> The fs-rep-sharing branch is functionally complete, and I'd like to get the
>>>>> branch merged to trunk soon. These are the stats for various copies of of our
>>>>> repository for the different branch/backend combinations.
>>>>>
>>>>> BDB: 1.5: 1.4GB
>>>>> trunk: 627MB
>>>>> reps-shared: 490MB
>>>>>
>>>>> FSFS: 1.5: 586MB
>>>>> trunk: 578MB
>>>>> reps-shared: 523MB
>>>>>
>>>>> The effect is quite pronounced on BDB, with around a 20% space savings compared
>>>>> with our current trunk (and over 67% compared with 1.5!) FSFS doesn't show as
>>>>> much improvement, partly due to the size of the index required to enable
>>>>> rep-sharing, partly due to decreased sharing opportunities in same-revision and
>>>>> parallel revision objects, and mostly due to the absolute floor on repo size due
>>>>> to inode usage.
>>>>>
>>>>> We may be able to tune the FSFS implementation just a bit. For instance, it may
>>>>> not be likely that directory content representations are likely to be shared, in
>>>>> which case we shouldn't bother
>>>>>
>>>>> The remaining issue is the failing blame tests. Blame tests 10 and 11, which
>>>>> test 'blame -g', both fail for both backends. Before the recent commits to add
>>>>> rep-sharing to fsfs, the tests only failed for bdb. I'm slightly puzzled here
>>>>> because 'blame -g' should be FS-agnostic. If anybody has some insight, I
>>>>> welcome it.
>>>>>
>>>>> [Note: Because SQLite is still not an official dependency, to compile the
>>>>> rep-sharing stuff with FSFS, you'll need to add -DENABLE_SQLITE_TESTING to the
>>>>> CPPFLAGS when configuring.]
>>>>>
>>>>> -Hyrum
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> David Glasser | glasser@davidglasser.net | http://www.davidglasser.net/
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
>>>> For additional commands, e-mail: dev-help_at_subversion.tigris.org
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> David Glasser | glasser@davidglasser.net | http://www.davidglasser.net/
>>
>
>
>
> --
> David Glasser | glasser@davidglasser.net | http://www.davidglasser.net/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
> For additional commands, e-mail: dev-help_at_subversion.tigris.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: dev-help_at_subversion.tigris.org
Received on 2008-10-22 19:31:21 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.