[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: fs-rep-sharing branch

From: Daniel Berlin <dberlin_at_dberlin.org>
Date: Sat, 25 Oct 2008 04:10:26 -0400

To be more serious for a moment, the other problem you face with MD5
is that our reps are going to be exactly in the form of most attacks
on MD5 take, without any deliberate requirement that the user do it.

All the colliding attacks on MD5 so far are common prefix attacks (IE
where files share a common set of start data) and because of how MD5
works, the data at the end doesn't happen to make the MD5 come out
different.

Because we use skip-delta reps, unless you modify your files at the
beginning all the time, you are going to end up with a a lot of reps
whose data shares a common prefix.
This significantly increases your odds of an accidental collision.

It all comes down to "how long do you want this repo format to last".

If you use MD5 for rep-sharing, i am willing to absolutely guarantee
you that in the next 10 years at least one of our users will have
non-deliberate (IE they are not trying to create colliding files)
corruption as a result.
If you don't expect users to be using this format 10 years from now, fine.

There are plenty of projects that used CVS, and before that, the RCS
file format, for > 10 years.
I think it would be a huge mistake to do something so unnecessary that
so greatly increases our risk of future problems.
But in the end, I am not writing the code, and if you want to make
this mistake, that's fine. I do plan on saying I told you so though
:)

On Fri, Oct 24, 2008 at 8:31 PM, Daniel Berlin <dberlin_at_dberlin.org> wrote:
> So just to be clear, you've gone from "this will never happen" to "I
> don't care about these people"?
>
>
> On Fri, Oct 24, 2008 at 4:31 PM, Greg Stein <gstein_at_gmail.com> wrote:
>> Hunh?! I don't understand how you got 4 billion from 2^59.
>>
>> And, personally, I'm not too worried for repositories with 4 billion files,
>> let alone 2^59 files.
>>
>> Cheers,
>> -g
>>
>>
>>
>> On Oct 23, 2008, at 18:29, "Daniel Berlin" <dberlin_at_dberlin.org> wrote:
>>
>>> It's definitely not 2^128
>>>
>>> Assuming a perfectly even distribution (which md5 doesn't have), the
>>> birthday paradox means to expect a collision after 1.25 * sqrt(x)
>>> outputs.
>>>
>>> sqrt(2^128) = 2^64
>>>
>>> Probabilistically, you have a 1% chance of collision after ~2^59
>>> nodes, and it grows fairly quickly from there:
>>> 25% chance after 2^60, 50% chance after 2^61, etc
>>>
>>> Again, this is all with a perfectly even distribution. If MD5's
>>> distribution is say, "half as good as perfect", you will get
>>> collisions with a little more than 4 billion files.
>>>
>>> In any case, having a 64 bit number of files is getting within the
>>> reach of large systems.
>>> We should move to SHA1, which is in the "universe size number of files"
>>> range.
>>>
>>>
>>> On Tue, Oct 21, 2008 at 9:50 PM, Greg Stein <gstein_at_gmail.com> wrote:
>>>>
>>>> There is a HUGE difference between constructing two files with the
>>>> same md5 in order to falsify a signature, and that of two files in a
>>>> repository having the same md5 hash by accident.
>>>>
>>>> Sit down and look at the odds. 1 in 2^128. If I understand my powers
>>>> of two properly, I believe that means the earth is more likely to
>>>> spontaneously explode, than for two files to have the same hash key.
>>>>
>>>> Cheers,
>>>> -g
>>>>
>>>> On Tue, Oct 21, 2008 at 3:57 PM, David Glasser <glasser_at_davidglasser.net>
>>>> wrote:
>>>>>
>>>>> As far as I can tell from reading the source, this (at least in FSFS)
>>>>> assumes that reps sharing the same md5 are the same file. (BDB seems
>>>>> to use sha1.)
>>>>>
>>>>> This means that you cannot store two files with the same md5 in the
>>>>> same repository. While obviously all hashes have collisions in
>>>>> theory, md5 has collisions in practice: there are known instances.
>>>>> And you know, cryptography researchers use Subversion! (At one point
>>>>> I tried to help fix Ron Rivest's corrupted svn repo...) I do not
>>>>> think that this limitation is appropriate for Subversion; I would
>>>>> highly advise against releasing this without changing FSFS to use SHA
>>>>> as well. (I can't find a mailing-list discussion of this choice; my
>>>>> apologies if I missed one, I have admittedly been not paying as much
>>>>> attention as I'd like to Subversion development recently.)
>>>>>
>>>>> --dave
>>>>>
>>>>> On Mon, Oct 6, 2008 at 8:59 PM, Hyrum K. Wright
>>>>> <hyrum_wright_at_mail.utexas.edu> wrote:
>>>>>>
>>>>>> The fs-rep-sharing branch is functionally complete, and I'd like to get
>>>>>> the
>>>>>> branch merged to trunk soon. These are the stats for various copies of
>>>>>> of our
>>>>>> repository for the different branch/backend combinations.
>>>>>>
>>>>>> BDB: 1.5: 1.4GB
>>>>>> trunk: 627MB
>>>>>> reps-shared: 490MB
>>>>>>
>>>>>> FSFS: 1.5: 586MB
>>>>>> trunk: 578MB
>>>>>> reps-shared: 523MB
>>>>>>
>>>>>> The effect is quite pronounced on BDB, with around a 20% space savings
>>>>>> compared
>>>>>> with our current trunk (and over 67% compared with 1.5!) FSFS doesn't
>>>>>> show as
>>>>>> much improvement, partly due to the size of the index required to
>>>>>> enable
>>>>>> rep-sharing, partly due to decreased sharing opportunities in
>>>>>> same-revision and
>>>>>> parallel revision objects, and mostly due to the absolute floor on repo
>>>>>> size due
>>>>>> to inode usage.
>>>>>>
>>>>>> We may be able to tune the FSFS implementation just a bit. For
>>>>>> instance, it may
>>>>>> not be likely that directory content representations are likely to be
>>>>>> shared, in
>>>>>> which case we shouldn't bother
>>>>>>
>>>>>> The remaining issue is the failing blame tests. Blame tests 10 and 11,
>>>>>> which
>>>>>> test 'blame -g', both fail for both backends. Before the recent
>>>>>> commits to add
>>>>>> rep-sharing to fsfs, the tests only failed for bdb. I'm slightly
>>>>>> puzzled here
>>>>>> because 'blame -g' should be FS-agnostic. If anybody has some insight,
>>>>>> I
>>>>>> welcome it.
>>>>>>
>>>>>> [Note: Because SQLite is still not an official dependency, to compile
>>>>>> the
>>>>>> rep-sharing stuff with FSFS, you'll need to add -DENABLE_SQLITE_TESTING
>>>>>> to the
>>>>>> CPPFLAGS when configuring.]
>>>>>>
>>>>>> -Hyrum
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Glasser | glasser@davidglasser.net | http://www.davidglasser.net/
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
>>>>> For additional commands, e-mail: dev-help_at_subversion.tigris.org
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
>>>> For additional commands, e-mail: dev-help_at_subversion.tigris.org
>>>>
>>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: dev-help_at_subversion.tigris.org
Received on 2008-10-25 10:10:56 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.