[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: fs-rep-sharing branch

From: Daniel Berlin <dberlin_at_dberlin.org>
Date: Thu, 23 Oct 2008 18:40:11 -0400

Again, you really aren't going to be runnning around with md5 number
of files for a long time.
2^64 files is not going to last you anywhere near as long as you think.
You can easily have 1/1000th the number of files you'd need to collide
in a perfectly even distribution on a large system today.
MD5 is not a perfectly even distribution, and there are large systems
that get md5 collisions all the time and have thus moved to SHA.
Remember that once you get past the magic number for an even
distribution, you are likely to have a collision on *every single node
you create*, not just on some of them. The number I gave means that
once you have 2^64 nodes, on average the probability of having content
that collides with some existing one is ~100%.

This is really not a state you want to be in. SHA-1 will have 0 of
these problems for the life of our known universe.
With MD5, you probably have 10 years or less before the average large
system stores is storing enough nodes to get a collision on every new

On Tue, Oct 21, 2008 at 11:05 PM, Greg Stein <gstein_at_gmail.com> wrote:
> After you've changed the editor API, the wc_entry_t structure,
> migrated all old clients over to svn_checksum_t, and then switched the
> storage defaults over to sh1, *then* we can talk about "an easy
> switch".
> The simple fact is that we're going to be running around with md5
> checksums in hand for a long while. OR we double-compute, and I'm not
> willing to burn that much CPU to satisfy somebody's misguided
> preconception about md5 collisions. And double-compute generally means
> that we *carry around* both checkums. You wanna update all the APIs
> for that, too?
> And before you start all of the above, please describe the failure mode?
> As I see it, Alice chooses FOO.C to be her prefix. So she appends some
> code to the end, then she adds a second file with the same FOO.C
> prefix, and then adds some *different* code to the end, which Alice
> knows to generate an MD5 collision. The working copy then proceeds to
> NOT install the second file since it thinks it already has it. Thus,
> Alice ends up with two copies of FOO.C + one suffix.
> Is there a different failure mode that you're "protecting" (ahem) against?
> -g
> On Tue, Oct 21, 2008 at 7:01 PM, Hyrum K. Wright
> <hyrum_wright_at_mail.utexas.edu> wrote:
>> Greg Stein wrote:
>>> There is a HUGE difference between constructing two files with the
>>> same md5 in order to falsify a signature, and that of two files in a
>>> repository having the same md5 hash by accident.
>>> Sit down and look at the odds. 1 in 2^128. If I understand my powers
>>> of two properly, I believe that means the earth is more likely to
>>> spontaneously explode, than for two files to have the same hash key.
>> I'm not concerned about *random* collisions. I'm concerned about malicious
>> committers (or attackers who compromise a comitter's account). In that case, it
>> becomes the same as constructing two files with the same md5 to falsify a signature.
>> The other way to look at this is cost vs. benefit. Changing to sha1 has minimal
>> cost, especially with the new checksum infrastructure. While some may claim
>> that the benefits are equally minimal, others would feel more comfortable if we
>> used sha1 in the rep cache table. I think that's a reasonable compromise.
>> -Hyrum
>>> On Tue, Oct 21, 2008 at 3:57 PM, David Glasser <glasser_at_davidglasser.net> wrote:
>>>> As far as I can tell from reading the source, this (at least in FSFS)
>>>> assumes that reps sharing the same md5 are the same file. (BDB seems
>>>> to use sha1.)
>>>> This means that you cannot store two files with the same md5 in the
>>>> same repository. While obviously all hashes have collisions in
>>>> theory, md5 has collisions in practice: there are known instances.
>>>> And you know, cryptography researchers use Subversion! (At one point
>>>> I tried to help fix Ron Rivest's corrupted svn repo...) I do not
>>>> think that this limitation is appropriate for Subversion; I would
>>>> highly advise against releasing this without changing FSFS to use SHA
>>>> as well. (I can't find a mailing-list discussion of this choice; my
>>>> apologies if I missed one, I have admittedly been not paying as much
>>>> attention as I'd like to Subversion development recently.)
>>>> --dave
>>>> On Mon, Oct 6, 2008 at 8:59 PM, Hyrum K. Wright
>>>> <hyrum_wright_at_mail.utexas.edu> wrote:
>>>>> The fs-rep-sharing branch is functionally complete, and I'd like to get the
>>>>> branch merged to trunk soon. These are the stats for various copies of of our
>>>>> repository for the different branch/backend combinations.
>>>>> BDB: 1.5: 1.4GB
>>>>> trunk: 627MB
>>>>> reps-shared: 490MB
>>>>> FSFS: 1.5: 586MB
>>>>> trunk: 578MB
>>>>> reps-shared: 523MB
>>>>> The effect is quite pronounced on BDB, with around a 20% space savings compared
>>>>> with our current trunk (and over 67% compared with 1.5!) FSFS doesn't show as
>>>>> much improvement, partly due to the size of the index required to enable
>>>>> rep-sharing, partly due to decreased sharing opportunities in same-revision and
>>>>> parallel revision objects, and mostly due to the absolute floor on repo size due
>>>>> to inode usage.
>>>>> We may be able to tune the FSFS implementation just a bit. For instance, it may
>>>>> not be likely that directory content representations are likely to be shared, in
>>>>> which case we shouldn't bother
>>>>> The remaining issue is the failing blame tests. Blame tests 10 and 11, which
>>>>> test 'blame -g', both fail for both backends. Before the recent commits to add
>>>>> rep-sharing to fsfs, the tests only failed for bdb. I'm slightly puzzled here
>>>>> because 'blame -g' should be FS-agnostic. If anybody has some insight, I
>>>>> welcome it.
>>>>> [Note: Because SQLite is still not an official dependency, to compile the
>>>>> rep-sharing stuff with FSFS, you'll need to add -DENABLE_SQLITE_TESTING to the
>>>>> CPPFLAGS when configuring.]
>>>>> -Hyrum
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
> For additional commands, e-mail: dev-help_at_subversion.tigris.org

To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: dev-help_at_subversion.tigris.org
Received on 2008-10-24 00:40:34 CEST

This is an archived mail posted to the Subversion Dev mailing list.