[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Antwort: Re: Re: dangerous implementation of rep-sharing cache for fsfs

From: Mark Mielke <mark_at_mark.mielke.cc>
Date: Fri, 25 Jun 2010 18:13:28 -0400

On 06/25/2010 03:34 PM, Daniel Shahaf wrote:
> [1] apparently, no SHA-1 collisions have been found to date. (see
> #svn-dev log today)
>

We know SHA-1 collisions must exist, however - they are also likely to
take unlikely form. The algorithms were specifically chosen so that
small changes in bits would result in major changes to the resulting
digest. A collision is unlikely to come from a single character
difference. It's far more likely to come from a completely different bit
set, likely a bit set that isn't even used in practical real world
applications.

File data tends to take a higher structured form - whether it be C code
or a Microsoft Office document. Huge portions of the sample set will
NEVER be used, because they will not be higher structured documents of
value to anybody. Take C code - it is likely to be a restricted set of
7-bit data with characters weighted towards the alphanumerics and
certain symbols. If you take all the C code in the world - it will not
represent a huge fraction of the sample set. If you take all the C code
in a particular repository - it will be a tiny sample set. Images have a
similar pattern. One could say that image data is random - but it's not.
Only certain images, which contain data, are worth saving. That data
means that a subset of the bit patterns are even being considered
valuable and worth storing.

Pick a repository with 1,000,000 commits with 1000 new file versions in
each commit.

This is 1 billion samples. 1 billion samples / (2^160) is still an
incredibly small number - 6.8 x 10^-40.

What real life repositories come close to this size? We work with some
very large repositories in ClearCase, and they don't come close to this...

It only takes one, you say? How are hard disks, memory, and other
factors considered acceptable then? All of these have documented chances
of failure. There is nothing that guarantees that if you write a certain
block to disk, that when you read it back, it will either fail or return
the original data. Some small percentage of the time, it will return new
data. With 2 Tbyte disks, the chances are becoming significantly higher
to the point where one can almost statistically guarantee a single bit
error over the entire surface of the disk that will go undetected and
un-error corrected. Again, though - most people don't use the entire
disk, and much of the data stored won't even be noticed if a single bit
error is introduced.

Personally, I don't want a performance hit introduced due to paranoia.
If a patch is introduced, I'd like it to be optional, so people can
choose whether to take the verification hit or not. I remain unconvinced
that rep-sharing is the greatest chance of detectable or undetectable
fsfs corruption problems. I think it is firmly in the realm of theory,
and that other products such as GIT have all but proven this.

Cheers,
mark

-- 
Mark Mielke<mark_at_mielke.cc>
Received on 2010-06-26 00:14:19 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.