Antwort: Re: ... Re: dangerous implementation of rep-sharing cache for fsfs

From: <michael.felke_at_evonik.com>
Date: Wed, 30 Jun 2010 11:57:59 +0200

Hello,

O.K., it seems there is really a need to discuss the problem of
SHA-1 collisions more deeply.

Mark already stated out most of the necessary points.

1. A collision is unlikely to come from small differences.
As I agree with all of you, a SHA-1 is practically impossible on
data derived form one another. It?s designed to do so.

2. Usable Data is not randomly chosen form the sample set of all possible
   data. Every kind of data, e.g. C code, is always a very tiny sample
   set and SHA-1 is not likely to collide in such a tiny sample set from
   one kind of data.

3. Data is always highly structured according to the kind it belongs to.
   That means that a whole set of data of one kind, even it is an
   infinite space it self, is always a narrow set, each element mostly
   resembling the others..

But one is missing!

4. The set of one kind of data and that of another kind are overlapping
   very infrequent, if at all. They could be seen as highly discriminable
   and separated parts of the sample set of all possible data.
   So SHA-1 hashes will wildly spread on the first set, doing the best
   of its job, and also, but independently, spread on the other set as
   wide as it?s expected to do.

What is the result, when two or more sets of hash values, each widely
spread of the same value range, are used together in one fetch index?

Perhaps, some can see a danger now, too.

I? am working on a practical demonstration, which everybody could
reproduce with his or her spreadsheet program.
But please be patient, I have other things to do, as well.

Greetings,

P.S. Thanks for the warning; we are not going to use 1.7.
At the Moment we are not using 1.6 either,
because of the SHA-1 rep-share cache.

Michael Felke
Telefon +49 2151 38-1453
Telefax +49 2151 38-1094
michael.felke_at_evonik.com
Evonik Stockhausen GmbH
Bäkerpfad 25
47805 Krefeld
http://www.evonik.com

Geschäftsführung: Gunther Wittmer (Sprecher), Willibrord Lampen

Sitz der Gesellschaft: Krefeld
Registergericht: Amtsgericht Krefeld; Handelsregister HRB 5791

This e-mail transmission, and any documents, files or previous e-mail
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient, or a person
responsible for delivering it to the intended recipient, you are hereby
notified that you must not read this transmission and that any disclosure,
copying, printing, distribution or use of any of the information contained
in or attached to this transmission is STRICTLY PROHIBITED. If you have
received this transmission in error, please immediately notify the sender
by telephone or return e-mail and delete the original transmission and its
attachments without reading or saving in any manner. Thank you.

Mark Mielke <mark_at_mark.mielke.cc>
26.06.2010 00:13

        An: Daniel Shahaf <d.s_at_daniel.shahaf.name>
        Kopie: michael.felke_at_evonik.com, mf_at_rola.ch,
"dev_at_subversion.apache.org" <dev_at_subversion.apache.org>, ghudson_at_mit.edu,
Mark Phippard <markphip_at_gmail.com>
        Thema: Re: Antwort: Re: Re: dangerous implementation of
rep-sharing cache for fsfs

On 06/25/2010 03:34 PM, Daniel Shahaf wrote:
> [1] apparently, no SHA-1 collisions have been found to date. (see
> #svn-dev log today)
>

We know SHA-1 collisions must exist, however - they are also likely to
take unlikely form. The algorithms were specifically chosen so that
small changes in bits would result in major changes to the resulting
digest. A collision is unlikely to come from a single character
difference. It's far more likely to come from a completely different bit
set, likely a bit set that isn't even used in practical real world
applications.

File data tends to take a higher structured form - whether it be C code
or a Microsoft Office document. Huge portions of the sample set will
NEVER be used, because they will not be higher structured documents of
value to anybody. Take C code - it is likely to be a restricted set of
7-bit data with characters weighted towards the alphanumerics and
certain symbols. If you take all the C code in the world - it will not
represent a huge fraction of the sample set. If you take all the C code
in a particular repository - it will be a tiny sample set. Images have a
similar pattern. One could say that image data is random - but it's not.
Only certain images, which contain data, are worth saving. That data
means that a subset of the bit patterns are even being considered
valuable and worth storing.

Pick a repository with 1,000,000 commits with 1000 new file versions in
each commit.

This is 1 billion samples. 1 billion samples / (2^160) is still an
incredibly small number - 6.8 x 10^-40.

What real life repositories come close to this size? We work with some
very large repositories in ClearCase, and they don't come close to this...

It only takes one, you say? How are hard disks, memory, and other
factors considered acceptable then? All of these have documented chances
of failure. There is nothing that guarantees that if you write a certain
block to disk, that when you read it back, it will either fail or return
the original data. Some small percentage of the time, it will return new
data. With 2 Tbyte disks, the chances are becoming significantly higher
to the point where one can almost statistically guarantee a single bit
error over the entire surface of the disk that will go undetected and
un-error corrected. Again, though - most people don't use the entire
disk, and much of the data stored won't even be noticed if a single bit
error is introduced.

Personally, I don't want a performance hit introduced due to paranoia.
If a patch is introduced, I'd like it to be optional, so people can
choose whether to take the verification hit or not. I remain unconvinced
that rep-sharing is the greatest chance of detectable or undetectable
fsfs corruption problems. I think it is firmly in the realm of theory,
and that other products such as GIT have all but proven this.

Cheers,
mark

-- 
Mark Mielke<mark_at_mielke.cc>

Received on 2010-06-30 11:59:45 CEST

This message: [ Message body ]
Next message: Julian Foad: "Re: about fixing add_file()..."
Previous message: Philip Martin: "Re: svn commit: r959121 - /subversion/trunk/subversion/libsvn_wc/wc_db.c"
In reply to: Mark Mielke: "Re: Antwort: Re: Re: dangerous implementation of rep-sharing cache for fsfs"
Next in thread: Mark Mielke: "Re: Antwort: Re: ... Re: dangerous implementation of rep-sharing cache for fsfs"
Reply: Mark Mielke: "Re: Antwort: Re: ... Re: dangerous implementation of rep-sharing cache for fsfs"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]