[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Antwort: Re: dangerous implementation of rep-sharing cache for fsfs

From: <michael.felke_at_evonik.com>
Date: Fri, 25 Jun 2010 14:45:03 +0200


I am actually more interested in finding reliable solution
instead of discussing mathematics and probabilities.
But to make it short, you are wrong!

1. You are comparing apples and oranges.

   A corruption due to faulty hardware, is a random error,
   because it's not inherent to method you use,
   in opposite to the data corruption by a SHA-1 collision
   in the rep-sharing implementation,
   which is a inherent bias of the implementation.

2. you can't balance the possibility of one error
   with the that of an other.

   the total faultiness of a system is the combination of
   all possible errors, random and systematical once.
   It often results in something like:
     square_root( a_1* (error_1 ^2) + a_2 * (error_2 ^2) + ...)
   As the rep-sharing SHA-1 also depends on the hardware,
   it wouldn't be trivial to calculate the system faultiness of
   this non-linear combined possibilities.
   For detail look at
   but in general the possibility of an error goes
   in the calculation with it's seconded power.
3. you over estimate the risk of undetected hardware faulty.

   hardware faulty is a long known and
   well controllable problem.
   Operating and network system have long tradition in
   implementing methods to detected data corruption by
   hardware faulty. It's an essential part of there design.
   In addition chemical firms and chem. software developers
   are doing a lot to detected and prevent data corruption,
   due to hardware faulty or any kind of other source,
   as willful acting humans. Like using checksums,
   data-replication , virtual machines, redundant systems etc.

4. you under estimate the error done by misusing math. methods.

   As I already said in my first e-mail. SHA-1 is developed
   to detected random and willful data manipulation.
   It's a cryptographic hash, so that there is a low chance of
   guessing or calculation a derived data sequence,
   which generates the same hash value as the original data.
   But this is the only thing it ensures.
   There is no evidence that the hash vales are
   equally distributed on the data sets, which is import for
   the us of hashing method in data fetching.
   In fact, as it's a cryptographic hash,
   you should not be able to calculate it,
   because this would mean that you are able
   to calculate sets of data resulting in the same hash value.
   So you can't conclude from the low chance of
   guessing or calculation a derived data sequence to
   a low chance of hash collisions in general.

At last, I want to give a short example calculation:
if we have a hash value with the size of 128 Bits and we
assume the algorithms generates equally distributes hash values,
than there are 2^128 = 3,40*10^38 different hash values
to represent data sequences. That sounds much.

But, how many different data sequences are there to represent?
Let us take short binary files of 1K e.g. 1024 Octets.
The 1. octet has 256 values which combine
with 256 values of the 2.
and 256 values of the 3. ... etc.
So there are 256^1024 = 1,09*10^2466 different data sequences
of 1K size.
This means for every hash value there are
= (2^(8*1024))/(2^128)
= (2^(8192))/(2^128)
= 2^(8192-128)
= 2^8064
= 3,21*10^2427 sequences of Data of 1K size
represented by the same hash value.

I hope this give a clue on the problem
we have with this implantation and
why I am so interested in finding a reliable solution.
Now I hope to find one or two experienced subversion developers,
how are willing to assist me in solving the problem.


Michael Felke
Telefon +49 2151 38-1453
Telefax +49 2151 38-1094
Evonik Stockhausen GmbH
Bäkerpfad 25
47805 Krefeld

Geschäftsführung: Gunther Wittmer (Sprecher), Willibrord Lampen

Sitz der Gesellschaft: Krefeld
Registergericht: Amtsgericht Krefeld; Handelsregister HRB 5791

This e-mail transmission, and any documents, files or previous e-mail
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient, or a person
responsible for delivering it to the intended recipient, you are hereby
notified that you must not read this transmission and that any disclosure,
copying, printing, distribution or use of any of the information contained
in or attached to this transmission is STRICTLY PROHIBITED. If you have
received this transmission in error, please immediately notify the sender
by telephone or return e-mail and delete the original transmission and its
attachments without reading or saving in any manner. Thank you.

Greg Hudson <ghudson_at_MIT.EDU>
24.06.2010 18:41
        An: "michael.felke_at_evonik.com" <michael.felke_at_evonik.com>
        Kopie: "dev_at_subversion.apache.org" <dev_at_subversion.apache.org>
        Thema: Re: Antwort: Re: dangerous implementation of rep-sharing
cache for fsfs

On Thu, 2010-06-24 at 11:29 -0400, michael.felke_at_evonik.com wrote:
> We must ensure that the data in the repository is, without any concerns,

> the data we have once measured or written.

You do realize that the probability of data corruption due to faulty
hardware is much, much more likely than the probability of corruption
due to a rep-sharing SHA-1 collision, right?
Received on 2010-06-25 14:45:52 CEST

This is an archived mail posted to the Subversion Dev mailing list.