# Re: Antwort: Re: dangerous implementation of rep-sharing cache for fsfs

From: Greg Hudson <ghudson_at_MIT.EDU>
Date: Fri, 25 Jun 2010 12:27:09 -0400

On Fri, 2010-06-25 at 08:45 -0400, michael.felke_at_evonik.com wrote:
> I am actually more interested in finding reliable solution
> instead of discussing mathematics and probabilities.

The discussion of probabilities affects whether it would be justifiable
to change Subversion to address hash collisions.

> 1. You are comparing apples and oranges.
> 2. you can't balance the possibility of one error
> with the that of an other.

All systems have a probability of failure, resulting from both human and
mechanical elements within the system. It may be difficult to estimate
precisely, but one can often establish a lower bound. The question is
whether hash-indexed storage increases the probability of failure by a
non-negligible amount.

> It often results in something like:
> square_root( a_1* (error_1 ^2) + a_2 * (error_2 ^2) + ...)

We're discussing failure rates, not margin of error propagation.
Failure rates propagate as 1 - ((1 - failure_1) * (1 - failure_2) * ...)
if the failure probabilities are independent.

If your system has a probability of failure of one in a million from
other factors, and we add in an independent failure probability of one
in 2^32 from hash-indexed storage, then the overall system failure
probability is one in 999767--that is, it doesn't change much.

> 3. you over estimate the risk of undetected hardware faulty.

I think you over-estimate the risk of hash-index storage collisions.

> There is no evidence that the hash vales are
> equally distributed on the data sets, which is import for
> the us of hashing method in data fetching.

A hash which had a substantially unequal distribution of hash values
among inputs would not be cryptographically sound.

> = 3,21*10^2427 sequences of Data of 1K size
> represented by the same hash value.

First, SHA-1 is a 160-bit hash, not a 128-bit hash. Second, the number
you calculated does not inform the probability of a collision.

If you have N samples, which are not specifically constructed as to
break SHA-1, then probability of a SHA-1 collision is roughly N^2 /
2^160 (see "birthday paradox" for more precision). So, for example,
with 2^64 representations (1.8 * 10^19), there would be a roughly 2^-32
probability of a SHA-1 collision in the rep cache. If you can construct
a system with close to a one in four billion probability of error from
other sources, kudos to you; if not, hash-indexed storage is not