On Fri, 2010-06-25 at 08:45 -0400, michael.felke_at_evonik.com wrote:

*> I am actually more interested in finding reliable solution
*

*> instead of discussing mathematics and probabilities.
*

The discussion of probabilities affects whether it would be justifiable

to change Subversion to address hash collisions.

*> 1. You are comparing apples and oranges.
*

*> 2. you can't balance the possibility of one error
*

*> with the that of an other.
*

All systems have a probability of failure, resulting from both human and

mechanical elements within the system. It may be difficult to estimate

precisely, but one can often establish a lower bound. The question is

whether hash-indexed storage increases the probability of failure by a

non-negligible amount.

*> It often results in something like:
*

*> square_root( a_1* (error_1 ^2) + a_2 * (error_2 ^2) + ...)
*

We're discussing failure rates, not margin of error propagation.

Failure rates propagate as 1 - ((1 - failure_1) * (1 - failure_2) * ...)

if the failure probabilities are independent.

If your system has a probability of failure of one in a million from

other factors, and we add in an independent failure probability of one

in 2^32 from hash-indexed storage, then the overall system failure

probability is one in 999767--that is, it doesn't change much.

*> 3. you over estimate the risk of undetected hardware faulty.
*

I think you over-estimate the risk of hash-index storage collisions.

*> There is no evidence that the hash vales are
*

*> equally distributed on the data sets, which is import for
*

*> the us of hashing method in data fetching.
*

A hash which had a substantially unequal distribution of hash values

among inputs would not be cryptographically sound.

*> = 3,21*10^2427 sequences of Data of 1K size
*

*> represented by the same hash value.
*

First, SHA-1 is a 160-bit hash, not a 128-bit hash. Second, the number

you calculated does not inform the probability of a collision.

If you have N samples, which are not specifically constructed as to

break SHA-1, then probability of a SHA-1 collision is roughly N^2 /

2^160 (see "birthday paradox" for more precision). So, for example,

with 2^64 representations (1.8 * 10^19), there would be a roughly 2^-32

probability of a SHA-1 collision in the rep cache. If you can construct

a system with close to a one in four billion probability of error from

other sources, kudos to you; if not, hash-indexed storage is not

perceptibly increasing your error rate.

Received on 2010-06-25 18:27:53 CEST