Re: Optional/compressed text bases (was: Re: [Reminder] Subversion a mentor for Google Summer of Code)

From: Jonathan Gilbert <o2w9gs702_at_sneakemail.com>
Date: 2006-05-08 19:18:28 CEST

At 03:42 PM 08/05/2006 +0200, Peter N. Lundblad wrote:
[snip]
>Qi Fred writes:
[snip]
> > cycle is feasible. The third one concerns the collision of message
> > digest algorithms. There is a report that different contents give
> > same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
> > collisions have not been found in SHA-1 algorithm. Some
> > investigations should be down to avoid collisions. I prefer to
> > implement the third working model.
> >
>I'm no expert in this area, but I pretty sure the collisions concern
>the cryptographic uses of MD5, so I don't think we need to worry about
>that. Others may want to comment here.

At 09:00 AM 08/05/2006 -0700, Ron wrote:
>I wish I could remember the link, but I read about using the MD5 of the
>file forward and then the MD5 of the file backwards, producing 2 MD5
>values and that along with the size of the data produced a chance of
>collision so small to be (almost) impossible. Subversion could also add
>modify time to increase this even more.
>
>Maybe (almost) impossible isn't good enough, but it was 1 in the
>billions of trillions if I remember correctly. I am far from an expert
>in this area, so this maybe common knowledge/debunked.
>
>I would love to see subversion store some kind of hash rather than the
>full file. I work on projects with many many gigabytes of binary data
>and hate to have my entire project stored twice.

The goal of a good hashing algorithm is similar to that of a good random
number generator: to evenly cover the digest space. Two similar messages
which differ even only by a single bit should have as wildly divergent hash
values as possible.

The problem here is that the disgest space is *much* smaller than the
message space. If you have a 32 KB message -- that's 262144 bits -- and use
128-bit digests, then on average, you'll have 2 ^ (262144 - 128) (roughly
4.73526e+78874 -- thousands of orders of magnitude greater than the number
of quark particles in the entire universe) unique messages with a given
digest!

The key thing, though, the reason tools like rsync work, is that while that
is certainly an immense number of messages with the same digest, very few
pairs of them (ideally *none*) are *similar*. In order to find a collision,
you have to basically start over from scratch with a completely
gobbledigook message. This works in our favour, because typically a changed
block is only slightly different. This virtually guarantees that it will
have a unique hash value.

This does not, however, eliminate the *possibility* of a collision. The
more structured the data (e.g. source code, uncompressed image data,
machine code, etc.), the harder it is to collide, but when people add
highly-compressed data (which, by definition, is extremely entropic) to a
repository, the risk increases dramatically. As the similarity & structure
of the blocks is removed, the chance of finding another block with the same
hash increases. The only thing in your favour is the number of possible
hashes, which, while far less than the number of possible colliding blocks,
is still itself a very large number. With MD5, which produces 128 bits of
hash data, there are 2 ^ 128 possible different hashes. So, even with no
structure on a given block of data at all (data that is characteristically
similar to the output of a random number generator), if you compare it with
another block that is also highly entropic, you have only a 1 in 2 ^ 128
chance (1 in 3.40282367e+38) of it having a matching hash code.

I vaguely recall reading that rsync has in fact had one or two such
collisions in its history (resulting in a corrupt copy of the file being
synchronized), but they are extremely rare and don't stop most people from
using it. Still, back when I suggested an rsync-like algorithm for
Subversion (for a completely different reason), one of the things I was
told is that Subversion tries to take nothing for granted when it comes to
data integrity, and that for that reason, my algorithm would be an unlikely
addition even if I did finish it.

If we replace the text-base with a bunch of block hashes, we will be
opening the door (albeit only by the tiniest crack) for working copies to
get undetectably (in the automated sense) corrupted. The only way to be
*absolutely* certain, assuming you trust TCP to move the data reliably
(which we usually do), is to move one of the two versions to be compared to
the other system so that they're both in the same place and can be directly
compared, byte for byte.

I should also point out that if you use a large block size like 32 KB for
the text base, source code files will almost never find matching blocks,
which will basically destroy the commit efficiency in that area. This is
probably only an issue for dialup users, where transferring 32 KB instead
of 300 bytes translates to real pain. :-)

Jonathan Gilbert

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon May 8 19:22:26 2006

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]