On Wed, 22 Oct 2008, Greg Stein wrote:
> We *still* have all the problems that md5 is fully-intertwined in our
> code. I'm still not willing to do double-checksums and kill millions
> of coders for a few researchers who could simply tar their candidate
> pairs together, or gzip them. Yes, that's the brutal truth :-P ... the
> researchers need to use workarounds, and the millions get a fast
> product.
Would it be possible to detect collisions and use a different index key
instead? By "index" I mean "whatever you use to map from short keys
(e.g. MD5 hashes) to actual stored content". Perhaps something like
this:
calculate hash of content;
if (hash does not exist as a key in the index) {
store content indexed by the hash;
} else if (index key refers to content that really is identical) {
re-use that index key;
} else {
do something clever to deal with the hash collision;
}
"Do something clever" could involve choosing a different index key based
on both the content hash and a collision serial number, incrementing the
serial number until previously-stored identical content is found, or
until the key is not found in the index.
--apb (Alan Barrett)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: dev-help_at_subversion.tigris.org
Received on 2008-10-22 20:05:34 CEST