It should be noted that as of 1.6 we've been using SHA1 hashes to index content on the server (which obviously includes all nodes on the client), and we've yet to hear any reports of checksum collisions. I do not doubt that it is theoretically possible, but in practice, the threat of SHA1 collisions are very low.
On Feb 23, 2010, at 11:01 AM, Ed Price wrote:
> For those of us who use Subversion to manage our home directories,
> or other random collections of files, it's entirely possible and reasonable
> that we might download some files with identical checksums, just out
> of curiosity or whatever. I do think it would be a shame if Subversion
> caused or suffered problems in this situation. For example here is a
> page with a number of interesting files with duplicate MD5 sums:
> (Hey, I wonder if that website is managed with Subversion?? That
> use case is not unheard of either...)
> You might also be interested to read about ZFS deduplication here:
> On Mon, Feb 22, 2010 at 6:03 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
>> Greg Stein wrote:
>>> On Fri, Feb 19, 2010 at 08:13, <julianfoad_at_apache.org> wrote:
>>>> +++ subversion/trunk/subversion/libsvn_wc/wc-metadata.sql Fri Feb 19 13:13:09 2010
>>>> @@ -172,7 +172,9 @@
>>>> and ACTUAL_NODE tables.
>>>> CREATE TABLE PRISTINE (
>>>> - /* ### the hash algorithm (MD5 or SHA-1) is encoded in this value */
>>>> + /* The SHA-1 checksum of the pristine text. This is a unique key. The
>>>> + SHA-1 checksum of a pristine text is assumed to be unique among all
>>>> + pristine texts referenced from this database. */
>>>> checksum TEXT NOT NULL PRIMARY KEY,
>>> That comment is now redundant with the PRIMARY KEY attached to that column.
>> Not quite. Perhaps someone can write this in better words for me. What I
>> wanted to say was:
>> "Look, this is an assumption on which the model depends. Don't
>> 'discover' it for yourself and flame us about it. We know that there is
>> a theoretical possibility of a clash, but it is so much less likely than
>> many other kinds of problem that we can treat it as a unique key for
>> practical purposes. If texts have been specially constructed so as to
>> have the same SHA-1 checksum, as might be done in cryptography research,
>> that would defeat this assumption, but everyone else stands far more
>> chance of being hit by a meteorite."
>> Such an explanatory note would probably be better in some higher-level
>> place, such as in the PRISTINE table's main doc string or in a different
>> document, rather than on that particular column where I put it. How
>> about I move it to the table's main doc string and change the wording
>> (Note: The PRISTINE table is indexed by the SHA-1 checksum of the
>> pristine text. A cryptography researcher might have different texts that
>> are specially constructed so as to have the same SHA-1 checksum, but for
>> anyone else the chance of ever having a clash is vanishingly small.)
>>>> /* ### enumerated values specifying type of compression. NULL implies
>>>> @@ -189,7 +191,8 @@
>>>> refcount INTEGER NOT NULL,
>>>> /* Alternative MD5 checksum used for communicating with older
>>>> - repositories. */
>>>> + repositories. Not guaranteed to be unique among table rows.
>>> pfft. riiiiiight.
>> Likewise. What I wanted to say was something like:
>> "The MD5 checksum, like the SHA-1 checksum, is considered distinctive
>> enough for all practical purposes (except cryptography research).
>> However, as some clashes have been reported in the world, it would be
>> best if the code did not assume this is a unique key."
>> Hmmm... parentheses and "strictly" will help. How about I tone it down
>> to the following:
>> /* Alternative MD5 checksum used for communicating with older
>> repositories. (This is not strictly guaranteed to be a unique
>> key, although in practice it nearly always will be.)
>> NULL if not (yet) calculated. */
>> md5_checksum TEXT
>> - Julian
>>>> + NULL if not (yet) calculated. */
>>>> md5_checksum TEXT
Received on 2010-02-23 18:07:04 CET