Re: dangerous implementation of rep-sharing cache for fsfs

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Thu, 24 Jun 2010 15:21:27 +0100

michael.felke_at_evonik.com wrote:
> [The new representation caching in 1.6] could save us a lot of disk space
> on the server [...].
>
> But unfortunately it seems we could not use it :-(
> Because after what the source code of rep.cache.c and fs_fs.c in
> libsvn_fs_fs looks to me, the mechanism to find an already existing
> representation is only relaying on the sha1-checksum.
> Due to the possibility of hash collisions it's not enough to ensure that
> the found old representation is really an duplicate of the new one.
> An undetected hash collision would result in a file with a totally wrong
> contents.
>
> sha1 has been developed to detected modifications in a file and ensure
> that it's likely impossible to generate the same sha1-checksum be only
> modifying a file.
> So it is good to use it whether a file has been modified.
> But it's not designed to check if two different files could possibly the
> same.
> There are always infinity numbers of independent files generating the same
> checksum.

You are right that there is the theoretical possibility of a different
file having the same SHA-1 and therefore being incorrectly stored.

When using real-life data, the statistical chance is so incredibly tiny
that most people who have tried to estimate the chance do not expect
that it will ever happen.

I am not sure whether the "representation" whose SHA-1 sum is stored is
ever an exact copy of the user's file. If it is - if it does not
include an extra header and is not stored in a delta format - then the
chance of collision would depend directly on the content of the user's
files. If that is the case, it *might* be advisable to disable the
rep-cache feature if you are storing files that have a higher than usual
chance of SHA-1 collisions - data files for SHA-1 research, for example.

We should find out the answer to that question before going further.

> Indeed, the number of hash collisions is only finite for a given file
> size, but is still increasing dramatically with the file size.
> So additional checking of the file size helps but is not a completely
> satisfying solution.
>
> The number of undetected hash collisions could be reduced easily by also
> checking the md5-checksum, the size and the expanded-size.

True. This approach could be beneficial if there are cases where the
perfect solution (below) is not feasible.

> To make this feature totally reliable, a complete comparison of the files
> content with the content of the old representation found, is necessary

Yes, it would be good if Subversion could do this extra check. Would
you be interested in helping to improve Subversion by writing code to do
this? If so, you will be very welcome and we will try to help you.

(I recall reading about an option in Git (?) to switch on full-text
comparisons to check for SHA-1 collisions. I can't find a reference to
it now.)

Regards,
- Julian
Received on 2010-06-24 16:22:09 CEST

This message: [ Message body ]
Next message: Daniel Shahaf: "Re: dangerous implementation of rep-sharing cache for fsfs"
Previous message: Ramkumar Ramachandra: "Re: Getting delta content length"
In reply to: michael.felke_at_evonik.com: "dangerous implementation of rep-sharing cache for fsfs"
Next in thread: Daniel Shahaf: "Re: dangerous implementation of rep-sharing cache for fsfs"
Reply: Daniel Shahaf: "Re: dangerous implementation of rep-sharing cache for fsfs"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]