[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Re: dangerous implementation of rep-sharing cache for fsfs

From: Martin Furter <mf_at_rola.ch>
Date: Fri, 25 Jun 2010 16:40:55 +0200 (CEST)

On Fri, 25 Jun 2010, Mark Phippard wrote:

> On Fri, Jun 25, 2010 at 8:45 AM, <michael.felke_at_evonik.com> wrote:
>> 4. you under estimate the error done by misusing math. methods.
>>
>>   As I already said in my first e-mail. SHA-1 is developed
>>   to detected random and willful data manipulation.
>>   It's a cryptographic hash, so that there is a low chance of
>>   guessing or calculation a derived data sequence,
>>   which generates the same hash value as the original data.
>>   But this is the only thing it ensures.
>>   There is no evidence that the hash vales are
>>   equally distributed on the data sets, which is import for
>>   the us of hashing method in data fetching.
>>   In fact, as it's a cryptographic hash,
>>   you should not be able to calculate it,
>>   because this would mean that you are able
>>   to calculate sets of data resulting in the same hash value.
>>   So you can't conclude from the low chance of
>>   guessing or calculation a derived data sequence to
>>   a low chance of hash collisions in general.
>
> I am in favor of making our software more reliable, I just do not want
> to see us handicap ourselves by programming against a problem that is
> unlikely to ever happen. If this is so risky, then why are so many
> people using git? Isn't it built entirely on this concept of using
> sha-1 hashes to identify content? While I notice if you Google for
> this you can find plenty of flame wars over this topic with Git, but I
> also notice blog posts like this one:
>
> http://theblogthatnoonereads.davegrijalva.com/2009/09/25/sha-1-collision-probability/

It's not the probability which concerns me, it's what happens when a file
collides. If I understood the current algorithm right the new file will be
silently replaced by an unrelated one and there will be no error and no
warning at all. If it's some kind of machine verifyable file like source
code the next build in a different working copy will notice. But if it's
something else like documents or images it can go unnoticed for a very
long time. The work may be lost by then.

That would be a reason to use CRC32 instead of SHA1 since then users get
used to losing files and making sure themselves that the contents of the
repos are what they expect ;o>

> We are already performance-challenged. Doing extra hash calculations
> for a problem that is not going to happen does not seem like a sound
> decision.

No extra hash calculations are needed. What's needed is extra file
comparisions with the already existing files with the same hash. I guess
that's more expensive than calculating a hash since you have to read the
file from disk which may need applying lots of deltas etc.

ZFS does a similar thing which they call deduplication:
http://blogs.sun.com/bonwick/entry/zfs_dedup

The 'verify' feature is optional. With a faster but weaker hash
performance could be regained:
http://valhenson.livejournal.com/48227.html

An optional 'verify' feature would be a nice way to silence paranoid
people like me and keep the performance the same for those who blindly
trust hash functions.

Martin
Received on 2010-06-25 16:41:50 CEST

This is an archived mail posted to the Subversion Dev mailing list.