[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: dangerous implementation of rep-sharing cache for fsfs

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Thu, 24 Jun 2010 17:38:29 +0300 (Jerusalem Daylight Time)

Julian Foad wrote on Thu, 24 Jun 2010 at 17:21 -0000:
> I am not sure whether the "representation" whose SHA-1 sum is stored is
> ever an exact copy of the user's file. If it is - if it does not
> include an extra header and is not stored in a delta format - then the

That is not the case:

    A representation begins with a line containing either "PLAIN\n" or
    "DELTA\n" or "DELTA <rev> <offset> <length>\n", where <rev>, <offset>,
    and <length> give the location of the delta base of the representation
    and the amount of data it contains (not counting the header or
    trailer). If no base location is given for a delta, the base is the
    empty stream. After the initial line comes raw svndiff data, followed
    by a cosmetic trailer "ENDREP\n".

So, there are header, trailer, and it's possibly deltified or self-deltified.

> chance of collision would depend directly on the content of the user's
> files. If that is the case, it *might* be advisable to disable the
> rep-cache feature if you are storing files that have a higher than usual
> chance of SHA-1 collisions - data files for SHA-1 research, for example.
> We should find out the answer to that question before going further.
> > Indeed, the number of hash collisions is only finite for a given file
> > size, but is still increasing dramatically with the file size.
> > So additional checking of the file size helps but is not a completely
> > satisfying solution.
> >
> > The number of undetected hash collisions could be reduced easily by also
> > checking the md5-checksum, the size and the expanded-size.

Check svn_fs_fs__set_rep_reference in rep-cache.c; we already assert
that the size and expanded size match.

It's indeed possible to also use md5 there. Another option is to use
practically any statistic about the fulltext: the first N bytes, the
number of '#' characters, ...

> True. This approach could be beneficial if there are cases where the
> perfect solution (below) is not feasible.
> > To make this feature totally reliable, a complete comparison of the files
> > content with the content of the old representation found, is necessary
> Yes, it would be good if Subversion could do this extra check. Would
> you be interested in helping to improve Subversion by writing code to do
> this? If so, you will be very welcome and we will try to help you.

+1 from me too.

> (I recall reading about an option in Git (?) to switch on full-text
> comparisons to check for SHA-1 collisions. I can't find a reference to
> it now.)
> Regards,
> - Julian
Received on 2010-06-24 16:38:20 CEST

This is an archived mail posted to the Subversion Dev mailing list.