Julian Foad wrote on Thu, 24 Jun 2010 at 17:21 -0000:
> I am not sure whether the "representation" whose SHA-1 sum is stored is
> ever an exact copy of the user's file. If it is - if it does not
> include an extra header and is not stored in a delta format - then the
That is not the case:
[[[
A representation begins with a line containing either "PLAIN\n" or
"DELTA\n" or "DELTA <rev> <offset> <length>\n", where <rev>, <offset>,
and <length> give the location of the delta base of the representation
and the amount of data it contains (not counting the header or
trailer). If no base location is given for a delta, the base is the
empty stream. After the initial line comes raw svndiff data, followed
by a cosmetic trailer "ENDREP\n".
]]]
So, there are header, trailer, and it's possibly deltified or self-deltified.
> chance of collision would depend directly on the content of the user's
> files. If that is the case, it *might* be advisable to disable the
> rep-cache feature if you are storing files that have a higher than usual
> chance of SHA-1 collisions - data files for SHA-1 research, for example.
>
> We should find out the answer to that question before going further.
>
>
> > Indeed, the number of hash collisions is only finite for a given file
> > size, but is still increasing dramatically with the file size.
> > So additional checking of the file size helps but is not a completely
> > satisfying solution.
> >
> > The number of undetected hash collisions could be reduced easily by also
> > checking the md5-checksum, the size and the expanded-size.
>
Check svn_fs_fs__set_rep_reference in rep-cache.c; we already assert
that the size and expanded size match.
It's indeed possible to also use md5 there. Another option is to use
practically any statistic about the fulltext: the first N bytes, the
number of '#' characters, ...
> True. This approach could be beneficial if there are cases where the
> perfect solution (below) is not feasible.
>
> > To make this feature totally reliable, a complete comparison of the files
> > content with the content of the old representation found, is necessary
>
> Yes, it would be good if Subversion could do this extra check. Would
> you be interested in helping to improve Subversion by writing code to do
> this? If so, you will be very welcome and we will try to help you.
>
+1 from me too.
> (I recall reading about an option in Git (?) to switch on full-text
> comparisons to check for SHA-1 collisions. I can't find a reference to
> it now.)
>
>
> Regards,
> - Julian
>
>
>
Received on 2010-06-24 16:38:20 CEST