[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Files with identical SHA1 breaks the repo

From: Stefan Sperling <stsp_at_elego.de>
Date: Fri, 24 Feb 2017 21:46:54 +0100

On Fri, Feb 24, 2017 at 01:03:09PM -0500, Mark Phippard wrote:
> Note that while this does fix the error, but because of the sha1 storage
> sharing in the working copy you actually do not get the correct files.
> Both PDF's wind up being the same file, I imagine whichever one you receive
> first is the one you get.
>
> So not only does rep sharing need to be fixed, the WC pristine storage is
> also broken by this.

Yes, indeed.

I believe we should prepare a new working format for 1.10.0 which
addresses this problem. I don't see a good way of fixing it without
a format bump. The bright side of this is that it gives us a good
reason to get 1.10.0 ready ASAP.

We can switch to a better hash algorithm with a WC format bump.
If we are willing to dispose of de-duplication in the pristine store we could
make the pristine store future proof by adding a "salt" to each row in the
pristine table. Say 64 bytes of data prepended to file content, which are
random but stay fixed throughout the lifetime of a pristine.
This way, there are 64 bytes of data not controlled by repository content
which affect the hash algorithm's result before data from repository content
gets mixed in. Now hash collisions in repository content become much less
of a problem for the working copy. However, the pristine store would stop
de-duplicating content. So perhaps this is not the best approach.

The rep-cache uses hashes only for de-duplication so it very much relies on
hash collisions being negligible. We should upgrade the hashing algorithm
in a way that 'svnadmin upgrade' can take care of (for new revisions).
Perhaps we should disable the feature by default in a 1.9.x patch release
and advise users to turn it off until they can upgrade to 1.10.

We might have to give up on ra_serf's approach of avoiding retransmissions
of content which is already stored in the pristine store. This is now just
as broken as the rep-cache is. We might be able to salvage it for future
clients, but we should probably send multiple hashes and make it as easy as
possible to add newer hash algorithms in future versions without disturbing
older clients. Perhaps as a first step we should just disable this feature?
Received on 2017-02-24 21:47:06 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.