[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

RE: Files with identical SHA1 breaks the repo

From: <bert_at_qqmail.nl>
Date: Sat, 25 Feb 2017 08:56:33 +0100

The whole idea of the pristine cache is handling duplicate files… I don’t see what you are trying to solve by adding a salt.

The pristine store is not a password store, where expensive hashing is a good feature… not a user facing slowdown and we never designed Subversion as a storage area for collision files.

That there is a collision now doesn’t change that we always assumed there would be collisions, and designed the current behavior with that in mind.

Bert

Sent from Mail for Windows 10

From: Stefan Sperling
Sent: vrijdag 24 februari 2017 21:47
To: Mark Phippard
Cc: Øyvind A. Holm; Subversion Development
Subject: Re: Files with identical SHA1 breaks the repo

On Fri, Feb 24, 2017 at 01:03:09PM -0500, Mark Phippard wrote:
> Note that while this does fix the error, but because of the sha1 storage
> sharing in the working copy you actually do not get the correct files.
> Both PDF's wind up being the same file, I imagine whichever one you receive
> first is the one you get.
>
> So not only does rep sharing need to be fixed, the WC pristine storage is
> also broken by this.

Yes, indeed.

I believe we should prepare a new working format for 1.10.0 which
addresses this problem. I don't see a good way of fixing it without
a format bump. The bright side of this is that it gives us a good
reason to get 1.10.0 ready ASAP.

We can switch to a better hash algorithm with a WC format bump.
If we are willing to dispose of de-duplication in the pristine store we could
make the pristine store future proof by adding a "salt" to each row in the
pristine table. Say 64 bytes of data prepended to file content, which are
random but stay fixed throughout the lifetime of a pristine.
This way, there are 64 bytes of data not controlled by repository content
which affect the hash algorithm's result before data from repository content
gets mixed in. Now hash collisions in repository content become much less
of a problem for the working copy. However, the pristine store would stop
de-duplicating content. So perhaps this is not the best approach.

The rep-cache uses hashes only for de-duplication so it very much relies on
hash collisions being negligible. We should upgrade the hashing algorithm
in a way that 'svnadmin upgrade' can take care of (for new revisions).
Perhaps we should disable the feature by default in a 1.9.x patch release
and advise users to turn it off until they can upgrade to 1.10.

We might have to give up on ra_serf's approach of avoiding retransmissions
of content which is already stored in the pristine store. This is now just
as broken as the rep-cache is. We might be able to salvage it for future
clients, but we should probably send multiple hashes and make it as easy as
possible to add newer hash algorithms in future versions without disturbing
older clients. Perhaps as a first step we should just disable this feature?
Received on 2017-02-25 08:56:44 CET

This is an archived mail posted to the Subversion Dev mailing list.