[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Files with identical SHA1 breaks the repo

From: Paul Hammant <paul_at_hammant.org>
Date: Sun, 26 Feb 2017 12:26:47 -0500

Why don't y'all take the same tactic as Git does - SHA1 the contents of the
file *and a prepended a type/length field* ?.

That and a tool to back convert SHA1s for existing repos.

Linus weighed in again:
https://plus.google.com/+LinusTorvalds/posts/7tp2gYWQugL
Svn is more likely to be used as a store for larger binaries, that need to
be non-repudiable than Git is, even if that is still an edge case.

-ph

- Paul

On Sun, Feb 26, 2017 at 11:08 AM, Garance A Drosehn <drosih_at_rpi.edu> wrote:

> On 24 Feb 2017, at 15:46, Stefan Sperling wrote:
> >
> > I believe we should prepare a new working format for 1.10.0
> > which addresses this problem. I don't see a good way of fixing
> > it without a format bump. The bright side of this is that it
> > gives us a good reason to get 1.10.0 ready ASAP.
> >
> > We can switch to a better hash algorithm with a WC format
> > bump.
>
> One of the previous messages mentioned that better hash
> algorithms are more expensive. So let me mention a tactic
> that I used many years ago, when MD5 was the best digest
> algorithm that I knew of, and I didn't trust it for the
> larger files I was working with at the time:
>
> Instead of going with a completely different hash algorithm,
> just double-down on the one you're using. What I did was to
> calculate one digest the standard way, and then a second one
> which summed up every-other-byte (or every 3rd byte, or ...).
> So to get a collision, not only do two files have to get the
> same digest-result for all their data, but they have to also
> get the same digest-result when exactly half the data is
> skipped over.
>
> (I did this a long time ago, and forget the details. What
> I may have done for performance reasons was every-other-word,
> not every-other-byte)
>
> My thinking was that *any* single algorithm which processes
> all the data is going to get collisions, eventually. But it
> will be much harder for someone to generate a duplicate file
> where there will also be a collision when summing up only
> half of the data.
>
> I'm not claiming this is great cure-all solution, but just
> that it's an alternate tactic which might be interesting.
> People could create repositories with just the one digest,
> or upgrade it to use multiple digests if they have the need.
>
> I found a few benchmarks which suggest that sha-256 is maybe
> twice as expensive as sha-1, so calculating two sha-1 digests
> might be a reasonable alternative.
>
> --
> Garance Alistair Drosehn = drosih_at_rpi.edu
> Senior Systems Programmer or gad_at_FreeBSD.org
> Rensselaer Polytechnic Institute; Troy, NY; USA
>
Received on 2017-02-26 18:26:58 CET

This is an archived mail posted to the Subversion Dev mailing list.