[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Files with identical SHA1 breaks the repo

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Thu, 2 Mar 2017 11:37:00 +0000

Garance A Drosehn wrote on Wed, Mar 01, 2017 at 14:48:07 -0500:
> On 1 Mar 2017, at 7:18, Daniel Shahaf wrote:
>
> > Stefan Sperling wrote on Wed, Mar 01, 2017 at 11:01:40 +0100:
> >> On Tue, Feb 28, 2017 at 10:17:34PM -0600, Greg Stein wrote:
> >>> I really like this idea.
> >>>
> >>> And we could take a copy of APR's sha1 code, and rejigger
> >>> it to perform *both* hashes during the same scan of the
> >>> raw bytes. I would expect the time taken to extend by
> >>> (say) 1.1X rather than a full 2X. The inner loop might
> >>> cost a bit more, but we'd only scan the bytes once. Very
> >>> handy, when you're talking about megabytes in a stream-y
> >>> environment.
> >>>
> >>> (and medium-term, push this dual-sha1 computation back
> >>> into APR)
> >>
> >> The idea is nice and I would support its application.
> >
> > I would not.
> >
> > Using two variants of sha1 is fine for supporting webkit's
> > use-case, or to protect admins from disgruntled employees
> > who commit the known collision to DoS their employer.
> >
> > However, when the attacker model is a competent/resourceful
> > attacker, using two variants of sha1 only increases the
> > attack's complexity by a factor of about 2⁷ ([1, §4.1]),
> > and might in fact cause collisions to be _easier_ to find
> > (since additional state is output).
> >
> > So let's not add a variant of sha1.
>
> Note that is not a variant of sha1, which implies that it
> is a *different* algorithm. It's the same algorithm, run
> with two different inputs.
>
> The paper you reference:
> https://www.iacr.org/archive/crypto2004/31520306/multicollisions.pdf
>
> is interesting, and some of it might be going over my head.
> However, I'll note that it seems to be about:
>
> "... unwanted dependencies between two slightly
> different instances of the same hash function"
>
> That is not what I am suggesting. I am suggesting to use
> the exact same hash function, but giving the function two
> vastly different collections of data.

I understood what you were suggesting, and it's precisely what the paper
talks about. Consider the function that takes as input a file (more
accurately: a byte stream) and returns the sha1 of every other word of
that file; that function is a hash function and is a variant of sha1.

In terms of that paper, you can think of F as sha1 and G as "sha1 of
every other word" (or the other way around).

> And not only that,
> but there is an ordering dependency between the two sets of
> data. It's mighty hard to add data to the first set (the
> entire file) without causing significant disruption in the
> *actual data* that will be fed to the second set.
>
> I do not see how this extended-sha1 would be easier to break
> than the current sha1, because it includes the current sha1,
> unchanged.

In cryptography, "it appears to be hard" and "I see no way to break it"
don't count: the fact that one does not see an attack, does not mean
that no attack exist. (Absence of evidence isn't evidence of absence.)

What does count is security proofs ("No attacker with resources X can
achieve Y in time T with probability exceeding 1/2**N"), and failing
that, constructions that have stood the test of time. Nothing either of
us comes up with off the top of his head is going to meet either of
these criteria.

Regarding "easier to break", you were probably thinking of collision
attacks, which are never made easier by extending the output. I was
thinking of "first preimage" attacks (finding a message that has a given
checksum). Extending the output can make _these_ attacks easier. For
example, if I told you my PIN is four decimal digits you wouldn't know
what it was, but if I told you not only the number of digits but also
that their sum was 36, you'd know the PIN was 9999.

> And then it adds a second sha1 digest to that,
> by effectively hashing a *different file* (the file created
> by using only every-other-byte of the original file).
>
> Obviously this is also not perfect, but then there will
> never be a hashing algorithm which can perfectly hash all
> 1-megabyte files of arbitrary data into unique 256-byte
> values. But it sure seems like it will make it much harder
> to *calculate* a second file which will sum-up to the same
> digest values (two values) as any given datafile.

That's called a "second preimage attack". The extended construction is
not secure against such an attack: a hash function with a 320-bit output
is supposed to require O(2³²⁰) evaluations to find a second preimage,
but it's trivial to find a second preimage for F and G simultaneously
with only O(2¹⁶⁰) evaluations. (Do you see that?)

A similar argument holds for collision attacks: a collision attack
against a perfect hash function with a 320-bit output should have
complexity 2¹⁶⁰; however, the paper provides an attack with
complexity 2⁸⁸. There is also a trivial, but specialized to these
particular F and G, attack with complexity 2⁸⁰.

In both cases, the attacks cost as much as they do against sha1. The
only real difference is that there haven't been 10 years of research
about the security of this particular extended construction; that's
a disadvantage.

> Now as someone else noted, this idea would require a format
> change to svn data structures,. So this would still have
> to wait for the next new version of svn. But it might be
> faster to use this tactic instead of making the much more
> substantial change of totally dropping sha1 for sha256.
>
> (I don't mean it will necessarily be faster to implement,
> but just that it might run faster in day-to-day operations)

I concede the performance issue.

Cheers,

Daniel
Received on 2017-03-02 12:41:50 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.