[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Antwort: Re: dangerous implementation of rep-sharing cache for fsfs

From: <michael.felke_at_evonik.com>
Date: Thu, 24 Jun 2010 17:29:45 +0200


sorry, but out E-mailing system doesn't support the usual way of
citating the message replied to.

First, we are using svn in chem. laboratory to save, archive and
version data and methods of our measurements.
We must ensure that the data in the repository is, without any concerns,
the data we have once measured or written.
So, only the totally reliable Solution for the rep-sharing cache
would be acceptable to us.

Yes, and i am interested in helping you to improve Subversion
by writing needed code. But i am not sure that i will be able to
compile subversion completely here at work, i could try.
Perhaps someone is willing to help me testing my code?

Thanks for the hint to svn_fs_fs__set_rep_reference,
because it didn't expected the additional checks to be there.
I locked there, but couldn't get at a first glance,
 when this check is performed. I will go deeper later.

I think it's better to add an check on md5 than any on part of fulltext,
because it's calculated on the hole data, too.
But is isn't imported to me,
it only reduces the risk it does not eliminate it.


P.S. I am also sorry for the signature, we are recommended to use.

Michael Felke
Telefon +49 2151 38-1453
Telefax +49 2151 38-1094
Evonik Stockhausen GmbH
Bäkerpfad 25
47805 Krefeld

Geschäftsführung: Gunther Wittmer (Sprecher), Willibrord Lampen

Sitz der Gesellschaft: Krefeld
Registergericht: Amtsgericht Krefeld; Handelsregister HRB 5791

This e-mail transmission, and any documents, files or previous e-mail
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient, or a person
responsible for delivering it to the intended recipient, you are hereby
notified that you must not read this transmission and that any disclosure,
copying, printing, distribution or use of any of the information contained
in or attached to this transmission is STRICTLY PROHIBITED. If you have
received this transmission in error, please immediately notify the sender
by telephone or return e-mail and delete the original transmission and its
attachments without reading or saving in any manner. Thank you.

Daniel Shahaf <d.s_at_daniel.shahaf.name>
24.06.2010 16:38
        An: Julian Foad <julian.foad_at_wandisco.com>
        Kopie: michael.felke_at_evonik.com, dev_at_subversion.apache.org
        Thema: Re: dangerous implementation of rep-sharing cache for fsfs

Julian Foad wrote on Thu, 24 Jun 2010 at 17:21 -0000:
> I am not sure whether the "representation" whose SHA-1 sum is stored is
> ever an exact copy of the user's file. If it is - if it does not
> include an extra header and is not stored in a delta format - then the

That is not the case:

    A representation begins with a line containing either "PLAIN\n" or
    "DELTA\n" or "DELTA <rev> <offset> <length>\n", where <rev>, <offset>,
    and <length> give the location of the delta base of the representation
    and the amount of data it contains (not counting the header or
    trailer). If no base location is given for a delta, the base is the
    empty stream. After the initial line comes raw svndiff data, followed
    by a cosmetic trailer "ENDREP\n".

So, there are header, trailer, and it's possibly deltified or

> chance of collision would depend directly on the content of the user's
> files. If that is the case, it *might* be advisable to disable the
> rep-cache feature if you are storing files that have a higher than usual
> chance of SHA-1 collisions - data files for SHA-1 research, for example.
> We should find out the answer to that question before going further.
> > Indeed, the number of hash collisions is only finite for a given file
> > size, but is still increasing dramatically with the file size.
> > So additional checking of the file size helps but is not a completely
> > satisfying solution.
> >
> > The number of undetected hash collisions could be reduced easily by
> > checking the md5-checksum, the size and the expanded-size.

Check svn_fs_fs__set_rep_reference in rep-cache.c; we already assert
that the size and expanded size match.

It's indeed possible to also use md5 there. Another option is to use
practically any statistic about the fulltext: the first N bytes, the
number of '#' characters, ...

> True. This approach could be beneficial if there are cases where the
> perfect solution (below) is not feasible.
> > To make this feature totally reliable, a complete comparison of the
> > content with the content of the old representation found, is necessary
> Yes, it would be good if Subversion could do this extra check. Would
> you be interested in helping to improve Subversion by writing code to do
> this? If so, you will be very welcome and we will try to help you.

+1 from me too.

> (I recall reading about an option in Git (?) to switch on full-text
> comparisons to check for SHA-1 collisions. I can't find a reference to
> it now.)
> Regards,
> - Julian
Received on 2010-06-24 18:22:06 CEST

This is an archived mail posted to the Subversion Dev mailing list.