[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

dangerous implementation of rep-sharing cache for fsfs

From: <michael.felke_at_evonik.com>
Date: Thu, 24 Jun 2010 11:15:46 +0200

Excuse me, but i original wrote the following E-Mail to Hyrum K. Wright
directly,
because I wasn't used to the guidelines of the subversion project.

----- Weitergeleitet von Michael Felke/AN/Stockhausen/DE am 24.06.2010
11:09 -----

Michael Felke
23.06.2010 14:07
 
        An: hwright_at_tigris.org
        Kopie:
        Thema: subversion Issue 2286: rep-sharing cache for fsfs

Hello Hyrum K. Wright,

sorry that i bother you with this directly, but i have no clue of work
with the issue tracker.

I just started to checking the changes in 1.6 on possible problem, when
updating our raw data repository to this version.
I found that the new representation caching would have an great impact on
our site.

It could save us a lot of disk space on the server, because the software
we are using, often generates file copies, witch are added as separate
files.

But unfortunately it seems we could not use it :-(
Because after what the source code of rep.cache.c and fs_fs.c in
libsvn_fs_fs looks to me, the mechanism to find an already existing
representation is only relaying on the sha1-checksum.
Due to the possibility of hash collisions it's not enough to ensure that
the found old representation is really an duplicate of the new one.
An undetected hash collision would result in a file with a totally wrong
contents.

sha1 has been developed to detected modifications in a file and ensure
that it's likely impossible to generate the same sha1-checksum be only
modifying a file.
So it is good to use it whether a file has been modified.
But it's not designed to check if two different files could possibly the
same.
There are always infinity numbers of independent files generating the same
checksum.
Indeed, the number of hash collisions is only finite for a given file
size, but is still increasing dramatically with the file size.
So additional checking of the file size helps but is not a completely
satisfying solution.

The number of undetected hash collisions could be reduced easily by also
checking the md5-checksum, the size and the expanded-size.
To make this feature totally reliable, a complete comparison of the files
content with the content of the old representation found, is necessary

Yours sincerely

Michael Felke
Telefon +49 2151 38-1453
Telefax +49 2151 38-1094
michael.felke_at_evonik.com
Evonik Stockhausen GmbH
Bäkerpfad 25
47805 Krefeld
http://www.evonik.com

Geschäftsführung: Gunther Wittmer (Sprecher), Willibrord Lampen

Sitz der Gesellschaft: Krefeld
Registergericht: Amtsgericht Krefeld; Handelsregister HRB 5791

This e-mail transmission, and any documents, files or previous e-mail
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient, or a person
responsible for delivering it to the intended recipient, you are hereby
notified that you must not read this transmission and that any disclosure,
copying, printing, distribution or use of any of the information contained
in or attached to this transmission is STRICTLY PROHIBITED. If you have
received this transmission in error, please immediately notify the sender
by telephone or return e-mail and delete the original transmission and its
attachments without reading or saving in any manner. Thank you.
Received on 2010-06-24 13:06:17 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.