[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Optional/compressed text bases (was: Re: [Reminder] Subversion a mentor for Google Summer of Code)

From: Ph. Marek <philipp.marek_at_bmlv.gv.at>
Date: 2006-05-12 07:30:16 CEST

On Monday 08 May 2006 19:18, Jonathan Gilbert wrote:
> I vaguely recall reading that rsync has in fact had one or two such
> collisions in its history (resulting in a corrupt copy of the file being
> synchronized)
AFAIK that happenened because of network bandwidth considerations only 16 or
32bit checksums were transmitted per 800byte-block, and for BIG files (with
>100MB) there were so many blocks that they got collisions (in the
32bit-checksums!)

> , but they are extremely rare and don't stop most people from
> using it. Still, back when I suggested an rsync-like algorithm for
> Subversion (for a completely different reason), one of the things I was
> told is that Subversion tries to take nothing for granted when it comes to
> data integrity, and that for that reason, my algorithm would be an unlikely
> addition even if I did finish it.
In FSVS (fsvs.tigris.org) I use such an algorithm. I use a rolling checksum,
and whenever I hit a "special" value (with a predefined number of zero bits)
I declare the block to be finished and do a MD5 of it.
So I can stop checking the local text for modifications *without* checksumming
the (possibly big) file.

I believe that if MD5 is successfully used to check integrity of all files
(small, big, ...) then taking the MD5 of blocks approximately of 100kB is no
problem, either.

> If we replace the text-base with a bunch of block hashes, we will be
> opening the door (albeit only by the tiniest crack) for working copies to
> get undetectably (in the automated sense) corrupted. The only way to be
> *absolutely* certain, assuming you trust TCP to move the data reliably
> (which we usually do), is to move one of the two versions to be compared to
> the other system so that they're both in the same place and can be directly
> compared, byte for byte.
>
> I should also point out that if you use a large block size like 32 KB for
> the text base, source code files will almost never find matching blocks,
> which will basically destroy the commit efficiency in that area. This is
> probably only an issue for dialup users, where transferring 32 KB instead
> of 300 bytes translates to real pain. :-)
But it's not a problem for a LAN.
And dialup users would trade the harddisk space against network bandwidth, I
think :-)

Regards,

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri May 12 07:30:21 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.