[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: pristine store design

From: Philipp Marek <philipp.marek_at_emerion.com>
Date: Thu, 4 Mar 2010 08:16:08 +0100

Hello Stefan,

On Mittwoch, 3. März 2010, Stefan Sperling wrote:
> A block is "just another pristine".
> So a block can happen to also serve as a pristine for a different 10MB
> file which happens to have the same content as the block.
> We set the block size to something fixed, like 10MB.
> The last block is allowed to be smaller and runs till EOF.
You could go a little step further and determine the block borders by a
synchronizing checksum, ie. a manber-hash.

FSVS does this; the advantages are
a) you can quit comparing the files (or hashs) as soon as a different block
   is found
b) sending nearly-optimal deltas is trivial, because you know that the file
   originally had the blocks A,B,C,D and E, and now the blocks hash to
   A,X,C,Y and E - so you can just tell the server "the first length(A) bytes
   are identical, then you need X, then length(C) are unchanged, ..."
   Of course, that's not byte-optimized, but AFAIK the server will re-compute
   the delta anyway (for BDB and FSFS).
c) identical blocks from different files get shared; so keeping 3 revisions of
   a 100MB file might need only 100 + epsilon space, and that's without having
   the blocks compressed.
d) small point: the manber-CRC provide a few bits more security that there's
   no hash-collision.

So on commit the content verification goes once through the file, and records
offsets and length of the blocks (and whether they did exist in the older
version); for transferring the delta only some seeks and writes are needed.

The manber-hash block size can be chosen arbitrarily; in FSVS I'm using
~128kB, which means that small files just have a single hash to store, but for
larger ones the IO savings get realized.

Regards,

Phil
Received on 2010-03-04 08:16:48 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.