On Mittwoch, 3. März 2010, Stefan Sperling wrote:
> A block is "just another pristine".
> So a block can happen to also serve as a pristine for a different 10MB
> file which happens to have the same content as the block.
> We set the block size to something fixed, like 10MB.
> The last block is allowed to be smaller and runs till EOF.
You could go a little step further and determine the block borders by a
synchronizing checksum, ie. a manber-hash.
FSVS does this; the advantages are
a) you can quit comparing the files (or hashs) as soon as a different block
b) sending nearly-optimal deltas is trivial, because you know that the file
originally had the blocks A,B,C,D and E, and now the blocks hash to
A,X,C,Y and E - so you can just tell the server "the first length(A) bytes
are identical, then you need X, then length(C) are unchanged, ..."
Of course, that's not byte-optimized, but AFAIK the server will re-compute
the delta anyway (for BDB and FSFS).
c) identical blocks from different files get shared; so keeping 3 revisions of
a 100MB file might need only 100 + epsilon space, and that's without having
the blocks compressed.
d) small point: the manber-CRC provide a few bits more security that there's
So on commit the content verification goes once through the file, and records
offsets and length of the blocks (and whether they did exist in the older
version); for transferring the delta only some seeks and writes are needed.
The manber-hash block size can be chosen arbitrarily; in FSVS I'm using
~128kB, which means that small files just have a single hash to store, but for
larger ones the IO savings get realized.
Received on 2010-03-04 08:16:48 CET