Hello Stefan,
On Mittwoch, 3. März 2010, Stefan Sperling wrote:
> A block is "just another pristine".
> So a block can happen to also serve as a pristine for a different 10MB
> file which happens to have the same content as the block.
> We set the block size to something fixed, like 10MB.
> The last block is allowed to be smaller and runs till EOF.
You could go a little step further and determine the block borders by a
synchronizing checksum, ie. a manber-hash.
FSVS does this; the advantages are
a) you can quit comparing the files (or hashs) as soon as a different block
is found
b) sending nearly-optimal deltas is trivial, because you know that the file
originally had the blocks A,B,C,D and E, and now the blocks hash to
A,X,C,Y and E - so you can just tell the server "the first length(A) bytes
are identical, then you need X, then length(C) are unchanged, ..."
Of course, that's not byte-optimized, but AFAIK the server will re-compute
the delta anyway (for BDB and FSFS).
c) identical blocks from different files get shared; so keeping 3 revisions of
a 100MB file might need only 100 + epsilon space, and that's without having
the blocks compressed.
d) small point: the manber-CRC provide a few bits more security that there's
no hash-collision.
So on commit the content verification goes once through the file, and records
offsets and length of the blocks (and whether they did exist in the older
version); for transferring the delta only some seeks and writes are needed.
The manber-hash block size can be chosen arbitrarily; in FSVS I'm using
~128kB, which means that small files just have a single hash to store, but for
larger ones the IO savings get realized.
Regards,
Phil
Received on 2010-03-04 08:16:48 CET