On Thu, Jan 17, 2008 at 11:49:32PM +0100, Niels Werensteijn wrote:
> + This algorithm works the same as xdelta, with the exception
> + that it gets blocks from its own data.
> +
> + 1. Checksum the first MATCH_BLOCKSIZE block of bytes using adler32, and
> + insert the checksum into a match table with the position of the match.
> + 2. Go through the target byte by byte, starting at position MATCH_BLOCKSIZE.
> + See if that byte starts a match that we have in the match table.
> + 2a. If so, try to extend the match as far as possible both
> + forwards and backwards, and then insert a source copy
> + operation into the delta ops builder for the match.
> + Care must be taken here to find match data only from "previous" data,
> + that we know is already in the target window.
> + 2b. If not, insert the byte as new data using an insert delta op.
> +
> + Our implementation doesn't immediately insert "insert" operations,
> + it waits until we have another copy, or we are done. The reasoning
> + is twofold:
> +
> + 1. Otherwise, we would just be building a ton of 1 byte insert
> + operations
> + 2. So that we can extend a source match backwards into a pending
> + insert operation, and possibly remove the need for the insert
> + entirely. This can happen due to stream alignment.
Interesting. This sounds similar to a patch that Dan Berlin (cc'd)
produced around the time of the Subversion conference, which used
Rabin fingerprinting (basically a CRC) and a match table.
Is this an implementation of any particular algorithm?
There are two papers I was looking at around that time in this area:
M. Ajtai, R. Burns, R. Fagin, D. D. E. Long, and L. Stockmeyer.
Compactly encoding unstructured input with differential compression.
www.almaden.ibm.com/cs/people/stock/diff7.ps, IBM Research Report RJ
10187, April 2000.
http://citeseer.ist.psu.edu/article/ajtai00compactly.html
(which your algorithm sounds very similar to, in particular the 1.5-pass
correcting version), and also
An approximation to the greedy algorithm for differential compression
by R. C. Agarwal, K. Gupta, S. Jain, S. Amalapurapu
IBM Journal of Research and Development
Volume 50, Number 1, Page 149 (2006)
http://www.research.ibm.com/journal/rd/501/agarwal.pdf
(which presents a possibly-better-performing, but much more complex
algorithm).
I never got the time to evaluate Dan's patch, or do any real work in
this area, but it's something we should consider doing.
I think Dan produced some performance (time, space) results for the gcc
repository. It'd be worth comparing them to what you've got.
Regards,
Malcolm
- application/pgp-signature attachment: stored
Received on 2008-01-18 18:34:51 CET