Malcolm Rowe wrote:
> On Thu, Jan 17, 2008 at 11:49:32PM +0100, Niels Werensteijn wrote:
>> + This algorithm works the same as xdelta, with the exception
>> + that it gets blocks from its own data.
>> + 1. Checksum the first MATCH_BLOCKSIZE block of bytes using adler32, and
>> + insert the checksum into a match table with the position of the match.
>> + 2. Go through the target byte by byte, starting at position MATCH_BLOCKSIZE.
>> + See if that byte starts a match that we have in the match table.
>> + 2a. If so, try to extend the match as far as possible both
>> + forwards and backwards, and then insert a source copy
>> + operation into the delta ops builder for the match.
>> + Care must be taken here to find match data only from "previous" data,
>> + that we know is already in the target window.
>> + 2b. If not, insert the byte as new data using an insert delta op.
>> + Our implementation doesn't immediately insert "insert" operations,
>> + it waits until we have another copy, or we are done. The reasoning
>> + is twofold:
>> + 1. Otherwise, we would just be building a ton of 1 byte insert
>> + operations
>> + 2. So that we can extend a source match backwards into a pending
>> + insert operation, and possibly remove the need for the insert
>> + entirely. This can happen due to stream alignment.
> Interesting. This sounds similar to a patch that Dan Berlin (cc'd)
> produced around the time of the Subversion conference, which used
> Rabin fingerprinting (basically a CRC) and a match table.
Is this in the mailinglist archive? If so, please give me a hint where
to find it? :)
> Is this an implementation of any particular algorithm?
No. I just looked at XDelta as it was implemented. I saw how it would be
faster than vdelta, and adjusted the XDelta algorithm so it could make a
diff against itself.
> There are two papers I was looking at around that time in this area:
> M. Ajtai, R. Burns, R. Fagin, D. D. E. Long, and L. Stockmeyer.
> Compactly encoding unstructured input with differential compression.
> www.almaden.ibm.com/cs/people/stock/diff7.ps, IBM Research Report RJ
> 10187, April 2000.
> (which your algorithm sounds very similar to, in particular the 1.5-pass
> correcting version), and also
> An approximation to the greedy algorithm for differential compression
> by R. C. Agarwal, K. Gupta, S. Jain, S. Amalapurapu
> IBM Journal of Research and Development
> Volume 50, Number 1, Page 149 (2006)
> (which presents a possibly-better-performing, but much more complex
> I never got the time to evaluate Dan's patch, or do any real work in
> this area, but it's something we should consider doing.
Ill take a look at them. Always interresting!
> I think Dan produced some performance (time, space) results for the gcc
> repository. It'd be worth comparing them to what you've got.
Well, One thing I am also thinking about. If we are diffing against an
empty source stream (what we are doing here), it is basicly compression.
Perhaps there are better algorithms for this problem.
To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: dev-help_at_subversion.tigris.org
Received on 2008-01-18 19:55:07 CET