I decided to ask Phong Vo about the second point I mentioned in my
last piece of mail. He responded with a nicely worked example
indicating that the algorithm for finding the longest match at a given
position is significantly more complicated than I or Branko had gotten
out of the paper. A more complete description of the block-move
generation algorithm is:
1. Look up the four bytes starting at the current position
pointer. If there are no matches for those four bytes,
output an insert, move the position pointer forward by one,
and go back to step 1.
2. Determine which of the candidates yields the longest
extension. This will be called the "current match".
3. Look up the last three bytes of the current match plus one
unmatched byte. If there is no match for those four bytes,
the current match is the best match; go to step 6.
4. For each candidate, check backwards to see if it matches
the entire match so far. If no candidates satisfy that
constraint, the current match is the best match; go to step
6.
5. Among the candidates which do satisfy the constraint,
determine which one yields the longest extension. This
will be the new "current match." Go back to step 3.
6. Output a block copy instruction, add indexes for the last
three positions of the matched data, advance the position
pointer by the length of the match, and go back to step 1.
(I almost feel like if you really want to find the longest match you
should do this in a tree fashion, rather than narrowing the field to
the match with the longest extension at each iteration. But Vo did
not make it sound like the vdelta code is quite so heroic, and I would
worry about combinatoric runtime explosions in psychotic cases.)
Here is the example he used to illustrate the full matching algorithm,
with asterixes marking the indexes which get inserted into the hash
table:
abcdxabcdabcdyabcdabcdabcdabcdabcdabcdz
***** *** **** *** ****
^
Note that when the position pointer reaches the caret, the longest
match found by simply looking at the sequence "abcd" and extending is
a four-byte match at position 0. But the full algorithm finds the
match "abcdabcd" at position 5 by looking up "bcda" and verifying
backwards from the candidate at position 6.
Received on Sat Oct 21 14:36:06 2006