I decided to ask Phong Vo about the second point I mentioned in my

last piece of mail. He responded with a nicely worked example

indicating that the algorithm for finding the longest match at a given

position is significantly more complicated than I or Branko had gotten

out of the paper. A more complete description of the block-move

generation algorithm is:

1. Look up the four bytes starting at the current position

pointer. If there are no matches for those four bytes,

output an insert, move the position pointer forward by one,

and go back to step 1.

2. Determine which of the candidates yields the longest

extension. This will be called the "current match".

3. Look up the last three bytes of the current match plus one

unmatched byte. If there is no match for those four bytes,

the current match is the best match; go to step 6.

4. For each candidate, check backwards to see if it matches

the entire match so far. If no candidates satisfy that

constraint, the current match is the best match; go to step

6.

5. Among the candidates which do satisfy the constraint,

determine which one yields the longest extension. This

will be the new "current match." Go back to step 3.

6. Output a block copy instruction, add indexes for the last

three positions of the matched data, advance the position

pointer by the length of the match, and go back to step 1.

(I almost feel like if you really want to find the longest match you

should do this in a tree fashion, rather than narrowing the field to

the match with the longest extension at each iteration. But Vo did

not make it sound like the vdelta code is quite so heroic, and I would

worry about combinatoric runtime explosions in psychotic cases.)

Here is the example he used to illustrate the full matching algorithm,

with asterixes marking the indexes which get inserted into the hash

table:

abcdxabcdabcdyabcdabcdabcdabcdabcdabcdz

***** *** **** *** ****

^

Note that when the position pointer reaches the caret, the longest

match found by simply looking at the sequence "abcd" and extending is

a four-byte match at position 0. But the full algorithm finds the

match "abcdabcd" at position 5 by looking up "bcda" and verifying

backwards from the candidate at position 6.

Received on Sat Oct 21 14:36:06 2006