Re: vdelta questions

From: Karl Fogel <kfogel_at_galois.collab.net>
Date: 2000-08-11 17:25:57 CEST

Greg Hudson <ghudson@mit.edu> writes:
> vdelta maintains a hash table of indexes keyed by the four bytes
> starting at that index. Do we think this hash table is supposed to be
> chained, so that it preserves all indexes which have been entered, or
> do you just replace an entry in the hash table if there was already
> something there? (Either because of a hash collision or because
> you're entering an index with the same four-byte sequence as a
> previous entry.) The description of how comparisons are done does not
> mention trying multiple possible matches and seeing which one yields
> the longest match, but the last diagram in the example shows a "*" at
> position 0 and 13 (both of which have "bcde" as the next four bytes),
> and the text following the example says, "Note that the above matching
> algorithm will actually find the longest match if indexes are kept for
> every location in the string," which would only be true if multiple
> match candidates were tried at each comparison.

I don't know what the paper recommends, but it seems to me that you
have your choice of how much collided data to preserve. You could

   1. Overwrite on every collision, or
   2. Keep most recent N matches (i.e., limit hash chain length to N), or
   3. Keep all matches.

Clearly, options 2 and 3 imply trying at least some of the available
matches to find the longest -- no point in keeping that data if we're
not going to use it. One could always try all available matches, or
else have some "good enough" heuristic to decide when enough have been
tried.

Oh. As I read the rest of your mail, I see that you're already well
aware of the tradeoffs summarized above. :-)

Why not implement the simplest, fastest case first (option 1), since
it's pretty much a subset of cases 2 and 3, and see what kind of delta
size you get? With a good test suite, comparing performance as you
try 2 and/or 3 should be easy.

> This is purely an issue of delta performance. vdelta is supposed to
> generate diffs about as small as bdiff's and run much faster, and I'd
> like to have a good guess as to whether that level performance (both
> in terms of runtime and diff size) can be obtained with or without
> trying multiple matches. Obviously, it's much easier to implement the
> algorithm if you simply store the most recent index at each hash table
> location; the hash table becomes a trivial data structure and you only
> have to check one match candidate at each byte position.
>
> (One option is to go out and buy the _Practical Unix Software_ book,
> which allegedly discusses vdelta according to the vcdiff draft, and
> might contain an implementation I could refer to... although that
> might have bad intellectual property consequences. Does anyone have
> this book, and can tell me how much vdelta stuff is in there? Another
> option is to ask Dave Korn, but I'd rather not take up his time before
> exhausting the other options.)

As long as we don't use his actual code I think we're okay. As far as
I know, there are no patent issues here (*shudder*).

-Karl
Received on Sat Oct 21 14:36:06 2006

This message: [ Message body ]
Next message: Karl Fogel: "Re: vdelta questions"
Previous message: Daniel Stenberg: "Re: [patch] First stab at text deltas"
Maybe in reply to: Greg Hudson: "vdelta questions"
Next in thread: Karl Fogel: "Re: vdelta questions"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]