Greg Hudson wrote:
> I've done a very basic
> implementation of the core block-move generation algorithm and have
> run into an ambiguity.
(That means we have two implementations right now -- yours and Karl's.
And mine is on the way, too ...)
> vdelta maintains a hash table of indexes keyed by the four bytes
> starting at that index. Do we think this hash table is supposed to be
> chained, so that it preserves all indexes which have been entered, or
> do you just replace an entry in the hash table if there was already
> something there? (Either because of a hash collision or because
> you're entering an index with the same four-byte sequence as a
> previous entry.) The description of how comparisons are done does not
> mention trying multiple possible matches and seeing which one yields
> the longest match, but the last diagram in the example shows a "*" at
> position 0 and 13 (both of which have "bcde" as the next four bytes),
> and the text following the example says, "Note that the above matching
> algorithm will actually find the longest match if indexes are kept for
> every location in the string," which would only be true if multiple
> match candidates were tried at each comparison.
I think this quote from the paper is relevant:
"For efficiency, vdelta relaxes the greedy parsing rule so that
matching prefixes are not always maximally long."
That seems to mean that a fully-associative hash isn't required.
Also, the only place where insertions in the hash might be with
an existing key is (2)(a) in the processing pseudo-code. My guess
is that you're supposed to allow for hash collisions, but in the
case of a key collision, insert the newer index into the table.
> This is purely an issue of delta performance.
> vdelta is supposed to
> generate diffs about as small as bdiff's and run much faster, and I'd
> like to have a good guess as to whether that level performance (both
> in terms of runtime and diff size) can be obtained with or without
> trying multiple matches. Obviously, it's much easier to implement the
> algorithm if you simply store the most recent index at each hash table
> location; the hash table becomes a trivial data structure and you only
> have to check one match candidate at each byte position.
Oh, I think it wouldn't be too hard to try all three possibilities,
then compare the results (for performance, memory consumption and
I've been worrying about something else, though. vdelta expects to
hash the whole source before starting to generate the delta. So where
does windowing come into that? Do you take parallel sections of
both source and target? Oh ... that would be O.K. for small differences,
and for large unrelated files you'd just get "noisy" compression, right?
Branko Čibej <firstname.lastname@example.org>
HERMES SoftLab, Litijska 51, 1000 Ljubljana, Slovenia
voice: (+386 1) 586 53 49 fax: (+386 1) 586 52 70
Received on Sat Oct 21 14:36:06 2006