[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Vdelta answers

From: Greg Hudson <ghudson_at_mit.edu>
Date: 2000-08-14 19:37:09 CEST

I wrote to kpv@research.att.com (an author of the delta algorithms
paper as well as the vcdiff draft) about my Vdelta question and got
some interesting answers back:

First, he said that multiple candidates may need to be tried at each
candidate position to find the longest match. So it sounds like if we
want to match the delta size performance of vdelta, we cannot simply
store the last index for a given four-byte sequence.

Second, he implied that after a match is found, the last three bytes
of that match plus one unmatched byte are used to look for the next
match. (This is as opposed to just skipping to the next four bytes
after the match.) Unfortunately, this assertion seems to contradict
the paper, so maybe I just misunderstood what he wrote.

Third, he clued me in on an interesting hashing trick used in their
vdelta implementation: vdelta allocates an array of hash table entries
equal to the length of the string being compressed. The entry used
for a given index into the string is just &array[index]; thus, the
index can be implicit, saving four bytes in the hash entries. Thus,
the hash table looks a bit like:

        struct hashslot { struct hashslot *next; };
        struct hashslot *slots[2 * WINDOWSIZE];
        struct hashslot *hashtab[HASHSIZE];

to insert a sequence -> position mapping into the hash table:

        hval = hash(sequence);
        slots[ind].next = hashtab[hval];
        hashtab[hval] = &slots[ind];

and you'd traverse the list of match candidates (some of which may be
hash collisions) for a given sequence like:

        for (slot = hashtab[hash(sequence)]; slot != NULL; slot = slot->next) {
          ind = slot - slots;
          ...
        }

Of course, there are other ways to manage the data structure and the
memory (you could use array positions into the slots table instead of
pointers, for instance; if the window size is 32K or less, you could
even use shorts to save memory); that was just the most compact way of
expressing it.

Fourth, he said that he was finishing up an open-source library for
data transformations called Vcode, which would contain a Vdelta
generator with vcdiff output. This is good news; even if the license
is bogus somehow, it gives us something to refer to (e.g. for settling
the second point).
Received on Sat Oct 21 14:36:06 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.