> (That means we have two implementations right now -- yours and
> Karl's. And mine is on the way, too ...)
I don't have an "implementation." I have a mock-up of the core part
of the algorithm, just so that I understand how it works; it's not
even written as a library function. That's like 3% of the actual
work; dealing with windowing and instruction encoding will be the hard
If you've made progress on the implementation, I can sit back and just
be a knowledgeable person, and we won't have wasted any effort.
> Also, the only place where insertions in the hash might be with an
> existing key is (2)(a) in the processing pseudo-code.
Right. But I think this case happens with some frequency. It
happened in the first real-world test case of my mock-up (two versions
of a 10K test document with a one-paragraph change), such that the
delta began by copying a few small segments from random places in the
source before copying a 3K chunk from the beginning.
> My guess is that you're supposed to allow for hash collisions, but
> in the case of a key collision, insert the newer index into the
That's possible. Though it's certainly a whole lot simpler if you
just overwrite entries when there's a hash collision, and you probably
don't lose significantly on delta size if you do that.
> I've been worrying about something else, though. vdelta expects to
> hash the whole source before starting to generate the delta. So
> where does windowing come into that? Do you take parallel sections
> of both source and target? Oh ... that would be O.K. for small
> differences, and for large unrelated files you'd just get "noisy"
> compression, right?
The vcdiff draft says:
The standard way to deal with this is to partition input files
into "windows" to be processed separately. Here, except for
some unpublished work by Vo, little has been done on designing
effective windowing schemes. Current techniques, including
Vdelta, simply use windows of the same size with corresponding
addresses across source and target files.
Using fixed-size windows with corresponding addresses is not
especially satisfactory. For instance, suppose you use 10K fixed
windows and there has been a 1K insertion at the beginning of the
target; then each window will have 10% non-commonality. A cleverer
windowing algorithm might figure out that using a 10K initial window
in the source and an 11K initial window in the target yields a much
better result. However, apparently vdelta does okay even with this
suboptimality. The whole point of vcdiff is that if you develop a
clever new windowing algorithm, that doesn't require you to convert to
a new encoding format.
Received on Sat Oct 21 14:36:06 2006