On Thu, 2005-02-10 at 10:12, Daniel Berlin wrote:
> The worst part, is that having stared at the delta combiner for quite a
> while myself, i'm actually not sure you can do anything about this
> either. I believe that handling source ops from the second window (or
> target ops from the first) is just going to be O(n^2) because of what is
> *has* to do to make it work right.
Perhaps the long-term answer is to stop using random-access deltas, and
instead use a generic compression algorithm on the output of our delta
I'll explain that in greater detail for those who aren't Dan Berlin and
Branko. Right now our deltas are allowed to use instructions like:
Copy N bytes from location <x> in source
Copy N bytes from location <x> in already-reconstructed target
Copy N bytes from new data, and advance new data pointer
(There's no offset for a "new data" copy, because "copy from
already-reconstructed target" handles the case of wanting to copy
earlier bytes from the new data area.) Compare this to gnu diff, which
is allowed to use instructions like:
Copy N lines from source, and advance source pointer
Skip N lines from source (advance source pointer w/o copying)
Copy N lines from new data, and advance new data pointer
The difference is not just that gnu diff uses lines and we use bytes;
it's that we're allowed to use random access to the source and target,
while gnu diff is not. Random access allows us to produce more
efficient deltas--for instance, it allows us to get some
self-compression if the delta source is the empty stream. So if we were
to sacrifice random access, we'd want to compress the output of the
delta routine in order to get savings in the "new data" segment. The
end result might be more or less space efficient than what we have now.
By sacrificing random access, we could get the following benefits:
* We could get rid of delta windowing, which is a source of
complexity. (Obviously not before 2.0, but I'm not sure if we can
realize any of these benefits before 2.0.) Eliminating windowing also
means we could get much better results for some changes, like "insert
128K of data at the beginning of the file".
* We could start using deltas for the blame algorithm, like CVS does,
although it would be a little trickier. Skip-deltas might totally
negate this benefit, unfortunately.
* Our delta combiner would have an easier time working efficiently.
Of course, if we eliminate random access, then we need a completely
different binary delta algorithm. But it sounds like xdelta may be such
an algorithm and it can be implemented in a small amount of code, so
that may not be a problem.
To unsubscribe, e-mail: firstname.lastname@example.org
For additional commands, e-mail: email@example.com
Received on Thu Feb 10 19:11:25 2005