On Tue, 2010-08-17 at 09:26 -0400, Johan Corveleyn wrote:
> Greg, could you explain a bit more what you mean with
> "edit-stream-style binary diffs", vs. the binary deltas we have now?
> Could you perhaps give an example similar to Julian's? Wouldn't you
> have the same problem with pieces of the source text being copied
> out-of-order (100 bytes from the end/middle of the source being copied
> to the beginning of the target, followed by the rest of the source)?
Let's take a look at the differences between a line-based edit stream
diff (such as what you'd see in the output of diff -e) and a binary
delta as we have in Subversion.
The most obvious difference is that the former traffics in lines, rather
than arbitrary byte ranges, but the actual differences are much deeper.
A line-based diff can be modeled with the following "instructions":
* Copy the next N lines of source to target.
* Skip the next N lines of source.
* Copy the next N lines of new data to target.
After applying a diff like this, you can easily divide the target lines
into two categories: those which originated from the source, and those
which originated from the diff. The division may not accurately
represent the intent of the change (there's the classic problem of the
mis-attributed close brace, for instance; see
http://bramcohen.livejournal.com/73318.html), but it's typically pretty
Subversion binary deltas have a more flexible instruction set, more akin
to what you'd find in a compression algorithm. The source and target
are chopped up into windows, and for each window you have:
* Copy N bytes from offset O in the source window to target.
* Copy N bytes from offset O in the target window to target.
* Copy the next N bytes of new data to target.
There is no easy way to divide the target into source bytes and diff
bytes. Certainly, you can tag which bytes were copied from the source
window, but that's meaningless. Bytes which came from the source window
may have been rearranged by the diff; bytes which came from new data may
only have done so because of windowing.
The optimization idea is to create a new kind of diff (or more likely,
research an existing algorithm) which obeys the rules of the line-based
edit stream--no windowing, sequential access only into the source
stream--but traffics in bytes instead of lines. With such a diff in
hand, we can divide the target bytes into source-origin and diff-origin,
and then, after splitting the target into lines, determine which lines
are "tainted" by diff-origin bytes and therefore should be viewed as
originating in the diff.
Received on 2010-08-17 17:32:19 CEST