Here is my proposal for an alternative diff format to vcdiff. I call
if svndiff. It's based on many of the ideas from vcdiff, but is much
simpler. I'm going to start implementing this (though I won't
actually hook it up until milestone 1 is out) unless people tell me
it's a bad idea.
Like vcdiff, an svndiff consists of a four-byte identifying string
followed by a series of windows, until the data runs out. The
four-byte identifier will be { 'S', 'V', 'N', 0 }, where 0 is a
version number. I could be convinced to nuke the identifying string
(gets rid of one state in the parsing machine), but I think it's a
little antisocial to define a binary format without a predictable
beginning.
Also like vcdiff, integers are generally encoded in a variable-length
fashion. The high bit of each byte is a continuation bit and the
other seven bits are data. Higher-order bytes come earlier. So 129
would be encoded as two bytes, { 0b10000001, 0b00000001 }.
Implementations must be able to handle numbers up to 32 bytes, which
means you can safely use source and target views up to 4GB in size.
A window has the following format:
Offset of the source view
Length of the source view
Length of the target view
Length of the instruction set in bytes
Length of the new data set in bytes
The instructions
The new data set
The instruction encoding: the high-order two bits of the first byte
are an instruction selector, and must be one of:
00 Copy from source view
01 Copy from target view
10 Copy from new data
The length of the copy is encoded specially: the six low-order bits of
the first byte are used as a continuation bit and five bits of length
data, and then remaining bytes are interpreted normally according to
the integer encoding. Following the length comes the offset of the
copy in the specified view, encoded normally.
So, for a simple example, the instruction "copy four bytes from offset
0 in the new data" would be encoded as two bytes (brackets denote byte
boundaries; spaces are for clarity only):
[10 0 00100] [0 0000000]
^^ ^ ^^^^^ ^ ^^^^^^^
| | | | |
| | | | offset
| | | continuation bit for offset
| | length
| continuation bit for length
selector
As a slightly more complicated example, "copy 2000 bytes from offset
20480 in the target view" would be encoded as five bytes:
<------length-----> <-------------offset------------>
[01 1 01111] [0 1010000] [1 0000001] [1 0100000] [0 0000000]
That's it. We won't be able to get deltas quite as compact as in
vcdiff because there are no games to play with address caches,
addressing modes, or tailored instruction code sets. But it's still
pretty compact.
Received on Sat Oct 21 14:36:10 2006