Warning: I haven't actually reviewed the patch (it's too big, or my
attention span is too short :-), but judging by the description, I think
it's *very* nice.
Dan -- do you have profiling data to compare between the current
implementation and your patch? If you do, I'd really like to see them.
I'd also like to compare the times for gzip vs. svndiff1, although
that's a bit harder to do ...
Daniel Berlin wrote:
>This patch differs from the previous one in a few ways.
>1. The commenting isn't running commentary anymore, i actually went back
>and wrote the comments (for most places, anyway).
>2. Address data is now seperated out into a seperate section, like
>instructions and new data is. Address data is, on average, 2x the size of
>the instructions, and always dominates the size of the deltas. Seperating
>it out allows us to compress it better (i'll get to this in a moment).
>3. The version number is now stored properly in the DB. I put it between
>the svndiff marker and the offset. It didn't seem right to put it near
>delta, since the version is a property of svndiff.
>This is done in a backwards compatible way, such that if it's missing, we
>assume it's version 0.
>Because the svndiff header appears per-window, i had to move where we send
>the header through down into the loop, and we just take care to avoid
>sending it more than once.
>The advantage of this is that we can (but don't yet) check to make sure we
>don't have mixed versions of svndiff data in one delta, and do something smart if
>4. We now range encode (almost-arithmetic encoding, but not patented) the
>instructions/new data/address data. This buys us quite a bit for almost
>no cost. We now beat gzip consistently. We actually usually beat bzip2 at
>the same blocksize as our SVN_STREAM_CHUNK_SIZE. For those curious why i
>didn't take a Public Domain implementation of adaptive huffman or
>something, it's because range encoding is general .1% worse than arithmetic encoding, but
>twice as fast. By comparison, adaptive huffman is notoriously slow in
>practice, and won't beat range encoding anyway (since huffman uses an
>integral number of bits). The implementation is a modified version of a
>public domain implementation. I specifically chose range encoding
>because it's not patented (and it was introduced in 197x),
>like arithmetic encoding is, and because it's fast.
>Profiling shows the range encoding is less than 2% of the time spent.
>Fer instance, on a 15 meg changelog svndiff'd against an empty file,
>vdelta takes 19 seconds, and the range encoder .68 seconds.
>For those who think this is overkill, vcodex compresses the
>instructions/data/new data as well.
>There are still a few things I need to do before i'd like to see this
>officially reviewed, i'm just posting it here so it's here, more than
Brane Čibej <brane_at_xbc.nu> http://www.xbc.nu/brane/
To unsubscribe, e-mail: firstname.lastname@example.org
For additional commands, e-mail: email@example.com
Received on Sat Oct 21 14:37:10 2006