On Tue, 2002-02-26 at 12:48, Branko ╚ibej wrote:
> Warning: I haven't actually reviewed the patch (it's too big, or my
> attention span is too short :-), but judging by the description, I think
> it's *very* nice.
> Dan -- do you have profiling data to compare between the current
> implementation and your patch?
Okay, i get the feeling people think i've replaced vdelta.
I have not.
I've simply compressed the instruction/data/address stream it outputs.
vdelta is, as we all know, dictionary based.
The output of it is reasonably compressable, once you seperate it out
into the components (instruction stream/address stream/plain text).
So i've compressed it, which buys us quite a bit.
My point with the .68 vs 19 seconds is that the extra compression done
doesn't slow us down at all.
If you do, I'd really like to see them.
> I'd also like to compare the times for gzip vs. svndiff1, although
> that's a bit harder to do
It's hard to say, since gzip is more optimized than our vdelta
> Anyway, thanks!
> Daniel Berlin wrote:
> >This patch differs from the previous one in a few ways.
> >1. The commenting isn't running commentary anymore, i actually went back
> >and wrote the comments (for most places, anyway).
> >2. Address data is now seperated out into a seperate section, like
> >instructions and new data is. Address data is, on average, 2x the size of
> >the instructions, and always dominates the size of the deltas. Seperating
> >it out allows us to compress it better (i'll get to this in a moment).
> >3. The version number is now stored properly in the DB. I put it between
> >the svndiff marker and the offset. It didn't seem right to put it near
> >delta, since the version is a property of svndiff.
> >This is done in a backwards compatible way, such that if it's missing, we
> >assume it's version 0.
> >Because the svndiff header appears per-window, i had to move where we send
> >the header through down into the loop, and we just take care to avoid
> >sending it more than once.
> >The advantage of this is that we can (but don't yet) check to make sure we
> >don't have mixed versions of svndiff data in one delta, and do something smart if
> >we do.
> >4. We now range encode (almost-arithmetic encoding, but not patented) the
> >instructions/new data/address data. This buys us quite a bit for almost
> >no cost. We now beat gzip consistently. We actually usually beat bzip2 at
> >the same blocksize as our SVN_STREAM_CHUNK_SIZE. For those curious why i
> >didn't take a Public Domain implementation of adaptive huffman or
> >something, it's because range encoding is general .1% worse than arithmetic encoding, but
> >twice as fast. By comparison, adaptive huffman is notoriously slow in
> >practice, and won't beat range encoding anyway (since huffman uses an
> >integral number of bits). The implementation is a modified version of a
> >public domain implementation. I specifically chose range encoding
> >because it's not patented (and it was introduced in 197x),
> >like arithmetic encoding is, and because it's fast.
> >Profiling shows the range encoding is less than 2% of the time spent.
> >Fer instance, on a 15 meg changelog svndiff'd against an empty file,
> >vdelta takes 19 seconds, and the range encoder .68 seconds.
> >For those who think this is overkill, vcodex compresses the
> >instructions/data/new data as well.
> >There are still a few things I need to do before i'd like to see this
> >officially reviewed, i'm just posting it here so it's here, more than
> >anything else.
> > --Dan
> Brane ╚ibej <brane_at_xbc.nu> http://www.xbc.nu/brane/
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org
Received on Sat Oct 21 14:37:10 2006