> Also, our SVNDIFF format is quite good. HTTP can also (tranparently)
> add GZIP encoding on top of that automatically. The GZIP will
> squeeze our diffs down, but also original checkouts, too!
A couple of points here:
* svndiff will compress original checkouts without additional
gzipping. You just make a diff against an empty source.
(It's not as good of a compressor as gzip, of course.)
* Although gzip can compress svndiff output somewhat, it can't
do so as well as it could compress the original data. So,
my previous point notwithstanding, you'd be better off *not*
using svndiff for an original checkout if you know your
transport is going to be gzipped.
Here are some data points using the first of my elc data sets (all
numbers are for performing the operations on each file individually
and then totalling the results; diff operations were peformed against
a source of /dev/null):
Raw size: 6912321
svndiff alone: 3471614 (50%)
svndiff+gzip: 2976929 (43%)
gzip alone: 2541305 (37%)
svndiff+base64: 4690465 (68%)
svndiff+base64+gzip: 3332093 (48%)
I don't really think network bandwidth usage really drives performance
as seen by the end user, but I thought I'd pipe up anyway. We're
unlikely to do worse than the CVS pserver unless HTTP overhead becomes
really cumbersome.
As long as I'm talking about performance, I'll note that I took a look
at one of the .elc files I was using, and .elc files are a really poor
example of "binary data." You mostly get a lot of doc strings (which
contain newlines) interspersed with short amounts of bytecode, so
apart from the presence of funny characters, they look a lot like text
files as far as diff is concerned. I plan to create some better
binary test data by compiling two versions of a program and comparing
the resulting object files (yet another idea stolen from the Hunt
paper). I've also discovered that you can save about 5% of total
output size by using the fourth possible instruction code for "copy
with offset relative to last copy instruction", but I didn't think a
5% savings was worth the complexity.
Received on Sat Oct 21 14:36:12 2006