May I just say: worrying about the interaction of svndiff with gzip,
and the relative merits of using one or the other, is premature. In
the immortal words of Knuth (sometimes attributed to Dijkstra):
"Premature optimization is the root of all evil."
Don't worry about it. IMHO, use svndiff for everything right now, if
that's most convenient, even initial checkouts/checkins. Let the
people doing the transport layer try gzipping it and see if they get
improvement. If Subversion sometimes ineffectively tries to compress
non-redundant data, it's not the end of the world. The time to
squeeze every last drop out of the network performance is in the
future.
-K
Greg Hudson <ghudson@MIT.EDU> writes:
> > Also, our SVNDIFF format is quite good. HTTP can also (tranparently)
> > add GZIP encoding on top of that automatically. The GZIP will
> > squeeze our diffs down, but also original checkouts, too!
>
> A couple of points here:
>
> * svndiff will compress original checkouts without additional
> gzipping. You just make a diff against an empty source.
> (It's not as good of a compressor as gzip, of course.)
>
> * Although gzip can compress svndiff output somewhat, it can't
> do so as well as it could compress the original data. So,
> my previous point notwithstanding, you'd be better off *not*
> using svndiff for an original checkout if you know your
> transport is going to be gzipped.
>
> Here are some data points using the first of my elc data sets (all
> numbers are for performing the operations on each file individually
> and then totalling the results; diff operations were peformed against
> a source of /dev/null):
>
> Raw size: 6912321
> svndiff alone: 3471614 (50%)
> svndiff+gzip: 2976929 (43%)
> gzip alone: 2541305 (37%)
>
> svndiff+base64: 4690465 (68%)
> svndiff+base64+gzip: 3332093 (48%)
>
> I don't really think network bandwidth usage really drives performance
> as seen by the end user, but I thought I'd pipe up anyway. We're
> unlikely to do worse than the CVS pserver unless HTTP overhead becomes
> really cumbersome.
>
> As long as I'm talking about performance, I'll note that I took a look
> at one of the .elc files I was using, and .elc files are a really poor
> example of "binary data." You mostly get a lot of doc strings (which
> contain newlines) interspersed with short amounts of bytecode, so
> apart from the presence of funny characters, they look a lot like text
> files as far as diff is concerned. I plan to create some better
> binary test data by compiling two versions of a program and comparing
> the resulting object files (yet another idea stolen from the Hunt
> paper). I've also discovered that you can save about 5% of total
> output size by using the fourth possible instruction code for "copy
> with offset relative to last copy instruction", but I didn't think a
> 5% savings was worth the complexity.
Received on Sat Oct 21 14:36:12 2006