Greg Stein <gstein@lyra.org> writes:
> Um. From a diff'ing neophyte, this seems to beg the question: why don't we
> just use diff+gzip instead of spending a bunch of time on a new format?
Two reasons:
- We know, for sure, that we can do better than that anyway. Diff
format includes data deleted from the src string, therefore it
includes more data than we want. Gzipping makes it smaller, but
then we could gzip whatever we produce too (except it will
probably already be quite compressed, so that would take care of
that).
- Diff performs unacceptably on binary data.
What's going on here is that there's some bug in our diffing code,
such that the diffs being produced are way larger than necessary.
Greg H and Branko will need more time to figure it out, which is fine.
But there's no doubt that, once properly implemented, vdelta
(represented in any reasonable format) produces much smaller diffs
than GNU diff or similar textual diff programs.
-K
> On Fri, Oct 06, 2000 at 06:56:57PM -0500, Karl Fogel wrote:
> > Wow. Pretty interesting.
> >
> > Thanks for doing these tests -- we should make sure that any format we
> > come up with is at least better than gdiff+gzip! :-) (Not to mention
> > within shouting distance of vcdiff.)
> >
> > I think Branko is on Eastern European time, so he probably hasn't seen
> > this thread yet.
> >
> > Greg Hudson <ghudson@mit.edu> writes:
> > > Well, I have a little bit of data on differencing performance using my
> > > proposed output format. I compared the gcc 2.95 and gcc 2.95.2
> > > sources against each other (just the .c files in the gcc subdir), and
> > > found:
> > >
> > > Bytes
> > > File-by-file, ours: 15828
> > > File-by-file, diff+gzip: 9989
> > > Concatenated, ours: 15206
> > > Concatenated, diff+gzip: 6015
> > >
> > > That wasn't too encouraging, so I decided to try some binary data. I
> > > tried the .elc files which were present in both emacs 19.34b and emacs
> > > 20.7.
> > >
> > > Bytes
> > > File-by-file, ours: 2630361
> > > File-by-file, diff+gzip: 2246212
> > > Concatenated, ours: 4457134
> > > Concatenated, diff+gzip: 2029833
> > >
> > > That wasn't too encouraging either. I'd like to know whether the
> > > problem is with our vdelta code, our window size, or my output format.
> > > Unfortunately, Branko does not seem to have had time to debug his
> > > generator (it dumps core when you use it, and after I fixed the first
> > > bug the next bug looked difficult to fix), so I can't eliminate the
> > > output format as a variable. I'm going to write to Phong Vo and ask
> > > whether they ever released that library he mentioned the last time I
> > > talked to him; maybe that will turn up a good source of comparative
> > > data and ideas.
> > >
> > > (And yes, I reenabled the call to vdelta before running these
> > > tests. :) Otherwise it would take a lot more than 15828 bytes to
> > > describe how to reconstruct the gcc sources.)
>
> --
> Greg Stein, http://www.lyra.org/
Received on Sat Oct 21 14:36:10 2006