Well, I have a little bit of data on differencing performance using my
proposed output format. I compared the gcc 2.95 and gcc 2.95.2
sources against each other (just the .c files in the gcc subdir), and
found:
Bytes
File-by-file, ours: 15828
File-by-file, diff+gzip: 9989
Concatenated, ours: 15206
Concatenated, diff+gzip: 6015
That wasn't too encouraging, so I decided to try some binary data. I
tried the .elc files which were present in both emacs 19.34b and emacs
20.7.
Bytes
File-by-file, ours: 2630361
File-by-file, diff+gzip: 2246212
Concatenated, ours: 4457134
Concatenated, diff+gzip: 2029833
That wasn't too encouraging either. I'd like to know whether the
problem is with our vdelta code, our window size, or my output format.
Unfortunately, Branko does not seem to have had time to debug his
generator (it dumps core when you use it, and after I fixed the first
bug the next bug looked difficult to fix), so I can't eliminate the
output format as a variable. I'm going to write to Phong Vo and ask
whether they ever released that library he mentioned the last time I
talked to him; maybe that will turn up a good source of comparative
data and ideas.
(And yes, I reenabled the call to vdelta before running these
tests. :) Otherwise it would take a lot more than 15828 bytes to
describe how to reconstruct the gcc sources.)
Received on Sat Oct 21 14:36:10 2006