Hi. A couple of times I've said we are performing 35 times worse than
Phong Vo's vcdiff implementation and I didn't know why. Well, I now
think that we're only doing that badly on some fairly brutal inputs,
and that we can do better on such inputs by increasing our window size
a little bit.
Here is a chart of svndiff performance using the difference between
gcc-2.95.1.tar and gcc-2.95.2.tar as inputs. (These are the inputs
Phong Vo used in the data he sent me, and they are also interesting
because there is presumably significant data movement between two
versions of a tar file.) These tarfiles are about 53MB apiece. Our
current window size (by which I mean the target view size) is 16K. We
actually pass twice as much source data as target data to the vdelta
routine to help deal with data movement.
Window size (K) Diff size (K) Time (seconds)
16 9904 67
32 2500 70
42 1102 66
64 123 72
85 102 80
100 98 86
128 96 91
256 92 125
512 89 197
1024 87 314
Times were recorded by summing the user and system times reported by
the "time" command, on a Sony Vaio PCG-F290 laptop (a Pentium II with
128MB memory, don't know the clock speed), with stuff compiled using
egcs-2.91.66 with the -O2 flag. I don't believe I was swapping at all
even with 1MB windows.
Phong Vo claimed to get a 273K diff in about 44 seconds; he didn't say
what hardware he was using or what kind of windowing his program uses.
You can see that we actually do substantially better on diff size than
his program did if we use a window size of 64K or greater.
Obviously, the optimal window size depends on the input data. On the
*.elc data set I've been using, expanding the window size from 16K all
the way up to 1MB only reduced the total diff size by 1.5%; on the
source data set, it reduced the total diff size by 28%. Quite a far
cry from the 99.1% reduction achieved with the gcc tar files.
Still, I'm going to increase the default window size to 100K. A
comparison of two similar tarfiles represents a kind of worst
reasonable case, and while I'm sure I could get a better picture of
that case by comparing other combinations of tarfiles, I don't want to
do that much of a research project.
Received on Sat Oct 21 14:36:10 2006