What if svndiff format recorded window sizes somewhere (either in a
single header, or at the front of each window of data), and Subversion
chose window sizes dynamically, based on the source & target file
sizes? After all, Subversion usually or always knows those sizes in
advance.
Then diff performance would be tuned to the data.
Greg Hudson <ghudson@mit.edu> writes:
> Hi. A couple of times I've said we are performing 35 times worse than
> Phong Vo's vcdiff implementation and I didn't know why. Well, I now
> think that we're only doing that badly on some fairly brutal inputs,
> and that we can do better on such inputs by increasing our window size
> a little bit.
>
> Here is a chart of svndiff performance using the difference between
> gcc-2.95.1.tar and gcc-2.95.2.tar as inputs. (These are the inputs
> Phong Vo used in the data he sent me, and they are also interesting
> because there is presumably significant data movement between two
> versions of a tar file.) These tarfiles are about 53MB apiece. Our
> current window size (by which I mean the target view size) is 16K. We
> actually pass twice as much source data as target data to the vdelta
> routine to help deal with data movement.
>
> Window size (K) Diff size (K) Time (seconds)
> 16 9904 67
> 32 2500 70
> 42 1102 66
> 64 123 72
> 85 102 80
> 100 98 86
> 128 96 91
> 256 92 125
> 512 89 197
> 1024 87 314
>
> Times were recorded by summing the user and system times reported by
> the "time" command, on a Sony Vaio PCG-F290 laptop (a Pentium II with
> 128MB memory, don't know the clock speed), with stuff compiled using
> egcs-2.91.66 with the -O2 flag. I don't believe I was swapping at all
> even with 1MB windows.
>
> Phong Vo claimed to get a 273K diff in about 44 seconds; he didn't say
> what hardware he was using or what kind of windowing his program uses.
> You can see that we actually do substantially better on diff size than
> his program did if we use a window size of 64K or greater.
>
> Obviously, the optimal window size depends on the input data. On the
> *.elc data set I've been using, expanding the window size from 16K all
> the way up to 1MB only reduced the total diff size by 1.5%; on the
> source data set, it reduced the total diff size by 28%. Quite a far
> cry from the 99.1% reduction achieved with the gcc tar files.
>
> Still, I'm going to increase the default window size to 100K. A
> comparison of two similar tarfiles represents a kind of worst
> reasonable case, and while I'm sure I could get a better picture of
> that case by comparing other combinations of tarfiles, I don't want to
> do that much of a research project.
Received on Sat Oct 21 14:36:11 2006