More diff performance notes

From: Greg Hudson <ghudson_at_mit.edu>
Date: 2000-10-09 17:42:01 CEST

I have some more performance data on diffs, mostly so I know what I
should try to optimize first. The "elc" data set is the .elc files
present in any subdirectory of "lisp" in both emacs 19.34b and emacs
20.7. The "src" data set is the files directly present in the "src"
subdirectory of both emacs 19.34b and emacs 20.7. All data was
collected by doing comparisons file-by-file and totalling the results.

                                           elc src
                                         ------- -------
        Number of files 397 169
        Number of windows 730 556
        Number of instructions 663747 282854
        Total length of new data 602542 155800
        Total encoded bytes 2630361 1045437
        diff -ae | gzip length 2246212 665046
        Total source bytes 6912321 6259304
        Total target bytes 8325651 7423620

Average instruction length 12.54 26.25
Average encoded bytes per instr 3.04 3.12

The average instruction length is the average "length" field of an
instruction opcode, easily computed by dividing the total target bytes
by the number of instructions. The average encoded instruction length
comes from the formula:

(encodedbytes - newdata - 4*files - 10*windows) / instructions

where the parenthesized expression counts the number of encoded bytes
which are encoding instructions. 4*files accounts for the four-byte
svndiff headers and 10*windows roughly accounts for the size of the
window headers.

My conclusions:

        * Instructions and other overhead account for 77% of the diff
          for binary files and 85% of the diff for source code. Any
          savings in the instruction encoding would have a noticeable
          impact on the size of the diffs.

        * It's going to be hard to save a lot on the instruction
          encoding. The optimizations described in the Hunt paper for
          the vdelta encoding might, if we're really lucky, get us
          down to an average of ~2 bytes per instruction. That would
          be enough to beat diff | gzip in the binary case, but not in
          the source code case.

        * An average block move length of 12-26 bytes doesn't seem
          very impressive, and I wonder how it compares to the vdelta
          algorithm as implemented by AT&T. (Everyone who writes
          papers on this stuff seems to have access to a vdelta
          program and library Phong Vo and company wrote back in 1994,
          which I really wish I could get my hands on.)
Received on Sat Oct 21 14:36:10 2006

This message: [ Message body ]
Next message: Jim Blandy: "Re: More diff performance notes"
Previous message: Greg Stein: "Re: CVS update: subversion/subversion/libsvn_fs/tests skel-test.c"
Next in thread: Jim Blandy: "Re: More diff performance notes"
Maybe reply: Jim Blandy: "Re: More diff performance notes"
Maybe reply: Jim Blandy: "Re: More diff performance notes"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]