Further diff optimization ideas

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Tue, 15 Feb 2011 01:42:42 +0100

Hi,

Here are some more ideas for optimizing "svn diff". Maybe I should
start to write up a notes file, but for now I'm settling for a list in
a mail.

[ note: these ideas are not original, other people have observed some
of these as well (and discussed on IRC and all), and some inspiration
also comes from studying GNU diff's code, and reading some research
papers. I'm just putting them out here. ]

I. Speeding up "minimal" diff (no heuristics)
---------------------------------------------

1) Further low-level optimization of prefix/suffix scanning. [ Quick win ]

- Can speed up identical prefix/suffix processing by another 50% (rough guess).

2) Faster hash function for hashing lines. [ Quick win ]

- Currently, we use adler32 (after the line has been normalized).

- GNU diff uses some simple bit shifting scheme, which seems to be a
lot faster (maybe it has more hash-collisions, but that doesn't seem
to matter that much).

- A quick test with a similar bit-shifting hash, in an extra for-loop
running over all characters of a line, provided a big speedup (> 50%)
of the get_tokens step (which is a big factor after prefix/suffix has
been stripped).

- Maybe the fast hash algorithm FNV-1a, recently mentioned by Blair
in [1], could also be a good option (though it may be even too complex
for our purposes, simple bit-shifting might be even faster (as I said,
hash collisions aren't too bad)).

3) Choose size of token.c#svn_diff__tree_t roots array (roots for the
multiple-rooted token tree) dynamically, relative to the number of
lines in the file. Or just choose a fixed size bigger than 127. [
Quick win if just picking a larger fixed size ]

- GNU diff chooses a size dynamically, a prime number between 1/3rd
and 2/3rds the number of lines.

- OTOH, svn diff suffers not that much from the small size, because it
uses a binary search tree rooted at every element in the array (GNU
diff uses a simple linked list, so needs to compare "linearly" with
lines part of the same "equivalence class"). Still, simply increasing
the size of the array can have significant impact for big files,
without sacrificing memory too much IMHO.

4) Choosing a bigger CHUNK_SIZE in diff_file.c. [ Quick win ]

- Currently, we read data from the files in chunks of 128k (i.e. "1 <<
17"). I think we can safely increase that a couple of "shifts", to
512k or even 1 or 2 MB (what's 1 MB of RAM these days, considering
this is purely a client-side operation). For comparison: GNU diff
simply reads both files into memory in full (but I don't think we
should go that route, the chunked operation of svn diff is a really
nice feature IMO).

5) Discarding non-matching lines before running the LCS (Longest
Common Subsequence) algorithm. [ Lot of work ]

- With "non-matching lines", I mean lines which don't have a single
match in the other file. These lines can be immediately identified as
added or deleted lines. They can be "discarded" (noting the
adds/deletes), and the LCS algorithm can be run on the remainder. This
can make a *huge* impact on files which have a lot of different/unique
lines.

(for instance, an example script from the users list [2], which
generates two files with random lines, 3/4th of which are
non-matching: diff goes from several hours to a couple of seconds;
another example is a big file of which indentation style was changed
from tabs to spaces, and diffing without an ignore-whitespace option).

- GNU diff does this as well, but strangely, it does it only as part
of a heuristic (which also discards other "confusing lines"). If run
with --minimal, these lines are not discarded (though, AFAICS, there
should be no problem discarding the non-matching lines even when going
for a minimal diff).

II. Going for a non-minimal diff (i.e. heuristics)
--------------------------------------------------

I think prefix/suffix-scanning, together with the suggestions from I.,
will already make "svn diff" quite fast in most real-world cases. It
will normally be just as fast, if not faster than GNU "diff --minimal"
(barring low-level differences, and the advantage of GNU diff of doing
everything in memory).

If I.5. is implemented, "svn diff" will usually be faster than GNU
"diff --minimal", because GNU diff doesn't do that unless it's running
with heuristics (so I think in a lot of cases, "svn diff" will match
performance of GNU diff running with heuristics (which is its
default)).

Still, in some cases, heuristics can make a big difference (while not
guaranteeing that you'll get a minimal diff).

- Existing issue:
http://subversion.tigris.org/issues/show_bug.cgi?id=1966 (libsvn_diff
needs 'non-minimal-diff' mode.).

- Apart from implementing heuristics as part of the current algorithm
(which can quickly become quite complex), we could also take a look at
an alternative approach, such as "Patience Diff" [3], which is already
implemented in several other (D)VCS's. It has the potential to be much
faster (reducing the problem to calculating several, much smaller,
LCS's), and has the added advantage of often producing "nicer" diff
output.

Cheers,

-- 
Johan
[1] http://svn.haxx.se/dev/archive-2011-01/0263.shtml
[2] http://svn.haxx.se/users/archive-2011-01/0319.shtml
[3] http://bramcohen.livejournal.com/73318.html

Received on 2011-02-15 01:43:41 CET

This message: [ Message body ]
Next message: Subversion Translation Status: "[l10n] Translation status report for trunk r1070774"
Previous message: Gavin Beau Baumanis: "Re: Patches pending review"
Next in thread: Branko ÄŒibej: "Re: Further diff optimization ideas"
Reply: Branko ÄŒibej: "Re: Further diff optimization ideas"
Reply: Stefan Fuhrmann: "Re: Further diff optimization ideas"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]