Re: Further diff optimization ideas

From: Stefan Fuhrmann <eqfox_at_web.de>
Date: Mon, 21 Feb 2011 08:55:27 +0100

On 15.02.2011 01:42, Johan Corveleyn wrote:
> Hi,
>
> Here are some more ideas for optimizing "svn diff". Maybe I should
> start to write up a notes file, but for now I'm settling for a list in
> a mail.
>
> [ note: these ideas are not original, other people have observed some
> of these as well (and discussed on IRC and all), and some inspiration
> also comes from studying GNU diff's code, and reading some research
> papers. I'm just putting them out here. ]
>
> I. Speeding up "minimal" diff (no heuristics)
> ---------------------------------------------
>
> 1) Further low-level optimization of prefix/suffix scanning. [ Quick win ]
>
> - Can speed up identical prefix/suffix processing by another 50% (rough guess).
Go for it. It's a quick win. Maybe, defer the merger until 1.8.
> 2) Faster hash function for hashing lines. [ Quick win ]
>
> - Currently, we use adler32 (after the line has been normalized).
>
> - GNU diff uses some simple bit shifting scheme, which seems to be a
> lot faster (maybe it has more hash-collisions, but that doesn't seem
> to matter that much).
>
> - A quick test with a similar bit-shifting hash, in an extra for-loop
> running over all characters of a line, provided a big speedup (> 50%)
> of the get_tokens step (which is a big factor after prefix/suffix has
> been stripped).
>
> - Maybe the fast hash algorithm FNV-1a, recently mentioned by Blair
> in [1], could also be a good option (though it may be even too complex
> for our purposes, simple bit-shifting might be even faster (as I said,
> hash collisions aren't too bad)).
Without having done any measurement on this myself,
my suspicion is that the actual hash function is not the
key factor. I guess the following are the issues:

* finding the end of the line is relatively expensive
* function calling overhead, including loop setup & finalization
(typ. line length is 30 chars - measurement taken on kde.org)
* inserting the result in a hash container

The actual hash functions are not an issue as long as
they contain no DIV/MOD statement - not even at the
end of the process. To, me the keys to higher speed
would be:

* merge hash calculation with EOL scanning
* use a low collision rate / low overhead hash container
* call static functions only

I came up with some ideas that should increase the
throughput to 50+ mio lines / sec / core (assuming
an average of 30 chars / line). But it's not for the faint
of hear and needs an extra post ;)
> 3) Choose size of token.c#svn_diff__tree_t roots array (roots for the
> multiple-rooted token tree) dynamically, relative to the number of
> lines in the file. Or just choose a fixed size bigger than 127. [
> Quick win if just picking a larger fixed size ]
Maybe 1013 would be a more suitable size (assuming
each entry consists of exactly one pointer). Larger
values push the hash out the L1 cache.

If the hash function is reasonably close to an even,
independent distribution, an array size of 2^N can
be faster: The number of collisions should be the
same (good hash function) but the index is much
cheaper to calculate.

Plain Adler32 is not a good hash in that sense but
can easily be improved to

index = (a32 + (a32 >> 16)) & (2^N-1)
> - GNU diff chooses a size dynamically, a prime number between 1/3rd
> and 2/3rds the number of lines.
I.e., there will be ~2 entries per bucket on average.
> - OTOH, svn diff suffers not that much from the small size, because it
> uses a binary search tree rooted at every element in the array (GNU
> diff uses a simple linked list, so needs to compare "linearly" with
> lines part of the same "equivalence class"). Still, simply increasing
> the size of the array can have significant impact for big files,
> without sacrificing memory too much IMHO.
If we get >4 entries on average per bucket (i.e. >512
lines total). Therefore, our lookup is likely more expensive.
Also, a single-linked list uses less memory.

I wonder what the actual distribution looks like because
certain lines are very frequent (e.g. empty lines) and
will end up in the same bucket.
> 4) Choosing a bigger CHUNK_SIZE in diff_file.c. [ Quick win ]
>
> - Currently, we read data from the files in chunks of 128k (i.e. "1<<
> 17"). I think we can safely increase that a couple of "shifts", to
> 512k or even 1 or 2 MB (what's 1 MB of RAM these days, considering
> this is purely a client-side operation). For comparison: GNU diff
> simply reads both files into memory in full (but I don't think we
> should go that route, the chunked operation of svn diff is a really
> nice feature IMO).
I think ~1MB will be safe. We can keep 2 of these chunks
in L3 cache on virtually any modern CPU. That may be
relevant when comparing the actual line content.
> 5) Discarding non-matching lines before running the LCS (Longest
> Common Subsequence) algorithm. [ Lot of work ]
>
> - With "non-matching lines", I mean lines which don't have a single
> match in the other file. These lines can be immediately identified as
> added or deleted lines. They can be "discarded" (noting the
> adds/deletes), and the LCS algorithm can be run on the remainder. This
> can make a *huge* impact on files which have a lot of different/unique
> lines.
I can see how "definitely non-matching" lines are relevant
but could you provide a sketch how they are used to sub-
divide (or whatever) the file with respect to the LCS code?

With mainly source code in mind, one would expect the
actual difference to contain a number of unique lines. Diff
could become an almost O(N) operation in that case.
> (for instance, an example script from the users list [2], which
> generates two files with random lines, 3/4th of which are
> non-matching: diff goes from several hours to a couple of seconds;
> another example is a big file of which indentation style was changed
> from tabs to spaces, and diffing without an ignore-whitespace option).
>
> - GNU diff does this as well, but strangely, it does it only as part
> of a heuristic (which also discards other "confusing lines"). If run
> with --minimal, these lines are not discarded (though, AFAICS, there
> should be no problem discarding the non-matching lines even when going
> for a minimal diff).
It is obvious how unique lines split the contents on both sides
individually. However, it may be hard to prove that a particular
alignment of the chunks in between on one side with those on
the respective other yields a minimal result.
> II. Going for a non-minimal diff (i.e. heuristics)
> --------------------------------------------------
>
> I think prefix/suffix-scanning, together with the suggestions from I.,
> will already make "svn diff" quite fast in most real-world cases. It
> will normally be just as fast, if not faster than GNU "diff --minimal"
> (barring low-level differences, and the advantage of GNU diff of doing
> everything in memory).
>
> If I.5. is implemented, "svn diff" will usually be faster than GNU
> "diff --minimal", because GNU diff doesn't do that unless it's running
> with heuristics (so I think in a lot of cases, "svn diff" will match
> performance of GNU diff running with heuristics (which is its
> default)).
>
> Still, in some cases, heuristics can make a big difference (while not
> guaranteeing that you'll get a minimal diff).
>
> - Existing issue:
> http://subversion.tigris.org/issues/show_bug.cgi?id=1966 (libsvn_diff
> needs 'non-minimal-diff' mode.).
>
> - Apart from implementing heuristics as part of the current algorithm
> (which can quickly become quite complex), we could also take a look at
> an alternative approach, such as "Patience Diff" [3], which is already
> implemented in several other (D)VCS's. It has the potential to be much
> faster (reducing the problem to calculating several, much smaller,
> LCS's), and has the added advantage of often producing "nicer" diff
> output.
Patience Diff seems to be ideally suited for source code
diffs because function headers etc. provide many of the
"only one matching" lines evenly spread across the whole
file.

-- Stefan^2.
Received on 2011-02-21 08:55:56 CET

This message: [ Message body ]
Next message: Noorul Islam K M: "[PATCH] input_validation_tests.py: Reduce overhead of multiple checkouts."
Previous message: Noorul Islam K M: "[PATCH] cat_tests.py: Modify expected error string to use only error codes."
In reply to: Johan Corveleyn: "Further diff optimization ideas"
Next in thread: Johan Corveleyn: "Re: Further diff optimization ideas"
Reply: Johan Corveleyn: "Re: Further diff optimization ideas"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]