[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Further diff optimization ideas

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Wed, 23 Feb 2011 02:05:46 +0100

On Mon, Feb 21, 2011 at 8:55 AM, Stefan Fuhrmann <eqfox_at_web.de> wrote:
> On 15.02.2011 01:42, Johan Corveleyn wrote:
>>
>> Hi,
>>
>> Here are some more ideas for optimizing "svn diff". Maybe I should
>> start to write up a notes file, but for now I'm settling for a list in
>> a mail.
>>
>> [ note: these ideas are not original, other people have observed some
>> of these as well (and discussed on IRC and all), and some inspiration
>> also comes from studying GNU diff's code, and reading some research
>> papers. I'm just putting them out here. ]
>>
>> I. Speeding up "minimal" diff (no heuristics)
>> ---------------------------------------------
>>
>> 1) Further low-level optimization of prefix/suffix scanning. [ Quick win ]
>>
>> - Can speed up identical prefix/suffix processing by another 50% (rough
>> guess).
>
> Go for it. It's a quick win. Maybe, defer the merger until 1.8.
>>
>> 2) Faster hash function for hashing lines. [ Quick win ]
>>
>> - Currently, we use adler32 (after the line has been normalized).
>>
>> - GNU diff uses some simple bit shifting scheme, which seems to be a
>> lot faster (maybe it has more hash-collisions, but that doesn't seem
>> to matter that much).
>>
>> - A quick test with a similar bit-shifting hash, in an extra for-loop
>> running over all characters of a line, provided a big speedup (>  50%)
>> of the get_tokens step (which is a big factor after prefix/suffix has
>> been stripped).
>>
>> -  Maybe the fast hash algorithm FNV-1a, recently mentioned by Blair
>> in [1], could also be a good option (though it may be even too complex
>> for our purposes, simple bit-shifting might be even faster (as I said,
>> hash collisions aren't too bad)).
>
> Without having done any measurement on this myself,
> my suspicion is that the actual hash function is not the
> key factor. I guess the following are the issues:
>
> * finding the end of the line is relatively expensive
> * function calling overhead, including loop setup & finalization
>  (typ. line length is 30 chars - measurement taken on kde.org)
> * inserting the result in a hash container
>
> The actual hash functions are not an issue as long as
> they contain no DIV/MOD statement - not even at the
> end of the process. To, me the keys to higher speed
> would be:
>
> * merge hash calculation with EOL scanning
> * use a low collision rate / low overhead hash container
> * call static functions only
>
> I came up with some ideas that should increase the
> throughput to 50+ mio lines / sec / core (assuming
> an average of 30 chars / line). But it's not for the faint
> of hear and needs an extra post ;)

After more discussion about this with stefan2 on irc, I have done some
more measurements of the different hash variants (adler32 vs.
apr_hashfunc_default vs. the GNU diff HASH). It seems I was mistaken,
and you are correct, Stefan. The hash function itself is not a big
factor, in fact the currently used adler32 was slightly faster than
the other two.

My initial assumption was based on old measurements I had done once,
with an artificial file full of lines with large random numbers. Not
exactly a representative example of source code or anything.
Apparently, it made a big difference on that example.

Now I retested with a file with random lines with random characters
(lines of a random line length, between 0 and 80 characters), and
there was hardly any difference between the hash functions.

So I'll just stop looking at this. Though I would certainly like to
see one day a solution implemented that integrates the EOL scanning
with the ignore-XXX normalization and the hash calculation, all in one
pass. I agree that would be a much more optimal way of processing the
lines. Processing 50 mio lines / sec definitely sounds nice :-).

>>
>> 3) Choose size of token.c#svn_diff__tree_t roots array (roots for the
>> multiple-rooted token tree) dynamically, relative to the number of
>> lines in the file. Or just choose a fixed size bigger than 127. [
>> Quick win if just picking a larger fixed size ]
>
> Maybe 1013 would be a more suitable size (assuming
> each entry consists of exactly one pointer). Larger
> values push the hash out the L1 cache.

Ok, thanks. Will try that.

> If the hash function is reasonably close to an even,
> independent distribution, an array size of 2^N can
> be faster: The number of collisions should be the
> same (good hash function) but the index is much
> cheaper to calculate.
>
> Plain Adler32 is not a good hash in that sense but
> can easily be improved to
>
> index = (a32 + (a32 >> 16)) & (2^N-1)
>>
>> - GNU diff chooses a size dynamically, a prime number between 1/3rd
>> and 2/3rds the number of lines.
>
> I.e., there will be ~2 entries per bucket on average.
>>
>> - OTOH, svn diff suffers not that much from the small size, because it
>> uses a binary search tree rooted at every element in the array (GNU
>> diff uses a simple linked list, so needs to compare "linearly" with
>> lines part of the same "equivalence class"). Still, simply increasing
>> the size of the array can have significant impact for big files,
>> without sacrificing memory too much IMHO.
>
> If we get >4 entries on average per bucket (i.e. >512
> lines total). Therefore, our lookup is likely more expensive.
> Also, a single-linked list uses less memory.

I don't understand. Are you saying that, from 512 lines onward, the
tree will be less expensive, or more expensive for the lookup?

In any case, I'm mainly interested in large files here, because for
small files the differences will almost always be negligible. So it's
pretty likely that we'll have more than 512 lines in the interesting
cases.

> I wonder what the actual distribution looks like because
> certain lines are very frequent (e.g. empty lines) and
> will end up in the same bucket.

Yes, but identical lines will just reuse the same node (and the
position_list associates line numbers with pointers to nodes in the
tree), so they will not increase the size of the tree/list of that
bucket.

>>
>> 4) Choosing a bigger CHUNK_SIZE in diff_file.c. [ Quick win ]
>>
>> - Currently, we read data from the files in chunks of 128k (i.e. "1<<
>> 17"). I think we can safely increase that a couple of "shifts", to
>> 512k or even 1 or 2 MB (what's 1 MB of RAM these days, considering
>> this is purely a client-side operation). For comparison: GNU diff
>> simply reads both files into memory in full (but I don't think we
>> should go that route, the chunked operation of svn diff is a really
>> nice feature IMO).
>
> I think ~1MB will be safe. We can keep 2 of these chunks
> in L3 cache on virtually any modern CPU. That may be
> relevant when comparing the actual line content.

Ok, will try that.

>>
>> 5) Discarding non-matching lines before running the LCS (Longest
>> Common Subsequence) algorithm. [ Lot of work ]
>>
>> - With "non-matching lines", I mean lines which don't have a single
>> match in the other file. These lines can be immediately identified as
>> added or deleted lines. They can be "discarded" (noting the
>> adds/deletes), and the LCS algorithm can be run on the remainder. This
>> can make a *huge* impact on files which have a lot of different/unique
>> lines.
>
> I can see how "definitely non-matching" lines are relevant
> but could you provide a sketch how they are used to sub-
> divide (or whatever) the file with respect to the LCS code?

I haven't really worked this out in detail, but I think it will have
no impact on the result. The reasoning is as follows:

- The core algorithm for calculating a (minimal) diff is to first find
the Longest Common Subsequence (the longest sequence of lines which
appear both in A and in B (not necessarily as contiguous lines, there
can be gaps)). The actual diff can be derived from this very quickly.

- But lines which only appear in A, or only in B, can never be part of
the LCS, because they are not common lines to begin with.

- So we can just as well calculate the LCS of A' and B' (A and B
without their "definitely non-matching" lines). This will also be an
LCS of A and B, because there are no common lines added which can make
it even longer.

The only thing I'm not 100% sure of is if it would yield the same LCS
(there can be many LCS's, all equally long, hence the different
possible diff-representations which are sometimes not nice for human
readers). However, gut feeling tells me it will be the same. I can't
prove this though, so feel free to come up with a counter example :-).
(although having a different LCS would not be a disaster: it would
still be a minimal diff, but a different one).

The practical difficulty is to map line numbers from lines in A and B
to their corresponding lines in A' and B', and back again.

> With mainly source code in mind, one would expect the
> actual difference to contain a number of unique lines. Diff
> could become an almost O(N) operation in that case.
>>
>> (for instance, an example script from the users list [2], which
>> generates two files with random lines, 3/4th of which are
>> non-matching: diff goes from several hours to a couple of seconds;
>> another example is a big file of which indentation style was changed
>> from tabs to spaces, and diffing without an ignore-whitespace option).
>>
>> - GNU diff does this as well, but strangely, it does it only as part
>> of a heuristic (which also discards other "confusing lines"). If run
>> with --minimal, these lines are not discarded (though, AFAICS, there
>> should be no problem discarding the non-matching lines even when going
>> for a minimal diff).
>
> It is obvious how unique lines split the contents on both sides
> individually. However, it may be hard to prove that  a particular
> alignment of the chunks in between on one side with those on
> the respective other yields a minimal result.
>>
>> II. Going for a non-minimal diff (i.e. heuristics)
>> --------------------------------------------------
>>
>> I think prefix/suffix-scanning, together with the suggestions from I.,
>> will already make "svn diff" quite fast in most real-world cases. It
>> will normally be just as fast, if not faster than GNU "diff --minimal"
>> (barring low-level differences, and the advantage of GNU diff of doing
>> everything in memory).
>>
>> If I.5. is implemented, "svn diff" will usually be faster than GNU
>> "diff --minimal", because GNU diff doesn't do that unless it's running
>> with heuristics (so I think in a lot of cases, "svn diff" will match
>> performance of GNU diff running with heuristics (which is its
>> default)).
>>
>> Still, in some cases, heuristics can make a big difference (while not
>> guaranteeing that you'll get a minimal diff).
>>
>> - Existing issue:
>> http://subversion.tigris.org/issues/show_bug.cgi?id=1966 (libsvn_diff
>> needs 'non-minimal-diff' mode.).
>>
>> - Apart from implementing heuristics as part of the current algorithm
>> (which can quickly become quite complex), we could also take a look at
>> an alternative approach, such as "Patience Diff" [3], which is already
>> implemented in several other (D)VCS's. It has the potential to be much
>> faster (reducing the problem to calculating several, much smaller,
>> LCS's), and has the added advantage of often producing "nicer" diff
>> output.
>
> Patience Diff seems to be ideally suited for source code
> diffs because function headers etc. provide many of the
> "only one matching" lines evenly spread across the whole
> file.

Yes, that could be quite interesting.

Though the choice between normal diff and patience diff (which is
really a heuristic, in some cases it will give a non-minimal diff (but
those are quite rare I think)) should be up to the user I think (AFAIK
that's also the case with the DVCS's that implement this: it's an
option). So optimizing the normal diff as far as we can is still very
much worthwhile...

Cheers,

-- 
Johan
Received on 2011-02-23 02:06:36 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.