Diff optimization: implement prefix/suffix-skipping in token-handling code (was: Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster)

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Wed, 20 Oct 2010 01:31:02 +0200

On Tue, Oct 12, 2010 at 12:35 PM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> On Sun, 2010-10-10 at 23:43 +0200, Johan Corveleyn wrote:
>> On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
>> > On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
>> >> But this makes me think, it looks to me like this whole
>> >> prefix-suffix-skipping functionality would fit better inside the
>> >> lower-level diff algorithm rather than inside the "datasources_open"
>> >> function. Do you agree?
>> >>
>> >> It works as it is, but you were talking about wanting it to obey the
>> >> standard token-parsing rules such as "ignore white space", and so on.
>> >> It seems that things like this would be much better one level down.
>> >
>> > Yes, I've been struggling with this. But I can't easily see it fit in
>> > the lower levels right now. Problem is that everything in those lower
>> > levels always acts on 1 datasource at a time (getting all the tokens,
>> > inserting them into the token tree, ... then the same for the next
>> > datasource). The datasource_open seemed to me to be the easiest place
>> > to combine datasources to do things for both of them "concurrently"
>> > (with least impact on the rest of the code).
>> >
>> > Maybe those lower-level functions could also be made
>> > "multi-datasource", but I have to think a little bit more about that.
>>
>> I've thought a little bit more about this, going through the code, and
>> yeah, it might be a good idea to push the prefix/suffix scanning into
>> the lower-level parts of the diff-algorithm (the token parsing /
>> building the token tree).
>>
>> Something like:
>> - Make token.c#svn_diff__get_tokens take multiple datasources.
>> - In diff.c, diff3.c and diff4.c replace the multiple calls to this
>> function to one call passing multiple datasources.
>> - token.c#svn_diff__get_tokens (with multiple datasources) will take
>> care of identical prefix and suffix scanning based on "tokens" (so can
>> take advantage of the standard token-parsing rules with ignore-*
>> options, by simply calling svn_diff_fns_t.datasource_get_next_token).
>
> One of the improvements we're looking for is to make use of the already
> existing diff options - ignore-white-space etc.
>
> We can get that improvement with a much smaller change: simply by
> calling the 'get_next_token' routine, or rather a part of it, from
> within your current 'find_identical_prefix' function, without touching
> any of the generic diffN.c/token.c code and APIs.
>
>> Some things needed to support this:
>> - svn_diff_fns_t.datasource_get_next_token: calculation of the hash
>> should be made conditional (I don't want to waste time for the adler32
>> hash for lines that are not going to be put in the token tree).
>
> Yes. If you take this "smaller change" approach, then the way to do
> this would be to factor out the non-adler part of
> 'datasource_get_next_token' into a separate private function, and call
> that.
>
>> - For suffix scanning, I'll need another variation of
>> datasource_get_next_token to get tokens from the end going backwards
>> (say "datasource_get_previous_token"). No hash needed.
>
> Yes.
>
>> Does that make sense? Or am I missing it completely?
>
> I can't really comment on the question of moving this right down into
> the diffN.c/token.c code, as I don't have time to study that right now.
> The possible benefits I can see are:
>
> * Sharing this feature across all kinds of diff implementations
> (currently: file and in-memory-string). Good from a practical POV if
> and only if really long strings are being compared in memory. Good from
> a code-structural POV.

Yes, that seems like a nice improvement, code-structurally. I don't
know if it will have noticeable performance impact. If I understand
the code correctly, the in-memory-diff code is used for diffing
properties. Some properties can be quite large (e.g. svn:mergeinfo),
but I don't know if they are large enough to have an impact (unless
some high-level operations perform a million property diffs or
something).

> * I'm curious about whether it would be possible to integrate prefix
> skipping into the main algorithm in such a way that the algorithm would
> see the skipped prefix as a possible match, just like any other chunk
> (including its adler32), but in a much quicker way than the algorithm
> currently finds such a prefix. If so, it would be able to handle better
> the cases where one file has added a prefix that is a duplicate of
> existing text. Same for the suffix.

Maybe that's possible, but I don't think it will help much. For one
thing, it introduces some overhead (adler32 calculation of the
prefix). And it will only help if the added piece of text is *exactly*
the same as the prefix (every line of it). If the added piece of text
misses the last line of the prefix, it will not match (different
adler32 hash). If you need to be able to match at different spots
inside the prefix, you'll have to hash (and compare) every line
separately, which voids the benefit of this optimization.

However, I've had another idea for optimization, which I discussed a
couple of weeks ago with danielsh on irc: in the current algorithm,
all lines are compared to all lines (to prepare for the LCS
algorithm). Therefore they are put in a binary tree (first the lines
of original file, then modified). That happens in
subversion/libsvn_diff/token.c#svn_diff__tree_insert_token. Maybe I
could do the following: when I get a new line, and the previous line
had a match in the tree, first try to match the new line with the
successor of that previously matched line. Only if that doesn't match,
go and search the tree as is done now.

That would take advantage of the fact that often entire blocks match,
instead of only looking at individual lines. It also introduces
overhead if there are no such matching blocks. I'd still keep the
prefix/suffix scanning, because anything that helps keeping the binary
tree smaller is good

> I'm really not sure whether it's worth pursuing that line of
> investigation. If you do, and it turns out well, that would be great.
> If you want to just stick to the implementation in diff_file.c, that
> seems like a more practical place to stop, now that you have basically
> solved the problem.

Hmmm, I can't help feeling challenged by the idea of implementing
everything token-based. It would make further improvement possible (or
at least easier), like the ignore-options, and the above idea. The
current patch works, but it feels slightly out of place. So maybe I'll
still have a go at it...

Cheers,

-- 
Johan

Received on 2010-10-20 01:32:02 CEST

This message: [ Message body ]
Next message: Hyrum K. Wright: "Re: svn commit: r1024442 - in /subversion/site/publish: index.html site-nav.html"
Previous message: Greg Stein: "Re: svn commit: r1024442 - in /subversion/site/publish: index.html site-nav.html"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]