[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Tue, 12 Oct 2010 11:35:49 +0100

On Sun, 2010-10-10 at 23:43 +0200, Johan Corveleyn wrote:
> On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> > On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> >> But this makes me think, it looks to me like this whole
> >> prefix-suffix-skipping functionality would fit better inside the
> >> lower-level diff algorithm rather than inside the "datasources_open"
> >> function. Do you agree?
> >>
> >> It works as it is, but you were talking about wanting it to obey the
> >> standard token-parsing rules such as "ignore white space", and so on.
> >> It seems that things like this would be much better one level down.
> >
> > Yes, I've been struggling with this. But I can't easily see it fit in
> > the lower levels right now. Problem is that everything in those lower
> > levels always acts on 1 datasource at a time (getting all the tokens,
> > inserting them into the token tree, ... then the same for the next
> > datasource). The datasource_open seemed to me to be the easiest place
> > to combine datasources to do things for both of them "concurrently"
> > (with least impact on the rest of the code).
> >
> > Maybe those lower-level functions could also be made
> > "multi-datasource", but I have to think a little bit more about that.
>
> I've thought a little bit more about this, going through the code, and
> yeah, it might be a good idea to push the prefix/suffix scanning into
> the lower-level parts of the diff-algorithm (the token parsing /
> building the token tree).
>
> Something like:
> - Make token.c#svn_diff__get_tokens take multiple datasources.
> - In diff.c, diff3.c and diff4.c replace the multiple calls to this
> function to one call passing multiple datasources.
> - token.c#svn_diff__get_tokens (with multiple datasources) will take
> care of identical prefix and suffix scanning based on "tokens" (so can
> take advantage of the standard token-parsing rules with ignore-*
> options, by simply calling svn_diff_fns_t.datasource_get_next_token).

One of the improvements we're looking for is to make use of the already
existing diff options - ignore-white-space etc.

We can get that improvement with a much smaller change: simply by
calling the 'get_next_token' routine, or rather a part of it, from
within your current 'find_identical_prefix' function, without touching
any of the generic diffN.c/token.c code and APIs.

> Some things needed to support this:
> - svn_diff_fns_t.datasource_get_next_token: calculation of the hash
> should be made conditional (I don't want to waste time for the adler32
> hash for lines that are not going to be put in the token tree).

Yes. If you take this "smaller change" approach, then the way to do
this would be to factor out the non-adler part of
'datasource_get_next_token' into a separate private function, and call
that.

> - For suffix scanning, I'll need another variation of
> datasource_get_next_token to get tokens from the end going backwards
> (say "datasource_get_previous_token"). No hash needed.

Yes.

> Does that make sense? Or am I missing it completely?

I can't really comment on the question of moving this right down into
the diffN.c/token.c code, as I don't have time to study that right now.
The possible benefits I can see are:

* Sharing this feature across all kinds of diff implementations
(currently: file and in-memory-string). Good from a practical POV if
and only if really long strings are being compared in memory. Good from
a code-structural POV.

* I'm curious about whether it would be possible to integrate prefix
skipping into the main algorithm in such a way that the algorithm would
see the skipped prefix as a possible match, just like any other chunk
(including its adler32), but in a much quicker way than the algorithm
currently finds such a prefix. If so, it would be able to handle better
the cases where one file has added a prefix that is a duplicate of
existing text. Same for the suffix.

I'm really not sure whether it's worth pursuing that line of
investigation. If you do, and it turns out well, that would be great.
If you want to just stick to the implementation in diff_file.c, that
seems like a more practical place to stop, now that you have basically
solved the problem.

- Julian
Received on 2010-10-12 12:36:34 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.