[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Sun, 10 Oct 2010 23:43:43 +0200

On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
>> But this makes me think, it looks to me like this whole
>> prefix-suffix-skipping functionality would fit better inside the
>> lower-level diff algorithm rather than inside the "datasources_open"
>> function.  Do you agree?
>>
>> It works as it is, but you were talking about wanting it to obey the
>> standard token-parsing rules such as "ignore white space", and so on.
>> It seems that things like this would be much better one level down.
>
> Yes, I've been struggling with this. But I can't easily see it fit in
> the lower levels right now. Problem is that everything in those lower
> levels always acts on 1 datasource at a time (getting all the tokens,
> inserting them into the token tree, ... then the same for the next
> datasource). The datasource_open seemed to me to be the easiest place
> to combine datasources to do things for both of them "concurrently"
> (with least impact on the rest of the code).
>
> Maybe those lower-level functions could also be made
> "multi-datasource", but I have to think a little bit more about that.

I've thought a little bit more about this, going through the code, and
yeah, it might be a good idea to push the prefix/suffix scanning into
the lower-level parts of the diff-algorithm (the token parsing /
building the token tree).

Something like:
- Make token.c#svn_diff__get_tokens take multiple datasources.
- In diff.c, diff3.c and diff4.c replace the multiple calls to this
function to one call passing multiple datasources.
- token.c#svn_diff__get_tokens (with multiple datasources) will take
care of identical prefix and suffix scanning based on "tokens" (so can
take advantage of the standard token-parsing rules with ignore-*
options, by simply calling svn_diff_fns_t.datasource_get_next_token).

Some things needed to support this:
- svn_diff_fns_t.datasource_get_next_token: calculation of the hash
should be made conditional (I don't want to waste time for the adler32
hash for lines that are not going to be put in the token tree).
- For suffix scanning, I'll need another variation of
datasource_get_next_token to get tokens from the end going backwards
(say "datasource_get_previous_token"). No hash needed.

Does that make sense? Or am I missing it completely?

It would mean I'd have to throw away the current patch and rewrite it
in the token layer (but I don't really mind, as long as a good
solution takes its place :-)). And your file_info refactoring would
also not be used (although it's certainly a worthwhile refactoring by
itself, making the current code clearer).

Cheers,

-- 
Johan
Received on 2010-10-10 23:44:21 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.