[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Tue, 28 Sep 2010 23:37:23 +0200

Hi Daniel,

Thanks for the feedback.

On Tue, Sep 28, 2010 at 4:11 PM, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
>> Index: subversion/include/svn_diff.h
>> ===================================================================
>> --- subversion/include/svn_diff.h     (revision 1001548)
>> +++ subversion/include/svn_diff.h     (working copy)
>> @@ -112,6 +112,11 @@
> (personally I prefer 'svn diff -x-p' to show the function name here)

Ok, will do next time.

>>    svn_error_t *(*datasource_open)(void *diff_baton,
>>                                    svn_diff_datasource_e datasource);
>> +  /** Open the datasources of type @a datasources. */
>> +  svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t *prefix_lines,
>> +                                   svn_diff_datasource_e datasource0,
>> +                                   svn_diff_datasource_e datasource1);
>> +
> So, you're extending the svn_diff_fns_t struct, which is defined in
> a public header.
> It's a public struct with no constructor function, so I believe you have
> to revv it (into svn_diff_fns2_t) in order to extend it (for binary
> compatibility: people allocating this struct and then using a newer
> Subversion library at runtime).
> If it did have a constructor function, you'd have to extend it only at
> the end, and even then make sure that if the added elements are NULL (eg
> because an old caller didn't know to fill them) then everything still
> worked.
> Probably more important to get the logic right than to revv it right
> away, though; the latter is a SMOP.

Doh! I'm sure that observation was in the back of my head somewhere,
but I forgot about it while working on the solution. Anyway, you're
right: there is first some more work to get the algorithm right.

I've had some progress:
- The blame_tests.py now all pass (tests 2 and 7 provoked a core
dump). That makes all tests pass. However, although I fixed the
coredump, I don't fully understand the root cause (why they ended up
in the situation where they ended up). So I'm going to study that
first a bit more.
- I have now included support for files with \r eol-style.

I'll send a new version of the patch shortly.

I'm also thinking that I could try to take advantage of -x-b/-x-w or
-x--ignore-eol-style options to make it even faster (right now the
prefix/suffix scanning will stop at the first difference, regardless
if it's a whitespace or eol difference that could/should be ignored).

However, I'm not sure if I should implement this myself, as part of
the find_identical_prefix and find_identical_suffix functions, or
switch to the usage of datasource_get_next_token (which is the
function that is used by the "real" diff algorithm to extract the
lines, and which uses util.c#svn_diff__normalize_buffer to normalize
whitespace and eol's if needed).

Right now, I don't read entire lines (tokens) but compare each byte as
I go. But I could do it line-based as well (extract line from file1,
extract line from file2, memcmp lines). I would have to make the
calculation of the adler32 hash in datasource_get_next_token
conditional on some boolean, or factor out the part of the function
that's useful to me into a new static function.

There is one caveat to this approach: I'm not sure if it would work
backwards (for suffix scanning). Well, the normalization function
wouldn't have to be changed, but the extraction of lines would have to
go backward. Surely it's possible, but I have no idea how much I'd
have to change the code in get_next_token to get lines backwards...

I'm also not sure if one would be (significantly) faster than the
other: comparing byte-by-byte while going through both files, or
extracting entire lines and then comparing the lines.



Received on 2010-09-28 23:37:59 CEST

This is an archived mail posted to the Subversion Dev mailing list.