Re: [PATCH] Speed-up of libsvn_diff using token counts

From: Morten Kloster <morklo_at_gmail.com>
Date: Thu, 26 May 2011 00:00:07 +0200

On Wed, May 25, 2011 at 1:44 AM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> On Tue, May 24, 2011 at 10:46 PM, Stefan Sperling <stsp_at_elego.de> wrote:
>> On Tue, May 24, 2011 at 10:22:58PM +0200, Morten Kloster wrote:
>>> Johan / Stefan - any progress on the reviewing part?
>>
>> I haven't had time to review this, sorry.
>> It got a bit lost within the recent flurry of activity during and
>> after the Berlin hackathon.
>
> Yes, I too have been a bit wrapped up in other issues, not to mention
> my $dayjob, so sorry for the wait. I have gone through your patch and
> spotted some minor things, but couldn't yet completely verify the core
> of your change (the stuff in lcs.c).
>
>> I would also need some time to look at this in detail since my
>> working knowledge of the diff code is... suboptimal.
>
> Having worked on it a while ago, I know the diff code pretty well. But
> I'm not the most experienced svn developer :-). Hmmmm, seems
> complementary :-).
>
>> Note also that your change might be deemed a bit too experimental at
>> this point as the main focus right now is on stabilisation and bug
>> fixing for the upcoming 1.7 release.
>
> Yes. I'm also unsure on this point right now. Maybe when I've had the
> opportunity to analyze and test your patch I might have enough
> confidence ... but right now I'm not sure.
>
> It would help if you did some additional tests/checks yourself (or
> tell us what tests you have already done -- you mentioned having
> tested svn_diff_diff_2, but you didn't say how or how much).
>
> - Have you run svn's testsuite? (you mentioned TortoiseSVN build
> scripts, not sure if you can easily run them (look for win-tests.py))
>
> - A test I did regularly when I worked on the diff code: run "svn
> blame" on a large file with a long history (I used a 60,000 line file
> which had 2200 changes in my repository (unfortunately I can't share
> it)). This executes diff2 for every change. Compare the result of the
> original svn with your patched svn. It should be exactly the same.
> Also try different diff-options for these blame runs
> (ignore-whitespace, ignore-eol-style, ...).
>

No, it's all I can do to compile the libraries on my home computer;
the svn.exe won't run for some reason. So all my testing is through
my own C++ program using the library.

I have, however, done some significant testing today: I downloaded
the first 100 revisions of merge.c in libsvn_client (merge.c is the
largest code file in the current HEAD revision, with almost 800
revisions), and ran diff of every one of those revisions against every
other (as well as against the empty file, just in case). While
the results weren't 100% identical for the original library and the
patched library, the only differences were slightly different lcs's of
equal quality (number of matches). Such differences are to be
expected, since the patched version orders the files based on
non-unique line length, whereas the original orders based on all
lines - unique or not. Keep in mind that the official version also can
give different results for file1 vs. file2 and file2 vs. file1, if the
files have the same number of lines.

The tests also highlighted the performance improvement
(total running time for all ~10000 diffs):

Original debug: Almost 30 minutes
Original release: 6 minutes 13 secs

Patched debug: 89 secs
Patched release: 54 secs

The minimal difference between the patched debug and patched
release suggest that those spend most of their time outside the
lcs algorithm anyway.

> To test diff3 you could run an "svn update" on a locally modified
> file. Just "update -r<older revision", make some changes in the file,
> and update again to let it merge the changes. Try this for some
> different cases relevant to your patch (e.g. file with lots of unique
> lines between the updates, with all unique lines, with no unique
> lines, large/small file, ...).

Difficult when svn.exe doesn't work... :-)
I might make tests for diff3 later.

>
> The diff4 code is not currently used in svn core, but we still like to
> keep it around and keep it working. There is one regression test for
> it in the testsuite. If that tests passes it should be ok I think.
>
>>> Only possible error I'm aware of is that I made the indices/counts
>>> 32 bit signed/unsigned integers, which means it'll run into trouble
>>> if there are more than ~2 billion different tokens (lines) or more than
>>> ~4 billion occurrences of the same token. If the library should be
>>> able to handle such files, they should be made 64 bit (for the
>>> indices, it should be safe to keep them 32 bit on 32 bit machines,
>>> as those could never have the memory to run such diffs anyway).
>>
>> Extending this for 64bit platforms sounds ok to me.
>
> +1.
>
> Also, when looking at your patch I was first puzzled why you used a
> signed integer for node->index, and unsigned integers for the
> token/node counts. Especially since they are assigned to each other in
> token.c:
>
> [[[
> @@ -122,6 +134,7 @@ tree_insert_token(svn_diff__node_t **node, svn_dif
> new_node->right = NULL;
> new_node->hash = hash;
> new_node->token = token;
> + new_node->index = tree->node_count++;
> ]]]
>
> Hm, tricky.
>
> Then I understood why, when I read your change to lcs.c#svn_diff__lcs,
> where you make use of the signed-ness to come up with (fake) node
> indices for the sentinel positions:
>
> [[[
> - /* These are never dereferenced, only compared by value, so we
> - * can safely fake these up and the void* cast is OK.
> + /* Negative indices will not be used elsewhere
> */
> - sentinel_position[0].node = (void*)&sentinel_position[0];
> - sentinel_position[1].node = (void*)&sentinel_position[1];
> + sentinel_position[0].node = -1;
> + sentinel_position[1].node = -2;
> ]]]
>
> I'm not sure if that's really safe (because of your unsigned -> signed
> assignment in token.c). Maybe there could be a better way to fake
> these nodes for the sentinels? Can't think of any right now.
>

I'm sure that is safe, as long as the count doesn't exceed the max
positive value for signed (which can never happen if we change the
types to 64 bit for 64 bit machines, since the algorithm requires
far more than 2 memory per count).

>> We have similar blue sky limits elsewhere, e.g. svn patch uses a
>> 64bit line counter for files it is patching (on all platforms -- but
>> it never reads more than one line into memory).
>>
>>> I have a couple of other changes ready:
>>>
>>> One is a minor simplification, by getting rid of the idx parameter in
>>> lcs.c - there is a simpler way to accomplish what it's for.
>
> Simplicity is always good :-).
>

One effect of this would be that the diff of file1 vs. file2 could be
different (different lcs of same quality) than for file2 vs. file1 even
if the files have different numbers of lines. I hope this is not a
problem?

>>> The other change is a partial implementation of the other possible
>>> enhancement I mentioned earlier, which takes advantage of unique
>>> matches to "restart" the lcs algorithm whenever possible, thus
>>> reducing the effect of the square computational complexity. The
>>> "partial" part is that this version is only really effective if the two
>>> files have roughly the same number of (non-unique) lines; it will not
>>> reduce the (P * length difference) part of the complexity. For good
>>> cases, though, it can reduce the time required by the lcs algorithm
>>> by large factors (roughly the number of well-separated sections of
>>> difference, if they are all about the same size).
>
> Hm, I'm not sure how effective this would be in real-world cases. I
> imagine that often such unique matches are interspersed with
> non-unique matches (e.g. braces, empty lines, typical headers/footers
> in blocks of otherwise unique structured text, ...), breaking the
> "power" of those unique matches.
>
> What do you think?
>

There is no need for the unique matches to be consecutive: As long
as there are no differences between the files in between the unique
matches, they can be spread out over an arbitrary number of lines.
Thus this approach should work fine as long as there are overall
more unique matches than non-unique differences between the files.

When I applied my new test program to that version, however, I
found a number of problems I had missed before. I've fixed most of
them (it now gets through about half the diffs with very few errors),
but there's at least one bug left. Stay tuned. :-)

>>> The question is how I should post these changes. The idx change
>>> is relatively independent of the others; should I post a patch from
>>> current HEAD revision for that change alone, should I post it in this
>>> thread or a new one, and should I specify how that patch interacts
>>> with the other patches (and if so, where)? The other patch is based
>>> on my original patch; should I post a patch of the additional
>>> changes or a cumulative patch from current HEAD revision?
>>
>> I'd say try to break down your changes into a series of self-contained
>> patches that each solve a single problem, with one thread of each patch.
>>
>> It's ok for patches you submit to depend on one another. Just note the
>> dependency in the submission email.
>>
>> It's also fine for patches to overlap or even conflict if that helps to
>> keep them in smaller reviewable chunks. As some of them get committed
>> you can rebase your remaining work against HEAD and repost.
>
> +1.
>
> Cheers,
> --
> Johan
>

Thanks, both, for your comments.

Morten
Received on 2011-05-26 00:00:33 CEST

This message: [ Message body ]
Next message: Stefan Fuhrmann: "Re: Pools in private/svn_cache.h"
Previous message: Daniel Shahaf: "Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py"
In reply to: Johan Corveleyn: "Re: [PATCH] Speed-up of libsvn_diff using token counts"
Next in thread: Johan Corveleyn: "Re: [PATCH] Speed-up of libsvn_diff using token counts"
Reply: Johan Corveleyn: "Re: [PATCH] Speed-up of libsvn_diff using token counts"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]