Re: [PATCH] Speed-up of libsvn_diff using token counts

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Wed, 25 May 2011 01:44:18 +0200

On Tue, May 24, 2011 at 10:46 PM, Stefan Sperling <stsp_at_elego.de> wrote:
> On Tue, May 24, 2011 at 10:22:58PM +0200, Morten Kloster wrote:
>> Johan / Stefan - any progress on the reviewing part?
>
> I haven't had time to review this, sorry.
> It got a bit lost within the recent flurry of activity during and
> after the Berlin hackathon.

Yes, I too have been a bit wrapped up in other issues, not to mention
my $dayjob, so sorry for the wait. I have gone through your patch and
spotted some minor things, but couldn't yet completely verify the core
of your change (the stuff in lcs.c).

> I would also need some time to look at this in detail since my
> working knowledge of the diff code is... suboptimal.

Having worked on it a while ago, I know the diff code pretty well. But
I'm not the most experienced svn developer :-). Hmmmm, seems
complementary :-).

> Note also that your change might be deemed a bit too experimental at
> this point as the main focus right now is on stabilisation and bug
> fixing for the upcoming 1.7 release.

Yes. I'm also unsure on this point right now. Maybe when I've had the
opportunity to analyze and test your patch I might have enough
confidence ... but right now I'm not sure.

It would help if you did some additional tests/checks yourself (or
tell us what tests you have already done -- you mentioned having
tested svn_diff_diff_2, but you didn't say how or how much).

- Have you run svn's testsuite? (you mentioned TortoiseSVN build
scripts, not sure if you can easily run them (look for win-tests.py))

- A test I did regularly when I worked on the diff code: run "svn
blame" on a large file with a long history (I used a 60,000 line file
which had 2200 changes in my repository (unfortunately I can't share
it)). This executes diff2 for every change. Compare the result of the
original svn with your patched svn. It should be exactly the same.
Also try different diff-options for these blame runs
(ignore-whitespace, ignore-eol-style, ...).

To test diff3 you could run an "svn update" on a locally modified
file. Just "update -r<older revision", make some changes in the file,
and update again to let it merge the changes. Try this for some
different cases relevant to your patch (e.g. file with lots of unique
lines between the updates, with all unique lines, with no unique
lines, large/small file, ...).

The diff4 code is not currently used in svn core, but we still like to
keep it around and keep it working. There is one regression test for
it in the testsuite. If that tests passes it should be ok I think.

>> Only possible error I'm aware of is that I made the indices/counts
>> 32 bit signed/unsigned integers, which means it'll run into trouble
>> if there are more than ~2 billion different tokens (lines) or more than
>> ~4 billion occurrences of the same token. If the library should be
>> able to handle such files, they should be made 64 bit (for the
>> indices, it should be safe to keep them 32 bit on 32 bit machines,
>> as those could never have the memory to run such diffs anyway).
>
> Extending this for 64bit platforms sounds ok to me.

+1.

Also, when looking at your patch I was first puzzled why you used a
signed integer for node->index, and unsigned integers for the
token/node counts. Especially since they are assigned to each other in
token.c:

[[[
@@ -122,6 +134,7 @@ tree_insert_token(svn_diff__node_t **node, svn_dif
  new_node->right = NULL;
  new_node->hash = hash;
  new_node->token = token;
+ new_node->index = tree->node_count++;
]]]

Hm, tricky.

Then I understood why, when I read your change to lcs.c#svn_diff__lcs,
where you make use of the signed-ness to come up with (fake) node
indices for the sentinel positions:

[[[
- /* These are never dereferenced, only compared by value, so we
- * can safely fake these up and the void* cast is OK.
+ /* Negative indices will not be used elsewhere
*/
- sentinel_position[0].node = (void*)&sentinel_position[0];
- sentinel_position[1].node = (void*)&sentinel_position[1];
+ sentinel_position[0].node = -1;
+ sentinel_position[1].node = -2;
]]]

I'm not sure if that's really safe (because of your unsigned -> signed
assignment in token.c). Maybe there could be a better way to fake
these nodes for the sentinels? Can't think of any right now.

> We have similar blue sky limits elsewhere, e.g. svn patch uses a
> 64bit line counter for files it is patching (on all platforms -- but
> it never reads more than one line into memory).
>
>> I have a couple of other changes ready:
>>
>> One is a minor simplification, by getting rid of the idx parameter in
>> lcs.c - there is a simpler way to accomplish what it's for.

Simplicity is always good :-).

>> The other change is a partial implementation of the other possible
>> enhancement I mentioned earlier, which takes advantage of unique
>> matches to "restart" the lcs algorithm whenever possible, thus
>> reducing the effect of the square computational complexity. The
>> "partial" part is that this version is only really effective if the two
>> files have roughly the same number of (non-unique) lines; it will not
>> reduce the (P * length difference) part of the complexity. For good
>> cases, though, it can reduce the time required by the lcs algorithm
>> by large factors (roughly the number of well-separated sections of
>> difference, if they are all about the same size).

Hm, I'm not sure how effective this would be in real-world cases. I
imagine that often such unique matches are interspersed with
non-unique matches (e.g. braces, empty lines, typical headers/footers
in blocks of otherwise unique structured text, ...), breaking the
"power" of those unique matches.

What do you think?

>> The question is how I should post these changes. The idx change
>> is relatively independent of the others; should I post a patch from
>> current HEAD revision for that change alone, should I post it in this
>> thread or a new one, and should I specify how that patch interacts
>> with the other patches (and if so, where)? The other patch is based
>> on my original patch; should I post a patch of the additional
>> changes or a cumulative patch from current HEAD revision?
>
> I'd say try to break down your changes into a series of self-contained
> patches that each solve a single problem, with one thread of each patch.
>
> It's ok for patches you submit to depend on one another. Just note the
> dependency in the submission email.
>
> It's also fine for patches to overlap or even conflict if that helps to
> keep them in smaller reviewable chunks. As some of them get committed
> you can rebase your remaining work against HEAD and repost.

+1.

Cheers,

-- 
Johan

Received on 2011-05-25 01:45:03 CEST

This message: [ Message body ]
Next message: Justin Erenkrantz: "Re: svn commit: r1124306 - /subversion/trunk/notes/meetings/berlin-11-agenda"
Previous message: Winston Smith: "RE: large number of large binary files in subversion"
In reply to: Stefan Sperling: "Re: [PATCH] Speed-up of libsvn_diff using token counts"
Next in thread: Morten Kloster: "Re: [PATCH] Speed-up of libsvn_diff using token counts"
Reply: Morten Kloster: "Re: [PATCH] Speed-up of libsvn_diff using token counts"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]