Re: [PATCH] Speed-up of libsvn_diff using token counts

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Thu, 26 May 2011 01:25:12 +0200

On Thu, May 26, 2011 at 12:00 AM, Morten Kloster <morklo_at_gmail.com> wrote:
[ snip ]
> I have, however, done some significant testing today: I downloaded
> the first 100 revisions of merge.c in libsvn_client (merge.c is the
> largest code file in the current HEAD revision, with almost 800
> revisions), and ran diff of every one of those revisions against every
> other (as well as against the empty file, just in case). While
> the results weren't 100% identical for the original library and the
> patched library, the only differences were slightly different lcs's of
> equal quality (number of matches). Such differences are to be
> expected, since the patched version orders the files based on
> non-unique line length, whereas the original orders based on all
> lines - unique or not. Keep in mind that the official version also can
> give different results for file1 vs. file2 and file2 vs. file1, if the
> files have the same number of lines.

Oh yes, indeed. That could account for some differences (as opposed to
the original version, the roles of the files in the LCS algorithm
might now be reversed, because another one becomes the shortest
(modulo unique lines)). I think that's okay: it's still an LCS, so a
minimal diff, just a different one. It makes it more difficult to test
though, so it's a bit complicating ...

> The tests also highlighted the performance improvement
> (total running time for all ~10000 diffs):
>
> Original debug: Almost 30 minutes
> Original release: 6 minutes 13 secs
>
> Patched debug: 89 secs
> Patched release: 54 secs

Nice!

This test might be a bit extreme though, since you compare every
version against every other version (so even versions which are vastly
different, like the 1st against the 800th). In practice I think most
diffs will be between more similar revisions. But ok, it gives a good
general idea of what to expect.

I would be interested in similar timings based on diffs between every
consecutive revision, to see if it still has a big impact (that would
be similar to what an "svn blame" needs to do)

[ snip ]

>> Also, when looking at your patch I was first puzzled why you used a
>> signed integer for node->index, and unsigned integers for the
>> token/node counts. Especially since they are assigned to each other in
>> token.c:
>>
>> [[[
>> @@ -122,6 +134,7 @@ tree_insert_token(svn_diff__node_t **node, svn_dif
>> new_node->right = NULL;
>> new_node->hash = hash;
>> new_node->token = token;
>> + new_node->index = tree->node_count++;
>> ]]]
>>
>> Hm, tricky.
>>
>> Then I understood why, when I read your change to lcs.c#svn_diff__lcs,
>> where you make use of the signed-ness to come up with (fake) node
>> indices for the sentinel positions:
>>
>> [[[
>> - /* These are never dereferenced, only compared by value, so we
>> - * can safely fake these up and the void* cast is OK.
>> + /* Negative indices will not be used elsewhere
>> */
>> - sentinel_position[0].node = (void*)&sentinel_position[0];
>> - sentinel_position[1].node = (void*)&sentinel_position[1];
>> + sentinel_position[0].node = -1;
>> + sentinel_position[1].node = -2;
>> ]]]
>>
>> I'm not sure if that's really safe (because of your unsigned -> signed
>> assignment in token.c). Maybe there could be a better way to fake
>> these nodes for the sentinels? Can't think of any right now.
>>
>
> I'm sure that is safe, as long as the count doesn't exceed the max
> positive value for signed (which can never happen if we change the
> types to 64 bit for 64 bit machines, since the algorithm requires
> far more than 2 memory per count).

Okay. I'm thinking maybe it would be best to use signed ints all the
way then. Since you can always only use the numbers up until the max
of the signed version. Not sure though, other people with more C
experience might have a different opinion on this ...

>>>> I have a couple of other changes ready:
>>>>
>>>> One is a minor simplification, by getting rid of the idx parameter in
>>>> lcs.c - there is a simpler way to accomplish what it's for.
>>
>> Simplicity is always good :-).
>>
>
> One effect of this would be that the diff of file1 vs. file2 could be
> different (different lcs of same quality) than for file2 vs. file1 even
> if the files have different numbers of lines. I hope this is not a
> problem?

Hm, I'm not sure about that. I guess it wouldn't really hurt, because
it would always still be an LCS, just a different one.

But isn't the LCS algorithm supposed to use the shortest file as file
A, and the longest as file B in the algorithm (according to the
article at least). I suppose that's not really required, but it might
change the performance characteristics if you don't, IIUC. Because the
worst-case time complexity is O(NP), with N the number of lines of the
longest, and P *the number of deletes going from A to B*.

If A is the shortest one, that gives us the best chance that P will be
small. If A is even a subsequence of B, the problem is solved in
linear time. But if you take the largest file for A, that might mean
that P becomes a lot larger as well.

So I'd try to keep the choice of the indexes the same as it is now.
Based upon the length of both files you always get the shortest one in
the role of A, and the longest in the role of B.

Or am I misunderstanding what you're trying to do here?

>>>> The other change is a partial implementation of the other possible
>>>> enhancement I mentioned earlier, which takes advantage of unique
>>>> matches to "restart" the lcs algorithm whenever possible, thus
>>>> reducing the effect of the square computational complexity. The
>>>> "partial" part is that this version is only really effective if the two
>>>> files have roughly the same number of (non-unique) lines; it will not
>>>> reduce the (P * length difference) part of the complexity. For good
>>>> cases, though, it can reduce the time required by the lcs algorithm
>>>> by large factors (roughly the number of well-separated sections of
>>>> difference, if they are all about the same size).
>>
>> Hm, I'm not sure how effective this would be in real-world cases. I
>> imagine that often such unique matches are interspersed with
>> non-unique matches (e.g. braces, empty lines, typical headers/footers
>> in blocks of otherwise unique structured text, ...), breaking the
>> "power" of those unique matches.
>>
>> What do you think?
>>
>
> There is no need for the unique matches to be consecutive: As long
> as there are no differences between the files in between the unique
> matches, they can be spread out over an arbitrary number of lines.
> Thus this approach should work fine as long as there are overall
> more unique matches than non-unique differences between the files.

Oh, okay then :-).

> When I applied my new test program to that version, however, I
> found a number of problems I had missed before. I've fixed most of
> them (it now gets through about half the diffs with very few errors),
> but there's at least one bug left. Stay tuned. :-)

Okay.

One thing I'm wary of at this time, is introducing heuristics in the
standard algorithm. I think heuristics to speed things up would be a
good thing at some point (at the cost of guaranteeing a minimal diff),
but I think it needs to be optional (or it can be the default, but the
user should still have the option to go for a "guaranteed minimal
diff").

I'm not sure which way you're going with this latest idea, but I just
mention this because you talked about introducing weights into the LCS
algorithm (for the unique matches). Maybe you can do this without
losing the guarantee of finding an *longest* common subsequence (but
then please explain to me why it's still guaranteed). If there is even
the slightest risk of missynchronizing, and finding a sub-longest
common subsequence, I wouldn't go for it at this time (maybe
optionally as part of a --heuristics or --non-minimal or something).

Cheers,

-- 
Johan

Received on 2011-05-26 01:26:01 CEST

This message: [ Message body ]
Next message: Julian Foad: "Re: [PATCH] WC DB verification triggers"
Previous message: Winston Smith: "RE: large number of large binary files in subversion"
In reply to: Morten Kloster: "Re: [PATCH] Speed-up of libsvn_diff using token counts"
Next in thread: Daniel Shahaf: "Re: [PATCH] Speed-up of libsvn_diff using token counts"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]