[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Simplification/speed-up of libsvn_diff by eliminating idx

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Fri, 27 May 2011 13:32:51 +0200

On Fri, May 27, 2011 at 12:47 PM, Fuhrmann Stefan (ETAS/ESA1)
<Stefan.Fuhrmann_at_etas.com> wrote:
> Morten Kloster wrote:
>
>> [[[
>> Simpler/faster LCS algorithm in libsvn_diff by elimination of idx.
>>
>>* subversion/libsvn_diff/lcs.c
>>  (svn_diff__snake_t): Removed idx parameter (always 0)
>>  (svn_diff__lcs_t): Removed idx variable (always 0) , let d have either
>>   sign, and moved the origo of fp by d. New version no longer chooses
>>   the shorter file as the "original", and can thus give different LCS's
>>   depending on the order of the input files even when the files have
>>   different lengths. Speed is ~15% faster for core lcs algorithm.
>>]]]
>
> Hi Morten,
>
> Thank you very much for the patch. From what I have seen,
> it should be correct (haven't applied & tested it, yet).
>
>> The old version of the LCS algorithm started at k=0 and ended at
>> k=d, where d is the length difference of the two files. d was required
>> to be positive. It is more efficient to start at k=(-)d and end at 0 (since
>> ending the loops at 0 allows for faster comparisons), and this also
>> makes it easy to let d have either sign, which makes the idx variable
>> unnecessary, further increasing speed by eliminating a parameter
>> to the frequently called function svn_diff__snake_t. The sign of d
>> is handled in the initial values of the loops (which have minimal speed
>> impact).
>
> Looks good.
>
>> The new version will more often give different (but equally good) results
>> depending on the order of the input files.
>
> How does that affect our test suite? Is any of its expected
> outputs invalid now? Is the result of blame affected as well
> (e.g. if a file shrunk and / or grew in later revisions)?
>
>> I estimated the speed
>> difference by diffing all of the first 100 revisions of merge.c in
>> libsvn_client against each other (when diffing only against the previous
>> revision, the core LCS algorithm usually takes a minor part of the total
>> running time, so this is not suitable for gauging the LCS algorithm
>> speed).
>
> The speedup is plausible. Did you test it with 32 bits?
>
>> /* k = -1 will be the first to be used to get previous
>>     * position information from, make sure it holds sane
>>     * data
>>     */
>> -  fp[-1].position[0] = sentinel_position[0].next;
>> -  fp[-1].position[1] = &sentinel_position[1];
>> +  fp[d - 1].position[0] = sentinel_position[0].next;
>> +  fp[d - 1].position[1] = &sentinel_position[1];
>
> Comment and code do not match anymore. Could you
> update your patch to fix that? Also, it would be very
> nice, if you added more commentary than there is today,
> especially the less obvious index handling (why fp[d] will
> always be valid etc.)
>
> As I said, I like your patch and will certainly apply it as
> soon as above points have been addressed.

I'm not sure I agree. I'm a bit suspicious about the ~15% speedup.
Where does this come from? I can't imagine that the "faster
comparisons by ending at 0" accounts for this. And even so, maybe
there is another way to achieve that.

I'm suspicious mainly on theoretical grounds: the article on which
this algorithm is based specifically always starts with the shortest
of both files. This is important, because the time complexity of the
algorithm is proportional to P, the number of deleted lines. If you
start with the largest of the two, that number P will usually be
larger than the other way around.

Specifically, the article states: if A (the shortest file) is a
subsequence of B (the longest), the algorithm's complexity will be
linear (O(N)). If you would start with B, I'm pretty sure that won't
be the case.

Concrete example:
A: 12345
B: 41521334254

Going from A -> B: P==0
Going from B -> A: P==6

The existing algorithm ensures that, no matter what order you give the
arguments to "diff", the LCS algorithm will always use the direction A
-> B for its work. which should be the most efficient.

Maybe I'm misunderstanding this, or maybe there has been further
research in Computer Science around this (I'm not an academic), but if
so, please explain.

Maybe some of the performance improvement (and simplification) can be
achieved simply by calculating idx0 and idx1 once, and then reusing
those variables and pass them on:

    idx0 = length[0] > length[1] ? 1 : 0;
    idx1 = abs(1 - idx0);

instead of re-calculating abs(1 - idx) everywhere.

Cheers,

-- 
Johan
Received on 2011-05-27 13:33:42 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.