On Fri, May 27, 2011 at 12:47 PM, Fuhrmann Stefan (ETAS/ESA1)

<Stefan.Fuhrmann_at_etas.com> wrote:

*> Morten Kloster wrote:
*

*>
*

*>> [[[
*

*>> Simpler/faster LCS algorithm in libsvn_diff by elimination of idx.
*

*>>
*

*>>* subversion/libsvn_diff/lcs.c
*

*>> (svn_diff__snake_t): Removed idx parameter (always 0)
*

*>> (svn_diff__lcs_t): Removed idx variable (always 0) , let d have either
*

*>> sign, and moved the origo of fp by d. New version no longer chooses
*

*>> the shorter file as the "original", and can thus give different LCS's
*

*>> depending on the order of the input files even when the files have
*

*>> different lengths. Speed is ~15% faster for core lcs algorithm.
*

*>>]]]
*

*>
*

*> Hi Morten,
*

*>
*

*> Thank you very much for the patch. From what I have seen,
*

*> it should be correct (haven't applied & tested it, yet).
*

*>
*

*>> The old version of the LCS algorithm started at k=0 and ended at
*

*>> k=d, where d is the length difference of the two files. d was required
*

*>> to be positive. It is more efficient to start at k=(-)d and end at 0 (since
*

*>> ending the loops at 0 allows for faster comparisons), and this also
*

*>> makes it easy to let d have either sign, which makes the idx variable
*

*>> unnecessary, further increasing speed by eliminating a parameter
*

*>> to the frequently called function svn_diff__snake_t. The sign of d
*

*>> is handled in the initial values of the loops (which have minimal speed
*

*>> impact).
*

*>
*

*> Looks good.
*

*>
*

*>> The new version will more often give different (but equally good) results
*

*>> depending on the order of the input files.
*

*>
*

*> How does that affect our test suite? Is any of its expected
*

*> outputs invalid now? Is the result of blame affected as well
*

*> (e.g. if a file shrunk and / or grew in later revisions)?
*

*>
*

*>> I estimated the speed
*

*>> difference by diffing all of the first 100 revisions of merge.c in
*

*>> libsvn_client against each other (when diffing only against the previous
*

*>> revision, the core LCS algorithm usually takes a minor part of the total
*

*>> running time, so this is not suitable for gauging the LCS algorithm
*

*>> speed).
*

*>
*

*> The speedup is plausible. Did you test it with 32 bits?
*

*>
*

*>> /* k = -1 will be the first to be used to get previous
*

*>> * position information from, make sure it holds sane
*

*>> * data
*

*>> */
*

*>> - fp[-1].position[0] = sentinel_position[0].next;
*

*>> - fp[-1].position[1] = &sentinel_position[1];
*

*>> + fp[d - 1].position[0] = sentinel_position[0].next;
*

*>> + fp[d - 1].position[1] = &sentinel_position[1];
*

*>
*

*> Comment and code do not match anymore. Could you
*

*> update your patch to fix that? Also, it would be very
*

*> nice, if you added more commentary than there is today,
*

*> especially the less obvious index handling (why fp[d] will
*

*> always be valid etc.)
*

*>
*

*> As I said, I like your patch and will certainly apply it as
*

*> soon as above points have been addressed.
*

I'm not sure I agree. I'm a bit suspicious about the ~15% speedup.

Where does this come from? I can't imagine that the "faster

comparisons by ending at 0" accounts for this. And even so, maybe

there is another way to achieve that.

I'm suspicious mainly on theoretical grounds: the article on which

this algorithm is based specifically always starts with the shortest

of both files. This is important, because the time complexity of the

algorithm is proportional to P, the number of deleted lines. If you

start with the largest of the two, that number P will usually be

larger than the other way around.

Specifically, the article states: if A (the shortest file) is a

subsequence of B (the longest), the algorithm's complexity will be

linear (O(N)). If you would start with B, I'm pretty sure that won't

be the case.

Concrete example:

A: 12345

B: 41521334254

Going from A -> B: P==0

Going from B -> A: P==6

The existing algorithm ensures that, no matter what order you give the

arguments to "diff", the LCS algorithm will always use the direction A

-> B for its work. which should be the most efficient.

Maybe I'm misunderstanding this, or maybe there has been further

research in Computer Science around this (I'm not an academic), but if

so, please explain.

Maybe some of the performance improvement (and simplification) can be

achieved simply by calculating idx0 and idx1 once, and then reusing

those variables and pass them on:

idx0 = length[0] > length[1] ? 1 : 0;

idx1 = abs(1 - idx0);

instead of re-calculating abs(1 - idx) everywhere.

Cheers,

--
Johan

Received on 2011-05-27 13:33:42 CEST