On Fri, 2011-05-27 at 13:32 +0200, Johan Corveleyn wrote:

*> On Fri, May 27, 2011 at 12:47 PM, Fuhrmann Stefan (ETAS/ESA1)
*

*> <Stefan.Fuhrmann_at_etas.com> wrote:
*

*> > Morten Kloster wrote:
*

*> >
*

*> >> [[[
*

*> >> Simpler/faster LCS algorithm in libsvn_diff by elimination of idx.
*

*> >>
*

*> >>* subversion/libsvn_diff/lcs.c
*

*> >> (svn_diff__snake_t): Removed idx parameter (always 0)
*

*> >> (svn_diff__lcs_t): Removed idx variable (always 0) , let d have either
*

*> >> sign, and moved the origo of fp by d. New version no longer chooses
*

*> >> the shorter file as the "original", and can thus give different LCS's
*

*> >> depending on the order of the input files even when the files have
*

*> >> different lengths. Speed is ~15% faster for core lcs algorithm.
*

*> >>]]]
*

*> >
*

*> > Hi Morten,
*

*> >
*

*> > Thank you very much for the patch. From what I have seen,
*

*> > it should be correct (haven't applied & tested it, yet).
*

*> >
*

*> >> The old version of the LCS algorithm started at k=0 and ended at
*

*> >> k=d, where d is the length difference of the two files. d was required
*

*> >> to be positive. It is more efficient to start at k=(-)d and end at 0 (since
*

*> >> ending the loops at 0 allows for faster comparisons), and this also
*

*> >> makes it easy to let d have either sign, which makes the idx variable
*

*> >> unnecessary, further increasing speed by eliminating a parameter
*

*> >> to the frequently called function svn_diff__snake_t. The sign of d
*

*> >> is handled in the initial values of the loops (which have minimal speed
*

*> >> impact).
*

*> >
*

*> > Looks good.
*

*> >
*

*> >> The new version will more often give different (but equally good) results
*

*> >> depending on the order of the input files.
*

*> >
*

*> > How does that affect our test suite? Is any of its expected
*

*> > outputs invalid now? Is the result of blame affected as well
*

*> > (e.g. if a file shrunk and / or grew in later revisions)?
*

*> >
*

*> >> I estimated the speed
*

*> >> difference by diffing all of the first 100 revisions of merge.c in
*

*> >> libsvn_client against each other (when diffing only against the previous
*

*> >> revision, the core LCS algorithm usually takes a minor part of the total
*

*> >> running time, so this is not suitable for gauging the LCS algorithm
*

*> >> speed).
*

*> >
*

*> > The speedup is plausible. Did you test it with 32 bits?
*

*> >
*

*> >> /* k = -1 will be the first to be used to get previous
*

*> >> * position information from, make sure it holds sane
*

*> >> * data
*

*> >> */
*

*> >> - fp[-1].position[0] = sentinel_position[0].next;
*

*> >> - fp[-1].position[1] = &sentinel_position[1];
*

*> >> + fp[d - 1].position[0] = sentinel_position[0].next;
*

*> >> + fp[d - 1].position[1] = &sentinel_position[1];
*

*> >
*

*> > Comment and code do not match anymore. Could you
*

*> > update your patch to fix that? Also, it would be very
*

*> > nice, if you added more commentary than there is today,
*

*> > especially the less obvious index handling (why fp[d] will
*

*> > always be valid etc.)
*

*> >
*

*> > As I said, I like your patch and will certainly apply it as
*

*> > soon as above points have been addressed.
*

*>
*

*> I'm not sure I agree. I'm a bit suspicious about the ~15% speedup.
*

*> Where does this come from? I can't imagine that the "faster
*

*> comparisons by ending at 0" accounts for this. And even so, maybe
*

*> there is another way to achieve that.
*

*>
*

*> I'm suspicious mainly on theoretical grounds: the article on which
*

*> this algorithm is based specifically always starts with the shortest
*

*> of both files. This is important, because the time complexity of the
*

*> algorithm is proportional to P, the number of deleted lines. If you
*

*> start with the largest of the two, that number P will usually be
*

*> larger than the other way around.
*

*>
*

*> Specifically, the article states: if A (the shortest file) is a
*

*> subsequence of B (the longest), the algorithm's complexity will be
*

*> linear (O(N)). If you would start with B, I'm pretty sure that won't
*

*> be the case.
*

*>
*

*> Concrete example:
*

*> A: 12345
*

*> B: 41521334254
*

*>
*

*> Going from A -> B: P==0
*

*> Going from B -> A: P==6
*

*>
*

*> The existing algorithm ensures that, no matter what order you give the
*

*> arguments to "diff", the LCS algorithm will always use the direction A
*

*> -> B for its work. which should be the most efficient.
*

*>
*

*> Maybe I'm misunderstanding this, or maybe there has been further
*

*> research in Computer Science around this (I'm not an academic), but if
*

*> so, please explain.
*

*>
*

*> Maybe some of the performance improvement (and simplification) can be
*

*> achieved simply by calculating idx0 and idx1 once, and then reusing
*

*> those variables and pass them on:
*

*>
*

*> idx0 = length[0] > length[1] ? 1 : 0;
*

*> idx1 = abs(1 - idx0);
*

*>
*

*> instead of re-calculating abs(1 - idx) everywhere.
*

Why does it use "abs()", I wonder? That looks redundant if idx can only

be 0 or 1.

I'm not paying much attention or deeply understanding all of this, but

I'm curious so I took a quick look.

Morten Kloster wrote:

*> * subversion/libsvn_diff/lcs.c
*

*> (svn_diff__snake_t): Removed idx parameter (always 0)
*

*> (svn_diff__lcs_t): Removed idx variable (always 0) , let d have either
*

*> sign, and moved the origo of fp by d.
*

The functions are called svn_diff__snake and svn_diff__lcs, without the

_t suffix.

In most of your patch you assume that 'idx' would have been 0, but in

two places you assume it would have been 1. Is this intentional? Here:

*> - fp += length[idx] + 1;
*

*> + fp += length[1] + 1;
*

*> - d = length[abs(1 - idx)] - length[idx];
*

*> + d = length[0] - length[1];
*

- Julian

Received on 2011-05-27 15:13:44 CEST