On Fri, 2011-05-27 at 13:32 +0200, Johan Corveleyn wrote:
> On Fri, May 27, 2011 at 12:47 PM, Fuhrmann Stefan (ETAS/ESA1)
> <Stefan.Fuhrmann_at_etas.com> wrote:
> > Morten Kloster wrote:
> >
> >> [[[
> >> Simpler/faster LCS algorithm in libsvn_diff by elimination of idx.
> >>
> >>* subversion/libsvn_diff/lcs.c
> >> (svn_diff__snake_t): Removed idx parameter (always 0)
> >> (svn_diff__lcs_t): Removed idx variable (always 0) , let d have either
> >> sign, and moved the origo of fp by d. New version no longer chooses
> >> the shorter file as the "original", and can thus give different LCS's
> >> depending on the order of the input files even when the files have
> >> different lengths. Speed is ~15% faster for core lcs algorithm.
> >>]]]
> >
> > Hi Morten,
> >
> > Thank you very much for the patch. From what I have seen,
> > it should be correct (haven't applied & tested it, yet).
> >
> >> The old version of the LCS algorithm started at k=0 and ended at
> >> k=d, where d is the length difference of the two files. d was required
> >> to be positive. It is more efficient to start at k=(-)d and end at 0 (since
> >> ending the loops at 0 allows for faster comparisons), and this also
> >> makes it easy to let d have either sign, which makes the idx variable
> >> unnecessary, further increasing speed by eliminating a parameter
> >> to the frequently called function svn_diff__snake_t. The sign of d
> >> is handled in the initial values of the loops (which have minimal speed
> >> impact).
> >
> > Looks good.
> >
> >> The new version will more often give different (but equally good) results
> >> depending on the order of the input files.
> >
> > How does that affect our test suite? Is any of its expected
> > outputs invalid now? Is the result of blame affected as well
> > (e.g. if a file shrunk and / or grew in later revisions)?
> >
> >> I estimated the speed
> >> difference by diffing all of the first 100 revisions of merge.c in
> >> libsvn_client against each other (when diffing only against the previous
> >> revision, the core LCS algorithm usually takes a minor part of the total
> >> running time, so this is not suitable for gauging the LCS algorithm
> >> speed).
> >
> > The speedup is plausible. Did you test it with 32 bits?
> >
> >> /* k = -1 will be the first to be used to get previous
> >> * position information from, make sure it holds sane
> >> * data
> >> */
> >> - fp[-1].position[0] = sentinel_position[0].next;
> >> - fp[-1].position[1] = &sentinel_position[1];
> >> + fp[d - 1].position[0] = sentinel_position[0].next;
> >> + fp[d - 1].position[1] = &sentinel_position[1];
> >
> > Comment and code do not match anymore. Could you
> > update your patch to fix that? Also, it would be very
> > nice, if you added more commentary than there is today,
> > especially the less obvious index handling (why fp[d] will
> > always be valid etc.)
> >
> > As I said, I like your patch and will certainly apply it as
> > soon as above points have been addressed.
>
> I'm not sure I agree. I'm a bit suspicious about the ~15% speedup.
> Where does this come from? I can't imagine that the "faster
> comparisons by ending at 0" accounts for this. And even so, maybe
> there is another way to achieve that.
>
> I'm suspicious mainly on theoretical grounds: the article on which
> this algorithm is based specifically always starts with the shortest
> of both files. This is important, because the time complexity of the
> algorithm is proportional to P, the number of deleted lines. If you
> start with the largest of the two, that number P will usually be
> larger than the other way around.
>
> Specifically, the article states: if A (the shortest file) is a
> subsequence of B (the longest), the algorithm's complexity will be
> linear (O(N)). If you would start with B, I'm pretty sure that won't
> be the case.
>
> Concrete example:
> A: 12345
> B: 41521334254
>
> Going from A -> B: P==0
> Going from B -> A: P==6
>
> The existing algorithm ensures that, no matter what order you give the
> arguments to "diff", the LCS algorithm will always use the direction A
> -> B for its work. which should be the most efficient.
>
> Maybe I'm misunderstanding this, or maybe there has been further
> research in Computer Science around this (I'm not an academic), but if
> so, please explain.
>
> Maybe some of the performance improvement (and simplification) can be
> achieved simply by calculating idx0 and idx1 once, and then reusing
> those variables and pass them on:
>
> idx0 = length[0] > length[1] ? 1 : 0;
> idx1 = abs(1 - idx0);
>
> instead of re-calculating abs(1 - idx) everywhere.
Why does it use "abs()", I wonder? That looks redundant if idx can only
be 0 or 1.
I'm not paying much attention or deeply understanding all of this, but
I'm curious so I took a quick look.
Morten Kloster wrote:
> * subversion/libsvn_diff/lcs.c
> (svn_diff__snake_t): Removed idx parameter (always 0)
> (svn_diff__lcs_t): Removed idx variable (always 0) , let d have either
> sign, and moved the origo of fp by d.
The functions are called svn_diff__snake and svn_diff__lcs, without the
_t suffix.
In most of your patch you assume that 'idx' would have been 0, but in
two places you assume it would have been 1. Is this intentional? Here:
> - fp += length[idx] + 1;
> + fp += length[1] + 1;
> - d = length[abs(1 - idx)] - length[idx];
> + d = length[0] - length[1];
- Julian
Received on 2011-05-27 15:13:44 CEST