On 27.05.2011 19:29, Morten Kloster wrote:
> On Fri, May 27, 2011 at 3:13 PM, Julian Foad<julian.foad_at_wandisco.com> wrote:
>> On Fri, 2011-05-27 at 13:32 +0200, Johan Corveleyn wrote:
>>> On Fri, May 27, 2011 at 12:47 PM, Fuhrmann Stefan (ETAS/ESA1)
>>> <Stefan.Fuhrmann_at_etas.com> wrote:
>>>> Morten Kloster wrote:
>>>>
>>>>> [[[
>>>>> Simpler/faster LCS algorithm in libsvn_diff by elimination of idx.
>>>>>
>>>>> * subversion/libsvn_diff/lcs.c
>>>>> (svn_diff__snake_t): Removed idx parameter (always 0)
>>>>> (svn_diff__lcs_t): Removed idx variable (always 0) , let d have either
>>>>> sign, and moved the origo of fp by d. New version no longer chooses
>>>>> the shorter file as the "original", and can thus give different LCS's
>>>>> depending on the order of the input files even when the files have
>>>>> different lengths. Speed is ~15% faster for core lcs algorithm.
>>>>> ]]]
>>>> Hi Morten,
>>>>
>>>> Thank you very much for the patch. From what I have seen,
>>>> it should be correct (haven't applied& tested it, yet).
>>>>
>>>>> The old version of the LCS algorithm started at k=0 and ended at
>>>>> k=d, where d is the length difference of the two files. d was required
>>>>> to be positive. It is more efficient to start at k=(-)d and end at 0 (since
>>>>> ending the loops at 0 allows for faster comparisons), and this also
>>>>> makes it easy to let d have either sign, which makes the idx variable
>>>>> unnecessary, further increasing speed by eliminating a parameter
>>>>> to the frequently called function svn_diff__snake_t. The sign of d
>>>>> is handled in the initial values of the loops (which have minimal speed
>>>>> impact).
>>>> Looks good.
>>>>
>>>>> The new version will more often give different (but equally good) results
>>>>> depending on the order of the input files.
>>>> How does that affect our test suite? Is any of its expected
>>>> outputs invalid now? Is the result of blame affected as well
>>>> (e.g. if a file shrunk and / or grew in later revisions)?
>>>>
> The patch passes all libsvn_diff tests, at least. After some
> thought, I found that the new version will actually improve the
> performance of merge: Given files
>
> abcdef (Base file)
> abdce (Their file)
> abdcefg (My file)
>
> the old version will diff (Their, Base) and (Base, My), since
> Their is shorter than Base while My is longer. That means that
> for Base vs. Their, the 'd's will be matched, while for Base vs.
> My, the 'c's will be matched, yielding a conflict during merge,
> even though that part of the files had identical changes in Their
> and My! With the new version, the diff order is given strictly by
> the calling function, thus it will match the same way in both
> cases and avoid conflict (both version produce a conflict at the
> end, of course).
>
> In general, the local diff between mostly-matching parts of two
> files can no longer be affected by changes in those files far
> from the area of interest (except if those changes cause a
> global change in LCS), and I'm pretty sure that's all for the good.
>
> The result of blame can be affected by the change, but only
> in cases where it essentially was arbitrary to begin with. I
> don't see that it can have any significant negative impact.
Ok.
I think the bigger point of you patch is the
"simplification" part rather than the "faster"
part.
>>>>> I estimated the speed
>>>>> difference by diffing all of the first 100 revisions of merge.c in
>>>>> libsvn_client against each other (when diffing only against the previous
>>>>> revision, the core LCS algorithm usually takes a minor part of the total
>>>>> running time, so this is not suitable for gauging the LCS algorithm
>>>>> speed).
>>>> The speedup is plausible. Did you test it with 32 bits?
> I'm running on a 64-bit machine/OS but using the win32 builds.
>
I can see that the register pressure is reduced,
particularly under 32 bits. And that may also cause
some compilers to actually inline the *__snake
code.
GCC does the latter anyways and under 64 bits
the gains are under 1% for the *_lcs function.
However, simpler code and more predictable
code often opens the door to more improvements
such as eliminating pointer dereferencing,
re-organizing data to make it more "streamy" etc.
>>>>> /* k = -1 will be the first to be used to get previous
>>>>> * position information from, make sure it holds sane
>>>>> * data
>>>>> */
>>>>> - fp[-1].position[0] = sentinel_position[0].next;
>>>>> - fp[-1].position[1] =&sentinel_position[1];
>>>>> + fp[d - 1].position[0] = sentinel_position[0].next;
>>>>> + fp[d - 1].position[1] =&sentinel_position[1];
>>>> Comment and code do not match anymore. Could you
>>>> update your patch to fix that? Also, it would be very
>>>> nice, if you added more commentary than there is today,
>>>> especially the less obvious index handling (why fp[d] will
>>>> always be valid etc.)
> I've attached an updated version.
>
>>>> As I said, I like your patch and will certainly apply it as
>>>> soon as above points have been addressed.
>>> I'm not sure I agree. I'm a bit suspicious about the ~15% speedup.
>>> Where does this come from? I can't imagine that the "faster
>>> comparisons by ending at 0" accounts for this. And even so, maybe
>>> there is another way to achieve that.
>>>
> 15% is more than I expected, but that's what I get. Surprisingly,
> none of the speed-up seems to be from the removal of idx; it's all
> from the change to the k loops.
>
Depending on what tool you used to measure
that, you may or may not see whether the code
got inlined there or not. Inlining would be a
perfect explanation for a 15% gain.
>>> I'm suspicious mainly on theoretical grounds: the article on which
>>> this algorithm is based specifically always starts with the shortest
>>> of both files. This is important, because the time complexity of the
>>> algorithm is proportional to P, the number of deleted lines. If you
>>> start with the largest of the two, that number P will usually be
>>> larger than the other way around.
>>>
>>> Specifically, the article states: if A (the shortest file) is a
>>> subsequence of B (the longest), the algorithm's complexity will be
>>> linear (O(N)). If you would start with B, I'm pretty sure that won't
>>> be the case.
>>>
>>> Concrete example:
>>> A: 12345
>>> B: 41521334254
>>>
>>> Going from A -> B: P==0
>>> Going from B -> A: P==6
>>>
>>> The existing algorithm ensures that, no matter what order you give the
>>> arguments to "diff", the LCS algorithm will always use the direction A
>>> -> B for its work. which should be the most efficient.
>>>
>>> Maybe I'm misunderstanding this, or maybe there has been further
>>> research in Computer Science around this (I'm not an academic), but if
>>> so, please explain.
>>>
> The problem where you start with file A (short) and end with file B) is
> completely symmetrical to the problem where you start with file B and
> end with file A - all you have to do is swap inserts and deletes. The old
> algorithm assumed there would be fewer deletes than inserts, while my
> new approach just uses whichever there is fewer of. Note the two loops
> for k: The first loop does free inserts until the remaining lengths of the
> files are equal, while the second does free deletes until the remaining
> lengths are equal. (Each loop can accommodate at most one of the
> opposite operation for each value of p, so those are not free.)
>
>>> Maybe some of the performance improvement (and simplification) can be
>>> achieved simply by calculating idx0 and idx1 once, and then reusing
>>> those variables and pass them on:
>>>
>>> idx0 = length[0]> length[1] ? 1 : 0;
>>> idx1 = abs(1 - idx0);
>>>
>>> instead of re-calculating abs(1 - idx) everywhere.
> Adding a parameter to the calls to svn_diff__snake would as likely increase
> the running time. The usage of idx is almost certainly irrelevant anyway,
> since new lcs structs are only rarely made during normal diffs.
>
>> Why does it use "abs()", I wonder? That looks redundant if idx can only
>> be 0 or 1.
>>
> I wondered the same thing myself! :-)
>
>> I'm not paying much attention or deeply understanding all of this, but
>> I'm curious so I took a quick look.
>>
>> Morten Kloster wrote:
>>> * subversion/libsvn_diff/lcs.c
>>> (svn_diff__snake_t): Removed idx parameter (always 0)
>>> (svn_diff__lcs_t): Removed idx variable (always 0) , let d have either
>>> sign, and moved the origo of fp by d.
>> The functions are called svn_diff__snake and svn_diff__lcs, without the
>> _t suffix.
>>
> Thanks; I was thinking of the names of the structs.
>
>> In most of your patch you assume that 'idx' would have been 0, but in
>> two places you assume it would have been 1. Is this intentional? Here:
>>
>>> - fp += length[idx] + 1;
>>> + fp += length[1] + 1;
>>> - d = length[abs(1 - idx)] - length[idx];
>>> + d = length[0] - length[1];
>> - Julian
>>
> Actually, I still assume that idx is 0, but I need the opposite value from
> before: Instead of starting at k=0 and ending at k=d, we now start at k=d
> and end at k=0. That means the sign of d must be opposite of what it
> was before (d =), and the amount of space needed to the left and right
> of origo are switched (fp +=).
>
>
> Updated log message:
> [[[
> Simpler/faster LCS algorithm in libsvn_diff by elimination of idx.
>
> * subversion/libsvn_diff/lcs.c
> (svn_diff__snake): Removed idx parameter (always 0)
> (svn_diff__lcs): Removed idx variable (always 0) , let d have either
> sign, and moved the origo of fp by d. New version no longer chooses
> the shorter file as the "original", and can thus give different LCS's
> depending on the order of the input files even when the files have
> different lengths. Speed is ~15% faster for core lcs algorithm.
> ]]]
Committed as r1128857. I took the liberty to
weaken the speed-up claim.
-- Stefan^2.
Received on 2011-05-29 13:28:08 CEST