[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Simplification/speed-up of libsvn_diff by eliminating idx

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Sat, 28 May 2011 10:18:01 +0200

On Fri, May 27, 2011 at 7:29 PM, Morten Kloster <morklo_at_gmail.com> wrote:
> On Fri, May 27, 2011 at 3:13 PM, Julian Foad <julian.foad_at_wandisco.com> wrote:
>> On Fri, 2011-05-27 at 13:32 +0200, Johan Corveleyn wrote:
>>> On Fri, May 27, 2011 at 12:47 PM, Fuhrmann Stefan (ETAS/ESA1)
>>> <Stefan.Fuhrmann_at_etas.com> wrote:
>>> > Morten Kloster wrote:
>>> >
>>> >> [[[
>>> >> Simpler/faster LCS algorithm in libsvn_diff by elimination of idx.
>>> >>
>>> >>* subversion/libsvn_diff/lcs.c
>>> >>  (svn_diff__snake_t): Removed idx parameter (always 0)
>>> >>  (svn_diff__lcs_t): Removed idx variable (always 0) , let d have either
>>> >>   sign, and moved the origo of fp by d. New version no longer chooses
>>> >>   the shorter file as the "original", and can thus give different LCS's
>>> >>   depending on the order of the input files even when the files have
>>> >>   different lengths. Speed is ~15% faster for core lcs algorithm.
>>> >>]]]
>>> >
>>> > Hi Morten,
>>> >
>>> > Thank you very much for the patch. From what I have seen,
>>> > it should be correct (haven't applied & tested it, yet).
>>> >
>>> >> The old version of the LCS algorithm started at k=0 and ended at
>>> >> k=d, where d is the length difference of the two files. d was required
>>> >> to be positive. It is more efficient to start at k=(-)d and end at 0 (since
>>> >> ending the loops at 0 allows for faster comparisons), and this also
>>> >> makes it easy to let d have either sign, which makes the idx variable
>>> >> unnecessary, further increasing speed by eliminating a parameter
>>> >> to the frequently called function svn_diff__snake_t. The sign of d
>>> >> is handled in the initial values of the loops (which have minimal speed
>>> >> impact).
>>> >
>>> > Looks good.
>>> >
>>> >> The new version will more often give different (but equally good) results
>>> >> depending on the order of the input files.
>>> >
>>> > How does that affect our test suite? Is any of its expected
>>> > outputs invalid now? Is the result of blame affected as well
>>> > (e.g. if a file shrunk and / or grew in later revisions)?
>>> >
>
> The patch passes all libsvn_diff tests, at least. After some
> thought, I found that the new version will actually improve the
> performance of merge: Given files
>
> abcdef  (Base file)
> abdce  (Their file)
> abdcefg  (My file)
>
> the old version will diff (Their, Base) and (Base, My), since
> Their is shorter than Base while My is longer. That means that
> for Base vs. Their, the 'd's will be matched, while for Base vs.
> My, the 'c's will be matched, yielding a conflict during merge,
> even though that part of the files had identical changes in Their
> and My! With the new version, the diff order is given strictly by
> the calling function, thus it will match the same way in both
> cases and avoid conflict (both version produce a conflict at the
> end, of course).

Interesting!

So your patch makes sure that the order of the files during LCS is
respected, as given by the caller. Which means in this case, the calls
from svn_diff_diff3_2:302-306

  /* Get the lcs for original-modified and original-latest */
  lcs_om = svn_diff__lcs(position_list[0], position_list[1], prefix_lines,
                         suffix_lines, subpool);
  lcs_ol = svn_diff__lcs(position_list[0], position_list[2], prefix_lines,
                         suffix_lines, subpool);

it will always calculate the LCS between (Base, My) and (Base, Their)
in the given direction. While the original algorithm might switch one
of them based on whichever is shortest, potentially yielding a
different LCS.

I agree that doing them in this consistent "direction" (always from
base to [mine, theirs]) is more logical, and potentially gives better
results. Though I think it could be pretty exceptional in real-world
cases, I agree that it is cleaner.

Thinking more about this, I'm starting to get convinced, as a user,
that it's quite normal, even expected, that diff(A,B) and diff(B,A)
can give different results sometimes.

> In general, the local diff between mostly-matching parts of two
> files can no longer be affected by changes in those files far
> from the area of interest (except if those changes cause a
> global change in LCS), and I'm pretty sure that's all for the good.
>
> The result of blame can be affected by the change, but only
> in cases where it essentially was arbitrary to begin with. I
> don't see that it can have any significant negative impact.

Agreed. It would probably give a "more correct" result, because always
using the "correct direction", so the most "logical" LCS for that
order. In real-world cases, it would probably have little significant
impact (like "who inserted this empty line").

[snip]

>>> I'm suspicious mainly on theoretical grounds: the article on which
>>> this algorithm is based specifically always starts with the shortest
>>> of both files. This is important, because the time complexity of the
>>> algorithm is proportional to P, the number of deleted lines. If you
>>> start with the largest of the two, that number P will usually be
>>> larger than the other way around.
>>>
>>> Specifically, the article states: if A (the shortest file) is a
>>> subsequence of B (the longest), the algorithm's complexity will be
>>> linear (O(N)). If you would start with B, I'm pretty sure that won't
>>> be the case.
>>>
>>> Concrete example:
>>> A: 12345
>>> B: 41521334254
>>>
>>> Going from A -> B: P==0
>>> Going from B -> A: P==6
>>>
>>> The existing algorithm ensures that, no matter what order you give the
>>> arguments to "diff", the LCS algorithm will always use the direction A
>>> -> B for its work. which should be the most efficient.
>>>
>>> Maybe I'm misunderstanding this, or maybe there has been further
>>> research in Computer Science around this (I'm not an academic), but if
>>> so, please explain.
>>>
>
> The problem where you start with file A (short) and end with file B) is
> completely symmetrical to the problem where you start with file B and
> end with file A - all you have to do is swap inserts and deletes. The old
> algorithm assumed there would be fewer deletes than inserts, while my
> new approach just uses whichever there is fewer of. Note the two loops
> for k: The first loop does free inserts until the remaining lengths of the
> files are equal, while the second does free deletes until the remaining
> lengths are equal. (Each loop can accommodate at most one of the
> opposite operation for each value of p, so those are not free.)

Ah yes. It's beginning to dawn on me. Re-reading the article, and
letting it work a little bit on me, I'm starting to understand :-).

I haven't looked to deeply into the implementation of your patch yet,
but theoretically I have no more objections. Maybe sometime later this
weekend I'll have some more time to review the implementation, and run
some tests (but if Stefan2 beats me to it, and commits this, that
would be fine for me as well :-)).

Actually, about the theory behind the algorithm: I think it would be
quite beneficial if lcs.c would contain more high level documentation
about how the algorithm works, and why it works. Right now it only
contains this reference to "the article", which is quite academic (not
to mention that there is still quite some distance between the
academic explanation, and the concrete way this is implemented;
especially after your series of patches :-)). It makes it very hard
for most developers to grok this piece of code (and I'm speaking for
myself here :-)), a lot of effort is required just to go and look up
the documentation/background etc...

Would you be interested in adding some high-level documentation to
lcs.c, explaining the algorithm at a high level, maybe with an
example, ...? You seem to have quite a good grip on this matter.

A high-level explanation, maybe combined with some technical comments
here and there in the code to document specifics of the concrete
implementation, would be highly beneficial IMHO to get more developers
interested in libsvn_diff, and hence increasing the chances to get
things reviewed and improved ...

Cheers,

-- 
Johan
Received on 2011-05-28 10:19:01 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.