On 28.05.2011 12:25, Morten Kloster wrote:

*> On Sat, May 28, 2011 at 10:18 AM, Johan Corveleyn<jcorvel_at_gmail.com> wrote:
*

*> []
*

*>> Actually, about the theory behind the algorithm: I think it would be
*

*>> quite beneficial if lcs.c would contain more high level documentation
*

*>> about how the algorithm works, and why it works. Right now it only
*

*>> contains this reference to "the article", which is quite academic (not
*

*>> to mention that there is still quite some distance between the
*

*>> academic explanation, and the concrete way this is implemented;
*

*>> especially after your series of patches :-)). It makes it very hard
*

*>> for most developers to grok this piece of code (and I'm speaking for
*

*>> myself here :-)), a lot of effort is required just to go and look up
*

*>> the documentation/background etc...
*

*>>
*

*>> Would you be interested in adding some high-level documentation to
*

*>> lcs.c, explaining the algorithm at a high level, maybe with an
*

*>> example, ...? You seem to have quite a good grip on this matter.
*

*>>
*

*>> A high-level explanation, maybe combined with some technical comments
*

*>> here and there in the code to document specifics of the concrete
*

*>> implementation, would be highly beneficial IMHO to get more developers
*

*>> interested in libsvn_diff, and hence increasing the chances to get
*

*>> things reviewed and improved ...
*

*>>
*

*>> Cheers,
*

*>> --
*

*>> Johan
*

*>>
*

I was going to ask for the same ;)

*> How's this?
*

*>
*

*> [[[
*

*> /*
*

*> * Calculate the Longest Common Subsequence (LCS) between two datasources.
*

*> * This function is what makes the diff code tick.
*

*> *
*

*> * The LCS algorithm implemented here is based on the approach described
*

*> * by Sun Wu, Udi Manber and Gene Meyers in "An O(NP) Sequence Comparison
*

*> * Algorithm", but has been modified for better performance.
*

*> *
*

*> * Let M and N be the lengths (number of tokens) of the two sources
*

*> * ('files'). The goal is to reach the end of both sources (files) with the
*

*> * minimum number of insertions + deletions. Since there is a known length
*

*> * difference N-M between the files, that is equivalent to just the minimum
*

*> * number of deletions, or equivalently the minimum number of insertions.
*

*> * For symmetry, we use the lesser number - deletions if M<N, insertions if
*

*> * M>N.
*

*> *
*

*> * Let 'k' be the difference in remaining length between the files, i.e.
*

*> * if we're at the beginning of both files, k=N-M, whereas k=0 for the
*

*> * 'end state', at the end of both files. An insertion will increase k by
*

*> * one, while a deletion decreases k by one. If k<0, then insertions are
*

*> * 'free' - we need those to reach the end state k=0 anyway - but deletions
*

*> * are costly: Adding a deletion means that we will have to add an additional
*

*> * insertion later to reach the end state, so it doesn't matter if we count
*

*> * deletions or insertions. Similarly, deletions are free for k>0.
*

*> *
*

*> * Let a 'state' be a given position in each file {pos1, pos2}. An array
*

*> * 'fp' keeps track of the best possible state (largest values of
*

*> * {pos1, pos2}) that can be achieved for a given cost 'p' (# moves away
*

*> * from k=0), as well as a linked list of what matches were used to reach
*

*> * that state. For each new value of p, we find for each value of k the
*

*> * best achievable state for that k - either by doing a costly operation
*

*> * (deletion if k<0) from a state achieved at a lower p, or doing a free
*

*> * operation (insertion if k<0) from a state achieved at the same p -
*

*> * and in both cases advancing past any matching regions found. This is
*

*> * handled by running loops over k in order of descending absolute value.
*

*> *
*

*> * A recent improvement of the algorithm is to ignore tokens that are unique
*

*> * to one file or the other, as those are known from the start to be
*

*> * impossible to match.
*

*> */
*

*> ]]]
*

*>
*

*>
*

Perfect. Committed as r1128862.

Thanks!

-- Stefan^2.

Received on 2011-05-29 13:29:39 CEST