Here's an updated patch based on the current HEAD, and with

one or two minor improvements. While, as stated, the patch

can give many-fold speed increases for ideal cases, it also slows

down the LCS algorithm significantly for poorly matching files; I

got about 18% increase in running time on my system for files

that were very poor matches but had no lines unique to either file.

On Fri, May 27, 2011 at 11:00 PM, Morten Kloster <morklo_at_gmail.com> wrote:

*> [[[
*

*> Speeds up LCS algorithm for well-separated sections of change
*

*> By exploiting unique matches between the two files, it is possible to
*

*> "restart" the LCS algorithm at times and reset the buildup of square
*

*> computational complexity.
*

*>
*

*> * subversion/libsvn_diff/lcs.c
*

*> (svn_diff__snake_t): Added uniquematchcount member.
*

*> (svn_diff__snake): Increments uniquematchcount each time
*

*> a matching token occurs only once in each file.
*

*> (clear_fp): New method that removes nonviable states.
*

*> (svn_diff__lcs): Restarts LCS algorithm from p=0 if the number
*

*> of unique matches exceeds the number of mismatches, using
*

*> the first state where that occurs as the new start state.
*

*> ]]]
*

*>
*

*> This patch is a partial implementation of a possible enhancement to
*

*> the LCS algorithm. The patch builds on the patch in the thread
*

*> "[PATCH] Speed-up of libsvn_diff using token counts". I have only
*

*> included the lcs.c file in this patch; the other files in that other patch
*

*> must already be patched.
*

*>
*

*>
*

*> Theory behind the enhancement:
*

*> If the number of unique matches - tokens that only occur once in each
*

*> file - so far found exceeds the number of mismatches so far, and this
*

*> is the FIRST time that happens, then we are guaranteed that that
*

*> state is part of the true LCS, since eliminating some number of the
*

*> mismatches will necessarily add more new mismatches among the
*

*> current unique matches. We can thus "start over" from the current
*

*> state and thus stop the build-up of computational complexity
*

*> represented by the 'p' variable. By resetting also the number
*

*> of unique matches found to 0, we can repeat this process many
*

*> times.
*

*>
*

*> This implementation is partial in the sense that it does not reach the
*

*> full potential of the enhancement. It will still scan at least 'd' states
*

*> for each value of 'p', whereas an optimal implementation would
*

*> interrupt the scan if there's a mismatch for what would have been
*

*> a unique match (since this always make the potential LCS worse,
*

*> regardless of whether the change is an insertion or a deletion).
*

*> However, a full implementation will be significantly more complicated,
*

*> as far as I can see.
*

*>
*

*> The current implementation can reduce the average computational
*

*> complexity from O((p_max + d)*p_max) to O(sum((p_n + d)*p_n)),
*

*> while an optimal implementation would be O(sum((p_n + d_n)*p_n)).
*

*> p_max is here the total number of mismatches in the shorter file,
*

*> while p_n and d_n are the lesser number of mismatches and the
*

*> length difference between the files within each area of difference that
*

*> is separated from other areas of difference by a large number of
*

*> (not necessarily contiguous) unique matches. The current version
*

*> is only really useful if there is a similar number of mismatches in
*

*> each file, i.e. if there is a similar number of insertions and deletions
*

*> (not counting tokens that are unique to one of the files).
*

*>
*

*> Johan: I initially thought it would be easiest to give unique matches
*

*> higher weight in the diff algorithm. This would have meant a change
*

*> in the definition of the optimal diff - rather than asking for simply the
*

*> largest number of matching tokes, we would prefer diffs with slightly
*

*> fewer total matches but more unique matches. It would not be a
*

*> heuristic, since we would still be a guaranteed optimal diff given the
*

*> new optimality criterion, but it would no longer be an LCS.
*

*> However, it turned out to be easier to keep the same weights.
*

*>
*

*>
*

*> The patch will need significant changes to integrate with the idx
*

*> patch (also libsvn_diff speed-up). The current version doesn't give
*

*> much speed improvement for most of my tests so far, but can
*

*> give very large improvements for "ideal" cases (many small areas
*

*> of non-unique changes that are separated by many unique matches,
*

*> and where the files are of equal length).
*

*>
*

*> I'm not sure whether or not the current version (integrated with idx)
*

*> is worthwhile or not. I'll look into making a full implementation of the
*

*> enhancement.
*

*>
*

*>
*

*> Morten
*

*>
*

Received on 2011-06-07 22:17:10 CEST