[[[

Speeds up LCS algorithm for well-separated sections of change

By exploiting unique matches between the two files, it is possible to

"restart" the LCS algorithm at times and reset the buildup of square

computational complexity.

* subversion/libsvn_diff/lcs.c

(svn_diff__snake_t): Added uniquematchcount member.

(svn_diff__snake): Increments uniquematchcount each time

a matching token occurs only once in each file.

(clear_fp): New method that removes nonviable states.

(svn_diff__lcs): Restarts LCS algorithm from p=0 if the number

of unique matches exceeds the number of mismatches, using

the first state where that occurs as the new start state.

]]]

This patch is a partial implementation of a possible enhancement to

the LCS algorithm. The patch builds on the patch in the thread

"[PATCH] Speed-up of libsvn_diff using token counts". I have only

included the lcs.c file in this patch; the other files in that other patch

must already be patched.

Theory behind the enhancement:

If the number of unique matches - tokens that only occur once in each

file - so far found exceeds the number of mismatches so far, and this

is the FIRST time that happens, then we are guaranteed that that

state is part of the true LCS, since eliminating some number of the

mismatches will necessarily add more new mismatches among the

current unique matches. We can thus "start over" from the current

state and thus stop the build-up of computational complexity

represented by the 'p' variable. By resetting also the number

of unique matches found to 0, we can repeat this process many

times.

This implementation is partial in the sense that it does not reach the

full potential of the enhancement. It will still scan at least 'd' states

for each value of 'p', whereas an optimal implementation would

interrupt the scan if there's a mismatch for what would have been

a unique match (since this always make the potential LCS worse,

regardless of whether the change is an insertion or a deletion).

However, a full implementation will be significantly more complicated,

as far as I can see.

The current implementation can reduce the average computational

complexity from O((p_max + d)*p_max) to O(sum((p_n + d)*p_n)),

while an optimal implementation would be O(sum((p_n + d_n)*p_n)).

p_max is here the total number of mismatches in the shorter file,

while p_n and d_n are the lesser number of mismatches and the

length difference between the files within each area of difference that

is separated from other areas of difference by a large number of

(not necessarily contiguous) unique matches. The current version

is only really useful if there is a similar number of mismatches in

each file, i.e. if there is a similar number of insertions and deletions

(not counting tokens that are unique to one of the files).

Johan: I initially thought it would be easiest to give unique matches

higher weight in the diff algorithm. This would have meant a change

in the definition of the optimal diff - rather than asking for simply the

largest number of matching tokes, we would prefer diffs with slightly

fewer total matches but more unique matches. It would not be a

heuristic, since we would still be a guaranteed optimal diff given the

new optimality criterion, but it would no longer be an LCS.

However, it turned out to be easier to keep the same weights.

The patch will need significant changes to integrate with the idx

patch (also libsvn_diff speed-up). The current version doesn't give

much speed improvement for most of my tests so far, but can

give very large improvements for "ideal" cases (many small areas

of non-unique changes that are separated by many unique matches,

and where the files are of equal length).

I'm not sure whether or not the current version (integrated with idx)

is worthwhile or not. I'll look into making a full implementation of the

enhancement.

Morten

Received on 2011-05-27 23:01:31 CEST