[[[
Speeds up LCS algorithm for well-separated sections of change
By exploiting unique matches between the two files, it is possible to
"restart" the LCS algorithm at times and reset the buildup of square
computational complexity.
* subversion/libsvn_diff/lcs.c
(svn_diff__snake_t): Added uniquematchcount member.
(svn_diff__snake): Increments uniquematchcount each time
a matching token occurs only once in each file.
(clear_fp): New method that removes nonviable states.
(svn_diff__lcs): Restarts LCS algorithm from p=0 if the number
of unique matches exceeds the number of mismatches, using
the first state where that occurs as the new start state.
]]]
This patch is a partial implementation of a possible enhancement to
the LCS algorithm. The patch builds on the patch in the thread
"[PATCH] Speed-up of libsvn_diff using token counts". I have only
included the lcs.c file in this patch; the other files in that other patch
must already be patched.
Theory behind the enhancement:
If the number of unique matches - tokens that only occur once in each
file - so far found exceeds the number of mismatches so far, and this
is the FIRST time that happens, then we are guaranteed that that
state is part of the true LCS, since eliminating some number of the
mismatches will necessarily add more new mismatches among the
current unique matches. We can thus "start over" from the current
state and thus stop the build-up of computational complexity
represented by the 'p' variable. By resetting also the number
of unique matches found to 0, we can repeat this process many
times.
This implementation is partial in the sense that it does not reach the
full potential of the enhancement. It will still scan at least 'd' states
for each value of 'p', whereas an optimal implementation would
interrupt the scan if there's a mismatch for what would have been
a unique match (since this always make the potential LCS worse,
regardless of whether the change is an insertion or a deletion).
However, a full implementation will be significantly more complicated,
as far as I can see.
The current implementation can reduce the average computational
complexity from O((p_max + d)*p_max) to O(sum((p_n + d)*p_n)),
while an optimal implementation would be O(sum((p_n + d_n)*p_n)).
p_max is here the total number of mismatches in the shorter file,
while p_n and d_n are the lesser number of mismatches and the
length difference between the files within each area of difference that
is separated from other areas of difference by a large number of
(not necessarily contiguous) unique matches. The current version
is only really useful if there is a similar number of mismatches in
each file, i.e. if there is a similar number of insertions and deletions
(not counting tokens that are unique to one of the files).
Johan: I initially thought it would be easiest to give unique matches
higher weight in the diff algorithm. This would have meant a change
in the definition of the optimal diff - rather than asking for simply the
largest number of matching tokes, we would prefer diffs with slightly
fewer total matches but more unique matches. It would not be a
heuristic, since we would still be a guaranteed optimal diff given the
new optimality criterion, but it would no longer be an LCS.
However, it turned out to be easier to keep the same weights.
The patch will need significant changes to integrate with the idx
patch (also libsvn_diff speed-up). The current version doesn't give
much speed improvement for most of my tests so far, but can
give very large improvements for "ideal" cases (many small areas
of non-unique changes that are separated by many unique matches,
and where the files are of equal length).
I'm not sure whether or not the current version (integrated with idx)
is worthwhile or not. I'll look into making a full implementation of the
enhancement.
Morten
Received on 2011-05-27 23:01:31 CEST