On Tue, Jun 7, 2011 at 10:16 PM, Morten Kloster <morklo_at_gmail.com> wrote:

*> On Fri, May 27, 2011 at 11:00 PM, Morten Kloster <morklo_at_gmail.com> wrote:
*

*>> [[[
*

*>> Speeds up LCS algorithm for well-separated sections of change
*

*>> By exploiting unique matches between the two files, it is possible to
*

*>> "restart" the LCS algorithm at times and reset the buildup of square
*

*>> computational complexity.
*

*>>
*

*>> * subversion/libsvn_diff/lcs.c
*

*>> (svn_diff__snake_t): Added uniquematchcount member.
*

*>> (svn_diff__snake): Increments uniquematchcount each time
*

*>> a matching token occurs only once in each file.
*

*>> (clear_fp): New method that removes nonviable states.
*

*>> (svn_diff__lcs): Restarts LCS algorithm from p=0 if the number
*

*>> of unique matches exceeds the number of mismatches, using
*

*>> the first state where that occurs as the new start state.
*

*>> ]]]
*

*>>
*

*>> This patch is a partial implementation of a possible enhancement to
*

*>> the LCS algorithm. The patch builds on the patch in the thread
*

*>> "[PATCH] Speed-up of libsvn_diff using token counts". I have only
*

*>> included the lcs.c file in this patch; the other files in that other patch
*

*>> must already be patched.
*

*>>
*

*>>
*

*>> Theory behind the enhancement:
*

*>> If the number of unique matches - tokens that only occur once in each
*

*>> file - so far found exceeds the number of mismatches so far, and this
*

*>> is the FIRST time that happens, then we are guaranteed that that
*

*>> state is part of the true LCS, since eliminating some number of the
*

*>> mismatches will necessarily add more new mismatches among the
*

*>> current unique matches. We can thus "start over" from the current
*

*>> state and thus stop the build-up of computational complexity
*

*>> represented by the 'p' variable. By resetting also the number
*

*>> of unique matches found to 0, we can repeat this process many
*

*>> times.
*

*>>
*

*>> This implementation is partial in the sense that it does not reach the
*

*>> full potential of the enhancement. It will still scan at least 'd' states
*

*>> for each value of 'p', whereas an optimal implementation would
*

*>> interrupt the scan if there's a mismatch for what would have been
*

*>> a unique match (since this always make the potential LCS worse,
*

*>> regardless of whether the change is an insertion or a deletion).
*

*>> However, a full implementation will be significantly more complicated,
*

*>> as far as I can see.
*

*>>
*

*>> The current implementation can reduce the average computational
*

*>> complexity from O((p_max + d)*p_max) to O(sum((p_n + d)*p_n)),
*

*>> while an optimal implementation would be O(sum((p_n + d_n)*p_n)).
*

*>> p_max is here the total number of mismatches in the shorter file,
*

*>> while p_n and d_n are the lesser number of mismatches and the
*

*>> length difference between the files within each area of difference that
*

*>> is separated from other areas of difference by a large number of
*

*>> (not necessarily contiguous) unique matches. The current version
*

*>> is only really useful if there is a similar number of mismatches in
*

*>> each file, i.e. if there is a similar number of insertions and deletions
*

*>> (not counting tokens that are unique to one of the files).
*

*>>
*

*>> Johan: I initially thought it would be easiest to give unique matches
*

*>> higher weight in the diff algorithm. This would have meant a change
*

*>> in the definition of the optimal diff - rather than asking for simply the
*

*>> largest number of matching tokes, we would prefer diffs with slightly
*

*>> fewer total matches but more unique matches. It would not be a
*

*>> heuristic, since we would still be a guaranteed optimal diff given the
*

*>> new optimality criterion, but it would no longer be an LCS.
*

*>> However, it turned out to be easier to keep the same weights.
*

*>>
*

*>>
*

*>> The patch will need significant changes to integrate with the idx
*

*>> patch (also libsvn_diff speed-up). The current version doesn't give
*

*>> much speed improvement for most of my tests so far, but can
*

*>> give very large improvements for "ideal" cases (many small areas
*

*>> of non-unique changes that are separated by many unique matches,
*

*>> and where the files are of equal length).
*

*>>
*

*>> I'm not sure whether or not the current version (integrated with idx)
*

*>> is worthwhile or not. I'll look into making a full implementation of the
*

*>> enhancement.
*

*>>
*

*>>
*

*>> Morten
*

*>>
*

*> Here's an updated patch based on the current HEAD, and with
*

*> one or two minor improvements. While, as stated, the patch
*

*> can give many-fold speed increases for ideal cases, it also slows
*

*> down the LCS algorithm significantly for poorly matching files; I
*

*> got about 18% increase in running time on my system for files
*

*> that were very poor matches but had no lines unique to either file.
*

Hi Morten,

Hmmm, that sounds like a lot of overhead in some cases (I actually

wonder why -- I haven't fully understood the implementation, but on

first sight it doesn't seem that computationally intensive). But the

most important question that's bugging me: do those "ideal cases"

occur often in real-world examples?

I'm sure you could craft good examples, but how likely am I to have a

benefit of this in my average repository (given all of the other

optimizations taking away already a lot of the work (prefix/suffix

elimination; elimination of non-matching lines due to token-counting;

...))?

Can you give examples from subversion's own repository, or other

public repositories, where it can have a big impact?

Or are you looking at this for some specific purpose, some repository

where this situation occurs regularly and has a big impact (huge

LCS'es ...)?

--
Johan

Received on 2011-06-14 02:50:50 CEST