[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Speed-up of libsvn_diff by restarts of LCS algorithm

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Tue, 14 Jun 2011 02:49:56 +0200

On Tue, Jun 7, 2011 at 10:16 PM, Morten Kloster <morklo_at_gmail.com> wrote:
> On Fri, May 27, 2011 at 11:00 PM, Morten Kloster <morklo_at_gmail.com> wrote:
>> [[[
>> Speeds up LCS algorithm for well-separated sections of change
>>  By exploiting unique matches between the two files, it is possible to
>>  "restart" the LCS algorithm at times and reset the buildup of square
>>  computational complexity.
>>
>> * subversion/libsvn_diff/lcs.c
>>  (svn_diff__snake_t): Added uniquematchcount member.
>>  (svn_diff__snake): Increments uniquematchcount each time
>>   a matching token occurs only once in each file.
>>  (clear_fp): New method that removes nonviable states.
>>  (svn_diff__lcs): Restarts LCS algorithm from p=0 if the number
>>   of unique matches exceeds the number of mismatches, using
>>   the first state where that occurs as the new start state.
>> ]]]
>>
>> This patch is a partial implementation of a possible enhancement to
>> the LCS algorithm. The patch builds on the patch in the thread
>> "[PATCH] Speed-up of libsvn_diff using token counts". I have only
>> included the lcs.c file in this patch; the other files in that other patch
>> must already be patched.
>>
>>
>> Theory behind the enhancement:
>> If the number of unique matches - tokens that only occur once in each
>> file - so far found exceeds the number of mismatches so far, and this
>> is the FIRST time that happens, then we are guaranteed that that
>> state is part of the true LCS, since eliminating some number of the
>> mismatches will necessarily add more new mismatches among the
>> current unique matches. We can thus "start over" from the current
>> state and thus stop the build-up of computational complexity
>> represented by the 'p' variable. By resetting also the number
>> of unique matches found to 0, we can repeat this process many
>> times.
>>
>> This implementation is partial in the sense that it does not reach the
>> full potential of the enhancement. It will still scan at least 'd' states
>> for each value of 'p', whereas an optimal implementation would
>> interrupt the scan if there's a mismatch for what would have been
>> a unique match (since this always make the potential LCS worse,
>> regardless of whether the change is an insertion or a deletion).
>> However, a full implementation will be significantly more complicated,
>> as far as I can see.
>>
>> The current implementation can reduce the average computational
>> complexity from O((p_max + d)*p_max) to O(sum((p_n + d)*p_n)),
>> while an optimal implementation would be O(sum((p_n + d_n)*p_n)).
>> p_max is here the total number of mismatches in the shorter file,
>> while p_n and d_n are the lesser number of mismatches and the
>> length difference between the files within each area of difference that
>> is separated from other areas of difference by a large number of
>> (not necessarily contiguous) unique matches. The current version
>> is only really useful if there is a similar number of mismatches in
>> each file, i.e. if there is a similar number of insertions and deletions
>> (not counting tokens that are unique to one of the files).
>>
>> Johan: I initially thought it would be easiest to give unique matches
>> higher weight in the diff algorithm. This would have meant a change
>> in the definition of the optimal diff - rather than asking for simply the
>> largest number of matching tokes, we would prefer diffs with slightly
>> fewer total matches but more unique matches. It would not be a
>> heuristic, since we would still be a guaranteed optimal diff given the
>> new optimality criterion, but it would no longer be an LCS.
>> However, it turned out to be easier to keep the same weights.
>>
>>
>> The patch will need significant changes to integrate with the idx
>> patch (also libsvn_diff speed-up). The current version doesn't give
>> much speed improvement for most of my tests so far, but can
>> give very large improvements for "ideal" cases (many small areas
>> of non-unique changes that are separated by many unique matches,
>> and where the files are of equal length).
>>
>> I'm not sure whether or not the current version (integrated with idx)
>> is worthwhile or not. I'll look into making a full implementation of the
>> enhancement.
>>
>>
>> Morten
>>
> Here's an updated patch based on the current HEAD, and with
> one or two minor improvements. While, as stated, the patch
> can give many-fold speed increases for ideal cases, it also slows
> down the LCS algorithm significantly for poorly matching files; I
> got about 18% increase in running time on my system for files
> that were very poor matches but had no lines unique to either file.

Hi Morten,

Hmmm, that sounds like a lot of overhead in some cases (I actually
wonder why -- I haven't fully understood the implementation, but on
first sight it doesn't seem that computationally intensive). But the
most important question that's bugging me: do those "ideal cases"
occur often in real-world examples?

I'm sure you could craft good examples, but how likely am I to have a
benefit of this in my average repository (given all of the other
optimizations taking away already a lot of the work (prefix/suffix
elimination; elimination of non-matching lines due to token-counting;
...))?

Can you give examples from subversion's own repository, or other
public repositories, where it can have a big impact?

Or are you looking at this for some specific purpose, some repository
where this situation occurs regularly and has a big impact (huge
LCS'es ...)?

-- 
Johan
Received on 2011-06-14 02:50:50 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.