[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Speed-up of libsvn_diff by restarts of LCS algorithm

From: Morten Kloster <morklo_at_gmail.com>
Date: Tue, 14 Jun 2011 19:32:26 +0200

On Tue, Jun 14, 2011 at 2:49 AM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> On Tue, Jun 7, 2011 at 10:16 PM, Morten Kloster <morklo_at_gmail.com> wrote:
>>
[]
>> Here's an updated patch based on the current HEAD, and with
>> one or two minor improvements. While, as stated, the patch
>> can give many-fold speed increases for ideal cases, it also slows
>> down the LCS algorithm significantly for poorly matching files; I
>> got about 18% increase in running time on my system for files
>> that were very poor matches but had no lines unique to either file.
>
> Hi Morten,
>
> Hmmm, that sounds like a lot of overhead in some cases (I actually
> wonder why -- I haven't fully understood the implementation, but on
> first sight it doesn't seem that computationally intensive). But the
> most important question that's bugging me: do those "ideal cases"
> occur often in real-world examples?
>

It's not surprising to get that kind of speed reduction, since it uses
an extra test within the loops that take the bulk of the time when
the files don't match.

> I'm sure you could craft good examples, but how likely am I to have a
> benefit of this in my average repository (given all of the other
> optimizations taking away already a lot of the work (prefix/suffix
> elimination; elimination of non-matching lines due to token-counting;
> ...))?
>
> Can you give examples from subversion's own repository, or other
> public repositories, where it can have a big impact?
>

I ran it on merge.c (in libsvn_client), between revisions 1127198 (HEAD,
or close enough) and 1036419 (semi-randomly chosen to have about
the right fraction of changes from 1127198 - it was the first one I looked
at, and seemed reasonable for the test). When running all of
svn_diff_file_diff, the patch was about 15% faster, however, the LCS
algorithm itself was about 3.5 times faster; most of the time was
spent reading the tokens from file.

The "problem" is that the files have to be quite big before the LCS
algorithm itself dominates the running time even when the files still
match reasonably well, whereas it is much easier for the LCS
algorithm to dominate the running time when the files don't match
at all, in which case this patch degrades performance.

It is quite easy to do a quick heuristic test for whether this patch
would help or not, so we could have alternate versions of the LCS
algorithm depending on whether that test checks out. That would
make the code somewhat harder to maintain, of course.

> Or are you looking at this for some specific purpose, some repository
> where this situation occurs regularly and has a big impact (huge
> LCS'es ...)?
>

Nope, I just like to optimize stuff, and since we were already
looking at diff optimization I thought I might as well point out the
possibility.

> --
> Johan
>
Received on 2011-06-14 19:33:03 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.