[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Speed-up of libsvn_diff by restarts of LCS algorithm

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Mon, 27 Jun 2011 00:14:47 +0200

On Sun, Jun 26, 2011 at 11:23 AM, Morten Kloster <morklo_at_gmail.com> wrote:
> On Tue, Jun 14, 2011 at 11:12 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
>> On Tue, Jun 14, 2011 at 7:32 PM, Morten Kloster <morklo_at_gmail.com> wrote:
>>> On Tue, Jun 14, 2011 at 2:49 AM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
>>>> On Tue, Jun 7, 2011 at 10:16 PM, Morten Kloster <morklo_at_gmail.com> wrote:
>>>>>
>>> []
>>>>> Here's an updated patch based on the current HEAD, and with
>>>>> one or two minor improvements. While, as stated, the patch
>>>>> can give many-fold speed increases for ideal cases, it also slows
>>>>> down the LCS algorithm significantly for poorly matching files; I
>>>>> got about 18% increase in running time on my system for files
>>>>> that were very poor matches but had no lines unique to either file.
>>>>
>>>> Hi Morten,
>>>>
>>>> Hmmm, that sounds like a lot of overhead in some cases (I actually
>>>> wonder why -- I haven't fully understood the implementation, but on
>>>> first sight it doesn't seem that computationally intensive). But the
>>>> most important question that's bugging me: do those "ideal cases"
>>>> occur often in real-world examples?
>>>>
>>>
>>> It's not surprising to get that kind of speed reduction, since it uses
>>> an extra test within the loops that take the bulk of the time when
>>> the files don't match.
>>
>> Indeed, that might be it :-). Just wondering though: it isn't by any
>> chance the case that, worst case, clear_fp is called too often (more
>> often than is useful, each time resetting the LCS run for only
>> marginal gain)?
>>
>> Can you verify that those 18% are spent because of those extra
>> comparisons, and not because of calls to clear_fp? IIUC, with those
>> poorly-matching files, there should never be a call to clear_fp,
>> because the unique-matching optimization never kicks in, right?
>>
>
> As you say, clear_fp is never called in this case. But there is
> actually a risk of slowdown for well-matching files due to
> calls to clear_fp (during the loop over k for p=0; clear_fp will
> clear more than it really needs to). It shouldn't be too hard
> to avoid that, though.
>
>> Also, those 18% you measured: is that the increase in
>> svn_diff_file_diff time, or in lcs time? Or is that approximately the
>> same, because the lcs time dominates because of the poor matching?
>>
>
> They are essentially the same in this case.
>
>>
>> I have two more, possibly stupid, questions on the implementation in the patch:
>>
>> - You mentioned in your theoretical explanation that you still have to
>> scan the entire d-range for every p, even after you've found a "reset
>> point". But can't you just "break" from the for-loops, whenever you
>> encounter a reset point?
>>
>
> No, I don't have to scan after I find a reset point (the loops
> just continue in post-reset mode; there's no waste there). The
> issue is that svn_diff__lcs still looks for a p=0 solution even after
> a unique possible match has failed, which clearly is no longer
> possible (it's not restricted to p=0, that's just the simplest case).
>
>> - You calculated 'limit' as:
>>        limit = (d >= 0 ? 2 * p + d : 2 * p - d);
>> and made the "reset condition" be:
>>        if (fp[k].uniquematchcount > limit + k)  /* with k < 0 */
>> or:
>>        if (fp[k].uniquematchcount > limit - k)  /* with k >= 0 */
>>
>> I don't fully understand. Can you explain a bit more why this is the
>> right value for 'limit', and how this is the correct "reset
>> condition"?
>>
>
> Those limits are actually overkill. The correct condition is that the
> number of unique matches must be larger than (or equal to, I guess)
> the number of mismatches that could conceivably be corrected by
> matching the files otherwise. I have here used the total number of
> mismatches in BOTH files, but the larger number of mismatches in
> one file would be sufficient.

Hi Morten,

Thanks for explaining.

Maybe you can start a feature branch for this optimization (and maybe
other further LCS/diff-optimizations). That would give it some more
visibility, and would allow for some more experimentation and
discussion. Now that you have your account with (partial) commit
access...

Speaking of which: have you added your name to the COMMITTERS file
yet? You should probably do that first.

-- 
Johan
Received on 2011-06-27 00:15:40 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.