On Tue, Jun 14, 2011 at 11:12 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> On Tue, Jun 14, 2011 at 7:32 PM, Morten Kloster <morklo_at_gmail.com> wrote:
>> On Tue, Jun 14, 2011 at 2:49 AM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
>>> On Tue, Jun 7, 2011 at 10:16 PM, Morten Kloster <morklo_at_gmail.com> wrote:
>>>>
>> []
>>>> Here's an updated patch based on the current HEAD, and with
>>>> one or two minor improvements. While, as stated, the patch
>>>> can give many-fold speed increases for ideal cases, it also slows
>>>> down the LCS algorithm significantly for poorly matching files; I
>>>> got about 18% increase in running time on my system for files
>>>> that were very poor matches but had no lines unique to either file.
>>>
>>> Hi Morten,
>>>
>>> Hmmm, that sounds like a lot of overhead in some cases (I actually
>>> wonder why -- I haven't fully understood the implementation, but on
>>> first sight it doesn't seem that computationally intensive). But the
>>> most important question that's bugging me: do those "ideal cases"
>>> occur often in real-world examples?
>>>
>>
>> It's not surprising to get that kind of speed reduction, since it uses
>> an extra test within the loops that take the bulk of the time when
>> the files don't match.
>
> Indeed, that might be it :-). Just wondering though: it isn't by any
> chance the case that, worst case, clear_fp is called too often (more
> often than is useful, each time resetting the LCS run for only
> marginal gain)?
>
> Can you verify that those 18% are spent because of those extra
> comparisons, and not because of calls to clear_fp? IIUC, with those
> poorly-matching files, there should never be a call to clear_fp,
> because the unique-matching optimization never kicks in, right?
>
As you say, clear_fp is never called in this case. But there is
actually a risk of slowdown for well-matching files due to
calls to clear_fp (during the loop over k for p=0; clear_fp will
clear more than it really needs to). It shouldn't be too hard
to avoid that, though.
> Also, those 18% you measured: is that the increase in
> svn_diff_file_diff time, or in lcs time? Or is that approximately the
> same, because the lcs time dominates because of the poor matching?
>
They are essentially the same in this case.
>
> I have two more, possibly stupid, questions on the implementation in the patch:
>
> - You mentioned in your theoretical explanation that you still have to
> scan the entire d-range for every p, even after you've found a "reset
> point". But can't you just "break" from the for-loops, whenever you
> encounter a reset point?
>
No, I don't have to scan after I find a reset point (the loops
just continue in post-reset mode; there's no waste there). The
issue is that svn_diff__lcs still looks for a p=0 solution even after
a unique possible match has failed, which clearly is no longer
possible (it's not restricted to p=0, that's just the simplest case).
> - You calculated 'limit' as:
> limit = (d >= 0 ? 2 * p + d : 2 * p - d);
> and made the "reset condition" be:
> if (fp[k].uniquematchcount > limit + k) /* with k < 0 */
> or:
> if (fp[k].uniquematchcount > limit - k) /* with k >= 0 */
>
> I don't fully understand. Can you explain a bit more why this is the
> right value for 'limit', and how this is the correct "reset
> condition"?
>
Those limits are actually overkill. The correct condition is that the
number of unique matches must be larger than (or equal to, I guess)
the number of mismatches that could conceivably be corrected by
matching the files otherwise. I have here used the total number of
mismatches in BOTH files, but the larger number of mismatches in
one file would be sufficient.
>>> I'm sure you could craft good examples, but how likely am I to have a
>>> benefit of this in my average repository (given all of the other
>>> optimizations taking away already a lot of the work (prefix/suffix
>>> elimination; elimination of non-matching lines due to token-counting;
>>> ...))?
>>>
>>> Can you give examples from subversion's own repository, or other
>>> public repositories, where it can have a big impact?
>>>
>>
>> I ran it on merge.c (in libsvn_client), between revisions 1127198 (HEAD,
>> or close enough) and 1036419 (semi-randomly chosen to have about
>> the right fraction of changes from 1127198 - it was the first one I looked
>> at, and seemed reasonable for the test). When running all of
>> svn_diff_file_diff, the patch was about 15% faster, however, the LCS
>> algorithm itself was about 3.5 times faster; most of the time was
>> spent reading the tokens from file.
>
> That's quite neat! It's a lot more than I expected in an average case.
> Maybe some more testing should be done (not saying you have to do this
> per say, just in general) to see how average this is :-). But for now
> it's already great to see a single real-world example like the one you
> just gave.
>
>> The "problem" is that the files have to be quite big before the LCS
>> algorithm itself dominates the running time even when the files still
>> match reasonably well, whereas it is much easier for the LCS
>> algorithm to dominate the running time when the files don't match
>> at all, in which case this patch degrades performance.
>
> Yeah, I know :-). The token-scanning/hashing phase is still quite a
> heavy factor in the diff performance (especially when the matching of
> the files is reasonably good (yielding a fast LCS-run), which is
> typically the case with changes in source code).
>
> Stefan Fuhrmann (stefan2) has some great ideas for optimizing the
> token-scanning phase, which could help a lot for that. We had some
> nice discussions about it during the Berlin Hackathon last month (but
> I didn't have the chance yet to dig into it further). But that's all
> definitely 1.8 or later stuff.
>
> Also, in the same vein, I think this current patch will have to wait
> for post-1.7. I like the idea a lot, and would definitely like to see
> it in svn one day, but it's not such a trivial change. With 1.7 just
> around the corner, we're trying to avoid changes that are not directly
> related to getting 1.7 (finally) out the door :-).
>
>> It is quite easy to do a quick heuristic test for whether this patch
>> would help or not, so we could have alternate versions of the LCS
>> algorithm depending on whether that test checks out. That would
>> make the code somewhat harder to maintain, of course.
>
> That might be a good idea, to get really the best of both worlds.
> Though I think it would also be acceptable to optimize for the common
> case (reasonably matching files (modulo all the "non-matching
> lines")), at a slight cost for the rare case (files that are matching
> terribly bad, even with "skipping all the non-matching lines"). Not
> sure right now, I guess it depends on the cost of the added code
> complexity.
>
> Definitely something to look at post-1.7 :-).
>
>>> Or are you looking at this for some specific purpose, some repository
>>> where this situation occurs regularly and has a big impact (huge
>>> LCS'es ...)?
>>>
>>
>> Nope, I just like to optimize stuff, and since we were already
>> looking at diff optimization I thought I might as well point out the
>> possibility.
>
> Cool :-). Me too.
>
> --
> Johan
>
Received on 2011-06-26 11:24:31 CEST