On Tue, Jun 14, 2011 at 11:12 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> On Tue, Jun 14, 2011 at 7:32 PM, Morten Kloster <morklo_at_gmail.com> wrote:
>> On Tue, Jun 14, 2011 at 2:49 AM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
>>> On Tue, Jun 7, 2011 at 10:16 PM, Morten Kloster <morklo_at_gmail.com> wrote:
>>>> Here's an updated patch based on the current HEAD, and with
>>>> one or two minor improvements. While, as stated, the patch
>>>> can give many-fold speed increases for ideal cases, it also slows
>>>> down the LCS algorithm significantly for poorly matching files; I
>>>> got about 18% increase in running time on my system for files
>>>> that were very poor matches but had no lines unique to either file.
>>> Hi Morten,
>>> Hmmm, that sounds like a lot of overhead in some cases (I actually
>>> wonder why -- I haven't fully understood the implementation, but on
>>> first sight it doesn't seem that computationally intensive). But the
>>> most important question that's bugging me: do those "ideal cases"
>>> occur often in real-world examples?
>> It's not surprising to get that kind of speed reduction, since it uses
>> an extra test within the loops that take the bulk of the time when
>> the files don't match.
> Indeed, that might be it :-). Just wondering though: it isn't by any
> chance the case that, worst case, clear_fp is called too often (more
> often than is useful, each time resetting the LCS run for only
> marginal gain)?
> Can you verify that those 18% are spent because of those extra
> comparisons, and not because of calls to clear_fp? IIUC, with those
> poorly-matching files, there should never be a call to clear_fp,
> because the unique-matching optimization never kicks in, right?
As you say, clear_fp is never called in this case. But there is
actually a risk of slowdown for well-matching files due to
calls to clear_fp (during the loop over k for p=0; clear_fp will
clear more than it really needs to). It shouldn't be too hard
to avoid that, though.
> Also, those 18% you measured: is that the increase in
> svn_diff_file_diff time, or in lcs time? Or is that approximately the
> same, because the lcs time dominates because of the poor matching?
They are essentially the same in this case.
> I have two more, possibly stupid, questions on the implementation in the patch:
> - You mentioned in your theoretical explanation that you still have to
> scan the entire d-range for every p, even after you've found a "reset
> point". But can't you just "break" from the for-loops, whenever you
> encounter a reset point?
No, I don't have to scan after I find a reset point (the loops
just continue in post-reset mode; there's no waste there). The
issue is that svn_diff__lcs still looks for a p=0 solution even after
a unique possible match has failed, which clearly is no longer
possible (it's not restricted to p=0, that's just the simplest case).
> - You calculated 'limit' as:
> limit = (d >= 0 ? 2 * p + d : 2 * p - d);
> and made the "reset condition" be:
> if (fp[k].uniquematchcount > limit + k) /* with k < 0 */
> if (fp[k].uniquematchcount > limit - k) /* with k >= 0 */
> I don't fully understand. Can you explain a bit more why this is the
> right value for 'limit', and how this is the correct "reset
Those limits are actually overkill. The correct condition is that the
number of unique matches must be larger than (or equal to, I guess)
the number of mismatches that could conceivably be corrected by
matching the files otherwise. I have here used the total number of
mismatches in BOTH files, but the larger number of mismatches in
one file would be sufficient.
>>> I'm sure you could craft good examples, but how likely am I to have a
>>> benefit of this in my average repository (given all of the other
>>> optimizations taking away already a lot of the work (prefix/suffix
>>> elimination; elimination of non-matching lines due to token-counting;
>>> Can you give examples from subversion's own repository, or other
>>> public repositories, where it can have a big impact?
>> I ran it on merge.c (in libsvn_client), between revisions 1127198 (HEAD,
>> or close enough) and 1036419 (semi-randomly chosen to have about
>> the right fraction of changes from 1127198 - it was the first one I looked
>> at, and seemed reasonable for the test). When running all of
>> svn_diff_file_diff, the patch was about 15% faster, however, the LCS
>> algorithm itself was about 3.5 times faster; most of the time was
>> spent reading the tokens from file.
> That's quite neat! It's a lot more than I expected in an average case.
> Maybe some more testing should be done (not saying you have to do this
> per say, just in general) to see how average this is :-). But for now
> it's already great to see a single real-world example like the one you
> just gave.
>> The "problem" is that the files have to be quite big before the LCS
>> algorithm itself dominates the running time even when the files still
>> match reasonably well, whereas it is much easier for the LCS
>> algorithm to dominate the running time when the files don't match
>> at all, in which case this patch degrades performance.
> Yeah, I know :-). The token-scanning/hashing phase is still quite a
> heavy factor in the diff performance (especially when the matching of
> the files is reasonably good (yielding a fast LCS-run), which is
> typically the case with changes in source code).
> Stefan Fuhrmann (stefan2) has some great ideas for optimizing the
> token-scanning phase, which could help a lot for that. We had some
> nice discussions about it during the Berlin Hackathon last month (but
> I didn't have the chance yet to dig into it further). But that's all
> definitely 1.8 or later stuff.
> Also, in the same vein, I think this current patch will have to wait
> for post-1.7. I like the idea a lot, and would definitely like to see
> it in svn one day, but it's not such a trivial change. With 1.7 just
> around the corner, we're trying to avoid changes that are not directly
> related to getting 1.7 (finally) out the door :-).
>> It is quite easy to do a quick heuristic test for whether this patch
>> would help or not, so we could have alternate versions of the LCS
>> algorithm depending on whether that test checks out. That would
>> make the code somewhat harder to maintain, of course.
> That might be a good idea, to get really the best of both worlds.
> Though I think it would also be acceptable to optimize for the common
> case (reasonably matching files (modulo all the "non-matching
> lines")), at a slight cost for the rare case (files that are matching
> terribly bad, even with "skipping all the non-matching lines"). Not
> sure right now, I guess it depends on the cost of the added code
> Definitely something to look at post-1.7 :-).
>>> Or are you looking at this for some specific purpose, some repository
>>> where this situation occurs regularly and has a big impact (huge
>>> LCS'es ...)?
>> Nope, I just like to optimize stuff, and since we were already
>> looking at diff optimization I thought I might as well point out the
> Cool :-). Me too.
Received on 2011-06-26 11:24:31 CEST