On Tue, Jun 14, 2011 at 7:32 PM, Morten Kloster <morklo_at_gmail.com> wrote:
> On Tue, Jun 14, 2011 at 2:49 AM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
>> On Tue, Jun 7, 2011 at 10:16 PM, Morten Kloster <morklo_at_gmail.com> wrote:
>>> Here's an updated patch based on the current HEAD, and with
>>> one or two minor improvements. While, as stated, the patch
>>> can give many-fold speed increases for ideal cases, it also slows
>>> down the LCS algorithm significantly for poorly matching files; I
>>> got about 18% increase in running time on my system for files
>>> that were very poor matches but had no lines unique to either file.
>> Hi Morten,
>> Hmmm, that sounds like a lot of overhead in some cases (I actually
>> wonder why -- I haven't fully understood the implementation, but on
>> first sight it doesn't seem that computationally intensive). But the
>> most important question that's bugging me: do those "ideal cases"
>> occur often in real-world examples?
> It's not surprising to get that kind of speed reduction, since it uses
> an extra test within the loops that take the bulk of the time when
> the files don't match.
Indeed, that might be it :-). Just wondering though: it isn't by any
chance the case that, worst case, clear_fp is called too often (more
often than is useful, each time resetting the LCS run for only
Can you verify that those 18% are spent because of those extra
comparisons, and not because of calls to clear_fp? IIUC, with those
poorly-matching files, there should never be a call to clear_fp,
because the unique-matching optimization never kicks in, right?
Also, those 18% you measured: is that the increase in
svn_diff_file_diff time, or in lcs time? Or is that approximately the
same, because the lcs time dominates because of the poor matching?
I have two more, possibly stupid, questions on the implementation in the patch:
- You mentioned in your theoretical explanation that you still have to
scan the entire d-range for every p, even after you've found a "reset
point". But can't you just "break" from the for-loops, whenever you
encounter a reset point?
- You calculated 'limit' as:
limit = (d >= 0 ? 2 * p + d : 2 * p - d);
and made the "reset condition" be:
if (fp[k].uniquematchcount > limit + k) /* with k < 0 */
if (fp[k].uniquematchcount > limit - k) /* with k >= 0 */
I don't fully understand. Can you explain a bit more why this is the
right value for 'limit', and how this is the correct "reset
>> I'm sure you could craft good examples, but how likely am I to have a
>> benefit of this in my average repository (given all of the other
>> optimizations taking away already a lot of the work (prefix/suffix
>> elimination; elimination of non-matching lines due to token-counting;
>> Can you give examples from subversion's own repository, or other
>> public repositories, where it can have a big impact?
> I ran it on merge.c (in libsvn_client), between revisions 1127198 (HEAD,
> or close enough) and 1036419 (semi-randomly chosen to have about
> the right fraction of changes from 1127198 - it was the first one I looked
> at, and seemed reasonable for the test). When running all of
> svn_diff_file_diff, the patch was about 15% faster, however, the LCS
> algorithm itself was about 3.5 times faster; most of the time was
> spent reading the tokens from file.
That's quite neat! It's a lot more than I expected in an average case.
Maybe some more testing should be done (not saying you have to do this
per say, just in general) to see how average this is :-). But for now
it's already great to see a single real-world example like the one you
> The "problem" is that the files have to be quite big before the LCS
> algorithm itself dominates the running time even when the files still
> match reasonably well, whereas it is much easier for the LCS
> algorithm to dominate the running time when the files don't match
> at all, in which case this patch degrades performance.
Yeah, I know :-). The token-scanning/hashing phase is still quite a
heavy factor in the diff performance (especially when the matching of
the files is reasonably good (yielding a fast LCS-run), which is
typically the case with changes in source code).
Stefan Fuhrmann (stefan2) has some great ideas for optimizing the
token-scanning phase, which could help a lot for that. We had some
nice discussions about it during the Berlin Hackathon last month (but
I didn't have the chance yet to dig into it further). But that's all
definitely 1.8 or later stuff.
Also, in the same vein, I think this current patch will have to wait
for post-1.7. I like the idea a lot, and would definitely like to see
it in svn one day, but it's not such a trivial change. With 1.7 just
around the corner, we're trying to avoid changes that are not directly
related to getting 1.7 (finally) out the door :-).
> It is quite easy to do a quick heuristic test for whether this patch
> would help or not, so we could have alternate versions of the LCS
> algorithm depending on whether that test checks out. That would
> make the code somewhat harder to maintain, of course.
That might be a good idea, to get really the best of both worlds.
Though I think it would also be acceptable to optimize for the common
case (reasonably matching files (modulo all the "non-matching
lines")), at a slight cost for the rare case (files that are matching
terribly bad, even with "skipping all the non-matching lines"). Not
sure right now, I guess it depends on the cost of the added code
Definitely something to look at post-1.7 :-).
>> Or are you looking at this for some specific purpose, some repository
>> where this situation occurs regularly and has a big impact (huge
>> LCS'es ...)?
> Nope, I just like to optimize stuff, and since we were already
> looking at diff optimization I thought I might as well point out the
Cool :-). Me too.
Received on 2011-06-14 23:13:21 CEST