[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [RFC] Diff (blame) optimization: how to go forward

From: Hyrum K. Wright <hyrum_wright_at_mail.utexas.edu>
Date: Fri, 8 Oct 2010 13:46:34 -0500

On Thu, Oct 7, 2010 at 6:44 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> Hi all,
>
> This is a follow-up to the WIP-patches I posted last week [1], which
> are about improving performance of svn_diff (and therefor also blame
> on the client-side), especially for large files.
>
> To summarize: the idea was (is) to strip off the identical prefix and
> suffix, and then letting the existing diff algorithm work on the
> remainder. As stated in [2], I ran into some difficulties to get the
> exact same result of the existing diff algorithm (because of too eager
> suffix stripping). I've spent some time searching for a "perfect"
> solution, but I can't find one. So now I'm stuck and would like to
> have some feedback on what to do with it.
>
> First: the thing works, and it can reduce the time needed for svn_diff
> by 20% up to 80% for large files (depending on the distribution of the
> changes). An extreme example is a reduction from ~150ms to ~30ms for a
> 60000-lines file with a small number of lines changed (say up to 100)
> located close together (so lots of identical prefix/suffix).
>
> Blame gets similar benefits, *if the server is fast enough*. I used
> svnserve built from Stefan^2's performance branch to test this. A
> normal svnserve with FSFS isn't really fast enough (unless maybe with
> an SSD, but I don't have one here) to serve up the 2500 revisions of
> the large file quickly enough. But the performance-enhanced-svnserve,
> with a hot cache filled with the necessary full-texts, is definitely
> fast enough to make the time of blame completely dependent on the
> client-side performance. Conclusion: the diff optimization we're
> talking about is much more useful for blame together with a server
> with the full-text membuffer caching of the performance branch.
>
> Now, the reason why I'm stuck: suffix scanning sometimes strips a
> little bit too much (see [2] for full explanation). In this case the
> resulting diff is still correct, but it's not as nice for the user. As
> I noted in [2], GNU diff also has this problem, but the user can help
> a little bit by specifying a large enough number of
> prefix/suffix-lines to keep (--horizon-lines).
>
> So, what to do?
> - I have not found a better solution than to use some form of
> "horizon-lines" to get the "nice" result, but still have a similar
> performance improvement (unless if I leave out suffix stripping
> entirely, only strip prefix, but that halves the power of the
> optimization).
>
> - I've tested with keeping up to 50 suffix-lines. It didn't have the
> slightest impact on the performance improvement (we're talking about
> stripping away 1000's of lines). This fixed all real-life examples I
> could find/test (diff/blame output is precisely identical to original
> svn). If I kept only 20 lines, I still ran into some differences (30
> lines was enough for my example file, but I took it up to 50 to be
> sure).
>
> - I would like to avoid leaving the decision to the user to come up
> with an adequate number of suffix-lines-to-keep. So I'd like to just
> hardcode some value, say 50, or even 100. I think this would give good
> (=identical to original svn) diff/blame results in all but the most
> extreme cases. With suffix-lines-to-keep=50, you'd need to insert a
> block of text that has its last 50 lines identical to the 50 lines
> preceding the insertion point, to mess up the diff result.
>
> - If really necessary, we could say default=50, but it can be
> overridden by the user with another -x option.
>
> So the main question is: is such a change in behavior (unlikely to
> have an impact in most realistic cases) acceptable for this kind of
> performance improvement?
>
> Other options:
> - Come up with a clever solution to fix the optimization properly (but
> I don't know if that's possible -> suggestions welcome :-)).
> - Throw this optimization away, and take a look at something like
> "Patience diff" (supposed to be fast as well, and give nice results).
> However, I don't know Patience diff well enough to code up something,
> or even to judge whether it would give good results in all (or most)
> cases.
> - Just throw it away and don't look back :-)
> - Something else?

My not-having-looked-deeply-at-the-problem reaction is this: if we can
introduce significant speed-ups in the common case, while still
maintaining functionality, let's do it. There may be corner cases
where somebody gets an ugly diff, but in those corner cases there are
likely to be *other* issues as play as well. Let's not penalize
everybody for the sake of a couple of corner cases.

-Hyrum
Received on 2010-10-08 20:47:15 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.