Re: svn diff optimization to make blame faster?

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Sun, 5 Sep 2010 01:53:30 +0200

On Tue, Aug 24, 2010 at 11:11 AM, Philip Martin
<philip.martin_at_wandisco.com> wrote:
> Johan Corveleyn <jcorvel_at_gmail.com> writes:
>> On Sun, Aug 22, 2010 at 4:02 PM, Branko Čibej <brane_at_xbc.nu> wrote:
>>> svn_diff uses basically the same algorithm as GNU diff but implemented
>>> slightly differently and IIRC it doesn't have some of GNU diff's
>>> optimizations. I'm sure it can be speeded up, but haven't a clue about
>>> how much.
>>
>> Ok, thanks. In the meantime I saw that there is not that much
>> difference anymore between GNU diff and svn_diff, after running the
>> latter from a release build, and disabling my anti-virus (which makes
>> me wonder why my anti-virus slows down svn_diff (impact when opening
>> the modified datasource), but not on GNU diff).
>
> The big difference between Subversion's diff and GNU diff is that GNU
> uses heuristics to cut short the diff algorithm whereas Subversion
> grinds on to the end. Certain files trigger the heuristics and then
> GNU diff is much faster than Subversion. Typically the file has a
> large number of matches and non-matches repeated through the file,
> machine generated files sometimes fit this pattern.
>
> GNU diff's heuristics work well so when they trigger the resulting
> diff is usually good enough. They can be disabled using the --minimal
> option and using that makes GNU diff performance much more like
> Subversion.

Hmm, I thought harder about this. If I try --minimal with GNU diff,
it's just as fast, still significantly faster than svn diff.

Then I read a little bit more about diff algorithms, like the blog
entry that Greg Hudson suggested in the other thread [1], about the
Patience Diff algorithm [2]. In there the following two lines caught
my eye, first steps in the diff algo:

[[[
1) Match the first lines of both if they're identical, then match the
second, third, etc. until a pair doesn't match.
2) Match the last lines of both if they're identical, then match the
next to last, second to last, etc. until a pair doesn't match.
]]]
(then the longest common subsequence algorithm kicks in)

Now, these steps are not specific to Patience Diff, they are common to
most diff algorithms. And I bet it's a much more efficient to exclude
the common prefix/suffix this way, than to include them into the
entire lcs algorithm.

Right now, svn diff doesn't do this AFAICS. However, GNU diff does,
and I think that's where most of the performance difference comes
from. For big files, where a couple of lines are modified somewhere in
the middle (and the first 30000 and last 30000 lines are identical), I
think this can make a big difference.

Thoughts?

If no-one else fancies this, I think I'll take a stab at implementing
this optimization (first with a POC to see if it's significant).
Unless feedback tells me it's not worth it, not a good idea, ...

Cheers,

-- 
Johan
[1] http://svn.haxx.se/dev/archive-2010-08/0411.shtml
[2] http://bramcohen.livejournal.com/73318.html

Received on 2010-09-05 01:54:08 CEST

This message: [ Message body ]
Next message: Hyrum K. Wright: "Re: [PROPOSAL] WC-NG: merge NODE_DATA, WORKING_NODE and BASE_NODE into a single table (NODES)"
Previous message: Branko Čibej: "Re: Repeated SQL queries when doing 'svn st'"
Next in thread: Johan Corveleyn: "Re: svn diff optimization to make blame faster?"
Reply: Johan Corveleyn: "Re: svn diff optimization to make blame faster?"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]