On Tue, Aug 24, 2010 at 11:11 AM, Philip Martin
> Johan Corveleyn <jcorvel_at_gmail.com> writes:
>> On Sun, Aug 22, 2010 at 4:02 PM, Branko Čibej <brane_at_xbc.nu> wrote:
>>> svn_diff uses basically the same algorithm as GNU diff but implemented
>>> slightly differently and IIRC it doesn't have some of GNU diff's
>>> optimizations. I'm sure it can be speeded up, but haven't a clue about
>>> how much.
>> Ok, thanks. In the meantime I saw that there is not that much
>> difference anymore between GNU diff and svn_diff, after running the
>> latter from a release build, and disabling my anti-virus (which makes
>> me wonder why my anti-virus slows down svn_diff (impact when opening
>> the modified datasource), but not on GNU diff).
> The big difference between Subversion's diff and GNU diff is that GNU
> uses heuristics to cut short the diff algorithm whereas Subversion
> grinds on to the end. Certain files trigger the heuristics and then
> GNU diff is much faster than Subversion. Typically the file has a
> large number of matches and non-matches repeated through the file,
> machine generated files sometimes fit this pattern.
> GNU diff's heuristics work well so when they trigger the resulting
> diff is usually good enough. They can be disabled using the --minimal
> option and using that makes GNU diff performance much more like
Hmm, I thought harder about this. If I try --minimal with GNU diff,
it's just as fast, still significantly faster than svn diff.
Then I read a little bit more about diff algorithms, like the blog
entry that Greg Hudson suggested in the other thread , about the
Patience Diff algorithm . In there the following two lines caught
my eye, first steps in the diff algo:
1) Match the first lines of both if they're identical, then match the
second, third, etc. until a pair doesn't match.
2) Match the last lines of both if they're identical, then match the
next to last, second to last, etc. until a pair doesn't match.
(then the longest common subsequence algorithm kicks in)
Now, these steps are not specific to Patience Diff, they are common to
most diff algorithms. And I bet it's a much more efficient to exclude
the common prefix/suffix this way, than to include them into the
entire lcs algorithm.
Right now, svn diff doesn't do this AFAICS. However, GNU diff does,
and I think that's where most of the performance difference comes
from. For big files, where a couple of lines are modified somewhere in
the middle (and the first 30000 and last 30000 lines are identical), I
think this can make a big difference.
If no-one else fancies this, I think I'll take a stab at implementing
this optimization (first with a POC to see if it's significant).
Unless feedback tells me it's not worth it, not a good idea, ...
Received on 2010-09-05 01:54:08 CEST