On Sun, Sep 5, 2010 at 1:53 AM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> On Tue, Aug 24, 2010 at 11:11 AM, Philip Martin
> <philip.martin_at_wandisco.com> wrote:
>> Johan Corveleyn <jcorvel_at_gmail.com> writes:
>>> On Sun, Aug 22, 2010 at 4:02 PM, Branko Čibej <brane_at_xbc.nu> wrote:
>>>> svn_diff uses basically the same algorithm as GNU diff but implemented
>>>> slightly differently and IIRC it doesn't have some of GNU diff's
>>>> optimizations. I'm sure it can be speeded up, but haven't a clue about
>>>> how much.
>>>
>>> Ok, thanks. In the meantime I saw that there is not that much
>>> difference anymore between GNU diff and svn_diff, after running the
>>> latter from a release build, and disabling my anti-virus (which makes
>>> me wonder why my anti-virus slows down svn_diff (impact when opening
>>> the modified datasource), but not on GNU diff).
>>
>> The big difference between Subversion's diff and GNU diff is that GNU
>> uses heuristics to cut short the diff algorithm whereas Subversion
>> grinds on to the end. Certain files trigger the heuristics and then
>> GNU diff is much faster than Subversion. Typically the file has a
>> large number of matches and non-matches repeated through the file,
>> machine generated files sometimes fit this pattern.
>>
>> GNU diff's heuristics work well so when they trigger the resulting
>> diff is usually good enough. They can be disabled using the --minimal
>> option and using that makes GNU diff performance much more like
>> Subversion.
>
> Hmm, I thought harder about this. If I try --minimal with GNU diff,
> it's just as fast, still significantly faster than svn diff.
>
> Then I read a little bit more about diff algorithms, like the blog
> entry that Greg Hudson suggested in the other thread [1], about the
> Patience Diff algorithm [2]. In there the following two lines caught
> my eye, first steps in the diff algo:
>
> [[[
> 1) Match the first lines of both if they're identical, then match the
> second, third, etc. until a pair doesn't match.
> 2) Match the last lines of both if they're identical, then match the
> next to last, second to last, etc. until a pair doesn't match.
> ]]]
> (then the longest common subsequence algorithm kicks in)
>
> Now, these steps are not specific to Patience Diff, they are common to
> most diff algorithms. And I bet it's a much more efficient to exclude
> the common prefix/suffix this way, than to include them into the
> entire lcs algorithm.
>
> Right now, svn diff doesn't do this AFAICS. However, GNU diff does,
> and I think that's where most of the performance difference comes
> from. For big files, where a couple of lines are modified somewhere in
> the middle (and the first 30000 and last 30000 lines are identical), I
> think this can make a big difference.
>
> Thoughts?
>
> If no-one else fancies this, I think I'll take a stab at implementing
> this optimization (first with a POC to see if it's significant).
> Unless feedback tells me it's not worth it, not a good idea, ...
Some update on this: I have implemented this for svn_diff (excluding
the identical prefix and suffix of both files, and only then starting
to fill up the token tree and let the lcs-agorithm to its thing). It
makes a *huge* difference. On my bigfile.xml (1.5 Mb) with only one
line changed, the call to svn_diff_diff is ~10 times faster (15-20 ms
vs. 150-170 ms).
This means blame is much faster for such big files as well, because
diff accounts for 90% of the client-side work of blame. This means
that in most cases the client cpu won't be the bottleneck anymore, so
it's more constrained by server-side and network. In my tests, blame
was about twice as fast for bigfile.xml (since all my tests were done
with client and server on the same machine, I'm guessing the server is
now the bottleneck (as well as them running on the same machine)).
Same factor of performance increase for a medium-sized file (a java
source file of 300 Kb).
The patch is still very crude, with lots of code duplication, and
everything stuffed in one big function (and I may have overlooked some
edge conditions), so I don't feel it's ready to post yet (don't want
to waste people's time :)). I'll try to clean it up as soon as I can
and post a reasonable version to the list.
Some notes already:
- I basically wrote a new function datasources_open in diff_file.c
(similar to datasource_open), that opens both files (original and
modified), and compares bytes from the beginning until the first
nonmatching byte (counting newlines when it sees them), and backwards
from the end to find the first (or last, depending how you look at it)
nonmatching byte. The number of prefix_lines is then used to determine
the starting offset of the "tokens" (lines) that will later be
extracted.
- I think I should generalize this function to take an array of
datasources, and not just two, so it can be used by diff3 and diff4 as
well. But for now I didn't want to take on that extra complication.
- The code is complicated by the fact that the svn_diff code
(diff_file.c) works with chunks of data (doesn't read the entire
file(s) in memory). So every time a pointer is advanced or receded I
have to test whether a new chunk must be read (and read it). GNU diff
doesn't have this problem (reads both files in memory).
Just to know how hard/fast I should pursue this: is there any chance
that a change like this could still make it into 1.7 (maybe even
backported to 1.6?), provided that the patch is of good quality and
all... ?
Cheers,
--
Johan
Received on 2010-09-15 14:20:46 CEST