Re: svn diff optimization to make blame faster?

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Wed, 18 Aug 2010 15:44:07 +0200

On Wed, Aug 18, 2010 at 12:49 PM, Stefan Sperling <stsp_at_elego.de> wrote:
> On Wed, Aug 18, 2010 at 12:59:21AM +0200, Johan Corveleyn wrote:
>> Hi devs,
>>
>> While "Looking to improve performance of svn annotate" [1], I found
>> that the current blame algorithm is mainly client-side bound, and that
>> most of its time is spent on "svn diff" (calls to svn_diff_file_diff_2
>> from add_file_blame in blame.c). Apart from avoiding to build
>> full-texts and diffing them altogether (which is subject of further
>> discussion in [1]), I'm wondering if optimization of "svn diff"
>> wouldn't also be an interesting way to improve the speed of blame.
>>
>> So the main question is: is it worth it to spend time to analyze this
>> further and try to improve performance? Or has this already been
>> optimized in the past, or is it simply already as optimal as it can
>> get? I have no idea really, so if anyone can shed some light ...
>>
>> Gut feeling tells me that there must be room for optimization, since
>> GNU diff seems a lot faster than svn diff for the same large file
>> (with one line changed) on my machine [1]. But maybe svn's diff
>> algorithm is purposefully different (better? more accurate? ...) than
>> GNU's, or there are specific things in the svn context so svn diff has
>> to do more work.
>>
>> Any thoughts?
>
> Can you show a profiler run that illustrates where the client is
> spending most of its time during diff? That would probably help with
> getting opinions from people, because it saves them from spending time
> doing this research themselves.
> You've already hinted at svn_diff__get_tokens() in another mail, but
> a real profiler run would show more candidates.

Sorry, I'm still very much a beginner in C (I've been programming for
10 years, but mainly in Java). Especially the tooling is a steep part
of the learning curve :-), so I don't know (how to use) a profiler
yet. Any suggestions? I'm on Windows (XP), using VCE 2008, and used
cygwin to compare with GNU diff.

For the time being, I've helped myself using "apr_time_now()" from
apr_time.h and printf statements. Not terribly accurate and probably
somewhat overheadish, but it gives me some hints about the
bottlenecks.

FWIW, below is the output of a recent run with my instrumented trunk
build. I've instrumented svn_diff_diff in diff.c and
svn_diff__get_tokens in token.c. I checked out bigfile.xml from a
repository, and edited a single line of it in my working copy. The two
statements "Starting diff" and "Diff finished" are the first and last
statements inside the svn_diff_diff function. These are numbers from
the second run (any following runs show approximately the same
numbers).

$ time svn diff bigfile.xml
Starting diff ... (entered svn_diff_diff in diff.c)
  -> calling svn_diff__get_tokens for svn_diff_datasource_original
      == svn_diff__get_tokens datasource_open: 0 usec
      == svn_diff__get_tokens while loop: 265625 usec
         calls to datasource_get_next_token: 62500 usec
                svn_diff__tree_insert_token: 171875 usec
                                       rest: 15625 usec
  <- done: 281250 usec
  -> calling svn_diff__get_tokens for svn_diff_datasource_modified
      == svn_diff__get_tokens datasource_open: 234375 usec
      == svn_diff__get_tokens while loop: 312500 usec
         calls to datasource_get_next_token: 62500 usec
                svn_diff__tree_insert_token: 171875 usec
                                       rest: 46875 usec
  <- done: 562500 usec
  -> calling svn_diff__lcs
  <- done: 0 usec
  -> calling svn_diff__diff
  <- done: 0 usec
Diff finished in 875000 usec (875 millis)
Index: bigfile.xml
===================================================================
[snip]

real 0m1.266s
user 0m0.015s
sys 0m0.031s

Some observations:
- svn_diff__get_tokens takes most of the time
- for some reason, in the case of datasource_modified, the call to
datasource_open takes a long time (234 ms). In case of
datasource_original, it's instantaneous. Maybe because of translation,
... of the pristine file? But I'd think that would be the original??
- apart from that, most of the time goes to the while loop in
svn_diff__get_tokens.
- Inside the while loop, most time is spent on calls to
svn_diff__tree_insert_token (which compares tokens (=lines) to insert
them into some tree structure). Calls to datasource_get_next_token
also take some time.

I'm not too sure of the accuracy of those last numbers with my simple
profiling method, because it's the accumulated time of calls inside a
while loop (with 61000 iterations).

For completeness, the same diff with /usr/bin/diff (under cygwin), of
the edited bigfile.xml vs. the original, as of the second diff run:

$ ls -l settings.xml
-rwxr-xr-x+ 1 User None 1447693 2010-08-17 14:20 bigfile.xml
$ time diff bigfile.xml bigfile_orig.xml
[snip]

real 0m0.078s
user 0m0.046s
sys 0m0.046s

Cheers,

-- 
Johan

Received on 2010-08-18 15:44:47 CEST

This message: [ Message body ]
Next message: Daniel Shahaf: "Re: svn commit: r106 - trunk/www: . cn-project-pages/snippets"
Previous message: Hyrum K. Wright: "Re: Webpage showing 1.7 status?"
In reply to: Stefan Sperling: "Re: svn diff optimization to make blame faster?"
Next in thread: Johan Corveleyn: "Re: svn diff optimization to make blame faster?"
Reply: Johan Corveleyn: "Re: svn diff optimization to make blame faster?"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]