[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Looking to improve performance of svn annotate

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Tue, 17 Aug 2010 15:26:33 +0200

On Thu, Aug 12, 2010 at 5:30 PM, Greg Hudson <ghudson_at_mit.edu> wrote:
> On Thu, 2010-08-12 at 10:57 -0400, Julian Foad wrote:
>> I'm wary of embedding any client functionality in the server, but I
>> guess it's worth considering if it would be that useful.  If so, let's
>> take great care to ensure it's only lightly coupled to the core server
>> logic.
>
> Again, it's possible that binary diffs between sequential revisions
> could be used for blame purposes (not the binary deltas we have now, but
> edit-stream-style binary diffs), which would decouple the
> line-processing logic from the server.
>
> (But again, I haven't thought through the problem in enough detail to be
> certain.)

If such edit-stream-style binary diffs could do the job, and they are
"fast enough" (I'm guessing that line based vs. binary wouldn't make
that much of a difference for the eventual blame processing), it seems
like a good compromise: we get the performance benefits of
blame-oriented delta's (supposedly fast and easy to calculate blame
info from), possibly cached on the server, while still not introducing
unnecessary coupling of the server to line-processing logic.

Greg, could you explain a bit more what you mean with
"edit-stream-style binary diffs", vs. the binary deltas we have now?
Could you perhaps give an example similar to Julian's? Wouldn't you
have the same problem with pieces of the source text being copied
out-of-order (100 bytes from the end/middle of the source being copied
to the beginning of the target, followed by the rest of the source)?
Wouldn't you also have to do the work of discovering the largest
contiguous block of source text as "the main stream", so determine
that those first 100 bytes are to be interpreted as new bytes, etc?

Caching this stuff on the server would of course be ideal. Whether it
be "post-commit" or on-demand (first guy requesting the blame takes
the hit), both approaches seem good to me. Working on that would be
severely out of my league though :-). At least for now.

Another thing that occurred to me: since most time of the current
blame implementation is spent on "diff" (svn_diff_file_diff_2), maybe
a quick win could be to simply (?) optimize the diff code? Or write a
specialized faster version for blame.

On my tests with a 1,5 Mb file (61000 lines), svn diffing it takes
about 500 ms on my machine. GNU diff is much faster (300 ms for the
first run, 72 ms on following runs). This seems to indicate that there
is much room for optimization of svn diff. Or is there something extra
that svn diff does, necessary in the svn context?

I have looked a little bit at the svn diff code, and saw that most of
the time is spent in the while loop inside svn_diff__get_tokens in
token.c, presumably extracting the tokens (lines) from the file(s).
Haven't looked any further/deeper. Anybody have any brilliant
ideas/suggestions? Or is this a bad idea, not worthy of further
exploration :-) ?

BTW, I also tested with Stefan Fuhrmann's performance branch_at_r985697,
just for kicks (had some trouble building it on Windows, but
eventually managed to get an svn.exe out of it). The timing of svn
diff of such a large file was about the same, so that didn't help. But
maybe the branch isn't ready for prime time just yet ...

Cheers,

-- 
Johan
Received on 2010-08-17 15:27:15 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.