[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Looking to improve performance of svn annotate

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Mon, 22 Mar 2010 10:51:07 +0000

On Sun, 2010-03-21, Johan Corveleyn wrote:
> Hi all,
> I'm very much looking forward to the performance (and other)
> improvements wc-ng will bring us. But there's more to svn performance
> than wc stuff :-). In our company, "svn annotate/blame/praise" is a
> known offender (people who attended subconf 2009 may remember me as
> the guy who complained about the annotate of a 2Mb file with 6000 revs
> taking over 4 hours).
> So, in the name of "make it faster, then make it faster again", and
> feeling that maybe it's time to scratch one of my own itches, I'd like
> to take a stab at this myself. I have no idea whether I'm up to this
> task (I am a total newbie at svn development), but I can try. Until
> now I have just analyzed some parts of the source code. I have setup a
> build environment on my machine (on windows), and have been able to
> run it from inside a debugger (Visual C 2008 Express Edition), setting
> breakpoints, seeing what happens, ...
> I am specifically looking at the suggestion that Julian made in the
> following comment in the issue tracker
> (http://subversion.tigris.org/issues/show_bug.cgi?id=3074#desc5):
> [[[
> My idea for client-side improvement:
> The client, while receiving binary diffs for successive revisions, could
> calculate a byte-wise annotation of the file. (It probably would not have to
> reconstruct full texts to do that.) Then, at the end, it could split the
> byte-wise annotation at the file's line boundaries and report the highest
> revision number within each line.
> ]]]

Hi Johan.

Please accept my sincere apologies for not having replied to your
private email about this. I remember discussing this with you but have
not thought more about it recently.

Let me recall what I can about that binary-blame idea.

My basic idea is an algorithm on the client that takes, as its input, a
sequence of binary diffs, one for each revision in which the file
changed, representing that incremental change. A binary diff tells
which byte ranges are added, which are unchanged and which are deleted.
The idea is to use Subversion's binary delta format which is transmitted
by the server, so that there is no need to produce full-texts and then
diff them on the client. (That format expresses adds and copies, with
deletes being represented implicitly by the lack of any add or copy for
a particular byte range, if I recall correctly.)

The algorithm applies one of these diffs at a time to an in-memory data
structure that represents which revision is associated with each byte in
the file. The representation can be a list of byte ranges covering the
whole length of the file; it doesn't need to store the actual data.

  # byte-range revision
    0:100 => r50
    100:101 => r49
    101:105 => r50
    105:1105 => r7
    1105:2000000 => r49

(Byte ranges are shown with an exclusive end point: 0:100 means byte
numbers 0 to 99 inclusive.)

To explain the algorithm a little further, let's first assume it takes
diffs starting with the oldest revision and working forward. (A
backward algorithm will usually terminate sooner, but the forward
algorithm is simpler to explain.)

The starting point is an empty range list, as there is no file:

  { }

The first incoming change is for revision 7 in which the file was
created, let't say with 1000 bytes of new text, so the range list looks

    0:1000 => r7

The next change is in r49, in which one byte is added to the beginning
and 3000000 bytes to the end, so we get:

    0:1 => r49
    1:1001 => r7
    1001:3001001 => r49

And finally r50 adds 100 bytes at the beginning, adds 4 bytes after the
one from r49, and deletes 1001105 bytes altogether in various ranges
scattered through the last big chunk, giving the final revision range
mapping show at the start of this example:

    0:100 => r50
    100:101 => r49
    101:105 => r50
    105:1105 => r7
    1105:2000000 => r49

Then, at the end, we read in the blamed file's text (just that one
revision of it), match up the byte ranges to the lines in the file, and
for each line we display the highest recorded revision number that falls
within that line.

The efficiency of this idea relies on being able quickly to apply each
incoming binary diff onto some such data structure.

A backward algorithm would be a little more complex as it also would
have to maintain a mapping between the byte ranges of the "current"
revision of the text and the byte ranges - if any - of the "blamed"
revision of the text. The backward algorithm would start with a range
list in which the whole range is marked "unknown revision" and would
successively fill in the known ranges, until no more unknown ranges are
left. I have not thought further about exactly how to implement this
but I can't see any theoretical difficulty.

I am still not completely sure whether the binary diff idea can produce
a line-wise result that looks natural and minimal, while still being
efficient. For example, I suspect there may be situations where the
binary diff will show changes within two different lines where a text
diff would have showed a change only to one line, but I am not sure.
Even if effects like this do occur, they might only happen in cases
where a human reader would agree that that is a good way to describe the
change anyway.

(It is not feasible to guarantee the exact same result as the current
blame algorithm, but that is OK because there are multiple ways of
representing some changes anyway. For example, if r49 is the three
lines { 'A', 'B', 'C' } and r50 is the four lines { 'A', 'B', 'B',
'C' }, then the blame info could be shown as { 49, 49, 50, 49 } or { 49,
50, 49, 49 }. Either way is correct.)

> I might also take a look at Bert's suggestion in
> http://subversion.tigris.org/issues/show_bug.cgi?id=3074#desc7
> (reversing the algorithm).
> If anyone could help me get started with this (more specifics about
> the above idea(s), or other ideas that could give a significant boost;
> any pointers, parts of the code where I could get started, ...), that
> would be very much appreciated.

The important first step is to do some profiling to find out how much
time is being spent -

  - on the server, calculating full texts from the stored diffs?

  - transmitting data over the network?

  - on the client, producing diffs?

  - on the client, producing blame info from diffs?

I've just re-read your message from last year
<http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=2079688> in which you report it's mainly client-side CPU bound. That's good to know. (That thread is linked from the issue.) So next, you would be well advised to profile the execution of the client code and see whether the textual diffs or some other part of the processing are taking the most time.

The clear advantage of the binary diff algorithm is that on the client
it neither calculated diffs nor even looks at the older revisions of the
file's text, so that might be a big win if it works at all.

As regards reversing the current algorithm, first you might want to find
out what is the oldest revision number that shows up in the blame of
your typical file, and compare that with all the changes in the file's
history. Is it only reporting the most recent 1% of the file's changes,
in which case the reversed algorithm could achieve a 100-fold speed-up,
or 20%, or 80%?

Anyway, I'm very glad you want to work on this. Welcome, and do not
hesitate to ask questions here or in IRC channel #svn-dev on Freenode
where many of us hang out most of the time, and please post your
investigation results, patches in progress, or anything else to get our

- Julian
Received on 2010-03-22 11:51:47 CET

This is an archived mail posted to the Subversion Dev mailing list.