Re: Looking to improve performance of svn annotate

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Wed, 11 Aug 2010 02:07:49 +0200

Hi all,

Other priorities have unfortunately kept me from focusing on the
project of speeding up blame. But recently I've spent some time
thinking about it, reading the other mail threads, studying the code
and profiling a little bit. I hope I can still do something useful for
blame, whether it be sooner or later.

I will first give some updates, info and discussion, and end up with a
couple of questions. Any input is greatly appreciated. A lot of this
stuff is still very fuzzy to me :-).

First, some profiling info (After sprinkling some printf's and some
time-counters into blame.c):

- Most of the client-side blame-time (~90%) is spent on producing the
diffs (more specifically, the call to svn_diff_file_diff_2 in the
add_file_blame function in blame.c).

- Applying the deltas to produce full-texts in the first place accounts for ~5%.

- Time spent on producing the blame info (inserting/deleting blame
chunks) is negligible ( < 0.5%).

Coming back with this information to the threads that Mark Phippard suggested:

On Mon, Mar 22, 2010 at 11:08 PM, Mark Phippard <markphip_at_gmail.com> wrote:
> If you haven't, you should review some of the old threads on speeding
> up blame. Dan Berlin made a lot of improvements in the SVN 1.2
> timeframe and learned a lot about what does NOT speed it up.
>
> http://svn.haxx.se/dev/archive-2005-02/0275.shtml

This thread mainly ended up focusing on the "delta combiner", with the
conclusion of replacing vdelta with xdelta. I think this was a very
good optimization of which we now reap the benefits. If I understand
my profiling numbers correctly, the delta combining stuff is no longer
the biggest bottleneck (at least not on the client-side).

There is however another interesting point/question from this thread, see below.

> You should really search around for all of the threads he commented on
> to get the complete picture. I think he also provided ideas on what
> more could potentially be done in the future. Such as this one that I
> do not recall was ever done.
>
> http://svn.haxx.se/dev/archive-2007-09/0459.shtml
>
> Maybe the thread will explain why.

The thread didn't really explain why it wasn't implemented, but the
poster said he had little time to work on it further. The suggestions
were improvements in the processing of the blame info (optimizations
in iterating through the linked list of blame chunks, and discussion
about the most optimal data structure to represent these blame
chunks). The thread sort of ended with the question whether this would
be a significant optimization:
On Tue, Jan 11, 2005, Mark Benedetto King wrote:
> I'd also be interested in profiling output from your 1750-rev file to
> see how much time is spent in the blame_* functions themselves vs
> the svn_diff_* functions, so that we can at least give a nod towards
> Amdahl's Law before we run off optimizing the wrong thing.

From my profiling data, it seems optimizing those blame_* functions
would indeed not be significant. Most of the time is spent in the
svn_diff_* functions.

Which brings us back to the idea of the binary blame, and Julian's
description of the algorithm. Now, some questions:

1) Dan Berlin's thread prompted an interesting response from Branko
ÄŒibej (http://svn.haxx.se/dev/archive-2005-02/0270.shtml):
> If by "diff format" you mean the binary delta, then... There was a time
> when I thought this would be possible. Now I'm not so sure. The trouble
> is that the vdelta doesn't generate an edit stream, it generates a
> compressed block-copy stream. Which means that can never be 100% sure,
> just from looking at the delta, which bytes in the target are really new
> and which are just (offset) copies from the source. The only blocks you
> can really be sure about are those that are represented by new data in
> the delta (either NEW blocks or target copies that take data from NEW
> blocks). This problem is made worse by our use of skip deltas in (at
> least) the BDB back-end.
>
> I agree that it would be nice if the server could generate some sort of
> byte-range oriented info, but I don't think it can be done just by
> looking at the deltas. It's sad, I know...

What? So this idea was considered before, but rejected? Is this
argument still valid (e.g. with xdelta i.s.o. vdelta)? Is there no way
around that, does it mean this simply cannot work? Maybe Branko can
comment on that? An example or some more detailed explanation would be
very welcome.

2) A concrete problem with the algorithm as described by Julian would
be the delete of bytes within a line. E.g. suppose one deletes a
single character in r60 ("oops, a boolean test was inverted, let's
delete that stray '!'"). That deleted character wouldn't be
represented in the binary blame structure (it's simply cut out), and
it's obviously not present in the final text. But the line on which it
occurred was now last edited in r60.

Current blame (diff-based) has no problem with this situation. It
correctly shows r60 as the last revision for that line. But the binary
algorithm loses information about the deleted bytes. That's a problem.
I haven't really worked out how to handle that. Anybody have any
ideas?

Cheers,

-- 
Johan

Received on 2010-08-11 02:08:31 CEST

This message: [ Message body ]
Next message: Alexey Neyman: "Re: Bug in svn_fs_paths_changed2() Python bindings?"
Previous message: Daniel Shahaf: "Re: Bikeshed: configuration override order"
Next in thread: Julian Foad: "Re: Looking to improve performance of svn annotate"
Reply: Julian Foad: "Re: Looking to improve performance of svn annotate"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]