[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Looking to improve performance of svn annotate

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Wed, 11 Aug 2010 11:04:41 +0100

On Wed, 2010-08-11, Johan Corveleyn wrote:
> Hi all,
>
> Other priorities have unfortunately kept me from focusing on the
> project of speeding up blame. But recently I've spent some time
> thinking about it, reading the other mail threads, studying the code
> and profiling a little bit. I hope I can still do something useful for
> blame, whether it be sooner or later.
>
> I will first give some updates, info and discussion, and end up with a
> couple of questions. Any input is greatly appreciated. A lot of this
> stuff is still very fuzzy to me :-).
>
> First, some profiling info (After sprinkling some printf's and some
> time-counters into blame.c):
>
> - Most of the client-side blame-time (~90%) is spent on producing the
> diffs (more specifically, the call to svn_diff_file_diff_2 in the
> add_file_blame function in blame.c).
>
> - Applying the deltas to produce full-texts in the first place accounts for ~5%.
>
> - Time spent on producing the blame info (inserting/deleting blame
> chunks) is negligible ( < 0.5%).
>
>
> Coming back with this information to the threads that Mark Phippard suggested:
>
> On Mon, Mar 22, 2010 at 11:08 PM, Mark Phippard <markphip_at_gmail.com> wrote:
> > If you haven't, you should review some of the old threads on speeding
> > up blame. Dan Berlin made a lot of improvements in the SVN 1.2
> > timeframe and learned a lot about what does NOT speed it up.
> >
> > http://svn.haxx.se/dev/archive-2005-02/0275.shtml
>
> This thread mainly ended up focusing on the "delta combiner", with the
> conclusion of replacing vdelta with xdelta. I think this was a very
> good optimization of which we now reap the benefits. If I understand
> my profiling numbers correctly, the delta combining stuff is no longer
> the biggest bottleneck (at least not on the client-side).
>
> There is however another interesting point/question from this thread, see below.
>
> > You should really search around for all of the threads he commented on
> > to get the complete picture. I think he also provided ideas on what
> > more could potentially be done in the future. Such as this one that I
> > do not recall was ever done.
> >
> > http://svn.haxx.se/dev/archive-2007-09/0459.shtml
> >
> > Maybe the thread will explain why.
>
> The thread didn't really explain why it wasn't implemented, but the
> poster said he had little time to work on it further. The suggestions
> were improvements in the processing of the blame info (optimizations
> in iterating through the linked list of blame chunks, and discussion
> about the most optimal data structure to represent these blame
> chunks). The thread sort of ended with the question whether this would
> be a significant optimization:
> On Tue, Jan 11, 2005, Mark Benedetto King wrote:
> > I'd also be interested in profiling output from your 1750-rev file to
> > see how much time is spent in the blame_* functions themselves vs
> > the svn_diff_* functions, so that we can at least give a nod towards
> > Amdahl's Law before we run off optimizing the wrong thing.
>
> >From my profiling data, it seems optimizing those blame_* functions
> would indeed not be significant. Most of the time is spent in the
> svn_diff_* functions.
>
>
> Which brings us back to the idea of the binary blame, and Julian's
> description of the algorithm. Now, some questions:
>
> 1) Dan Berlin's thread prompted an interesting response from Branko
> Čibej (http://svn.haxx.se/dev/archive-2005-02/0270.shtml):
> > If by "diff format" you mean the binary delta, then... There was a time
> > when I thought this would be possible. Now I'm not so sure. The trouble
> > is that the vdelta doesn't generate an edit stream, it generates a
> > compressed block-copy stream. Which means that can never be 100% sure,
> > just from looking at the delta, which bytes in the target are really new
> > and which are just (offset) copies from the source. The only blocks you
> > can really be sure about are those that are represented by new data in
> > the delta (either NEW blocks or target copies that take data from NEW
> > blocks). This problem is made worse by our use of skip deltas in (at
> > least) the BDB back-end.
> >
> > I agree that it would be nice if the server could generate some sort of
> > byte-range oriented info, but I don't think it can be done just by
> > looking at the deltas. It's sad, I know...
>
> What? So this idea was considered before, but rejected? Is this
> argument still valid (e.g. with xdelta i.s.o. vdelta)? Is there no way
> around that, does it mean this simply cannot work? Maybe Branko can
> comment on that? An example or some more detailed explanation would be
> very welcome.

Ah. An interesting point which I hadn't fully digested. I think the
gist of Branko's point is that the delta format doesn't distinguish
between

  "copy some old text because it's still here in the corresponding
location in the new version"

and

  "copy some old text, out of order, to generate text that we would
consider to be 'new' at this position in the new version".

We want to consider each piece of old text as being candidate for at
most one piece of the target text, but the delta format doesn't give us
many clues about that.

For example, let's say the repo content of the file in two revisions is:

  r1: "ONE\nTWO\nTHE REST\n"

  r2: "TWO\nTHREE\nTWO\n"

And let's say we're using this binary method with forward deltas, for
ease of thinking about it. (The principle would be the same with
reverse deltas.) The delta r1->r2 might look like (and this is just one
of many possible encodings):

  cp from r1: "TWO\nTH"
  new in r2: "RE"
  cp from r1: "E\nTWO\n"

which we can simply split into lines and get the answer we want:

  "TWO\n" @ max(1) => r1
  "THREE\n" @ max(1,2,1) => r2
  "TWO\n" @ max(1) => r1

But the delta might instead look like:

  cp from r1: "TWO\nTH"
  cp from r1: "RE"
  cp from r1: "E\nTWO\n"

With a naïve interpretation, ignoring the source positions of these
copies, this would look like none of the lines are new or modified in
r2, which is clearly not an acceptable answer.

I hadn't really appreciated this difficulty. By noticing the copy
source positions when interpreting the delta, we might be able to do
better:

  cp from r1[4:10] "TWO\nTH" => OK, this is r1 source text
  cp from r1[12:14] "RE" => "we skipped/deleted 2 source bytes"
  cp from r1[2:8] "E\nTWO\n" => "this is out of order, so is an add"

which we could store in our intermediate data structure as

  r1 "TWO\nTH"
  r2 "" # recording a deletion at this position
  r1 "RE"
  r2 "E\nTWO\n" # recorded as an addition

That's a little better, but it relies on having some robust way to
decide which copies are "out of order" and which copies are "getting
back into the main sequential flow of the source", taking into account
that copy source ranges can come from behind and ahead of the "current"
position and can overlap with other copies. And of course taking into
account that copies of long chunks of the source text should take
priority over short bits when it comes to deciding which revision the
text belongs to.

I'm not sure if there's a feasible solution.

> 2) A concrete problem with the algorithm as described by Julian would
> be the delete of bytes within a line. E.g. suppose one deletes a
> single character in r60 ("oops, a boolean test was inverted, let's
> delete that stray '!'"). That deleted character wouldn't be
> represented in the binary blame structure (it's simply cut out), and
> it's obviously not present in the final text. But the line on which it
> occurred was now last edited in r60.
>
> Current blame (diff-based) has no problem with this situation. It
> correctly shows r60 as the last revision for that line. But the binary
> algorithm loses information about the deleted bytes. That's a problem.
> I haven't really worked out how to handle that. Anybody have any
> ideas?

The deleted bytes issue is easy to handle. The algorithm uses a data
structure to remember which revision is last known to have introduced
each contiguous range of bytes. For each deletion, the structure must
contain a zero-length range, with the revision number in which something
was deleted that previously existed in that gap, as shown in my last
example above.

- Julian
Received on 2010-08-11 12:05:27 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.