Re: Getting raw blame information

From: Mark Benedetto King <mbk_at_lowlatency.com>
Date: 2005-01-11 20:10:14 CET

On Tue, Jan 11, 2005 at 01:04:00PM +0000, Hugh Gibson wrote:
> Hi,
>
> I'm looking at ways of speeding up the "blame" output. For example, for
> one of our files which is about 300k with around 1750 total revisions in
> the repository, it takes around 2 minutes.
>

Most people find blame to be network-bound, which is why optimizing
the algorithm itself hasn't really been addressed.

Just wondering: is this a file that gets automatically appended to, or
all the deltas distributed throughout the file?

> Looking at the source code (blame.c) it appears that it gets all the
> revisions, collecting information about changed lines, then gets the final
> text and matches that up with all the blame chunks.
>
> I've noticed that performance drops as later revisions are retrieved, and
> the linked list for blame must be the problem. blame_find() does a linear
> search through the chunks, similar to blame_adjust() which has to iterate
> through all the chunks to put in an offset. So there are two O(n^2)
> processes here.
>
> A quick win would be to make the search for *last* in blame_delete_range()
> use *start* as the starting point, not db->blame.
>

I think you mean "first" (not start), but yes, that should improve
blame_delete_range().

> I wondered if a tree structure would be a lot faster (though just as bad
> in the degenerate case). Was this considered at all?
>

Yes. I figured it would be easier to get the list implementation correct
in time for the 1.0 deadline. :-)

Even the list implementation turned out to have a bug in it!

Just saying "tree" doesn't convince me, though; this list has some
fairly special invariants. Do you have a specific datastructure in
mind? If you're going to go to the trouble of thinking of one, you
might try thinking about computing blame backwards in time, rather
than forwards, since the backwards calculation could potentially
terminate early (once all lines in the blamed revision are accounted
for).

I'd also be interested in profiling output from your 1750-rev file to
see how much time is spent in the blame_* functions themselves vs
the svn_diff_* functions, so that we can at least give a nod towards
Amdahl's Law before we run off optimizing the wrong thing.

> I normally work in Python these days and found blame.py in tools/examples.
> It appears to use a different strategy - doing a full diff between each
> version, which will be a lot slower. But if I can cache information
> locally then maybe I could get away without having to obtain all the log
> information. Is there anything like this around?
>

Not that I know of, but there seems to be increasing demand for it.
I suppose it was only a matter of time (more revisions committed every
day!).

You might be able to extend SVN::Mirror or svk or somesuch in order
to incrementally pull the changes into a local blame cache. I'll bet
you the svk guys would eagerly accept your patch.

Another option would be for us to calculate the changed-line
information at commit time (or lazily), store that in the repository,
and make it available to clients, but that would be a Big Change.

--ben

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Jan 11 20:21:39 2005

This message: [ Message body ]
Next message: Greg Hudson: "Re: RFC: RA API changes"
Previous message: Peter N. Lundblad: "Re: RFC: RA API changes"
In reply to: Hugh Gibson: "Getting raw blame information"
Next in thread: Hugh Gibson: "Re: Getting raw blame information"
Reply: Hugh Gibson: "Re: Getting raw blame information"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]