Re: Getting raw blame information

From: Hugh Gibson <hgibson_at_cix.co.uk>
Date: 2005-01-12 11:26:15 CET

> > I'm looking at ways of speeding up the "blame" output. For example,
> > for one of our files which is about 300k with around 1750 total
> > revisions in the repository, it takes around 2 minutes.
> >
>
> Most people find blame to be network-bound, which is why optimizing
> the algorithm itself hasn't really been addressed.

Fair comment: I'll compare the time required to simply get all the log
information. However, the progress graph shows a classic O(n^2) behaviour,
with the first few 100 revisions skipping through quickly, and later
revisions taking more and more time (a few seconds each).

> Just wondering: is this a file that gets automatically appended to, or
> all the deltas distributed throughout the file?

Deltas are distributed. The file should probably be split up anyway.

> > Looking at the source code (blame.c) it appears that it gets all the
> > revisions, collecting information about changed lines, then gets the
> > final text and matches that up with all the blame chunks.
> >
> > I've noticed that performance drops as later revisions are retrieved,
> > and the linked list for blame must be the problem. blame_find() does
> > a linear search through the chunks, similar to blame_adjust() which
> > has to iterate through all the chunks to put in an offset. So there
> > are two O(n^2) processes here.
> >
> > A quick win would be to make the search for *last* in
> > blame_delete_range() use *start* as the starting point, not db->blame.
> >
>
> I think you mean "first" (not start), but yes, that should improve
> blame_delete_range().

Yes, thanks. A couple of other thoughts for quick wins on this algorithm:
if you are processing a block of changes for one revision, if they come in
the order of the lines, you could check for deletions and additions that
match the line numbers, and avoid having to do two lots of blame_adjust.
That would be the case with adding comments, correcting a bug on one line
etc. Also, you could see where the next change is and only do blame_adjust
up to that point. Then the blame_adjust for the rest of the lines would be
tweaked to include the +/- for the next change.

> > I wondered if a tree structure would be a lot faster (though just as
> > bad in the degenerate case). Was this considered at all?
> >
>
> Yes. I figured it would be easier to get the list implementation
> correct in time for the 1.0 deadline. :-)
>
> Even the list implementation turned out to have a bug in it!

Do you have unit tests for it? That would be important for testing any new
implementation.

> Just saying "tree" doesn't convince me, though; this list has some
> fairly special invariants. Do you have a specific datastructure in
> mind? If you're going to go to the trouble of thinking of one, you
> might try thinking about computing blame backwards in time, rather
> than forwards, since the backwards calculation could potentially
> terminate early (once all lines in the blamed revision are accounted
> for).

I've been playing with a few ideas. It's a challenge, isn't it? Maybe I
need to talk to the "cat" or do the washing up - I usually get my best
inspiration then.

> I'd also be interested in profiling output from your 1750-rev file to
> see how much time is spent in the blame_* functions themselves vs
> the svn_diff_* functions, so that we can at least give a nod towards
> Amdahl's Law before we run off optimizing the wrong thing.

Absolutely. It will probably turn out to be the memory allocator or string
concatenation which is the problem, or the server. If I can get you the
raw information (with file content omitted) would that do? Note that there
are 1750 total revisions in the repository, and only around 228 that apply
to this file.

> > I normally work in Python these days and found blame.py in
> > tools/examples. It appears to use a different strategy - doing a full
> > diff between each version, which will be a lot slower. But if I can
> > cache information locally then maybe I could get away without having
> > to obtain all the log information. Is there anything like this around?
> >
>
> Not that I know of, but there seems to be increasing demand for it.
> I suppose it was only a matter of time (more revisions committed every
> day!).
>
> You might be able to extend SVN::Mirror or svk or somesuch in order
> to incrementally pull the changes into a local blame cache. I'll bet
> you the svk guys would eagerly accept your patch.

Hmmm. I'll stick with standard SVN for the moment (which we are just
switching to from CVS). Maybe in the future...

> Another option would be for us to calculate the changed-line
> information at commit time (or lazily), store that in the repository,
> and make it available to clients, but that would be a Big Change.

Yes. I guess that's something like what CVS does.

One other comment: when doing blame for multiple files the output doesn't
include any information about the file that was being accessed.

Hugh

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Jan 12 13:18:41 2005

This message: [ Message body ]
Next message: Nicklas Norling: "Re: mailer.py now should work on Windows - soliciting testers (furtherinformation)"
Previous message: Brian W. Fitzpatrick: "Re: svn commit: r12676 - branches/locking/subversion/libsvn_fs_fs"
In reply to: Mark Benedetto King: "Re: Getting raw blame information"
Next in thread: Mark Benedetto King: "Re: Getting raw blame information"
Reply: Mark Benedetto King: "Re: Getting raw blame information"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]