[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Getting raw blame information

From: Mark Benedetto King <mbk_at_lowlatency.com>
Date: 2005-01-12 16:30:11 CET

On Wed, Jan 12, 2005 at 10:26:00AM +0000, Hugh Gibson wrote:
> > > I'm looking at ways of speeding up the "blame" output. For example,
> > > for one of our files which is about 300k with around 1750 total
> > > revisions in the repository, it takes around 2 minutes.
> > >
> >
> > Most people find blame to be network-bound, which is why optimizing
> > the algorithm itself hasn't really been addressed.
>
> Fair comment: I'll compare the time required to simply get all the log
> information. However, the progress graph shows a classic O(n^2) behaviour,
> with the first few 100 revisions skipping through quickly, and later
> revisions taking more and more time (a few seconds each).

I believe you, I was just prefacing my response with an excuse.

[...]

> > > Looking at the source code (blame.c) it appears that it gets all the
> > > revisions, collecting information about changed lines, then gets the
> > > final text and matches that up with all the blame chunks.
> > >
> > > I've noticed that performance drops as later revisions are retrieved,
> > > and the linked list for blame must be the problem. blame_find() does
> > > a linear search through the chunks, similar to blame_adjust() which
> > > has to iterate through all the chunks to put in an offset. So there
> > > are two O(n^2) processes here.
> > >
> > > A quick win would be to make the search for *last* in
> > > blame_delete_range() use *start* as the starting point, not db->blame.
> > >
> >
> > I think you mean "first" (not start), but yes, that should improve
> > blame_delete_range().
>
> Yes, thanks. A couple of other thoughts for quick wins on this algorithm:
> if you are processing a block of changes for one revision, if they come in
> the order of the lines, you could check for deletions and additions that
> match the line numbers, and avoid having to do two lots of blame_adjust.
> That would be the case with adding comments, correcting a bug on one line
> etc. Also, you could see where the next change is and only do blame_adjust
> up to that point. Then the blame_adjust for the rest of the lines would be
> tweaked to include the +/- for the next change.

They do come in the order of lines. Those are good ideas, presuming
we can't come up with a datastructure that requires less fixup work
in the first place.

> > > I wondered if a tree structure would be a lot faster (though just as
> > > bad in the degenerate case). Was this considered at all?
> > >
> >
> > Yes. I figured it would be easier to get the list implementation
> > correct in time for the 1.0 deadline. :-)
> >
> > Even the list implementation turned out to have a bug in it!
>
> Do you have unit tests for it? That would be important for testing any new
> implementation.

I wish we had unit tests for it. That would be important for testing
any new implementation.

>
> > Just saying "tree" doesn't convince me, though; this list has some
> > fairly special invariants. Do you have a specific datastructure in
> > mind? If you're going to go to the trouble of thinking of one, you
> > might try thinking about computing blame backwards in time, rather
> > than forwards, since the backwards calculation could potentially
> > terminate early (once all lines in the blamed revision are accounted
> > for).
>
> I've been playing with a few ideas. It's a challenge, isn't it? Maybe I
> need to talk to the "cat" or do the washing up - I usually get my best
> inspiration then.

Yes. I think it would be reasonable to simply use an array
implementation with good slice operations, though it would be
nice to use a representation that handled adjacent lines from the
same revision in a space-efficient manner. Even a naive dynarray
might outperform the current approach.

>
> > I'd also be interested in profiling output from your 1750-rev file to
> > see how much time is spent in the blame_* functions themselves vs
> > the svn_diff_* functions, so that we can at least give a nod towards
> > Amdahl's Law before we run off optimizing the wrong thing.
>
> Absolutely. It will probably turn out to be the memory allocator or string
> concatenation which is the problem, or the server. If I can get you the
> raw information (with file content omitted) would that do? Note that there
> are 1750 total revisions in the repository, and only around 228 that apply
> to this file.

Yes, the raw information would be fine.

[...]

>
> > Another option would be for us to calculate the changed-line
> > information at commit time (or lazily), store that in the repository,
> > and make it available to clients, but that would be a Big Change.
>
> Yes. I guess that's something like what CVS does.

CVS gets line-oriented deltas per revision "for free" from the fact that
it stores the changes that way internally. This is a huge win, since
it doesn't have to do all the diffing again. Also, IIUC, it does the
blame calculation server-side, which saves lots of latency and
bandwidth.

>
> One other comment: when doing blame for multiple files the output doesn't
> include any information about the file that was being accessed.
>

We need to decide whether that's a bug or an API change before we
fix it. My vote is for "bug".

--ben

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Jan 12 16:31:34 2005

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.