[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Looking to improve performance of svn annotate

From: Philipp Marek <philipp.marek_at_emerion.com>
Date: Mon, 22 Mar 2010 12:29:54 +0100

Hello Julian,
Hello Johan,

On Montag, 22. März 2010, Julian Foad wrote:
...
> My basic idea is an algorithm on the client that takes, as its input, a
> sequence of binary diffs, one for each revision in which the file
> changed, representing that incremental change.
...
> The algorithm applies one of these diffs at a time to an in-memory data
> structure that represents which revision is associated with each byte in
> the file.
...
> Then, at the end, we read in the blamed file's text (just that one
> revision of it), match up the byte ranges to the lines in the file, and
> for each line we display the highest recorded revision number that falls
> within that line.
please forgive me for intruding with barely some knowledge to share, but that
sounds awful because of the round-trips (and latencies) involved.

I'd suggest trying to keep the (current) version of the file line-wise in
memory (stored as line-number, revision, start byte, length, and actual
bytes); an incoming binary delta would just change the data and revision,
until the deltas are exhausted.

Then the blame output can be given directly, without any more round-trips.

Of course, that means splitting xdelta operations into line-wise bits; but, as
an improvement, you could also print the source line number of a change (by
just "counting" the lines from top to bottom when applying a delta).

> A backward algorithm would be a little more complex ...
Please see my comment at the end of my mail.

> I am still not completely sure whether the binary diff idea can produce
> a line-wise result that looks natural and minimal, while still being
> efficient. For example, I suspect there may be situations where the
> binary diff will show changes within two different lines where a text
> diff would have showed a change only to one line, but I am not sure.
> Even if effects like this do occur, they might only happen in cases
> where a human reader would agree that that is a good way to describe the
> change anyway.
That might not be true, in case of an ignore-whitespace-changes option or
something like that.
 
> The important first step is to do some profiling to find out how much
> time is being spent -
> - on the server, calculating full texts from the stored diffs?
> - transmitting data over the network?
> - on the client, producing diffs?
> - on the client, producing blame info from diffs?
That's always good advise.
 
> I've just re-read your message from last year
> <http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=
> 2079688> in which you report it's mainly client-side CPU bound. That's
> good to know. (That thread is linked from the issue.) So next, you would
> be well advised to profile the execution of the client code and see
> whether the textual diffs or some other part of the processing are taking
> the most time.
>
> The clear advantage of the binary diff algorithm is that on the client
> it neither calculated diffs nor even looks at the older revisions of the
> file's text, so that might be a big win if it works at all.
>
> As regards reversing the current algorithm, first you might want to find
> out what is the oldest revision number that shows up in the blame of
> your typical file, and compare that with all the changes in the file's
> history. Is it only reporting the most recent 1% of the file's changes,
> in which case the reversed algorithm could achieve a 100-fold speed-up,
> or 20%, or 80%?
Well, how about using an iterative approach? Just do the algorithm you
described for the revisions [N-100:N]; if there's any line that's still at
N-100 fetch the blame information for [N-200:N-100], and substitute the
'unknown' lines; repeat as necessary.

Sounds simpler than a full reverse run, and might be efficient enough - only a
few revisions more than absolutely necessary are fetched and analyzed.

Regards,

Phil
Received on 2010-03-22 12:30:32 CET

This is an archived mail posted to the Subversion Dev mailing list.