On Montag, 22. März 2010, Julian Foad wrote:
> My basic idea is an algorithm on the client that takes, as its input, a
> sequence of binary diffs, one for each revision in which the file
> changed, representing that incremental change.
> The algorithm applies one of these diffs at a time to an in-memory data
> structure that represents which revision is associated with each byte in
> the file.
> Then, at the end, we read in the blamed file's text (just that one
> revision of it), match up the byte ranges to the lines in the file, and
> for each line we display the highest recorded revision number that falls
> within that line.
please forgive me for intruding with barely some knowledge to share, but that
sounds awful because of the round-trips (and latencies) involved.
I'd suggest trying to keep the (current) version of the file line-wise in
memory (stored as line-number, revision, start byte, length, and actual
bytes); an incoming binary delta would just change the data and revision,
until the deltas are exhausted.
Then the blame output can be given directly, without any more round-trips.
Of course, that means splitting xdelta operations into line-wise bits; but, as
an improvement, you could also print the source line number of a change (by
just "counting" the lines from top to bottom when applying a delta).
> A backward algorithm would be a little more complex ...
Please see my comment at the end of my mail.
> I am still not completely sure whether the binary diff idea can produce
> a line-wise result that looks natural and minimal, while still being
> efficient. For example, I suspect there may be situations where the
> binary diff will show changes within two different lines where a text
> diff would have showed a change only to one line, but I am not sure.
> Even if effects like this do occur, they might only happen in cases
> where a human reader would agree that that is a good way to describe the
> change anyway.
That might not be true, in case of an ignore-whitespace-changes option or
something like that.
> The important first step is to do some profiling to find out how much
> time is being spent -
> - on the server, calculating full texts from the stored diffs?
> - transmitting data over the network?
> - on the client, producing diffs?
> - on the client, producing blame info from diffs?
That's always good advise.
> I've just re-read your message from last year
> 2079688> in which you report it's mainly client-side CPU bound. That's
> good to know. (That thread is linked from the issue.) So next, you would
> be well advised to profile the execution of the client code and see
> whether the textual diffs or some other part of the processing are taking
> the most time.
> The clear advantage of the binary diff algorithm is that on the client
> it neither calculated diffs nor even looks at the older revisions of the
> file's text, so that might be a big win if it works at all.
> As regards reversing the current algorithm, first you might want to find
> out what is the oldest revision number that shows up in the blame of
> your typical file, and compare that with all the changes in the file's
> history. Is it only reporting the most recent 1% of the file's changes,
> in which case the reversed algorithm could achieve a 100-fold speed-up,
> or 20%, or 80%?
Well, how about using an iterative approach? Just do the algorithm you
described for the revisions [N-100:N]; if there's any line that's still at
N-100 fetch the blame information for [N-200:N-100], and substitute the
'unknown' lines; repeat as necessary.
Sounds simpler than a full reverse run, and might be efficient enough - only a
few revisions more than absolutely necessary are fetched and analyzed.
Received on 2010-03-22 12:30:32 CET