Re: Looking to improve performance of svn annotate

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Thu, 12 Aug 2010 01:14:39 +0200

Thanks for the input/explanations, Julian and Greg. Some reactions below ...

On Wed, Aug 11, 2010 at 12:04 PM, Julian Foad <julian.foad_at_wandisco.com> wrote:
[...]
> I hadn't really appreciated this difficulty. By noticing the copy
> source positions when interpreting the delta, we might be able to do
> better:
>
> cp from r1[4:10] "TWO\nTH" => OK, this is r1 source text
> cp from r1[12:14] "RE" => "we skipped/deleted 2 source bytes"
> cp from r1[2:8] "E\nTWO\n" => "this is out of order, so is an add"
>
> which we could store in our intermediate data structure as
>
> r1 "TWO\nTH"
> r2 "" # recording a deletion at this position
> r1 "RE"
> r2 "E\nTWO\n" # recorded as an addition
>
> That's a little better, but it relies on having some robust way to
> decide which copies are "out of order" and which copies are "getting
> back into the main sequential flow of the source", taking into account
> that copy source ranges can come from behind and ahead of the "current"
> position and can overlap with other copies. And of course taking into
> account that copies of long chunks of the source text should take
> priority over short bits when it comes to deciding which revision the
> text belongs to.
>
> I'm not sure if there's a feasible solution.

How does diff do that? If the target starts with some copy of the end
of the source, followed by the main part of the source from the
beginning. Or just a small piece of the source that's "out of order",
followed by a bigger chunk of source that's more in the "main
sequential flow". Maybe the problems are similar?

Maybe I should study the diff code a bit ...

On Wed, Aug 11, 2010 at 4:39 PM, Greg Hudson <ghudson_at_mit.edu> wrote:
> In the process of a blame operation, the server is already performing
> binary deltas (which aren't directly usable for blame) between each pair
> of revs. It wouldn't necessarily be any more work for the server to
> perform line-based diffs instead, although that would be the sort of
> text processing we don't ordinarily put into the server code.
>
> Alternatively, the server could perform a more diff-like binary delta,
> where the only instructions are "copy the next N bytes from the input"
> or "insert these new bytes." Such a binary delta could possibly be
> transformed into line-based diffs on the client, although I'm not
> completely sure of that.

I haven't studied the server part of the operation that much yet. But
this seems like it could be an interesting (additional) approach. Any
pointers where I should start looking into the source code, where this
stuff is happening?

I naively thought that the server, upon being called get_file_revs2,
would just supply the deltas which it has already stored in the
repository. I.e. that the deltas are just the native format in which
the stuff is kept in the back-end FS, and the server wasn't doing much
else but iterate through the relevant files, and extract the relevant
bits.

But it seems that's not the case, and the server is calculating (some
of) these binary delta's on the fly? In that case, yes, that seems
like another good point to perform optimization, or do some
pre-processing work to make it easier for the client to calculate the
blame.

Cheers,

-- 
Johan

Received on 2010-08-12 01:15:26 CEST

This message: [ Message body ]
Next message: Greg Hudson: "Re: Looking to improve performance of svn annotate"
Previous message: Alexey Neyman: "Re: [PATCH] Bug in svn_fs_paths_changed2() Python bindings?"
In reply to: Greg Hudson: "Re: Looking to improve performance of svn annotate"
Next in thread: Greg Hudson: "Re: Looking to improve performance of svn annotate"
Reply: Greg Hudson: "Re: Looking to improve performance of svn annotate"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]