Re: 'svn blame -g' design: client vs. server

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: 2007-08-17 01:39:12 CEST

Hi Hyrum.

I see it's three weeks since you sent this, so you might have gone ahead
anyway, but I just thought I'd throw in my quick thoughts.

The strong separation between client and server is a valuable feature of
Subversion. (I've been working with Perforce recently and encountering many
problems that are pretty much a result of it doing everything on the server.)

So, if the functionality you're thinking of adding to the server seems to be
generally useful to all types of clients for a variety of purposes, and not
just for the current "svn blame" command, then it might be fine to put it in
the server. If, however, it's in some way tied to client-specific features such
as knowing about lines and line endings, or assuming that "blame" is going to
give a simple stream of text as its output, then don't.

To judge the generality of the supporting methods you're envisaging, ask
yourself whether they would be useful to a GUI client with an advanced
interactive "blame-like" feature which allows selective hiding of merge sources
from the displayed result, stepping back in time down any chosen line of
ancestry, recalculating the display as it goes.

Now, looking at what you actually wrote, my comments are in line below, and a
suggestion...

Hyrum K. Wright wrote:
> I've gotten most of the compatibility code for 'svn blame -g' written,
> so now it is on to (what I perceive to be) the crux of the issue:
> sending the file data used for blame from the server to the client and
> computing the blame results. After thinking about it, I've come up with
> two different solutions. This mail presents them, along with their
> respective pros and cons, and solicits comments regarding each method.
>
>
> The Problem
> -----------
> 'svn blame' currently works by requesting the file data from the
> server using svn_ra_get_file_revs(), and then iterating over the file
> revisions in a given range, matching revisions to lines which they
> modified. This works great, because up to this point, a file's
> revisions have been linear, without the possibility of multiple ancestors.
>
> Enter merge tracking. A file may have multiple lines of history as
> ancestors as the result of a merge. For the purposes of blame, these
> lines of history have to be linearized, or folded into on line, to
> determine which revision modified which lines of the file. So far, I've
> come up with a couple of ways to perform this linearization.
>
>
> Solution #1 - On the Server
> ---------------------------
> My proposed solution for the linearization of history information for a
> file, F, on the server is fairly straight forward:
> 1) Get all the revisions affecting F on the direct line as we
> currently do.
> 2) Add to this set the revisions affecting F as the result of a merge,
> recursing to account for merges merging merges.
> 3) Sort the list of (revision, path) pairs, and use the file_revs
> callback to return deltas in revision order between each (revision,
> path) pair. Use a flag on the callback to indicate which revisions
> are the result of a merge.
> 4) On the client, construct blame information as currently done, using
> the extra callback invocations to compute the result of blame with
> respect to merges.
>
> For example, if file /trunk/a was created in r2, branched to
> branches/1/a at r3, modified in the branch at r5, modified in trunk in
> r6, merged back to trunk in r8, and then 'svn blame -g'd, the sequence
> of revisions (with corresponding deltas) returned to the client would be:
> (2, /trunk/a)
> (3, /branches/1/a)
> (5, /branches/1/a)
> (6, /trunk/a)
> (8, /trunk/a)

It looks to me like that's too monolithic. There are two useful things the
server is doing there:

* finding the merge graph of the file (steps 1 and 2);

* sending a stream of deltas (in step 3);

but in general the client would want to choose what parts of the merge graph
it's interested in, and not have the server simply send it the whole set of
deltas serialised in one particular way.

> Pros:
> * The heavy lifting and major modifications would occur on the server.

I don't see these as a "pro". What would be a "pro" is if the server is doing
something efficiently that a client could not do efficiently.

> * The client would remain largely unchanged, save for code to filter on
> the appropriate callback flag.
> * Only one invocation of file_revs(), meaning less network traffic.
>
> Cons:
> * The heavy lifting is done by the server, which could create a
> bottleneck.
> * The same file on different branches could become arbitrarily
> different, resulting in complex deltas, leading to more network
> bandwidth and more processing on the client.

That one's basically saying it's not very scalable.

> * This method involves hacking the server in a way which could change
> the semantics of file_revs().

> Solution #2 - On the Client
> ---------------------------
> Instead of modifying the server, it is possible to compute the blame
> information on the client using existing ra methods. The (rough)
> algorithm looks something like this:
> 1) Use file_revs() to determine the set of revisions and contents for a
> given file, F.
> 2) Using get_mergeinfo(), determine if any of the revisions are merging
> revisions. If so, use file_revs() to get their lines of history.
> 2a) Potentially recurse due to merges merging merges.
> 3) Using all the gathered file rev data, run through the versions of F
> in revision order generating the appropriate blame information.

As you point out below, that's very inefficient getting all the file contents.

> Pros:
> * Uses server functionality currently availably in trunk.
> - This also means that should this work not get done before branching
> 1.5, it would work with a 1.6 client and 1.5 server.
> * Client, not the server, does the heavy lifting. This may be offset
> somewhat by the multiple calls to svn_ra_get_file_revs().
>
> Cons:
> * Duplicating a lot of the merging revisions detection logic already
> found in libsvn_repos in libsvn_client.

Algorithms can be factored out if they really are common. (I assume you mean
source code; it's not a problem to have small amounts of duplicated object code
in different programs.)

> * Lots of ra method invocations, resulting in lots of network
> bandwidth, much of which won't be used in the final output of blame.
> * Because of the multiple invocations of file_revs(), multiple
> full-text versions of the same file will be sent over the network.
> * I'm not that familiar with the libsvn_client blame functionality,
> which will take some time to learn.
> * We'll need to keep lots of data around locally on the client,
> roughly equivalent to O(number of branches).

Speaking in abstract functions for a moment, I think the client should:

1. request the merge graph (ancestry graph) of the file from the server,
limited only in very simple ways (perhaps overall revision range, perhaps a
"depth" of merge hierarchy, etc.);

2. trim the graph according to the user's requirements;

3. convert to a list of deltas that it needs to see (and it doesn't have to
ask for a single succession of deltas like a1->a2->b3->a4: it can avoid that
criss-crossing and ask for a1->a2, a2->b3, a2->a4);

4. request the deltas from the server (and here perhaps a stream of deltas
could be significantly more efficient than a set of separate requests, but
evaluate that question objectively);

5. process the received deltas.

I don't know how much of the server-side work in steps (1) and (4) is already
available, but maybe it's available in a crude way now and could be streamlined
later if found to be a bottleneck.

Well, those are my thoughts.

- Julian

>
>
> Proposal
> --------
> Personally, I favor option #1, because the repos code looks more
> hackable, and I'm more familiar with it. Adding this ability to
> file_revs() can be done relatively unobtrusively. Also, I don't think
> the added load on the server will be significant, compared to multiple
> invocations of ra methods in option #2.
>
> Though, as I write this mail, #2 is starting to grow on me. Although it
> requires more network time and more resources on the client, it feels
> cleaner, if somewhat more complex. Using functionality currently
> available on the server and packing everything into the client is very
> appealing.
>
>
> Summary
> -------
> Merge tracking creates ancestor trees instead of a a linear history
> path. Computing blame information for ancestor trees is nontrivial, and
> can be done
> 1) On the server with one ra call.
> 2) On the client with ra calls multiple invocations, and little or no
> server modification, but increased network overhead.
>
> I'm leaning toward #1, but #2 has its appeals also.
>
> Thoughts?
>
> -Hyrum

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri Aug 17 01:37:02 2007

This message: [ Message body ]
Next message: David James: "Re: [PATCH] Last two ra callbacks for Python binding"
Previous message: Daniel Rall: "Re: [PATH] Fix ra_neon_neonprogress data types"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]