'svn blame -g' design: client vs. server

From: Hyrum K. Wright <hyrum_wright_at_mail.utexas.edu>
Date: 2007-07-23 22:14:06 CEST

[Note: I'll be AFK for most of the week, so replies may be slow in coming.]

I've gotten most of the compatibility code for 'svn blame -g' written,
so now it is on to (what I perceive to be) the crux of the issue:
sending the file data used for blame from the server to the client and
computing the blame results. After thinking about it, I've come up with
two different solutions. This mail presents them, along with their
respective pros and cons, and solicits comments regarding each method.

The Problem
-----------
'svn blame' currently works by requesting the file data from the
server using svn_ra_get_file_revs(), and then iterating over the file
revisions in a given range, matching revisions to lines which they
modified. This works great, because up to this point, a file's
revisions have been linear, without the possibility of multiple ancestors.

Enter merge tracking. A file may have multiple lines of history as
ancestors as the result of a merge. For the purposes of blame, these
lines of history have to be linearized, or folded into on line, to
determine which revision modified which lines of the file. So far, I've
come up with a couple of ways to perform this linearization.

Solution #1 - On the Server
---------------------------
My proposed solution for the linearization of history information for a
file, F, on the server is fairly straight forward:
   1) Get all the revisions affecting F on the direct line as we
      currently do.
   2) Add to this set the revisions affecting F as the result of a merge,
      recursing to account for merges merging merges.
   3) Sort the list of (revision, path) pairs, and use the file_revs
      callback to return deltas in revision order between each (revision,
      path) pair. Use a flag on the callback to indicate which revisions
      are the result of a merge.
   4) On the client, construct blame information as currently done, using
      the extra callback invocations to compute the result of blame with
      respect to merges.

For example, if file /trunk/a was created in r2, branched to
branches/1/a at r3, modified in the branch at r5, modified in trunk in
r6, merged back to trunk in r8, and then 'svn blame -g'd, the sequence
of revisions (with corresponding deltas) returned to the client would be:
(2, /trunk/a)
(3, /branches/1/a)
(5, /branches/1/a)
(6, /trunk/a)
(8, /trunk/a)

Pros:
  * The heavy lifting and major modifications would occur on the server.
  * The client would remain largely unchanged, save for code to filter on
    the appropriate callback flag.
  * Only one invocation of file_revs(), meaning less network traffic.

Cons:
  * The heavy lifting is done by the server, which could create a
    bottleneck.
  * The same file on different branches could become arbitrarily
    different, resulting in complex deltas, leading to more network
    bandwidth and more processing on the client.
  * This method involves hacking the server in a way which could change
    the semantics of file_revs().

Solution #2 - On the Client
---------------------------
Instead of modifying the server, it is possible to compute the blame
information on the client using existing ra methods. The (rough)
algorithm looks something like this:
  1) Use file_revs() to determine the set of revisions and contents for a
     given file, F.
  2) Using get_mergeinfo(), determine if any of the revisions are merging
     revisions. If so, use file_revs() to get their lines of history.
  2a) Potentially recurse due to merges merging merges.
  3) Using all the gathered file rev data, run through the versions of F
     in revision order generating the appropriate blame information.

Pros:
  * Uses server functionality currently availably in trunk.
    - This also means that should this work not get done before branching
      1.5, it would work with a 1.6 client and 1.5 server.
  * Client, not the server, does the heavy lifting. This may be offset
    somewhat by the multiple calls to svn_ra_get_file_revs().

Cons:
  * Duplicating a lot of the merging revisions detection logic already
    found in libsvn_repos in libsvn_client.
  * Lots of ra method invocations, resulting in lots of network
    bandwidth, much of which won't be used in the final output of blame.
  * Because of the multiple invocations of file_revs(), multiple
    full-text versions of the same file will be sent over the network.
  * I'm not that familiar with the libsvn_client blame functionality,
    which will take some time to learn.
  * We'll need to keep lots of data around locally on the client,
    roughly equivalent to O(number of branches).

Proposal
--------
Personally, I favor option #1, because the repos code looks more
hackable, and I'm more familiar with it. Adding this ability to
file_revs() can be done relatively unobtrusively. Also, I don't think
the added load on the server will be significant, compared to multiple
invocations of ra methods in option #2.

Though, as I write this mail, #2 is starting to grow on me. Although it
requires more network time and more resources on the client, it feels
cleaner, if somewhat more complex. Using functionality currently
available on the server and packing everything into the client is very
appealing.

Summary
-------
Merge tracking creates ancestor trees instead of a a linear history
path. Computing blame information for ancestor trees is nontrivial, and
can be done
  1) On the server with one ra call.
  2) On the client with ra calls multiple invocations, and little or no
     server modification, but increased network overhead.

I'm leaning toward #1, but #2 has its appeals also.

Thoughts?

-Hyrum

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Jul 23 22:09:21 2007

This message: [ Message body ]
Next message: Ben Collins-Sussman: "Re: [PATCH] fix for interactive merge-callback not supporting spaces in file paths"
Previous message: Erik Huelsmann: "Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]