Re: subversion in the news

From: Zack Weinberg <zack_at_codesourcery.com>
Date: 2002-03-14 05:43:01 CET

On Thu, Mar 07, 2002 at 06:29:46PM -0500, Greg Hudson wrote:
> I'm definitely curious what your vision is. The global repository version
> number seems like a big obstacle to some forms of distributed operation.

It does cause problems. I have an idea for how to deal with that,
though.

> On the other hand, I sent mail here a while ago with some ideas on how we
> could get cheap forms of distributed operation (basically, things which
> fall out of the unusual way we do branching and tagging), but it didn't
> generate any discussion. It's at
> http://subversion.tigris.org/servlets/ReadMsg?msgId=59658&listName=dev

The distributed-repository stuff I'm most interested in is variants on
your use-case #2. But let me start with a situation you didn't
mention:

0. An organization wishes to maintain read-only mirrors of its
   Subversion repository. These are either to take the load of
   anonymous users off the master, or to reduce the bandwith demand of
   geographically-distant users at some expense in latency.

The primitive operation we need in order to support this is just your
cross-repository "svn cp". If I do

svn cp http://a/proj http://b/proj

and B/proj doesn't yet exist, that establishes a mirror of A/proj at
B/proj. Now what happens when you repeat the operation? Notionally,
the current contents of A/proj replace B/proj. Practically, we want
to avoid having to re-transmit the entire repository. Instead, we use
almost the same procedure that we would for "svn update" -- the slave
notifies the master of the most recent revision it's got, and the
master sends only the newer revs. It has to send svndiffs for *all*
the newer revs, not just a combined delta, so that the slave can
continue to be an accurate mirror.

The mirrored repository is automatically read-only. It has to be,
because otherwise we could get numbering conflicts between revisions
committed on the mirror and the master. There are two ways to
implement that: either an attempt to do any repo-modifying operation
fails, or it gets translated into a request against the master, which
will fail if the mirror is out-of-date. In the latter case, we want a
way to force a WC-update to query the master, so that a developer
working off a mirror is not stuck when it's out of date.

Now suppose we want real disconnected operation. Let me start by
talking about one way to work in the existing system. Anyone can
create a branch at any time; so suppose that, as a matter of project
policy, all developers are expected to have personal branches on which
they do development. In fact, one person might have several personal
branches, one per independent task they were hacking on. While
they're hacking, they periodically commit changes to their branch;
those changes are visible to all, but don't affect anyone who doesn't
explicitly go get them. When any given chunk of work is complete, the
developer merges from the trunk to their personal branch, resolves
conflicts, re-tests, and merges the branch back to the trunk; at that
point everyone gets the patch(es).

Ignoring the global-revision-number issue for the moment, notice that
nothing happening on a personal branch can affect revisions anywhere
else in the tree (assuming that a merge pointer is recorded in the
child, not the parent). Therefore, the integrity of the system is not
compromised if commits to a private branch happen in a mirrored
repository. Nor can it be compromised if the branch gets pushed back
upstream to the master repository. The only time integrity can be
damaged is if two direct children of the same parent node get created
on different machines at the same time. (direct == not branch). So
the semantics of "svn cp" when the target is not a subset of the
source, can just be to augment the target with all the revisions that
the source has and it doesn't, as long as the above invariant holds.

For the moment, let's just assume that it always holds; I want to deal
with the global revision number first. Obviously, all sorts of
invariants break if a mirror's revision 1234 is not the same as its
parent's 1234. (There's nothing stopping a mirror being copied from
another mirror.) I see two different ways to deal with this problem.
First, we could detect the situation and re-number one set of
revisions -- presumably the mirror's. I suspect this is impossible
without inventing a new revision-identifying-thing which would be
globally unique, at which point why keep the revision numbers? Just
use the identifying things everywhere.

That is, in fact, option 2: change the format of a revision number so
that it is globally unique even in the face of multiple editable
mirrors of the same tree. This is merely difficult. The first thing
most people will think of is to use a string based on the time, the
computer's hostname, etc. This is a bad idea. Bitkeeper tries to do
it, and they keep having to make their "changeset keys" longer as
collisions occur -- which forces a repository format change,
propagated everywhere!

Instead, I suggest we keep the simple sequence number but make it be
structured. Revisions created on the master still get a bare number,
1234. The first repository to mirror the master gets assigned mirror
number 1. Revisions created there get a number looking like 1:23. If
a mirror of a mirror is created, its number is appended to its
parent's number: 1.4.3:12 is rev 12 created on the third mirror of the
fourth mirror of the first mirror of the master repository. (The
colon is to prevent people from confusing the rev number with the last
component of the mirror number, or confusing the whole thing with one
of RCS's branched version numbers.)

This takes up extra space, but only proportional to the depth of the
mirror tree; in normal circumstances that tree will be quite shallow.

Now that we have the version number sorted, let's go back to the
problem of creating two children of the same node in different
repositories. My proposed solution is the same as Bitkeeper's: when
we discover this has happened, one of the children gets converted into
a branch node. Generally it is better for the revision created higher
up in the mirror tree to displace the one lower down. The developer
is notified that this has happened, and can then merge the new branch
back onto its parent.

This operation is easy to implement with Subversion. The only trick
is what to call the branch. If the displaced node is itself the
result of a merge, it should simply have its merge and parent pointers
swapped. Let me clarify that with a sketch: suppose I have this
branch structure

    1 -- 2 -- 3 -- 4
         |
         1:1 -- 1:2

Mirror 1 is my private repository. I've done some work in it;
meanwhile other things have been applied to the master, and I've done
a svn cp to pick them up.

I merge 1:2 back onto the trunk within my private tree:

    1 -- 2 -- 3 -- 4 -- 1:3
         | /
         1:1 -- 1:2 --

But meantime someone else has created revision 5 as a child of 4.
When I attempt to cp my stuff back upstream, it gets rejected.
Instead my repository is rearranged like this:

    1 -- 2 -- 3 -- 4 -- 5
         | \
         1:1 -- 1:2 -- 1:3

I have to repeat the merge operation to take into account the new rev
5 before my push will succeed. Fortunately, Subversion remembers that
2-4 have already been merged and doesn't make me do it again.

    1 -- 2 -- 3 -- 4 -- 5 -- 1:4
         | \ /
         1:1 -- 1:2 -- 1:3

But what happens if I've been sloppy and worked exclusively on the
trunk?

1 -- 2 -- 1:1 -- 1:2

In that case I'll get a reject when I attempt to do the initial svn cp
to acquire work in the parent. In this case, Subversion needs to push
everything I've done onto a brand new branch; it has no way of knowing
what I want the branch to be called, so it should make me tell it.

$ svn cp http://<parent> file:///<my-repo>
svn: Parent conflict between revisions 3 and 1:1. Shift revision
1:1 onto a branch.

I then create the branch, effectively turning the above tree into

    1 -- 2
         |
         1:1 -- 1:2

and repeat the pull, which succeeds, giving me

    1 -- 2 -- 3 -- 4
         |
         1:1 -- 1:2

just as I had in the first example.

I'm not sure what the user interface to after-the-fact branch creation
should be like.

One final note: the cross-repo svn cp operation is a lot of tedious
typing. Under normal conditions you'll be either pulling stuff down
from a parent repo, or pushing stuff back up to the same parent. I
suggest the commands "svn pull" and "svn push" as shorthand for these
two operations.

Thoughts?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Mar 14 05:43:43 2002

This message: [ Message body ]
Next message: Bill Tutt: "More fs fun: COPY sources and detecting LinesOfDevelopment (branches)"
Previous message: Garrett Rooney: "Re: svn commit: rev 1509 - trunk/packages/freebsd/apr-snapshot trunk/packages/freebsd/apr-snapshot/files"
In reply to: Greg Hudson: "Re: subversion in the news"
Next in thread: Bill Tutt: "RE: Re: subversion in the news"
Reply: Bill Tutt: "RE: Re: subversion in the news"
Reply: Greg Stein: "Re: subversion in the news"
Reply: Yoshiki Hayashi: "Mirroring Subversion repository"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]