[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

RE: Re: subversion in the news

From: Bill Tutt <rassilon_at_lyra.org>
Date: 2002-03-14 10:45:21 CET

> From: Zack Weinberg [mailto:zack@codesourcery.com]
>
> On Thu, Mar 07, 2002 at 06:29:46PM -0500, Greg Hudson wrote:
> > On the other hand, I sent mail here a while ago with some ideas on
how
> we
> > could get cheap forms of distributed operation (basically, things
which
> > fall out of the unusual way we do branching and tagging), but it
didn't
> > generate any discussion. It's at
> >
http://subversion.tigris.org/servlets/ReadMsg?msgId=59658&listName=dev
>
> The distributed-repository stuff I'm most interested in is variants on
> your use-case #2. But let me start with a situation you didn't
> mention:
>
> 0. An organization wishes to maintain read-only mirrors of its
> Subversion repository. These are either to take the load of
> anonymous users off the master, or to reduce the bandwith demand of
> geographically-distant users at some expense in latency.
>
> The primitive operation we need in order to support this is just your
> cross-repository "svn cp". If I do
>
> svn cp http://a/proj http://b/proj
>
> and B/proj doesn't yet exist, that establishes a mirror of A/proj at
> B/proj. Now what happens when you repeat the operation? Notionally,
> the current contents of A/proj replace B/proj. Practically, we want
> to avoid having to re-transmit the entire repository. Instead, we use
> almost the same procedure that we would for "svn update" -- the slave
> notifies the master of the most recent revision it's got, and the
> master sends only the newer revs. It has to send svndiffs for *all*
> the newer revs, not just a combined delta, so that the slave can
> continue to be an accurate mirror.
>
> The mirrored repository is automatically read-only. It has to be,
> because otherwise we could get numbering conflicts between revisions
> committed on the mirror and the master. There are two ways to
> implement that: either an attempt to do any repo-modifying operation
> fails, or it gets translated into a request against the master, which
> will fail if the mirror is out-of-date. In the latter case, we want a
> way to force a WC-update to query the master, so that a developer
> working off a mirror is not stuck when it's out of date.
>

Err, a mirrored repository is automatically read-only sort of by
definition. It's a mirror, and thus doesn't reflect a location you can
actually edit. :)
i.e. single authoritative source of the data, and multiple read-only
mirrors just to support the typical anonymous source code repository
usage that lots of open source projects have. Let's not worry about
anything else for a mirror.
 
> Now suppose we want real disconnected operation. Let me start by
> talking about one way to work in the existing system. Anyone can
> create a branch at any time; so suppose that, as a matter of project
> policy, all developers are expected to have personal branches on which
> they do development. In fact, one person might have several personal
> branches, one per independent task they were hacking on. While
> they're hacking, they periodically commit changes to their branch;
> those changes are visible to all, but don't affect anyone who doesn't
> explicitly go get them. When any given chunk of work is complete, the
> developer merges from the trunk to their personal branch, resolves
> conflicts, re-tests, and merges the branch back to the trunk; at that
> point everyone gets the patch(es).
>
> Ignoring the global-revision-number issue for the moment, notice that
> nothing happening on a personal branch can affect revisions anywhere
> else in the tree (assuming that a merge pointer is recorded in the
> child, not the parent). Therefore, the integrity of the system is not
> compromised if commits to a private branch happen in a mirrored
> repository. Nor can it be compromised if the branch gets pushed back
> upstream to the master repository. The only time integrity can be
> damaged is if two direct children of the same parent node get created
> on different machines at the same time. (direct == not branch). So
> the semantics of "svn cp" when the target is not a subset of the
> source, can just be to augment the target with all the revisions that
> the source has and it doesn't, as long as the above invariant holds.
>
> For the moment, let's just assume that it always holds; I want to deal
> with the global revision number first. Obviously, all sorts of
> invariants break if a mirror's revision 1234 is not the same as its
> parent's 1234. (There's nothing stopping a mirror being copied from
> another mirror.) I see two different ways to deal with this problem.
> First, we could detect the situation and re-number one set of
> revisions -- presumably the mirror's. I suspect this is impossible
> without inventing a new revision-identifying-thing which would be
> globally unique, at which point why keep the revision numbers? Just
> use the identifying things everywhere.
>
> That is, in fact, option 2: change the format of a revision number so
> that it is globally unique even in the face of multiple editable
> mirrors of the same tree. This is merely difficult. The first thing
> most people will think of is to use a string based on the time, the
> computer's hostname, etc. This is a bad idea. Bitkeeper tries to do
> it, and they keep having to make their "changeset keys" longer as
> collisions occur -- which forces a repository format change,
> propagated everywhere!
>

You're right using a computer's hostname is an insanely bad idea. There
is a much simpler solution and it has a constant data size as well. Give
each repository its own GUID and expand the appropriate keys as
necessary.

> Instead, I suggest we keep the simple sequence number but make it be
> structured. Revisions created on the master still get a bare number,
> 1234. The first repository to mirror the master gets assigned mirror
> number 1. Revisions created there get a number looking like 1:23. If
> a mirror of a mirror is created, its number is appended to its
> parent's number: 1.4.3:12 is rev 12 created on the third mirror of the
> fourth mirror of the first mirror of the master repository. (The
> colon is to prevent people from confusing the rev number with the last
> component of the mirror number, or confusing the whole thing with one
> of RCS's branched version numbers.)
>
> This takes up extra space, but only proportional to the depth of the
> mirror tree; in normal circumstances that tree will be quite shallow.
>

Yeah, but it's still theoretically unbounded, and is the structure of
the replication tree really important for some other reason? I'd just
rather use a GUID as mentioned above and be done with it.

> Now that we have the version number sorted, let's go back to the
> problem of creating two children of the same node in different
> repositories. My proposed solution is the same as Bitkeeper's: when
> we discover this has happened, one of the children gets converted into
> a branch node. Generally it is better for the revision created higher
> up in the mirror tree to displace the one lower down. The developer
> is notified that this has happened, and can then merge the new branch
> back onto its parent.
>

Err, don't you mean children from the repository you're merging into are
given more priority then children from other repositories? (for
upstream, reverse the sentence if you mean downstream) Don't you only
have two candidate children for a given inter-repository merge?
i.e. the tree position is irrelavent, it just matters which repository
you're trying to merge into. The GUID mechanism exposes the possibly
graph-like nature of distributed development.

> This operation is easy to implement with Subversion. The only trick
> is what to call the branch. If the displaced node is itself the
> result of a merge, it should simply have its merge and parent pointers
> swapped. Let me clarify that with a sketch: suppose I have this
> branch structure
>
> 1 -- 2 -- 3 -- 4
> |
> 1:1 -- 1:2
>
> Mirror 1 is my private repository. I've done some work in it;
> meanwhile other things have been applied to the master, and I've done
> a svn cp to pick them up.
>
> I merge 1:2 back onto the trunk within my private tree:
>
> 1 -- 2 -- 3 -- 4 -- 1:3
> | /
> 1:1 -- 1:2 --
>
> But meantime someone else has created revision 5 as a child of 4.
> When I attempt to cp my stuff back upstream, it gets rejected.
> Instead my repository is rearranged like this:
>
> 1 -- 2 -- 3 -- 4 -- 5
> | \
> 1:1 -- 1:2 -- 1:3
>
> I have to repeat the merge operation to take into account the new rev
> 5 before my push will succeed. Fortunately, Subversion remembers that
> 2-4 have already been merged and doesn't make me do it again.
>
> 1 -- 2 -- 3 -- 4 -- 5 -- 1:4
> | \ /
> 1:1 -- 1:2 -- 1:3
>
>
> But what happens if I've been sloppy and worked exclusively on the
> trunk?
>
> 1 -- 2 -- 1:1 -- 1:2
>
> In that case I'll get a reject when I attempt to do the initial svn cp
> to acquire work in the parent. In this case, Subversion needs to push
> everything I've done onto a brand new branch; it has no way of knowing
> what I want the branch to be called, so it should make me tell it.
>
> $ svn cp http://<parent> file:///<my-repo>
> svn: Parent conflict between revisions 3 and 1:1. Shift revision
> 1:1 onto a branch.
>
> I then create the branch, effectively turning the above tree into
>
> 1 -- 2
> |
> 1:1 -- 1:2
>
> and repeat the pull, which succeeds, giving me
>
> 1 -- 2 -- 3 -- 4
> |
> 1:1 -- 1:2
>
> just as I had in the first example.
>
> I'm not sure what the user interface to after-the-fact branch creation
> should be like.
>

Yeah, this is kind of funky. Do you know how bk does this? (in terms of
UI)

There's a similar (but reverse) problem with the "svn cp <local>
<remote>" operation when you're merging changes into another repository
if you're local changes aren't on a branch. I think assuming a tree
structure for a repository replication graph is limiting. Just handling
the graph case really doesn't seem any harder from an implementation
angle.

> One final note: the cross-repo svn cp operation is a lot of tedious
> typing. Under normal conditions you'll be either pulling stuff down
> from a parent repo, or pushing stuff back up to the same parent. I
> suggest the commands "svn pull" and "svn push" as shorthand for these
> two operations.
>
> Thoughts?
>

I think distributed repository stuff overcomplicates the svn command
line interface. I'd rather see a repository browsing UI that let you
intuitively perform inter-repository actions. To make the command line
versions of "push" and "pull" simpler than the "svn cp" syntax you'd
probably want to add local friendly names of remote repositories, etc...

I honestly think the real work in the distributed space has to do with
the import/export format used to push the changesets back and forth, and
figuring out what set of data you want to pull/push. What do you do if
you just want to "svn push" a subset of the local changes you've made?
Do you let the user get away with only "svn push"ing local ChangeSetIDs
over? Or is "svn cp" the appropriate metaphor?

The schema keys that need the RepositoryID (aka GUID) added revolve
around what kind of semantics you want to expose to the user.

Do you want to allow repositories to just back each other up?
Etc...

Defining the semantics you want to support and from there defining your
import/export format to achieve those semantics will tell you where in
the schema you need to add RepositoryIDs.

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Mar 14 10:46:03 2002

This is an archived mail posted to the Subversion Dev mailing list.