[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: svn diff, svn merge, and vendor branches (long)

From: Tom Lord <lord_at_regexps.com>
Date: 2002-12-12 01:12:55 CET

The enclosed is not intended to be "light reading" -- it is densely
worded, in order to convey a lot of information in a small document.

Here is a high level sketch of the design space of modern revision
control in a just a few paragraphs. You can see where your system or
one you are evaluating fits into this design space. Implications are
spelled out, at least abstractly, for the problem of "repeated

Programmers collectively maintain a forest of tree-structured graphs
of what I'll call "project trees" -- snapshots of source trees in
time. Each node in one of these graphs is a "revision". Children
nodes are "successor revisions" whose "immediate anscestor" is the
parent node.

Between any two nodes, we can describe a textual and structural diff:
a collection of patch sets and tree-rearrangements that discribe how
the nodes differ. We can call such a diff a "changeset".

Each tree of revisions therefore has a natural dual graph: a graph
formed by replacing every node other than the root by the changeset
that describes how that node differs from its immediate ancestor.
This is the changeset graph. `mkpatch' and `dopatch' tools can
reliably convert between these two graphs.

The changeset graph has significance to human programmers. When
programmers create new revisions, often they are authoring not merely
a new revision of the project tree -- but, explicitly, they are
authoring the corresponding changeset. It is the changeset, not the
tree, that they'll post to the gcc-patches mailing list, for example.
It is the changeset, not the tree, that other programmers will apply
to a different base revision. It is the changeset, not the tree, that
will be reviewed -- though a complete review must also include an
observation of how the changeset interacts with the immediate ancestor
revision or whatever revision it is applied to. Good programmers
often have cause to think explicitly about the changesets they are

The fundamental roles of revision control are much like those of a
good reference library. Revision control has the task of:

        A) Archiving the changeset graph.

        B) Cataloging the changeset graph.

        C) Providing access to the chanegset graph.

        D) Helping programmers to use the changeset graph.

Any system which performs those tasks has, as a consequence, provided
access to the revision graph. Any system which archives the revision
graph, rather than its dual, has accomplished (A) if and only if it
has formalized and preserved a corresponding `mkpatch' and `dopatch',
at each point in time.

It is important, especially from the point of view of (A), the
archivist, that the changeset format be well defined and documented,
and bound in some permanent way to the repository. It isn't enough to
be able to simply recreate each revision's project tree: to do a
complete job, one must also be permanently prepared to produce the
precise changeset that the author of that revision intentionally
created, in the context of its corresponding, historic,
changeset-application semantics. The technology of changesets may,
conceivably, slowly improve, over time -- nevertheless, each historic
revision represents not only an archived project tree, but often, a
deliberate changeset in a particular form. That changeset information
should be preserved.

{*}: The challenge of (D), making use of the changeset graph, is quite
open ended and dependent on the changeset capabilities. For example,
there is no single operation on changesets that corresponds to the
socially maintained concept of "merging" two lines of development.
Rather, there are quite a large number of ways in which changesets may
be manipulated to accomplish an effect that can reasonably called
merging. In practice, many different kinds of merging are used and

Now, what does all this imply for revision control designers and

        a) They must have a coherent idea of what a changeset is,
           and what it might be in the future. In choosing their
           definition, designers should take a _long_ term point of
           view ((A) and (C)).

        b) They have a very wide selection of storage management,
           indexing, and caching options available for the task
           of archiving the graphs ((A)).

        c) They should be designing systems in which the changeset
           intentionally created by programmers for a specific
           revision is reliably producable, indefinately. ((C)).

        d) They should be designing systems in which changeset
           manipulation is handled by an open-ended, easily extensible
           framework ((D) and {*}).

I'll tell you, speaking very informally, arch's simple solution:

        1) A global namespace is created for nodes in the revision graphs,
           or equivalently, their duals in the changeset graphs. This
           namespace is essential to maximum flexibility in tools
           which manipulate changesets: it allows them to refer to any
           and all changesets, regardless of how or where they are stored.

        2) The repository-of-reference, the authoritative record of
           history, is quite simply a write-once collection of
           changesets. Each changeset is assigned its global name,
           that maps to a filename on particular filesystems, the
           files are written once, and stable thereafter.

        3) Access to the revision and changeset graphs may be
           optimized by the creation of indexes and caches. The
           mechanisms for this are open-ended. The initial selection
           is very low-tech, yet quite effective when used properly.

          4) The framework for changeset manipulation is made as
           orthogonal as practical to the framework for storage
           management. An abstraction barrier is maintained between
           merging tools and storage management.

How does all this relate to the problem of "repeated merging"?

        i) A history of merges is best kept in a form that
           refers to a global namespace of changesets.

        ii) The form in which that history is archived should be
           independent of storage management, in order to preserve
           property (4) above, which is essential for tasks (b-d)
           above. For example, there is no obvious reason why one
           would want to keep this information (only or even
           primarily) in repository-specific "file properties" (nor is
           it obvious whether or not the information in history is
           reasonably indexed by per-file properties).

        iii) Before approaching the problem of repeated merging, it is
           a good idea to come to grips with task (a) above, and with
           the implications of the paragraph marked {*} (because, at
           least, of (A) and (D)). It violates (C), (D), (b), and (d)
           to try to solve this problem in a repository or
           repository-implementation specific way. It violates {*}
           and imposes arbitrary project management policy to solve
           only a narrow range of the "repeated merge" problem and
           assert that that is the whole solution.

"How can we tack some kind of repeated merging support onto svn?" is,
I think, a question that ignores important design context.

"or something like that",

To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Dec 12 01:02:08 2002

This is an archived mail posted to the Subversion Dev mailing list.