Research on rename-aware merging -- or rather of ordered (XML) trees

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: Wed, 12 Sep 2012 15:43:18 +0100 (BST)

Hi, merge fans!

I've started doing some research for how we can update 'merge' to fully support renames.

Surprisingly to me, I have so far found very little written about merging trees of files. If anyone knows how to find some, I'd be glad. On the other hand, I found several papers on 3-way merging XML data, which is somewhat analogous in that it is a tree structure of nodes that have both content and children.

This email is to summarize some useful points I have taken from a paper called "A 3-way Merging Algorithm for Synchronizing Ordered Trees -- the 3DM merging and differencing tool for XML" [1], which I'll refer to as "3DM" for short. This is all at the algorithmic level, far removed from practical implementation concerns.

* First *match* the three versions of each node (in the base, branch-1, and branch-2 trees); then *merge* the matched sets. That is, treat matching and merging as two distinct problems.

* Matching means deciding which node in the target tree corresponds to which node in the base tree, and which node in the source tree corresponds to which node in the base tree, and thus which source node corresponds to which target node. Matching is all about following moves and copies. Matching can be deterministic (using node-id information) or heuristic (using similarity detection). Matching is explicit: it builds a list of "edges" connecting these nodes, and that list is then used as an input to the merging stage. (That doesn't necessarily mean it has to be completed before the merging stage can begin.)

-- In Subversion's current merge code the matching is implicit (and
simple): it assumes that a relative path 'foo/bar' inside the root of
the branch-1 tree matches relative path 'foo/bar' inside the root of the
base tree and of the branch-2 tree and of the merged result tree.

* One interesting thing about matching is whether to match copies: if branch-1 copies 'foo' to 'bar', then do we match 'foo' in the base tree to both 'foo' and 'bar' in branch-1? That depends on whether we want to propagate the modification to both copies when a node is copied in branch 1 and modified in branch 2. The 3DM paper takes the view that one node should be declared the "original" and receive all changes, and any other nodes that look like copies of it should be flagged as 'partial matches' and receive *some* changes. I don't fully understand that scheme yet, but the idea seems to come from the assumption that the matching is heuristic, based on similarity, and a desire to propagate content changes and/or tree changes to all copies, but only under certain conditions.

-- In Subversion, we will likely prefer deterministic matching, perhaps with a fallback to similarity matching if we can't get node-id information from an old server. If we decide (and let's not be too hasty to make this decision) that we
should never merge changes into a copy but only into the "original"
node, then the matchings need only be 1:1. Otherwise, we will need to make our own plan for whether we match copies and, if so, to what degree we propagate changes to them. [5]

* In 3DM, an XML tree node is either an *element* or *text*:

-- n1: elem children={n2,n3,n5}
 Hello, -- n2: text
 <emph> -- n3: elem children={n4}
 my -- n4: text
 </emph> -- end tags are implicit, not nodes
 world! -- n5: text

An element node has a *content* (its XML attributes) and a *child list* (a list of element nodes and text nodes); a text node is a leaf node and has *content* only (the text).

-- In Subversion, by analogy, a directory has a child list and its content is its set of properties; files (and symlinks) are leaf nodes.

* In what order do we process the nodes, given that there are multiple trees whose nodes may be in different orders? Answer: Starting at the root node, merge one node's *content*, and (separately) its *child list*. Then recurse on each child of the resulting merged node. That is, a top-down, breadth-first traversal of the merged result tree.

* A child list in 3DM is an *ordered* list of non-unique nodes (they are not assumed to have unique ids), e.g.

<foo>
 hello
 goodbye
 hello
</foo>

and much of the design of 3DM, and its sets of merge primitives and kinds of conflict, are related to traacking and preserving node-ordering relationships.

-- In Subversion, a child list is an *unordered* list, and unique (by name) within a directory. Thus we will have a very different set of merge primitives and of conflicts. This difference is almost enough to make the XML merging problem seem irrelevant, but I am finding there is still enough common ground to be useful.

* The primitives of *change* in 3DM are:

Update (we say 'modify') -- a change of content
Insert (we say 'add')
Delete
Move
Copy

-- These make sense for Subversion too.

* 3DM describes a set of "functional requirements", while I'll paraphrase and comment on:

1. Easy to understand. -- No quibble there.

2a. A change on one side is always propagated to the result. -- Agreed.

2b. There is no restriction on the positions of the source and destination of a move or copy. (For example, a restriction could be that a copy is only recognized if the source and destination are in the same directory.) -- Agreed.

3. Changes are interpreted relative to their position in the tree, not applied at their absolute position in the tree. -- Agreed.

4. A node exists in the *context* of its siblings and parent; this context should be preserved in the result. -- The 'parent' part of the context pretty much follows from (3), while the 'siblings' part of the context is closely related to ordering and so may hardly be relevant in Subversion. Nevertheless, this idea bears further consideration.

5. In a copy-vs-modify situation, 3DM chooses that the modification should be applied to all the copies, yet states that this isn't always wanted. -- In Subversion, I don't think we want to do this; we might want it to be optional, controlled by a flag in a 'merge policy' or by some sort of callback.

6. Inserting a node at the same place on each branch: the result could be "insert both" or a conflict. -- The Subversion equivalent is insertions of the same name, which we would normally expect to be a conflict, but we might want the option to automatically resolve identical duplicate adds.

7. The merge is symmetric: branch 1 and branch 2 can be swapped and the result will be the same. -- In Subversion, yes, that's desirable too. [4]

8a. Update-vs-update is a conflict. -- Yes. (May want an option to automatically resolve identical duplicate updates.)

8b. Del-vs-move is a conflict. Del-vs-update is a conflict. Del-vs-copy is a conflict. -- The first two are obvious. The third one is more interesting, and something we haven't (to my knowledge) considered before. I agree it should be a conflict.

8c. Move-vs-move is a conflict. -- Agreed.

The same author wrote a later (and much shorter) paper, "A Three-way Merge for XML Documents" [2], in which a "more elegant variation of the algorithm discards support for copy operations, at the same time gaining enormously in clarity and
simplicity" [3]. I'll summarize it briefly.

That paper concentrates on merging (not on matching) trees of XML nodes, using as its central principle one of the main ideas from the first paper: the idea of preserving 'parent-child-successor' (PCS) contexts. If a node is changed, then the context of being a child of its parent node, and next to its left sibling (or start of list) and next to its right sibling (or end of list) must be preserved in the result tree. If a node has moved, then both the PCS context of the 'gap' it left in its source position (where it was in the base tree) and the new PCS contexts it creates (to its left and right) in the its target position must be preserved in the result tree. The merged result tree must contain all of the (p-c-s) contexts associated with changes in branch-1 and all those associated with branch-2 as well. If the context of a change in branch-1 is not compatible with the context a change in branch-2, there is a conflict.

I can't see any useful analogy of P-C-S contexts in Subversion, except for the 'Parent' part of it, because it's entirely geared towards preserving ordering in child lists. In an abstract way I can see how it could be useful to loosely associate a node with all its siblings, so that for example if branch-1 moves all the children of directory 'notes/' into directory 'doc/', and branch-2 adds one new child 'notes/foo', then a suggested result could be that 'foo' should be either added in 'doc/' or flagged as a conflict, because all of its sibling context has gone away. But that's rather vague and I'm confident that we don't need any such thing.

Feel free to query or comment on any of these ideas: I'd be glad to have some discussion.

- Julian

[1] Lindholm, T., "A 3-way Merging Algorithm for Synchronizing Ordered Trees -- the 3DM merging and differencing tool for XML." (2001). <http://www.cs.hut.fi/~ctl/3dm/thesis.pdf>

[2] Lindholm, T., "AThree-way Merge for XML Documents" (2004). <http://www.hiit.fi/files/fi/fc/papers/doceng04-pc.pdf>.

[3] Web page: 'The "3DM" XML 3-way Merging and Differencing Tool', <http://tdm.berlios.de/3dm/doc/index.html>.

[4] There may be some edge cases where the simplest implementation would
allow the order of the two branches to leak through. For example, if
we chose to resolve non-identical duplicate adds of a node 'foo' by
adding both nodes, giving one of them a new name, we might name the node
from branch-1 'foo-1' and the node from branch-2 'foo-2'. However, it
would be possible to avoid this leakage by comparing some other property
of the nodes (time stamp, content, ...) rather than which side of the
merge they came from, to decide which one is named '-1' and which '-2'.

[5] We may also be able to distinguish a *branch* from a *copy*, and if a file is branched rather than just copied (a sub-tree branched within
the main-tree branch -- ugh) then we need to decide what to do there as
well.

--
Subversion Live 2012 - 2 Days: training, networking, demos & more! http://bit.ly/NgUaRi
Certified & Supported Apache Subversion Downloads: http://www.wandisco.com/subversion/download

Received on 2012-09-12 16:43:57 CEST

This message: [ Message body ]
Next message: Stefan Sperling: "1.6.19 up for testing/signing"
Previous message: Vincent Lefevre: "Working copy corruption with svn 1.6.17 (r1128011) under Ubuntu"
Next in thread: C. Michael Pilato: "Re: Research on rename-aware merging -- or rather of ordered (XML) trees"
Reply: C. Michael Pilato: "Re: Research on rename-aware merging -- or rather of ordered (XML) trees"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]