Re: Two-phase merge

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: Thu, 10 May 2012 12:50:48 +0100 (BST)

On 29 April, Stefan Fuhrmann wrote:
> Julian Foad wrote:
>> Stefan Fuhrmann wrote:
>>> [...] The proposal I make here is not
>>> an alternative to e.g. "Symmetric Merge" but rather a
>>> refined merge strategy that would also benefit the current
>>> merge.
[...]
>>> In end, we will always merge individual file or property
>>> changes but we plan the merge on tree-level. For many
>>> file nodes, this will unnecessary intermediate states
>>> where the merge got fragmented due to a change to
>>> some *other* file.
>>
>> Yes. That is likely to be a significant cause of extra conflict in cases
>> where the user has sub-tree mergeinfo or has done some cherry-pick merges
>> and is now doing a "merge everything else" merge.
>
> It also happens in the "back and forth" case:

If this problem does arise in basic back-and-forth merging then I would agree it's much more important than I thought,
but I don't get it. If "it"
refers to fragmenting the merge, then no, each merge in a back-and-forth case doesn't get fragmented.

> if the changes in the source are interspersed
> with merges from the target, i.e. the merges
> from A to B will be excluded from the changes
> to be merged from B to A.

Do you mean like this?

A2 A3 A4
A o-------o-------o-------o

/ \ /

o \ /

\ \ /

B o---o-------o-------o

B2 B3 B4

The merge we're considering is the one shown from B4 to A4. Changes B2 and B4 are changes in the source, interspersed with B3 which is a merge from the target, that is the result of a merge from A to B.

A2 will be chosen as the base for a 3-way merge, and the two arms of this merge will be (A2:B4) as the source side and (A2:A4) as the target side. The source side, (A2:B4), is equivalent to the combination of B2 and B4. It can be a bit tricky to understand why it is equivalent; if so, see <>. The point is that changes B2 and B4 don't get split up and merged one after the other in this scenario.

>>> This increases the probability of
>>> merge conflicts and their likelihood to show up again
>>> after each conflict resolution.
>> (That is: after each phase of the N revision-range phases that the merge
>> got broken into.)
>>
>>> Moreover, it masks cases
>>> where Symmetric Merge might select an optimized
>>> merge plan (see wiki for various examples).
>> I understand, but I don't know of any examples on the Wiki yet.
>
> E.g. in a merge from A to B, we can simply take the merge
> result of the last B to A merge if there were no later changes
> on either side and the B -> A merge did not cherry-pick
> (with respect to the node in question).

>>> I propose to use the following two-phase merge workflow:
>>>
>>> (1) Planning phase
>>> - read / evaluate the merge info and change history
>>> to find the list of revisions to merge
>>> - break that down for each node changed on the source side
>>> - optionally handle renaming etc.
>>> - mark tree conflicts (not sure when to signal them to the
>>> user and when to resolve them; high-level fragmentation
>>> might become necessary)
>>>
>>> (2) Execution phase
>>> - merge changes on a per-node basis, i.e. all changes of
>>> that node get merged before the next node
>>> - fragmentation may still be necessary for *that file*
>>> - given the "merge plan" for a node, alternative merge
>>> strategies may be chosen
>>
>> OK, the significant change here is not so much the two-phase (although that
>> has merit in itself) but rather turning the processing order inside-out from
>> (conceptually)
>>
>> REV_RANGES = [contiguous-revision-ranges needed for TREE]
>> for REV_RANGE in REV_RANGES:
>> 3-way merge REV_RANGE into the whole TREE
>>
>> to (conceptually)
>>
>> for each NODE in the TREE:
>> MERGES = [find the best sequence of 3-way merges to merge all needed
>> changes into NODE]
>> for MERGE in MERGES:
>> 3-way-merge MERGE into NODE
>
> Correctly. With the tiny addition that all rename
> and tree conflict handling is done in the planning
> phase. That makes it repeatable and potentially
> reduces the number of conflicts (at most one per
> node instead of one per node and merge step).

I get how it reduces conflicts. What do you mean by 'repeatable'?

> The second phase is also reduced to purely text-
> based processing because all the "structure"
> issues have already been resolved. But that might
> be more of a mental aid instead of a true reduction
> in complexity (we still might need to handle missing
> targets on disk etc.).

>> This is a good idea in itself. It certainly brings advantages as
>> mentioned above. I'm not sure what, if any, disadvantages it might have.
>
> Memory consumption could be proportional to
> the number of changes and renames or such.
> Should not be a real-world problem as long as
> you don't merge more than a few 100k changes.
>
> A more severe issue might be latency because
> the planning phase needs to gather / evaluate
> data from large sections of history. But that one
> can be improved ;)

Right.

>> I don't know that the scenarios this helps with are our main concern at
>> this point. It would be good to keep this approach in mind when writing
>> or designing any new merge code. In terms of development effort I think
>> it's more valuable to concentrate first on the advantages to to-and-fro
>> merging (most commonly without subtree merges and without cherry-picks)
>> that the 'symmetric merge' brings.
>
> I think it can become helpful as soon as we track renames because
> we need to remap on a per-node basis.

To track a renamed node, we need to be able to recognize and act on the information that each node in a merge will be found at three paths that may all be different: its path in the merge-base (aka source-left) tree, its path in the source (aka source-right) tree, and its path in the target tree. Is that what you mean by "remap on a per-node basis"? I think that's something we will need to do anyway, but I don't see that it makes much difference whether we do this once at the planning stage (for a two-phase merge), or do it separately for each sub-merge (for the split-into-revision-ranges method).

- Julian
Received on 2012-05-10 13:51:25 CEST

This message: [ Message body ]
Next message: C. Michael Pilato: "Re: svn commit: r1335639 - /subversion/trunk/subversion/libsvn_wc/adm_ops.c"
Previous message: Stefan Sperling: "Re: svn commit: r1336128 - in /subversion/trunk/subversion/bindings/javahl: native/ src/org/apache/subversion/javahl/"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]