[merge tracking] [RFC] New merging algorithm

From: Peter Lundblad <plundblad_at_google.com>
Date: 2007-03-09 18:33:18 CET

Hi,

I've spent some time looking into the algorithm in
/branches/merge-tracking/subversion/libsvn_client/diff.c
that tries to do the mergeinfo-sensitive merging and I think I have a better
algorithm. Let me first recap my understanding of the current algorithm,
then discuss problems with it and, after that, I'll sketch my version
of the solution.

The current versions does something like this:
merge_tree(R, N, M):
- Get all revisions in N-M that are missing from any merge info in or
  beneath R. Call this set ALLR.
- For every consecutive range of ALLR:
  - Get the paths ands revs that are to be merged and that are relevant
    for this merge.
  - Merge the common ancestors of these paths.

So, we first merge parts of the tree that have differing merge info
from the root and then, for each range of the root that needs to be
merged, it asks the repository for a diff and merges it. This algorithm
is recursive, so if a subtree of a subtree with differing merge info
again has differing merge info, it will be merged separately.

this has some drawbacks:
- If a tree gets split into several merge ranges, and a path in
  that tree gets deleted in a subrange other than the first, we are doing
  a lot of extra work. Worse, if we get conflicts in an earlier range,
  the user will have to resolve that conflict (or pretend it is resolved if
  she knows that this file is going away later) just to make svn happy to
  delete the file/directory. That can be a bit annoying, I believe;)
- We will be asking the repository for a lot of diffs in different parts of
  the subtree. In some cases, these requests could be combined into a larger
  one. Currently, our merge performance sucks, it will suck after merge
  tracking, but let's keep performance in mind anyways.
- If a merge process is halted due to a conflict the tree can often be left
  in a very inconsistent state. As I've argued before, this will always
  have to happen, unless we merge revision for revision, but I think
  we could do better than this algorithm.

One simple approach I've come up with is something like this:
Let's assume we have a list of pending merge ranges for every path in
a tree. This information is derivable from the current mergeinfo and
simplifies the discussion.
- Get all deletions/replacements/additions in the whole tree with the moral
  equivalent of svn diff --summarize. For deletes, the reporter gives
  us the (first) revision where the path was deleted.
  If the deleted-rev revision is part of the set of revisions that need
  to be merged, then, don't merge anything at or below the deleted path.
  Just make sure the path gets deleted.
  Remember replacements/additions for later.
- Merge all paths/subranges in order of end revisions.
  This means that if we have a set of paths and range to be merged,
  pick one with the lowest end revision of the range.
  (This needs to be reversed for rollback/revert operations.)
  We can merge different
  parts of the subtree together even if they have different start revisions.
  And we make sure to exclude parts of the tree that are not to be
  merged this time. This is just a matter of making the right set_path
  calls to the reporter before asking for the diff.
  This step is repeated until we have nothing left to merge.
- For paths that were replaced/added (see the first step), just check them out
  in the corresponding location in the WC at the target revision.
  Note that since this path was deleted, the old merge info must become
  irrelevant. The whole range of revisions from where the path was added
  until the target revision must be merged, since it is the add we're
  merging, and no changes under it.

What this algorithm tries to do is to keep the revisions of different
parts of the subtree together as far as possible without cutting subranges
in pieces. The former is good because it makes the tree "less inconsistent".
The latter is good because it lessens the risk of spurious conflicts.
At the same time, it groups merges together in such a way that merges
towards the same end revision can be done in one request. This avoids
some unnecessary server roundtrips which are not entirely cheap.

Another problem with the current implementation is how parts of the
subtree is excluded. The client is pretending to have a specific path at the
target revision by a call to reporter->set_path(). We don't know if
the path actually exists in the target revision. This might work in
the current implementation, but it is still an undocumented hack.
A way around this is to link_path to "/" (the repository root) at the
target revision. This path is always guaranteed to exist.

OTOH, we have a cleaner way of excluding subtrees on the
sparse-directories branch. We can use svn_depth_exclude, which
wouldn't give us any changes related to that path or its subpath. Also,
svn_depth_empty would give us a way to combine separate subtrees into
a single merge, which would be a userful performance optimization in
my proposed merge algorithm.

This proposal certainly have details to be filled in. (I have an ugly
prototype implementation in Python that's too awful to publish
anywhere;).
I'd like feedback, though, before I look deeper into this.
Thoughts?

Best,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri Mar 9 18:33:45 2007

This message: [ Message body ]
Next message: Mark Phippard: "Re: [merge tracking] [RFC] New merging algorithm"
Previous message: Lieven Govaerts: "Re: Subversion 1.5"
Next in thread: Mark Phippard: "Re: [merge tracking] [RFC] New merging algorithm"
Reply: Mark Phippard: "Re: [merge tracking] [RFC] New merging algorithm"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]