[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: short question about merge [PROPOSAL] vs. tree-deltas

From: Tom Lord <lord_at_emf.net>
Date: 2003-04-18 01:15:44 CEST

    me:

>> The nested contents of MOD may be arbitrarily rearranged from ORIG.

>> The merge algorithm has to look at the two trees and figure out what
>> in ORIG corresponds to which in MOD. It has to know what has been
>> renamed, and what to compare to what.

> From: Greg Stein <gstein@lyra.org>

> That huge email is based on a premise that you don't
> explain. You throw out "this won't work", but without anything
> to back it up, and then go into a long list of "which means this
> and that, and can't do this, and so the code should do that."

Mostly I left out explaining why "this won't work" to keep the message
from being huger than huge.

But here you go.

We can already assume that for every path@rev we can quickly and
easily compute a node_id.copy_id.rev.

We can assume that for two such noderev ids, the predicate:

        X is_ancestor_of Y

is easy to compute. It's cheap to know where a given node_id.copy_id
branched from. The immediate ancestor of a noderev is easy to find.
I'll assume that immediate successor is easy to compute, though I
don't know that for a fact.

Let's also assume that that a true rename is added.

Finally, let's assume that we can tweak various commands enough so
that for any given path@rev we can efficiently compute a complete
"import/copy/rename" history. So that the history for path-a@r_k
might read:

                renamed to path-a@r_q
                renamed to path-b@r_p
                copied to path-c@r_n
                copied to path-d@r_m
                imported to path-e@r_k

Alternatively or additionally, we could keep import/copy/rename
histories on directories. So that given a directory noderev, we
might be able to compute a history like:

                copied [path@u | noderev_id] to <name> @ rev Q
                copied <name> from [path@v | noderev2_id] @ rev P
                deleted <name> from [path@w | noderev3_id] @ rev N
                renamed [path@x | noderev4_id] to <name> @ rev M
                ...

Now I give you three directories: ORIG, MOD, TARGET -- with their
common ancestor, ANCESTOR. Under the constraints of v.a.p., we can
assume that

                ORIG is_ancestor_of MOD
                ANCESTOR is_ancestor_of ORIG
                ANCESTOR is_ancestor_of TARGET

So what follows will be the construction of an example that
demonstrates how none of the meta-data listed above helps you merge
reasonably.

Between ORIG and MOD, a change has been made to tree structure.
Let's say: "orig/src/doc" has been renamed "orig/doc".

First, just to ensure we're on the same page here, let's look at a
case that works. Let's assume that in TARGET, maybe some files in
src/doc have been modified, and src/doc has been renamed
src/documentation -- but that's about it. Using the meta-data above,
we'll have no trouble variance adjusting the rename, regardless of how
we characterize it. Maybe the changeset says, for example:

                --- ORIG
                *** MOD
                % rename src/doc@x to doc

and after we adjust it back to ANCESTOR it will say:

                --- ORIG
                *** MOD
                vvv ORIG,ANCESTOR
                % rename src/doc@v to doc

and after we adjust it forward to TARGET it will say:

                --- ORIG
                *** MOD
                vvv ORIG,ANCESTOR
                vvv ANCESTOR,TARGET
                % rename src/documentation@i to doc

and then we'll apply that, commit, and be happy.

Pretty much the idea you have mind?

Now, what else might reasonably have occured between, say, TARGET and
ANCESTOR? and between ORIG and MOD? Since we're talking about
structureless, unrestricted use of copy and rename, how about this
scenario:

[I'll warn you in advance that your first reaction might be "What a
 contrived story, I don't care what merge does in that case!" -- but
 after the story I'll point out why that first reaction isn't such a
 good idea.]

The programmer working on MOD didn't just rename the doc directory.
He also made some changes to files in that directory.

The programmer working on the TARGET branch had some thoughts about
improving the documentation. He wanted to work on these without
messing up the current src/doc directory. So he copied src/doc to
src/doc2, planning to shrink back to just one doc directory before the
next scheduled merge from his branch to the mainline. He made a
bunch of changes in doc2.

The technical writer group who has primary responsibility for the
manuals got wind of his doc ideas and looked them over. They, in
fact, have a branch of src/doc that branched somewhere prior to
ANCESTOR. They like his ideas about doc improvements so they make a
new branch of their own doc directory and merge in his changes. They
then proceed to "go to town" on the docs, fleshing out his idea for
improvements.

A few days later, someone in the company is going to start working on
a Japanese translation of the docs. In the translation, some of the
technical sections containing formulas and code snippets require no
translation, but some of the text does. And anyway, this work is
going to happen concurrently with changes to the text. Basically,
the translator wants to branch from the English docs and just start
adding new files that contain the translations -- so that he can keep
the English files current, and compare them to the ongoing
translation. The doc group tells the translator to branch from the
src/doc2 directory in the TARGET branch -- since it's more stable than
their own branch, yet is expected to eventually merge their changes
back in.

Eventually, our TARGET programmer gets the word, merges in the doc
groups changes to his src/doc2, renames that to src/documentation,
deletes the now defunct src/doc[*], and adds a copy of the translation as
src/documentation-jp. (Re "[*]": you might be tempted to say "That's
a usage error!" but keep reading.) Actually -- screw that. That's
not quite what he does. Actually he deletes both "src/doc" and
"src/doc2" and simply _copies_ from the doc group's tree to make the
new "src/documentation". Why does he do that? Well, because
anticipates further merges from the doc group's branch and this way, he
gets a good choice of ANCESTOR for those merges. And because it's
easier and is the first thing that occurs to him.

Of course, there's not a chance in heck here that v.a.p. is going to
figure out to rename src/documentation to doc or where to apply
the ORIG..MOD changes within src/doc --- and if you think there is
such a chance, then why doesn't it rename and modify
src/documentation-jp instead? Or should it modify both and delete
one? Or modify both and signal a rename conflict?

Now maybe you'll think, as I did initially -- "No, actually v.a.p. can
probably be made to do something really useful here! It can say `you
have both src/documentation and src/documentation-2 -- based on
meta-data, it looks like either of these might be the directory I'm
supposed to rename to ./doc and modify the contents of, and I can't
tell which. You have to figure out how to resolve this conflict.'"

But that's part of the point of making up the little narrative tale.
I can make up a completely different narrative tail, that boils down
to the same svn operations, but that has very different implications.

Instead of `src/doc', we can talk about `src/prog-a'. Instead of
`src/doc-2', we can talk about `src/prog-a-revised'. Instead of a
Japanese translation, we can consider a new program being added to the
suite, `src/prog-b' -- but initially derived from and sharing history
with `prog-a'. Instead of ORIG..MOD renaming `src/doc' to `doc',
perhaps it renames `src/prog-a' to `src/a-prog'. Now in this
scenario, sure -- the initial merge might throw up its hands and say
"I can't tell what counts as `a-prog' here!" -- and programmers
grumble and fix the conflicts by hand. But what happens on the next
merge? And the next? The ambiguity is still there. The programmers
know perfectly well because they have to resolve it the same way
_every_ time. But there's _nearly_ no way to tell svn how to resolve
the ambiguity automatically, from now on.

Ok, there's _one_ way: but it creates new problems. If you haven't
already typed it out, I'd guess you're just itching to flame: "Hey,
the mistake here was that the guy just deleted src/doc. Instead of
deleting it, he should have just merged doc2 into it!".

Nope. The TARGET guy's "src/doc" and "src/doc2" have very different
ANCESTOR revisions with respect to the doc group's branch. They'll
get very different merges one way or the other -- with the better
merges coming from deleting "src/doc" and "src/doc2" and copying from
the doc group's branch.

There's no reasonable resolution here. There's no good expression of
"logical file identity" just in terms of rename and copy operations.
The divergence between "how to get a tree that looks like what I want"
and "how to get a tree that merge handles as I would expect" is deep.
We can make up lots of stories that demonstrate this.

"It's such a contrived example, I can't take it seriously!," I hear
you cry. And to be sure, whether or not these scenarios or ones
resembling them are realistic is sufficiently hard to prove that if
your _goal_ is just to make arguments, I'm sure any disagreement
along these lines can ultimately be reduced to "It's a matter of
speculation and opinion." But let me assume that your goal isn't
just to argue for argument's sake and, in that light, make a
non-conclusive pitch for my view (but one that I find convincing):

User's are perverse.

User's aren't going to form accurate models of the abstract data type
that describes svn's notion of history. They aren't going to learn
subtle patterns of using svn commands that avoid problems such as
mentioned above. They're going to use svn commands however it first
occurs to them until they get their trees to "look right", and then
they're going to commit. For whatever reason, not just my little
story, they'll delete, and copy, and rename on a random walk until
their wc looks like what they want. And then they'll commit. Given
just noderev ids and even copy/rename history --- merge has no chance
at all of sorting that out.

ID cookies, on the other hand -- those aren't so damn hard for even
users with imperfect grokking of svn to grasp. And if the ID cookies
get messed up somehow, that's ok: you can change them directly rather
than going through a lot of copy/rename/merge indirections to achieve
the same logical effect.

ID cookies are a heck of a lot easier to implement, AFAICT, than
efficiently computable copy/rename histories. (By all means, keep
making it _possible_ to compute copy (and later, rename) histories --
but don't worry so much about making it as fast as merge would need.
Save the effort -- ID cookies solve the same problem in a simpler,
more general, and more easily user-controlled way.

And, of course, ID cookies simplify alternative merging operations,
distributed branches, and yadda yadda yadda. You guys work too hard.

-t

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri Apr 18 01:05:07 2003

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.