Re: merge performance (was: Re: Distributed Subversion)

From: Stefan Sperling <stsp_at_elego.de>
Date: Fri, 12 Jun 2009 13:29:10 +0100

On Fri, Jun 12, 2009 at 01:03:46AM -0600, Nathan Nobbe wrote:
> On Wed, Jun 10, 2009 at 11:27 AM, Stefan Sperling <stsp_at_elego.de> wrote:
>
> > On Wed, Jun 10, 2009 at 10:35:29AM -0600, Nathan Nobbe wrote:
> > > merge semantics in dvcs systems are way better,
> >
> > How and why? Honest question.
>
>
> to be honest i dont fully understand how merging or tracking (history) works
> in dvcs.

The monotone book has a nice background on the versioning model,
as does the hgbook:

http://monotone.ca/docs/Concepts.html
http://hgbook.red-bean.com/read/behind-the-scenes.html

I don't know how far git deviates from the two, but I don't
expect it deviates too much from the basic ideas. It seems
to just optimise things in different directions than hg
and monotone do.

> > Please explain in detail, because I've
> > been trying to understand what they are doing right, and what
> > Subversion is doing wrong, and I don't think I've quite figured
> > it out yet.
>
>
> what i do know is git establishes a 'content tracking' approach. everything
> (used loosely) is marked w/ a sha1 hash. sha1 hashes are the guid(s), and
> this is where the power comes from.

Yes, see the links above to see what this really means.

> diff'ing is really, really fast. entire trees can be compared by doing a
> strcmp() on 2 40 character strings. if the strings are the same, the trees
> are the same (optimal case), if not traverse, where likely youll be able to
> discard large sections of sub-trees quicly as well.
> so, im not sure on this part, but im guessing this is part of the reason
> merging goes so fast.., assuming the first step in a merge process would be
> determining which objects need to be merged

It certainly allows for good speed in some situations.

> see, thats the thing.., if the tool can manage things for you, and you dont
> have to think about them, b/c you trust the tool; youre more likely to use
> it for what its good for in the first place! prime example, reintegration
> in svn1.5; i hate to say it, but ouch.. honestly, i dont even use the
> --reintegrate flag (regretting after reading more from you below <damn>)..
> svn seems to figure out what to do w/o it so far.., but the issue we've all
> been over on this list a dozen times or better now, is that you have to
> create a new branch, re-create the branch or or .. etc.

You need to understand the limitations of a tool you are using to be
able to use it well. Just not using --reintegrate because you don't
understand why it's necessary won't solve the problem.

E.g. take CVS. Once you know the limitations of CVS, and why they
exist, and how to avoid them, then you know how to use CVS effectively.
Before that, it bites you all the time. Well it did that to me, anyway.
Maybe your experience was different.

Same goes for Subversion.
And I would assume that the same goes for git and hg, too.

Let's look at some limitations:

E.g. it seems that in Mercurial, by design, it's not possible to
cherry-pick changesets from a branch.

When you merge in hg, you specify exactly one revision.
It's the revision *up to which* you would like to merge:

$ hg help merge
hg merge [-f] [[-r] REV]

Because in hg, every revision can either have one or two children,
and one parent.
This way, hg always knows which revision is the parent of the branch
you are working on. Every revision in the branch is a grand-grand-
grand-...-grand child of the revision of trunk you initially branched.

To sync up the branch with trunk, hg goes back to the common
grand-parent revision, and applies all outstanding revisions
of trunk to the branch.

Merge tracking is dead-easy, because all you need to remember
is the revision (named by a hash) you last merged from trunk
into the branch. Nice and clean design. Merge-tracking problem solved!

So here's where the limitation is: What if you want just one
particular revision from trunk, and not its grand-parents?
Can't do it in this design.

So what people do is, they fix bugs in release branches (for everyone
of their active release branches), and then merge the fix back
from one of those branches into trunk. Because only fixes are merged
from branches into the trunk, i.e. only changes you really want in
trunk, this works nicely.

Or, they use a hg extension called "transplant",
It's essentially hg diff -rN:M <source> | patch.
It will also store an extra file in the repository, mapping hashes
of transplanted revisions as they exist in the source repository
of the transplanted change to hashes of revisions committed to the
branch in the local repository. This creates an extra mapping in addition
to the simple and straight-forward parent->{child1 [, child2]} relationship,
i.e. it tries to work around a limitation of the basic design.

So the basic design is not flexible enough to accommodate for all
types of merges people would like to do. So additional hacks around
the basic design are required to make them work. Hacks around design
are usually considered bad. Good design is considered better.
But people might not have problems with hacks as long as they work
well enough for them.

I don't know yet how git does cherry-picking, is it similar?

In Subversion, the basic design is an "infinite" series of stacked trees,
each of which store nodes of the tree which have changed from the previous
tree in the stack. Each such tree has a label, say rX (revision X).
All nodes untouched in rX are resolved by referring to earlier revisions
(rX-1, rX-2, ..., r1), which are further down the stack.

This design allows you to easily merge a fix rX from trunk into the
branch. Each rX is just a set of changes to a tree, and you can try
to apply any revision to any subpart of the versioned tree (any branch),
or even inside of any subpart (svn doesn't care where branches start).

So it's trivial to try to cherry-pick a fix rY from branch into trunk.
Just merge the change into the target path and see what happens.
No svn diff -rN:M | patch is necessary, nor is an extra file that
keeps track of merges you did outside of the usual parent->child flow
like in hg.

And here is the limitation:

You need one additional step to do this properly, which is a "record-only"
merge (which only modifies mergeinfo, not the actual tree) of rY to the
branch of the revision which applied the fix to trunk.
You need this to prevent the merge-tracking logic from trying to pick
up rY next time you sync the branch with trunk.
Because the change now exists in two revisions, rX as applied to the
branch, and rY as applied to trunk as a merge of rX. Merge-tracking doesn't
know that they are essentially the same. It just stores "this path
already has rA-D, rM, rN, rJ, ...". If rY isn't in that list, Subversion
will try to merge rY. It does not know that, semantically, rX equals rY,
unless you tell it so by running the record-only merge.
So this is actually a limitation of the merge-tracking logic, rather
than the basic design.

And this same limitation is related to why --reintegrate is necessary.
See http://blogs.open.collab.net/svn/2008/07/subversion-merg.html
and http://svnbook.red-bean.com/en/1.5/svn.branchmerge.basicmerging.html

Now, it's clear that users don't want to care about such internals.
But with every complex tool, it's necessary to get a bit of an idea
about the internals so that you're not totally helpless when the tool
doesn't work as you expect. Be it svn, git, hg, cvs, a bicycle,
or whatever.

> correct me if im wrong, but a prereq for running a --reintegrate merge is
> that the branch is fully cauht up to HEAD on the upstream branch? thats my
> impression based on an error mesage i got when using it early on.

Yes, see links above for more information.

> if thats
> the case, then this is yet another point where a dvcs tool would just do the
> merge and i wont have to think, and also, discouraging lazy ppl like me from
> using --reintegrate (tho i may find reason to change my ways).

And what if the merge result isn't right?

E.g. what if the change you merge wants to replace a directory
that you have also replaced since you branched? How do you detect
this situation and warn the user that it's happening? (This is a
tree conflict by the way.)

Try solving this problem in hg's design, keeping in mind that
directories are just side effects of the content (i.e. files)
being tracked.

And yes, this use case happens for some of our users using heavy
refactoring to keep their heads over the water in giant code bases
that are a decade old and everything has been moved around at least
once already.

I know it's a corner case most project's don't need, but if you
really need this, there aren't many tools which can do the job.
And Subversion is almost there. During updates we already detect
this, merge still requires some work.

> > Subversion's merge tracking allows you to merge from only a subset of
> > paths modified during a given revision. It then tracks which paths of
> > the commit you haven't merged into the target yet. Next time you merge
> > the entire revision into the target again, Subversion knows what has
> > already been merged and what still needs to be merged from that
> > revision.
>
>
> well honestly, i think git solves this w/ the decision to allow history
> rewriting. w/ git rebase -i
> you can rearrange commits, such that 1 commit is now two, or two commits
> one, or w/e you decide, and then you can merge just what you want by merging
> at the commit boundaries as normal, 6-a-dozen-half-the-other i suppose.

That's interesting.
Can you do that in public repositories, too? Would you need to?

> rebasing is another feature btw, that w/o, requires extra branching and
> merging in systems that dont allow history rerwriting to emulate. which,
> when talking about effectiveness is a very relevant point.

Yeah, it also sounds like a nice solution to the vendor branching problem.

> > of merging at units of entire revisions only, then maybe that is
> > what Subversion should be doing, too. We could do away with subtree
> > mergeinfo, just have mergeinfo at branch roots and declare rX merged
> > into branch B as soon as someone merges _anything_ from rX into
> > branch B. If you want to merge more from rX, well, then undo the
> > previous merge of rX first, and then merge rX again. From conversations
> > I've had with Paul (our merge-tracking guru), it would seem that this
> > strategy would improve performance a lot. Would people like this?
>
> no, because there is no history rewriting so this would be the only way to
> to effectively merge inside of the commit boundary.

I see.

> im lost on the tree conflict stuff i see when i run merges now since 1.5. i
> always end up resolving them.. i do understand that the dvcs systems infer
> changes to the folder structure, but im not sure on the implementation.

I'm under the impression that they just do a lazy mkdir -p for all
directories needed by the files they track. They don't care whether
directory X on the branch is, semantically, the same directory as
directory X on trunk.

> that said, i think im going to catch up on the tree conflict stuff; a quick
> google pointed me here
> http://blogs.open.collab.net/svn/2009/03/subversion-160-and-tree-conflicts.html
>
> *sigh* i doubt ill get to it tonight tho.., heh

This might be easier to start with:
http://svnbook.red-bean.com/nightly/en/svn.tour.treeconflicts.html

> its funny, its like some things are brain-dead
> simple in svn, some things in git.

Yeah the problem is trying to find an optimal trade-off.
Version control is a complex problem.

> now finally to my illustration of merge performance in centralized vs.
> distribued vcs, and this is a real world example. im at a private secotr
> biz, and we have a healthy repo, running svn 1.5.5 (or something close [not
> 100% on the minor]), fsfs, on a heartbeat system - the works. so heres an
> experiment which will demonstrate how tightly coupled merge performance in
> svn is tied to the performance of your underlying netowrk (whichever one you
> happen to be on at the moment). depending on which machine i run the merge
> on greatly impacts the time to run the merge. here are the results:
> . merge time - machine
> 1m42.729s - server on same colo as svn box
> 2m55.933s - on laptop at office over ethernet (solid bandwidth to colo..)
> 16m58.753s - same laptop at home over the vpn
>
> so you can see performance drops quickly and dramatically depending on your
> network. anyone wondering how long it takes git to run this merge (cloned
> repo via git-svn) in git?
>
> 0m3.736s - same laptop

Thanks for those numbers. We need to speed this up.

Which protocol is this?

Could you repeat this to show how e.g. svn:// compares to http://?
And how they compare to file://?
Both with and without encryption (SSH and SSL), if possible?
Can you even try to run the merge on the svn box itself,
using svn://localhost and http://localhost to see how much overhead
is left when there is no network I/O?
I'd like to know these numbers. If you can't or don't want to do it,
that's fine. I can make similar measurements when I find the time.

Batching requests the client is making to the server could help.
Paul already has a small note on this in
http://svn.collab.net/repos/svn/branches/subtree-mergeinfo/notes/subtree-mergeinfo/solutions-whiteboard.txt

III) New svn_ra_get_location_segments2 API that accepts multiple paths.

   This should aleviate the 'Implicit Mergeinfo Query Problem' somewhat
   by allowing us to query the server one time for all the subtree's
   implicit mergeinfo.

That's asking for an optimisation for a particular situation in which
performance is known to be exceptionally bad. But it should not be
impossible to speed up things even in the general case.

> and btw, i did it just now, same location that took almost 17 minutes for
> svn. so, now you start to think about the premise of the thread that
> started this one.. dude is talking about repos across the world.. wow, like
> lets hope they have someone who lives nearby the server to do the merges,
> seriously.. and it goes further, to all those other commands, log, diff,
> branch, switch ..

Yup. And many people (e.g. me :) pay for their traffic by usage,
and not just a flat fee.

> there are dark sides to dvcs tools, and im learning them as i go,

If you learn more, please tell me. I'm interested to learn about
them, too!

Stefan
Received on 2009-06-12 14:30:17 CEST

This message: [ Message body ]
Next message: Les Mikesell: "Re: How to map CVS revision number to migrated SVN revision number"
Previous message: Wiebesiek, Torsten: "RE: Hooks don't work via Http-Access"
In reply to: Nathan Nobbe: "Re: merge performance (was: Re: Distributed Subversion)"
Next in thread: Nathan Nobbe: "Re: merge performance (was: Re: Distributed Subversion)"
Reply: Nathan Nobbe: "Re: merge performance (was: Re: Distributed Subversion)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]