Re: merge performance (was: Re: Distributed Subversion)

From: Nathan Nobbe <quickshiftin_at_gmail.com>
Date: Mon, 15 Jun 2009 12:54:47 -0600

On Fri, Jun 12, 2009 at 6:29 AM, Stefan Sperling <stsp_at_elego.de> wrote:

> On Fri, Jun 12, 2009 at 01:03:46AM -0600, Nathan Nobbe wrote:
> > see, thats the thing.., if the tool can manage things for you, and you
> dont
> > have to think about them, b/c you trust the tool; youre more likely to
> use
> > it for what its good for in the first place! prime example,
> reintegration
> > in svn1.5; i hate to say it, but ouch.. honestly, i dont even use the
> > --reintegrate flag (regretting after reading more from you below
> <damn>)..
> > svn seems to figure out what to do w/o it so far.., but the issue we've
> all
> > been over on this list a dozen times or better now, is that you have to
> > create a new branch, re-create the branch or or .. etc.
>
> You need to understand the limitations of a tool you are using to be
> able to use it well. Just not using --reintegrate because you don't
> understand why it's necessary won't solve the problem.
>
> E.g. take CVS. Once you know the limitations of CVS, and why they
> exist, and how to avoid them, then you know how to use CVS effectively.
> Before that, it bites you all the time. Well it did that to me, anyway.
> Maybe your experience was different.
>
> Same goes for Subversion.
> And I would assume that the same goes for git and hg, too.

no doubt. and i felt pretty dialed in w/ svn-1.4; now w/ 1.5+ it appears
ill have to learn some more in order to avoid potential pitfalls. and def
this goes for git/hg - more on this at the end of the post..

> Let's look at some limitations:
>
> E.g. it seems that in Mercurial, by design, it's not possible to
> cherry-pick changesets from a branch.
>
> When you merge in hg, you specify exactly one revision.
> It's the revision *up to which* you would like to merge:
>
> $ hg help merge
> hg merge [-f] [[-r] REV]
>
> Because in hg, every revision can either have one or two children,
> and one parent.
> This way, hg always knows which revision is the parent of the branch
> you are working on. Every revision in the branch is a grand-grand-
> grand-...-grand child of the revision of trunk you initially branched.
>
> To sync up the branch with trunk, hg goes back to the common
> grand-parent revision, and applies all outstanding revisions
> of trunk to the branch.
>
> Merge tracking is dead-easy, because all you need to remember
> is the revision (named by a hash) you last merged from trunk
> into the branch. Nice and clean design. Merge-tracking problem solved!
>
> So here's where the limitation is: What if you want just one
> particular revision from trunk, and not its grand-parents?
> Can't do it in this design.
>
> So what people do is, they fix bugs in release branches (for everyone
> of their active release branches), and then merge the fix back
> from one of those branches into trunk. Because only fixes are merged
> from branches into the trunk, i.e. only changes you really want in
> trunk, this works nicely.
>
> Or, they use a hg extension called "transplant",
> It's essentially hg diff -rN:M <source> | patch.
> It will also store an extra file in the repository, mapping hashes
> of transplanted revisions as they exist in the source repository
> of the transplanted change to hashes of revisions committed to the
> branch in the local repository. This creates an extra mapping in addition
> to the simple and straight-forward parent->{child1 [, child2]}
> relationship,
> i.e. it tries to work around a limitation of the basic design.
>
> So the basic design is not flexible enough to accommodate for all
> types of merges people would like to do. So additional hacks around
> the basic design are required to make them work. Hacks around design
> are usually considered bad. Good design is considered better.
> But people might not have problems with hacks as long as they work
> well enough for them.
>
> I don't know yet how git does cherry-picking, is it similar?

i honestly have done cherry-picking like once when i first started using
git; under the guidance of my buddy who was teaching me. ill have to bone
up on it in order to see how cherry-picking works in git, but ill probly get
round to it in the next week or two.

> In Subversion, the basic design is an "infinite" series of stacked trees,
> each of which store nodes of the tree which have changed from the previous
> tree in the stack. Each such tree has a label, say rX (revision X).
> All nodes untouched in rX are resolved by referring to earlier revisions
> (rX-1, rX-2, ..., r1), which are further down the stack.
>
> This design allows you to easily merge a fix rX from trunk into the
> branch. Each rX is just a set of changes to a tree, and you can try
> to apply any revision to any subpart of the versioned tree (any branch),
> or even inside of any subpart (svn doesn't care where branches start).
>
> So it's trivial to try to cherry-pick a fix rY from branch into trunk.
> Just merge the change into the target path and see what happens.
> No svn diff -rN:M | patch is necessary, nor is an extra file that
> keeps track of merges you did outside of the usual parent->child flow
> like in hg.
>
> And here is the limitation:
>
> You need one additional step to do this properly, which is a "record-only"
> merge (which only modifies mergeinfo, not the actual tree) of rY to the
> branch of the revision which applied the fix to trunk.
> You need this to prevent the merge-tracking logic from trying to pick
> up rY next time you sync the branch with trunk.
> Because the change now exists in two revisions, rX as applied to the
> branch, and rY as applied to trunk as a merge of rX. Merge-tracking doesn't
> know that they are essentially the same. It just stores "this path
> already has rA-D, rM, rN, rJ, ...". If rY isn't in that list, Subversion
> will try to merge rY. It does not know that, semantically, rX equals rY,
> unless you tell it so by running the record-only merge.
> So this is actually a limitation of the merge-tracking logic, rather
> than the basic design.
>
> And this same limitation is related to why --reintegrate is necessary.
> See http://blogs.open.collab.net/svn/2008/07/subversion-merg.html
> and http://svnbook.red-bean.com/en/1.5/svn.branchmerge.basicmerging.html
>
> Now, it's clear that users don't want to care about such internals.
> But with every complex tool, it's necessary to get a bit of an idea
> about the internals so that you're not totally helpless when the tool
> doesn't work as you expect. Be it svn, git, hg, cvs, a bicycle,
> or whatever.

totally agree - you may not have to know your your vcs at the code level,
but its nice to have a clue whats going on under the hood, heh.

> correct me if im wrong, but a prereq for running a --reintegrate merge is
> > that the branch is fully cauht up to HEAD on the upstream branch? thats
> my
> > impression based on an error mesage i got when using it early on.
>
> Yes, see links above for more information.

OK, ill just have to stop being lazy and do it right hence forth.

> if thats
> > the case, then this is yet another point where a dvcs tool would just do
> the
> > merge and i wont have to think, and also, discouraging lazy ppl like me
> from
> > using --reintegrate (tho i may find reason to change my ways).
>
> And what if the merge result isn't right?
>
> E.g. what if the change you merge wants to replace a directory
> that you have also replaced since you branched?

see below.

> How do you detect
> this situation and warn the user that it's happening? (This is a
> tree conflict by the way.)

well, it looks like git is cognizant of tree conflicts (at least in some
cases). i tested deleting a file in branchA and making a text change in
branchB, here are the results in git when trying to merge

CONFLICT (delete/modify): testFile.txt deleted in HEAD and modified in
commit a text change. Version commit a text change of testFile.txt left in
tree.

in the case above (completely replace subdir) im perhaps not running the
test correctly... so i start out w/ a directory structure in trunk as such

/testDir/innerTestDir/testFile.txt

then i create a branch, branchA
in branchA, i refactor the subdirectory testDir, and commit such that at
that point the structure is as follows,

/testDir/newInnerTestDir/newFile.txt

and then similarly in trunk(master in git..), such that its structure is

/testDir/randomInnerDir/randomNew.txt

so w/o having to first merge trunk to branchA, i checkout trunk and run a
merge from branchA. the result is the removal of testDir/innerTestDir, and
now both newInnerTestDir and randomInnerDir are there in testDir. this
seems reasonable to me, and additionally, ive been spared a merge (trunk ->
branchA) that id have had to run in svn. i am however sort of seeing how
you might want to be notified of this situation and act accordingly .. but
im not sure if ive structured this test correctly; is this something that
would tree conflict in svn?

> Try solving this problem in hg's design, keeping in mind that
> directories are just side effects of the content (i.e. files)
> being tracked.
>
> And yes, this use case happens for some of our users using heavy
> refactoring to keep their heads over the water in giant code bases
> that are a decade old and everything has been moved around at least
> once already.
>
> I know it's a corner case most project's don't need, but if you
> really need this, there aren't many tools which can do the job.
> And Subversion is almost there. During updates we already detect
> this, merge still requires some work.
>
> > > Subversion's merge tracking allows you to merge from only a subset of
> > > paths modified during a given revision. It then tracks which paths of
> > > the commit you haven't merged into the target yet. Next time you merge
> > > the entire revision into the target again, Subversion knows what has
> > > already been merged and what still needs to be merged from that
> > > revision.
> >
> >
> > well honestly, i think git solves this w/ the decision to allow history
> > rewriting. w/ git rebase -i
> > you can rearrange commits, such that 1 commit is now two, or two commits
> > one, or w/e you decide, and then you can merge just what you want by
> merging
> > at the commit boundaries as normal, 6-a-dozen-half-the-other i suppose.
>
> That's interesting.
> Can you do that in public repositories, too? Would you need to?

my understanding is that this is typically done when people are 'cleaning
up' a trees revision history. since git lets you alter history, it follows
that before people share, and publicize, they might want to clean things up
a bit, like remove commits that are silly, by incorpoating them into common
commit where it seems like the changes are more-or-less atomic. as far as
w/ public repos, i suspect the flow would be to clone or pull from the
public repo, rearrange things, then push back - but ive def not tested
anything like that - still getting my feet wet w/ the basics, heh.

> now finally to my illustration of merge performance in centralized vs.
> > distribued vcs, and this is a real world example. im at a private secotr
> > biz, and we have a healthy repo, running svn 1.5.5 (or something close
> [not
> > 100% on the minor]), fsfs, on a heartbeat system - the works. so heres
> an
> > experiment which will demonstrate how tightly coupled merge performance
> in
> > svn is tied to the performance of your underlying netowrk (whichever one
> you
> > happen to be on at the moment). depending on which machine i run the
> merge
> > on greatly impacts the time to run the merge. here are the results:
> > . merge time - machine
> > 1m42.729s - server on same colo as svn box
> > 2m55.933s - on laptop at office over ethernet (solid bandwidth to colo..)
> > 16m58.753s - same laptop at home over the vpn
> >
> > so you can see performance drops quickly and dramatically depending on
> your
> > network. anyone wondering how long it takes git to run this merge
> (cloned
> > repo via git-svn) in git?
> >
> > 0m3.736s - same laptop
>
> Thanks for those numbers. We need to speed this up.
>
> Which protocol is this?

https

> Could you repeat this to show how e.g. svn:// compares to http://?
> And how they compare to file://?
> Both with and without encryption (SSH and SSL), if possible?
> Can you even try to run the merge on the svn box itself,
> using svn://localhost and http://localhost to see how much overhead
> is left when there is no network I/O?
> I'd like to know these numbers. If you can't or don't want to do it,
> that's fine. I can make similar measurements when I find the time.

sadly the repo is at the office, and i pushed for months to get svn-1.5 :)
seems like the only protocol we have supported is https .. but ill see if i
can talk to the server dude and work some magic to expose additonal
protocols for more testing.

> > there are dark sides to dvcs tools, and im learning them as i go,
>
> If you learn more, please tell me. I'm interested to learn about
> them, too!

well, ive just started getting into the private/public branch flow,
finally. and w/ git since its so easy to branch, you just end up creating a
lot of them, and then its like, hmm, which branch did i create this one off
of?? well gitk will show you, but trying to get the info off the cli is
essentially impossible.. this i garnered from the git mailing list (taken
out of context)

"I really think this is impossible to define unambiguously in git, due to
the nature of git branches, being movable tags, much different from say
hg's hardwired branches (embedded in the commit object)."

understanding git is going to require a paradigm shift in my brain :D im
actually planning to get a grasp on hg as well - lol what am i thinking - 3
different vcs tools - arghhh!

-nathan

btw.
thanks for the links, ill keep reading ..

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=2362264

To unsubscribe from this discussion, e-mail: [users-unsubscribe_at_subversion.tigris.org].
Received on 2009-06-15 20:56:27 CEST

This message: [ Message body ]
Next message: Grant Rettke: "RE: Someone added new files to repository, how do I see them?"
Previous message: Ryan Schmidt: "Re: lock message comment"
In reply to: Stefan Sperling: "Re: merge performance (was: Re: Distributed Subversion)"
Next in thread: Stefan Sperling: "tree conflicts (was: Re: merge performance (was: Re: Distributed Subversion))"
Reply: Stefan Sperling: "tree conflicts (was: Re: merge performance (was: Re: Distributed Subversion))"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]