Re: merge performance (was: Re: Distributed Subversion)

From: Nathan Nobbe <quickshiftin_at_gmail.com>
Date: Fri, 12 Jun 2009 01:03:46 -0600

On Wed, Jun 10, 2009 at 11:27 AM, Stefan Sperling <stsp_at_elego.de> wrote:

> On Wed, Jun 10, 2009 at 10:35:29AM -0600, Nathan Nobbe wrote:
> > merge semantics in dvcs systems are way better,
>
> How and why? Honest question.

to be honest i dont fully understand how merging or tracking (history) works
in dvcs.

> Please explain in detail, because I've
> been trying to understand what they are doing right, and what
> Subversion is doing wrong, and I don't think I've quite figured
> it out yet.

what i do know is git establishes a 'content tracking' approach. everything
(used loosely) is marked w/ a sha1 hash. sha1 hashes are the guid(s), and
this is where the power comes from.
diff'ing is really, really fast. entire trees can be compared by doing a
strcmp() on 2 40 character strings. if the strings are the same, the trees
are the same (optimal case), if not traverse, where likely youll be able to
discard large sections of sub-trees quicly as well.
so, im not sure on this part, but im guessing this is part of the reason
merging goes so fast.., assuming the first step in a merge process would be
determining which objects need to be merged

im still a relative newb w/ git and dvcs.. ive only been using it for maybe
a year now. im not sure how the merges work in detail yet, if theres a
mathematical algorithm thats run, using the original version and treating
the hashes as offsets or something, im just not sure. one thing i do know
for sure though, is that git does not retain sets of 'deltas' or patch
files. it stores lots and lots of sha1 hashes, lol.

Maybe you just mean "simpler" instead of "better"? See below.
>
> > so merging is fast and effective.
>
> I know it's fast. How effective is it, really?

what ive been trying to do (warn: theres a little bias here) is trying to
get people to realize what merge tracking in svn1.5 actually brings to the
table. imo, it might not be as smart or fast as dvcs tools, BUT, it is
enough to make my life easier, and since i dont have to maintain the
revisions by hand anymore, im likely to merge more frequently, which as we
all know is ideal. not only that, but i can encourage others on the team,
to start merging under my guidance, now that its easier, and then i can
start distributing (no pun intended :)) the work load.
see, thats the thing.., if the tool can manage things for you, and you dont
have to think about them, b/c you trust the tool; youre more likely to use
it for what its good for in the first place! prime example, reintegration
in svn1.5; i hate to say it, but ouch.. honestly, i dont even use the
--reintegrate flag (regretting after reading more from you below <damn>)..
svn seems to figure out what to do w/o it so far.., but the issue we've all
been over on this list a dozen times or better now, is that you have to
create a new branch, re-create the branch or or .. etc. in git and other
vcs tools, those sorts of flows, and others that seem convoluded just seem
to magically work when you run a merge (at least in git [ive still not done
much more than download hg]). like when i first ran the flow in git,
(imagine if you will, reintegrating upstream to trunk in svn, then not
recreating the branch or w/e and now committing to trunk and trying to merge
that back to the branch.) in svn, youd get conflicts w/ the changesets you
just sent upstream to trunk, however, in git, only the new work (on trunk
since reintegration) is applied.. there are no issues w/ the stuff that just
came to trunk from this branch during reintegration. and thats just one
flow example.. this is where im trying to go .. in svn i have to be
cognizent of this sort of issue, in git i do not. just as merge tracking in
svn1.5 alieviates me from maintaining merge information myself, dvcs systems
inherint merge capabilities (which are astounding) allow me to forget about
issues that could arise w/ arbitrary, sometimes complex branching / merging
flows. so yes, not only are they really fast, they are effective;
pragmatically in my personal experience.

correct me if im wrong, but a prereq for running a --reintegrate merge is
that the branch is fully cauht up to HEAD on the upstream branch? thats my
impression based on an error mesage i got when using it early on. if thats
the case, then this is yet another point where a dvcs tool would just do the
merge and i wont have to think, and also, discouraging lazy ppl like me from
using --reintegrate (tho i may find reason to change my ways).

> i read on this list a month or so back svn is handinling like 50-60
> files
> > per second on a merge and its supposedly fast.., yet git can do
> > *thousands* of files per second.
>
> Part of it is that git has been designed for performance
> more than Subversion. But there are trade-offs, because great
> performance requires taking shortcuts. I'm under the impression
> that git is solving a simpler problem when merging than Subversion
> is solving.
>
> Subversion's merge tracking allows you to merge from only a subset of
> paths modified during a given revision. It then tracks which paths of
> the commit you haven't merged into the target yet. Next time you merge
> the entire revision into the target again, Subversion knows what has
> already been merged and what still needs to be merged from that
> revision.

well honestly, i think git solves this w/ the decision to allow history
rewriting. w/ git rebase -i
you can rearrange commits, such that 1 commit is now two, or two commits
one, or w/e you decide, and then you can merge just what you want by merging
at the commit boundaries as normal, 6-a-dozen-half-the-other i suppose.
rebasing is another feature btw, that w/o, requires extra branching and
merging in systems that dont allow history rerwriting to emulate. which,
when talking about effectiveness is a very relevant point.

When this happens, mergeinfo is essentially distributed across
> several paths in the repository. So Subversion has to query multiple
> paths for mergeinfo. That can cause performance problems for large trees.
> There are users hitting the mark where this slows down things so much
> that Subversion becomes virtually unusable. Those users do crazy things
> though, creating subtree mergeinfo on *every* path because of their
> special merging process I'm not going to explain because it takes too long.
> Having mergeinfo on every single path is something merge-tracking
> hasn't been designed to scale up to.
>
> But I'm under the impresssion that with git and Mercurial, you merge
> either everything, or nothing. And the whole commit is assumed to be
> merged no matter what the merge result was that got committed.
> If this is so, and if this works for everybody, then maybe Subversion
> is trying to be too precise.

i dunno, it does sound like a limitation when you put it like that, but
since revisions (or as ive been referring to them, commits) are not written
in stone (at least in git), this problem does not exist therein (see
above). either way, i would consider this need esoteric / rarely needed in
either git or svn.

If it is enough for most other systems to solve the simple problem
> of merging at units of entire revisions only, then maybe that is
> what Subversion should be doing, too. We could do away with subtree
> mergeinfo, just have mergeinfo at branch roots and declare rX merged
> into branch B as soon as someone merges _anything_ from rX into
> branch B. If you want to merge more from rX, well, then undo the
> previous merge of rX first, and then merge rX again. From conversations
> I've had with Paul (our merge-tracking guru), it would seem that this
> strategy would improve performance a lot. Would people like this?

no, because there is no history rewriting so this would be the only way to
to effectively merge inside of the commit boundary.

> Note that there are more limitations of merging in git and Mercurial.
> For example, I think that they will have a hard time squeezing
> tree-conflict handling for directories into their design, if they
> will ever try to. We are trying, and it's damn hard, even though
> Subversion already has a design which is more suitable for this task.
> Because in Subversion, directories are versioned objects and not just
> a side-effect of a versioned file -- that's more or less what directories
> are in git and Mercurial, as far as I understand.

im lost on the tree conflict stuff i see when i run merges now since 1.5. i
always end up resolving them.. i do understand that the dvcs systems infer
changes to the folder structure, but im not sure on the implementation.
again, not sure w/ hg, but for git, it seems reasonable to think git knows
nothing about what a directory is, being that it knows nothing about files
either. like w/ svn, you decide to store files, then a list of diffs for
the file. then you can also define svn to track folders as well. im not
sure how it works under the hood, but as i understand in *nix, directories
are files. meaning they can have sha1 hashes generated from them, meaning
they fit into the content model approach. i would venture to guess that
when git snapshots the binary of a project to create a hash, it includes the
directory files as part of the content.
that said, i think im going to catch up on the tree conflict stuff; a quick
google pointed me here
http://blogs.open.collab.net/svn/2009/03/subversion-160-and-tree-conflicts.html

*sigh* i doubt ill get to it tonight tho.., heh

It took Subversion almost 10 years to get where it is now.
> DCVSses are much younger, and they could develop much quicker.
> I think it's because they solve a simpler subset of problems of
> version control. Maybe they are even solving an optimal subset
> of the problems, because their users are so happy.
> I don't know. I'd like to know.

there are a couple of key advantages i see
. performace - illustration forthcoming
. convenience - repository creation / management
. flexibility - a brief illustration, creating custom rsync scripts to push
changes to the server from an svn working copy (ew). dvcs approach - add
another repo on the server and use integrated sync operations.

from my perspective the advantages are for the personal developer really. i
feel like svn is simpler though, in many regards. at least from a
conceptual standpoint. its funny, its like some things are brain-dead
simple in svn, some things in git. i just feel like merging is less painful
in git.

> A plus i have to stop and think about
> > whether its time to branch to avoid a goofy limitation in the merge
> > tracking, or find out it wasnt smart enough to omit previously merged
> > changesets..
>
> Most of the time, you're using it wrong if that happens :)
> Those cyclic merges can happen when you don't use --reintegrate
> and dispose of your reintegrated branches porperly.

wow, it sounds like i could have been caught red-handed here :(

> But I agree that merge-tracking can be goofy, because it tries to
> be very smart. Much smarter than what git or hg are doing, it seems.
> And that is hard to get right. See the long list of merge-tracking
> fixes which have been made since 1.5 was released. Paul is working
> hard to make it happen!
>
> > , and start writing another bash script w/ several merge
> > commands, carefully skipping over the already merged revisions...
>
> Whenever you have to do this, please just stop doing it and talk
> to us instead. You should not have to do this. If you have to do
> this and it's not your fault but Subversion's fault, we need to fix it.

honestly the last time this happened since merge tracking (which is the only
time so far, but we've just really gotten started w/ it). i havent tried
reproducing the flow yet, but basically it was across two branches which had
the same root branch. ill try to repro soon, if i can.

now finally to my illustration of merge performance in centralized vs.
distribued vcs, and this is a real world example. im at a private secotr
biz, and we have a healthy repo, running svn 1.5.5 (or something close [not
100% on the minor]), fsfs, on a heartbeat system - the works. so heres an
experiment which will demonstrate how tightly coupled merge performance in
svn is tied to the performance of your underlying netowrk (whichever one you
happen to be on at the moment). depending on which machine i run the merge
on greatly impacts the time to run the merge. here are the results:
. merge time - machine
1m42.729s - server on same colo as svn box
2m55.933s - on laptop at office over ethernet (solid bandwidth to colo..)
16m58.753s - same laptop at home over the vpn

so you can see performance drops quickly and dramatically depending on your
network. anyone wondering how long it takes git to run this merge (cloned
repo via git-svn) in git?

0m3.736s - same laptop

and btw, i did it just now, same location that took almost 17 minutes for
svn. so, now you start to think about the premise of the thread that
started this one.. dude is talking about repos across the world.. wow, like
lets hope they have someone who lives nearby the server to do the merges,
seriously.. and it goes further, to all those other commands, log, diff,
branch, switch ..
this is something ive been trying to get across, about the distinction
between on-net or off. technically, im on-net right now in the sense that i
have connectivity to the repository. however connectivity to the repository
is so slow, even w/ my 6meg dsl connection, that its painful to use. the
only reason i can do practically anything useful at all here at the apt, is
b/c i use git-svn. i only run merges here when i have to (using svn for all
merges here). all you have to do is look at the numbers above to imagine
how bad the performance would be when we're literally talking 'across the
world'; hell even if it was across the state, itd be horrible. as it
stands, i can hop in the shower sometimes when i run a svn merge from the
crib :)

there are dark sides to dvcs tools, and im learning them as i go,
naturally. its about tradeoffs, like anything. i suspect svn will move in
some fashion to distribution as time goes on, likely in the server-to-server
synching style. really, im making them sound better than they may be i
guess, b/c im new to dvcs. but from a practical observation, ive moved all
my stuff over to dvcs, and ive been using git-svn for work against svn
driven projects. realistically, id be lying to myself if i said i wasnt
sold on the bulk of the benefits.

-nathan

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=2361518

To unsubscribe from this discussion, e-mail: [users-unsubscribe_at_subversion.tigris.org].
Received on 2009-06-12 09:04:49 CEST

This message: [ Message body ]
Next message: Nafter: "svnmucc - Server sent unexpected return value (403 forbidden)"
Previous message: Ramachandran, Vishwanath(IE10): "RE: Is it possible to have a trigger script replace user ID (eid) with name in error messages?"
In reply to: Stefan Sperling: "merge performance (was: Re: Distributed Subversion)"
Next in thread: Stefan Sperling: "Re: merge performance (was: Re: Distributed Subversion)"
Reply: Stefan Sperling: "Re: merge performance (was: Re: Distributed Subversion)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]