merge performance (was: Re: Distributed Subversion)

From: Stefan Sperling <stsp_at_elego.de>
Date: Wed, 10 Jun 2009 18:27:31 +0100

On Wed, Jun 10, 2009 at 10:35:29AM -0600, Nathan Nobbe wrote:
> merge semantics in dvcs systems are way better,

How and why? Honest question. Please explain in detail, because I've
been trying to understand what they are doing right, and what
Subversion is doing wrong, and I don't think I've quite figured
it out yet.

Maybe you just mean "simpler" instead of "better"? See below.

> so merging is fast and effective.

I know it's fast. How effective is it, really?

> i read on this list a month or so back svn is handinling like 50-60 files
> per second on a merge and its supposedly fast.., yet git can do
> *thousands* of files per second.

Part of it is that git has been designed for performance
more than Subversion. But there are trade-offs, because great
performance requires taking shortcuts. I'm under the impression
that git is solving a simpler problem when merging than Subversion
is solving.

Subversion's merge tracking allows you to merge from only a subset of
paths modified during a given revision. It then tracks which paths of
the commit you haven't merged into the target yet. Next time you merge
the entire revision into the target again, Subversion knows what has
already been merged and what still needs to be merged from that
revision.

When this happens, mergeinfo is essentially distributed across
several paths in the repository. So Subversion has to query multiple
paths for mergeinfo. That can cause performance problems for large trees.
There are users hitting the mark where this slows down things so much
that Subversion becomes virtually unusable. Those users do crazy things
though, creating subtree mergeinfo on *every* path because of their
special merging process I'm not going to explain because it takes too long.
Having mergeinfo on every single path is something merge-tracking
hasn't been designed to scale up to.

But I'm under the impresssion that with git and Mercurial, you merge
either everything, or nothing. And the whole commit is assumed to be
merged no matter what the merge result was that got committed.
If this is so, and if this works for everybody, then maybe Subversion
is trying to be too precise.

If it is enough for most other systems to solve the simple problem
of merging at units of entire revisions only, then maybe that is
what Subversion should be doing, too. We could do away with subtree
mergeinfo, just have mergeinfo at branch roots and declare rX merged
into branch B as soon as someone merges _anything_ from rX into
branch B. If you want to merge more from rX, well, then undo the
previous merge of rX first, and then merge rX again. From conversations
I've had with Paul (our merge-tracking guru), it would seem that this
strategy would improve performance a lot. Would people like this?

Note that there are more limitations of merging in git and Mercurial.
For example, I think that they will have a hard time squeezing
tree-conflict handling for directories into their design, if they
will ever try to. We are trying, and it's damn hard, even though
Subversion already has a design which is more suitable for this task.
Because in Subversion, directories are versioned objects and not just
a side-effect of a versioned file -- that's more or less what directories
are in git and Mercurial, as far as I understand.

It took Subversion almost 10 years to get where it is now.
DCVSses are much younger, and they could develop much quicker.
I think it's because they solve a simpler subset of problems of
version control. Maybe they are even solving an optimal subset
of the problems, because their users are so happy.
I don't know. I'd like to know.

> A plus i have to stop and think about
> whether its time to branch to avoid a goofy limitation in the merge
> tracking, or find out it wasnt smart enough to omit previously merged
> changesets..

Most of the time, you're using it wrong if that happens :)
Those cyclic merges can happen when you don't use --reintegrate
and dispose of your reintegrated branches porperly.

But I agree that merge-tracking can be goofy, because it tries to
be very smart. Much smarter than what git or hg are doing, it seems.
And that is hard to get right. See the long list of merge-tracking
fixes which have been made since 1.5 was released. Paul is working
hard to make it happen!

> , and start writing another bash script w/ several merge
> commands, carefully skipping over the already merged revisions...

Whenever you have to do this, please just stop doing it and talk
to us instead. You should not have to do this. If you have to do
this and it's not your fault but Subversion's fault, we need to fix it.

Thanks,
Stefan
Received on 2009-06-10 19:28:48 CEST

This message: [ Message body ]
Next message: Stephen Connolly: "Re: How to authenticate Subversion with SASL2 + LDAP"
Previous message: Stephen Connolly: "Re: Directly commit in svn without a workingdirectory"
In reply to: Nathan Nobbe: "Re: Distributed Subversion"
Next in thread: Todd C. Gleason: "RE: merge performance (was: Re: Distributed Subversion)"
Reply: Todd C. Gleason: "RE: merge performance (was: Re: Distributed Subversion)"
Reply: Trent Nelson: "RE: merge performance (was: Re: Distributed Subversion)"
Reply: Nathan Nobbe: "Re: merge performance (was: Re: Distributed Subversion)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]