Re: [RFC] Greatly improve merge performance, but break(?) edge cases

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: Fri, 10 Jul 2009 10:22:57 +0100

Paul Burba wrote:
> A thinly veiled plea for some eyes on issue #3443...

[...]
> 1) Are the cases where subtrees in a merge target have different
> implicit mergeinfo than the rest of the target common use cases or
> highly contrived edge cases?. I think the latter is true, certainly
> the vanilla release and feature branch models aren't affected, but am
> I missing something obvious? I've spent so much time close to this
> that I might be missing the forest for the trees.
>
> 2) Assuming subtrees with differing implicit mergeinfo are edge cases,
> is there any reason *not* to make the changes suggested by the patch
> in issue #3443?

Hi Paul.

I can't answer (1) with empirical evidence. All I can do is offer my
perspective. Please forgive me if I'm labouring my point in this long
email.

Before you read on, be aware that I am not 100% sure that the behaviour
change you propose is actually a change from correctness to
incorrectness, but it sounds like it is so that's what I am assuming.

We are in a position where we fully support a certain limited kind of
merging scenario, and we also claim that Subversion can to some extent
cope with a much wider range of scenarios, and we hope to improve this.

When a tool does a complex job (merging) and is used in a complex
setting (software development, typically), users need to understand as
best they can what it does, and then apply it over and over. Because
they had to choose from the set of available version control systems,
none of which completely fulfills all of their needs, they are
inevitably using Subversion in some capacity that is a stretch for its
suitability in one way or another. Some of them are stretching the merge
capabilities and not able to stick to the most basic and fully supported
merging scenarios.

We tell them it has this complex but describable behaviour. If we then
tell them that in another scenario, with similar recent history but
different ancient history, it may omit parts of the merge (without even
telling the user that anything is amiss), then the user will rightly
lose confidence, or will go ahead and use it and then be bitterly
disappointed one day when they discover the problem. We can say, "You
liked it because it did the merges quickly." And they may reply, "It was
working fine, and then we used it in a an obviously similar scenario and
apparently it didn't work, so now we have to recall and fix our product.
We know that it has limitations, but we've discovered you did this
deliberately. Thanks a bunch."

When a software tool doesn't behave predictably and consistently, it
frustrates efforts to integrate it into a larger system. I would say
building GUIs around Subversion is such a case.

I think there is an equally important point on the Subversion
development side, too. From my point of view as a Subversion developer,
I think designing WHAT it does - the behaviour - is the more complex
task, and implementing that behaviour, though time consuming and still
complex, is the more accessible task.

If we design and implement the right behaviour, but it is slow, a new
developer can look at it and find ways to implement that same behaviour
more efficiently. (For example, pipelining the network requests to
eliminate the multiple-turnaround delay.) In our line of work, almost
any algorithm that executes slowly can be speeded up immensely by a
better implementation. That may be a complex task to many of us, but
there are a reasonable number of people who step up to do such tasks. It
is a self-contained task, with regression tests already in place, and
the roll-out brings only goodness to the users.

If we knowingly implement something that is not the correct behaviour,
then it is a much harder task for anyone later to redesign the correct
behaviour while also having to contemplate the effects on the user base
of making such a change. It is not sufficient to leave a comment in the
source saying, "The correct behavior would be X. Here, we knowingly do
Y." Changing the behaviour requires not just changing the source but
also testing, writing new tests, understanding old tests, and managing
users' experiences with the upgrade. And the re-implementation of X
would surely be as difficult as it was the first time around.

The definition of an "edge case" is that which we don't see happening
very often or which can easily be avoided. The trouble with dismissing
"edge cases" is that (a) although rare as an average across the
experience we have encountered so far, when a user hits one it often
becomes common for that user; and (b) when an edge case occurs in one
sub-set of a set of complex behaviours, as the combinations multiply up,
a whole large class of scenarios becomes unusable or unreliable.

The fact that the current regression tests don't catch the shortcomings
is just a result of their simplicity and we can be sure that real users
will run into problems. Haven't you spent, um, a *bit* of time fixing
"edge cases" that turned out to be actually rather important even if
they are still uncommon on average?

You have not found a way to speed it up immensely while keeping the
behaviour, but I think you should put this proposal out as a kick-off
point for other people to tackle that problem, and you should not feel
you have to implement something fast even at the expense of correctness.

All the foregoing assumes that the current behaviour w.r.t. sub-trees is
"correct" and skipping them is "wrong". That is a definition we could
consider changing. One alternative to keeping the "correct" behaviour is
to explicitly not support arbitrary merges with mixed sub-tree
mergeinfo. I can't see us choosing that option but maybe the line of
thought will lead somewhere.

Or maybe there are opportunities for higher level redesign of the
mergeinfo storage scheme that would make the implementation easier. For
example, if we were to rearrange the mergeinfo so that it explicitly
holds some info about "implicit" mergeinfo as well, maybe that could
enable major optimisations to be made in the look-up algorithms. (That's
not a thought-out proposal, just a top-of-of-my-head example.)

So basically, IF I understood you correctly and IF the behaviour today
is definitely correct and the proposed behaviour definitely wrong, then
I am not in favour of such a change. I value the ability to hit "merge"
and trust the tool to do its job more than I value the ability to hit
"merge" and have the result quickly but then have to worry about whether
it did what I expected.

Regards,
- Julian

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2369619
Received on 2009-07-10 11:23:20 CEST

This message: [ Message body ]
Next message: Senthil Kumaran S: "Re: svn commit: r38381 - in trunk/subversion: include libsvn_subr"
Previous message: Hyrum K. Wright: "Re: Mismatched backwards compat callbacks in libsvn_wc"
In reply to: Paul Burba: "[RFC] *Greatly* improve merge performance, but break(?) edge cases"
Next in thread: Greg Troxel: "Re: [RFC] *Greatly* improve merge performance, but break(?) edge cases"
Reply: Greg Troxel: "Re: [RFC] *Greatly* improve merge performance, but break(?) edge cases"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]

Re: [RFC] *Greatly* improve merge performance, but break(?) edge cases

Re: [RFC] Greatly improve merge performance, but break(?) edge cases