[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Improvements to diff3 (merge) performance

From: Morten Kloster <morklo_at_gmail.com>
Date: Mon, 13 Jun 2011 13:00:53 +0200

On Sun, Jun 12, 2011 at 9:37 PM, Greg Hudson <ghudson_at_mit.edu> wrote:
> My executive summary of your post is that you want diff3 to try to merge
> related, but not identical, changes occuring between a pair of sync
> points.  I'm wary about this for two reasons.
>
> First, the benefit appears to arise chiefly for what Bram Cohen calls
> "implicit cherrypicking" use cases--that is, cases where a change is
> merged and then merged again together with other changes.  After
> extensive research, Bram eventually concluded that trying to support
> this is a bad idea (http://bramcohen.livejournal.com/52148.html).  I
> tend to think that a merge algorithm should not increase its false
> negative rate for the benefit of implicit cherrypicking.
>

I assume he has discussed this elsewhere in more detail? The link
you provided says very little about it (and the ONLY hit for "implicit
cherrypicking" on Google was your post :-).

> Second, I can see a lot of potential for incorrect merges based on
> superficial similarities.  For instance, if you and I both add code
> between the same two sync points, and the last line of my change is the
> same as the first line of yours (perhaps a blank line), that could be
> enough to unambiguously determine an ordering.  Perhaps both of our code
> additions do the same thing in different ways, and now it gets done
> twice (which is almost never desirable).  Certainly, the existing
> algorithm can produce incorrect merges too, but my intuition says that
> the practical likelihood would become much higher.
>

The blank line case is precisely the kind of problem you're likely to run
into also with the current algorithm, unless the project has a strict policy
on when to have blank lines. I'm not too fond of letting blank lines count
as valid matches for that purpose in the first place - see below. I don't
think my proposal would cause too much of an increase in that kind of
problem, but of course there's a risk. I agree that my original proposal
might be too aggressive, though - that was partially a case of "cool, this
approach will do all I wanted and then some"; see below.

> Of course, I could be off base.  All merge algorithms are heuristic, and
> it would take a lot of research to really compare the efficacy of one
> against another.  You need a cost function to determine the tradeoff
> between the false negative rate and the false positive rate, and you
> also need to measure how any given algorithmic change affects the false
> negative and false positive rates in practice.  Both of these seem
> really hard.
>

I don't suppose there would be a not-horribly-complicated/arduous way of
extracting (LOTS OF) merges (base, both changed files, and committed
merge) from the subversion project history, to serve as a somewhat
representative set of test cases? There must be plenty of them, all told...?

> It would definitely affect my opinion if I learned that the three-way
> merge algorithms in other popular version control systems worked this
> way, or if I learned that Subversion was more restrictive than, say, gnu
> diff3.
>
>

Actually, GNU diff3 (at least version 2.9) is a fair bit MORE restrictive:

1. It requires an actual overlap for sync points - it does not allow
adjacent changes.

2. It does not allow common changes; these are considered conflicts.
Only one of Mine and Theirs can be changed between neighboring
sync points.

Ironically, property 1 would precisely have guaranteed that A or B
could merge cleanly with their merge product C (as long as the LCS
matches stayed the same, of course), since the overlap at the sync
point means it could remain a sync point also in C, IF it were not
for property 2, which pretty explicitly disallows merging with the
merge product...

===

As mentioned above, my original proposal was somewhat more
aggressive than strictly necessary for my purposes. Also, I think
the user should be allowed to specify how aggressive the merge
algorithm should be as an option. Here are some possible levels:

1. Cautious: Only merge changes that are separated by at least
X lines, of which at least Y must be unique matches (since diff3
already has counts for the tokens present in each file, checking
for unique matches is now very cheap and simple).

2. Classic: As now.

3. Aggressive: Sync also on commonly deleted sections, and
merge strictly larger changes (when the changes are easily
identified as such). This is similar to my original proposal,
but doesn't attempt to interleave different insertions or to
merge only partially overlapping changes.

4. Reckless: My original proposal.

I see very little to no risk in allowing the merging of strictly
larger changes; this is a natural extension of allowing both
common and adjacent changes. The only significant risk I
see for syncing on commonly deleted sections is if changes
in one or both files are identified by the LCS algorithm as
fragmented conserved sections, due to a lack of unique
tokens; then the different code from the two files could
end up interleaved. One way to avoid this would be to
require conserved sections to be of a certain quality (at least
X lines, not all identical, or at least one unique line) in order
to qualify as sync points - this could be a good idea also
for the "2. Classic" option, in order to avoid erroneous
syncing on blank lines as discussed above. As mentioned
before, it is easy to put such checks in place now, since the
token counts are already available.

===

At the very least, I think diff3 should return resolved_diff
results that are based on such additional sync points,
since that would give GUI clients far better information
for presenting a good line matching, even if the region as
a whole is still marked as a conflict.

---
Morten
Received on 2011-06-13 13:01:25 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.