Re: Improvements to diff3 (merge) performance

From: Morten Kloster <morklo_at_gmail.com>
Date: Sun, 12 Jun 2011 19:10:50 +0200

Ugh - using "---" to separate sections of the post was clearly a
really bad idea; sorry about that. :-(

Here it is again, in a hopefully more readable format:

Hi,

I'm not very happy with the current performance of the diff3
algorithm; I think it returns too many and too big conflicts. I have
some ideas on how to improve it, but I'd like to hear other people's
opinions first. Note that the primary purpose is to discuss how we
want merge to behave, not whether it is easy to implement.

I'll use the terms Base (original), Mine (modified) and Theirs (latest)
to refer to the three source files. In the merged file, C indicates a
conflict.

===

First, here's a quick run-though of how diff3 works now: After
running LCS on Base vs. Mine and Base vs. Theirs, it looks for
"sync points" - points where both Mine and Theirs match the Base.
Note that there is no need for an actual overlap: If Mine matches
Base up to a point and Theirs matches Base from that point
onwards, that is a valid sync point. If no two of the three are
identical between two neighboring sync points, that creates a
conflict.

Adjacent changes are accepted:

abcd (Base)
a1cd (Mine)
ab2d (Theirs)
a12d (Merged)

and

abcd (Base)
acd (Mine)
abc (Theirs)
ad (Merged)

while strictly larger changes are not:

abcd (Base)
a1cd (Mine)
a12d (Theirs)
aCd (Merged)

and

abcd (Base)
acd (Mine)
ad (Theirs)
aCd (Merged)

Note that this means that if changed files A and B are merged to
produce C, then attempting to merge A (or B) with C could well
produce a conflict (using the same Base the whole time), whereas
in my opinion, that ought to work fine, yielding C again.

===

My proposed main change is to let diff3 also sync on points where
Mine and Theirs have identical changes relative to Base, if either file
(Mine or Theirs) can also be synced to Base at the same point. This
can happen if either Mine or Theirs is unchanged from Base either
before or after that point, or if either of the changes from Base was a
pure insertion (as there is then no uncertainty as to the position in
the Base file). This would let diff3 incorporate all changes relative to
Base from both Mine and Theirs as long as there is no true conflict.

Some consequences:

Given two changed files A and B that only have deletions relative to
Base, the merged file would only contain the lines that are present
in both A and B.

Different insertions would be merged, not marked as conflict,
unless they truly conflict. There are only true conflicts if the diff2
between the two insertions have actual changes, not just separated
deletions and insertions (or equivalently, if the LCS has gaps in
both files between neighboring segments):

ag (Base)
abcefg (Mine)
acdeg (Theirs)
abcdefg (Merged)

Here, the order of the inserted lines is unambiguously determined;
partially from Mine and partially from Theirs.

Incremental and overlapping changes would merge cleanly:

abcde (Base)
a12de (Mine)
ab23e (Theirs)
a123e (Merged)

Of course there are cases where the merged result might not be
what would be desired, but that is also the case with the current
algorithm - keep in mind that any conflict can be "resolved" by
inserting blank lines in Base, Mine and Theirs to serve as sync
points. Here's an unfortunate case with the current algorithm:

abcbcd (Base)
acbcd (Mine)
abcd (Theirs)
acd (Merged)

- even though Theirs < Mine < Base, in strict subsequence sense,
Merged < Theirs.

As a secondary change, I would like to reduce the sizes of
detected conflicts when the change in one file is a pure deletion.
Example:

abcdef (Base)
abc1ef (Mine
af (Theirs)
aCf (Merged)

Here, only the change d -> 1 is a true conflict, and in a graphical
interface such as TortoiseSVN, I would like to see only that single
line conflicted. However, as was pointed out when I suggested
this on our dev IRC, if can be useful to have the context when
editing the conflict in e.g. vim. One suggestion on IRC was to
split the conflict into 3; two not-really-conflicts around the true
conflict. Ideally, the not-really-conflicts would be marked as such
in the diff, which would require an API change.

An alternate way of handling it, however, would be to merely
improve the resolved_diff result for that conflict (which currently
only contains the whole thing as a conflict). This means that a
resolved_diff could also contain diffs of type
svn_diff__type_diff_modified or svn_diff__type_diff_latest, which
is currently not the case. It might be best to add more display
options for the output functions, so an API change might still be
required. Some or all of the output functions would need to be
adjusted to handle these possibilities.

===

If you know of reasons why these changes would be a bad
idea, please provide examples when possible.

Morten Kloster
Received on 2011-06-12 19:11:24 CEST

This message: [ Message body ]
Next message: Greg Hudson: "Re: Improvements to diff3 (merge) performance"
Previous message: Morten Kloster: "Improvements to diff3 (merge) performance"
In reply to: Morten Kloster: "Improvements to diff3 (merge) performance"
Next in thread: Greg Hudson: "Re: Improvements to diff3 (merge) performance"
Reply: Greg Hudson: "Re: Improvements to diff3 (merge) performance"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]