Improvements to diff3 (merge) performance
From: Morten Kloster <morklo_at_gmail.com>
Date: Sun, 12 Jun 2011 19:05:23 +0200
Hi,
I'm not very happy with the current performance of the diff3
I'll use the terms Base (original), Mine (modified) and Theirs (latest)
--- First, here's a quick run-though of how diff3 works now: After running LCS on Base vs. Mine and Base vs. Theirs, it looks for "sync points" - points where both Mine and Theirs match the Base. Note that there is no need for an actual overlap: If Mine matches Base up to a point and Theirs matches Base from that point onwards, that is a valid sync point. If no two of the three are identical between two neighboring sync points, that creates a conflict. Adjacent changes are accepted: abcd (Base) a1cd (Mine) ab2d (Theirs) a12d (Merged) and abcd (Base) acd (Mine) abc (Theirs) ad (Merged) while strictly larger changes are not: abcd (Base) a1cd (Mine) a12d (Theirs) aCd (Merged) and abcd (Base) acd (Mine) ad (Theirs) aCd (Merged) Note that this means that if changed files A and B are merged to produce C, then attempting to merge A (or B) with C could well produce a conflict (using the same Base the whole time), whereas in my opinion, that ought to work fine, yielding C again. --- My proposed main change is to let diff3 also sync on points where Mine and Theirs have identical changes relative to Base, if either file (Mine or Theirs) can also be synced to Base at the same point. This can happen if either Mine or Theirs is unchanged from Base either before or after that point, or if either of the changes from Base was a pure insertion (as there is then no uncertainty as to the position in the Base file). This would let diff3 incorporate all changes relative to Base from both Mine and Theirs as long as there is no true conflict. Some consequences: Given two changed files A and B that only have deletions relative to Base, the merged file would only contain the lines that are present in both A and B. Different insertions would be merged, not marked as conflict, unless they truly conflict. There are only true conflicts if the diff2 between the two insertions have actual changes, not just separated deletions and insertions (or equivalently, if the LCS has gaps in both files between neighboring segments): ag (Base) abcefg (Mine) acdeg (Theirs) abcdefg (Merged) Here, the order of the inserted lines is unambiguously determined; partially from Mine and partially from Theirs. Incremental and overlapping changes would merge cleanly: abcde (Base) a12de (Mine) ab23e (Theirs) a123e (Merged) Of course there are cases where the merged result might not be what would be desired, but that is also the case with the current algorithm - keep in mind that any conflict can be "resolved" by inserting blank lines in Base, Mine and Theirs to serve as sync points. Here's an unfortunate case with the current algorithm: abcbcd (Base) acbcd (Mine) abcd (Theirs) acd (Merged) - even though Theirs < Mine < Base, in strict subsequence sense, Merged < Theirs. As a secondary change, I would like to reduce the sizes of detected conflicts when the change in one file is a pure deletion. Example: abcdef (Base) abc1ef (Mine af (Theirs) aCf (Merged) Here, only the change d -> 1 is a true conflict, and in a graphical interface such as TortoiseSVN, I would like to see only that single line conflicted. However, as was pointed out when I suggested this on our dev IRC, if can be useful to have the context when editing the conflict in e.g. vim. One suggestion on IRC was to split the conflict into 3; two not-really-conflicts around the true conflict. Ideally, the not-really-conflicts would be marked as such in the diff, which would require an API change. An alternate way of handling it, however, would be to merely improve the resolved_diff result for that conflict (which currently only contains the whole thing as a conflict). This means that a resolved_diff could also contain diffs of type svn_diff__type_diff_modified or svn_diff__type_diff_latest, which is currently not the case. It might be best to add more display options for the output functions, so an API change might still be required. Some or all of the output functions would need to be adjusted to handle these possibilities. --- If you know of reasons why these changes would be a bad idea, please provide examples when possible. Morten KlosterReceived on 2011-06-12 19:05:59 CEST |
This is an archived mail posted to the Subversion Dev mailing list.
This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.