Hi all,
I started writing up a mail on some thoughts I had about merging. I
wanted to handle other cases apart from the standard line-by-line merging.
In the end I seem to have come up with something very like
variance-adjusted-patching, but using the delta combiner to do a lot of the
work. It also handles adjustable resolution in conflict detection - it
isn't tied to line-by-line diffing.
Anyway, here's what I wrote, Given it is so similar to
variance-adjusted-patching you might want to skip it. Just remember you can
use the delta combiner...
\x/ill :-}
Alo,
I've been thinking about this merging business and thought I'd stick my oar
in. This mail is kinda long.
As some background, I've read www/variance-adjusted-patching.html (which
I'll refer to as VAP) and http://prcs.sourceforge.net/merge.html amongst
other things.
Here are my issues with what I read:
i) The VAP system looks nice, although this raised a few flags for me:
> Note that the [VAP] algorithm is actually too powerful -- if taken to its
> logical limits, it can generate patches that apply cleanly even when the user
> would almost certainly prefer a conflict. Thus it's necessary to offer an
> adjustment selectivity level; a good default selectivity would probably allow
> compensation for context variance, but not for variance in expected target
> lines.
ii) PRCS doesn't use any form of VAP.
iii) I'd like to be able handle the following form of 'binary' patching:
Consider a text document that is meant to be displayed in an editor with
text wrapping (i.e. EOLs at the end of the paragraph rather than every 80
chars). If you change a sentance, that change shouldn't conflict with
changes to other sentances in the same paragraph. With the current 'line
based' patching, the whole paragraph gets treated as a line. (see the above
paragraph from VAP - but note that we don't always want to use 'target
lines')
Other more complex binary patching would be nice, but this is the basic
example I'm considering.
The PRCS doc above talks about finding sets of three files to merge. It
doesn't seem to discuss the mechanics of merging those files. I assume it
is simply 3 way diff. It is kinda orthoganal to VAP and this suggestion.
In this mail I describe a very VAP-like algorithm, but using binary diffs
and thinking about those knobs for conflicts.
Here is what seems to be a standard diagram (example taken from VAP and
modified slightly):
T:1 --> T:2 -...-> T:8 --> T:9 -...-> T:18 --> T:19 --> T:20
=
= B:1 --> B:2 -...-> B:15
Here --> indicates a direct child revision, -...-> indicates a series of
child revisions, and == indicates a branch. T:8 is identical to B:1. We
talk about T:X -...-> T:Y as the delta that converts T:X into T:Y. It is
assumed to be reversable, so we can get T:Y -...-> T:X from it.
The user wants to merge the changes from T:18 -...-> T:20 into B:15. (Note
that here I'm using VAP terminology. PRCS talks about merging revisions,
not applying patches.)
In VAP there is an algorithm described for taking the T:18 -...-> T:20 patch
and adjusting it so it can be applied to B:15. It works on a line by line
basis appropriate for text files. Lets drop that and assume binary diffs.
I'm going to assume that the binary diff format (a delta) is a list of
hunks. Each hunk is either of the form "Insert data D at location X", or
"Delete data D from location X". Note that the delete form carries enough
information to be invertable.
I'm going to suggest the following algorithm: (Note, this went through a
few iterations, and I think this is a coherent final version, but I may have
made a mistake.)
i) Use a delta combiner to get two binary deltas. The first is the 'patch
delta', T:18 -...-> T:20. The second is the 'context delta', T:18 -...->
T:8 -...-> B:15, or T:18 -...-> B:15.
These deltas should be in a 'simplified' form. Deletes in the context delta
should only delete original text, not text added by prior insert hunks.
This means that if one hunk adds a block of text and then another hunk
deletes a section from the middle of that newly added block, then that pair
of hunks should be replaced with a pair of inserts, one for the data before
the delete, and one for the data after the delete.
Additionally, inserts that reinsert text that was previously deleted
should be removed, along with the original delete. This might require a
diff between hunks that operate on the same region of the file.
Imagine in our example above that T:9 --> T:10 is the same as B:2 --> B:3.
In the combined context diff, one of these is reversed, and so the two
should cancel each other and neither will appear. This reduces the number
of conflicts that will be found below. If T:9 --> T:10 is the same as B:2
--> B:3, except for a space at the front of one of them, then the combined
result should be the simple insertion/deletion of a space.
ii) Once the two deltas are found, the context delta is used to change the
patch delta so it applies cleanly (or conflicts are marked).
Both of these deltas convert T:18 into something else. There is a conflict
if they both try and convert the same block of text into something else.
The algorithm goes through the motions of applying the context delta, but
instead of applying it, looks for conflicts with the patch delta.
Regardless of whether there are any conflicts, the patch delta is adjusted
so that it applies to the correct areas of the file.
Consider each hunk in the context delta in turn, Hc. For each Hc, consider
each hunk in the patch delta in turn Hp. (I'm sure there are smart ways to
reduce the time complexity here.)
For each hunk we'll consider an 'area of effect'. Inserts will have an area
of effect equal to the offset where they insert. Deletes will have an area
of effect equal to the range they delete.
For each pair of hunks, we check whether they conflict. If they do then the
context hunk is marked as conflicting.
An insert conflicts with another insert at the same location (we don't know
in which order they should occur). If Hc inserts into a range that Hp
deletes then there is no conflict. If Hp inserts into a range that Hc
deletes then there is a conflict. If Hc and Hp delete overlapping ranges
then there is no conflict.
Note that even when there is a conflict, it is possible to update the
location of the Hp hunks so that they have moved to the correct location to
apply to B:15.
iii) If there were no conflicts then we can just apply the modified patch
delta (modified because the locations of changes may have changed). If
there were conflicts then we can ignore the non-conflicting hunks of the
context delta, we can apply the non-conflicting hunks of the patch delta.
When two hunks conflicted, we can add a conflict marker, show the original
section of file, add a marker for context, show what happens to the file
when context patched, add a marker for the patch and show what happens when
the file is patched, then add a closing marker.
So, does this give us what we want? Well, for the soft wrapped text
document example above, it seems to be ok (although not perfect). For C
code where we may want to mark a mark a conflict for changes on the same
line this may not mark as many conflicts as we'd like yet. e.g. if B:3 -->
B:4 changes a line that is also changed in T:19 --> T:20, BUT in different
parts of the line, then this algorithm will not mark a conflict when it
maybe should.
One way to handle this is to expand the 'area of effect' of a hunk as
defined above. For C code, you would expand the area of effect to the next
EOL in each direction. For plain text you might expand the area of effect
to the next whitespace, or perhaps the next '.' or EOLN if you want to use
sentances. For XML you could expand to the next tag, or, if you were
feeling fancy, to the innermost pair of matching tags that encloses the
change. For arbitary binary files you really need to expand the area of
effect to include the whole file - everything might conflict.
If you're expanding the area of effect of a hunk to some textual feature and
it is the same for each hunk, then I believe you only need to do it for the
hunks in one of the deltas. eg you could expand the area of effect just for
the hunks in the patch delta and leave the context delta's hunks as they
are.
In addition, we may not always be interested in storing all of a conflicting
hunk. If a conflicting hunk is an insert of 4 pages of text that only
conflicts on the last line, we really only want to store enough of the hunk
to get the context of the conflict. One way to think of this is to divide a
conflicting insert into a pair of inserts, one smaller one near the
conflict, and a larger one that no longer conflicts. I imagine you'd save
about five lines of context, just like a context diff.
Does this make any sense at all?
Here are the examples from the VAP document (I did this by hand - I hope I
didn't make any mistakes... :).
Let's take a brief look at what adjustment does to some patches. Here's a
simple case, arising from an attempt to port an individual change from trunk
to branch. On trunk T, revision 1 looks like this:
int main (int argc, char **argv)
{
/* line minus-five of context */
/* line minus-four of context */
/* line minus-three of context */
/* line minus-two of context */
/* line minus-one of context */
printf ("Hello, world!\n");
/* line plus-one of context */
/* line plus-two of context */
/* line plus-three of context */
/* line plus-four of context */
/* line plus-five of context */
}
A branch B also exists, rooted in the trunk at revision T:1. On branch B, we
commit a change to the context (we replace the words with numbers), so that
B:2 looks like this:
int main (int argc, char **argv)
{
/* line -5 of context */
/* line -4 of context */
/* line -3 of context */
/* line -2 of context */
/* line -1 of context */
printf ("Hello, world!\n");
/* line +1 of context */
/* line +2 of context */
/* line +3 of context */
/* line +4 of context */
/* line +5 of context */
}
Meanwhile on the trunk, we change only the middle line, so that T:2 is
committed as:
int main (int argc, char **argv)
{
/* line minus-five of context */
/* line minus-four of context */
/* line minus-three of context */
/* line minus-two of context */
/* line minus-one of context */
printf ("Good-bye, cruel world!\n");
/* line plus-one of context */
/* line plus-two of context */
/* line plus-three of context */
/* line plus-four of context */
/* line plus-five of context */
}
we then attempt to patch B:2 with the difference T:1 --> T:2.
We generate a patch delta T:1 --> T:2:
Delete "Hello," from position 220
Insert "Good-bye, cruel" at position 220
And a context delta T:1 --> B:2:
Delete "minus-five" from position 45
Insert "-5" at position 45
Delete "minus-four" from position 72 (Note, this 72 is taking into account
the previous change)
Insert "-4" at position 72
Delete "minus-three" from position 99
Insert "-3" at position 99
Delete "minus-two" from position 126
Insert "-2" at position 126
Delete "minus-one" from position 153
Insert "-1" at position 153
Delete "plus-one" from position 210
Insert "+1" at position 210
Delete "plus-two" from position 237
Insert "+2" at position 237
Delete "plus-three" from position 264
Insert "+3" at position 264
Delete "plus-four" from position 291
Insert "+4" at position 291
Delete "plus-five" from position 318
Insert "+5" at position 318
Now we check to see if the hunks conflict. None do (even if their area of
effect is extended to the nearest EOL). The patch delta has the 220s
changed to 181s:
Delete "Hello," from position 181
Insert "Good-bye, cruel" at position 181
Applying this patch does the right thing.
\x/ill :-}
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:37:08 2006