# Re: diff wish

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Sun, 19 Jun 2011 17:19:52 +0200

On Thu, Jun 16, 2011 at 11:41 PM, Stefan Sperling <stsp_at_elego.de> wrote:
> On Thu, Jun 16, 2011 at 10:38:12PM +0200, Johan Corveleyn wrote:
>> Are you sure? Maybe there are different definitions of how patience
>> diff works (it seems to be rather undocumented :-)). I'm not really
>> into the details, but from what I've read on this thread on the git
>> mailinglist [1], I thought it worked recursively, more or less like
>> this:
>>
>> (1) Find the unique lines, and perform an LCS on them.
>>
>> (2) For each in-between section, again perform step (1) (finding
>> unique lines for this section, which may turn up new unique lines,
>> unique to this section).
>>
>> Repeat this recursively, until there are no more unique lines in a
>> section. Then fall back to the regular diff algorithm (a "standard
>> minimal diff" on the remaining section).
>
> As far as I can tell the recursion was dropped later:
>
> http://bramcohen.livejournal.com/73318.html
> "I've previously described it with the ordering a bit different and a
> recursive step at the end..."

Hm, I've read that blog entry, but it's lacking a lot of detail. For
starters, the explanation of "how it works" is incomplete:

[[[
1. Match the first lines of both if they're identical, then match the
second, third, etc. until a pair doesn't match.
2. Match the last lines of both if they're identical, then match the
next to last, second to last, etc. until a pair doesn't match.
3. Find all lines which occur exactly once on both sides, then do
longest common subsequence on those lines, matching them up.
4. Do steps 1-2 on each section between matched lines
]]]

So ... in step 4, after executing 1-2 on the remaining section, then
what :-)? I'm guessing he means that you should then run a standard
LCS on the remaining non-matching section. Or does he want to consider
the entire part, between the points where 1 and 2 stopped, a
mismatch???

Also, the explanation why he dropped the "recursive step at the end"
is very hand-wavy: "... and the performance hit of doing the recursion
isn't worth it, because it rarely if ever finds any more matches, and
even when it does it isn't clear whether the extra matches produce a
functionally superior diff."

Hmmmm, what does that mean? Does it mean that if you zoom in on a
section-between-unique-matches, it's usually of such a form that a
patience diff doesn't produce a better result than a standard diff?
Strange. Also, I'm not sure if this would be such a big performance
hit (with the current token counting in place in libsvn_diff, it's not
such a big effort to recount "token_counts" for a subsection of
lines).

Note that 1-2 is more-or-less the prefix/suffix-scanning we already
do. Though the way it's implemented now in libsvn_diff, it's not
suited for recursive use like this, because it's low-level,
byte-based, ... before the tokens/lines are extracted. But it would be
pretty trivial to implement something similar with sets of
tokens/lines.

Also, I think if you follow the above algorithm strictly, you'll end
up with ugly diffs, similar to the example Bram Cohen gives in this
same article. Because those prefix/suffix-scanning steps will blindly
consider identical lines as part of the common lines, something that's
not always wanted (hence the entire reason of existence for patience
diff in the first place).

Blind prefix/suffix-scanning should keep some identical lines
available to leave some wiggle room for the main algorithm (that's why
we have the "#define SUFFIX_LINES_TO_KEEP 50" in diff_file.c, which
makes sure the suffix isn't gobbled up completely (with the current
LCS algorithm, there is no need to do this in the prefix scanning,
because the LCS algorithm will also always consider lines common as
long as it can, while scanning forward)).

Anywayzzz ... we'll see when we get there I suppose :-).

```--
Johan
```