Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Mon, 4 Oct 2010 02:51:23 +0200

On Sun, Oct 3, 2010 at 1:46 AM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> Hi,
>
> Here is a second iteration of the patch. It now passes make check.
>
> Differences from the previous version are:
> - Support for \r eol-style (\n and \r\n was already ok).
> - The number of prefix_lines is now passed to svn_diff__lcs, so it can
> use that value to set the position offset of the "EOF" marker
> correctly, in case one of both files has become empty after skipping
> the prefix. This fixes the crashes in blame_tests.py 2 and 7.
>
> The patch is pretty big, so please let me know if I should split it up
> to make it more reviewable (I could easily split it up between the
> prefix-finding and the suffix-finding, at the cost of having overview
> over the entire algorithm).
>
> Still to do:
> - Think about why results are sometimes different (because of
> eliminated suffix, the LCS can sometimes be slightly different), and
> what can be done about it.

Hm, this problem has made me reconsider whether my patch is a good
solution (more details below). I'm thinking of other ways of
implementing this kind of optimization, so for now I'm pulling this
patch. Please don't review it yet, as I might go for a radically
different approach (sorry for any time spent that may be lost).

The details:
The problem happens if there are two (or more) non-matching parts with
an identical part in between (so the prefix scanning stops early,
doesn't meet the suffix in one of the files), and the suffix scanning
goes too far because the end of the last change is identical to the
original at that point.

For example (best viewed with fixed-width font):
[[[
original | modified
-----------------------------------------------------------
This is line 1 | This is line 1
This is line 2 | This line is *changed*
This is line 3 | This is line 3
... (more identical lines) | ...
existing_function() | existing_function()
{ | {
  // do stuff | // do stuff
  return SVN_NO_ERROR; | return SVN_NO_ERROR;
} | }
                             |
This is the end of the file | new_function()
-----------------------------| {
                             | // do other stuff
                             | return SVN_NO_ERROR;
                             | }
                             |
                             | This is the end of the file
                             |-----------------------------
]]]

The following identical suffix will be eliminated:
[[[
return SVN_NO_ERROR;
}

This is the end of the file
]]]

which means the LCS algorithm cannot do better than to say that the
following lines are new:
[[[
+ return SVN_NO_ERROR;
+}
+
+new_function()
+{
+ // do other stuff
]]]

Not quite what we want/expect. If the suffix is not eliminated, the
LCS algorithm gives the correct/expected result. Also if the change in
the beginning of the file isn't there, the result is correct (because
the prefix scanning goes first, and eliminates the identical stuff
from the start (I'll leave it as an exercise to the reader to see that
the result is better)).

Interestingly, GNU diff also has this problem. It mitigates it a bit
by keeping a number of identical prefix/suffix lines (at least the
number of context lines, or a higher number if supplied by
--horizon-lines=NUM). This cannot completely solve the problem though:
for any given number of "horizon-lines" one can always come up with an
example that doesn't give the correct result.

So I need to find a way to not "eliminate" the identical suffix, but
to mark those lines as identical and then include them in the
position_list that's going to be used by the LCS algorithm. I'd like
to do the same for the prefix. This means I need to approach this
problem on a different level.

To be continued ...

Cheers,

-- 
Johan

Received on 2010-10-04 02:52:04 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]