On Wed, May 18, 2011 at 9:25 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:

*> On Wed, May 18, 2011 at 9:23 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
*

*>> On Wed, May 18, 2011 at 7:31 PM, Morten Kloster <morklo_at_gmail.com> wrote:
*

*>>> Thanks, Daniel.
*

*>>>
*

*>>> Johan:
*

*>>> I've found your notes trunk/notes/diff-optimizations.txt. As you may
*

*>>> have realized, my patch is a slightly different approach to the
*

*>>> problem you discuss in section I.5, "Discarding non-matching
*

*>>> lines before running the LCS (Longest Common Subsequence)
*

*>>> algorithm." - rather than discarding the lines before calling LCS, I
*

*>>> create count information that lets LCS skip the lines without
*

*>>> counting them as (avoidable) changes. I believe my approach
*

*>>> is easier to implement (not least now, when it's already done ;)
*

*>>> and has less overhead, although the potential speed-up is
*

*>>> somewhat less:
*

*>>
*

*>> Yes, I noticed the similarity :-). Your approach is a nice one, and
*

*>> indeed easier to implement. With the discarding-approach, we'd have to
*

*>> give each line an alternative line number (a "virtual line number", of
*

*>> the file without its non-matching lines), then run the LCS on those
*

*>> lines, and then when producing the diff output go back to using the
*

*>> original line numbers. Or something like that.
*

*>>
*

*>> But I like your idea too, it sounds more light-weight, and might
*

*>> provide most of the benefits.
*

*>>
*

*>>> LCS has running time of O( N + P D ), where N is #tokens in
*

*>>> longest source, P is number of mismatches in smallest source
*

*>>> and D is total number of mismatches in both sources.
*

*>>> Discarding the non-matching tokens would replace both P and D
*

*>>> by their discarded versions P' and D', giving a potential speed-up
*

*>>> of the SQUARE of the inverse of the fraction of changed tokens
*

*>>> that are not unique to one source, whereas my approach
*

*>>> effectively only replaces P by P', not D, so you only get one
*

*>>> power, not two (at least I think so; I haven't checked carefully).
*

*>>
*

*>> You seem to know your stuff :-).
*

*>>
*

*>> For me it's easier to understand by explaining D as "the length of the
*

*>> shortest edit script (additions+deletes) from largest to smallest
*

*>> source", and P as "the number of deleted tokens in that shortest edit
*

*>> script". That's the terminology used in the article referred to by
*

*>> lcs.c [1]. Or instead of "shortest edit script", I usually use the
*

*>> term "minimal diff".
*

*>>
*

*>> I'm not sure about your calculation though. I should probably study
*

*>> your patch first, because I haven't quite grokked it yet. But could
*

*>> you explain why your approach doesn't reduce D (the deleted lines)?
*

*>>
*

At least in some cases, you basically get the same behavior as for the

worst-case scenario below (where it's still N, not N'). That might not be

the case for a "typical" scenario, though - I'm not sure.

*>> Also: the discarding-approach reduces N as well, just by making both
*

*>> files smaller. In the "average case", where expected running time is
*

*>> O(N + PD), that's probably not a big deal. But according to the
*

*>> article the worst-case running time is O(NP). So reducing N also has a
*

*>> big impact on the worst-case running time. Anyway, I speaking purely
*

*>> theoretical here, I have no idea when this worst-case running time can
*

*>> occur in practice.
*

Worst case-type scenario:

Source1: ababababababababababababababacccccccccccccccb

Source2: bbbababababababababababababababac

Best match is clearly to skip the 3 first b's in Source2. However, you will

be able to match almost all of Source2 against various parts of Source1

also if skip no or only 1 or 2 b's, which means LCS will take PN time

(actually (P+1)N, but complexity is the same).

Now insert a series of MANY d's after the first 10 tokens of Source1, and as

many e's after the first 10 tokens of Source2. With my code, you will have

to skip past each of those tokens for each of the 4 scans, so it's still P N,

not P N'.

*>>
*

*>>> There is another possible enhancement that is somewhat
*

*>>> complementary to the one I've implemented: If the number of
*

*>>> changed tokens is smaller than the number of uniquely
*

*>>> matching tokens, and the changes are grouped in well-
*

*>>> separated sections, then whenever you have found more
*

*>>> unique matches than the number of changes you have "used",
*

*>>> you can "restart" the problem from that point: you are
*

*>>> guaranteed that there is no better match for the earlier
*

*>>> part of either source.
*

*>>
*

*>> I'm not sure about that. I have to chew on it a little bit :-).
*

*>>
*

*>> Can you describe this idea in some more detail? Maybe give with an
*

*>> example or something?
*

*>>
*

Source1: aa1234abab56789ababababababa

Source2: bb1234aabb56789aababbbaaabbb

Here, the numbers are unique matches (only occur once in each file).

After scanning the first 5 tokens of each file, you know that even though

you might be able to match the a's / b's with later parts of the other file,

that would cost you at least three unique matches, so that's just not an

option - you know that the LCS will match 123 with 123. Similarly when

you get to 5678, that is more unique matches than what you could

possibly gain by matching the files otherwise.

The easiest way to implement this might be to NOT use strict LCS, but

rather let unique matches get double weight - you count unmatched

unique matches in BOTH files. This should also give nicer diffs (similar to

Patience Diff, but less extreme). It should be possible to stick to strict

LCS, but that would make the code more complicated - you would have

to check if you have already accounted for that token in the other file.

(To get this enhancement, you have to count some cost for a mismatched

unique match the first time you skip the token, regardless if it's in the

longer or shorter source file.)

*>>> This reduces the running time from
*

*>>> O( N + P' D ) to something like O( N + sum(P'n Dn)), I think,
*

*>>> where the sum is over the different changed sections. I
*

*>>> haven't worked out the details on how to best implement
*

*>>> this approach, but it should work well with my initial patch.
*

*>
*

*> Sorry, forgot the footnote I intended to add:
*

*> [1] "An O(NP) Sequence Comparison Algorithm", Sun Wu, Udi Manber, Gene Myers
*

*>
*

*> --
*

*> Johan
*

*>
*

Received on 2011-05-18 23:25:41 CEST