On Wed, May 18, 2011 at 9:23 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:

*> On Wed, May 18, 2011 at 7:31 PM, Morten Kloster <morklo_at_gmail.com> wrote:
*

*>> Thanks, Daniel.
*

*>>
*

*>> Johan:
*

*>> I've found your notes trunk/notes/diff-optimizations.txt. As you may
*

*>> have realized, my patch is a slightly different approach to the
*

*>> problem you discuss in section I.5, "Discarding non-matching
*

*>> lines before running the LCS (Longest Common Subsequence)
*

*>> algorithm." - rather than discarding the lines before calling LCS, I
*

*>> create count information that lets LCS skip the lines without
*

*>> counting them as (avoidable) changes. I believe my approach
*

*>> is easier to implement (not least now, when it's already done ;)
*

*>> and has less overhead, although the potential speed-up is
*

*>> somewhat less:
*

*>
*

*> Yes, I noticed the similarity :-). Your approach is a nice one, and
*

*> indeed easier to implement. With the discarding-approach, we'd have to
*

*> give each line an alternative line number (a "virtual line number", of
*

*> the file without its non-matching lines), then run the LCS on those
*

*> lines, and then when producing the diff output go back to using the
*

*> original line numbers. Or something like that.
*

*>
*

*> But I like your idea too, it sounds more light-weight, and might
*

*> provide most of the benefits.
*

*>
*

*>> LCS has running time of O( N + P D ), where N is #tokens in
*

*>> longest source, P is number of mismatches in smallest source
*

*>> and D is total number of mismatches in both sources.
*

*>> Discarding the non-matching tokens would replace both P and D
*

*>> by their discarded versions P' and D', giving a potential speed-up
*

*>> of the SQUARE of the inverse of the fraction of changed tokens
*

*>> that are not unique to one source, whereas my approach
*

*>> effectively only replaces P by P', not D, so you only get one
*

*>> power, not two (at least I think so; I haven't checked carefully).
*

*>
*

*> You seem to know your stuff :-).
*

*>
*

*> For me it's easier to understand by explaining D as "the length of the
*

*> shortest edit script (additions+deletes) from largest to smallest
*

*> source", and P as "the number of deleted tokens in that shortest edit
*

*> script". That's the terminology used in the article referred to by
*

*> lcs.c [1]. Or instead of "shortest edit script", I usually use the
*

*> term "minimal diff".
*

*>
*

*> I'm not sure about your calculation though. I should probably study
*

*> your patch first, because I haven't quite grokked it yet. But could
*

*> you explain why your approach doesn't reduce D (the deleted lines)?
*

*>
*

*> Also: the discarding-approach reduces N as well, just by making both
*

*> files smaller. In the "average case", where expected running time is
*

*> O(N + PD), that's probably not a big deal. But according to the
*

*> article the worst-case running time is O(NP). So reducing N also has a
*

*> big impact on the worst-case running time. Anyway, I speaking purely
*

*> theoretical here, I have no idea when this worst-case running time can
*

*> occur in practice.
*

*>
*

*>> There is another possible enhancement that is somewhat
*

*>> complementary to the one I've implemented: If the number of
*

*>> changed tokens is smaller than the number of uniquely
*

*>> matching tokens, and the changes are grouped in well-
*

*>> separated sections, then whenever you have found more
*

*>> unique matches than the number of changes you have "used",
*

*>> you can "restart" the problem from that point: you are
*

*>> guaranteed that there is no better match for the earlier
*

*>> part of either source.
*

*>
*

*> I'm not sure about that. I have to chew on it a little bit :-).
*

*>
*

*> Can you describe this idea in some more detail? Maybe give with an
*

*> example or something?
*

*>
*

*>> This reduces the running time from
*

*>> O( N + P' D ) to something like O( N + sum(P'n Dn)), I think,
*

*>> where the sum is over the different changed sections. I
*

*>> haven't worked out the details on how to best implement
*

*>> this approach, but it should work well with my initial patch.
*

Sorry, forgot the footnote I intended to add:

[1] "An O(NP) Sequence Comparison Algorithm", Sun Wu, Udi Manber, Gene Myers

--
Johan

Received on 2011-05-18 21:26:12 CEST