On Wed, May 18, 2011 at 9:23 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> On Wed, May 18, 2011 at 7:31 PM, Morten Kloster <morklo_at_gmail.com> wrote:
>> Thanks, Daniel.
>>
>> Johan:
>> I've found your notes trunk/notes/diff-optimizations.txt. As you may
>> have realized, my patch is a slightly different approach to the
>> problem you discuss in section I.5, "Discarding non-matching
>> lines before running the LCS (Longest Common Subsequence)
>> algorithm." - rather than discarding the lines before calling LCS, I
>> create count information that lets LCS skip the lines without
>> counting them as (avoidable) changes. I believe my approach
>> is easier to implement (not least now, when it's already done ;)
>> and has less overhead, although the potential speed-up is
>> somewhat less:
>
> Yes, I noticed the similarity :-). Your approach is a nice one, and
> indeed easier to implement. With the discarding-approach, we'd have to
> give each line an alternative line number (a "virtual line number", of
> the file without its non-matching lines), then run the LCS on those
> lines, and then when producing the diff output go back to using the
> original line numbers. Or something like that.
>
> But I like your idea too, it sounds more light-weight, and might
> provide most of the benefits.
>
>> LCS has running time of O( N + P D ), where N is #tokens in
>> longest source, P is number of mismatches in smallest source
>> and D is total number of mismatches in both sources.
>> Discarding the non-matching tokens would replace both P and D
>> by their discarded versions P' and D', giving a potential speed-up
>> of the SQUARE of the inverse of the fraction of changed tokens
>> that are not unique to one source, whereas my approach
>> effectively only replaces P by P', not D, so you only get one
>> power, not two (at least I think so; I haven't checked carefully).
>
> You seem to know your stuff :-).
>
> For me it's easier to understand by explaining D as "the length of the
> shortest edit script (additions+deletes) from largest to smallest
> source", and P as "the number of deleted tokens in that shortest edit
> script". That's the terminology used in the article referred to by
> lcs.c [1]. Or instead of "shortest edit script", I usually use the
> term "minimal diff".
>
> I'm not sure about your calculation though. I should probably study
> your patch first, because I haven't quite grokked it yet. But could
> you explain why your approach doesn't reduce D (the deleted lines)?
>
> Also: the discarding-approach reduces N as well, just by making both
> files smaller. In the "average case", where expected running time is
> O(N + PD), that's probably not a big deal. But according to the
> article the worst-case running time is O(NP). So reducing N also has a
> big impact on the worst-case running time. Anyway, I speaking purely
> theoretical here, I have no idea when this worst-case running time can
> occur in practice.
>
>> There is another possible enhancement that is somewhat
>> complementary to the one I've implemented: If the number of
>> changed tokens is smaller than the number of uniquely
>> matching tokens, and the changes are grouped in well-
>> separated sections, then whenever you have found more
>> unique matches than the number of changes you have "used",
>> you can "restart" the problem from that point: you are
>> guaranteed that there is no better match for the earlier
>> part of either source.
>
> I'm not sure about that. I have to chew on it a little bit :-).
>
> Can you describe this idea in some more detail? Maybe give with an
> example or something?
>
>> This reduces the running time from
>> O( N + P' D ) to something like O( N + sum(P'n Dn)), I think,
>> where the sum is over the different changed sections. I
>> haven't worked out the details on how to best implement
>> this approach, but it should work well with my initial patch.
Sorry, forgot the footnote I intended to add:
[1] "An O(NP) Sequence Comparison Algorithm", Sun Wu, Udi Manber, Gene Myers
--
Johan
Received on 2011-05-18 21:26:12 CEST