On Wed, May 18, 2011 at 7:31 PM, Morten Kloster <morklo_at_gmail.com> wrote:

*> Thanks, Daniel.
*

*>
*

*> Johan:
*

*> I've found your notes trunk/notes/diff-optimizations.txt. As you may
*

*> have realized, my patch is a slightly different approach to the
*

*> problem you discuss in section I.5, "Discarding non-matching
*

*> lines before running the LCS (Longest Common Subsequence)
*

*> algorithm." - rather than discarding the lines before calling LCS, I
*

*> create count information that lets LCS skip the lines without
*

*> counting them as (avoidable) changes. I believe my approach
*

*> is easier to implement (not least now, when it's already done ;)
*

*> and has less overhead, although the potential speed-up is
*

*> somewhat less:
*

Yes, I noticed the similarity :-). Your approach is a nice one, and

indeed easier to implement. With the discarding-approach, we'd have to

give each line an alternative line number (a "virtual line number", of

the file without its non-matching lines), then run the LCS on those

lines, and then when producing the diff output go back to using the

original line numbers. Or something like that.

But I like your idea too, it sounds more light-weight, and might

provide most of the benefits.

*> LCS has running time of O( N + P D ), where N is #tokens in
*

*> longest source, P is number of mismatches in smallest source
*

*> and D is total number of mismatches in both sources.
*

*> Discarding the non-matching tokens would replace both P and D
*

*> by their discarded versions P' and D', giving a potential speed-up
*

*> of the SQUARE of the inverse of the fraction of changed tokens
*

*> that are not unique to one source, whereas my approach
*

*> effectively only replaces P by P', not D, so you only get one
*

*> power, not two (at least I think so; I haven't checked carefully).
*

You seem to know your stuff :-).

For me it's easier to understand by explaining D as "the length of the

shortest edit script (additions+deletes) from largest to smallest

source", and P as "the number of deleted tokens in that shortest edit

script". That's the terminology used in the article referred to by

lcs.c [1]. Or instead of "shortest edit script", I usually use the

term "minimal diff".

I'm not sure about your calculation though. I should probably study

your patch first, because I haven't quite grokked it yet. But could

you explain why your approach doesn't reduce D (the deleted lines)?

Also: the discarding-approach reduces N as well, just by making both

files smaller. In the "average case", where expected running time is

O(N + PD), that's probably not a big deal. But according to the

article the worst-case running time is O(NP). So reducing N also has a

big impact on the worst-case running time. Anyway, I speaking purely

theoretical here, I have no idea when this worst-case running time can

occur in practice.

*> There is another possible enhancement that is somewhat
*

*> complementary to the one I've implemented: If the number of
*

*> changed tokens is smaller than the number of uniquely
*

*> matching tokens, and the changes are grouped in well-
*

*> separated sections, then whenever you have found more
*

*> unique matches than the number of changes you have "used",
*

*> you can "restart" the problem from that point: you are
*

*> guaranteed that there is no better match for the earlier
*

*> part of either source.
*

I'm not sure about that. I have to chew on it a little bit :-).

Can you describe this idea in some more detail? Maybe give with an

example or something?

*> This reduces the running time from
*

*> O( N + P' D ) to something like O( N + sum(P'n Dn)), I think,
*

*> where the sum is over the different changed sections. I
*

*> haven't worked out the details on how to best implement
*

*> this approach, but it should work well with my initial patch.
*

--
Johan

Received on 2011-05-18 21:23:55 CEST