On Wed, May 18, 2011 at 7:31 PM, Morten Kloster <morklo_at_gmail.com> wrote:
> Thanks, Daniel.
>
> Johan:
> I've found your notes trunk/notes/diff-optimizations.txt. As you may
> have realized, my patch is a slightly different approach to the
> problem you discuss in section I.5, "Discarding non-matching
> lines before running the LCS (Longest Common Subsequence)
> algorithm." - rather than discarding the lines before calling LCS, I
> create count information that lets LCS skip the lines without
> counting them as (avoidable) changes. I believe my approach
> is easier to implement (not least now, when it's already done ;)
> and has less overhead, although the potential speed-up is
> somewhat less:
Yes, I noticed the similarity :-). Your approach is a nice one, and
indeed easier to implement. With the discarding-approach, we'd have to
give each line an alternative line number (a "virtual line number", of
the file without its non-matching lines), then run the LCS on those
lines, and then when producing the diff output go back to using the
original line numbers. Or something like that.
But I like your idea too, it sounds more light-weight, and might
provide most of the benefits.
> LCS has running time of O( N + P D ), where N is #tokens in
> longest source, P is number of mismatches in smallest source
> and D is total number of mismatches in both sources.
> Discarding the non-matching tokens would replace both P and D
> by their discarded versions P' and D', giving a potential speed-up
> of the SQUARE of the inverse of the fraction of changed tokens
> that are not unique to one source, whereas my approach
> effectively only replaces P by P', not D, so you only get one
> power, not two (at least I think so; I haven't checked carefully).
You seem to know your stuff :-).
For me it's easier to understand by explaining D as "the length of the
shortest edit script (additions+deletes) from largest to smallest
source", and P as "the number of deleted tokens in that shortest edit
script". That's the terminology used in the article referred to by
lcs.c [1]. Or instead of "shortest edit script", I usually use the
term "minimal diff".
I'm not sure about your calculation though. I should probably study
your patch first, because I haven't quite grokked it yet. But could
you explain why your approach doesn't reduce D (the deleted lines)?
Also: the discarding-approach reduces N as well, just by making both
files smaller. In the "average case", where expected running time is
O(N + PD), that's probably not a big deal. But according to the
article the worst-case running time is O(NP). So reducing N also has a
big impact on the worst-case running time. Anyway, I speaking purely
theoretical here, I have no idea when this worst-case running time can
occur in practice.
> There is another possible enhancement that is somewhat
> complementary to the one I've implemented: If the number of
> changed tokens is smaller than the number of uniquely
> matching tokens, and the changes are grouped in well-
> separated sections, then whenever you have found more
> unique matches than the number of changes you have "used",
> you can "restart" the problem from that point: you are
> guaranteed that there is no better match for the earlier
> part of either source.
I'm not sure about that. I have to chew on it a little bit :-).
Can you describe this idea in some more detail? Maybe give with an
example or something?
> This reduces the running time from
> O( N + P' D ) to something like O( N + sum(P'n Dn)), I think,
> where the sum is over the different changed sections. I
> haven't worked out the details on how to best implement
> this approach, but it should work well with my initial patch.
--
Johan
Received on 2011-05-18 21:23:55 CEST