On Thu, Dec 2, 2010 at 6:18 PM, Bill Tutt <bill_at_tutts.org> wrote:
> Note: This email only tangentially relates to svn diff and more about
> reverse token scanning in general:
>
> As someone who has implemented suffix reverse token scanning before:
Thanks for the input. It's nice to see other people have also
struggled with this :-).
> * It simply isn't possible in DBCS code pages. Stick to byte only here.
> SBCS and UTF-16 make reverse token stuff relatively
> straightforward. UTF-8 is a little trickier but still tractable.
> At least UTF-8 is tractable in a way that DBCS isn't. You always
> know which part of a Unicode code point you are in. (i.e. byte 4 vs.
> byte 3 vs. etc...)
Ok, this further supports the decision to focus on the byte-based
approach. We'll only consider stuff identical if all bytes are
identical. That's the simplest route, and since it's only an
optimization anyway ...
> * I would recommend only supporting a subset of the diff options for
> reverse token scanning. i.e. ignore whitespace/ignore eol but skip
> ignore case (if svn has that, I forget...)
svn diff doesn't have an ignore-case option, so that's ok :-).
> If tokens include keyword expansion operations then stop once you
> hit one. The possible source of bugs outways the perf gain in my mind
> here.
Haven't thought about keyword expansion yet. But as you suggest: I'm
not going to bother doing special stuff for (expanded) keywords. If we
find a mismatch, we'll stop with the optimized scanning, and fall back
to the default algorithm.
> * Suffix scanning does really require a seekable stream, if it isn't
> seekable then don't perform the reverse scanning. It is only an
> optimization after all.
Hm, yes, we'll need to be careful about that. I'll start another mail
thread asking for known implementors of the svn_diff_fns_t functions,
to find out whether seeking around like that for suffix would be
supported.
> Additional ignore whitespace related comment:
> * IIRC, Perforce had an interesting twist on ignoring whitespace. You
> could ignore just line leading/ending whitespace instead of all
> whitespace differences but pay attention to any whitespace change
> after the "trim" operation had completed.
>
> e.g.:
> * " aaa bbb " vs "aaa bbb" would compare as equal
> * " aaa bbb " vs "aaa bbb" would compare as equal
> * " aaa bbb " vs "aaa bbb" would compare as non-equal due to the
> white space change in the middle of the line
Cool (svn doesn't have that option). But I'm not sure what that would
be useful for (as a user, I can't immediately imagine an important use
case). Anyway, could still be a nice option...
Cheers,
--
Johan
Received on 2010-12-02 21:22:17 CET