Re: diff-optimizations-tokens branch: I think I'm going to abandon it

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Thu, 2 Dec 2010 21:21:20 +0100

On Thu, Dec 2, 2010 at 6:18 PM, Bill Tutt <bill_at_tutts.org> wrote:
> Note: This email only tangentially relates to svn diff and more about
> reverse token scanning in general:
>
> As someone who has implemented suffix reverse token scanning before:

Thanks for the input. It's nice to see other people have also
struggled with this :-).

> * It simply isn't possible in DBCS code pages. Stick to byte only here.
> SBCS and UTF-16 make reverse token stuff relatively
> straightforward. UTF-8 is a little trickier but still tractable.
> At least UTF-8 is tractable in a way that DBCS isn't. You always
> know which part of a Unicode code point you are in. (i.e. byte 4 vs.
> byte 3 vs. etc...)

Ok, this further supports the decision to focus on the byte-based
approach. We'll only consider stuff identical if all bytes are
identical. That's the simplest route, and since it's only an
optimization anyway ...

> * I would recommend only supporting a subset of the diff options for
> reverse token scanning. i.e. ignore whitespace/ignore eol but skip
> ignore case (if svn has that, I forget...)

svn diff doesn't have an ignore-case option, so that's ok :-).

> If tokens include keyword expansion operations then stop once you
> hit one. The possible source of bugs outways the perf gain in my mind
> here.

Haven't thought about keyword expansion yet. But as you suggest: I'm
not going to bother doing special stuff for (expanded) keywords. If we
find a mismatch, we'll stop with the optimized scanning, and fall back
to the default algorithm.

> * Suffix scanning does really require a seekable stream, if it isn't
> seekable then don't perform the reverse scanning. It is only an
> optimization after all.

Hm, yes, we'll need to be careful about that. I'll start another mail
thread asking for known implementors of the svn_diff_fns_t functions,
to find out whether seeking around like that for suffix would be
supported.

> Additional ignore whitespace related comment:
> * IIRC, Perforce had an interesting twist on ignoring whitespace. You
> could ignore just line leading/ending whitespace instead of all
> whitespace differences but pay attention to any whitespace change
> after the "trim" operation had completed.
>
> e.g.:
> * "    aaa bbb   " vs "aaa bbb" would compare as equal
> * "    aaa bbb " vs "aaa bbb" would compare as equal
> * "    aaa bbb " vs "aaa bbb" would compare as non-equal due to the
> white space change in the middle of the line

Cool (svn doesn't have that option). But I'm not sure what that would
be useful for (as a user, I can't immediately imagine an important use
case). Anyway, could still be a nice option...

Cheers,

-- 
Johan

Received on 2010-12-02 21:22:17 CET

This message: [ Message body ]
Next message: Stefan Sperling: "Re: gpg-agent branch treats PGP passphrase as repository password?"
Previous message: Johan Corveleyn: "Re: diff-optimizations-tokens branch: I think I'm going to abandon it"
In reply to: Bill Tutt: "Re: diff-optimizations-tokens branch: I think I'm going to abandon it"
Next in thread: Daniel Shahaf: "Re: diff-optimizations-tokens branch: I think I'm going to abandon it"
Reply: Daniel Shahaf: "Re: diff-optimizations-tokens branch: I think I'm going to abandon it"
Reply: Branko ÄŒibej: "Re: diff-optimizations-tokens branch: I think I'm going to abandon it"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]