Note: This email only tangentially relates to svn diff and more about
reverse token scanning in general:
As someone who has implemented suffix reverse token scanning before:
* It simply isn't possible in DBCS code pages. Stick to byte only here.
SBCS and UTF-16 make reverse token stuff relatively
straightforward. UTF-8 is a little trickier but still tractable.
At least UTF-8 is tractable in a way that DBCS isn't. You always
know which part of a Unicode code point you are in. (i.e. byte 4 vs.
byte 3 vs. etc...)
* I would recommend only supporting a subset of the diff options for
reverse token scanning. i.e. ignore whitespace/ignore eol but skip
ignore case (if svn has that, I forget...)
If tokens include keyword expansion operations then stop once you
hit one. The possible source of bugs outways the perf gain in my mind
here.
* Suffix scanning does really require a seekable stream, if it isn't
seekable then don't perform the reverse scanning. It is only an
optimization after all.
Additional ignore whitespace related comment:
* IIRC, Perforce had an interesting twist on ignoring whitespace. You
could ignore just line leading/ending whitespace instead of all
whitespace differences but pay attention to any whitespace change
after the "trim" operation had completed.
e.g.:
* " aaa bbb " vs "aaa bbb" would compare as equal
* " aaa bbb " vs "aaa bbb" would compare as equal
* " aaa bbb " vs "aaa bbb" would compare as non-equal due to the
white space change in the middle of the line
Fyi,
Bill
Received on 2010-12-02 18:19:00 CET