[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

From: Bill Tutt <bill_at_tutts.org>
Date: Thu, 2 Dec 2010 12:18:20 -0500

Note: This email only tangentially relates to svn diff and more about
reverse token scanning in general:

As someone who has implemented suffix reverse token scanning before:

* It simply isn't possible in DBCS code pages. Stick to byte only here.
   SBCS and UTF-16 make reverse token stuff relatively
straightforward. UTF-8 is a little trickier but still tractable.
   At least UTF-8 is tractable in a way that DBCS isn't. You always
know which part of a Unicode code point you are in. (i.e. byte 4 vs.
byte 3 vs. etc...)

* I would recommend only supporting a subset of the diff options for
reverse token scanning. i.e. ignore whitespace/ignore eol but skip
ignore case (if svn has that, I forget...)
   If tokens include keyword expansion operations then stop once you
hit one. The possible source of bugs outways the perf gain in my mind
here.
* Suffix scanning does really require a seekable stream, if it isn't
seekable then don't perform the reverse scanning.  It is only an
optimization after all.

Additional ignore whitespace related comment:
* IIRC, Perforce had an interesting twist on ignoring whitespace. You
could ignore just line leading/ending whitespace instead of all
whitespace differences but pay attention to any whitespace change
after the "trim" operation had completed.

e.g.:
* "    aaa bbb   " vs "aaa bbb" would compare as equal
* "    aaa  bbb  " vs "aaa  bbb" would compare as equal
* "    aaa  bbb  " vs "aaa bbb" would compare as non-equal due to the
white space change in the middle of the line

Fyi,
Bill
Received on 2010-12-02 18:19:00 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.