Re: Proposed resolution: Standardizing on UTF-8 isn't enough

From: Matthias WĂ¤chter <matthias.waechter_at_tttech.com>
Date: 2007-07-19 17:24:28 CEST

Eric,

I am neither a Unicode nor Subversion (developer) expert, but let me
make my (verbose) point anyway.

On 18.07.2007 16:15, Erik Huelsmann wrote:
> Unicode has 2 different representations

There is nothing like 'two different representations' of a Unicode
string. A Unicode string can be NFC, NFD or a mixture of them. And
there is more. As Unicode is not just a character encoding format
but more a text description standard, it contains a lot more
features that can be (mis)used--by the user or by the operating
system--to have a displayed text be represented by multiple Unicode
symbol sequences. If you start playing around with normalization,
you have to deal with those as well. Let me give some examples.

1. Think of a soft-hyphen (U+00AD) [1]. Is this displayed
unambiguously in each operating system? When should it appear on
screen? Should Svn filter it, should Svn replace it with an
unambiguous representation like a normal hyphen?

2. What about unprinted Unicode control sequences (pick one from
[7], but there are more)? Should we remove them, escape them, or use
another approach to make them unambiguously visible?

3. What about homoglyphs [4], Unicode code groups that have (nearly)
the same display form as other characters? Should we normalize those?

4. What about replacement characters (like U+F900 vs. U+8C48) [5,6]?

5. What about Unicode code groups that represent one NFC symbol but
multiple NFD symbols that _cannot_ be re-translated to NFC? For
example, U+3374 SQUARE BAR [2] is a single code to represent the
character sequence 'bar' in square format. The given decomposition
is U+0062 U+0061 U+0072 which is the ASCII sequence 'bar'.
Certainly, re-coding to NFC will result in no change. Do we want to
disallow those? BTW: Is this correct, does OS X translate U+3374 to
this three-letter sequence?

6. But there is more, look at text direction. Not only arabic and
hebrew codes implicitly define text direction but one can explicitly
force a different text direction (in the display) by embedding
unprinted unicode codes [3]. One can use forced text direction codes
to force the same output as the 'mirrored' representation.

U+0053 (S) U+0075 (u) U+0062 (b) U+0076 (v) U+0065 (e) U+0072 (r)
U+0073 (s) U+0069 (i) U+006F (o) U+006E (n)

or by

U+202E (LRO) U+006E (n) U+006F (o) U+0069 (i) U+0073 (s) U+0072 (r)
U+0065 (e) U+0076 (v) U+0062 (b) U+0075 (u) U+0053 (S) U+202C (PDF)

Certainly, one can 'stack' multiple direction specifiers, you can
imagine how 'cryptic' and uncomparable stuff can become. Though, I
don't know whether there is any operating system that 'correctly'
displays both file name encodings the same. Yet. When there is, and
we already volunteered for NFC/NFD translation, we are in duty to
fix this issue as well.

Just a final aspect: What if the user _wants_ to have NFC/NFD/mixed
file names for whatever purpose? What if he wants to have unicode
'control characters' in them, do we want to rule that out? For what
good reason? Subversion is not harmed by such file names, and some
operating systems are very compatible in how they treat the unicode
file names.

It cannot be Subversion's task to enforce any normalization to the
file names as it's not its task to enforce any other
conversion/translation of special Unicode characters. This is true
at least for the Subversion _core_. This said, there is nothing
wrong in supporting optional, configurable special modes that
convert or escape a set of or all ambiguous character sequences _at
the client_.

This is true not only for OS X where it seems that conversion
between NFD and NFC would solve the issue. It is required for any
operating system that cannot ensure the two main properties: 1.
unique file names in Unicode representation are unique in the file
system, and 2. all characters and character sequences are correctly
supported, irrespective of NFD/NFC/mixed normalization and
representation of nonprintable characters.

Not mentioning things like (back)slashes in file names, Linux is
good in all that, Unicode-enabled Windows breaks it by making lower-
and upper-case letters equal in the file system and disallowing more
characters (my Win2k reports '"?*<>|:'), but it seems to keep
Unicode sequences as they are. I am not familiar with OS X, but it's
the first to actually break the Unicode sequence by normalizing
everything to NFD which, for me, seems to be a design flaw (or,
maybe, a design flaw in Unicode to let them do that).

My proposed solution: If the subversion client is operated on a file
system that is known to have deficiencies in the area of the noted
Unicode compatibility issues, it _could_ offer file/directory name
translation.

1. For the MAC OS X NFC/NFD issue this could be a simple,
transparent normalization (which is then remembered somewhere in
.svn for proper re-translation when communicating with the
repository). The user can configure whether he wants no/NFC/NFD
translation on the repository.

2. More difficult renaming like lower/upper case in presence of two
equal file names (Windows), OS-dependent disallowed characters, file
name length or unprintable characters should be dealt with by using
renaming. Depending on the issue, the code symbols could be replaced
by '%' sequences like in http requests, the file name gets punycoded
or shortened.

3. File name collisions due to local file name mapping must be dealt
with either automatically (append some random number until it is
unique) or manually (UI question: overwrite, rename, ...) - remember
"PROGRA~1" introduced by VFAT.

4. The UI must report if the local file name differs from the 'real'
name and should be configurable to err upon any renaming. Certainly,
although an issue with non-converted references (Makefiles),
unambiguous translations to/from a local charset are okay.

- Matthias

[1] http://www.fileformat.info/info/unicode/char/00ad/index.htm
[2] http://www.fileformat.info/info/unicode/char/3374/index.htm
[3] http://www.unicode.org/reports/tr9/
[4] http://en.wikipedia.org/wiki/Homoglyph
[5] http://www.fileformat.info/info/unicode/char/f900/index.htm
[6] http://www.fileformat.info/info/unicode/char/8c48/index.htm
[7] http://www.fileformat.info/info/unicode/category/Cf/list.htm

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Jul 19 17:23:42 2007

This message: [ Message body ]
Next message: Vlad Georgescu: "Re: Finishing relative svn:externals"
Previous message: C. Michael Pilato: "Re: Finishing relative svn:externals"
In reply to: Erik Huelsmann: "Proposed resolution: Standardizing on UTF-8 isn't enough"
Next in thread: B. Smith-Mannschott: "Re: Proposed resolution: Standardizing on UTF-8 isn't enough"
Reply: B. Smith-Mannschott: "Re: Proposed resolution: Standardizing on UTF-8 isn't enough"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]