[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Proposed resolution: Standardizing on UTF-8 isn't enough

From: Matthias Wächter <matthias.waechter_at_tttech.com>
Date: 2007-07-19 18:15:31 CEST

On 19.07.2007 17:39, B. Smith-Mannschott wrote:
> On 7/19/07, Matthias Wächter <matthias.waechter@tttech.com> wrote:
>> 5. What about Unicode code groups that represent one NFC symbol but
>> multiple NFD symbols that _cannot_ be re-translated to NFC? For
>> example, U+3374 SQUARE BAR [2] is a single code to represent the
>> character sequence 'bar' in square format. The given decomposition
>> is U+0062 U+0061 U+0072 which is the ASCII sequence 'bar'.
>> Certainly, re-coding to NFC will result in no change. Do we want to
>> disallow those? BTW: Is this correct, does OS X translate U+3374 to
>> this three-letter sequence?
>
> This is misleading. It's true for the NFKC and NFKD, the
> "compatibility" normalizations, which are lossy by design. NFD does
> not decompose SQUARE BAR.

Thanks for pointing this out. Just verified with python.

>>> normalize('NFD',u'\u3374')
u'\u3374'
>>> normalize('NFC',u'\u3374')
u'\u3374'
>>> normalize('NFKC',u'\u3374')
u'bar'

> Do you know of an example where NFD->NFC->NFD is lossy?

Some of my 'knowledge' is from [1], it states that normalization [4]
can be one way at least for old and replacement symbols. E.g. this
applies to U+212B ANGSTROM SIGN (formerly ANGSTROM UNIT) [2] being
converted to U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE [3] being
decomposed to U+0041 U+030A. So the first normalization ->NFD
results in U+00C5 which could then be successfully rebuilt by
finishing the cycle ->NFC->NFD. Apparantly, normalizing to NFC
already contains normalizing to NFD as a first step.

Interestingly, ANGSTROM SIGN, as a unit, should not have a
lower-case representation. But as a latin capital letter A with ring
above, certainly, there is no unit meaning on it anymore, so there
is a lower-case variant U+00E5 available. Actually, even for
Angstrom sign, this lower-case representation is given. OTOH, there
is no lower-case representation for U+2103 DEGREE CELSIUS [5]. Weird.

Similarly, U+F900 CJK COMPATIBILITY IDEOGRAPH-F900 is NFD-normalized
to U+8C48 'how? what?' which stays NFC-normalized U+8C48. No way
back to U+F900.

- Matthias

[1] (german)
http://www.c-plusplus.de/forum/viewtopic-var-t-is-161855.html
[2] http://www.fileformat.info/info/unicode/char/212B/index.htm
[3] http://www.fileformat.info/info/unicode/char/00c5/index.htm
[4] http://www.unicode.org/reports/tr15/
[5] http://www.fileformat.info/info/unicode/char/2103/index.htm

-- 
Matthias Wächter - Senior Chip Designer
TTTech Computertechnik AG - Time-Triggered Technology
Commercial Reg. No.: 165 664z, Commercial Court Vienna
Schoenbrunner Strasse 7, A-1040 Vienna, Austria
tel:+43-1-5853434;ext=36 fax:+43-1-5853434;ext=90
mailto:matthias.waechter_at_tttech.com http://www.tttech.com
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Jul 19 18:14:42 2007

This is an archived mail posted to the Subversion Dev mailing list.