Standardizing on UTF8 internally isn't enough

From: Erik Huelsmann <ehuels_at_gmail.com>
Date: 2007-07-17 22:18:07 CEST

This is the mail the second issue which actually is a unicode issue,
but happens to show up on Macs more so than elsewhere.

Why isn't it enough to standardize on UTF-8? Doesn't UTF-8 have 1
codepoint per Unicode codepoint? Surely it does, but unicode itself
has 2 ways to write some characters. Take for example u-umlaut.
There's a unicode codepoint which says 'u-with-umlaut'. But, Unicode
also knows a codepoint which means 'next-character-has-umlaut'. Now,
the u-with-umlaut can be written in 2 ways:

1) 'u-with-umlaut' codepoint (1 codepoint)
2) 'next-char-has-umlaut', 'u' codepoints (2 codepoints)

(1) and (2) mean the same, but lead to different byte sequences in
*all* Unicode encodings and UTF-8 is no exception.

Conclusion: even though we standardized on Unicode to eliminate
character encoding problems, we now have new character encoding
problems.

(1) is known as composed (or pre-composed) and (2) is known as decomposed.

Why is this problem mainly showing up on the Mac? Well, Linux and
Windows have standardized (non-enforcing) on (1) which is known under
the acronym NFC (Normal Form Composed) and the Mac has standardized on
(2). The Mac however enforces this convention, meaning that when you
create a file with a name in NFC form, when reading back the directory
entries, you get a file which looks exactly the same, but has a
different byte-sequence (NFD).

That last bit utterly confuses Subversion, because it has a list of
files to expect in its admin data, which now doesn't correspond
anymore with the names it gets back from the OS.

Because we defined we'd be using UTF-8, but didn't define we'd be
using NFC or NFD, we now have a problem: we defined where we wanted
the door in our house, but not what it's size would be :-)

We are not the only project with this problem however (and note that
it's not the Mac which causes this problem, using Unicode is): IBM
created the ICU project (http://icu-project.org/) to address all kinds
of i18n problems including Unicode normalization, collation etc.

This problem could be solved to by adding the ICU lib as a dependency
and change all path comparisons to use the ICU normalform agnostic
comparison routines.

I hope this explains the problem (and problem domain) to anybody who
never delve into Unicode before.

bye,

Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Jul 17 22:17:18 2007

This message: [ Message body ]
Next message: Daniel Rall: "Re: svn commit: r25772 - branches/svnpatch-diff/subversion/tests/libsvn_subr"
Previous message: Erik Huelsmann: "Re: a Unicode issue and a Mac character encoding issue"
Next in thread: Mark Phippard: "Re: Standardizing on UTF8 internally isn't enough"
Reply: Mark Phippard: "Re: Standardizing on UTF8 internally isn't enough"
Reply: Folker Schamel: "Re: Standardizing on UTF8 internally isn't enough"
Reply: steveking: "Re: Standardizing on UTF8 internally isn't enough"
Reply: Justin Erenkrantz: "Re: Standardizing on UTF8 internally isn't enough"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]