This is the mail the second issue which actually is a unicode issue,
but happens to show up on Macs more so than elsewhere.
Why isn't it enough to standardize on UTF-8? Doesn't UTF-8 have 1
codepoint per Unicode codepoint? Surely it does, but unicode itself
has 2 ways to write some characters. Take for example u-umlaut.
There's a unicode codepoint which says 'u-with-umlaut'. But, Unicode
also knows a codepoint which means 'next-character-has-umlaut'. Now,
the u-with-umlaut can be written in 2 ways:
1) 'u-with-umlaut' codepoint (1 codepoint)
2) 'next-char-has-umlaut', 'u' codepoints (2 codepoints)
(1) and (2) mean the same, but lead to different byte sequences in
*all* Unicode encodings and UTF-8 is no exception.
Conclusion: even though we standardized on Unicode to eliminate
character encoding problems, we now have new character encoding
(1) is known as composed (or pre-composed) and (2) is known as decomposed.
Why is this problem mainly showing up on the Mac? Well, Linux and
Windows have standardized (non-enforcing) on (1) which is known under
the acronym NFC (Normal Form Composed) and the Mac has standardized on
(2). The Mac however enforces this convention, meaning that when you
create a file with a name in NFC form, when reading back the directory
entries, you get a file which looks exactly the same, but has a
different byte-sequence (NFD).
That last bit utterly confuses Subversion, because it has a list of
files to expect in its admin data, which now doesn't correspond
anymore with the names it gets back from the OS.
Because we defined we'd be using UTF-8, but didn't define we'd be
using NFC or NFD, we now have a problem: we defined where we wanted
the door in our house, but not what it's size would be :-)
We are not the only project with this problem however (and note that
it's not the Mac which causes this problem, using Unicode is): IBM
created the ICU project (http://icu-project.org/) to address all kinds
of i18n problems including Unicode normalization, collation etc.
This problem could be solved to by adding the ICU lib as a dependency
and change all path comparisons to use the ICU normalform agnostic
I hope this explains the problem (and problem domain) to anybody who
never delve into Unicode before.
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org
Received on Tue Jul 17 22:17:18 2007