On Tuesday 17 July 2007 16:06, you wrote:
> Can somebody give a brief overview of this issue from square one,
> aimed at folks who aren't incredibly familiar with the details of
> Unicode? If we're proposing adding a new dependency it would be nice
> if more people understood the issue. (I gather it relates to the
> question of whether u-with-umlaut is stored as a thingy that says "u
> with umlaut" vs a thingy that says "the next letter has an umlaut"
> followed by a "u".)
Yes, you got it right, but I'll start over (but beware: I'm by no means an
expert; I hope I get it right).
Suppose you want to encode the character "ü" (umlaut-u). You can do it in two
ways in Unicode: either use the "composed" form, which is just one character:
umlaut-u. Or you can use the "decomposed" form, which is two characters: the
first character means "add umlaut to the subsequent character" followed by a
plain latin "u". A normal strcmp will say that the two representations are
different because they are two different byte streams.
Both forms have their advantages and disadvantages. Windows wants to store
Unicode filenames in "composed" form, while Mac OS X wants to store the
filenames in "decomposed" form. This leads to problems.
There are various solutions to this problem, but they all more or less require
to have some "real" Unicode handling: either by having a strcmp that doesn't
complain when one string is composed and the other decomposed, or you
normalize each and every filename to an agreed-on representation. As far as I
can see the later is propably less error-prone (only few well-known "entries"
for filenames exist) and requires less code change. Especially if the
composed representation is used, as that already works on Windows and Linux,
so now "only" the Mac OS X client needs to normalize them as well.
Rüdesheimer Straße 7
Tel: +49 (0)89 - 548 433 321
To unsubscribe, e-mail: firstname.lastname@example.org
For additional commands, e-mail: email@example.com
Received on Tue Jul 17 16:41:05 2007