From: Jim Blandy [mailto:email@example.com]
> "Bill Tutt" <firstname.lastname@example.org> writes:
> > Several comments/questions:
> > * By Unicode canonical decomposition, do you mean Normalization Form
> > as noted in TR15? (http://www.unicode.org/unicode/reports/tr15/)
> > I ask because canonical decomposition results in all combined
> > characters being expanded into their component forms. i.e. A
> > umlauted lower case u turns into two characters. An umlaut followed
> > lowercase u. I ask, because you really wouldn't want to implement
> > wrong normalization algorithm. :) TR15 also states the following:
> > "The W3C Character Model for the World Wide Web [CharMod] requires
> > use of Normalization Form C for XML and related standards (this
> > is not yet final, but this requirement is not expected to change).
> > the W3C Requirements for String Identity, Matching, and String
> > [CharReq] for more background."
> I was punting. I knew that there were several ways to represent
> composite characters, and assumed that there was some form recommended
> for use in names that needed to be matched. From what you say, it
> sounds like there are several. (Joy.)
Well, there are 4 different ways to handle composite characters.
Form D: Canonical Decomposition.
Form C: Canonical Decomposition followed by Canonical Composition.
Form KD: Compatible Decomposition.
Form KC: Compatible Decomposition followed by Compatible Composition.
> > * What do you mean by ordering? It didn't sound like you were
> > about a sorting order...
> No --- I was trying to refer to the ordering of the modifiers. It
> sounds like that is subsumed by the normalization form requirements
> you mention above.
> What I'm trying to do is put directory entries in some canonical form,
> so that directory entries don't become mysteriously invisible because
> different users chose different compositions/decompositions. What
> would you recommend that I say?
Well, since XML wants Form C, it seems to make sense for us to use
Form C as well.
> Another problem with that text (defining how Unicode should be used in
> filenames) is that it places the onus on the caller to put the name in
> the right form. The filesystem doesn't actually check the form. As a
> consequence, if somebody does it the wrong way, you get a mess. The
> comment places the blame outside the filesystem, but that doesn't help
> the poor user.
> So the filesystem should either check that the filenames are properly
> decomposed, or normalize them itself. The latter would be the easist
> to use, but more work.
> Surely there are libraries for this. But I don't know where they are.
WRT to what to do about the problem:
There are two immediate sources of code that I know of to help deal
with the problem:
One is the technical report itself. It provides some
sample/non-optimal code for how to do the appropriate logic, as well
as some possible optimization hints. (esp. if you only want to verify
that it complies with the normalization form, as opposed to actually
normalizing the string.)
The second is IBM's ICU project, but IIRC
this has a fairly funky license.
An ancillary source of Unicode code (at least in terms of having a
nice small copy of the Unicode character database) is how Python goes
about it. Greg or I can dig up the appropriate part of that code if it
Received on Sat Oct 21 14:36:22 2006