[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)

From: Branko Čibej <brane_at_xbc.nu>
Date: 2007-07-16 01:23:50 CEST

B. Smith-Mannschott wrote:
> On Jul 15, 2007, at 14:34, Erik Huelsmann wrote:
>> On 7/15/07, B. Smith-Mannschott <benpsm@gmail.com> wrote:
>>> void
>>> normalize_utf8_composed(const char **path_utf8)
>>> {
>>> /* ... and then a miracle occurs ... */
>>> }
>>> Have I misunderstood the problem?
>> No, except that the problem is the part where the miracle occurs...
>> Someone needs to write it or to find a library which does it for us.
> I downloaded a copy of the unicode database to see what would be
> involved, but then I got to googling on the reasonable assumption that
> *someone* has surely invented this wheel already...
> How about ICU: http://www.icu-project.org/

ICU is exactly the thing to use. Of course it's huge, but for things
like this, I know nothing better. Note that Unicode normalization isn't
the only problem -- checking the validity of Unicode sequences is also
less than trivial. Also string comparison--even if we assume everything
is normalized, it's nice to use a library that doesn't care.

What Subversion should do is not only know about the encoding of file
names in the filesystem, but also the platform-specific normalization
form. AIUI, WIndows always uses normalization form C; Mac uses form D.
Linux I've no idea about, but I suspect it uses form C.

Internally, Subversion should normalize all Unicode strings, and I'd
propose to use form C, since it's the most compact canonical representation.

-- Brane

To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Jul 16 01:23:32 2007

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.