Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)

From: Branko Čibej <brane_at_xbc.nu>
Date: 2007-07-16 01:23:50 CEST

B. Smith-Mannschott wrote:
>
> On Jul 15, 2007, at 14:34, Erik Huelsmann wrote:
>
>> On 7/15/07, B. Smith-Mannschott <benpsm@gmail.com> wrote:
>>>
>>>
>>>
>>> void
>>> normalize_utf8_composed(const char **path_utf8)
>>> {
>>> /* ... and then a miracle occurs ... */
>>> }
>>>
>>> Have I misunderstood the problem?
>>
>> No, except that the problem is the part where the miracle occurs...
>> Someone needs to write it or to find a library which does it for us.
>>
>
> I downloaded a copy of the unicode database to see what would be
> involved, but then I got to googling on the reasonable assumption that
> *someone* has surely invented this wheel already...
>
> How about ICU: http://www.icu-project.org/

ICU is exactly the thing to use. Of course it's huge, but for things
like this, I know nothing better. Note that Unicode normalization isn't
the only problem -- checking the validity of Unicode sequences is also
less than trivial. Also string comparison--even if we assume everything
is normalized, it's nice to use a library that doesn't care.

What Subversion should do is not only know about the encoding of file
names in the filesystem, but also the platform-specific normalization
form. AIUI, WIndows always uses normalization form C; Mac uses form D.
Linux I've no idea about, but I suspect it uses form C.

Internally, Subversion should normalize all Unicode strings, and I'd
propose to use form C, since it's the most compact canonical representation.

-- Brane

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Jul 16 01:23:32 2007

This message: [ Message body ]
Next message: Blair Zajac: "Re: [request for comments] scheme bindings update"
Previous message: Holden Karau: "[request for comments] scheme bindings update"
In reply to: B. Smith-Mannschott: "Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)"
Next in thread: Marc Haisenko: "Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)"
Reply: Marc Haisenko: "Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)"
Reply: Vincent Lefevre: "Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]