[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: MacOSX filename encoding issue

From: Martin Hauner <martin.hauner_at_gmx.net>
Date: 2006-04-23 17:52:34 CEST

Jesper Steen Møller wrote:
> Martin Hauner wrote:
>> Hi,
>>
>> while fixing "svn: Can't convert string from native encoding to 'UTF-8':"
>> errors in subcommander when using filenames with extended characters on
>> MacOSX I noticed some strange behaviour that is reproducable with the
>> svn command line tool (1.3.0).
>>[..]
>> First I'm using DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE
>> (utf16: 278A, utf8: E2 9E 8A)
>>
>>
>> $ svn mkdir ➊
>> A ➊
>>
>> $ svn st
>> A ➊
>>
>> This is as expected, now another character, the german umlaut ö.
>>
>> ö (utf16: 00F6, utf8: C3 B6)
> In Unicode lingo, this is "precomposed".
>
>> $ svn mkdir ö
>> A ö
>> $ svn st
>> ? ö
>> ! ö
>>
>> This is unexpected. It looks like that status gets a different filename
>> when it reads the dir and thinks that the new dir is missing and that
>> there is an unversioned item of the same name.
> A normal user would head straight for the cupboard with the heavy-duty
> aspirin.

:)

>> Then entries file in .svn looks good.
> By good, you mean precomposed?

Yes, a single precomposed ö entry.

>> Looking at the output of ll -B (works only with LANG unset) shows that
>> svn is really getting something different:
>>
>> drwxr-xr-x 3 hauner hauner 102 Apr 22 15:30 o\314\210
>> drwxr-xr-x 3 hauner hauner 102 Apr 22 18:11 \342\236\212
>>
>> the second line is digit one and converting the numbers to hex delivers
>> its utf8 code. What should be the ö is something differnt (o + cc 88,
>> where cc 88 is a character with two dots: COMBINING DIAERESIS).
> This is your umlaut ö "decomposed". File systems on OSX are expected to
> do this (I know very litttle OSX stuff, but stumbled upon this:
> <http://developer.apple.com/qa/qa2001/qa1173.html>) This is NFD
> (normalization form "decomposed", as opposed to FNC, C for "composed").
> There is also NFKD and NFKC which adds "kompatibility" into the mix, for
> things like ligatures (whether fi and ff are single glyphs or not).

Oh my... this sounds complicated.

And the page alos says "Converting between precomposed and decomposed
Unicode text is a complicated process...". ;)

>[..]
>> I'm no unicode expert but i guess a 100% unicode compatible program
>> (for example a text editor) would combine the o with COMBINING DIAERESIS
>> to display it as a single ö character?
>
> True. There is a three level system of compliance, dealing with how
> combining characters are used. In a way, Subversion supports it all (by
> storing full UTF-8), but it doesn't deal with normalization as you've
> discovered.
>> Now the question is (assuming my analysis is correct) if it is possible
>> to workaround this strange behaviour of the Mac filesystem?
>
> As you correctly suggest, yes: By normalizing before comparing.
>
>> It would be nice if there were a combining aware utf8strcmp that could
>> be used by svn. I don't know how hard it would be to write such a
>> function.
> It is probably easiest to convert to the same normalization form, and
> then compare codepoints (binary). I would go composing rather than
> decomposing, since you can optimize the operation by scanning the
> codepoints for combining characters and only do the composition if any
> are found. I'd probably avoid using the compatibility normalization
> forms since they lose information (e.g. superscript 2 -> 2)...

The link above points to another link that mentions a system function
CFStringNormalize that can convert decomposed to composed.

> Libraries to do normalization already exists:
>
> There's IBM's ICU for C: <http://icu.sourceforge.net/> (X license)
> There's UCData <http://crl.nmsu.edu/~mleisher/ucdata.html> ("freeware")
>
> See also the Unicode Howto,
> <http://www.tldp.org/HOWTO/Unicode-HOWTO-6.html> and Markus Kuhn's
> excellent Unicode FAQ UTF-8 and Unicode FAQ
> <http://www.cl.cam.ac.uk/~mgk25/unicode.html>.

Thanks for your info. It removed some confusion and added new confusion
at the same time ;-)

Anyway, i think it would be nice if this could be handled by subversion
(or is it an apr issue?) because the 'ö' is a normal german character on
the keyboard, not some magic special character that's never used in
filenames. And there are a probably a lot of other characters around
the non-english world which cause the same problem.

-- 
Martin
Subcommander, http://subcommander.tigris.org
a cross platform Win32/Unix/MacOSX subversion GUI client & diff/merge tool.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sun Apr 23 17:53:00 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.