[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: MacOSX filename encoding issue

From: Jesper Steen Møller <jesper_at_selskabet.org>
Date: 2006-04-23 02:10:26 CEST

Martin Hauner wrote:
> Hi,
>
> while fixing "svn: Can't convert string from native encoding to 'UTF-8':"
> errors in subcommander when using filenames with extended characters on
> MacOSX I noticed some strange behaviour that is reproducable with the
> svn command line tool (1.3.0).
>
> First thing that i have to do is set LANG so svn works at all. Without
> it svn complains with the above error.
>
> setlocale(LC_ALL, "") doesn't seem to work on MacOSX if LANG isn't set.
>
> First I'm using DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE
> (utf16: 278A, utf8: E2 9E 8A)
>
>
> $ svn mkdir ➊
> A ➊
>
> $ svn st
> A ➊
>
> This is as expected, now another character, the german umlaut ö.
>
> ö (utf16: 00F6, utf8: C3 B6)
In Unicode lingo, this is "precomposed".
> $ svn mkdir ö
> A ö
> $ svn st
> ? ö
> ! ö
>
> This is unexpected. It looks like that status gets a different filename
> when it reads the dir and thinks that the new dir is missing and that
> there is an unversioned item of the same name.
A normal user would head straight for the cupboard with the heavy-duty
aspirin.
> Then entries file in .svn looks good.
By good, you mean precomposed?
> Looking at the output of ll -B (works only with LANG unset) shows that
> svn is really getting something different:
>
> drwxr-xr-x 3 hauner hauner 102 Apr 22 15:30 o\314\210
> drwxr-xr-x 3 hauner hauner 102 Apr 22 18:11 \342\236\212
>
> the second line is digit one and converting the numbers to hex delivers
> its utf8 code. What should be the ö is something differnt (o + cc 88,
> where cc 88 is a character with two dots: COMBINING DIAERESIS).
This is your umlaut ö "decomposed". File systems on OSX are expected to
do this (I know very litttle OSX stuff, but stumbled upon this:
<http://developer.apple.com/qa/qa2001/qa1173.html>) This is NFD
(normalization form "decomposed", as opposed to FNC, C for "composed").
There is also NFKD and NFKC which adds "kompatibility" into the mix, for
things like ligatures (whether fi and ff are single glyphs or not).

The Linux way is to go for NFC, from the unicode man page:

>Under Linux, in general only the BMP at implementation level 1 should
be used at the moment. Up to two combining characters per base character
for certain scripts (in particular Thai) are also supported by some
UTF-8 terminal emulators and ISO 10646 fonts (level 2), but in general
precomposed characters should be preferred where available (Unicode
calls this "Normalization Form C" ).
> I'm no unicode expert but i guess a 100% unicode compatible program
> (for example a text editor) would combine the o with COMBINING DIAERESIS
> to display it as a single ö character?
True. There is a three level system of compliance, dealing with how
combining characters are used. In a way, Subversion supports it all (by
storing full UTF-8), but it doesn't deal with normalization as you've
discovered.
> Now the question is (assuming my analysis is correct) if it is possible
> to workaround this strange behaviour of the Mac filesystem?
As you correctly suggest, yes: By normalizing before comparing.
> It would be nice if there were a combining aware utf8strcmp that could
> be used by svn. I don't know how hard it would be to write such a
> function.
It is probably easiest to convert to the same normalization form, and
then compare codepoints (binary). I would go composing rather than
decomposing, since you can optimize the operation by scanning the
codepoints for combining characters and only do the composition if any
are found. I'd probably avoid using the compatibility normalization
forms since they lose information (e.g. superscript 2 -> 2)...

Libraries to do normalization already exists:

There's IBM's ICU for C: <http://icu.sourceforge.net/> (X license)
There's UCData <http://crl.nmsu.edu/~mleisher/ucdata.html> ("freeware")

See also the Unicode Howto,
<http://www.tldp.org/HOWTO/Unicode-HOWTO-6.html> and Markus Kuhn's
excellent Unicode FAQ UTF-8 and Unicode FAQ
<http://www.cl.cam.ac.uk/~mgk25/unicode.html>.

I'd imagine a spectrum of

-Jesper

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sun Apr 23 02:07:38 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.