[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8 support for Unix with APR?

From: B. Blodau <b_blodau_at_hamburg.de>
Date: Wed, 13 Feb 2008 17:18:57 +0100

Hi,
just for your information:

Calling 'setlocale(LC_ALL, "en_us.UTF-8")' solved my problem on the
Mac. I can now commit and update files with umlauts or even chinese
characters.

Thanks for everybody who helped!
Bert

Am 13.02.2008 um 16:14 schrieb Erik Huelsmann:

>>>>> The APR libraries handle file paths in the system locale. This
>>>>> means
>>>>> they *may* be encoded in UTF-8, but are not necessarily. When
>>>>> they are
>>>>> interpreted as UTF-8 depends on the LANG or LC_CTYPE settings
>>>>> in the
>>>>> host environment.
>>>>
>>>> This is broken. APR should switch to UTF-8 locales internally
>>>> when it
>>>> deals with filenames (like what GNOME apps do). Otherwise this
>>>> leads
>>>> to consistency problems when the user has both ISO-8859-1 and UTF-8
>>>> terminal sessions (the reason is that some applications and/or some
>>>> machines do not support multibyte character sets, and one wouldn't
>>>> want to mess everything when running svn in degraded mode, i.e.
>>>> with
>>>> ISO-8859-1 locales).
>>>
>>> No. The way (non-Mac) unices deal with this is seriously broken.
>>> There
>>> is *no* guarantee the actual input paths are the encoding claimed by
>>> the locale settings.
>>>
>>> There is no way for APR to solve that issue. The only thing it
>>> can do
>>> is tell the application which input it should expect. Subversion
>>> offers conversion routines to do the actual "locale"->UTF8 path
>>> conversion since Subversion actually *is* UTF8 "inside", meaning
>>> that
>>> it's ok for Subversion to err when it encounters invalid (ie non-
>>> UTF8)
>>> input. Not all APR applications may find that desirable (for
>>> example:
>>> Apache httpd doesn't initialise locale settings, so, it can't do
>>> locale->utf8 conversions [as the C runtime doesn't know what the
>>> current locale is]; nor will it change that behaviour.)
>>
>> It's worse. SVN doesn't get it right either since it's ignorant of
>> unicode
>> normalization forms [1].
>
> Well, yes and no :-) Subversion depends (more so than, say, /bin/ls)
> on a sanely configured environment (locale on disk == locale in
> terminal, locale configured in the first place, etc). This is fine,
> since Subversion needs to operate accross different configurations and
> even OSes (whereas /bin/ls does not).
>
>> OS X always encodes file names in NFD while other
>> unix systems don't standardize this at all, though in practice
>> they tend to
>> use NFC.
>
> Right. This issue is actually not 'worse', but different than the
> other one. (Alas not less unfortunate.) When the Subversion devs (yes,
> I'm one of them) decided to use UTF-8, they didn't realise there are 4
> Unicode normal forms. Fortunately, 2 are irrelevant here, leaving
> 'only' 2 forms. Some (many) filenames will be binary different when
> encoded in one form vs the other (NFC vs NFD) as you describe below.
>
>> The same name in NFD and NFC will be represented by a different
>> sequence and number of unicode code points if it contains e.g.
>> accented
>> characters.
>
> The effect is that Subversion doesn't recognize 2 filenames being the
> same when in fact they are differently encoded. This issue has long
> gone undetected, because many OSes seem to prefer either one or the
> other encoding (Windows and Linux prefer NFC, Solaris I don't know,
> but Mac prefers NFD). When working between Windows and Linux, nobody
> will notice. Neither will Mac users exchanging files.
>
> Many open source projects won't notice either even though they
> exchange between Windows, Linux and Mac, since they restrict
> themselves to ascii filenames. This leaves mixed Windows/Linux and Mac
> setups with accented characters at loss.
>
>> See also subversion issue 2464 [2].
>>
>> [1] http://unicode.org/reports/tr15
>> [2] http://subversion.tigris.org/issues/show_bug.cgi?id=2464
>
> Right. I've written a number of e-mails on the issue, but the other
> developers were too busy working on 1.5 at the time to be open for
> discussion on the issue. I haven't forgotten about it, but this issue
> isn't as easy to solve as it was to solve the "APR doesn't work with
> UTF-8" issue was, because a very large legacy repositories has built
> up in the mean time. We don't want to break those.
>
> We'll be working on it. It's not worse, but unfortunately, the
> resolution to the problem contained a few problems itself and we'll be
> solving those. Hopefully by 1.6.
>
> Bye,
>
>
> Erik.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe_at_subversion.tigris.org
> For additional commands, e-mail: users-help_at_subversion.tigris.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: users-help_at_subversion.tigris.org
Received on 2008-02-13 17:19:31 CET

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.