[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8 support for Unix with APR?

From: B Smith-Mannschott <bsmith.occs_at_gmail.com>
Date: Wed, 13 Feb 2008 15:24:39 +0100

On Feb 13, 2008 1:41 PM, Erik Huelsmann <ehuels_at_gmail.com> wrote:

> On 2/13/08, Vincent Lefevre <vincent+svn_at_vinc17.org> wrote:
> > On 2008-02-12 13:52:41 +0100, Erik Huelsmann wrote:
> > > The APR libraries handle file paths in the system locale. This means
> > > they *may* be encoded in UTF-8, but are not necessarily. When they are
> > > interpreted as UTF-8 depends on the LANG or LC_CTYPE settings in the
> > > host environment.
> >
> > This is broken. APR should switch to UTF-8 locales internally when it
> > deals with filenames (like what GNOME apps do). Otherwise this leads
> > to consistency problems when the user has both ISO-8859-1 and UTF-8
> > terminal sessions (the reason is that some applications and/or some
> > machines do not support multibyte character sets, and one wouldn't
> > want to mess everything when running svn in degraded mode, i.e. with
> > ISO-8859-1 locales).
>
> No. The way (non-Mac) unices deal with this is seriously broken. There
> is *no* guarantee the actual input paths are the encoding claimed by
> the locale settings.
>
> There is no way for APR to solve that issue. The only thing it can do
> is tell the application which input it should expect. Subversion
> offers conversion routines to do the actual "locale"->UTF8 path
> conversion since Subversion actually *is* UTF8 "inside", meaning that
> it's ok for Subversion to err when it encounters invalid (ie non-UTF8)
> input. Not all APR applications may find that desirable (for example:
> Apache httpd doesn't initialise locale settings, so, it can't do
> locale->utf8 conversions [as the C runtime doesn't know what the
> current locale is]; nor will it change that behaviour.)
>

It's worse. SVN doesn't get it right either since it's ignorant of unicode
normalization forms [1]. OS X always encodes file names in NFD while other
unix systems don't standardize this at all, though in practice they tend to
use NFC. The same name in NFD and NFC will be represented by a different
sequence and number of unicode code points if it contains e.g. accented
characters. See also subversion issue 2464 [2].

[1] http://unicode.org/reports/tr15
[2] http://subversion.tigris.org/issues/show_bug.cgi?id=2464

-- 
// Ben Smith-Mannschott
Received on 2008-02-13 15:25:03 CET

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.