[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8 support for Unix with APR?

From: Erik Huelsmann <ehuels_at_gmail.com>
Date: Wed, 13 Feb 2008 16:14:28 +0100

> > > > The APR libraries handle file paths in the system locale. This means
> > > > they *may* be encoded in UTF-8, but are not necessarily. When they are
> > > > interpreted as UTF-8 depends on the LANG or LC_CTYPE settings in the
> > > > host environment.
> > >
> > > This is broken. APR should switch to UTF-8 locales internally when it
> > > deals with filenames (like what GNOME apps do). Otherwise this leads
> > > to consistency problems when the user has both ISO-8859-1 and UTF-8
> > > terminal sessions (the reason is that some applications and/or some
> > > machines do not support multibyte character sets, and one wouldn't
> > > want to mess everything when running svn in degraded mode, i.e. with
> > > ISO-8859-1 locales).
> >
> > No. The way (non-Mac) unices deal with this is seriously broken. There
> > is *no* guarantee the actual input paths are the encoding claimed by
> > the locale settings.
> >
> > There is no way for APR to solve that issue. The only thing it can do
> > is tell the application which input it should expect. Subversion
> > offers conversion routines to do the actual "locale"->UTF8 path
> > conversion since Subversion actually *is* UTF8 "inside", meaning that
> > it's ok for Subversion to err when it encounters invalid (ie non-UTF8)
> > input. Not all APR applications may find that desirable (for example:
> > Apache httpd doesn't initialise locale settings, so, it can't do
> > locale->utf8 conversions [as the C runtime doesn't know what the
> > current locale is]; nor will it change that behaviour.)
>
> It's worse. SVN doesn't get it right either since it's ignorant of unicode
> normalization forms [1].

Well, yes and no :-) Subversion depends (more so than, say, /bin/ls)
on a sanely configured environment (locale on disk == locale in
terminal, locale configured in the first place, etc). This is fine,
since Subversion needs to operate accross different configurations and
even OSes (whereas /bin/ls does not).

> OS X always encodes file names in NFD while other
> unix systems don't standardize this at all, though in practice they tend to
> use NFC.

Right. This issue is actually not 'worse', but different than the
other one. (Alas not less unfortunate.) When the Subversion devs (yes,
I'm one of them) decided to use UTF-8, they didn't realise there are 4
Unicode normal forms. Fortunately, 2 are irrelevant here, leaving
'only' 2 forms. Some (many) filenames will be binary different when
encoded in one form vs the other (NFC vs NFD) as you describe below.

> The same name in NFD and NFC will be represented by a different
> sequence and number of unicode code points if it contains e.g. accented
> characters.

The effect is that Subversion doesn't recognize 2 filenames being the
same when in fact they are differently encoded. This issue has long
gone undetected, because many OSes seem to prefer either one or the
other encoding (Windows and Linux prefer NFC, Solaris I don't know,
but Mac prefers NFD). When working between Windows and Linux, nobody
will notice. Neither will Mac users exchanging files.

Many open source projects won't notice either even though they
exchange between Windows, Linux and Mac, since they restrict
themselves to ascii filenames. This leaves mixed Windows/Linux and Mac
setups with accented characters at loss.

> See also subversion issue 2464 [2].
>
> [1] http://unicode.org/reports/tr15
> [2] http://subversion.tigris.org/issues/show_bug.cgi?id=2464

Right. I've written a number of e-mails on the issue, but the other
developers were too busy working on 1.5 at the time to be open for
discussion on the issue. I haven't forgotten about it, but this issue
isn't as easy to solve as it was to solve the "APR doesn't work with
UTF-8" issue was, because a very large legacy repositories has built
up in the mean time. We don't want to break those.

We'll be working on it. It's not worse, but unfortunately, the
resolution to the problem contained a few problems itself and we'll be
solving those. Hopefully by 1.6.

Bye,

Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: users-help_at_subversion.tigris.org
Received on 2008-02-13 16:14:53 CET

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.