[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8 support for Unix with APR?

From: Vincent Lefevre <vincent+svn_at_vinc17.org>
Date: Fri, 15 Feb 2008 13:12:40 +0100

On 2008-02-13 16:14:28 +0100, Erik Huelsmann wrote:
> Well, yes and no :-) Subversion depends (more so than, say, /bin/ls)
> on a sanely configured environment (locale on disk == locale in
> terminal, locale configured in the first place, etc).

You're assuming too much. Unix is designed so that the user can use
different locales, e.g. in different terminals. The locales are not
global to the system (unlike the host name) or even the network (for
the NFS users). And some applications are designed to do charset
conversion because of that ("screen" is a good example). Moreover
different users will typically have different locales, and possibly
need to access to the data of other users. Also think about a USB
key that will be used in various environments...

> The effect is that Subversion doesn't recognize 2 filenames being the
> same when in fact they are differently encoded. This issue has long
> gone undetected, because many OSes seem to prefer either one or the
> other encoding (Windows and Linux prefer NFC,

I don't know about Windows, but Linux does *not* prefer NFC. It will
accept whatever the user will use. This can be both NFC and NFD (so
that the user may end up with two files with the same apparent name,
in particular after scp between Linux and Mac OS X), broken UTF-8
sequences or other encodings. Fortunately some applications (e.g.
GNOME ones) enforce some conventions by default.

> Solaris I don't know, but Mac prefers NFD).

in fact HFS+.

On 2008-02-13 21:56:10 +0100, Erik Huelsmann wrote:
> Ah! but the Mac (although that was snipped out of the quote) was
> exempt from 'Normal unix behaviour', since they use UTF-8 on disk *all
> the time*. The rest of the unix world uses LC_CTYPE, LC_ALL or LANG
> environment variables to determine what the current locale is. It then
> applies that setting both to paths on the disk as well as any output
> sent to the terminal.

No, it doesn't apply to pathnames. The encoding is left unspecified, and
may depend on the file system, and the system just see filenames as a
sequence of bytes (BTW, many system scripts set the locale back to C,
but they must work with filenames containing non-ASCII characters).

It would be more correct to say that most software doesn't support
filenames with non-ASCII characters. A real support would mean charset
conversion between the encoding on disk and the current locale.

> > This is the locale I know about. "LANG=en_US.UTF-8" and so forth.
>
> But, as stated above, in the rest of the unix world, LANG= also
> applies to paths read from disk.

No, see GNOME applications, for instance. This is mainly a question
of convention.

Also, at my previous lab, the NFS system has been changed to a NAS that
supports both Unix and Windows, and for this reason, the filenames had
to be interpreted as sequences of characters. Now, how the system could
guess the locale used by each user? You see, having an encoding based
on the current locale is broken by design. FYI, all the users who chose
a UTF-8 incompatible encoding had their filenames munged.

> > Is that when I first checked out a working copy? when I first made
> > a repository? when I first installed Subversion? when I first
> > installed the OS?
>
> When you installed your windows (presumably), or when you last created
> your Unix user.
  ^^^^^^^^^^^^^^
I suppose you meant OS installation. Unix is a multi-user system!

Now, do you want every user of some USB key to have installed their
machine in the same way? That's incredible!

> And that's correct. With the right choice of pathnames the sequence of
> commands below could be broken (the second command will return a
> "Non-conforming UTF-8 sequence encountered." error):
>
> $ LANG=en_US.iso88591 svn checkout URL your-path
> $ LANG=en_US.UTF-8 svn update your-path
>
> Now, Subversion could remember that the path was checked out using the
> latin1 setting, but essentially you're telling it you changed your
> paths (and output) to UTF-8. Should it ignore that? Absolutely not!
> You might be (*should* be) right, in which case you'd end up with the
> wrong UTF-8, when it's being read as if it were the latin1 which you
> checked out...

Well, there should be a (possibly optional) way to say: use this
encoding for pathnames *on disk*, and use this other encoding for
input/output.

In a similar way, when I read/write a file with my text editor, it
shouldn't expect it to be always in the charset specified by the
current locale.

BTW, the notion of locale is old and was created when users usually
worked in a single environment and didn't exchange data very much.
Things have evolved. Nowadays, most software is able to work with
various charsets (sometimes recording the charset together with the
contents, e.g. in XML, mail messages...), instead of sticking to the
current locale.

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: users-help_at_subversion.tigris.org
Received on 2008-02-15 13:13:14 CET

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.