[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8 support for Unix with APR?

From: Erik Huelsmann <ehuels_at_gmail.com>
Date: Fri, 15 Feb 2008 13:27:31 +0100

On Fri, Feb 15, 2008 at 1:12 PM, Vincent Lefevre <vincent+svn_at_vinc17.org> wrote:
> On 2008-02-13 16:14:28 +0100, Erik Huelsmann wrote:
> > Well, yes and no :-) Subversion depends (more so than, say, /bin/ls)
> > on a sanely configured environment (locale on disk == locale in
> > terminal, locale configured in the first place, etc).
>
> You're assuming too much. Unix is designed so that the user can use
> different locales, e.g. in different terminals. The locales are not
> global to the system (unlike the host name) or even the network (for
> the NFS users). And some applications are designed to do charset
> conversion because of that ("screen" is a good example). Moreover
> different users will typically have different locales, and possibly
> need to access to the data of other users. Also think about a USB
> key that will be used in various environments...
>
>
> > The effect is that Subversion doesn't recognize 2 filenames being the
> > same when in fact they are differently encoded. This issue has long
> > gone undetected, because many OSes seem to prefer either one or the
> > other encoding (Windows and Linux prefer NFC,
>
> I don't know about Windows, but Linux does *not* prefer NFC. It will
> accept whatever the user will use. This can be both NFC and NFD (so
> that the user may end up with two files with the same apparent name,
> in particular after scp between Linux and Mac OS X), broken UTF-8
> sequences or other encodings. Fortunately some applications (e.g.
> GNOME ones) enforce some conventions by default.
>
>
> > Solaris I don't know, but Mac prefers NFD).
>
> in fact HFS+.
>
>
> On 2008-02-13 21:56:10 +0100, Erik Huelsmann wrote:
> > Ah! but the Mac (although that was snipped out of the quote) was
> > exempt from 'Normal unix behaviour', since they use UTF-8 on disk *all
> > the time*. The rest of the unix world uses LC_CTYPE, LC_ALL or LANG
> > environment variables to determine what the current locale is. It then
> > applies that setting both to paths on the disk as well as any output
> > sent to the terminal.
>
> No, it doesn't apply to pathnames. The encoding is left unspecified, and
> may depend on the file system, and the system just see filenames as a
> sequence of bytes (BTW, many system scripts set the locale back to C,
> but they must work with filenames containing non-ASCII characters).
>
> It would be more correct to say that most software doesn't support
> filenames with non-ASCII characters. A real support would mean charset
> conversion between the encoding on disk and the current locale.
>
>
> > > This is the locale I know about. "LANG=en_US.UTF-8" and so forth.
> >
> > But, as stated above, in the rest of the unix world, LANG= also
> > applies to paths read from disk.
>
> No, see GNOME applications, for instance. This is mainly a question
> of convention.
>
> Also, at my previous lab, the NFS system has been changed to a NAS that
> supports both Unix and Windows, and for this reason, the filenames had
> to be interpreted as sequences of characters. Now, how the system could
> guess the locale used by each user? You see, having an encoding based
> on the current locale is broken by design. FYI, all the users who chose
> a UTF-8 incompatible encoding had their filenames munged.
>
>
> > > Is that when I first checked out a working copy? when I first made
> > > a repository? when I first installed Subversion? when I first
> > > installed the OS?
> >
> > When you installed your windows (presumably), or when you last created
> > your Unix user.
> ^^^^^^^^^^^^^^
> I suppose you meant OS installation. Unix is a multi-user system!
>
> Now, do you want every user of some USB key to have installed their
> machine in the same way? That's incredible!
>
>
> > And that's correct. With the right choice of pathnames the sequence of
> > commands below could be broken (the second command will return a
> > "Non-conforming UTF-8 sequence encountered." error):
> >
> > $ LANG=en_US.iso88591 svn checkout URL your-path
> > $ LANG=en_US.UTF-8 svn update your-path
> >
> > Now, Subversion could remember that the path was checked out using the
> > latin1 setting, but essentially you're telling it you changed your
> > paths (and output) to UTF-8. Should it ignore that? Absolutely not!
> > You might be (*should* be) right, in which case you'd end up with the
> > wrong UTF-8, when it's being read as if it were the latin1 which you
> > checked out...
>
> Well, there should be a (possibly optional) way to say: use this
> encoding for pathnames *on disk*, and use this other encoding for
> input/output.
>
> In a similar way, when I read/write a file with my text editor, it
> shouldn't expect it to be always in the charset specified by the
> current locale.
>
> BTW, the notion of locale is old and was created when users usually
> worked in a single environment and didn't exchange data very much.
> Things have evolved. Nowadays, most software is able to work with
> various charsets (sometimes recording the charset together with the
> contents, e.g. in XML, mail messages...), instead of sticking to the
> current locale.

Right. But all of that means it's not so deterministic that either APR
or Subversion can solve *all* locale problems.

bye,

Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: users-help_at_subversion.tigris.org
Received on 2008-02-15 13:27:54 CET

This is an archived mail posted to the Subversion Users mailing list.