[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Mac OS X: why LC_ALL needs to be specified (Was: problems adding files with umlauts)

From: Ulrich Eckhardt <eckhardt_at_satorlaser.com>
Date: 2006-07-10 10:16:35 CEST

On Friday 07 July 2006 11:00, Thomas Singer wrote:
> You are making it too simple: you assume that the file name already _is_
> plain UTF-8.

Indeed, because filenames are supposed to be UTF-8.

> My Java example works as expected:
>
> final File dir = new File("file-test");
> dir.mkdirs();
> final File file = new File(dir, "invalid\u00FF\u00FE");
> file.createNewFile();
> for (String fileName : dir.list()) {
> System.out.println(fileName);
> }
> file.delete();

AFAIK, Java uses UCS2 or UTF-16 internally. It then has to convert that to the
system's format which, in case of OSX, is UTF-8. Now, FF and FE are both
valid codepoints in Unicode (thorn and y with diaeresis), so Java just
encodes them in UTF-8 and everything's fine. C++ is much more direct, it just
passes as filename to the system what it got from the programmer.

> > The thing is that, as Wilfredo said and whose attribution you snipped,
> > filenames are UTF-8 _by_ _convention_ and nothing enforces this.
>
> As I understand it, file names are stored *in the repository* as UTF-8 (by
> convention)

Yes, although this is not a convention but a definition/requirement of
Subversion. Also, this is validated, i.e. it rejects invalid UTF-8 sequences.

> and the Subversion client needs to enforce the correct encoding
> from the OS' native file name encoding.

Right. In the case of OSX, Subversion probably assumes the encoding is UTF-8
(because that is what it should be). If this is already wrong, because some
program broke with the convention, it can't do much. In said case it only
sees that the UTF-8 sequence is invalid and bails out with an error message.

> With Java this is no problem, since
> it does not simply treat characters as bytes and lists the directory
> content correctly (on Mac with decomposed umlauts, but thats another
> problem) and hence can (without setting the LC_ALL variable) convert the
> file name to UTF-8 or what ever encoding you want. If Java can do that
> without setting LC_ALL, it also should be technically possible from C(++).

It is technically and practically possible, but it doesn't happen behind the
scenes like in Java but requires an active effort. Since C++ mostly doesn't
interpret characters and just passes them on, you need a function that simply
converts the local encoding of the program (whichever that is is up to the
programmer and/or the locale) to the externally specified format before
opening the file.

In other words, the difference between C++ and Java in this aspect is that in
C++ you provide the bytewise representation of the filename and that name is
used without conversion, while in Java you provide a string that is converted
to the filename's bytewise representation according to system requirements.
That said, I wonder how Java would deal with "invalid\uFFFF\uFFFE" as those
two are not allowed for interchange (i.e. filenames or content) according to
Unicode.

Uli

****************************************************
Visit our website at <http://www.domino-printing.com/>
****************************************************
This Email and any files transmitted with it are intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any reading, redistribution, disclosure or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient please contact the sender immediately and delete the material from your computer.

E-mail may be susceptible to data corruption, interception, viruses and unauthorised amendment and Domino UK Limited does not accept liability for any such corruption, interception, viruses or amendment or their consequences.
****************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Mon Jul 10 10:18:04 2006

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.