[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: "Strange" characters in file names

From: Ryan Schmidt <subversion-2007a_at_ryandesign.com>
Date: 2007-03-26 04:12:34 CEST

On Mar 25, 2007, at 18:58, <svn.users@salvisberg.com>
<svn.users@salvisberg.com> wrote:

>>>>> svn: Can't convert string from native encoding to 'UTF-8':
>>>>> svn: Kurs Ern?\195?\164hrung.doc
>>>> Perhaps you have not set the LANG variable so ls and svn don't
>>>> know how to properly display it. Try export LANG=de_DE.utf8 or
>>>> whatever the correct value for your OS is. (The contents of the
>>>> directory /usr/share/locale may tell you what the valid locales
>>>> are on your system.)
>>> Yes, indeed, this helped. ls still doesn't display the file
>>> properly (I don't really care), but svn doesn't complain anymore.
>>> Can you explain (or point me to an explanation), what this does
>>> and why it's needed? I'd prefer not to set a locale (leave it at
>>> the default POSIX), because I don't want to introduce a bias
>>> towards German. This particular filename happens to be in German,
>>> but I'm sure someone will upload a file with a French name sooner
>>> or later.
>> This does not introduce a bias towards German. It does cause error
>> messages to be printed in German. Based on the name of the file, I
>> assumed you would want that. If you prefer English error messages
>> from Subversion, use en_US.utf8, or whatever it is on your OS.
>> The important part is the .utf8 part, which explains to Subversion
>> and other tools that you are using the UTF-8 character encoding.
>> UTF-8 can handle all languages, so as long as your locale is a
>> UTF-8 locale, you will be able to handle all filenames.
> How do I know I'm using the UTF-8 encoding? How do you know?

I know it because I speak German and recognize that the filename
should be "Kurs Ernährung.doc", because I also know that "ä" is
represented by the bytes 196,164 in the UTF-8 encoding.

> Could svn know it, too?

Subversion could guess it, but it couldn't know it for sure. For
example, in the ISO-8859-1 encoding, the word is "Ernährung";
Subversion doesn't know German, so it doesn't know that that's
nonsense. Same for ISO-8859-5, where the word becomes "ErnУЄhrung".
That's why it's best that we tell Subversion (and all other programs)
what character encoding we're using, so it can be sure what we mean.

> What exactly does it mean that I'm "using the UTF-8 encoding"? That
> the filename is UTF-8-encoded?

Yes, that's what I meant. And that could be dictated by your OS and/
or filesystem.

> Instead of two question marks, ls now displays two different box
> drawing characters.

Perhaps your terminal or shell doesn't understand UTF-8? Perhaps you
can educate it? I know on Mac OS X, for example, the terminal uses
UTF-8 by default, but can be configured to use any character encoding
you prefer.

> So, even if svn doesn't complain anymore, how do I know that UTF-8
> really is the correct encoding and I wouldn't risk putting
> something into the repository that might cause trouble?

As above, I knew because of my knowledge of German words and the
bytes used for UTF-8 encoding. In the general case, programatically
figuring out what character set something is in is difficult or
impossible, which is why all text everywhere should be accompanied by
an indication of what character encoding it's in so that everyone can
read it properly -- or it should be decreed that there is only one
character encoding, and everything is normalized to that. Web pages
and emails take the approach that any character encoding is ok, but
they include in the Content-Type header the name of the character
encoding that was used. Subversion takes the approach that everything
in the repository* shall be UTF-8, and everything is converted to and
from UTF-8 automatically, and if it doesn't know what to convert to
or from, it complains, as well it should.

*Here I only mean file and directory names. The contents of files can
be in any character encoding you like; Subversion doesn't care.

> In this case I do want to ignore thm, but I'm surprised that svn
> looks at them anyway and trips over the "strange" characters.

I think it's a good idea that Subversion tells you at the first
opportunity that it finds something wrong. Not knowing what character
encoding something is is certainly something that should be rectified.

A plausible explanation: your config file is in UTF-8, and you have
specified your ignore rules in UTF-8, because that's what Subversion
requires. In order to accurately compare your ignore rules against
the file and directory names of the filesystem, it needs to know what
character encoding the file and directory names are in. Otherwise
it's possible it would ignore the wrong things.

To reply to the mailing list, please use your mailer's Reply To All  
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Mon Mar 26 04:13:03 2007

This is an archived mail posted to the Subversion Users mailing list.