[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: International Characters & Subversion 1.1.0 Problems

From: Erich Enke <epte_at_ruffdogs.com>
Date: 2004-10-05 00:34:25 CEST

>>However, even though `locale charmap` says 'UTF-8', if I do:
>>echo ab | tr 'a' '\303' | tr 'b' '\244'
>>I get ä (Cap. A + superscript tilde, and then something that looks
>>like a misfigured pound sign). That's not right. I should get a
>>lower-case a with hysteresis, I would think.
>>
>>
>
>"locale charmep" shows what the environment variables in your shell are
>telling your programs to use - i.e. how the programs that you run will
>interpret and produce bytesequences. That needn't (sadly!) correspond to
>the way your terminal window interprets those sequences when the programs
>output them!
>
>The symbols that you're seeing correspond (possibly among other encodings)
>to the characters mapped to 'c3' and 'a4' in the iso-8859-1 encoding. This
>would suggest that your terminal is interpreting the characters as
>iso-8859-1 (the default encoding in many situations).
>
>You may be able to start a UTF8 xterm with 'xterm -u8'.
>
>
>
That was a good suggestion. It gave me different results anyhow.

So, I start up an 'xterm -u8' and set LANG, LC_CLANG, LC_CTYPE, and
LC_ALL to en_US.UTF-8. After cleaning up the files left over from the
other terminal, I can now print out 0xc3a4 and see an
a-with-hysteresis! That's correct. Yay!

Thinking that all might now be running smoothly, I try operations on
percent'ed filenames. I can `svn add G%E4steBuch` just fine. I can
`svn ci` it. If I `svn ls` it directly (in the repository, specifying
even the filename), though, I get 'non-existent in that revision'.
Trying the merge-commit, I get 'File not found' for the hex sequence 47
e4 73 74 65 (I had to run it through hexdump to see what that character
was (it interpreted 'e4 73' as a box of some sort)) ... which, of
course, is the 0xe4 version of G%E4steBuch.

That is to say, filenames with percents still seem to be buggy. This
should be reproducible. Just touch a file as 'G%E4steBuch', add it,
commit it, then try ls'ing it, mv'ing it, rm'ing it, merging it into
another branch. See if you have any problems.

>As I'm sure you've realised, that's 'e4' (our troublesome friend :-),
>
>
Oh yes. :-)

>followed by "ste". So clearly the 'e4' is being taken as UTF-8 for some
>reason.
>
>
>
I'm glad you think so too. It's nice to have some confirmation of my
line of thought. ;-)

>Another possibility is that since the terminal seems to be in iso-8859-1
>mode, but the environment variables suggest you're using UTF8, that the
>character isn't being affected at all, when it should in fact get
>converted from iso-8859-1 to utf-8. (There may still be a bug here, by
>this stage my head is spinning!)
>
>
Yeah, but I just did an strace on the commit (within the old terminal,
using en_US.UTF-8 variables), and I wouldn't expect to see the following
if that were the case:

open("/usr/share/locale/en_US.UTF-8/LC_MEASUREMENT", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=23, ...}) = 0
mmap2(NULL, 23, PROT_READ, MAP_PRIVATE, 3, 0) = 0x4010e000
close(3) = 0
open("/usr/share/locale/en_US.UTF-8/LC_TELEPHONE", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=59, ...}) = 0
mmap2(NULL, 59, PROT_READ, MAP_PRIVATE, 3, 0) = 0x4010f000
close(3) = 0
open("/usr/share/locale/en_US.UTF-8/LC_ADDRESS", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=155, ...}) = 0
mmap2(NULL, 155, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40110000
close(3) = 0
open("/usr/share/locale/en_US.UTF-8/LC_NAME", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=77, ...}) = 0
mmap2(NULL, 77, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40111000
close(3) = 0
open("/usr/share/locale/en_US.UTF-8/LC_PAPER", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=34, ...}) = 0
mmap2(NULL, 34, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40112000
close(3) = 0
open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFDIR|0755, st_size=80, ...}) = 0
close(3) = 0
open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/SYS_LC_MESSAGES",
O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=52, ...}) = 0
mmap2(NULL, 52, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40113000
close(3) = 0
open("/usr/share/locale/en_US.UTF-8/LC_MONETARY", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=286, ...}) = 0
mmap2(NULL, 286, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40114000
close(3) = 0
open("/usr/share/locale/en_US.UTF-8/LC_COLLATE", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=882134, ...}) = 0
mmap2(NULL, 882134, PROT_READ, MAP_PRIVATE, 3, 0) = 0x405fe000
close(3) = 0
open("/usr/share/locale/en_US.UTF-8/LC_TIME", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=2451, ...}) = 0
mmap2(NULL, 2451, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40115000
close(3) = 0
open("/usr/share/locale/en_US.UTF-8/LC_NUMERIC", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=54, ...}) = 0
mmap2(NULL, 54, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40116000
close(3) = 0
open("/usr/share/locale/en_US.UTF-8/LC_CTYPE", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=208464, ...}) = 0
mmap2(NULL, 208464, PROT_READ, MAP_PRIVATE, 3, 0) = 0x406d6000

If the file's contents are:

Date: Sat, 3 Feb 2002 17:28:00 -0500
Mime-Version: 1.0 (Produced by PhpWiki 1.3.3-jeffs-hacks)
X-Rcs-Id: $Id: G%E4steBuch,v 1.6 2002/03/02 22:32:06 carstenklapp Exp $
Content-Type: application/x-phpwiki;
  pagename=G%E4steBuch;
  flags="";
  charset=iso-8859-1
Content-Transfer-Encoding: binary

Would that make a difference? Would the 'charset=iso-8859-1' be messing
things up?

Here are some possibly-relevant portions of the stack trace:

\nle\0\0\1p\4\0\0011hhap\24\0\1(svn:executable 1
*)c\0\0\1b\4\0\0011hh9bf\3\1((PhpWiki eky.22a.g0) (G\303\244steBuch
hoj.22a.th) (VolltextSuche el9.22a.g0) (PhpWikiSystemverwalten
el0.22a.g0) (G%E4steBuch hom.22a.tq) (BackLinks eki.22a.g0)
(EditiereText eko.22a.g0)

e/de/G%E4steBuch hof.22a.sp modify 1 1 0 ).s[\0\1(change 54
/SocialMPN/branches/wiki/phpwiki/locale/de/G%E4steBuch hof.22a.sp modify
0 1 1) W\0\1(change 54
/SocialMPN/branches/wiki/phpwiki/locale/de/G%E4steBuch hof.22a.sp add 0
0 )\2\0\2\0\1sp\0\0\0]\0\1(change 60
/SocialMPN/branches/wiki/phpwiki/locale/de/pgsrc/G%E4steBuch eks.241.sn
add 0 0 )_\0\1(change 59
/SocialMPN/branches/wiki/phpwiki/locale/de/pgsrc/G\303\244steBuch
eks.23r.sj delete 0 0 )\2\0\2\0\1sn\0\0\0`\0\1(change 60
/SocialMPN/branches/wiki/phpwiki/locale/de/pgsrc/G%E4steBuch eks.22a.g0
delete 0 0 )y\\\0\1(change 59
/SocialMPN/branches/wiki/phpwiki/locale/de/pgsrc/G\303\244steBuch
eks.23r.sj add 0 0 )\0\2\0\1sj\0\0\0R\0\1

(copy 60 /SocialMPN/branches/wiki/phpwiki/locale/de/pgsrc/G%E4steBuch tb
hoi.251.tc).\3\0\001251).E\0\1(soft-copy 35
/trunk/SocialMPN/admin/settings.php 2 3z 10
2q5.24c.sw)\3\0\00124ca.J\0\1(soft-copy 40
/trunk/SocialMPN/admin/original/main.php 2 86 10
2r0.24b.sw)cg.\3\0\00124b).D\0\1(copy 48
/branches/RuffDogs/Oscar/mods/client/benefit.php r
w0.2e.s)e\2\0\0012e\nDRF\0\1(copy 50
/branches/RuffDogs/Oscar/mods/client/incidents.php r
g9.2d.s)t(2\2\0\0012dT NE\0\1(copy 49
/branches/RuffDogs/Oscar/mods/client/commserv.php r fy.2c.s)\2\0\0012c
NOE\0\1(copy 49 /branches/RuffDogs/Oscar/mods/client/contacts.php r
es.2b.s)\2\0\0012b\f<\fE\0\1(copy 49
/branches/RuffDogs/Oscar/mods/client/learnhrs.php r
d9.2a.s)\2\0\0012ase.B\0\1(copy 46
/branches/RuffDogs/Oscar/lib/libaccess.inc.php r td.29.s)m
I\2\0\00129resA\0\1(copy 45
/branches/RuffDogs/Oscar/lib/libmysql.inc.php r
n6.28.s)\2\0\00128k\223N@\0\1(copy 44
/branches/RuffDogs/Oscar/lib/libmisc.inc.php r
m6.27.s)\10\2\0\00127=\5\201C\0\1(copy 47
/branches/RuffDogs/Oscar/config.inc.php.default r
w8.26.s)\201^\2\0\00126\201\0WF\0\1(copy 50
/branches/RuffDogs/Oscar/lib/menus/clients.inc.php r w6.25.s))
_\2\0\00125phpS\0\1(copy 59
/SocialMPN/branches/wiki/phpwiki/locale/de/pgsrc/G\303\244steBuch sj
eks.241.sn)g.\3\0\001241).C\0\1(copy 47
/branches/RuffDogs/Oscar/lib/libdisplay.inc.php r
vy.24.s)\4\207\2\0\00124\202\34\202T\0\1(copy 60
/SocialMPN/branches/wiki/phpwiki/locale/de/pgsrc/G%E4steBuch sd
eks.23r.sj).\3\0\00123r).B\0\1

.j \1\0\1u_\0\1(change 59
/SocialMPN/branches/wiki/phpwiki/locale/de/pgsrc/G\303\244steBuch
hoj.22a.th delete 0 0 )an\2\0\1tr

p\3\0\001242b S\0\1(copy 59
/SocialMPN/branches/wiki/phpwiki/locale/de/pgsrc/G\303\244steBuch sj
eks.241.sn)g.\3\0\001241).T\0\1(copy 60
/SocialMPN/branches/wiki/phpwiki/locale/de/pgsrc/G%E4steBuch sd
eks.23r.sj).\3\0\00123r).B\0\1

And others like them...

>It may be worth setting your environment variable to an iso-8859-1 locale
>- in that case the character you're typing *should* get converted to utf8;
>if not there's either a bug somewhere or a problem with the character
>conversion libraries.
>
>
Well, that's essentially what I started with. `locale charmap` on en_US
gives iso-8859-1.

Erich
Ruffdogs.com

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Tue Oct 5 00:42:20 2004

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.