[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: International Characters & Subversion 1.1.0 Problems

From: Patrick Smears <patrick.smears_at_ensoft.co.uk>
Date: 2004-10-04 23:17:09 CEST

On Mon, 4 Oct 2004, Erich Enke wrote:

> I have a little more information now than previously.
>
> The hex character that svn is complaining is a bad UTF sequence is the
> e4. That would make some sense, since the Unicode value for the a with
> hysteresis is indeed 0x00e4.
>
> 0x00e4 in octal is \344, the one that the en_US commit says is missing.
>
> When manually encoding 0x00e4 into UTF-8, I come up with:
> 0x00e4 = 000 1110 0100 ==> 110 000 11 10 10 0100 = 0xc3a4 =
> \303 \244
> with the standard 110 and 10 prefixes.

Your conversion is correct:

printf "\xe4" | iconv -f iso-8859-1 -t UTF-8 | od -tx1
0000000 c3 a4

> However, even though `locale charmap` says 'UTF-8', if I do:
> echo ab | tr 'a' '\303' | tr 'b' '\244'
> I get รค (Cap. A + superscript tilde, and then something that looks
> like a misfigured pound sign). That's not right. I should get a
> lower-case a with hysteresis, I would think.

"locale charmep" shows what the environment variables in your shell are
telling your programs to use - i.e. how the programs that you run will
interpret and produce bytesequences. That needn't (sadly!) correspond to
the way your terminal window interprets those sequences when the programs
output them!

The symbols that you're seeing correspond (possibly among other encodings)
to the characters mapped to 'c3' and 'a4' in the iso-8859-1 encoding. This
would suggest that your terminal is interpreting the characters as
iso-8859-1 (the default encoding in many situations).

You may be able to start a UTF8 xterm with 'xterm -u8'.

> I tried checking in a file with that name, but when commiting the merge,
> it doesn't recognize it as an a-with-hysteresis, even though I'm pretty
> sure I got the octal right. However, now I can't even remove that extra
> file! It says:
>
> followed by invalid UTF-8 sequence
> (hex: e4 73 74 65)

As I'm sure you've realised, that's 'e4' (our troublesome friend :-),
followed by "ste". So clearly the 'e4' is being taken as UTF-8 for some
reason.

> It seems like I should have enough information to piece together what's
> going on if I could just put it all together...
>
> Trying svn remove on (with cosmetic spaces) 'G 0xe4 steBuch' and 'G
> 0xc3a4 steBuch' both (I can hexdump the contents of the variables I am
> using to hold these characters and confirm that I am indeed holding 0xe4
> and 0xc3a4) yield the above 'invalid UTF-8 sequence', including the 'e4'
> character. So both UTF-16 (I think??) and UTF-8 are being converted to
> UTF-16 (?) somewhere along the way, but that UTF-16 (?) char is being
> interpreted as UTF-8 (0xe4 is indeed invalid UTF-8), which shouldn't be
> happening. This is sounding more and more like a bug to me.

Another possibility is that since the terminal seems to be in iso-8859-1
mode, but the environment variables suggest you're using UTF8, that the
character isn't being affected at all, when it should in fact get
converted from iso-8859-1 to utf-8. (There may still be a bug here, by
this stage my head is spinning!)

It may be worth setting your environment variable to an iso-8859-1 locale
- in that case the character you're typing *should* get converted to utf8;
if not there's either a bug somewhere or a problem with the character
conversion libraries.

Patrick

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Mon Oct 4 23:17:37 2004

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.