[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: resolve the unicode problem of mod_authz_svn apache mod

From: Jesper Steen Møller <jesper_at_selskabet.org>
Date: 2006-04-09 21:40:22 CEST

L mice wrote:

> svn readin the auth config file as "rt" mode. and cant deal
> with unicode files correctly, unicode files have 2 bytes "FF FE"
> before content. and svn just read as bytes one by one ,and I think svn
> just deal with ansi charset file.

Well, yes - SVN reads the file as a byte-oriented stream, without paying
attention to what the non-ascii characters represent. Some of the
strings go into internal memory structures that are internally parsed as
UTF-8.

> ...
> [testsvn:/大家好才是真的好]
>
> in hex mode the bytes are:
> 5B 74 65 73 74 73 76 6E 3A 2F B4F3 BCD2 BAC3 B2C5 CAC7 D5E6 B5C4 BAC3 5D

This is indeed GB18030. This is the ANSI codepage with MBCS in action.
Converting these to UTF-8 before storing them in the internal structures
will solve the problem.

> and if I use unicode charset,in hex mode,we'll see:
>
> 5B00 7400 6500 7300 7400 7300 7600 6E00 3A00 2F00 2759 B65B 7D59 4D62
> 2F66 1F77 8476 7D59 5D00

Well - this is UCS-2 (or UTF-16 if you must), not UTF-8. UTF-8 would
require 24 bytes for your characters.
I recommend against using UTF-16 in a config file when most platforms
have decent 8-bit oriented local character sets.
The local charset should be fine, with internal conversion to UTF-8.

> IMHO,there are too many charsets in the world.any way,they can read as
> bytes.svn support bytes read and compare;So, we just need do
> the reverse things what the request.uri ever do ;users use ansi
> charset or any other compatible charset write their config file;and
> must keep the same charset between client and server.

Well, I more or less agree, I just think that Subversion should stick
with UTF-8 internally as much as possible - to provide for the widest
possible support (in other words, you should be able to leave a file
with chinese letters in the name on my machine which may run Win1252 as
the ANSI codepage. UTF-8 enables that.

> // Jesper Steen Møller
> > In summary - yes, there is a bug, but IMHO the bug is that config
> files are assumed to be UTF-8.
>
> I think svn assume any thing are bytes.
>
Yes, they are bytes but they are stored (without any checking or
conversion) in internal strucures where they are later assumed to be UTF-8.

How is this handled for AS/400, I wonder. Mark?

-Jesper

P.S.: For charset nerds like myself:
http://www-950.ibm.com/software/globalization/icu/demo/converters?s=ALL

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sun Apr 9 21:38:08 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.