[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: resolve the unicode problem of mod_authz_svn apache mod

From: L mice <hehaoslj_at_gmail.com>
Date: 2006-04-08 14:55:49 CEST

svn readin the auth config file as "rt" mode. and cant deal with unicode
files correctly, unicode files have 2 bytes "FF FE" before content. and svn
just read as bytes one by one ,and I think svn just deal with ansi charset
file.
eg:
config_file.c Line 417
        case '[': /* Start of section header */
in ansi charset, char [ is 0x5B .( 1 byte)
in other wide charset ,char [ maybe 0x5B 00 (2 bytes),then next time,ch =
0x00 ,I can guess the section name will be 0x00...

my auth_conf file using ansi/oem charset,it work correctly.
...
[testsvn:/大家好æ‰æ˜¯çœŸçš„好]

in hex mode the bytes are:
5B 74 65 73 74 73 76 6E 3A 2F B4F3 BCD2 BAC3 B2C5 CAC7 D5E6 B5C4 BAC3 5D

there are 8 chinese chars,and in svn's understanding,there are 16 bytes.

and if I use unicode charset,in hex mode,we'll see:

5B00 7400 6500 7300 7400 7300 7600 6E00 3A00 2F00 2759 B65B 7D59 4D62 2F66
1F77 8476 7D59 5D00

there're also 16 bytes,but not the same as above.

if I use unicode big endian,it's will be:
005B 0074 0065....5927 5BB6 597D 624D 662F 771F 7684 597D 005D
there're also 16 bytes.but different.

IMHO,there are too many charsets in the world.any way,they can read as
bytes.svn support bytes read and compare;So, we just need do the reverse
things what the request.uri ever do ;users use ansi charset or any other
compatible charset write their config file;and must keep the same charset
between client and server.

// Jesper Steen Møller
In summary - yes, there is a bug, but IMHO the bug is that config files are
assumed to be UTF-8.

I think svn assume any thing are bytes.

I suggest that the bug be fixed during the reading of the config files
(they are effectively always UTF-8 now, I guess they should be in the
native charset - or have a charset declaration in their format, or look
for BOM or whatever). I also tested this using auto-props in my config
file:

[auto-props]
*.txt = svn:eol-style=native;Høkerløsning=true

Add a .txt file, try proplist on it - it is now impossible:

E:\someWC>svn add hej.txt
A hej.txt
E:\someWC>svn proplist hej.txt
Properties on 'hej.txt':
svn: Valid UTF-8 data
(hex: 48)
followed by invalid UTF-8 sequence
(hex: f8 6b 65 72)

For the Win32 registry, it would be best to use the 'W' (wide) function
calls and then convert from UTF-16 to UTF-8. I can supply a patch if
required.

-Jesper
Received on Sat Apr 8 14:56:21 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.