[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: resolve the unicode problem of mod_authz_svn apache mod

From: Jesper Steen Møller <jesper_at_selskabet.org>
Date: 2006-04-07 03:11:20 CEST

L mice wrote:

> thanks for your advice.
> 1. my server is Windows2003 system,when directories are
> [1-9a-zA-Z],it's correct.

What system charset are you using in Windows? GB18030?

> 2.and when a repository or a directory contain chinese char. I reach a
> error
> [Fri Mar 31 21:16:30 2006] [error] [client 192.168.1.103
> <http://192.168.1.103/>] Access denied: 'hehao' CHECKOUT
> testsvn:/\xe5\xa4\xa7\xe5\xae\xb6\xe5\xa5\xbd\xe6\x89\x8d\xe6\x98\xaf\xe7\x9c\x9f\xe7\x9a\x84\xe5\xa5\xbd
>
>
> in python ,I do a test:
> >>> import codecs
> >>>
> a='testsvn:/\xe5\xa4\xa7\xe5\xae\xb6\xe5\xa5\xbd\xe6\x89\x8d\xe6\x98\xaf\xe7\x9c\x9f\xe7\x9a\x84\xe5\xa5\xbd'
> >>> codecs.utf_8_decode (a)[0]
> u'testsvn:/\u5927\u5bb6\u597d\u624d\u662f\u771f\u7684\u597d'

[...]

What is the encoding of your authz config file - your native (Windows)
encoding or UTF-8?

You show that there is no conversion being done during read of the
config file, not during checking of authorization. That's how I see it
too, good research!

Your patch - if I understand correctly - converts the Apache URI from
UTF-8 to "cstring" (i.e. local 8-bit charset) before looking up in the
authorization structure. This is then matched against the URIs in your
authz config file, right?

Isn't it better to normalize the encoding in the authz config
structures? It could be that the input file is either always assumed to
be UTF-8, or converted from the native charset into UTF-8 (or it could
carry an encoding declaration).

> auth_checker-->ap_get_module_config-->req_check_acces
> -->svn_repos_authz_read
> -->svn_config_read-->svn_config__parse_file-->parse_section_name-->svn_stringbuf_appendbytes
>
> no any convert

IMHO, there should be.

> 5.So,I think the Apache2 in Windows2003 convert the request's uri to
> utf-8.

> Of the top of my head, I'm not sure this is correct, since I'd be kind
> of surprised if it's always correct that the URI you get from Apache
> is in utf8 form...
>
> -garrett
>
>
The character encoding of the request URI depends on the client,
obviously. There is a convergence on UTF-8 in the HTTP method URI
(encoded, like %e8%d4, but still UTF-8 sequences character-wise), but a
lot of products use some heuristics to get the right value. On my
machine Firefox will assume ISO-8859-1 if I type æøå into the address
bar, but switch to UTF-8 if I type Japanese.

Subversion, luckily, does this consistently as UTF-8, as can be seen by
using a proxy:

D:\>svn ls http://subclipse.tigris.org/svn/subclipse/l%e6rke

PROPFIND http://subclipse.tigris.org/svn/subclipse/l%C3%A6rke HTTP/1.1
Host: subclipse.tigris.org
User-Agent: SVN/1.2.3 (r15833) neon/0.24.7
Connection: TE
TE: trailers
Content-Length: 300
Content-Type: text/xml
Depth: 0
Accept-Encoding: gzip
Accept-Encoding: gzip

<?xml version="1.0" encoding="utf-8"?><propfind
xmlns="DAV:"><prop><version-controlled-configuration
xmlns="DAV:"/><resourcetype xmlns="DAV:"/><baseline-relative-path
xmlns="http://subversion.tigris.org/xmlns/dav/"/><repository-uuid
xmlns="http://subversion.tigris.org/xmlns/dav/"/></prop></propfind>

So yes, Apache will provide a UTF-8 encoded URI to the authz
functionality. P.S. Why the double "Accept-Encoding: gzip", I wonder.

In summary - yes, there is a bug, but IMHO the bug is that config files
are assumed to be UTF-8.
I suggest that the bug be fixed during the reading of the config files
(they are effectively always UTF-8 now, I guess they should be in the
native charset - or have a charset declaration in their format, or look
for BOM or whatever). I also tested this using auto-props in my config
file:

[auto-props]
*.txt = svn:eol-style=native;Høkerløsning=true

Add a .txt file, try proplist on it - it is now impossible:

E:\someWC>svn add hej.txt
A hej.txt
E:\someWC>svn proplist hej.txt
Properties on 'hej.txt':
svn: Valid UTF-8 data
(hex: 48)
followed by invalid UTF-8 sequence
(hex: f8 6b 65 72)

For the Win32 registry, it would be best to use the 'W' (wide) function
calls and then convert from UTF-16 to UTF-8. I can supply a patch if
required.

-Jesper

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri Apr 7 03:10:12 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.