[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: svnlook proplist & unicode characters

From: Nico Kadel-Garcia <nkadel_at_gmail.com>
Date: Wed, 17 Dec 2014 00:19:42 -0500

On Mon, Dec 15, 2014 at 8:59 AM, Philip Martin
<philip.martin_at_wandisco.com> wrote:
> "Matthias Ludwig" <matthias.ludwig_at_stl-software.de> writes:
>
>> I try to call Svnlook proplist within a svn hook on windows.
>>
>> Svnlook proplist <repo-path> <pathToFile>
>>
>> The <pathToFile> contains unicode only characters (unicode combinining characters).
>>
>> The unicode characters are not passed correctly to svnlook.
>>
>> I googled around and found that one should that the code page with chcp. This changes the stdout-encoding of svnlook for the output. But I did not succeed to change the interpretation oft he calling parameter.
>>
>> The caller is a java routine. I tried Runtime.getRuntime().exe() and native calls via jna.
>>
>> I do not exactly know where the problem is. Does the call mess up the
>> unicode characters? Or is svnlook not capable of processing unicode
>> characters in input paremeters?
>
> svnlook should handle unicode characters in parameters. However
> Subversion has no special support for combining characters and just uses
> whatever literal UTF-8 sequence is supplied. That means the composed
> and decomposed forms are different paths in the repository: e.g š
> encoded as 's' + 'U+030C' is not the same path as š encoded as 'U+0161'

And this is *exaxtly* why non-ASCII characters should, generally be
rejected for filenames and potentially for commit messages by a
pre-commit tool. Personally, I prefer to also reject most punctuation,
like single and double quotes, either left or right, parenthes, curly
or straight brackets, etc. This generally falls into the "sanitize
your inputs" world of programming, as described in XKCD comic
http://xkcd.com/327/. In general, programming for Unicode is
destabilizing to software and to workflows..

That said, some people do feel the need for various good reasons, and
you've a point that it should be consistent.

> $ svnadmin create repo
> $ svnmucc -mm -U file://`pwd`/repo mkdir `printf "s\u030c"` propset p v `printf "s\u030c"`
> $ svnlook tree repo
> /
> š/
> $ svnlook proplist repo `printf "s\u030c"`
> Properties on '/š':
> p
> $ svnlook proplist repo `printf "u\0161"`
> svnlook: E160013: Path '/š' does not exist
>
> All Subversion utilities do conversion between UTF-8 and whatever local
> encoding is in use. If your local encoding is not UTF-8 then the
> conversion to UTF-8 will probably generate either the composed or
> decomposed form and it can be difficult to generate the other form, you
> may have to switch your local encoding to UTF-8 and generate it
> yourself. I have no idea what that involves on Windows.
>
> See also http://subversion.tigris.org/issues/show_bug.cgi?id=2464 which
> is about choosing a canonical representation.

"export LANG=C", baby, LANG=C. Sometimes also known as "LANG=POSIX".
I'm still miffed that modern RHEL and its descendants elected to
switch to en_US.UTF8, which badly breaks case sensitivite ordering for
basic usitilities such as "sort".

> --
> Philip Martin | Subversion Committer
> WANdisco // *Non-Stop Data*
Received on 2014-12-17 06:20:16 CET

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.