[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Are log messages Unicode?

From: Barry Scott <barry_at_barrys-emacs.org>
Date: Sat, 12 Jul 2008 15:54:56 +0100

On Jul 7, 2008, at 17:15, Karl Fogel wrote:

> "Ben Collins-Sussman" <sussman_at_red-bean.com> writes:
>> On Sun, Jul 6, 2008 at 5:23 AM, Barry Scott <barry_at_barrys-
>> emacs.org> wrote:
>>> Using the svn_client API is it possible for a client to write
>>> none-UTF-8 log messages?
>>> Clearly if this happened it would be a bug in the client given the
>>> above statement.
>>
>> I don't recall the details, but it's actually the *programmers'*
>> burden to convert paths and log messages from native locale to UTF8
>> (and back again). If you read the svn APIs, you'll notice that every
>> path and log message passed into APIs (or passed around between APIs)
>> are presumed to *already* be UTF8. So if you're writing your own
>> client, it's your job to convert user input to UTF8 before passing to
>> svn_client_*(). Look at the commandline client to see how it's doing
>> that; I believe there a number of convenience routines in
>> libsvn_subr
>> to help with conversion.
>
> I think Barry's asking if the client and/or server do any validation.
> That is, if the programmer supplies a non-UTF8 log message, our client
> libraries should reject it; and if such a log message were to reach
> the
> repository (perhaps because someone wrote their own client software
> from
> scratch), the repository should reject it too.
>
> I don't know whether we do such validation or not, but agree we
> should.
>
> Barry, got time to test/trace it?
>

I have the dump of the repos that causes pysvn to fail. In the
attachment is
the fragment of the dump file for r219 that causes the problems. If
you need the
whole 3MB of the full dump I'll have to ask permission to pass it on
to you.

Python cannot decode the svn:log as utf-8.

  $ python2.5 extract_log_text.py
'Bitbucket r\xe9serv\xe9 \xe0 dev/null\nClassement dans Mail/spam
seulement apr\xe8s le localstart qui lance spamc\n'
'\xe9s'
Traceback (most recent call last):
   File "extract_log_text.py", line 12, in <module>
     print log.decode( 'utf-8' )
   File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position
11-13: invalid data

Is this proof that the repos has none UTF-8 log text?

svn 1.4.6 is happy to show the log:

$ svn log -r219 file:///Users/barry/tmp/repos/trunk/dotfiles
------------------------------------------------------------------------
r219 | bortzmeyer | 2003-01-17 14:04:31 +0000 (Fri, 17 Jan 2003) | 3
lines

Bitbucket r?\233serv?\233 ?\224 dev/null
Classement dans Mail/spam seulement apr?\232s le localstart qui lance
spamc

------------------------------------------------------------------------

But the \233 are supposed to be é I understand.

Barry

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: dev-help_at_subversion.tigris.org
Received on 2008-07-12 16:55:29 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.