[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Control characters in log message cause failure

From: <kfogel_at_collab.net>
Date: 2004-12-01 17:52:39 CET

Philip Martin <philip@codematters.co.uk> writes:
> > Oh, no, I understand how UTF8 works. If we're already checking that
> > log messages are valid UTF8, then that just means my condition is
> > already met. I'm not crazy, I'm just behind the times :-).
> >
> > The problem is that we're not doing that check before we send a log
> > message from server to client.
>
> Which check?

The check if the log message is valid UTF8 (but see below).

> > We should, and if the string is not
> > UTF8... then what?
>
> We already fail if the message is not valid UTF-8:
>
> $ LANG=en_GB.UTF-8 svn commit -m `printf "\xe5"`
> ../svn/subversion/libsvn_client/commit.c:775: (apr_err=22)
> svn: Commit failed (details follow):
> ../svn/subversion/libsvn_subr/utf.c:457: (apr_err=22)
> svn: Valid UTF-8 data
> (hex:)
> followed by invalid UTF-8 sequence
> (hex: e5)

That's a commit from client to server. I'm talking about the server
to client direction. (But again, see below.)

> One of us is confused, or perhaps is just terminology.
>
> There is no "mixed UTF8/non-UTF8 string" and there are no "non-UTF8"
> characters that need to be "converted". There may be ASCII control
> codes in the log message, and if these are not valid XML then they
> need to be rejected or escaped, but the only place that UTF-8 comes in
> is that ASCII control codes are encoded unchanged in UTF-8.

Sorry, I was using incorrect terminology. When I wrote "valid UTF8",
I really meant "the subset of UTF8 that contains no ASCII control
codes except for LF, CR, TAB", which, as you point out, is not quite
the same thing as "valid UTF8". For the rest of this message, I'll
call such strings "conservative UTF8".

The log message may still be UTF8, but if it's non-conservative UTF8,
it needs to be escaped to travel over XML.

Now, in this case escaping is possible, because the log message
travels as CDATA, which can represent absolutely any byte string,
right? So we can solve the ctrl-char-in-log-messages problem in way
that would not be possible for the ctrl-char-in-file-path problem.

However, note that it is possible for *truly* non-UTF8 data to end up
in a log message (via cvs2svn, say). So our code flow needs to look
something like this:

   /* sending a log msg from server to client */
   
   if (is_valid_utf8 (msg)) {
     if (has_non_xml_safe_ctrl_chars (msg)) {
       send_message (xml_escape (msg));
     }
     else {
       send_message (msg);
     }
   }
   else {
       send_message (fuzzy_escape (msg));
   }

The "fuzzy_escape()" function would be something similar to
svn_utf_cstring_from_utf8_fuzzy(). It would return a conservative
UTF8 string, with visible escape sequences to represent the
non-conservative bytes.

Am I making more sense now?

> It looks like we have the same problem with paths in the entries file:
>
> $ svn mkdir wc/`printf "\x18"`
> $ svn st wc
> ../svn/subversion/libsvn_wc/entries.c:671: (apr_err=130003)
> svn: XML parser failed in 'wc'
> ../svn/subversion/libsvn_subr/xml.c:365: (apr_err=130003)
> svn: Malformed XML: not well-formed (invalid token) at line 13

Yup. That's another thread, of course:

   "issue #1954 (was: Re: Supporting non-XML-safe pathnames)"

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Dec 1 17:56:55 2004

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.