[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] issue #2147 - v1

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: 2005-01-28 15:13:03 CET

VK Sameer wrote:
> Julian Foad wrote on 01/26/2005 09:07 AM:
>> Julian Foad wrote:
>>> VK Sameer wrote:
>>>> SVN_ERR( dav_svn__send_xml(lrb->bb, lrb->output,
>>>> "<D:comment>%s</D:comment>" DEBUG_CR,
>>>> - apr_xml_quote_string(pool, msg, 0)) );
>>>> + svn_xml_fuzzy_escape (
>>>> + apr_xml_quote_string (pool, msg, 0),
>>>> + pool)) );
>>>
>>> No. You should strip control characters before you XML-quote it,
>>> otherwise you are relying on (1) your function producing validly
>>> quoted XML (which it may do at present but is not documented to do),
>>> and (2) the XML-quoting function accepting and passing through
>>> control characters (which such a function need not be expected to do).
>>
>> (1)
>> Thinking some more about this, I see now that you originally intended
>> svn_xml_fuzzy_escape to accept nearly-well-formed XML text (but with
>> some control characters) and return well-formed XML text, and that's
>> why it is named "svn_xml_...". Ideally you wanted it to perform some
>> sort of XML quoting which would encode the control characters in XML
>> such that they would be decoded by the XML receiver, but you have
>> found that this is not possible. (I agree: the XML spec says control
>> characters cannot be represented, encoded or otherwise.) Therefore
>> you have had to settle for performing an ad-hoc, mostly
>> human-readable, non-reversible encoding. Because of this, I stand by
>> what I wrote above: your fuzzy function is not, and should not be
>> documented for use on well-formed XML text; it should be applied to
>> the actual (non-escaped) log message.
>
> Wouldn't that cause unnecessary encoding for those transport layers that
> _can_ handle control characters? Control characters in a log message are
> not a problem when using file:// or svn://.

No, I just meant swapping the order of fuzzy_escape and xml_quote where it is
used in dav_svn__send_xml(...). I'm not suggesting it be used in other RA
methods or any higher up the call chain than this; sorry if that wasn't clear.

>> Whether the fuzzy function's name and/or doc-string should change to
>> reflect that it doesn't produce XML but does produce XML-safe text, I
>> don't know.
>
> I'm a little lost now. Why shouldn't the output be considered XML?

Your function doesn't do XML escaping (quoting), so the output can only be
well-formed XML if the input was. And yet the input can't usefully be
well-formed XML because well-formed XML doesn't contain control characters
(except TAB, NL, CR).

The way your patch uses it currently is:

   Arbitrary UTF-8 text: the log message
       -> apr_xml_quote_string() ->
   Fairly-well-formed XML, but not quite
       -> svn_xml_fuzzy_escape() ->
   Well-formed XML.

I believe that apr_xml_quote_string() is not guaranteed to accept arbitrary
UTF-8 text and not intended to produce nearly-XML, but to take XML-safe text
and produce well-formed XML. I know its doc-string doesn't say this explicitly.

I recommend:

   Arbitrary UTF-8 text: the log message
       -> svn_xml_fuzzy_escape() ->
   XML-safe text
       -> apr_xml_quote_string() ->
   Well-formed XML.

In other words, let apr_xml_quote_string work with XML-safe text and generate
well-formed XML, and define your svn_xml_fuzzy_escape as converting arbitrary
UTF-8 text into XML-safe UTF-8 text. In this way, the input and output data
formats of both functions are formats that are commonly used and understood.

Note the notion of "XML-safe text": text that can be represented in XML by
XML-escaping it. The function svn_xml_is_xml_safe guarantees that a string is
XML-safe (though it's a rather restrictive implementation, only supporting the
ASCII subset of XML-safe strings, so it's not good enough for your use).

>> (2)
>> Now I'm confused about what you are escaping. You are escaping all
>> ASCII control characters (as defined by svn_ctype_iscntrl). That
>> includes valid XML characters CR, LF and TAB. Shouldn't you be
>> escaping only non-XML control characters?
>
> Fixed, thanks for pointing that out.

Sorry, but it's not fixed in patch v2. The set of valid XML control characters
is TAB, NL, CR; this is not the same as the set of ASCII white space characters
(which includes VT, for example).

See 'svn_xml_is_xml_safe', but note that it is not what you want in terms of
characters above 127 (it only accepts ASCII, not UTF-8).

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri Jan 28 15:14:37 2005

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.