-----BEGIN PGP SIGNED MESSAGE-----
On Wednesday 04 December 2002 15:41, Philip Martin wrote:
> Look at libsvn_subr/xml.c:xml_escape, Subversion currently escapes the
> five characters &<>"'. In particular it doesn't escape the ^H that
> Andreas used. I find it odd that Subversion "escapes" a different set
> of characters from that "quoted" by apr_xml_quote_elem, but then I
> don't know much about XML or UTF8.
Basically, xml_escape is wrong. Those are the only characters that need to be
escaped in normal human text, but control characters need to be as well. In
fact, I've seen some parsers just escape everything outside of 0x20 to 0x127
(not including newlines (except in XML attributes) and including the 5
characters above). That's probably a bit overboard, but it's safe for
US-ASCII and all of the ISO-9660-* encodings AFAIK, as well as UTF-8.
Does apr_xml_quote_elem do a better job? Is there a reason why svn needs its
own xml_escape function instead of using the apr (or expat) versions?
Looking at the code, xml_escape() is wrong in another way. If inside a CDATA
section, you cannot escape a "]]>" by "]]>". You have to exit the CDATA
section, write the ">", and start a new one. Thus, you get
"]]>><![CDATA[". Also, there is no requirement to escape ">" following
"]]" if not in a CDATA section. That comment (line 47) should just be
removed, and possibly there needs to be a separate xml_escape_cdata()
function, if apr doesn't already provide one.
Technically, ">" never needs to be escaped. The ">" character-entity is
only provided for symmetry with "<".
Ugh, I just looked at xml_unescape: it doesn't handle any numeric character
escapes (it just ignores anything other than the basic entities). Again, why
re-invent the wheel (incorrectly)? Will apr or some other library do this?
If not, then I'd be happy to correct both functions.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)
-----END PGP SIGNATURE-----
To unsubscribe, e-mail: firstname.lastname@example.org
For additional commands, e-mail: email@example.com
Received on Thu Dec 5 01:54:49 2002