Peter Davis <peter@pdavis.cx> writes:
[...]
> fact, I've seen some parsers just escape everything outside of 0x20 to 0x127
> (not including newlines (except in XML attributes) and including the 5
> characters above). That's probably a bit overboard, but it's safe for
> US-ASCII and all of the ISO-9660-* encodings AFAIK, as well as UTF-8.
(I'm assuming you mant 127 == 0x7f, not 0x127)
Escaping everything is of course safe, but if you want to escape
characters over 0x7f you have to take care: The UTF-8 octet sequence
0xc3 0xa4 (representing the character "ä") has to be escaped as
ä (or ä or ä), not ä. The escapes encode
characters, not octets. Therefore, in the case of UTF-8, it's better
_not_ to try to escape characters beyond ASCII.
The octets 0-127 can safely be encoded as &#nn; though, since in this
range the octet value and the UNICODE codepoint of the character are
the same (this goes for UTF-8 as well as ISO-8859-* (ISO-9660 is the
CD-ROM filesystem standard :)).
The best option would probably be to encode the characters 0-31
(except 10 and 13) and 127 as numeric character entities, and '"&<> as
named character entities (' " & < >), leaving all
other characters/octets unescaped. If only one of the quote
characters are used to enclose all attributes, then the other one
doesn't need to be escaped.
// Marcus
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Dec 5 17:33:04 2002