[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Bug: Control char in commit message

From: Marcus Comstedt <marcus_at_mc.pp.se>
Date: 2002-12-05 17:31:25 CET

Peter Davis <peter@pdavis.cx> writes:

> fact, I've seen some parsers just escape everything outside of 0x20 to 0x127
> (not including newlines (except in XML attributes) and including the 5
> characters above). That's probably a bit overboard, but it's safe for
> US-ASCII and all of the ISO-9660-* encodings AFAIK, as well as UTF-8.

(I'm assuming you mant 127 == 0x7f, not 0x127)

Escaping everything is of course safe, but if you want to escape
characters over 0x7f you have to take care: The UTF-8 octet sequence
0xc3 0xa4 (representing the character "ń") has to be escaped as
&#xe4; (or &#228; or &auml;), not &#xc3;&#xa4;. The escapes encode
characters, not octets. Therefore, in the case of UTF-8, it's better
_not_ to try to escape characters beyond ASCII.

The octets 0-127 can safely be encoded as &#nn; though, since in this
range the octet value and the UNICODE codepoint of the character are
the same (this goes for UTF-8 as well as ISO-8859-* (ISO-9660 is the
CD-ROM filesystem standard :)).

The best option would probably be to encode the characters 0-31
(except 10 and 13) and 127 as numeric character entities, and '"&<> as
named character entities (&apos; &quot; &amp; &lt; &gt;), leaving all
other characters/octets unescaped. If only one of the quote
characters are used to enclose all attributes, then the other one
doesn't need to be escaped.

  // Marcus

To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Dec 5 17:33:04 2002

This is an archived mail posted to the Subversion Dev mailing list.