[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: some valid Linux filenames break subversion

From: Amelia A Lewis <alewis_at_tibco.com>
Date: 2004-07-14 19:16:43 CEST

On Wed, 14 Jul 2004 10:24:39 -0500
kfogel@collab.net wrote:
> I'm operating under the assumption that XML has no standard way, in
> attribute values, to represent every possible character. Whereas in
> CDATA, there *is* a way to represent every possible character.

Umm, no, there isn't.

CDATA has precisely the same set of restrictions as PCDATA (and attributes
default to CDATA, using DTD), except that CDATA doesn't interpret pointy
brackets and ampersands, so it isn't necessary to escape them.

If you want any characters in the range 0-31 or 128-159 (C0 and C1), apart
from tab, cr, and lf, then you *must* encode them in some fashion, such as
Base64. Their appearance, as characters, including such appearance in a
CDATA attribute or a CDATA section (<![CDATA[ ]]>) is a well-formedness
error, and it is illegal for the parser to attempt to continue parsing.

These characters may *not* be represented by numeric character entities
(&#x07; is as illegal as the appearance of a bare BEL character). I don't
recall, off the top of my head, whether the appearance of such a character
entity is treated as a well-formedness error (I think that it is), or is
treated as an unrecognized entity (which, in fact, may be a
well-formedness error as well; the rules on entities are fairly
unpleasant, all things considered).

> If there were a way to do it in attributes as well, then we'd be fine
> (even if that way were different from the CDATA way).

Umm. No. If you are placing information that contains characters that
fall into the Unicode C0 and C1 (control) sets, then you *must* encode
that information, because those characters are *illegal*.

Note that if you are using a character set encoding of some sort that
happens to re-use the bit patterns of C0 for some printable unicode
characters, that's legal (I can't think of any character set that does
this, though; perhaps EBCDIC? Almost every other character set uses the
ASCII bit patterns for the bottom seven bits, and Unicode uses the same
mapping (ASCII is a proper subset of Unicode (at least of UTF-8; things
are more complex with UTF-16, UCS-2, and the like))).

Did I close all my parens? My mail editor doesn't do paren matching ....

(provoked into being an XML geek)

Amelia A. Lewis
Senior Architect
TIBCO/Extensibility, Inc.
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Jul 14 19:16:49 2004

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.