[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: some valid Linux filenames break subversion

From: <kfogel_at_collab.net>
Date: 2004-07-14 18:03:10 CEST

Amelia A Lewis <alewis@tibco.com> writes:
> If you want any characters in the range 0-31 or 128-159 (C0 and C1), apart
> from tab, cr, and lf, then you *must* encode them in some fashion, such as
> Base64. Their appearance, as characters, including such appearance in a
> CDATA attribute or a CDATA section (<![CDATA[ ]]>) is a well-formedness
> error, and it is illegal for the parser to attempt to continue parsing.
>
> These characters may *not* be represented by numeric character entities
> (&#x07; is as illegal as the appearance of a bare BEL character). I don't
> recall, off the top of my head, whether the appearance of such a character
> entity is treated as a well-formedness error (I think that it is), or is
> treated as an unrecognized entity (which, in fact, may be a
> well-formedness error as well; the rules on entities are fairly
> unpleasant, all things considered).

Ah.

Okay, so the real issue here is not

   "There's no way to represent these funny characters in attribute
    values"

but rather

   "We (the Subversion project) are unwilling to Base64 or otherwise
    encode the attribute values in our .svn/entries files, probably
    because that would make them so much harder to debug."

But if we *did* encode the attribute values somehow, we could support
all possible filenames.

> Umm. No. If you are placing information that contains characters that
> fall into the Unicode C0 and C1 (control) sets, then you *must* encode
> that information, because those characters are *illegal*.
>
> Note that if you are using a character set encoding of some sort that
> happens to re-use the bit patterns of C0 for some printable unicode
> characters, that's legal (I can't think of any character set that does
> this, though; perhaps EBCDIC? Almost every other character set uses the
> ASCII bit patterns for the bottom seven bits, and Unicode uses the same
> mapping (ASCII is a proper subset of Unicode (at least of UTF-8; things
> are more complex with UTF-16, UCS-2, and the like))).

Thanks for the education! I'll be saving this mail.

> Did I close all my parens? My mail editor doesn't do paren matching ....

Yup, my editor thinks you did anyway :-).

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Jul 14 19:32:21 2004

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.