[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: charset neutral? pls solve this

From: Greg Stein <gstein_at_lyra.org>
Date: 2002-05-31 23:25:41 CEST

Hey Stephen,

On Fri, May 31, 2002 at 09:57:19PM +0100, Stephen C. Tweedie wrote:
> On Fri, May 31, 2002 at 01:51:37PM -0700, Greg Stein wrote:
>...
> > Without knowing the charset, it cannot know that a '\n' or '\r' byte is
> > actually a newline. Maybe that is one half of a UCS-2 encoding of a
> > character? Maybe it is part of a shifted character in Shift-JIS or the like.
>
> Charset != encoding.

Yes. I've been conflating them simply to avoid complexity.

[ but I will admit that while I know enough about the issues to move past
  the "knows enough to be dangerous" :-), I'll definitely defer to you and
  Marcus and some of the others on this list who have shown a much higher
  level of knowledge in this area. ]

> UTF-8 is an encoding, not a charset, and as long
> as we agree that we'll only be using encodings such as KOI-8 or UTF-8
> which protect control characters, this isn't a problem. (Of course,
> some encodings restrict the charsets you can use more than others.)

Sure. And part of my push for UTF-8 is the recognition that a lot of our
code which assumes US-ASCII can simply continue to work.

> > At a minimum, I'd like to at least use this datapoint as a way to
> > demonstrate that charset neutral just really isn't a good option.
>
> But requiring that control chars are respected really doesn't restrict
> the encodings much, because pretty much all the current encodings do
> that.

Fair enough, but I thought it important to show a specific example in our
current code which is /not/ encoding/charset neutral.

> If you say "we're charset-neutral", you're just saying that the
> client is doing the encoding for you before it ever gets to svn.

Well, to some extent the SVN libraries will be pretty neutral. They'll just
assume that everything passed to them is UTF-8 and leave it at that :-)

[ although the assumption *will* come into play when it goes to shove the
  text into an XML document; knowing it is utf-8 is handy ]

> It will still work --- there are *tons* of weird and wonderful
> encodings used in email these days (as moderator of the ext3-users
> list I get the job of rejecting all the BIG-5 or Korean-encoded spams
> that attempt to get onto that list), and email still works. And you
> can bet that SMTP relies on \n being a protected character.

Heh. Well, it can rely on \n because it also has restrictions about how to
pass data around. Yet, even poor implementations will work because, as you
point out, the \n happens to appear "as is" in most encoding.

> In other words, it's just wide-char encodings such as UCS-2 that need
> to be avoided from that point of view.

Yup. And that UCS-2 was part of my example. And on the Windows platform,
UCS-2 is the standard encoding for characters, so it isn't really all that
theoretical (well, once you get past the apparent NUL values in there and
being okay with casting wchar_t* to char* :-)

> Other than that, knowing the
> encoding is enough to extract the appropriate chars. Given the data
> and the encoding, you can convert between the stored charset and
> UTF-8, and from that to pretty much anything you want.

*nod*

And that is fine. s/charset/encoding/ in my statement, and we're still back
to "if you don't pass any additional data with that text, then you're SOL."

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jun 1 14:10:45 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.