[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: charset neutral? pls solve this

From: Stephen C. Tweedie <sct_at_redhat.com>
Date: 2002-05-31 22:57:19 CEST

Hi,

On Fri, May 31, 2002 at 01:51:37PM -0700, Greg Stein wrote:
> Something that Karl mentioned got me thinking, and made me realize that we
> *already* have a bug due to charset issues. If we are charset neutral, then
> I see no possible way to solve this:
>
> In the log message output, we count the number of newlines, and display
> that count (see clients/cmdline/log-cmd.c::num_lines).
>
>
> Without knowing the charset, it cannot know that a '\n' or '\r' byte is
> actually a newline. Maybe that is one half of a UCS-2 encoding of a
> character? Maybe it is part of a shifted character in Shift-JIS or the like.

Charset != encoding. UTF-8 is an encoding, not a charset, and as long
as we agree that we'll only be using encodings such as KOI-8 or UTF-8
which protect control characters, this isn't a problem. (Of course,
some encodings restrict the charsets you can use more than others.)

> At a minimum, I'd like to at least use this datapoint as a way to
> demonstrate that charset neutral just really isn't a good option.

But requiring that control chars are respected really doesn't restrict
the encodings much, because pretty much all the current encodings do
that. If you say "we're charset-neutral", you're just saying that the
client is doing the encoding for you before it ever gets to svn.

It will still work --- there are *tons* of weird and wonderful
encodings used in email these days (as moderator of the ext3-users
list I get the job of rejecting all the BIG-5 or Korean-encoded spams
that attempt to get onto that list), and email still works. And you
can bet that SMTP relies on \n being a protected character.

In other words, it's just wide-char encodings such as UCS-2 that need
to be avoided from that point of view. Other than that, knowing the
encoding is enough to extract the appropriate chars. Given the data
and the encoding, you can convert between the stored charset and
UTF-8, and from that to pretty much anything you want.

--Stephen

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jun 1 14:10:50 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.