[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: charset neutral? pls solve this

From: Stephen C. Tweedie <sct_at_redhat.com>
Date: 2002-05-31 23:59:24 CEST


On Fri, May 31, 2002 at 02:25:41PM -0700, Greg Stein wrote:

> [ but I will admit that while I know enough about the issues to move past
> the "knows enough to be dangerous" :-), I'll definitely defer to you and
> Marcus and some of the others on this list who have shown a much higher
> level of knowledge in this area. ]

> On Fri, May 31, 2002 at 09:57:19PM +0100, Stephen C. Tweedie wrote:
> > In other words, it's just wide-char encodings such as UCS-2 that need
> > to be avoided from that point of view.
> Yup. And that UCS-2 was part of my example. And on the Windows platform,
> UCS-2 is the standard encoding for characters, so it isn't really all that
> theoretical (well, once you get past the apparent NUL values in there and
> being okay with casting wchar_t* to char* :-)

This is now getting into "knows enough to be dangerous" territory.

We're at the very heart of the matter here. Windows does NOT use
UCS-2 universally as its encoding. It uses it as the *internal*
encoding. It uses it where it's manipulating character strings
internally, and where it knows that it actually has to do specific
things to those strings. It matters in such cases that we know
exactly where all characters begin and end, and it matters that we can
distinguish between large numbers of distinct chars.

However, on disk, in simple notepad text documents or in emails or
whatever, Windows is not necessarily using UCS-2. It's often using a
local charset, or even UTF-8. The external encodings do not have to
have anything to do with the internal ones. By "internal" and
"external" I mean internal to the software or not; byte-strings
written to disk or sent over the wire both count as external. Char or
wchar arrays in the source are internal.

UCS-2 may be used for filenames on disk, but the documents themselves
use localised encodings, not UTF-8. Your average KOI-8 user *hates*
UTF-8 because most of the characters in that charset end up with
4-byte UTF-8 representations. That's fine, we can carry the
characters internally in the software as UCS-2 (ie. 16-bit wchar_t),
and write to disk using whatever local encoding is wanted (8-bit KOI-8

UCS-2 is really useful --- almost essential --- if you are
manipulating Unicode characters. So, if svn is reformatting strings
to word-wrap, or is translating between encodings, it really does want
to be using UCS-2 for that.

But if all you are doing is storing strings verbatim, it *really*
doesn't matter.

Email is a good model here. It passes text 8-bit clean, using things
like base64-encoding to guarantee that if necessary. It also, these
days, passes a charset encoding.

Now, passing email between two Cyrillic users using KOI-8, there are
just no charset translation issues. Email passes it 8-bit clean. The
consoles of the users at both ends of the email are using KOI-8, and
they both see their Cyrillic characters come out nicely.

But when a Western, ISO-LATIN-based user reads that email on their
8-bit console, they see garbage. If they want to read the email, then
either they need to change their locale (font etc) to KOI-8, or they
need to use a Unicode-based (ie. UCS-2 internal) mail reader. UTF-8
does not come into it, *anywhere*. We don't force 4-byte stateful
character sequences on anybody if the users concerned are each just
using their own 8-bit charsets. And if one particular user _does_
want to use characters from multiple charsets, then their email client
can pass in a UTF-8 encoding, and due to its magical 8-bit
compatibility, the rest of the email universe doesn't have to know a
thing about it.

In other words, if you just store 8-bit data plus a charset encoding,
then right now, everything just continues to work as it always has;
and in the future, internationalised clients will be able to parse and
encode things quite happily.

Pretty much the only advantage you get if you force all strings
internally to UTF-8 is that when a client comes to translate one
charset to another, it doesn't have to know anything about the
encoding used by the original user when submitting the string in the
first place. But then, it still has to know about that charset to
display it, so that's really not much of a win.


To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jun 1 14:10:27 2002

This is an archived mail posted to the Subversion Dev mailing list.