RE: Re: charset neutral? pls solve this

From: Bill Tutt <rassilon_at_lyra.org>
Date: 2002-06-01 00:19:39 CEST

> From: Stephen C. Tweedie [mailto:sct@redhat.com]
>
> Hi,
>
> On Fri, May 31, 2002 at 02:25:41PM -0700, Greg Stein wrote:
>
> > [ but I will admit that while I know enough about the issues to move
> past
> > the "knows enough to be dangerous" :-), I'll definitely defer to
you
> and
> > Marcus and some of the others on this list who have shown a much
> higher
> > level of knowledge in this area. ]
>
> > On Fri, May 31, 2002 at 09:57:19PM +0100, Stephen C. Tweedie wrote:
> > > In other words, it's just wide-char encodings such as UCS-2 that
need
> > > to be avoided from that point of view.
> >
> > Yup. And that UCS-2 was part of my example. And on the Windows
platform,
> > UCS-2 is the standard encoding for characters, so it isn't really
all
> that
> > theoretical (well, once you get past the apparent NUL values in
there
> and
> > being okay with casting wchar_t* to char* :-)
>
> This is now getting into "knows enough to be dangerous" territory.
> :-)
>
> We're at the very heart of the matter here. Windows does NOT use
> UCS-2 universally as its encoding. It uses it as the *internal*
> encoding.
> It uses it where it's manipulating character strings
> internally, and where it knows that it actually has to do specific
> things to those strings. It matters in such cases that we know
> exactly where all characters begin and end, and it matters that we can
> distinguish between large numbers of distinct chars.
>

This might (or might not be true) of the NT based kernels, it doesn't
honestly matter.

> However, on disk, in simple notepad text documents or in emails or
> whatever, Windows is not necessarily using UCS-2.

Ok, this is just going to confuse people.

The Win32 API set supports three kinds of strings:
ANSI (i.e. OEM, Latin-1, KOI-8, Shift-JIS, Big5, etc...), and Unicode
strings (represented as UTF-16 as of Win2k, or UCS-2 prior to Win2k).

Older applications implemented on top of Win32 tend to use only the ANSI
APIs, or the Unicode APIs, but use the Unicode APIs in a pattern that
isn't really Unicode character set safe because they then convert back
to some ANSI character set encoding before storing their data wherever
it goes.

> It's often using a
> local charset, or even UTF-8. The external encodings do not have to
> have anything to do with the internal ones. By "internal" and
> "external" I mean internal to the software or not; byte-strings
> written to disk or sent over the wire both count as external. Char or
> wchar arrays in the source are internal.
>
> UCS-2 may be used for filenames on disk, but the documents themselves
> use localised encodings, not UTF-8.

That's not entirely true. Some applications do use byte ordered
UTF-16/UCS-2 output as their external storage mechanism. Relational
databases that want to have an internationalized character type
certainly make their lives much easier if they support storing in
UCS-2/UTF-16. (e.g. SQL Server 7, SQL Server 2k, etc..) Some
applications even have UTF-8 output. (e.g. Visual Studio.Net).

[...]
> UCS-2 is really useful --- almost essential --- if you are
> manipulating Unicode characters. So, if svn is reformatting strings
> to word-wrap, or is translating between encodings, it really does want
> to be using UCS-2 for that.
>

A big +20 to that. UCS-2/UTF-16 is so much easier to process than UTF-8
sequences.

> But if all you are doing is storing strings verbatim, it *really*
> doesn't matter.
>

Textual properties in data stores, esp. if you migrate SVN's datastore
into a real relational database are much easier to search if they're not
UTF-8.

I've always been in favor of storing data in UCS-2/UTF-16 because it's
just so much easier to manage/deal with. But then again, I'm mostly
using a platform (Win32) who doesn't usually use UTF-8 storage.

[....]

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jun 1 14:10:19 2002

This message: [ Message body ]
Next message: Ben Collins-Sussman: "Re: vsn-rsc-url adaptations"
Previous message: Greg Stein: "Re: vsn-rsc-url adaptations"
In reply to: Stephen C. Tweedie: "Re: charset neutral? pls solve this"
Next in thread: Marcus Comstedt: "Re: charset neutral? pls solve this"
Reply: Marcus Comstedt: "Re: charset neutral? pls solve this"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]