> From: Stephen C. Tweedie [mailto:firstname.lastname@example.org]
> On Fri, May 31, 2002 at 02:25:41PM -0700, Greg Stein wrote:
> > [ but I will admit that while I know enough about the issues to move
> > the "knows enough to be dangerous" :-), I'll definitely defer to
> > Marcus and some of the others on this list who have shown a much
> > level of knowledge in this area. ]
> > On Fri, May 31, 2002 at 09:57:19PM +0100, Stephen C. Tweedie wrote:
> > > In other words, it's just wide-char encodings such as UCS-2 that
> > > to be avoided from that point of view.
> > Yup. And that UCS-2 was part of my example. And on the Windows
> > UCS-2 is the standard encoding for characters, so it isn't really
> > theoretical (well, once you get past the apparent NUL values in
> > being okay with casting wchar_t* to char* :-)
> This is now getting into "knows enough to be dangerous" territory.
> We're at the very heart of the matter here. Windows does NOT use
> UCS-2 universally as its encoding. It uses it as the *internal*
> It uses it where it's manipulating character strings
> internally, and where it knows that it actually has to do specific
> things to those strings. It matters in such cases that we know
> exactly where all characters begin and end, and it matters that we can
> distinguish between large numbers of distinct chars.
This might (or might not be true) of the NT based kernels, it doesn't
> However, on disk, in simple notepad text documents or in emails or
> whatever, Windows is not necessarily using UCS-2.
Ok, this is just going to confuse people.
The Win32 API set supports three kinds of strings:
ANSI (i.e. OEM, Latin-1, KOI-8, Shift-JIS, Big5, etc...), and Unicode
strings (represented as UTF-16 as of Win2k, or UCS-2 prior to Win2k).
Older applications implemented on top of Win32 tend to use only the ANSI
APIs, or the Unicode APIs, but use the Unicode APIs in a pattern that
isn't really Unicode character set safe because they then convert back
to some ANSI character set encoding before storing their data wherever
> It's often using a
> local charset, or even UTF-8. The external encodings do not have to
> have anything to do with the internal ones. By "internal" and
> "external" I mean internal to the software or not; byte-strings
> written to disk or sent over the wire both count as external. Char or
> wchar arrays in the source are internal.
> UCS-2 may be used for filenames on disk, but the documents themselves
> use localised encodings, not UTF-8.
That's not entirely true. Some applications do use byte ordered
UTF-16/UCS-2 output as their external storage mechanism. Relational
databases that want to have an internationalized character type
certainly make their lives much easier if they support storing in
UCS-2/UTF-16. (e.g. SQL Server 7, SQL Server 2k, etc..) Some
applications even have UTF-8 output. (e.g. Visual Studio.Net).
> UCS-2 is really useful --- almost essential --- if you are
> manipulating Unicode characters. So, if svn is reformatting strings
> to word-wrap, or is translating between encodings, it really does want
> to be using UCS-2 for that.
A big +20 to that. UCS-2/UTF-16 is so much easier to process than UTF-8
> But if all you are doing is storing strings verbatim, it *really*
> doesn't matter.
Textual properties in data stores, esp. if you migrate SVN's datastore
into a real relational database are much easier to search if they're not
I've always been in favor of storing data in UCS-2/UTF-16 because it's
just so much easier to manage/deal with. But then again, I'm mostly
using a platform (Win32) who doesn't usually use UTF-8 storage.
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org
Received on Sat Jun 1 14:10:19 2002