[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

RE: Re: charset neutral? pls solve this

From: Bill Tutt <rassilon_at_lyra.org>
Date: 2002-06-03 07:30:14 CEST

> From: Marcus Comstedt [mailto:marcus@mc.pp.se]
>
> =?UTF-8?B?QnJhbmtvIMSMaWJlag==?= <brane@xbc.nu> writes:
>
> > You're mixing apples and oranges. UCS-2 indeed can't encode the
whole
> > range. UTF-16 can. They're not the same.
>
> Nope. You are wrong. Sorry. And I'm not mixing them to any further
> extent than to put them both into the "partial ISO-10646 salad".
>
> UTF-16 can encode 65536-2048+1048576 = 1112064 characters.
>
> The whole ISO-10646 range is 2147483648 characters, so UTF-16 only
> covers about 0.05%.
>
> Thus, unfortuantely neither the apples nor the oranges are
> sufficient. The bananas and the pears (UCS-4 and UTF-8) are though.
>
>

Your "whole range" for ISO 10646 is still inherently what UTF-16 can
handle.

From the Unicode FAQ at
http://www.unicode.org/unicode/faq/utf_bom.html#9:
----------------------------------
Will UTF-16 ever be extended to more than a million characters?

A: As stated, the goal of Unicode is not to encode glyphs, but
characters. Over a million possible codes is far more than enough for
this goal. Unicode is *not* designed to encode arbitrary data. If you
wanted, for example, to give each "instance of a character on paper
throughout history" its own code, you might need trillions or
quadrillions of such codes; noble as this effort might be, you would not
use Unicode for such an encoding. No proposed extensions of UTF-16 to
more than 2 surrogates has a chance of being accepted into the Unicode
Standard or ISO/IEC 10646. Furthermore, both Unicode and ISO 10646 have
policies in place that formally limit even the UTF-32 encoding form to
the integer range that can be expressed with UTF-16 (or 21 significant
bits).
----------------------------------

So, he is indeed correct that UTF-16 can encode everything.

Just to add additional points to the above comment I wrote the following
to a different email list back in April 2000:
-----
On 2000-03-07, the Unicode Technical Committee submitted document
"N2175" to
ISO/IEC JTC1/SC2/WG2 titled:
"Proposal to restrict the range of code positions to the values up to
U-0010FFFF"

Summary:
The Unicode consortium requests a remedy to this situation: the
publication
of a technical corrigendum to ISO/IEC 10646-1:2000 which excludes values
above U-0010FFFF. In this corrigendum,

* The private use characters from U-60000000 to U-7F000000 and from
U-00E00000 to
U-00FFFFFF would be removed from the standard.
 
* A note would be added stating that for interoperability between UTF-8,
UTF-16 and UCS-4, it is not expected that any code positions will ever
be
allocated above U-0010FFFF.

URL: http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2175.htm

On 2000-03-24 WG2 accepted the proposal:
(from the minutes of the meeting:
http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2204.doc)

WG2 accepts the proposal in document N2175 towards removing the
provision
for Private Use Groups and Planes beyond Plane 16 in ISO/IEC 10646, to
ensure internal consistency in the standard between UCS-4, UTF-8 and
UTF-16
encoding formats, and instructs its project editor prepare suitable text
for
processing as a future Technical Corrigendum or an Amendment to 10646-1:
2000.
-------

FYI,
Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Jun 3 07:30:37 2002

This is an archived mail posted to the Subversion Dev mailing list.