[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

RE: UTF-8 (was: Re: property names)

From: Bill Tutt <billtut_at_microsoft.com>
Date: 2000-12-22 01:12:33 CET

Not that this is a how best to localize software mailing list, but....

UTF-8 is only compact if you're a western European type.
CJK (Chinese Japanese, and Korean) perform horribly in UTF-8.

CJK will actually need UTF-16 character pairs in order to cover all of
their glyph space.

If you think that still takes up too much room, go take a look at:
A Standard Compression Scheme for Unicode
 http://www.unicode.org/unicode/reports/tr6/

This describes an encoding stream that gives UTF-8 like (storage
performance) for western Europeans, but still allowing non western
European to be stored in single bytes by using a simple window shifting
concept.

UTF-16 strikes a nice balance being easy to deal with (95% fixed width),
not taking up 32 whole bits per character (UCS-4), and not adversely
penalizing CJK and other Asian languages. (UTF-8)

Bill
Received on Sat Oct 21 14:36:18 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.