[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8 (was: Re: property names)

From: Karl Fogel <kfogel_at_galois.collab.net>
Date: 2000-12-21 22:31:43 CET

Mo DeJong <mdejong@cygnus.com> writes:
> Forget about the C code, what about the memory? A 1000 byte file
> requires 2000 bytes of memory in a unicode representation. If
> each character required 32 bits or memory, a 1 meg file would
> require 4 megs of system memory. That is just crazy!
>
> Don't forget about the network transfer time either. Why would
> anyone want to transfer 'a' as a 32 bit number over a network?
>
> Using UTF-8 is a great solution since existing 8 bits
> character sets require only 8 bits of system memory
> to store them.

Just a terminology clarification, here (this is how it was explained
to me, but I Am Not An Expert):

"Unicode" is a system that maps characters<-->numbers, independently
of how those numbers are represented.

"UTF-*" are systems for mapping numbers<-->bitrepresentations, of
which UTF-8 is probably the most efficient for our purposes.

In other words, there is no limit on the size of the Unicode character
set, but every time they add characters past a certain boundary, the
UTF encodings need to be updated so people know how to encode the new
ranges.

So Mo, you're complaining not about "unicode" per se, but about its
UTF-16 and UTF-32 encodings, and I quite agree with you. :-)

-K
Received on Sat Oct 21 14:36:18 2006

This is an archived mail posted to the Subversion Dev mailing list.