[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8 (was: Re: property names)

From: Karl Fogel <kfogel_at_galois.collab.net>
Date: 2000-12-21 22:31:43 CET

Mo DeJong <mdejong@cygnus.com> writes:
> Forget about the C code, what about the memory? A 1000 byte file
> requires 2000 bytes of memory in a unicode representation. If
> each character required 32 bits or memory, a 1 meg file would
> require 4 megs of system memory. That is just crazy!
> Don't forget about the network transfer time either. Why would
> anyone want to transfer 'a' as a 32 bit number over a network?
> Using UTF-8 is a great solution since existing 8 bits
> character sets require only 8 bits of system memory
> to store them.

Just a terminology clarification, here (this is how it was explained
to me, but I Am Not An Expert):

"Unicode" is a system that maps characters<-->numbers, independently
of how those numbers are represented.

"UTF-*" are systems for mapping numbers<-->bitrepresentations, of
which UTF-8 is probably the most efficient for our purposes.

In other words, there is no limit on the size of the Unicode character
set, but every time they add characters past a certain boundary, the
UTF encodings need to be updated so people know how to encode the new

So Mo, you're complaining not about "unicode" per se, but about its
UTF-16 and UTF-32 encodings, and I quite agree with you. :-)

Received on Sat Oct 21 14:36:18 2006

This is an archived mail posted to the Subversion Dev mailing list.