Greg Hudson <ghudson@MIT.EDU> writes:
> It is if you choose to think of it that way. A character set is a
> mapping of octet sequences to glyph sequences.
Actually, if we want to be stringent, a character set is just that, a
set of characters (which may or may not have associated glyphs). The
mapping between octet sequences and characters sequences is called a
character encoding. A character encoding implies a character set
though, namely the set of characters that is the range of the mapping.
Normally, there is no need to distinguish between the character
encoding and its associated character set, so the terms are often used
interchangeably. Even the MIME standard uses the keyword "charset" for
something which is actually a character encoding specification. For
character sets with 256 or fewer characters, a natural character
encoding with the same name as the character set is implicitly defined.
> UTF-8, UTF-16, and UCS-4
> are all perfectly fine character sets by that definition; they just all
> have the same large range of glyphs.
They are character encodings, which have the same character set (not
entirely true, since UTF-16 actually only has a subset of Unicode as
its range).
> (UTF-8 and UTF-16 are certainly
> not the only character sets which use a variable number of octets per
> glyph; consider Shift-JIS.)
Shift-JIS is a character encoding of the union of the character sets
ANSI-X3.4-1968 (a.k.a. ASCII), JIS X 0201-1976 (halfwidth katakana),
and JIS C6226-1983 (kana, kanji, symbols). It is actually a good
example of a case where it makes sense to distinguish between
character encoding and character set. :-)
I don't think we need to dwell on this point any more. Everybody
probably knows what people mean when they refer to UTF-8 as a
"charset".
// Marcus
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jun 1 14:12:28 2002