[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [RFC/PATCH] commit messages not 8-bit compatible

From: Greg Hudson <ghudson_at_MIT.EDU>
Date: 2002-05-31 16:03:09 CEST

On Fri, 2002-05-31 at 03:23, Henrik Svensson wrote:
> UTF-8 is actually not a character set.

It is if you choose to think of it that way. A character set is a
mapping of octet sequences to glyph sequences. UTF-8, UTF-16, and UCS-4
are all perfectly fine character sets by that definition; they just all
have the same large range of glyphs. (UTF-8 and UTF-16 are certainly
not the only character sets which use a variable number of octets per
glyph; consider Shift-JIS.)

If an application or protocol chooses to worry about character sets, it
is easier to think of UTF-8 as a character set than to model the
additional level of abstraction (where UTF-8 is a character encoding and
Unicode or ISO/IEC 10646 is the character set). Thus XML and MIME both
put UTF-8 and UTF-16 on a par with ISO-8859-1 and Shift-JIS, and refer
to them all more or less interchangeably as "charsets" or "character
sets" or "character encodings."

> It is just a way to store unicode characters. Since it is unicode you
> don't have to store any information about the charset used when
> entering the text.

Uh, I'm not sure if that's a blue-sky statement or just one I don't
understand. If everybody's tools (text editors, terminals, web
browsers, whatever) used UTF-8 to encode input characters, then there
would be no need to find out what charset was used, but in the real
world, you still have to convert from the charset used to enter the text
to UTF-8.

To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jun 1 14:12:36 2002

This is an archived mail posted to the Subversion Dev mailing list.