Hi there.
I have followed this discussion with intrest. It is very similar to the
discussion we had in a project I took part in a time ago. I think I can
add some comments that to clear things up.
UTF-8 is actually not a character set. It is just a way to store
unicode characters. Since it is unicode you don't have to store any
information about the charset used when entering the text.
When it is time to display the text to a human (sometimes called
rendering the text) it's the client software that is responsible for
doing it right. If it is for some reason (eg. missing fonts) impossible
for the client to render the characters correct. It should not try to
do any interpretation, just replace the unknown characters with some
known glyph (in MS-windows it is a small square).
In my project we made the decision to represent all texts with a
wchar_t* in the "clients" and let the library take care of the storage
format and little/big-endian stuff.
To handle different languages and scripts in a computer can be very
complicated, but the unicode standard (www.unicode.org) contains a lot
of useful information and implementation guidelines.
I hope this comment can help.
Henrik
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jun 1 14:14:04 2002