Re: Character sets for log messages

From: Henrik Svensson <innotron_at_telia.com>
Date: 2002-06-02 00:09:37 CEST

citerar Colin Putney <cputney@whistler.net>:

>
> On Saturday, June 1, 2002, at 08:08 AM, Henrik Svensson wrote:
>
> > It is not very difficult to convert from unicode to any other
charset.
> > Code and recommendations how to do it (for most standard charsets)
are
> > available. Some systems even have functions that will do it for
you. In
> > the case of a simple client that for example only can display 7 bit
> > ASCII it is even trivial, only remove the most significant byte from
> > every character in the array. The only thing the client has to
consider
> > is that the text can contain characters that it can't convert and
> > print. How to handle this case has to be decided by the client
> > developer, but an easy solution is to replace the unprintable
> > characters with a simple placehoder.
>
> Well, just stripping off the high bit would leave garbage characters
> wherever there are multibyte sequences, so you'd have to be able to
> recognize those sequences and deal with them appropriately.
>

Please read the last sentence of my posting.

> I realize that it's not impossible, or even difficult to do the
> conversion. But it will take time and effort to do the research,
coding,
> testing and maintenance. It's another hurdle that a client developer
> will have to clear, for no particular benefit.
>
Please read sentence one and two. For more information go to
www.unicode.org. It's all there. It will benfit all people the want to
use a language that needs more than US-ASCII.

> What will likely happen is that simple clients will just ignore the
> problem, like CVS does. Then everything will work fine as long as the
> user never encounters anything but 7-bit ASCII, which maps to UTF-8
> without modification.
>
Then people will find out that the client is not doing its work
properly. They will either not use it or fix it. The beauty of open
source.

> By explicitly specifying the charset we give clients the option to
> gracefully decline to display charsets they don't know about.
>
Think of two clients. One used for writing and one for reading. The
writing client uses "latin-1" charset to store a log message. The
reading client does only understand "US-ASCII". This means that the
reading client can't present any message at all to the user even if the
two used charsets are almost the same. Even if the stored text only
contains characters that are represented with the same numerical value
in the two charsets, the reader client can not show one single
character of the message to the user. Beacuse the reading client has no
knowledge at all about the similarities of the charsets. If the message
were stored as unicode/UTF-8, and the clients are able to convert
between their local charsets and unicode, the reading user would have
seen exacly the same message as was written originally with the writing
client.

/Henrik

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sun Jun 2 04:09:49 2002

This message: [ Message body ]
Next message: Henrik Svensson: "Re: Character sets for log messages"
Previous message: Sander Striker: "RE: Updating to the latest neon release"
Maybe in reply to: Colin Putney: "Character sets for log messages"
Next in thread: Henrik Svensson: "Re: Character sets for log messages"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]