Character sets for log messages

From: Colin Putney <colin_at_whistler.com>
Date: 2002-06-01 03:04:21 CEST

> On Friday, May 31, 2002, at 02:59 PM, Stephen C. Tweedie wrote:
>
> In other words, if you just store 8-bit data plus a charset encoding,
> then right now, everything just continues to work as it always has;
> and in the future, internationalised clients will be able to parse and
> encode things quite happily.
>
> Pretty much the only advantage you get if you force all strings
> internally to UTF-8 is that when a client comes to translate one
> charset to another, it doesn't have to know anything about the
> encoding used by the original user when submitting the string in the
> first place. But then, it still has to know about that charset to
> display it, so that's really not much of a win.

For me this is a very compelling argument for not requiring UTF-8.

Consider the alternatives with regard to ease of client implementation.
Suppose there are two types of clients: simple clients which only use
the local encoding, and advanced clients which can display text in many
different encodings.

If UTF-8 is required (option 1), even a simple client must convert
between the local encoding and UTF-8. This is non-trivial, and can get
more complex if the local encoding can vary according the the user's
preference. An advanced client won't find this a problem since it's
going to be jumping through all sorts of hoops to display arbitrary
Unicode string anyway. On the other hand, an advanced client won't
benefit much from the implicit knowledge that log messages are in UTF-8.

The other approach (option 2), is to accept log messages in any encoding
as long as it's specified. Let's call this charset-aware.
"Charset-neutral" would be accurate, but we've been using that to mean
something else. This makes simple clients much easier, since they don't
have to do any conversion, only state what encoding they're using.
Advanced clients will be able to optimize their text handling by using
an encoding that handles the characters in the log message efficiently.

Ok, now let's consider the ease of implementing the Subversion
libraries. Requiring UTF-8 is attractive because it requires a minimum
of change to the existing code, and doesn't require the libraries to
store or pass around encodings. It also allows for consistent string
handling; filenames are in the same character set as log messages.

In general, I'd say that ease of client development should be preferred
over ease of library development. The Subversion developers win either
way, because they're implementing both the libraries and a client. :-)

Cheers,

Colin

Colin Putney
Whistler.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jun 1 14:09:19 2002

This message: [ Message body ]
Next message: Marcus Comstedt: "Re: Call For Votes: converting log messages to UTF-8"
Previous message: David Mankin: "Re: Human representation of dates, opinions"
In reply to: Stephen C. Tweedie: "Re: charset neutral? pls solve this"
Next in thread: Henrik Svensson: "Re: Character sets for log messages"
Maybe reply: Henrik Svensson: "Re: Character sets for log messages"
Maybe reply: Henrik Svensson: "Re: Character sets for log messages"
Maybe reply: Henrik Svensson: "Re: Character sets for log messages"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]