Re: use of UTF-8

From: Greg Stein <gstein_at_lyra.org>
Date: 2002-05-31 00:45:19 CEST

On Thu, May 30, 2002 at 05:06:03PM -0500, Karl Fogel wrote:
> Greg Stein <gstein@lyra.org> writes:
>...
> Right, right. But the `log_msg' parameter to functions was not
> `char *' until very recently, and for reasons having nothing to do
> with some prior decision about them being UTF-8.

Yup.

> I'm sorry to keep repeating myself. It seems (maybe I'm
> misunderstanding?) that you brought up type of those params as
> indicating that some decision had already been made about their
> charset.

Misunderstanding :-). The logic went like this: the param is a char*, thus
it will be (typically) be representing characters, which means it needs an
associated charset, which we have previously stated would be UTF-8.

>...
> > To be concrete: either those char* params are UTF-8, or you add a second
> > parameter to state their charset. (or you just go charset neutral which
> > isn't really a good option)
>
> Those aren't the only options here (and you're dismissing charset
> neutral as an obviously bad third option, mentioned only to be
> rejected, when in fact it's what this whole thread is really about).

I dismissed it because there has already been quite a bit of material
(from Garrett, Jon, etc etc) stating how clients need to know the charset to
be able to do anything with those log messages.

--> They contain characters. You need to know their charset.

I'm not sure how it is possible to really consider otherwise. To display
those characters to the user, you need the charset. To edit them, to set
them, to email them, to do whatever.

Basically, I find the notion that "leaving it up to arbitrary interpreation"
is in any way a valid approach.

> I see three options on the table:

Four.

     - add a second parameter to the relevant data structures and routines
       to hold the character set of the string in question (while we're
       talking about log message here, I think there are others; the rule
       for log msgs will apply everywhere)

> - Keep them as char *, declare them UTF-8, and convert user input
> as best we can.
>
> - Keep them as char *, declare no particular charset, but don't
> allow zero bytes.
>
> - Convert them back to counted-length strings and treat them as
> binary data again (I guess this is the most militantly charset
> neutral option).

Of the above four approaches:

1) a second param is very heavyweight from a conceptual and coding
   standpoint. and, in the end, we'll probably have to do conversions
   anyways, so allowing an arbitrary charset rather than fixed doesn't
   seem to buy a lot.

2) my favoriate. note that the *client* does the conversions. the libraries
simply assume all text strings are in UTF-8.

3) untenable for the clients.

4) this is similar to (3), but we just allow more flexibility.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Sat Jun 1 14:15:27 2002

This message: [ Message body ]
Next message: Greg Stein: "Re: [RFC/PATCH] commit messages not 8-bit compatible"
Previous message: Greg Stein: "Re: [RFC/PATCH] commit messages not 8-bit compatible"
In reply to: Karl Fogel: "Re: use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)"
Next in thread: William Uther: "Re: use of UTF-8"
Reply: William Uther: "Re: use of UTF-8"
Reply: Karl Fogel: "Re: use of UTF-8"
Reply: Greg Hudson: "Re: use of UTF-8"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]