On Thu, May 30, 2002 at 04:53:47PM -0500, Karl Fogel wrote:
> Greg Stein <email@example.com> writes:
> > Nope. We said that text strings passed around within the libraries (log
> > message is a good one, paths, property names, etc) would be considered to be
> > in UTF-8. We chose that following the same reasoning as using UTF-8 for the
> > pathnames: consistency and that it can represent any other character set.
> Oh, okay -- what we have here is different memory about what was
> agreed on in the past. So, let's never mind what we *thought* was
> agreed on, since it's clear what various people think right now :-).
Yah, seems that way, so definitely fair enough to just revisit.
> Right now I mildly prefer this solution:
> - Don't munge (or convert, to use a less pejorative term) the log
> message at all, but simply reject log messages that contain any
> zero bytes. Log message charsets would be determined by each
> individual repository's policy, with a recommendation (but not an
> enforcement) from us to use UTF-8.
> If a lot of people feel strongly that enforcing conversion to UTF-8 is
> the Right Thing, I certainly won't veto. I mean, I could be wrong :-).
I think a better thread for responding is the other one. I'll defer to that
> How reliable it is to use locale to determine the source format of the
> conversion (or whatever method we're going to use), though? For
You *must* use the locale. Looking at the characters is insufficient.
> example, my locale indicates nothing about Chinese editing, but
> sometimes I write text in one of the various char encodings that
> supports Chinese characters. If I were to do that in a log message on
> some project, my log message would get all messed up. In such a case,
> leaving it alone would be better, because some tools that can
> heuristically determine the charset -- *if* they have the original
> data to work with.
The best they could do would be to determine whether you've got a
double-byte charset or some variety of single-byte charset. Within those
groups, you might be able to refine a bit. But not much more.
For example, if you have a string of bytes that validates as UTF-8, is that
*really* what it was? Or was it from the latin-1 charset? You just can't
tell from inspection. Thus, the requirement for needing the locale.
> If the data is there, one can guess at the charset
> if necessary. If the data is destroyed by a misconversion, then it's
> gone. That's why I feel it's better to leave it alone.
Sorry... guessing isn't possible. Something has to state the charset
(whether that "something" is another attribute, a requirement of a specific
charset, or just never guess). More in the other thread.
Greg Stein, http://www.lyra.org/
To unsubscribe, e-mail: firstname.lastname@example.org
For additional commands, e-mail: email@example.com
Received on Sat Jun 1 14:15:33 2002