Re: [RFC/PATCH] commit messages not 8-bit compatible

From: Greg Stein <gstein_at_lyra.org>
Date: 2002-05-31 00:34:48 CEST

On Thu, May 30, 2002 at 04:53:47PM -0500, Karl Fogel wrote:
> Greg Stein <gstein@lyra.org> writes:
> > Nope. We said that text strings passed around within the libraries (log
> > message is a good one, paths, property names, etc) would be considered to be
> > in UTF-8. We chose that following the same reasoning as using UTF-8 for the
> > pathnames: consistency and that it can represent any other character set.
>
> Oh, okay -- what we have here is different memory about what was
> agreed on in the past. So, let's never mind what we *thought* was
> agreed on, since it's clear what various people think right now :-).

Yah, seems that way, so definitely fair enough to just revisit.

>...
> Right now I mildly prefer this solution:
>
> - Don't munge (or convert, to use a less pejorative term) the log
> message at all, but simply reject log messages that contain any
> zero bytes. Log message charsets would be determined by each
> individual repository's policy, with a recommendation (but not an
> enforcement) from us to use UTF-8.
>
> If a lot of people feel strongly that enforcing conversion to UTF-8 is
> the Right Thing, I certainly won't veto. I mean, I could be wrong :-).

I think a better thread for responding is the other one. I'll defer to that
thread.

> How reliable it is to use locale to determine the source format of the
> conversion (or whatever method we're going to use), though? For

You *must* use the locale. Looking at the characters is insufficient.

> example, my locale indicates nothing about Chinese editing, but
> sometimes I write text in one of the various char encodings that
> supports Chinese characters. If I were to do that in a log message on
> some project, my log message would get all messed up. In such a case,
> leaving it alone would be better, because some tools that can
> heuristically determine the charset -- *if* they have the original
> data to work with.

The best they could do would be to determine whether you've got a
double-byte charset or some variety of single-byte charset. Within those
groups, you might be able to refine a bit. But not much more.

For example, if you have a string of bytes that validates as UTF-8, is that
*really* what it was? Or was it from the latin-1 charset? You just can't
tell from inspection. Thus, the requirement for needing the locale.

> If the data is there, one can guess at the charset
> if necessary. If the data is destroyed by a misconversion, then it's
> gone. That's why I feel it's better to leave it alone.

Sorry... guessing isn't possible. Something has to state the charset
(whether that "something" is another attribute, a requirement of a specific
charset, or just never guess). More in the other thread.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Sat Jun 1 14:15:33 2002

This message: [ Message body ]
Next message: Garrett Rooney: "Re: Human representation of dates, opinions"
Previous message: Greg Stein: "Re: use of UTF-8"
In reply to: Karl Fogel: "Re: [RFC/PATCH] commit messages not 8-bit compatible"
Next in thread: Marcus Comstedt: "Re: [RFC/PATCH] commit messages not 8-bit compatible"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]