Re: [PATCH] Script for warning about error leaks

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: 2006-02-08 04:30:41 CET

Vincent Lefevre wrote:
> On 2006-02-06 15:01:08 +0000, Julian Foad wrote:
>>Vincent Lefevre wrote:
>>>
>>>Shouldn't you make it locale-insensitive by setting LC_ALL to "C"?
>>
>>Um, I've no idea. Should I? What might go wrong in other locales?
>
> In the script, you have:
>
> IDENT="[a-zA-Z_][a-zA-Z0-9_]*"
>
> I think this regexp should be OK with any locale, though. Range
> expressions, such as [a-z], depend on the locale. For instance,
> [a-z] may include uppercase and/or accented characters. If one

OK, thanks. Then I definitely ought to set the "C" locale, at least for this bit.

I always find this confusing, the idea that "[a-z]" might not mean what I want
it to mean. For the possible enlightenment of other readers, having thought
about it a bit now, here's what I came up with...

We're not talking about the file being stored or translated into a wierd
character set encoding in which 'a' to 'z' might not be represented by (ASCII)
codes 97 to 122 consecutively; all common character sets are at least
approximately a superset of ASCII, I think.

Rather, we're talking about a range expression like "[a-z]" being interpreted
by a particular program not as the range of characters having code('a') to
code('z'), but as the set of characters found between 'a' and 'z' in the
locale's collation table, which is primarily intended for sorting things into
"alphabetical" order and typically differs from the character set encoding
order, e.g. (fictional example)

LANG=fr_FR
Encoding order: ...ABCDE...XYZ...abcde...xyz...ÁÉ...áé...
Collation table: ...aáâàAÁÂÀbBcçCÇdDeéêèEÉÊÈ...xXyYzZ...

Thus, the important setting here is LC_COLLATE. By experiment, I note that
some programs, such as my current copy of "grep", notice LC_COLLATE and use the
collation table to interpret a range expression, whereas others, such as my
current copy of "sed", just use the character set encoding sequence.

So, I should set LC_COLLATE if I use range expressions and need them to mean
something in particular. Also if I use sorting (such as the "sort" utility)
and require a particular order. I should set LC_CTYPE if I use things like
"end-of-word" regular expressions ("\>"), and LC_TIME if it matters how dates
and times are written, and so on.

Basically I should set some or all of the LC_ variables, probably to "C", when
I want deterministic machine behaviour, and use the user's locale when I want
behaviour that is friendly for humans.

In a script like this that operates on computer (programming language) data, it
is reasonable to set LC_ALL=C (LC_ALL overrides LC_COLLATE and all the other
individual settings).

OK, I think I'm happy with that now if I've got it roughly right.

(There's one other significant complication: the not-so-standard LC_MESSAGES.
I don't know whether a utility operating under LC_ALL=C will produce localised
messages. In this case it really doesn't matter much. Unless the answer is
simple, that's perhaps for another discussion another time. In fact, I think
that discussion has already been had, for Subversion itself.)

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Feb 8 04:31:08 2006

This message: [ Message body ]
Next message: Julian Foad: "Re: [PATCH] Remove an old script from the "tools" directory"
Previous message: Greg Stein: "Re: [PATCH] Multiple RA layer initialization"
In reply to: Vincent Lefevre: "Re: [PATCH] Script for warning about error leaks"
Next in thread: Vincent Lefevre: "locale-sensitive features (was: [PATCH] Script for warning about error leaks)"
Reply: Vincent Lefevre: "locale-sensitive features (was: [PATCH] Script for warning about error leaks)"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]