[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Removing the --enable-utf8 flag

From: Ulrich Drepper <drepper_at_redhat.com>
Date: 2002-07-20 20:13:09 CEST

On Thu, 2002-07-18 at 13:29, Karl Fogel wrote:
> Here's how this would become a run-time decision:
>
> * Always attempt conversion. If the conversion fails (for example
> because the underlying xlation mechanism isn't working, as is
> currently the case), *then* check for non_ascii, and bomb only if
> there are illegal characters in the data. Otherwise, we proceed,
> effectively treating the data as if it were already UTF-8,
> because we know it's all safe ascii characters.

I like the idea of removing the option but this outlines algorithm is
very unsafe. Admittedly it will work in most cases but not all. And
for something like a version control tool this isn't enough IMO.

Look at this "message":

  M@]@`n@J@ZK

Consists only of ASCII characters and therefore would pass the non_ascii
test. But it's not readable and not comparable to other strings since
it's encoded using IBM870 [*]:

$ echo -n 'M@]@`n@J@ZK' | iconv -f IBM870; echo
( ) -> [ ].

I suggest one additional test before running the non_ascii test for the
entire string. Check whether the encoding used is known to be
ASCII-safe. Only if this test succeeds should the non_ascii tests be
performed.

The checks for ASCII-safeness can be performed by string comparisons
with the name of the encoding of the incoming data. The names of all
the safe encodings could be collected. Variations in names can and
probably should be eliminated by normalization before the comparison.

An encoding is ASCII-safe if

  From it's initial state it is not possible to create a character
  which does not have the ASCII encoding when only using ASCII input
  bytes.

The catch stateful encodings etc only the printable ASCII characters are
allowed. I.e., 0x20 <= ch < 0x7f && isprint (ch). Note that the
sometimes available isascii() test in <ctype.h> is *not* sufficient.

[*] This is one example I came up with right away. Yes, it is a
constructed example. There are certainly more compelling examples.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------

Received on Sat Jul 20 20:13:51 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.