[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8

From: Ulrich Drepper <drepper_at_redhat.com>
Date: 2002-05-22 22:38:41 CEST

I cannot comment on the svn code, only on the i18n-specific code

> +/* Figure out what the name of the native character set is, so
> + that we can pass it to iconv_open. */
> +static const char *
> +find_native_charset (void)
> +{
> + /* First try $LC_CTYPE */
> + const char *dot, *ctype = getenv ("LC_CTYPE");
> +
> + /* Skip anything before the dot, if any */
> + if (ctype != NULL && (dot = strchr (ctype, '.')))
> + ctype = dot + 1;
> +
> + if (ctype != NULL && !strcmp (ctype, "iso_8859_1"))
> + ctype = "ISO8859-1";
> +
> + if (ctype == NULL || ! *ctype) {
> + /* Fall back to US-ASCII */
> + ctype = "646";
> + }
> +
> + return ctype;
> +}

That's horrible. It looks like code from the early 90s when nobody knew
how to use locales.

First, Unix requires a function named nl_langinfo() which returns just
the wanted information. So you should have something like

static const char *
find_native_charset (void)
{
#ifdef HAVE_NL_LANGINFO
  return nl_langinfo (CODESET);
#else
  ...
#endif
}

To support systems without nl_langinfo() you cannot simply look at the
LC_CTYPE environment variable. Its value need not have anything to do
with the selected locale. The order in which the setlocale() function
looks at environment variables for the LC_CTYPE category is

  LC_ALL -> LC_CTYPE -> LANG

I.e., if LC_ALL is set use it. Otherwise if LC_CTYPE is set, use this.
Else use LANG if set.

But your problems won't stop there. Charsets can have many different
names. Other systems provide different mechanisms to determine the
current charset etc etc.

Look at the file localcharset.c which is used in some GNU packages (and
other GPL'ed ones). I attach a copy. This is what you'll have to do.

> +static svn_error_t *
> +svn_utf_convert_to_stringbuf (iconv_t cd,
> + const char *src_data, apr_size_t src_length,
> + svn_stringbuf_t **dest,
> + apr_pool_t *pool)
> +{
> + /* 2 bytes per character will be enough in most cases.
> + If not, we'll make a larger buffer and try again. */
> + apr_size_t buflen = src_length * 2 + 1;
> +
> + if (cd == (iconv_t)-1)
> + return svn_error_create (0, errno, NULL, pool, "recoding string");
> +
> + do {
> +
> + char *destbuf = apr_palloc (pool, buflen);
> +
> + /* Set up state variables for iconv */
> + const char *srcptr = src_data;
> + char *destptr = destbuf;
> + size_t srclen = src_length;
> + size_t destlen = buflen;
> +
> + /* Attempt the conversion */
> + if (iconv(cd, &srcptr, &srclen, &destptr, &destlen) != (size_t)-1) {
> +
> + /* Conversion succeeded. Return buffer */
> + *dest = svn_stringbuf_ncreate (destbuf, buflen - destlen, pool);
> +
> + return SVN_NO_ERROR;

That doesn't seem right. Note that iconv() can return a value !=
(size_t)-1 and still have problems. That is a stupidity in the POSIX
spec which requires that characters which are valid in the input charset
but have no corresponding position in the output charset must be
ignored. In glibc this does not happen (without the user explicitly
saying so) since this is a security problem.

For you here it means that the loop should terminate of != (size_t)-1
but an error saying that the result couldn't be represented must be
returned if the value is != 0. Otherwise you are silently loosing
information.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------

Received on Wed May 22 22:40:46 2002

This is an archived mail posted to the Subversion Dev mailing list.