[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8

From: Marcus Comstedt <marcus_at_mc.pp.se>
Date: 2002-05-22 23:59:51 CEST

Ulrich Drepper <drepper@redhat.com> writes:

> I cannot comment on the svn code, only on the i18n-specific code
>
> > +/* Figure out what the name of the native character set is, so
> > + that we can pass it to iconv_open. */
> > +static const char *
> > +find_native_charset (void)
> > +{
> > + /* First try $LC_CTYPE */
> > + const char *dot, *ctype = getenv ("LC_CTYPE");
> > +
> > + /* Skip anything before the dot, if any */
> > + if (ctype != NULL && (dot = strchr (ctype, '.')))
> > + ctype = dot + 1;
> > +
> > + if (ctype != NULL && !strcmp (ctype, "iso_8859_1"))
> > + ctype = "ISO8859-1";
> > +
> > + if (ctype == NULL || ! *ctype) {
> > + /* Fall back to US-ASCII */
> > + ctype = "646";
> > + }
> > +
> > + return ctype;
> > +}
>
> That's horrible. It looks like code from the early 90s when nobody knew
> how to use locales.

I _said_ that code was horrible. :-) If you want to make a better
one, please be my guest. I have been concentrating on inserting the
appropriate calls all over svn, not on the actual implementation of
the conversion functions.

> First, Unix requires a function named nl_langinfo() which returns just
> the wanted information. So you should have something like
>
> static const char *
> find_native_charset (void)
> {
> #ifdef HAVE_NL_LANGINFO
> return nl_langinfo (CODESET);
> #else
> ...
> #endif
> }

Cool. I sat for an whole hour trying to find a function that did
that, without succeeding. :-)

> To support systems without nl_langinfo() you cannot simply look at the
> LC_CTYPE environment variable. Its value need not have anything to do
> with the selected locale. The order in which the setlocale() function
> looks at environment variables for the LC_CTYPE category is
>
> LC_ALL -> LC_CTYPE -> LANG
>
> I.e., if LC_ALL is set use it. Otherwise if LC_CTYPE is set, use this.
> Else use LANG if set.

I know. It was a quick hack, as I suspected there had to be a better
way to do it...

> But your problems won't stop there. Charsets can have many different
> names. Other systems provide different mechanisms to determine the
> current charset etc etc.
>
> Look at the file localcharset.c which is used in some GNU packages (and
> other GPL'ed ones). I attach a copy. This is what you'll have to do.

Nice. I'll take a look at it.

> > +static svn_error_t *
> > +svn_utf_convert_to_stringbuf (iconv_t cd,
> > + const char *src_data, apr_size_t src_length,
> > + svn_stringbuf_t **dest,
> > + apr_pool_t *pool)
> > +{
> > + /* 2 bytes per character will be enough in most cases.
> > + If not, we'll make a larger buffer and try again. */
> > + apr_size_t buflen = src_length * 2 + 1;
> > +
> > + if (cd == (iconv_t)-1)
> > + return svn_error_create (0, errno, NULL, pool, "recoding string");
> > +
> > + do {
> > +
> > + char *destbuf = apr_palloc (pool, buflen);
> > +
> > + /* Set up state variables for iconv */
> > + const char *srcptr = src_data;
> > + char *destptr = destbuf;
> > + size_t srclen = src_length;
> > + size_t destlen = buflen;
> > +
> > + /* Attempt the conversion */
> > + if (iconv(cd, &srcptr, &srclen, &destptr, &destlen) != (size_t)-1) {
> > +
> > + /* Conversion succeeded. Return buffer */
> > + *dest = svn_stringbuf_ncreate (destbuf, buflen - destlen, pool);
> > +
> > + return SVN_NO_ERROR;
>
> That doesn't seem right. Note that iconv() can return a value !=
> (size_t)-1 and still have problems. That is a stupidity in the POSIX
> spec which requires that characters which are valid in the input charset
> but have no corresponding position in the output charset must be
> ignored. In glibc this does not happen (without the user explicitly
> saying so) since this is a security problem.
>
> For you here it means that the loop should terminate of != (size_t)-1
> but an error saying that the result couldn't be represented must be
> returned if the value is != 0. Otherwise you are silently loosing
> information.

Ah. Right. Should have read the RETURN VALUES section of the
manpage more closely.

Thanks for your input!

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu May 23 00:04:48 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.