using isalpha/isalnum in locale-independent code

From: <kfogel_at_collab.net>
Date: 2004-11-25 03:50:50 CET

I just got done with issue #1832, and realized that the problem
described there might be present elsewhere in the code too.

The Problem:

In is_valid_prop_name(), we were using isalpha()/isalnum() to test a
UTF8 string. (That's not as bad as it seems, because we were only
interested in the subset of UTF8 that is the same as lower ASCII.)
However, as Joe Orton pointed out, using isalpha(), etc, was bogus,
because those can be sensitive to locale, whereas a UTF8 string is a
UTF8 string no matter what locale it's in. Note that the APR
equivalents apr_isalpha(), apr_isalnum(), etc, just fall back to the
underlying C library implementation in most cases, and thus have the
same problem.

r12029 fixed is_valid_prop_name(). But where else might similar
things be going on in Subversion? I'll take the sites one by one:

   * subversion/libsvn_ra_svn/marshal.c (read_item):
     [...]
     else if (apr_isalpha(c))
       {
         /* It's a word. */
         str = svn_stringbuf_ncreate(&c, 1, pool);
         while (1)
           {
             SVN_ERR(readbuf_getchar(conn, pool, &c));
             if (!apr_isalnum(c) && c != '-')
               break;
             svn_stringbuf_appendbytes(str, &c, 1);
           }
         item->kind = SVN_RA_SVN_WORD;
         item->u.word = str->data;
       }
     [...]

This looks like it's potentially problematic. The ra_svn protocol
should never be locale-sensitive. A solution similar to r12029 could
be applied. Perhaps the r12029 changes could be abstracted out.
(Note there are some apr_isdigit() calls in marshal.c, they would need
to be fixed too.)

   * subversion/libsvn_subr/validate.c (svn_mime_type_validate):
     [...]
     if (! apr_isalnum (mime_type[len - 1]))
       return svn_error_createf
         (SVN_ERR_BAD_MIME_TYPE, NULL,
          _("MIME type '%s' ends with non-alphanumeric character"), mime_type);
     [...]

Same problem here. The svn:mime-type property's value is UTF8,
specifically the lower ASCII portion (or perhaps even a subset of
that). Also, technically, that function is using strcspn() and
strchr() with hardcoded C strings/chars ("; ", '/') as arguments. If
we were really paranoid, we could rewrite it to search for the
specific numeric values of the appropriate UTF8 characters, instead of
depending on C not using (say) EBCDIC.

   * subversion/libsvn_subr/opt.c (parse_one_rev):
     [...]
     else if (apr_isalpha (*str))
       {
         end = str + 1;
         while (apr_isalpha (*end))
           end++;
         save = *end;
         *end = '\0';
         if (revision_from_word (revision, str) != 0)
           return NULL;
         *end = save;
         return end;
       }
     [...]

This one is fine, because 'str' is in native encoding. Well,
actually, it's kind of hard to tell. I would think it *should* be
UTF8, and that the svn_opt* functions would take UTF8 strings just
like the rest of subversion. However, tracing the code back, it does
not look like this string is in UTF8. On the other hand, the doc
strings of this function and all the other functions on the way back
to the source don't seem to say anything about encoding :-(.

One possible course is to do nothing about any of these. We haven't
seen any actual bugs about them, unless I'm forgetting a report. If
the problems are theoretical only, then who cares?

Basically, I'm really not sure how urgent any of this is, and would
like others' opinions. Am I being paranoid, or is there a problem
here?

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Nov 25 03:53:29 2004

This message: [ Message body ]
Next message: Branko Čibej: "Re: using isalpha/isalnum in locale-independent code"
Previous message: VK Sameer: "Re: Supporting non-XML-safe pathnames"
Next in thread: Branko Čibej: "Re: using isalpha/isalnum in locale-independent code"
Reply: Branko Čibej: "Re: using isalpha/isalnum in locale-independent code"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]