[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

UTF-8 problem: non-UTF-8 in a UTF-8 locale

From: Philip Martin <philip_at_codematters.co.uk>
Date: 2004-02-06 01:51:47 CET

Consider

$ export LANG=en_GB.UTF-8
$ svnadmin create /tmp/repo
$ svn co http://localhost/tmp/repo wc
$ svn mkdir `printf "wc/abc\xC0def"`
$ svn ci -m "" wc
$ rm -rf wc
$ svn co http://localhost/tmp/repo wc
../svn/subversion/libsvn_ra_dav/util.c:661: (apr_err=175002)
svn: REPORT request failed on '/obj/repo/!svn/vcc/default'
../svn/subversion/libsvn_ra_dav/util.c:647: (apr_err=175002)
svn: The REPORT request returned invalid XML in the response: XML parse error at line 10: Bytes: 0xC0 0x78 0x78 0x78
. (/obj/repo/!svn/vcc/default)

The mkdir command creates a directory with a non-UTF-8 name. If I
repeat but with LANG=C the mkdir command fails with a "can't recode"
error. Using a non-UTF-8 locale catches invalid input, using a UTF-8
locale allows the invalid input through. The same problem occurs with
import and recursive add, and in those cases the dodgy name may not
even appear on the command line. It appears that when the native to
UTF-8 conversion is a trivial UTF-8 to UTF-8 conversion there is no
validation of the native encoding.

UTF-8 is defined by RFC2279, but it appears the GNU iconv uses the
more restrictive rules defined by Unicode, such as found in section
3.9 of http://www.unicode.org/versions/Unicode4.0.0/bookmarks.html
The Unicode rules are simple, so I wrote a finite state machine to
implement them (in fact I wrote two implemtations, one to test other).

Things to consider

- Is there existing code (outside Subversion) to do UTF-8 validation?
  I looked briefly but didn't find any.

- Where should this hook into Subversion? At present I have the
  svn_utf_xxx_to_utf functions validating all conversions. This
  catches problems like the ones at the start of this mail, but
  perhaps it should be done it libsvn_fs?

- The validation runs irrespective of the current locale because there
  doesn't appear to be any way to determine if we are using a UTF-8
  locale. It's probably redundant to validate if the locale is not
  UTF-8 as the UTF-8 conversion is effectively a validation. The
  validation code should be fast but it does pull in about 400 bytes
  of lookup table.

I've included the patch for completeness, I'll probably commit it in a
day or two if nobody objects.

Detect invalid UTF-8 inputs when the native encoding is UTF-8.

* build.conf: Add utf-test.

* win-tests.py: Add utf-test.exe.

* subversion/libsvn_subr/utf.c
  (invalid_utf8, check_utf8, check_cstring_utf8): New functions.
  (svn_utf_stringbuf_to_utf8, svn_utf_string_to_utf8,
   svn_utf_cstring_to_utf8, svn_utf_cstring_to_utf8_ex): Add validity
   check.

* subversion/libsvn_subr/utf_impl.h
  (svn_utf__last_valid, svn_utf__is_valid, svn_utf__cstring_is_valid,
   svn_utf__last_valid2): New functions.

* subversion/libsvn_subr/utf_validate.c: New file.

* subversion/tests/libsvn_subr/utf-test.c: New file.

Index: ../svn/build.conf
===================================================================
--- ../svn/build.conf (revision 8567)
+++ ../svn/build.conf (working copy)
@@ -524,6 +524,14 @@
 install = test
 libs = libsvn_test libsvn_delta libsvn_wc libsvn_subr aprutil apriconv apr
 
+# test time functions
+[utf-test]
+type = exe
+path = subversion/tests/libsvn_subr
+sources = utf-test.c
+install = test
+libs = libsvn_test libsvn_subr aprutil apriconv apr
+
 # test svn_config utilities
 [config-test]
 type = exe
Index: ../svn/win-tests.py
===================================================================
--- ../svn/win-tests.py (revision 8567)
+++ ../svn/win-tests.py (working copy)
@@ -18,6 +18,7 @@
          'subversion/tests/libsvn_subr/path-test.exe',
          'subversion/tests/libsvn_subr/stream-test.exe',
          'subversion/tests/libsvn_subr/time-test.exe',
+ 'subversion/tests/libsvn_subr/utf-test.exe',
          'subversion/tests/libsvn_wc/translate-test.exe',
          'subversion/tests/libsvn_diff/diff-diff3-test.exe',
          'subversion/tests/libsvn_delta/random-test.exe',
Index: ../svn/subversion/libsvn_subr/utf.c
===================================================================
--- ../svn/subversion/libsvn_subr/utf.c (revision 8567)
+++ ../svn/subversion/libsvn_subr/utf.c (working copy)
@@ -232,7 +232,62 @@
   return SVN_NO_ERROR;
 }
 
+/* Construct an error with a suitable message to describe the invalid UTF-8
+ * sequence DATA of length LEN (which may have embedded NULLs). We can't
+ * simply print the data, almost by definition we don't really know how it
+ * is encoded.
+ */
+static svn_error_t *
+invalid_utf8 (const char *data, apr_size_t len, apr_pool_t *pool)
+{
+ const char *last = svn_utf__last_valid (data, len);
+ const char *msg = "Valid UTF-8 data\n(hex:";
+ int i, valid, invalid;
 
+ /* We will display at most 24 valid octets (this may split a leading
+ multi-byte character) as that should fit on one 80 character line. */
+ valid = last - data;
+ if (valid > 24)
+ valid = 24;
+ for (i = 0; i < valid; ++i)
+ msg = apr_pstrcat (pool, msg, apr_psprintf (pool, " %02x",
+ (unsigned char)last[i-valid]),
+ NULL);
+ msg = apr_pstrcat (pool, msg,
+ ")\nfollowed by invalid UTF-8 sequence\n(hex:", NULL);
+
+ /* 4 invalid octets will guarantee that the faulty octet is displayed */
+ invalid = data + len - last;
+ if (invalid > 4)
+ invalid = 4;
+ for (i = 0; i < invalid; ++i)
+ msg = apr_pstrcat (pool, msg, apr_psprintf (pool, " %02x",
+ (unsigned char)last[i]), NULL);
+ msg = apr_pstrcat (pool, msg, ")", NULL);
+
+ return svn_error_create (APR_EINVAL, NULL, msg);
+}
+
+/* Verify that the sequence DATA of length LEN is valid UTF-8 */
+static svn_error_t *
+check_utf8 (const char *data, apr_size_t len, apr_pool_t *pool)
+{
+ if (! svn_utf__is_valid (data, len))
+ return invalid_utf8 (data, len, pool);
+ return SVN_NO_ERROR;
+}
+
+/* Verify that the NULL terminated sequence DATA is valid UTF-8 */
+static svn_error_t *
+check_cstring_utf8 (const char *data, apr_pool_t *pool)
+{
+
+ if (! svn_utf__cstring_is_valid (data))
+ return invalid_utf8 (data, strlen (data), pool);
+ return SVN_NO_ERROR;
+}
+
+
 svn_error_t *
 svn_utf_stringbuf_to_utf8 (svn_stringbuf_t **dest,
                            const svn_stringbuf_t *src,
@@ -243,7 +298,10 @@
   SVN_ERR (get_ntou_xlate_handle (&convset, pool));
 
   if (convset)
- return convert_to_stringbuf (convset, src->data, src->len, dest, pool);
+ {
+ SVN_ERR (convert_to_stringbuf (convset, src->data, src->len, dest, pool));
+ return check_utf8 ((*dest)->data, (*dest)->len, pool);
+ }
   else
     {
       SVN_ERR (check_non_ascii (src->data, src->len, pool));
@@ -267,6 +325,7 @@
     {
       SVN_ERR (convert_to_stringbuf (convset, src->data, src->len,
                                      &destbuf, pool));
+ SVN_ERR (check_utf8 (destbuf->data, destbuf->len, pool));
       *dest = svn_string_create_from_buf (destbuf, pool);
     }
   else
@@ -315,6 +374,7 @@
 
   SVN_ERR (get_ntou_xlate_handle (&convset, pool));
   SVN_ERR (convert_cstring (dest, src, convset, pool));
+ SVN_ERR (check_cstring_utf8 (*dest, pool));
 
   return SVN_NO_ERROR;
 }
@@ -331,6 +391,7 @@
 
   SVN_ERR (get_xlate_handle (&convset, "UTF-8", frompage, convset_key, pool));
   SVN_ERR (convert_cstring (dest, src, convset, pool));
+ SVN_ERR (check_cstring_utf8 (*dest, pool));
 
   return SVN_NO_ERROR;
 }
Index: ../svn/subversion/libsvn_subr/utf_validate.c
===================================================================
--- ../svn/subversion/libsvn_subr/utf_validate.c (revision 0)
+++ ../svn/subversion/libsvn_subr/utf_validate.c (revision 0)
@@ -0,0 +1,359 @@
+/*
+ * utf_validate.c: Validate a UTF-8 string
+ *
+ * ====================================================================
+ * Copyright (c) 2004 CollabNet. All rights reserved.
+ *
+ * This software is licensed as described in the file COPYING, which
+ * you should have received as part of this distribution. The terms
+ * are also available at http://subversion.tigris.org/license-1.html.
+ * If newer versions of this license are posted there, you may use a
+ * newer version instead, at your option.
+ *
+ * This software consists of voluntary contributions made by many
+ * individuals. For exact contribution history, see the revision
+ * history and logs, available at http://subversion.tigris.org/.
+ * ====================================================================
+ */
+
+/* Validate a UTF-8 string according to the rules in
+ *
+ * Table 3-6. Well-Formed UTF-8 Bytes Sequences
+ *
+ * in
+ *
+ * The Unicode Standard, Version 4.0
+ *
+ * which is available at
+ *
+ * http://www.unicode.org/
+ *
+ * UTF-8 was originally defined in RFC-2279, Unicode's "well-formed UTF-8"
+ * is a subset of that enconding. The Unicode enconding prohibits things
+ * like non-shortest encodings (some characters can be represented by more
+ * than one multi-byte encoding) and the encodings for the surrogate code
+ * points.
+ */
+
+#include "utf_impl.h"
+
+/* Lookup table to categorise each octet in the string. */
+static char octet_category[256] = {
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* 0x00-0x7f */
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 0x80-0x8f */
+ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, /* 0x90-0x9f */
+ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, /* 0xa0-0xbf */
+ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+ 4, 4, /* 0xc0-0xc1 */
+ 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, /* 0xc2-0xdf */
+ 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+ 6, /* 0xe0 */
+ 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, /* 0xe1-0xec */
+ 8, /* 0xed */
+ 9, 9, /* 0xee-0xef */
+ 10, /* 0xf0 */
+ 11, 11, 11, /* 0xf1-0xf3 */
+ 12, /* 0xf4 */
+ 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13 /* 0xf5-0xff */
+};
+
+/* Machine states */
+#define FSM_START 0
+#define FSM_80BF 1
+#define FSM_A0BF 2
+#define FSM_80BF80BF 3
+#define FSM_809F 4
+#define FSM_90BF 5
+#define FSM_80BF80BF80BF 6
+#define FSM_808F 7
+#define FSM_ERROR 8
+
+/* In the FSM it appears that categories 0xc0-0xc1 and 0xf5-0xff make the
+ same transitions, as do categories 0xe1-0xec and 0xee-0xef. I wonder if
+ there is any great benefit in combining categories? It would reduce the
+ memory footprint of the transition table by 16 bytes, but might it be
+ harder to understand? */
+
+/* Machine transition table */
+static char machine [9][14] = {
+ /* FSM_START */
+ {FSM_START, /* 0x00-0x7f */
+ FSM_ERROR, /* 0x80-0x8f */
+ FSM_ERROR, /* 0x90-0x9f */
+ FSM_ERROR, /* 0xa0-0xbf */
+ FSM_ERROR, /* 0xc0-0xc1 */
+ FSM_80BF, /* 0xc2-0xdf */
+ FSM_A0BF, /* 0xe0 */
+ FSM_80BF80BF, /* 0xe1-0xec */
+ FSM_809F, /* 0xed */
+ FSM_80BF80BF, /* 0xee-0xef */
+ FSM_90BF, /* 0xf0 */
+ FSM_80BF80BF80BF, /* 0xf1-0xf3 */
+ FSM_808F, /* 0xf4 */
+ FSM_ERROR}, /* 0xf5-0xff */
+
+ /* FSM_80BF */
+ {FSM_ERROR, /* 0x00-0x7f */
+ FSM_START, /* 0x80-0x8f */
+ FSM_START, /* 0x90-0x9f */
+ FSM_START, /* 0xa0-0xbf */
+ FSM_ERROR, /* 0xc0-0xc1 */
+ FSM_ERROR, /* 0xc2-0xdf */
+ FSM_ERROR, /* 0xe0 */
+ FSM_ERROR, /* 0xe1-0xec */
+ FSM_ERROR, /* 0xed */
+ FSM_ERROR, /* 0xee-0xef */
+ FSM_ERROR, /* 0xf0 */
+ FSM_ERROR, /* 0xf1-0xf3 */
+ FSM_ERROR, /* 0xf4 */
+ FSM_ERROR}, /* 0xf5-0xff */
+
+ /* FSM_A0BF */
+ {FSM_ERROR, /* 0x00-0x7f */
+ FSM_ERROR, /* 0x80-0x8f */
+ FSM_ERROR, /* 0x90-0x9f */
+ FSM_80BF, /* 0xa0-0xbf */
+ FSM_ERROR, /* 0xc0-0xc1 */
+ FSM_ERROR, /* 0xc2-0xdf */
+ FSM_ERROR, /* 0xe0 */
+ FSM_ERROR, /* 0xe1-0xec */
+ FSM_ERROR, /* 0xed */
+ FSM_ERROR, /* 0xee-0xef */
+ FSM_ERROR, /* 0xf0 */
+ FSM_ERROR, /* 0xf1-0xf3 */
+ FSM_ERROR, /* 0xf4 */
+ FSM_ERROR}, /* 0xf5-0xff */
+
+ /* FSM_80BF80BF */
+ {FSM_ERROR, /* 0x00-0x7f */
+ FSM_80BF, /* 0x80-0x8f */
+ FSM_80BF, /* 0x90-0x9f */
+ FSM_80BF, /* 0xa0-0xbf */
+ FSM_ERROR, /* 0xc0-0xc1 */
+ FSM_ERROR, /* 0xc2-0xdf */
+ FSM_ERROR, /* 0xe0 */
+ FSM_ERROR, /* 0xe1-0xec */
+ FSM_ERROR, /* 0xed */
+ FSM_ERROR, /* 0xee-0xef */
+ FSM_ERROR, /* 0xf0 */
+ FSM_ERROR, /* 0xf1-0xf3 */
+ FSM_ERROR, /* 0xf4 */
+ FSM_ERROR}, /* 0xf5-0xff */
+
+ /* FSM_809F */
+ {FSM_ERROR, /* 0x00-0x7f */
+ FSM_80BF, /* 0x80-0x8f */
+ FSM_80BF, /* 0x90-0x9f */
+ FSM_ERROR, /* 0xa0-0xbf */
+ FSM_ERROR, /* 0xc0-0xc1 */
+ FSM_ERROR, /* 0xc2-0xdf */
+ FSM_ERROR, /* 0xe0 */
+ FSM_ERROR, /* 0xe1-0xec */
+ FSM_ERROR, /* 0xed */
+ FSM_ERROR, /* 0xee-0xef */
+ FSM_ERROR, /* 0xf0 */
+ FSM_ERROR, /* 0xf1-0xf3 */
+ FSM_ERROR, /* 0xf4 */
+ FSM_ERROR}, /* 0xf5-0xff */
+
+ /* FSM_90BF */
+ {FSM_ERROR, /* 0x00-0x7f */
+ FSM_ERROR, /* 0x80-0x8f */
+ FSM_80BF80BF, /* 0x90-0x9f */
+ FSM_80BF80BF, /* 0xa0-0xbf */
+ FSM_ERROR, /* 0xc0-0xc1 */
+ FSM_ERROR, /* 0xc2-0xdf */
+ FSM_ERROR, /* 0xe0 */
+ FSM_ERROR, /* 0xe1-0xec */
+ FSM_ERROR, /* 0xed */
+ FSM_ERROR, /* 0xee-0xef */
+ FSM_ERROR, /* 0xf0 */
+ FSM_ERROR, /* 0xf1-0xf3 */
+ FSM_ERROR, /* 0xf4 */
+ FSM_ERROR}, /* 0xf5-0xff */
+
+ /* FSM_80BF80BF80BF */
+ {FSM_ERROR, /* 0x00-0x7f */
+ FSM_80BF80BF, /* 0x80-0x8f */
+ FSM_80BF80BF, /* 0x90-0x9f */
+ FSM_80BF80BF, /* 0xa0-0xbf */
+ FSM_ERROR, /* 0xc0-0xc1 */
+ FSM_ERROR, /* 0xc2-0xdf */
+ FSM_ERROR, /* 0xe0 */
+ FSM_ERROR, /* 0xe1-0xec */
+ FSM_ERROR, /* 0xed */
+ FSM_ERROR, /* 0xee-0xef */
+ FSM_ERROR, /* 0xf0 */
+ FSM_ERROR, /* 0xf1-0xf3 */
+ FSM_ERROR, /* 0xf4 */
+ FSM_ERROR}, /* 0xf5-0xff */
+
+ /* FSM_808F */
+ {FSM_ERROR, /* 0x00-0x7f */
+ FSM_80BF80BF, /* 0x80-0x8f */
+ FSM_ERROR, /* 0x90-0x9f */
+ FSM_ERROR, /* 0xa0-0xbf */
+ FSM_ERROR, /* 0xc0-0xc1 */
+ FSM_ERROR, /* 0xc2-0xdf */
+ FSM_ERROR, /* 0xe0 */
+ FSM_ERROR, /* 0xe1-0xec */
+ FSM_ERROR, /* 0xed */
+ FSM_ERROR, /* 0xee-0xef */
+ FSM_ERROR, /* 0xf0 */
+ FSM_ERROR, /* 0xf1-0xf3 */
+ FSM_ERROR, /* 0xf4 */
+ FSM_ERROR}, /* 0xf5-0xff */
+
+ /* FSM_ERROR */
+ {FSM_ERROR, /* 0x00-0x7f */
+ FSM_ERROR, /* 0x80-0x8f */
+ FSM_ERROR, /* 0x90-0x9f */
+ FSM_ERROR, /* 0xa0-0xbf */
+ FSM_ERROR, /* 0xc0-0xc1 */
+ FSM_ERROR, /* 0xc2-0xdf */
+ FSM_ERROR, /* 0xe0 */
+ FSM_ERROR, /* 0xe1-0xec */
+ FSM_ERROR, /* 0xed */
+ FSM_ERROR, /* 0xee-0xef */
+ FSM_ERROR, /* 0xf0 */
+ FSM_ERROR, /* 0xf1-0xf3 */
+ FSM_ERROR, /* 0xf4 */
+ FSM_ERROR}, /* 0xf5-0xff */
+};
+
+
+const char *
+svn_utf__last_valid(const char *data, apr_size_t len)
+{
+ const char *start = data, *end = data + len;
+ int state = FSM_START;
+ while (data < end)
+ {
+ unsigned char octet = *data++;
+ int category = octet_category[octet];
+ state = machine[state][category];
+ if (state == FSM_START)
+ start = data;
+ }
+ return start;
+}
+
+svn_boolean_t
+svn_utf__cstring_is_valid(const char *data)
+{
+ int state = FSM_START;
+ while (*data)
+ {
+ unsigned char octet = *data++;
+ int category = octet_category[octet];
+ state = machine[state][category];
+ }
+ return state == FSM_START ? TRUE : FALSE;
+}
+
+svn_boolean_t
+svn_utf__is_valid(const char *data, apr_size_t len)
+{
+ const char *end = data + len;
+ int state = FSM_START;
+ while (data < end)
+ {
+ unsigned char octet = *data++;
+ int category = octet_category[octet];
+ state = machine[state][category];
+ }
+ return state == FSM_START ? TRUE : FALSE;
+}
+
+const char *
+svn_utf__last_valid2(const char *data, apr_size_t len)
+{
+ const char *start = data, *end = data + len;
+ int state = FSM_START;
+ while (data < end)
+ {
+ unsigned char octet = *data++;
+ switch (state)
+ {
+ case FSM_START:
+ if (octet <= 0x7F)
+ break;
+ else if (octet <= 0xC1)
+ state = FSM_ERROR;
+ else if (octet <= 0xDF)
+ state = FSM_80BF;
+ else if (octet == 0xE0)
+ state = FSM_A0BF;
+ else if (octet <= 0xEC)
+ state = FSM_80BF80BF;
+ else if (octet == 0xED)
+ state = FSM_809F;
+ else if (octet <= 0xEF)
+ state = FSM_80BF80BF;
+ else if (octet == 0xF0)
+ state = FSM_90BF;
+ else if (octet <= 0xF3)
+ state = FSM_80BF80BF80BF;
+ else if (octet <= 0xF4)
+ state = FSM_808F;
+ else
+ state = FSM_ERROR;
+ break;
+ case FSM_80BF:
+ if (octet >= 0x80 && octet <= 0xBF)
+ state = FSM_START;
+ else
+ state = FSM_ERROR;
+ break;
+ case FSM_A0BF:
+ if (octet >= 0xA0 && octet <= 0xBF)
+ state = FSM_80BF;
+ else
+ state = FSM_ERROR;
+ break;
+ case FSM_80BF80BF:
+ if (octet >= 0x80 && octet <= 0xBF)
+ state = FSM_80BF;
+ else
+ state = FSM_ERROR;
+ break;
+ case FSM_809F:
+ if (octet >= 0x80 && octet <= 0x9F)
+ state = FSM_80BF;
+ else
+ state = FSM_ERROR;
+ break;
+ case FSM_90BF:
+ if (octet >= 0x90 && octet <= 0xBF)
+ state = FSM_80BF80BF;
+ else
+ state = FSM_ERROR;
+ break;
+ case FSM_80BF80BF80BF:
+ if (octet >= 0x80 && octet <= 0xBF)
+ state = FSM_80BF80BF;
+ else
+ state = FSM_ERROR;
+ break;
+ case FSM_808F:
+ if (octet >= 0x80 && octet <= 0x8F)
+ state = FSM_80BF80BF;
+ else
+ state = FSM_ERROR;
+ break;
+ default:
+ case FSM_ERROR:
+ return start;
+ }
+ if (state == FSM_START)
+ start = data;
+ }
+ return start;
+}
Index: ../svn/subversion/libsvn_subr/utf_impl.h
===================================================================
--- ../svn/subversion/libsvn_subr/utf_impl.h (revision 8567)
+++ ../svn/subversion/libsvn_subr/utf_impl.h (working copy)
@@ -38,6 +38,33 @@
                                                apr_pool_t *));
 
 
+/* Return a pointer to the first character after the last valid UTF-8
+ * multi-byte character in the string SRC of length LEN. If SRC is a valid
+ * UTF-8 the return value will point to the byte after SRC+LEN, otherwise
+ * it will point to the start of the first invalid multi-byte character.
+ * In either case all the characters between SRC and the return pointer are
+ * valid UTF-8.
+ */
+const char *svn_utf__last_valid(const char *src, apr_size_t len);
+
+/* Return TRUE if the string SRC of length LEN is a valid UTF-8 encoding
+ * according to the rules laid down by the Unicode 4.0 standard, FALSE
+ * otherwise. This function is faster than svn_utf__last_valid.
+ */
+svn_boolean_t svn_utf__is_valid (const char *src, apr_size_t len);
+
+/* As for svn_utf__is_valid but SRC is NULL terminated. */
+svn_boolean_t svn_utf__cstring_is_valid (const char *src);
+
+/* As for svn_utf__last_valid but uses a different implementation without
+ lookup tables. It avoids the table memory use (about 400 bytes) but the
+ function is longer (about 200 bytes extra) and likely to be slower when
+ the string is valid. If the string is invalid this function may be
+ faster since it returns immediately rather than continuing to the end of
+ the string. The main reason this function exists is to test the table
+ driven implementation. */
+const char *svn_utf__last_valid2 (const char *src, apr_size_t len);
+
 #ifdef __cplusplus
 }
 #endif /* __cplusplus */
Index: ../svn/subversion/tests/libsvn_subr/utf-test.c
===================================================================
--- ../svn/subversion/tests/libsvn_subr/utf-test.c (revision 0)
+++ ../svn/subversion/tests/libsvn_subr/utf-test.c (revision 0)
@@ -0,0 +1,232 @@
+/*
+ * utf-test.c -- test the utf functions
+ *
+ * ====================================================================
+ * Copyright (c) 2004 CollabNet. All rights reserved.
+ *
+ * This software is licensed as described in the file COPYING, which
+ * you should have received as part of this distribution. The terms
+ * are also available at http://subversion.tigris.org/license-1.html.
+ * If newer versions of this license are posted there, you may use a
+ * newer version instead, at your option.
+ *
+ * This software consists of voluntary contributions made by many
+ * individuals. For exact contribution history, see the revision
+ * history and logs, available at http://subversion.tigris.org/.
+ * ====================================================================
+ */
+
+#include "../../libsvn_subr/utf_impl.h"
+#include "svn_test.h"
+
+/* Random number seed. Yes, it's global, just pretend you can't see it. */
+static apr_uint32_t diff_diff3_seed;
+
+/* Return the value of the current random number seed, initializing it if
+ necessary */
+static apr_uint32_t
+seed_val (void)
+{
+ static svn_boolean_t first = TRUE;
+
+ if (first)
+ {
+ diff_diff3_seed = (apr_uint32_t) apr_time_now ();
+ first = FALSE;
+ }
+
+ return diff_diff3_seed;
+}
+
+/* Return a random number N such that MIN_VAL <= N <= MAX_VAL */
+static apr_uint32_t
+range_rand (apr_uint32_t min_val,
+ apr_uint32_t max_val)
+{
+ apr_uint64_t diff = max_val - min_val;
+ apr_uint64_t val = diff * svn_test_rand (&diff_diff3_seed);
+ val /= 0xffffffff;
+ return min_val + (apr_uint32_t) val;
+}
+
+/* Explicit tests of various valid/invalid sequences */
+static svn_error_t *
+utf_validate (const char **msg,
+ svn_boolean_t msg_only,
+ apr_pool_t *pool)
+{
+ struct data {
+ svn_boolean_t valid;
+ char string[20];
+ } tests[] = {
+ {TRUE, {'a', 'b', '\0'}},
+ {FALSE, {'a', 'b', '\x80', '\0'}},
+
+ {FALSE, {'a', 'b', '\xC0', '\0'}},
+ {FALSE, {'a', 'b', '\xC0', '\x81', 'x', 'y', '\0'}},
+
+ {TRUE, {'a', 'b', '\xC5', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xC5', '\xC0', 'x', 'y', '\0'}},
+
+ {FALSE, {'a', 'b', '\xE0', '\0'}},
+ {FALSE, {'a', 'b', '\xE0', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xE0', '\xA0', '\0'}},
+ {FALSE, {'a', 'b', '\xE0', '\xA0', 'x', 'y', '\0'}},
+ {TRUE, {'a', 'b', '\xE0', '\xA0', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xE0', '\x9F', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xE0', '\xCF', '\x81', 'x', 'y', '\0'}},
+
+ {FALSE, {'a', 'b', '\xE5', '\0'}},
+ {FALSE, {'a', 'b', '\xE5', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xE5', '\x81', '\0'}},
+ {FALSE, {'a', 'b', '\xE5', '\x81', 'x', 'y', '\0'}},
+ {TRUE, {'a', 'b', '\xE5', '\x81', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xE5', '\xE1', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xE5', '\x81', '\xE1', 'x', 'y', '\0'}},
+
+ {FALSE, {'a', 'b', '\xED', '\0'}},
+ {FALSE, {'a', 'b', '\xED', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xED', '\x81', '\0'}},
+ {FALSE, {'a', 'b', '\xED', '\x81', 'x', 'y', '\0'}},
+ {TRUE, {'a', 'b', '\xED', '\x81', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xED', '\xA0', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xED', '\x81', '\xC1', 'x', 'y', '\0'}},
+
+ {FALSE, {'a', 'b', '\xEE', '\0'}},
+ {FALSE, {'a', 'b', '\xEE', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xEE', '\x81', '\0'}},
+ {FALSE, {'a', 'b', '\xEE', '\x81', 'x', 'y', '\0'}},
+ {TRUE, {'a', 'b', '\xEE', '\x81', '\x81', 'x', 'y', '\0'}},
+ {TRUE, {'a', 'b', '\xEE', '\xA0', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xEE', '\xC0', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xEE', '\x81', '\xC1', 'x', 'y', '\0'}},
+
+ {FALSE, {'a', 'b', '\xF0', '\0'}},
+ {FALSE, {'a', 'b', '\xF0', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF0', '\x91', '\0'}},
+ {FALSE, {'a', 'b', '\xF0', '\x91', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF0', '\x91', '\x81', '\0'}},
+ {FALSE, {'a', 'b', '\xF0', '\x91', '\x81', 'x', 'y', '\0'}},
+ {TRUE, {'a', 'b', '\xF0', '\x91', '\x81', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF0', '\x81', '\x81', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF0', '\xC1', '\x81', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF0', '\x91', '\xC1', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF0', '\x91', '\x81', '\xC1', 'x', 'y', '\0'}},
+
+ {FALSE, {'a', 'b', '\xF2', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF2', '\x91', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF2', '\x91', '\x81', 'x', 'y', '\0'}},
+ {TRUE, {'a', 'b', '\xF2', '\x91', '\x81', '\x81', 'x', 'y', '\0'}},
+ {TRUE, {'a', 'b', '\xF2', '\x81', '\x81', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF2', '\xC1', '\x81', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF2', '\x91', '\xC1', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF2', '\x91', '\x81', '\xC1', 'x', 'y', '\0'}},
+
+ {FALSE, {'a', 'b', '\xF4', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF4', '\x91', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF4', '\x91', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF4', '\x91', '\x81', '\x81', 'x', 'y', '\0'}},
+ {TRUE, {'a', 'b', '\xF4', '\x81', '\x81', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF4', '\xC1', '\x81', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF4', '\x91', '\xC1', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF4', '\x91', '\x81', '\xC1', 'x', 'y', '\0'}},
+
+ {FALSE, {'a', 'b', '\xF5', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF5', '\x81', 'x', 'y', '\0'}},
+
+ {TRUE, {'a', 'b', '\xF4', '\x81', '\x81', '\x81', 'x', 'y',
+ 'a', 'b', '\xF2', '\x91', '\x81', '\x81', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF4', '\x81', '\x81', '\x81', 'x', 'y',
+ 'a', 'b', '\xF2', '\x91', '\x81', '\xC1', 'x', 'y', '\0'}},
+ {FALSE, {'a', 'b', '\xF4', '\x81', '\x81', '\x81', 'x', 'y',
+ 'a', 'b', '\xF2', '\x91', '\x81', 'x', 'y', '\0'}},
+
+ {-1},
+ };
+ int i = 0;
+
+ *msg = "test is_valid/last_valid";
+
+ if (msg_only)
+ return SVN_NO_ERROR;
+
+
+ while (tests[i].valid != -1)
+ {
+ const char *last = svn_utf__last_valid(tests[i].string,
+ strlen (tests[i].string));
+ apr_size_t len = strlen(tests[i].string);
+
+ if ((svn_utf__cstring_is_valid (tests[i].string) != tests[i].valid)
+ ||
+ (svn_utf__is_valid (tests[i].string, len) != tests[i].valid))
+ return svn_error_createf
+ (SVN_ERR_TEST_FAILED, NULL, "is_valid test %d failed", i);
+
+ if (!svn_utf__is_valid(tests[i].string, last - tests[i].string)
+ ||
+ (tests[i].valid && *last))
+ return svn_error_createf
+ (SVN_ERR_TEST_FAILED, NULL, "last_valid test %d failed", i);
+
+ ++i;
+ }
+
+ return SVN_NO_ERROR;
+}
+
+/* Compare the two different implementations using random data. */
+static svn_error_t *
+utf_validate2 (const char **msg,
+ svn_boolean_t msg_only,
+ apr_pool_t *pool)
+{
+ int i;
+
+ *msg = apr_psprintf (pool,
+ "test last_valid/last_valid2 (seed:%u)", seed_val());
+
+ if (msg_only)
+ return SVN_NO_ERROR;
+
+ /* We want enough iterations so that most runs get both valid and invalid
+ strings. We also want enough iterations such that a deliberate error
+ in one of the implementations will trigger a failure. By experiment
+ the second requirement requires a much larger number of iterations
+ that the first. */
+ for (i = 0; i < 100000; ++i)
+ {
+ unsigned int j;
+ char str[64];
+ apr_size_t len;
+
+ /* A random string; experiment shows that it's occasionally (less
+ than 1%) valid but usually invalid. */
+ for (j = 0; j < sizeof (str) - 1; ++j)
+ str[j] = range_rand (0, 255);
+ str[sizeof (str) - 1] = 0;
+ len = strlen (str);
+
+ if (svn_utf__last_valid (str, len) != svn_utf__last_valid2 (str, len))
+ {
+ /* Duplicate calls for easy debugging */
+ svn_utf__last_valid (str, len);
+ svn_utf__last_valid2 (str, len);
+ return svn_error_createf
+ (SVN_ERR_TEST_FAILED, NULL, "is_valid2 test %d failed", i);
+ }
+ }
+
+ return SVN_NO_ERROR;
+}
+
+
+/* The test table. */
+
+struct svn_test_descriptor_t test_funcs[] =
+ {
+ SVN_TEST_NULL,
+ SVN_TEST_PASS (utf_validate),
+ SVN_TEST_PASS (utf_validate2),
+ SVN_TEST_NULL
+ };

-- 
Philip Martin
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri Feb 6 01:52:42 2004

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.