[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

From: Michael W Thelen <mike_at_pietdepsi.com>
Date: 2005-06-07 05:02:35 CEST

Charles Bailey tried to respond to my message, but for some reason it
hasn't shown up on the list. He asked me to forward it, so here it is:

Forwarded message:
--On Jun 6, 2005 12:34 PM, Michael W Thelen <mike@pietdepsi.com> wrote:

>
> It looks like something bad happened to your message, as if something
> stripped out all the newlines. Would you mind resending the patch?

Sorry; I think I've been Gmailed again; let me try from Mulberry.
Here's the leading text:

On Wed, 23 Feb 2005, Charles Bailey wrote:

>
> Attached is a patchlet that, when expat fails to parse an hunk of XML,
> appends at least part of the offending hunk to the error message. It

which led to an exchange regarding the need to make strings
UTF-8-safe. After too long a haitus, I posted for comment:

On 4/21/05, Charles Bailey <bailey.charles@gmail.com> wrote:

>
> Well, after umpteen interrupts from the rest of life,I finally got a
> few hours to look at this again. In checking was was already
> available, I found a handful of "string escaping" function in various
> places which perform similar tasks (at least one with the comment
> "this should share code with other_string_escaping_routine()"). Since
> I'd have to add ya such function, I thought I'd try to abstract it a
> bit, with the hope that similar routines could use a common base.
> I've appended a short proposal at the bottom of this messages,
> containing a common "engine" and an example implementation for
> creating a UTF-8-safe version of an arbitrary string.

Julian Foad was kind enough to point out a dumb thinko, but no other
comments were forthcoming, possibly because the core developers were
busy with pre-1.2 cleanup.

So, after another too-long hiatus, here's a patch which implements a
"common" string escaping function , uses it for UTF-8 escaping, and
uses that to sanitize the offending XML, which is then output in the
error message that Jack built^W^Wstarted this thread.

I've interspersed my comments in the code, since there's imho zero
chance that this version of the patch will be
substantially/stylistically suitable for committing. They're far from
exhaustive, but this message is long enough already.

Conceptual "Log message":
[[[
Add function that escapes illegal UTF-8 characters, along the way
refactoring core of
string-escaping routines, and insure that illegal XML error message
outputs legal UTF-8.
### Probably best applied as several patches, but collected here for review.

* subversion/libsvn_subr/escape.c:
  New file
  (svn_subr__escape_string): Final-common-path function for escaping
strings.

* subversion/libsvn_subr/escape_impl.h:
  New file, declaring svn_subr__escape_string and convenience macros.
  ### Logical candidate for consolidation with utf_impl.h, perhaps as
subr_impl.h

* subversion/libsvn_subr/utf.c:
  (fuzzy_escape): Renamed to ascii_fuzzy_escape, and rewritten to use
   svn_subr__escape_string.
  (svn_utf__stringbuf_escape_utf8_fuzzy): New function which escapes illegal
   UTF-8 in a string, returning the escaped string in a stringbuf.
  (utf8_escape_mapper): Helper function for
svn_utf__stringbuf_escape_utf8_fuzzy.

* subversion/libsvn_subr/utf_impl.h:
  Add prototype for svn_utf__stringbuf_escape_utf8_fuzzy.
  (svn_utf__cstring_escape_utf8_fuzzy): Macro implementing variant
of above that
   returns NUL-terminated string.

* subversion/libsvn_subr/xml.c:
  (svn_xml_parse): If parse fails, print (sanitized) (part of) offending XML
   with error message.

* subversion/tests/libsvn_subr/utf-test.c:
  (utf_escape): New function testing UTF-8 string-escaping functions.

* subversion/po/de.po, subversion/po/es.po, subversion/po/ja.po,
 subversion/po/ko.po, subversion/po/nb.po, subversion/po/pl.po,
 subversion/po/pt_BR.po, subversion/po/sv.po,
 subversion/po/zh_CN.po, subversion/po/zh_TW.po:
 Courtesy to translators, since I've changed a localized string.
]]]

The patch, with interspersed comments, is appended as an attachment.

Conceptual "Log message":
[[[
Add function that escapes illegal UTF-8 characters, along the way
refactoring core of
string-escaping routines, and insure that illegal XML error message
outputs legal UTF-8.
### Probably best applied as several patches, but collected here for review.

* subversion/libsvn_subr/escape.c:
   New file
   (svn_subr__escape_string): Final-common-path function for escaping strings.

* subversion/libsvn_subr/escape_impl.h:
   New file, declaring svn_subr__escape_string and convenience macros.
   ### Logical candidate for consolidation with utf_impl.h, perhaps as
subr_impl.h

* subversion/libsvn_subr/utf.c:
   (fuzzy_escape): Renamed to ascii_fuzzy_escape, and rewritten to use
    svn_subr__escape_string.
   (svn_utf__stringbuf_escape_utf8_fuzzy): New function which escapes illegal
    UTF-8 in a string, returning the escaped string in a stringbuf.
   (utf8_escape_mapper): Helper function for
svn_utf__stringbuf_escape_utf8_fuzzy.

* subversion/libsvn_subr/utf_impl.h:
   Add prototype for svn_utf__stringbuf_escape_utf8_fuzzy.
   (svn_utf__cstring_escape_utf8_fuzzy): Macro implementing variant
of above that
    returns NUL-terminated string.

* subversion/libsvn_subr/xml.c:
   (svn_xml_parse): If parse fails, print (sanitized) (part of) offending XML
    with error message.

* subversion/tests/libsvn_subr/utf-test.c:
   (utf_escape): New function testing UTF-8 string-escaping functions.

* subversion/po/de.po, subversion/po/es.po, subversion/po/ja.po,
  subversion/po/ko.po, subversion/po/nb.po, subversion/po/pl.po,
  subversion/po/pt_BR.po, subversion/po/sv.po,
  subversion/po/zh_CN.po, subversion/po/zh_TW.po:
  Courtesy to translators, since I've changed a localized string.
]]]

### This driver was written because there are several "escaping"
functions in different
### places which do similar things with slightly different criteria.
It seemed best to collect
### the common work into one place, if not to save space, then to
minimize divergence.
### The goal here is to be fast on the simple cases via the screening
array, while allowing
### flexibility for more complex substitutions via the mapping
function. In very over-
### simplified, off-the-cuff testing, eliminating the screening array
caused a slowdiwn of
### slightly less than twofold.
### I've attempted to incorporate reasonable default behavior in the
case of NULL params.
--- /dev/null Mon Jun 6 11:06:27 2005
+++ subversion/libsvn_subr/escape.c Fri Jun 3 19:16:09 2005
@@ -0,0 +1,58 @@
+/*
+ * escape.c: common code for cleaning up unwanted bytes in strings
+ */
+
+#include "escape_impl.h"
+
+#define COPY_PREFIX \
+ if (c > base) { \
+ svn_stringbuf_appendbytes (out, base, c - base); \
+ base = c; \
+ }
+
+svn_stringbuf_t *
+svn_subr__escape_string (svn_stringbuf_t **outsbuf,
+ const unsigned char *instr,
+ apr_size_t len,
+ const unsigned char *isok,
+ unsigned char (*mapper) (unsigned char **,
+ const unsigned char *,
+ apr_size_t,
+ const svn_stringbuf_t *,
+ void *,
+ apr_pool_t *),
+ void *mapper_baton,
+ apr_pool_t *pool)
+{
+ unsigned char *base, *c;
+ svn_stringbuf_t *out;
+
+ if (outsbuf == NULL || *outsbuf == NULL) {
+ out = svn_stringbuf_create ("", pool);
+ if (outsbuf)
+ *outsbuf = out;
+ }
+ else
+ out = *outsbuf;
+
+ for (c = base = (unsigned char *) instr; c < instr + len; ) {
+ apr_size_t count = isok ? isok[*c] : 0;
+ if (count == 0) {
+ COPY_PREFIX;
+ count = mapper ? mapper (&c, instr, len, out, mapper_baton, pool) : 255;
+ }
+ if (count == 255) {
+ char esc[6];
+
+ COPY_PREFIX;
+ sprintf (esc,"?\\%03u",*c);
+ svn_stringbuf_appendcstr (out, esc);
+ c++;
+ base = c;
+ }
+ else c += count;
+ }
+ COPY_PREFIX;
+ return out;
+}
+

### Comments are pretty self-explanatory.
### Docs are as doxygen; will need to be downgraded to plaintext since it's
### an internal header.
### As noted above, it makes sense to combine this with utf_impl.h.
--- /dev/null Mon Jun 6 11:35:47 2005
+++ subversion/libsvn_subr/escape_impl.h Thu Jun 2 18:44:05 2005
@@ -0,0 +1,147 @@
+/*
+ * escape_impl.h : private header for string escaping function.
+ */
+
+
+
+#ifndef SVN_LIBSVN_SUBR_ESCAPE_IMPL_H
+#define SVN_LIBSVN_SUBR_ESCAPE_IMPL_H
+
+
+#include "svn_pools.h"
+#include "svn_string.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif /* __cplusplus */
+
+
+/** Scan @a instr of length @a len bytes, copying to stringbuf @a *outsbuf,
+ * escaping bytes as indicated by the lookup array @a isok and the mapping
+ * function @a mapper. Memory is allocated from @a pool. You may provide
+ * any extra information needed by @a mapper in @a mapper_baton.
+ * Returns a pointer to the stringbuf containing the escaped string.
+ *
+ * If @a outsbuf or *outsbuf is NULL, a new stringbuf is created; its
address is
+ * placed in @a outsbuf unless that argument is NULL.
+ * If @a isok is NULL, then @a mapper is used exclusively.
+ * If @ mapper is NULL, then a single character is escaped every time @a mapper
+ * would have been called.
+ *
+ * This is designed to be the common pathway for various string "escaping"
+ * functions across subversion. The basic approach is to scan
+ * the input and decide whether each byte is OK as it stands, needs to be
+ * "escaped" using subversion's "?\uuu" default format, or needs to be
+ * transformed in some other way. The decision is made using a two step
+ * process, which is designed to handle the simple cases quickly but allow
+ * for more complex mappings. Since the typical string will (we hope)
+ * comprise mostly simple cases, this shouldn't require much code
+ * complexity or loss of efficiency. The two steps used are:
+ *
+ * 1. The value of a byte from the input string ("test byte") is used as an
+ * index into a (usually 256 byte) array passed in by the caller.
+ * - If the value of the appropriate array element is 0xff,
+ * then the test byte is escaped as a "?\uuu" string in the output.
+ * - If the value of the appropriate element is otherwise non-zero,
+ * that many bytes are copied verbatim from the input to the output.
+ * 2. If the array yields a 0 value, then a mapping function provided by
+ * the caller is used to allow for more complex evaluation. This function
+ * receives five arguments:
+ * - a pointer to the pointer used by svn__do_char_escape() to
+ * mark the test byte in the input string
+ * - a pointer to the start of the input string
+ * - the length of the input string
+ * - a pointer to the output stringbuf
+ * - the ever-helpful pool.
+ * The mapping function may return a (positive) nonzero value,
+ * which is interpreted * as described in step 1 above, or zero,
+ * indicating that the test byte * should be ignored. In the latter
+ * case, this is generally because the * mapping function has done the
+ * necessary work itself; it's free to * modify the output stringbuf and
+ * adjust the pointer to the test byte * as it sees fit (within the
+ * bounds of the input string). At a minimum, * it should at least
+ * increment the pointer to the test byte before * returning 0, in order
+ * to avoid an infinite loop.
+ */
+
+svn_stringbuf_t *
+svn_subr__escape_string (svn_stringbuf_t **outsbuf,
+ const unsigned char *instr,
+ apr_size_t len,
+ const unsigned char *isok,
+ unsigned char (*mapper) (unsigned char **,
+ const unsigned char *,
+ apr_size_t,
+ const svn_stringbuf_t *,
+ void *,
+ apr_pool_t *),
+ void *mapper_baton,
+ apr_pool_t *pool);
+
+
+
+/** Initializer for a basic screening matrix suitable for use with
+ * #svn_subr__escape_string to escape non-UTF-8 bytes.
+ * We provide this since "UTF-8-safety" is a common denominator for
+ * most string escaping in Subversion, so this matrix makes a good
+ * starting point for more involved schemes.
+ */
+#define SVN_ESCAPE_UTF8_LEGAL_ARRAY { \
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\
+255, 255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\
+ 0, 0, 0, 0, 0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255}
+
+/** Given pointer @a c into a string which ends at @a e, figure out
+ * whether (*c) starts a valid UTF-8 sequence, and if so, how many bytes
+ * it includes. Return 255 if it's not valid UTF-8.
+ * For a more detailed description of the encoding rules, see the UTF-8
+ * specification in section 3-9 of the Unicode standard 4.0 (e.g. at
+ * http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf),
+ * with special attention to Table 3-6.
+ * This macro is also provided as a building block for mappers used by
+ * #svn_subr__escape_string that want to check for UTF-8-safety in
+ * addition to other tasks.
+ */
+#define SVN_ESCAPE_UTF8_MAPPING(c,e) \
+ ( (c)[0] < 0x80 ? /* ASCII */ \
+ 1 : /* OK, 1byte */ \
+ ( ( ((c)[0] > 0xc2 && (c)[0] < 0xdf) && /* 2-byte char */ \
+ ((c) + 1 <= (e)) && /* Got 2 bytes */ \
+ ((c)[1] >= 0x80 && (c)[1] <= 0xbf)) ? /* Byte 2 legal */ \
+ 2 : /* OK, 2 bytes */ \
+ ( ( ((c)[0] >= 0xe0 && (c)[0] <= 0xef) && /* 3 byte char */ \
+ ((c) + 2 <= (e)) && /* Got 3 bytes */ \
+ ((c)[1] >= 0x80 && (c)[1] <= 0xbf) && /* Basic byte 2 legal */ \
+ ((c)[2] >= 0x80 && (c)[2] <= 0xbf) && /* Basic byte 3 legal */ \
+ (!((c)[0] == 0xe0 && (c)[1] < 0xa0)) && /* 0xe0-0x[89]? illegal */\
+ (!((c)[0] == 0xed && (c)[1] > 0x9f)) ) ? /* 0xed-0x[ab]? illegal */\
+ 3 : /* OK, 3 bytes */ \
+ ( ( ((c)[0] >= 0xf0 && (c)[0] <= 0xf4) && /* 4 byte char */ \
+ ((c) + 3 <= (e)) && /* Got 4 bytes */ \
+ ((c)[1] >= 0x80 && (c)[1] <= 0xbf) && /* Basic byte 2 legal */ \
+ ((c)[2] >= 0x80 && (c)[2] <= 0xbf) && /* Basic byte 3 legal */ \
+ ((c)[3] >= 0x80 && (c)[3] <= 0xbf) && /* Basic byte 4 legal */ \
+ (!((c)[0] == 0xf0 && (c)[1] < 0x90)) && /* 0xf0-0x8? illegal */ \
+ (!((c)[0] == 0xf4 && (c)[1] > 0x8f)) ) ? /* 0xf4-0x[9ab]? illegal*/\
+ 4 : /* OK, 4 bytes */ \
+ 255)))) /* Illegal; escape it */
+
+
+#ifdef __cplusplus
+}
+#endif /* __cplusplus */
+
+#endif /* SVN_LIBSVN_SUBR_ESCAPE_IMPL_H */

### Function names can be revised to fit convention, of course.
### svn_utf__cstring_escape_utf8_fuzzy serves as an example of a benefit of
### returning the resultant stringbuf from svn_subr__escape_string both in a
### parameter and as the function's return value. If the sense is that it'll
### be a cause of debugging headaches, or that it's contrary to subversion
### culture to code public functions as macros, it's easy enough to code
### this as a function, and to make svn_subr__escape_string return void
### (or less likely svn_error_t, if it got pickier about params.)
--- subversion/libsvn_subr/utf_impl.h (revision 14986)
+++ subversion/libsvn_subr/utf_impl.h (working copy)
@@ -24,12 +24,33 @@

 #include <apr_pools.h>
 #include "svn_types.h"
+#include "svn_string.h"

 #ifdef __cplusplus
 extern "C" {
 #endif /* __cplusplus */

+/** Replace any non-UTF-8 characters in @a len byte long string @a src with
+ * escaped representations, placing the result in a stringbuf pointed to by
+ * @a *dest, which will be created if necessary. Memory is allocated from
+ * @a pool as needed. Returns a pointer to the stringbuf containing the result
+ * (identical to @a *dest, but facilitates chaining calls).
+ */
+svn_stringbuf_t *
+svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,
+ const unsigned char *src,
+ apr_size_t len,
+ apr_pool_t *pool);
+
+/** Replace any non-UTF-8 characters in @a len byte long string @a src with
+ * escaped representations. Memory is allocated from @a pool as needed.
+ * Returns a pointer to the resulting string.
+ */
+#define svn_utf__cstring_escape_utf8_fuzzy(src,len,pool) \
+ (svn_utf__stringbuf_escape_utf8_fuzzy(NULL,(src),(len),(pool)))->data
+
+
const char *svn_utf__cstring_from_utf8_fuzzy (const char *src,
                                               apr_pool_t *pool,
                                               svn_error_t *(*convert_from_utf8)

### There're other places that could be rewritten in terms of the new escaping
### functions, but I hope the two given here serve as an example of how it might
### be done.
### The rename to ascii_fuzzy_escape is to distinguish it from the new functions
### that escape only illegal UTF-8 sequences.
--- subversion/libsvn_subr/utf.c (revision 14986)
+++ subversion/libsvn_subr/utf.c (working copy)
@@ -30,6 +30,7 @@
 #include "svn_pools.h"
 #include "svn_ctype.h"
 #include "svn_utf.h"
+#include "escape_impl.h"
 #include "utf_impl.h"
 #include "svn_private_config.h"

@@ -323,53 +324,19 @@
 /* Copy LEN bytes of SRC, converting non-ASCII and zero bytes to ?\nnn
    sequences, allocating the result in POOL. */
 static const char *
-fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)
+ascii_fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)
 {
- const char *src_orig = src, *src_end = src + len;
- apr_size_t new_len = 0;
- char *new;
- const char *new_orig;
+ static unsigned char asciinonul[256];
+ svn_stringbuf_t *result = NULL;

- /* First count how big a dest string we'll need. */
- while (src < src_end)
- {
- if (! svn_ctype_isascii (*src) || *src == '\0')
- new_len += 5; /* 5 slots, for "?\XXX" */
- else
- new_len += 1; /* one slot for the 7-bit char */
+ if (!asciinonul[0]) {
+ asciinonul[0] = 255; /* NUL's not allowed */
+ memset(asciinonul + 1, 1, 127); /* Other regular ASCII OK */
+ memset(asciinonul + 128, 255, 128); /* High half not allowed */
+ }

- src++;
- }
-
- /* Allocate that amount. */
- new = apr_palloc (pool, new_len + 1);
-
- new_orig = new;
-
- /* And fill it up. */
- while (src_orig < src_end)
- {
- if (! svn_ctype_isascii (*src_orig) || src_orig == '\0')
- {
- /* This is the same format as svn_xml_fuzzy_escape uses, but that
- function escapes different characters. Please keep in sync!
- ### If we add another fuzzy escape somewhere, we should abstract
- ### this out to a common function. */
- sprintf (new, "?\\%03u", (unsigned char) *src_orig);
- new += 5;
- }
- else
- {
- *new = *src_orig;
- new += 1;
- }
-
- src_orig++;
- }
-
- *new = '\0';
-
- return new_orig;
+ svn_subr__escape_string(&result, src, len, asciinonul, NULL, NULL, pool);
+ return result->data;
 }

 /* Convert SRC_LENGTH bytes of SRC_DATA in NODE->handle, store the result
@@ -448,7 +415,7 @@
         errstr = apr_psprintf
           (pool, _("Can't convert string from '%s' to '%s':"),
            node->frompage, node->topage);
- err = svn_error_create (apr_err, NULL, fuzzy_escape (src_data,
+ err = svn_error_create (apr_err, NULL, ascii_fuzzy_escape (src_data,
                                                            src_length, pool));
       return svn_error_create (apr_err, err, errstr);
     }
@@ -564,7 +531,28 @@
   return SVN_NO_ERROR;
 }

+static unsigned char
+utf8_escape_mapper (unsigned char **targ, const unsigned char *start,
+ apr_size_t len, const svn_stringbuf_t *dest,
+ void *baton, apr_pool_t *pool)
+{
+ const unsigned char *end = start + len;
+ return SVN_ESCAPE_UTF8_MAPPING(*targ, end);
+}

+svn_stringbuf_t *
+svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,
+ const unsigned char *src,
+ apr_size_t len,
+ apr_pool_t *pool)
+{
+ static unsigned char utf8screen[256] = SVN_ESCAPE_UTF8_LEGAL_ARRAY;
+
+ return svn_subr__escape_string(dest, src, len,
+ utf8screen, utf8_escape_mapper, NULL,
+ pool);
+}
+
 svn_error_t *
 svn_utf_stringbuf_to_utf8 (svn_stringbuf_t **dest,
                            const svn_stringbuf_t *src,
@@ -787,7 +775,7 @@
   const char *escaped, *converted;
   svn_error_t *err;

- escaped = fuzzy_escape (src, strlen (src), pool);
+ escaped = ascii_fuzzy_escape (src, strlen (src), pool);

   /* Okay, now we have a *new* UTF-8 string, one that's guaranteed to
      contain only 7-bit bytes :-). Recode to native... */

### With code comes testing.
### Note: Contains 8-bit chars, and also uses convention that cc will treat
### "foo" "bar" as "foobar". Both can be avoided if useful for
finicky compilers.

--- subversion/tests/libsvn_subr/utf-test.c (revision 14986)
+++ subversion/tests/libsvn_subr/utf-test.c (working copy)
@@ -17,6 +17,7 @@
  */

 #include "../svn_test.h"
+#include "../../include/svn_utf.h"
 #include "../../libsvn_subr/utf_impl.h"

 /* Random number seed. Yes, it's global, just pretend you can't see it. */
@@ -222,6 +223,84 @@
   return SVN_NO_ERROR;
 }

+static svn_error_t *
+utf_escape (const char **msg,
+ svn_boolean_t msg_only,
+ svn_test_opts_t *opts,
+ apr_pool_t *pool)
+{
+ char in[] = { 'A', 'S', 'C', 'I', 'I', /* All printable */
+ 'R', 'E', 'T', '\n', 'N', /* Newline */
+ 'B', 'E', 'L', 0x07, '!', /* Control char */
+ 0xd2, 0xa6, 'O', 'K', '2', /* 2-byte char, valid */
+ 0xc0, 0xc3, 'N', 'O', '2', /* 2-byte char, invalid 1st */
+ 0x82, 0xc3, 'N', 'O', '2', /* 2-byte char, invalid 2nd */
+ 0xe4, 0x87, 0xa0, 'O', 'K', /* 3-byte char, valid */
+ 0xe2, 0xff, 0xba, 'N', 'O', /*3-byte char, invalid 2nd */
+ 0xe0, 0x87, 0xa0, 'N', 'O', /*3-byte char, invalid 2nd */
+ 0xed, 0xa5, 0xa0, 'N', 'O', /*3-byte char, invalid 2nd */
+ 0xe4, 0x87, 0xc0, 'N', 'O', /* 3-byte char, invalid 3rd */
+ 0xf2, 0x87, 0xa0, 0xb5, 'Y', /* 4-byte char, valid */
+ 0xf2, 0xd2, 0xa0, 0xb5, 'Y', /* 4-byte char, invalid 2nd */
+ 0xf0, 0x87, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */
+ 0xf4, 0x97, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */
+ 0xf2, 0x87, 0xc3, 0xb5, 'N', /* 4-byte char, invalid 3rd */
+ 0xf2, 0x87, 0xa0, 0xd5, 'N', /* 4-byte char, invalid 4th */
+ 0x00 };
+ const unsigned char *legalresult =
+ "ASCIIRET\nNBEL!(c)OK2?\\192?\\195NO2?\\130?\\195NO2"
+ "䇠1OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO?\\237?\\165?\\160NO"
+ "?\\228?\\135?\\192NO1Y?\\242(c)?\\181Y?\\240?\\135?\\160"
+ "?\\181N?\\244?\\151?\\160?\\181N?\\242?\\135õN?\\242?\\135?\\160"
+ "?\\213N";
+ const unsigned char *asciiresult =
+ "ASCIIRET\nNBEL\x07!?\\210?\\166OK2?\\192?\\195NO2?\\130?\\195NO2"
+ "?\\228?\\135?\\160OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO"
+ "?\\237?\\165?\\160NO?\\228?\\135?\\192NO?\\242?\\135?\\160?\\181Y"
+ "?\\242?\\210?\\160?\\181Y?\\240?\\135?\\160?\\181N"
+ "?\\244?\\151?\\160?\\181N?\\242?\\135?\\195?\\181N"
+ "?\\242?\\135?\\160?\\213N";
+ const unsigned char *asciified;
+ apr_size_t legalresult_len = 213; /* == strlen(legalresult) iff no NULs */
+ int i = 0;
+ svn_stringbuf_t *escaped = NULL;
+
+ *msg = "test utf string escaping";
+
+ if (msg_only)
+ return SVN_NO_ERROR;
+
+ if (svn_utf__stringbuf_escape_utf8_fuzzy
+ (&escaped, in, sizeof in - 1, pool) != escaped)
+ return svn_error_createf
+ (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
+ i++;
+ if (escaped->len != legalresult_len)
+ return svn_error_createf
+ (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
+ i++;
+ if (memcmp(escaped->data, legalresult, legalresult_len))
+ return svn_error_createf
+ (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
+ i++;
+ if (memcmp(escaped->data, legalresult, legalresult_len))
+ return svn_error_createf
+ (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
+ i++;
+
+ asciified = svn_utf_cstring_from_utf8_fuzzy(in, pool);
+ if (strlen(asciified) != strlen(asciiresult))
+ return svn_error_createf
+ (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
+ i++;
+ if (strcmp(asciified, asciiresult))
+ return svn_error_createf
+ (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
+ i++;
+
+ return SVN_NO_ERROR;
+}
+

 /* The test table. */

@@ -230,5 +309,6 @@
     SVN_TEST_NULL,
     SVN_TEST_PASS (utf_validate),
     SVN_TEST_PASS (utf_validate2),
+ SVN_TEST_PASS (utf_escape),
     SVN_TEST_NULL
   };

### The original point of this thread.
### This patch will apply with an offset, since I've cut out sections which
### reimplement XML escaping in terms of the svn_subr__escape_string.
--- subversion/libsvn_subr/xml.c (revision 14986)
+++ subversion/libsvn_subr/xml.c (working copy)
@@ -395,11 +413,22 @@
   /* If expat choked internally, return its error. */
   if (! success)
     {
+ svn_stringbuf_t *sanitized;
+ unsigned char *end;
+
+ svn_utf__stringbuf_escape_utf8_fuzzy(&sanitized, buf,
+ (len > 240 ? 240 : len),
+ svn_parser->pool);
+ end = sanitized->data +
+ (sanitized->len > 240 ? 240 : sanitized->len);
+ while (*end > 0x80 && *end < 0xc0 &&
+ (char *) end > sanitized->data) end--;
       err = svn_error_createf
         (SVN_ERR_XML_MALFORMED, NULL,
- _("Malformed XML: %s at line %d"),
+ _("Malformed XML: %s at line %d; XML starts:\n%.*s"),
          XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),
- XML_GetCurrentLineNumber (svn_parser->parser));
+ XML_GetCurrentLineNumber (svn_parser->parser),
+ (char *) end - sanitized->data + 1, sanitized->data);

       /* Kill all parsers and return the expat error */
       svn_xml_free_parser (svn_parser);

### Finally, be kind to the translators.
--- subversion/po/pt_BR.po (revision 14986)
+++ subversion/po/pt_BR.po (working copy)
@@ -6006,8 +6006,8 @@

 #: libsvn_subr/xml.c:400
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "XML mal formado: %s na linha %d"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "XML mal formado: %s na linha %d; XML começa:\n%.240s"

 #: libsvn_wc/adm_crawler.c:380
 #, c-format
--- subversion/po/es.po (revision 14986)
+++ subversion/po/es.po (working copy)
@@ -6102,8 +6102,8 @@

 #: libsvn_subr/xml.c:400
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "XML malformado: %s en la línea %d"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "XML malformado: %s en la línea %d; XML comienza:\n%.240s"

 #: libsvn_wc/adm_crawler.c:380
 #, c-format
--- subversion/po/de.po (revision 14986)
+++ subversion/po/de.po (working copy)
@@ -6090,8 +6090,8 @@

 #: libsvn_subr/xml.c:400
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "Fehlerhaftes XML: %s in Zeile %d"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "Fehlerhaftes XML: %s in Zeile %d; XML beginnt:\n%.240s"

 #: libsvn_wc/adm_crawler.c:380
 #, c-format
--- subversion/po/sv.po (revision 14986)
+++ subversion/po/sv.po (working copy)
@@ -6005,8 +6005,8 @@

 #: libsvn_subr/xml.c:400
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "Felaktig XML: %s på rad %d"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "Felaktig XML: %s på rad %d; XML starta:\n%.240s"

 #: libsvn_wc/adm_crawler.c:380
 #, c-format
--- subversion/po/ko.po (revision 14986)
+++ subversion/po/ko.po (working copy)
@@ -5906,8 +5906,8 @@

 #: libsvn_subr/xml.c:400
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "잘1ﰪ1된1 XML: %s (ﲄ„1
ﰲˆ1호1 %d)"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "잘1ﰪ1된1 XML: %s (ﲄ„1
ﰲˆ1호1 %d); XML:\n%.240s"

 #: libsvn_wc/adm_crawler.c:380
 #, c-format
--- subversion/po/ja.po (revision 14986)
+++ subversion/po/ja.po (working copy)
@@ -6463,8 +6463,8 @@

 #: libsvn_subr/xml.c:400
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "異1笠1 XML です: %s (fiŒ1 %d)"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "異1笠1 XML です: %s
(fiŒ1 %d); XML:\n%.240s"

 #: libsvn_wc/adm_crawler.c:380
 #, c-format
--- subversion/po/pl.po (revision 14986)
+++ subversion/po/pl.po (working copy)
@@ -6103,8 +6103,8 @@

 #: libsvn_subr/xml.c:400
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "Uszkodzony XML: %s w linii %d"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "Uszkodzony XML: %s w linii %d; XML wersja:\n%.240s"

 #: libsvn_wc/adm_crawler.c:380
 #, c-format
--- subversion/po/zh_TW.po (revision 14986)
+++ subversion/po/zh_TW.po (working copy)
@@ -5896,8 +5896,8 @@

 #: libsvn_subr/xml.c:400
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "有1謁1陷1 XML: %s
於1窱1 %d 列1"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "有1謁1陷1 XML: %s
於1窱1 %d 列1; XML:\n%.240s"

 #: libsvn_wc/adm_crawler.c:380
 #, c-format
--- subversion/po/nb.po (revision 14986)
+++ subversion/po/nb.po (working copy)
@@ -5995,8 +5995,8 @@

 #: libsvn_subr/xml.c:400
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "Misdannet XML: %s i linje %d"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "Misdannet XML: %s i linje %d; XML starter:\n%.240s"

 #: libsvn_wc/adm_crawler.c:380
 #, c-format
--- subversion/po/zh_CN.po (revision 14986)
+++ subversion/po/zh_CN.po (working copy)
@@ -5955,8 +5955,8 @@

 #: libsvn_subr/xml.c:400
 #, c-format
-msgid "Malformed XML: %s at line %d"
-msgstr "畸1什1的1XML:%s
(r)1窱1 %d fiŒ1"
+msgid "Malformed XML: %s at line %d; XML starts:\n%.240s"
+msgstr "畸1什1的1XML:%s
(r)1窱1 %d fiŒ1; XML:\n%.240s"

 #: libsvn_wc/adm_crawler.c:380
 #, c-format

### End of patch ###

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Jun 7 05:04:47 2005

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.