[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

From: Philip Martin <philip_at_codematters.co.uk>
Date: 2005-06-21 03:17:54 CEST

Charles Bailey <bailey.charles@gmail.com> writes:

> I've interspersed my comments in the code, since there's imho zero
> chance that this version of the patch will be
> substantially/stylistically suitable for committing.

That doesn't encourage review! Anyway, if the comments are necessary
to understand the code then they should be part of the code.

> They're far from
> exhaustive, but this message is long enough already.
>
> Conceptual "Log message":
> [[[
> Add function that escapes illegal UTF-8 characters, along the way

"valid" rather than "legal", "invalid" rather than "illegal". That
applies to function names, variables, comments, etc.

> refactoring core of
> string-escaping routines, and insure that illegal XML error message
> outputs legal UTF-8.
> ### Probably best applied as several patches, but collected here for review.

If you think it should be several patches then submit several
patches. On the whole it looks like something that should be one
patch.

> * subversion/po/de.po, subversion/po/es.po, subversion/po/ja.po,
> subversion/po/ko.po, subversion/po/nb.po, subversion/po/pl.po,
> subversion/po/pt_BR.po, subversion/po/sv.po,
> subversion/po/zh_CN.po, subversion/po/zh_TW.po:
> Courtesy to translators, since I've changed a localized string.

I wouldn't apply patches to PO files unless they come from someone who
claims to be fluent.

> --- /dev/null Mon Jun 6 11:06:27 2005
> +++ subversion/libsvn_subr/escape.c Fri Jun 3 19:16:09 2005

New files need a copyright notice.

> @@ -0,0 +1,58 @@
> +/*
> + * escape.c: common code for cleaning up unwanted bytes in strings
> + */
> +
> +#include "escape_impl.h"
> +
> +#define COPY_PREFIX \
> + if (c > base) { \
> + svn_stringbuf_appendbytes (out, base, c - base); \
> + base = c; \
> + }

I don't think I'd have used a macro, it makes the calling code harder
to understand, but I'd prefer a macro with parameters and it needs
documentation.

> +
> +svn_stringbuf_t *
> +svn_subr__escape_string (svn_stringbuf_t **outsbuf,
> + const unsigned char *instr,
> + apr_size_t len,
> + const unsigned char *isok,
> + unsigned char (*mapper) (unsigned char **,
> + const unsigned char *,
> + apr_size_t,
> + const svn_stringbuf_t *,
> + void *,
> + apr_pool_t *),

mapper should be a typedef.

> + void *mapper_baton,
> + apr_pool_t *pool)
> +{
> + unsigned char *base, *c;
> + svn_stringbuf_t *out;
> +
> + if (outsbuf == NULL || *outsbuf == NULL) {
> + out = svn_stringbuf_create ("", pool);

It should probably be created with at least len bytes capacity.

> + if (outsbuf)
> + *outsbuf = out;
> + }
> + else
> + out = *outsbuf;
> +
> + for (c = base = (unsigned char *) instr; c < instr + len; ) {

Try to avoid casting away const.

> + apr_size_t count = isok ? isok[*c] : 0;
> + if (count == 0) {
> + COPY_PREFIX;
> + count = mapper ? mapper (&c, instr, len, out, mapper_baton, pool) : 255;
> + }
> + if (count == 255) {
> + char esc[6];
> +
> + COPY_PREFIX;
> + sprintf (esc,"?\\%03u",*c);
> + svn_stringbuf_appendcstr (out, esc);
> + c++;
> + base = c;
> + }
> + else c += count;
> + }
> + COPY_PREFIX;
> + return out;

This function doesn't follow the project's indentation/formatting
guidelines.

> +}
> +
>
>
> ### Comments are pretty self-explanatory.
> ### Docs are as doxygen; will need to be downgraded to plaintext since it's
> ### an internal header.
> ### As noted above, it makes sense to combine this with utf_impl.h.

If it makes sense to combine it then why is it separate?

> --- /dev/null Mon Jun 6 11:35:47 2005
> +++ subversion/libsvn_subr/escape_impl.h Thu Jun 2 18:44:05 2005

Missing copyright notice.

> @@ -0,0 +1,147 @@
> +/*
> + * escape_impl.h : private header for string escaping function.
> + */
> +
> +
> +
> +#ifndef SVN_LIBSVN_SUBR_ESCAPE_IMPL_H
> +#define SVN_LIBSVN_SUBR_ESCAPE_IMPL_H
> +
> +
> +#include "svn_pools.h"
> +#include "svn_string.h"
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif /* __cplusplus */
> +
> +
> +/** Scan @a instr of length @a len bytes, copying to stringbuf @a *outsbuf,
> + * escaping bytes as indicated by the lookup array @a isok and the mapping
> + * function @a mapper. Memory is allocated from @a pool. You may provide
> + * any extra information needed by @a mapper in @a mapper_baton.
> + * Returns a pointer to the stringbuf containing the escaped string.
> + *
> + * If @a outsbuf or *outsbuf is NULL, a new stringbuf is created; its
> address is

The patch is mangled here, and in several other places.

> + * placed in @a outsbuf unless that argument is NULL.
> + * If @a isok is NULL, then @a mapper is used exclusively.
> + * If @ mapper is NULL, then a single character is escaped every time @a mapper
> + * would have been called.
> + *
> + * This is designed to be the common pathway for various string "escaping"
> + * functions across subversion. The basic approach is to scan
> + * the input and decide whether each byte is OK as it stands, needs to be
> + * "escaped" using subversion's "?\uuu" default format, or needs to be
> + * transformed in some other way. The decision is made using a two step
> + * process, which is designed to handle the simple cases quickly but allow
> + * for more complex mappings. Since the typical string will (we hope)
> + * comprise mostly simple cases, this shouldn't require much code
> + * complexity or loss of efficiency. The two steps used are:

The question is do we need such a general function? One the one hand
it looks like it is more complicated then necessary just to handle
escaping, on the other hand it doesn't look general enough to be used,
say, as a routine to separate multibyte codepoints.

I think I'd prefer a simpler solution, although I haven't tried to
implement one. Perhaps based on the existing validation functions?
(Although since I wrote those I might be biased.)

> + *
> + * 1. The value of a byte from the input string ("test byte") is used as an
> + * index into a (usually 256 byte) array passed in by the caller.
> + * - If the value of the appropriate array element is 0xff,
> + * then the test byte is escaped as a "?\uuu" string in the output.
> + * - If the value of the appropriate element is otherwise non-zero,
> + * that many bytes are copied verbatim from the input to the output.
> + * 2. If the array yields a 0 value, then a mapping function provided by
> + * the caller is used to allow for more complex evaluation. This function
> + * receives five arguments:

Five? I see six in the code.

> + * - a pointer to the pointer used by svn__do_char_escape() to
> + * mark the test byte in the input string
> + * - a pointer to the start of the input string
> + * - the length of the input string
> + * - a pointer to the output stringbuf
> + * - the ever-helpful pool.
> + * The mapping function may return a (positive) nonzero value,
> + * which is interpreted * as described in step 1 above, or zero,
> + * indicating that the test byte * should be ignored. In the latter
> + * case, this is generally because the * mapping function has done the
> + * necessary work itself; it's free to * modify the output stringbuf and
> + * adjust the pointer to the test byte * as it sees fit (within the
> + * bounds of the input string). At a minimum, * it should at least
> + * increment the pointer to the test byte before * returning 0, in order
> + * to avoid an infinite loop.

Are the '*' characters within the comment lines just fallout from
paragraph filling?

> + */
> +
> +svn_stringbuf_t *
> +svn_subr__escape_string (svn_stringbuf_t **outsbuf,
> + const unsigned char *instr,
> + apr_size_t len,
> + const unsigned char *isok,
> + unsigned char (*mapper) (unsigned char **,
> + const unsigned char *,
> + apr_size_t,
> + const svn_stringbuf_t *,
> + void *,
> + apr_pool_t *),

Should be a typedef, the typedef should document the parameters.

> + void *mapper_baton,
> + apr_pool_t *pool);
> +
> +
> +
> +/** Initializer for a basic screening matrix suitable for use with
> + * #svn_subr__escape_string to escape non-UTF-8 bytes.
> + * We provide this since "UTF-8-safety" is a common denominator for
> + * most string escaping in Subversion, so this matrix makes a good
> + * starting point for more involved schemes.
> + */
> +#define SVN_ESCAPE_UTF8_LEGAL_ARRAY { \
> + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1,\
> + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1,\
> + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1,\
> + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1,\
> + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1,\
> + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1,\
> + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1,\
> + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1,\
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0,\
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0,\
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0,\
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0,\
> +255, 255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0,\
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0,\
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0,\
> + 0, 0, 0, 0, 0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255}

That looks a bit like the svn_ctype stuff, but it's a separate
implementation. That makes me uneasy.

> +
> +/** Given pointer @a c into a string which ends at @a e, figure out
> + * whether (*c) starts a valid UTF-8 sequence, and if so, how many bytes
> + * it includes. Return 255 if it's not valid UTF-8.
> + * For a more detailed description of the encoding rules, see the UTF-8
> + * specification in section 3-9 of the Unicode standard 4.0 (e.g. at
> + * http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf),
> + * with special attention to Table 3-6.
> + * This macro is also provided as a building block for mappers used by
> + * #svn_subr__escape_string that want to check for UTF-8-safety in
> + * addition to other tasks.
> + */
> +#define SVN_ESCAPE_UTF8_MAPPING(c,e) \
> + ( (c)[0] < 0x80 ? /* ASCII */ \
> + 1 : /* OK, 1 byte */ \
> + ( ( ((c)[0] > 0xc2 && (c)[0] < 0xdf) && /* 2-byte char */ \
> + ((c) + 1 <= (e)) && /* Got 2 bytes */ \
> + ((c)[1] >= 0x80 && (c)[1] <= 0xbf)) ? /* Byte 2 legal */ \
> + 2 : /* OK, 2 bytes */ \
> + ( ( ((c)[0] >= 0xe0 && (c)[0] <= 0xef) && /* 3 byte char */ \
> + ((c) + 2 <= (e)) && /* Got 3 bytes */ \
> + ((c)[1] >= 0x80 && (c)[1] <= 0xbf) && /* Basic byte 2 legal */ \
> + ((c)[2] >= 0x80 && (c)[2] <= 0xbf) && /* Basic byte 3 legal */ \
> + (!((c)[0] == 0xe0 && (c)[1] < 0xa0)) && /* 0xe0-0x[89]? illegal */\
> + (!((c)[0] == 0xed && (c)[1] > 0x9f)) ) ? /* 0xed-0x[ab]? illegal */\
> + 3 : /* OK, 3 bytes */ \
> + ( ( ((c)[0] >= 0xf0 && (c)[0] <= 0xf4) && /* 4 byte char */ \
> + ((c) + 3 <= (e)) && /* Got 4 bytes */ \
> + ((c)[1] >= 0x80 && (c)[1] <= 0xbf) && /* Basic byte 2 legal */ \
> + ((c)[2] >= 0x80 && (c)[2] <= 0xbf) && /* Basic byte 3 legal */ \
> + ((c)[3] >= 0x80 && (c)[3] <= 0xbf) && /* Basic byte 4 legal */ \
> + (!((c)[0] == 0xf0 && (c)[1] < 0x90)) && /* 0xf0-0x8? illegal */ \
> + (!((c)[0] == 0xf4 && (c)[1] > 0x8f)) ) ? /* 0xf4-0x[9ab]? illegal*/\
> + 4 : /* OK, 4 bytes */ \
> + 255)))) /* Illegal; escape it */

utf_validate.c already implements the UTF-8 encoding rules. There is
obviously some duplication of the algorithm, that makes me uneasy.

Those big macros also make me uneasy.

> +
> +
> +#ifdef __cplusplus
> +}
> +#endif /* __cplusplus */
> +
> +#endif /* SVN_LIBSVN_SUBR_ESCAPE_IMPL_H */
>
>
> ### Function names can be revised to fit convention, of course.
> ### svn_utf__cstring_escape_utf8_fuzzy serves as an example of a benefit of
> ### returning the resultant stringbuf from svn_subr__escape_string both in a
> ### parameter and as the function's return value. If the sense is that
> it'll be a cause
> ### of debugging headaches, or that it's cortrary to subversion
> culture to code public
> ### functions as macros, it's easy enough to code this as a function,
> and to make
> ### svn_subr__escape_string return void (or less likely svn_error_t,
> if it got pickier
> ### about params.)
> --- subversion/libsvn_subr/utf_impl.h (revision 14986)
> +++ subversion/libsvn_subr/utf_impl.h (working copy)
> @@ -24,12 +24,33 @@
>
> #include <apr_pools.h>
> #include "svn_types.h"
> +#include "svn_string.h"
>
> #ifdef __cplusplus
> extern "C" {
> #endif /* __cplusplus */
>
>
> +/** Replace any non-UTF-8 characters in @a len byte long string @a src with
> + * escaped representations, placing the result in a stringbuf pointed to by
> + * @a *dest, which will be created if necessary. Memory is allocated from

How does the user know what "if necessary" means?

> + * @a pool as needed. Returns a pointer to the stringbuf containing the result
> + * (identical to @a *dest, but facilitates chaining calls).
> + */
> +svn_stringbuf_t *
> +svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,
> + const unsigned char *src,
> + apr_size_t len,
> + apr_pool_t *pool);
> +
> +/** Replace any non-UTF-8 characters in @a len byte long string @a src with
> + * escaped representations. Memory is allocated from @a pool as needed.
> + * Returns a pointer to the resulting string.
> + */
> +#define svn_utf__cstring_escape_utf8_fuzzy(src,len,pool) \
> + (svn_utf__stringbuf_escape_utf8_fuzzy(NULL,(src),(len),(pool)))->data

Is there any need for this to be a macro? A real function would be
less "surprising" and that makes it better unless you can justify the
macro.

> +
> +
> const char *svn_utf__cstring_from_utf8_fuzzy (const char *src,
> apr_pool_t *pool,
> svn_error_t *(*convert_from_utf8)
>
>
>
> ### There're other places that could be rewritten in terms of the new escaping
> ### functions, but I hope the two given here serve as an example of how it might
> ### be done.

Complete patches are better than incomplete ones.

> ### The rename to ascii_fuzzy_escape is to distinguish it from the new functions
> ### that escape only illegal UTF-8 sequences.
> --- subversion/libsvn_subr/utf.c (revision 14986)
> +++ subversion/libsvn_subr/utf.c (working copy)
> @@ -30,6 +30,7 @@
> #include "svn_pools.h"
> #include "svn_ctype.h"
> #include "svn_utf.h"
> +#include "escape_impl.h"
> #include "utf_impl.h"
> #include "svn_private_config.h"
>
> @@ -323,53 +324,19 @@
> /* Copy LEN bytes of SRC, converting non-ASCII and zero bytes to ?\nnn
> sequences, allocating the result in POOL. */
> static const char *
> -fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)
> +ascii_fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)
> {
> - const char *src_orig = src, *src_end = src + len;
> - apr_size_t new_len = 0;
> - char *new;
> - const char *new_orig;
> + static unsigned char asciinonul[256];
> + svn_stringbuf_t *result = NULL;
>
> - /* First count how big a dest string we'll need. */
> - while (src < src_end)
> - {
> - if (! svn_ctype_isascii (*src) || *src == '\0')
> - new_len += 5; /* 5 slots, for "?\XXX" */
> - else
> - new_len += 1; /* one slot for the 7-bit char */
> + if (!asciinonul[0]) {
> + asciinonul[0] = 255; /* NUL's not allowed */

That doesn't look threadsafe.

> + memset(asciinonul + 1, 1, 127); /* Other regular ASCII OK */
> + memset(asciinonul + 128, 255, 128); /* High half not allowed */
> + }
>
> - src++;
> - }
> -
> - /* Allocate that amount. */
> - new = apr_palloc (pool, new_len + 1);
> -
> - new_orig = new;
> -
> - /* And fill it up. */
> - while (src_orig < src_end)
> - {
> - if (! svn_ctype_isascii (*src_orig) || src_orig == '\0')
> - {
> - /* This is the same format as svn_xml_fuzzy_escape uses, but that
> - function escapes different characters. Please keep in sync!
> - ### If we add another fuzzy escape somewhere, we should abstract
> - ### this out to a common function. */
> - sprintf (new, "?\\%03u", (unsigned char) *src_orig);
> - new += 5;
> - }
> - else
> - {
> - *new = *src_orig;
> - new += 1;
> - }
> -
> - src_orig++;
> - }
> -
> - *new = '\0';
> -
> - return new_orig;
> + svn_subr__escape_string(&result, src, len, asciinonul, NULL, NULL, pool);
> + return result->data;
> }
>
> /* Convert SRC_LENGTH bytes of SRC_DATA in NODE->handle, store the result
> @@ -448,7 +415,7 @@
> errstr = apr_psprintf
> (pool, _("Can't convert string from '%s' to '%s':"),
> node->frompage, node->topage);
> - err = svn_error_create (apr_err, NULL, fuzzy_escape (src_data,
> + err = svn_error_create (apr_err, NULL, ascii_fuzzy_escape (src_data,
> src_length, pool));
> return svn_error_create (apr_err, err, errstr);
> }
> @@ -564,7 +531,28 @@
> return SVN_NO_ERROR;
> }
>
> +static unsigned char
> +utf8_escape_mapper (unsigned char **targ, const unsigned char *start,
> + apr_size_t len, const svn_stringbuf_t *dest,
> + void *baton, apr_pool_t *pool)

New functions need documentation.

> +{
> + const unsigned char *end = start + len;
> + return SVN_ESCAPE_UTF8_MAPPING(*targ, end);
> +}
>
> +svn_stringbuf_t *
> +svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,
> + const unsigned char *src,
> + apr_size_t len,
> + apr_pool_t *pool)
> +{
> + static unsigned char utf8screen[256] = SVN_ESCAPE_UTF8_LEGAL_ARRAY;
> +
> + return svn_subr__escape_string(dest, src, len,
> + utf8screen, utf8_escape_mapper, NULL,
> + pool);
> +}
> +
> svn_error_t *
> svn_utf_stringbuf_to_utf8 (svn_stringbuf_t **dest,
> const svn_stringbuf_t *src,
> @@ -787,7 +775,7 @@
> const char *escaped, *converted;
> svn_error_t *err;
>
> - escaped = fuzzy_escape (src, strlen (src), pool);
> + escaped = ascii_fuzzy_escape (src, strlen (src), pool);
>
> /* Okay, now we have a *new* UTF-8 string, one that's guaranteed to
> contain only 7-bit bytes :-). Recode to native... */
>
> ### With code comes testing.
> ### Note: Contains 8-bit chars, and also uses convention that cc will treat
> ### "foo" "bar" as "foobar". Both can be avoided if useful for
> finicky compilers.
>
> --- subversion/tests/libsvn_subr/utf-test.c (revision 14986)
> +++ subversion/tests/libsvn_subr/utf-test.c (working copy)
> @@ -17,6 +17,7 @@
> */
>
> #include "../svn_test.h"
> +#include "../../include/svn_utf.h"

Does a plain "svn_utf.h" work?

> #include "../../libsvn_subr/utf_impl.h"
>
> /* Random number seed. Yes, it's global, just pretend you can't see it. */
> @@ -222,6 +223,84 @@
> return SVN_NO_ERROR;
> }
>
> +static svn_error_t *
> +utf_escape (const char **msg,
> + svn_boolean_t msg_only,
> + svn_test_opts_t *opts,
> + apr_pool_t *pool)
> +{
> + char in[] = { 'A', 'S', 'C', 'I', 'I', /* All printable */
> + 'R', 'E', 'T', '\n', 'N', /* Newline */
> + 'B', 'E', 'L', 0x07, '!', /* Control char */
> + 0xd2, 0xa6, 'O', 'K', '2', /* 2-byte char, valid */
> + 0xc0, 0xc3, 'N', 'O', '2', /* 2-byte char, invalid 1st */
> + 0x82, 0xc3, 'N', 'O', '2', /* 2-byte char, invalid 2nd */
> + 0xe4, 0x87, 0xa0, 'O', 'K', /* 3-byte char, valid */
> + 0xe2, 0xff, 0xba, 'N', 'O', /*3-byte char, invalid 2nd */
> + 0xe0, 0x87, 0xa0, 'N', 'O', /*3-byte char, invalid 2nd */
> + 0xed, 0xa5, 0xa0, 'N', 'O', /*3-byte char, invalid 2nd */
> + 0xe4, 0x87, 0xc0, 'N', 'O', /* 3-byte char, invalid 3rd */
> + 0xf2, 0x87, 0xa0, 0xb5, 'Y', /* 4-byte char, valid */
> + 0xf2, 0xd2, 0xa0, 0xb5, 'Y', /* 4-byte char, invalid 2nd */
> + 0xf0, 0x87, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */
> + 0xf4, 0x97, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */
> + 0xf2, 0x87, 0xc3, 0xb5, 'N', /* 4-byte char, invalid 3rd */
> + 0xf2, 0x87, 0xa0, 0xd5, 'N', /* 4-byte char, invalid 4th */
> + 0x00 };
> + const unsigned char *legalresult =
> + "ASCIIRET\nNBEL!$-1(c)æOK2?\\192?\\195NO2?\\130?\\195NO2"-A
> + "3$-3õ½0䇠1OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO?\\237?\\165?\\160NO"-A
> + "?\\228?\\135?\\192NO3$-3õ½0ò‡ µ1Y?\\242$-1(c)à?\\181Y?\\240?\\135?\\160"-A
> + "?\\181N?\\244?\\151?\\160?\\181N?\\242?\\135õN?\\242?\\135?\\160"
> + "?\\213N";

I don't like the embedded control characters in the source code, could
you generate them at runtime?

> + const unsigned char *asciiresult =
> + "ASCIIRET\nNBEL\x07!?\\210?\\166OK2?\\192?\\195NO2?\\130?\\195NO2"
> + "?\\228?\\135?\\160OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO"
> + "?\\237?\\165?\\160NO?\\228?\\135?\\192NO?\\242?\\135?\\160?\\181Y"
> + "?\\242?\\210?\\160?\\181Y?\\240?\\135?\\160?\\181N"
> + "?\\244?\\151?\\160?\\181N?\\242?\\135?\\195?\\181N"
> + "?\\242?\\135?\\160?\\213N";
> + const unsigned char *asciified;
> + apr_size_t legalresult_len = 213; /* == strlen(legalresult) iff no NULs */
> + int i = 0;
> + svn_stringbuf_t *escaped = NULL;
> +
> + *msg = "test utf string escaping";
> +
> + if (msg_only)
> + return SVN_NO_ERROR;
> +
> + if (svn_utf__stringbuf_escape_utf8_fuzzy
> + (&escaped, in, sizeof in - 1, pool) != escaped)

I prefer () with sizeof.

> + return svn_error_createf
> + (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> + i++;
> + if (escaped->len != legalresult_len)
> + return svn_error_createf
> + (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> + i++;
> + if (memcmp(escaped->data, legalresult, legalresult_len))
> + return svn_error_createf
> + (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> + i++;
> + if (memcmp(escaped->data, legalresult, legalresult_len))

A duplicate of the one above?

> + return svn_error_createf
> + (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> + i++;
> +
> + asciified = svn_utf_cstring_from_utf8_fuzzy(in, pool);
> + if (strlen(asciified) != strlen(asciiresult))
> + return svn_error_createf
> + (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> + i++;
> + if (strcmp(asciified, asciiresult))
> + return svn_error_createf
> + (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);
> + i++;
> +
> + return SVN_NO_ERROR;
> +}
> +
>
> /* The test table. */
>
> @@ -230,5 +309,6 @@
> SVN_TEST_NULL,
> SVN_TEST_PASS (utf_validate),
> SVN_TEST_PASS (utf_validate2),
> + SVN_TEST_PASS (utf_escape),
> SVN_TEST_NULL
> };
>
>
>
> ### The original point of this thread.
> ### This patch will apply with an offset, since I've cut out sections which
> ### reimplement XML escaping in terms of the svn_subr__escape_string.
> --- subversion/libsvn_subr/xml.c (revision 14986)
> +++ subversion/libsvn_subr/xml.c (working copy)
> @@ -395,11 +413,22 @@
> /* If expat choked internally, return its error. */
> if (! success)
> {
> + svn_stringbuf_t *sanitized;
> + unsigned char *end;
> +
> + svn_utf__stringbuf_escape_utf8_fuzzy(&sanitized, buf,
> + (len > 240 ? 240 : len),
> + svn_parser->pool);
> + end = sanitized->data +
> + (sanitized->len > 240 ? 240 : sanitized->len);

240? A magic number with no explanation.

> + while (*end > 0x80 && *end < 0xc0 &&
> + (char *) end > sanitized->data) end--;

I think that could generate a huge error message in pathological
cases.

> err = svn_error_createf
> (SVN_ERR_XML_MALFORMED, NULL,
> - _("Malformed XML: %s at line %d"),
> + _("Malformed XML: %s at line %d; XML starts:\n%.*s"),
> XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),
> - XML_GetCurrentLineNumber (svn_parser->parser));
> + XML_GetCurrentLineNumber (svn_parser->parser),
> + (char *) end - sanitized->data + 1, sanitized->data);
>
> /* Kill all parsers and return the expat error */
> svn_xml_free_parser (svn_parser);
>

Overall this patch makes me uneasy, I'd prefer a patch that builds on
our existing ctype and/or utf-8 code. However as I haven't tried to
implement such a patch I don't know whether our existing code is
a suitable starting point.

-- 
Philip Martin
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Jun 21 03:18:52 2005

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.