[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Re: [PATCH] Include offending XML in "Malformed XML" error message

From: Charles Bailey <bailey.charles_at_gmail.com>
Date: 2005-06-21 20:40:02 CEST

On 6/20/05, Philip Martin <philip@codematters.co.uk> wrote:> Charles Bailey <bailey.charles@gmail.com> writes:> > > I've interspersed my comments in the code, since there's imho zero> > chance that this version of the patch will be> > substantially/stylistically suitable for committing.> > That doesn't encourage review! Anyway, if the comments are necessary> to understand the code then they should be part of the code.
Hmm. Perhaps I'm phrasing poorly. My intent was to say that the patch wasfunctionally correct, but there are enough issues not only of style (ala HACKING)but of "project philosophy", for lack of a better term -- whethermacros are preferredto repeated code, framing of comments, calling sequence conventions --that it wasunlikely to be in final form. I'd hoped to encourage review, ratherthan put it off.
Those comments which I intended to document the code are present as C-stylecomments in the patch. It's the "philosophical" questions I've interspersed.
I hope this helps make my intent clearer.
> > They're far from> > exhaustive, but this message is long enough already.> >> > Conceptual "Log message":> > [[[> > Add function that escapes illegal UTF-8 characters, along the way> > "valid" rather than "legal", "invalid" rather than "illegal". That> applies to function names, variables, comments, etc.
OK, I'm happy to follow that convention.
> > refactoring core of> > string-escaping routines, and insure that illegal XML error message> > outputs legal UTF-8.> > ### Probably best applied as several patches, but collected here for review.> > If you think it should be several patches then submit several> patches. On the whole it looks like something that should be one> patch.
Thanks. This is another place where I was unsure of the philosophy. If the policy is one-change-per-commit, then I can frame it as addingthe driver in one patch, using the driver to escape strings to UTF8 ina second, and using that to fix the XML error message in a third. Ifthe policy is one-problem-per-patch, then I agree it makes sense as asingle patch. (I'd still keep refactoring other string escapingroutines to use the driver as a separate patch, no?)
> > * subversion/po/de.po, subversion/po/es.po, subversion/po/ja.po,> > subversion/po/ko.po, subversion/po/nb.po, subversion/po/pl.po,> > subversion/po/pt_BR.po, subversion/po/sv.po,> > subversion/po/zh_CN.po, subversion/po/zh_TW.po:> > Courtesy to translators, since I've changed a localized string.> > I wouldn't apply patches to PO files unless they come from someone who> claims to be fluent.
OK; I can drop them.
> > --- /dev/null Mon Jun 6 11:06:27 2005> > +++ subversion/libsvn_subr/escape.c Fri Jun 3 19:16:09 2005> > New files need a copyright notice.
Right. I'd noticed that, but since I don't know Collabnet's policyhere I didn't wantto presume. Is it acceptable to clone the copyright from an existingfile? Doesthat suffice to Collabnet as assignment of copyright? Do I have a pledge fromCollabnet that any code whose copyright is assigned in this way willremain freelyavailable in perpetuity? (By "freely available", I mean termssubstantially identicalto Subversion's current licensing; I'm not trying to start a skirmishover softwarelicensing.)
> > > @@ -0,0 +1,58 @@> > +/*> > + * escape.c: common code for cleaning up unwanted bytes in strings> > + */> > +> > +#include "escape_impl.h"> > +> > +#define COPY_PREFIX \> > + if (c > base) { \> > + svn_stringbuf_appendbytes (out, base, c - base); \> > + base = c; \> > + }> > I don't think I'd have used a macro, it makes the calling code harder> to understand, but I'd prefer a macro with parameters and it needs> documentation.
Hmm. The only purpose of the macro here is to make the calling code easierto understand; if it's not helping, I can just inline the prefix checkin the threeplaces where it's required.
> > +> > +svn_stringbuf_t *> > +svn_subr__escape_string (svn_stringbuf_t **outsbuf,> > + const unsigned char *instr,> > + apr_size_t len,> > + const unsigned char *isok,> > + unsigned char (*mapper) (unsigned char **,> > + const unsigned char *,> > + apr_size_t,> > + const svn_stringbuf_t *,> > + void *,> > + apr_pool_t *),> > mapper should be a typedef.
OK.
> > + void *mapper_baton,> > + apr_pool_t *pool)> > +{> > + unsigned char *base, *c;> > + svn_stringbuf_t *out;> > +> > + if (outsbuf == NULL || *outsbuf == NULL) {> > + out = svn_stringbuf_create ("", pool);> > It should probably be created with at least len bytes capacity.
Sounds fair. Would it be better to create it with as many bytes asthe input string length?
> > + if (outsbuf)> > + *outsbuf = out;> > + }> > + else> > + out = *outsbuf;> > +> > + for (c = base = (unsigned char *) instr; c < instr + len; ) {> > Try to avoid casting away const.
Agreed. I think I tripped a compiler warning trying to leave everything (instr, base, and c) const, though it should be legal -- I'll haveanother look.
> > + apr_size_t count = isok ? isok[*c] : 0;> > + if (count == 0) {> > + COPY_PREFIX;> > + count = mapper ? mapper (&c, instr, len, out, mapper_baton, pool) : 255;> > + }> > + if (count == 255) {> > + char esc[6];> > +> > + COPY_PREFIX;> > + sprintf (esc,"?\\%03u",*c);> > + svn_stringbuf_appendcstr (out, esc);> > + c++;> > + base = c;> > + }> > + else c += count;> > + }> > + COPY_PREFIX;> > + return out;> > This function doesn't follow the project's indentation/formatting> guidelines.
Got it. I reflexively cuddled the braces, but can push them down.
> > +}> > +> >> >> > ### Comments are pretty self-explanatory.> > ### Docs are as doxygen; will need to be downgraded to plaintext since it's> > ### an internal header.> > ### As noted above, it makes sense to combine this with utf_impl.h.> > If it makes sense to combine it then why is it separate?
I didn't want on a first outing to be rearranging existing files. Atsome level, this isalso a question of philosophy. It seems to me that one "private"header file foreverything local to libsvn_subr is the best way to go. If, however,project policyis one header per topic, then I'd use a separate escape.h.

> > --- /dev/null Mon Jun 6 11:35:47 2005> > +++ subversion/libsvn_subr/escape_impl.h Thu Jun 2 18:44:05 2005> > Missing copyright notice.> > > @@ -0,0 +1,147 @@> > +/*> > + * escape_impl.h : private header for string escaping function.> > + */> > +> > +> > + > > +#ifndef SVN_LIBSVN_SUBR_ESCAPE_IMPL_H> > +#define SVN_LIBSVN_SUBR_ESCAPE_IMPL_H> > +> > +> > +#include "svn_pools.h"> > +#include "svn_string.h"> > +> > +#ifdef __cplusplus> > +extern "C" {> > +#endif /* __cplusplus */> > +> > + > > +/** Scan @a instr of length @a len bytes, copying to stringbuf @a *outsbuf,> > + * escaping bytes as indicated by the lookup array @a isok and the mapping> > + * function @a mapper. Memory is allocated from @a pool. You may provide> > + * any extra information needed by @a mapper in @a mapper_baton.> > + * Returns a pointer to the stringbuf containing the escaped string.> > + *> > + * If @a outsbuf or *outsbuf is NULL, a new stringbuf is created; its> > address is> > The patch is mangled here, and in several o
ther places.
Right. When Michael Thelen pointed that out, I resent it as an attachment;I hope that came through in better shape.
> > + * placed in @a outsbuf unless that argument is NULL.> > + * If @a isok is NULL, then @a mapper is used exclusively.> > + * If @ mapper is NULL, then a single character is escaped every time @a mapper> > + * would have been called.> > + *> > + * This is designed to be the common pathway for various string "escaping"> > + * functions across subversion. The basic approach is to scan> > + * the input and decide whether each byte is OK as it stands, needs to be> > + * "escaped" using subversion's "?\uuu" default format, or needs to be> > + * transformed in some other way. The decision is made using a two step> > + * process, which is designed to handle the simple cases quickly but allow> > + * for more complex mappings. Since the typical string will (we hope)> > + * comprise mostly simple cases, this shouldn't require much code> > + * complexity or loss of efficiency. The two steps used are:> > The question is do we need such a general function? One the one hand> it looks like it is more complicated th
en necessary just to handle> escaping, on the other hand it doesn't look general enough to be used,> say, as a routine to separate multibyte codepoints.> > I think I'd prefer a simpler solution, although I haven't tried to> implement one. Perhaps based on the existing validation functions?> (Although since I wrote those I might be biased.)
I think this is the crucial philosophic question. For the specificproblem with whichI started (given a string from I-know-not-where, and hence inI-know-not-what-encoding,emit a string that is valid UTF8), it's certainly easy enough to writeanother escaping function, similar to the ones already present. Thatapproach seemed less than ideal to me, since it would introduce yetanother function to be maintained in parallel, etc. The conceptualsimplicity may make it most useful in the end, though.
I tried to write the driver to be general enough to handle differenttasks, but reasonably fast for common tasks. Since most of thestrings appearing within Subversion are UTF8, I chose a byte-orientedscreen. (I'm also trying to take advantage of the nice property ofUTF8 that one can identify multibyte codepoints from the first byte,iff the string is guaranteed to be valid.) I'm using the mappingfunction as a "trap door" to permit more involved (includingmultibyte) testing; in this case, there's no benefit in speed, butperhaps a common point for the actual escaping of bytes. More generalsolutions (of which I could conceive) seemed to me to entail yet morecomplexity; I chose this level as the appropriate tradeoff. It maywell be that the core developers choose a different level. I can workwith that.
> > + *> > + * 1. The value of a byte from the input string ("test byte") is used as an> > + * index into a (usually 256 byte) array passed in by the caller.> > + * - If the value of the appropriate array element is 0xff,> > + * then the test byte is escaped as a "?\uuu" string in the output.> > + * - If the value of the appropriate element is otherwise non-zero,> > + * that many bytes are copied verbatim from the input to the output.> > + * 2. If the array yields a 0 value, then a mapping function provided by> > + * the caller is used to allow for more complex evaluation. This function> > + * receives five arguments:> > Five? I see six in the code.
Drat. I thought I'd added the baton to the docment. My omission.
> > + * - a pointer to the pointer used by svn__do_char_escape() to> > + * mark the test byte in the input string> > + * - a pointer to the start of the input string> > + * - the length of the input string> > + * - a pointer to the output stringbuf> > + * - the ever-helpful pool.> > + * The mapping function may return a (positive) nonzero value,> > + * which is interpreted * as described in step 1 above, or zero,> > + * indicating that the test byte * should be ignored. In the latter> > + * case, this is generally because the * mapping function has done the> > + * necessary work itself; it's free to * modify the output stringbuf and> > + * adjust the pointer to the test byte * as it sees fit (within the> > + * bounds of the input string). At a minimum, * it should at least> > + * increment the pointer to the test byte before * returning 0, in order> > + * to avoid an infinite loop.> > Are the '*' characters within the comment lines just fallout from> paragraph filling?
Yes.
> > + */> > +> > +svn_stringbuf_t *> > +svn_subr__escape_string (svn_stringbuf_t **outsbuf,> > + const unsigned char *instr,> > + apr_size_t len,> > + const unsigned char *isok,> > + unsigned char (*mapper) (unsigned char **,> > + const unsigned char *,> > + apr_size_t,> > + const svn_stringbuf_t *,> > + void *,> > + apr_pool_t *),> > Should be a typedef, the typedef should document the parameters.
OK.
> > + void *mapper_baton,> > + apr_pool_t *pool);> > +> > +> > +> > +/** Initializer for a basic screening matrix suitable for use with> > + * #svn_subr__escape_string to escape non-UTF-8 bytes.> > + * We provide this since "UTF-8-safety" is a common denominator for> > + * most string escaping in Subversion, so this matrix makes a good> > + * starting point for more involved schemes.> > + */> > +#define SVN_ESCAPE_UTF8_LEGAL_ARRAY { \> > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,> > 1, 1,\> > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,> > 1, 1,\> > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,> > 1, 1,\> > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,> > 1, 1,\> > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,> > 1, 1,\> > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,> > 1, 1,\> > + 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,> > 1, 1,\> > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,> > 1, 1,\> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,> > 0, 0,\> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,> > 0, 0,\> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,> > 0, 0,\> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,> > 0, 0,\> > +255, 255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,> > 0, 0,\> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,> > 0, 0,\> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,> > 0, 0,\> > + 0, 0, 0, 0, 0, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255}> > That looks a bit like the svn_ctype stuff, but it's a separate> implementation. That makes me uneasy.
Fair enough. The underlying problem is that different situationsrequire slightly different behaviors, of course. I'm not sure whetherthe best solution is to start with a single framework and tweak it atruntime as needed, or to set up the framework for each case separatelyin the interest of speed over space.
> > +> > +/** Given pointer @a c into a string which ends at @a e, figure out> > + * whether (*c) starts a valid UTF-8 sequence, and if so, how many bytes> > + * it includes. Return 255 if it's not valid UTF-8.> > + * For a more detailed description of the encoding rules, see the UTF-8> > + * specification in section 3-9 of the Unicode standard 4.0 (e.g. at> > + * http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf),> > + * with special attention to Table 3-6.> > + * This macro is also provided as a building block for mappers used by> > + * #svn_subr__escape_string that want to check for UTF-8-safety in> > + * addition to other tasks.> > + */> > +#define SVN_ESCAPE_UTF8_MAPPING(c,e) \> > + ( (c)[0] < 0x80 ? /* ASCII */ \> > + 1 : /* OK, 1 byte */ \> > + ( ( ((c)[0] > 0xc2 && (c)[0] < 0xdf) && /* 2-byte char */ \> > + ((c) +
1 <= (e)) && /* Got 2 bytes */ \> > + ((c)[1] >= 0x80 && (c)[1] <= 0xbf)) ? /* Byte 2 legal */ \> > + 2 : /* OK, 2 bytes */ \> > + ( ( ((c)[0] >= 0xe0 && (c)[0] <= 0xef) && /* 3 byte char */ \> > + ((c) + 2 <= (e)) && /* Got 3 bytes */ \> > + ((c)[1] >= 0x80 && (c)[1] <= 0xbf) && /* Basic byte 2 legal */ \> > + ((c)[2] >= 0x80 && (c)[2] <= 0xbf) && /* Basic byte 3 legal */ \> > + (!((c)[0] == 0xe0 && (c)[1] < 0xa0)) && /* 0xe0-0x[89]? illegal */\> > + (!((c)[0] == 0xed && (c)[1] > 0x9f)) ) ? /* 0xed-0x[ab]? illegal */\> > + 3 : /* OK, 3 bytes */ \> > + ( ( ((c)[0] >= 0xf0 && (c)[0] <= 0xf4) && /* 4 byte char */ \> > + ((c) + 3 <= (e)) && /* Got 4 bytes */ \> > + ((c)[1] >= 0x80 && (c)[1] <= 0xbf)
    && /* Basic byte 2 legal */ \> > + ((c)[2] >= 0x80 && (c)[2] <= 0xbf) && /* Basic byte 3 legal */ \> > + ((c)[3] >= 0x80 && (c)[3] <= 0xbf) && /* Basic byte 4 legal */ \> > + (!((c)[0] == 0xf0 && (c)[1] < 0x90)) && /* 0xf0-0x8? illegal */ \> > + (!((c)[0] == 0xf4 && (c)[1] > 0x8f)) ) ? /* 0xf4-0x[9ab]? illegal*/\> > + 4 : /* OK, 4 bytes */ \> > + 255)))) /* Illegal; escape it */> > utf_validate.c already implements the UTF-8 encoding rules. There is> obviously some duplication of the algorithm, that makes me uneasy.
True enough. I'd done this for speed and compactness relative to thestate machine in utf_validate.c, expecting comments from the coredevelopers about whether it was a net win. For the specific case ofescaping invalid UTF8, I could write a mapper to call out tosvn_utf__last_valid per character.
> Those big macros also make me uneasy.
Is the concern that it'll choke come compilers, or that they're hardto maintain?This may just be a stylistic issue: I tend to favor macros forrepeated tasks, on the theory that as long as one debugs it carefully,the benefit of inlining is worth the effort of coding.
> > +> > + > > +#ifdef __cplusplus> > +}> > +#endif /* __cplusplus */> > +> > +#endif /* SVN_LIBSVN_SUBR_ESCAPE_IMPL_H */> >> >> > ### Function names can be revised to fit convention, of course.> > ### svn_utf__cstring_escape_utf8_fuzzy serves as an example of a benefit of> > ### returning the resultant stringbuf from svn_subr__escape_string both in a> > ### parameter and as the function's return value. If the sense is that> > it'll be a cause> > ### of debugging headaches, or that it's cortrary to subversion> > culture to code public> > ### functions as macros, it's easy enough to code this as a function,> > and to make> > ### svn_subr__escape_string return void (or less likely svn_error_t,> > if it got pickier> > ### about params.)> > --- subversion/libsvn_subr/utf_impl.h (revision 14986)> > +++ subversion/libsvn_subr/utf_impl.h (working copy)> > @@ -24,12 +24,33 @@> >> > #include <apr_pools.h>> > #include "svn_types.h"> > +#include "svn_string.h"> >> > #ifdef __cplusplus> > extern "C" {> > #endif /*
__cplusplus */> >> >> > +/** Replace any non-UTF-8 characters in @a len byte long string @a src with> > + * escaped representations, placing the result in a stringbuf pointed to by> > + * @a *dest, which will be created if necessary. Memory is allocated from> > How does the user know what "if necessary" means?
Fair point. s/necessary/NULL/.
> > + * @a pool as needed. Returns a pointer to the stringbuf containing the result> > + * (identical to @a *dest, but facilitates chaining calls).> > + */> > +svn_stringbuf_t *> > +svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,> > + const unsigned char *src,> > + apr_size_t len,> > + apr_pool_t *pool);> > +> > +/** Replace any non-UTF-8 characters in @a len byte long string @a src with> > + * escaped representations. Memory is allocated from @a pool as needed.> > + * Returns a pointer to the resulting string.> > + */> > +#define svn_utf__cstring_escape_utf8_fuzzy(src,len,pool) \> > + (svn_utf__stringbuf_escape_utf8_fuzzy(NULL,(src),(len),(pool)))->data> > Is there any need for this to be a macro? A real function would be> less "surprising" and that makes it better unless you can justify the> macro.
OK. I think this is another instance of the style point noted above.
> > +> > +> > const char *svn_utf__cstring_from_utf8_fuzzy (const char *src,> > apr_pool_t *pool,> > svn_error_t *(*convert_from_utf8)> >> >> >> > ### There're other places that could be rewritten in terms of the new escaping> > ### functions, but I hope the two given here serve as an example of how it might> > ### be done.> > Complete patches are better than incomplete ones.
Sure. My standard for 'complete' might have been a bit low: I wasaiming for a unit that compiled, passed tests, and illustrated theissues I believed were still open. I thought I might wait to seewhether the notion of a "common escaping routine" would carry beforeextending it to XML, other UTF8 tasks, etc.

> > ### The rename to ascii_fuzzy_escape is to distinguish it from the new functions> > ### that escape only illegal UTF-8 sequences.> > --- subversion/libsvn_subr/utf.c (revision 14986)> > +++ subversion/libsvn_subr/utf.c (working copy)> > @@ -30,6 +30,7 @@> > #include "svn_pools.h"> > #include "svn_ctype.h"> > #include "svn_utf.h"> > +#include "escape_impl.h"> > #include "utf_impl.h"> > #include "svn_private_config.h"> >> > @@ -323,53 +324,19 @@> > /* Copy LEN bytes of SRC, converting non-ASCII and zero bytes to ?\nnn> > sequences, allocating the result in POOL. */> > static const char *> > -fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)> > +ascii_fuzzy_escape (const char *src, apr_size_t len, apr_pool_t *pool)> > {> > - const char *src_orig = src, *src_end = src + len;> > - apr_size_t new_len = 0;> > - char *new;> > - const char *new_orig;> > + static unsigned char asciinonul[256];> > + svn_stringbuf_t *result = NULL;> >> > - /* First count how big a dest str
ing we'll need. */> > - while (src < src_end)> > - {> > - if (! svn_ctype_isascii (*src) || *src == '\0')> > - new_len += 5; /* 5 slots, for "?\XXX" */> > - else> > - new_len += 1; /* one slot for the 7-bit char */> > + if (!asciinonul[0]) {> > + asciinonul[0] = 255; /* NUL's not allowed */> > That doesn't look threadsafe.
It is safe, though it doesn't prevent duplication.
> > + memset(asciinonul + 1, 1, 127); /* Other regular ASCII OK */> > + memset(asciinonul + 128, 255, 128); /* High half not allowed */> > + }> >> > - src++;> > - }> > -> > - /* Allocate that amount. */> > - new = apr_palloc (pool, new_len + 1);> > -> > - new_orig = new;> > -> > - /* And fill it up. */> > - while (src_orig < src_end)> > - {> > - if (! svn_ctype_isascii (*src_orig) || src_orig == '\0')> > - {> > - /* This is the same format as svn_xml_fuzzy_escape uses, but that> > - function escapes different characters. Please keep in sync!> > - ### If we add another fuzzy escape somewhere, we should abstract> > - ### this out to a common function. */> > - sprintf (new, "?\\%03u", (unsigned char) *src_orig);> > - new += 5;> > - }> > - else> > - {> > - *new = *src_orig;> > - new += 1;> > - }> > -> > - src_orig++;> > - }> > -> > - *new = '\0';> > ->
> - return new_orig;> > + svn_subr__escape_string(&result, src, len, asciinonul, NULL, NULL, pool);> > + return result->data;> > }> >> > /* Convert SRC_LENGTH bytes of SRC_DATA in NODE->handle, store the result> > @@ -448,7 +415,7 @@> > errstr = apr_psprintf> > (pool, _("Can't convert string from '%s' to '%s':"),> > node->frompage, node->topage);> > - err = svn_error_create (apr_err, NULL, fuzzy_escape (src_data,> > + err = svn_error_create (apr_err, NULL, ascii_fuzzy_escape (src_data,> > src_length, pool));> > return svn_error_create (apr_err, err, errstr);> > }> > @@ -564,7 +531,28 @@> > return SVN_NO_ERROR;> > }> >> > +static unsigned char> > +utf8_escape_mapper (unsigned char **targ, const unsigned char *start,> > + apr_size_t len, const svn_stringbuf_t *dest,> > + void *baton, apr_pool_t *pool)> > New functions need documentation.
OK. I can add a docment above it.
> > +{> > + const unsigned char *end = start + len;> > + return SVN_ESCAPE_UTF8_MAPPING(*targ, end);> > +}> >> > +svn_stringbuf_t *> > +svn_utf__stringbuf_escape_utf8_fuzzy (svn_stringbuf_t **dest,> > + const unsigned char *src,> > + apr_size_t len,> > + apr_pool_t *pool)> > +{> > + static unsigned char utf8screen[256] = SVN_ESCAPE_UTF8_LEGAL_ARRAY;> > +> > + return svn_subr__escape_string(dest, src, len,> > + utf8screen, utf8_escape_mapper, NULL,> > + pool);> > +}> > +> > svn_error_t *> > svn_utf_stringbuf_to_utf8 (svn_stringbuf_t **dest,> > const svn_stringbuf_t *src,> > @@ -787,7 +775,7 @@> > const char *escaped, *converted;> > svn_error_t *err;> >> > - escaped = fuzzy_escape (src, strlen (src), pool);> > + escaped = ascii_fuzzy_escape (src, strlen (src), pool);> >> > /* Okay, now we have a *new* UTF-8 s
tring, one that's guaranteed to> > contain only 7-bit bytes :-). Recode to native... */> >> > ### With code comes testing.> > ### Note: Contains 8-bit chars, and also uses convention that cc will treat> > ### "foo" "bar" as "foobar". Both can be avoided if useful for> > finicky compilers.> >> > --- subversion/tests/libsvn_subr/utf-test.c (revision 14986)> > +++ subversion/tests/libsvn_subr/utf-test.c (working copy)> > @@ -17,6 +17,7 @@> > */> >> > #include "../svn_test.h"> > +#include "../../include/svn_utf.h"> > Does a plain "svn_utf.h" work?
Hmm. I don't know; I'd followed the convention of the prior line. I'll give it a try.
> > #include "../../libsvn_subr/utf_impl.h"> >> > /* Random number seed. Yes, it's global, just pretend you can't see it. */> > @@ -222,6 +223,84 @@> > return SVN_NO_ERROR;> > }> >> > +static svn_error_t *> > +utf_escape (const char **msg,> > + svn_boolean_t msg_only,> > + svn_test_opts_t *opts,> > + apr_pool_t *pool)> > +{> > + char in[] = { 'A', 'S', 'C', 'I', 'I', /* All printable */> > + 'R', 'E', 'T', '\n', 'N', /* Newline */> > + 'B', 'E', 'L', 0x07, '!', /* Control char */> > + 0xd2, 0xa6, 'O', 'K', '2', /* 2-byte char, valid */> > + 0xc0, 0xc3, 'N', 'O', '2', /* 2-byte char, invalid 1st */> > + 0x82, 0xc3, 'N', 'O', '2', /* 2-byte char, invalid 2nd */> > + 0xe4, 0x87, 0xa0, 'O', 'K', /* 3-byte char, valid */> > + 0xe2, 0xff, 0xba, 'N', 'O', /*3-byte char, invalid 2nd */> > + 0xe0, 0x87, 0xa0, 'N', 'O', /*3-byte char, invalid 2nd */> > + 0xed, 0
xa5, 0xa0, 'N', 'O', /*3-byte char, invalid 2nd */> > + 0xe4, 0x87, 0xc0, 'N', 'O', /* 3-byte char, invalid 3rd */> > + 0xf2, 0x87, 0xa0, 0xb5, 'Y', /* 4-byte char, valid */> > + 0xf2, 0xd2, 0xa0, 0xb5, 'Y', /* 4-byte char, invalid 2nd */> > + 0xf0, 0x87, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */> > + 0xf4, 0x97, 0xa0, 0xb5, 'N', /* 4-byte char, invalid 2nd */> > + 0xf2, 0x87, 0xc3, 0xb5, 'N', /* 4-byte char, invalid 3rd */> > + 0xf2, 0x87, 0xa0, 0xd5, 'N', /* 4-byte char, invalid 4th */> > + 0x00 };> > + const unsigned char *legalresult => > + "ASCIIRET\nNBEL!$-1(c)�OK2?\\192?\\195NO2?\\130?\\195NO2"-A> > + "3$-3�0䇠1OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO?\\237?\\165?\\160NO"-A> > + "?\\228?\\135?\\192NO3$-3�0򇠵1Y?\\242$-1(c)�?\\181Y?\\240?\\135?\\160"-A> > + "?\\181N?\\244?\\151?\\160?\\181N?\\242?\\135�N?\\242?\\135?\\160"> > + "?\\213N";> > I don't like
 the embedded control characters in the source code, could> you generate them at runtime?
Sure. I could also just C-escape them like the high-half bytes above.
> > + const unsigned char *asciiresult => > + "ASCIIRET\nNBEL\x07!?\\210?\\166OK2?\\192?\\195NO2?\\130?\\195NO2"> > + "?\\228?\\135?\\160OK?\\226?\\255?\\186NO?\\224?\\135?\\160NO"> > + "?\\237?\\165?\\160NO?\\228?\\135?\\192NO?\\242?\\135?\\160?\\181Y"> > + "?\\242?\\210?\\160?\\181Y?\\240?\\135?\\160?\\181N"> > + "?\\244?\\151?\\160?\\181N?\\242?\\135?\\195?\\181N"> > + "?\\242?\\135?\\160?\\213N";> > + const unsigned char *asciified;> > + apr_size_t legalresult_len = 213; /* == strlen(legalresult) iff no NULs */> > + int i = 0;> > + svn_stringbuf_t *escaped = NULL;> > +> > + *msg = "test utf string escaping";> > +> > + if (msg_only)> > + return SVN_NO_ERROR;> > +> > + if (svn_utf__stringbuf_escape_utf8_fuzzy> > + (&escaped, in, sizeof in - 1, pool) != escaped)> > I prefer () with sizeof.
OK.
> > + return svn_error_createf> > + (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > + i++;> > + if (escaped->len != legalresult_len)> > + return svn_error_createf> > + (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > + i++;> > + if (memcmp(escaped->data, legalresult, legalresult_len))> > + return svn_error_createf> > + (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > + i++;> > + if (memcmp(escaped->data, legalresult, legalresult_len))> > A duplicate of the one above?
Er, yes. I must've mispasted. Sorry.
> > + return svn_error_createf> > + (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > + i++;> > +> > + asciified = svn_utf_cstring_from_utf8_fuzzy(in, pool);> > + if (strlen(asciified) != strlen(asciiresult))> > + return svn_error_createf> > + (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > + i++;> > + if (strcmp(asciified, asciiresult))> > + return svn_error_createf> > + (SVN_ERR_TEST_FAILED, NULL, "UTF-8 escape test %d failed", i);> > + i++;> > +> > + return SVN_NO_ERROR;> > +}> > +> >> > /* The test table. */> >> > @@ -230,5 +309,6 @@> > SVN_TEST_NULL,> > SVN_TEST_PASS (utf_validate),> > SVN_TEST_PASS (utf_validate2),> > + SVN_TEST_PASS (utf_escape),> > SVN_TEST_NULL> > };> >> >> >> > ### The original point of this thread.> > ### This patch will apply with an offset, since I've cut out sections which> > ### reimplement XML escaping in terms of the svn_subr__escape_string.> > --- subversion/libsvn_subr/xml.c (
revision 14986)> > +++ subversion/libsvn_subr/xml.c (working copy)> > @@ -395,11 +413,22 @@> > /* If expat choked internally, return its error. */> > if (! success)> > {> > + svn_stringbuf_t *sanitized;> > + unsigned char *end;> > +> > + svn_utf__stringbuf_escape_utf8_fuzzy(&sanitized, buf,> > + (len > 240 ? 240 : len),> > + svn_parser->pool);> > + end = sanitized->data +> > + (sanitized->len > 240 ? 240 : sanitized->len);> > 240? A magic number with no explanation.
I can add a comment. It was discussed earlier in this thread, but Iagree it'd helpin the future to add something to the code.

> > + while (*end > 0x80 && *end < 0xc0 &&> > + (char *) end > sanitized->data) end--;> > I think that could generate a huge error message in pathological> cases.
Up to 240 characters, yes. My intent was to include enough to givethe reader a fair idea where the error occurred, without subjectingthem to screensful of junk. Setting the cutoff at 240 seemed to give reasonable chunk of XML element for the cases I checked (errors in theentries file.)
> > err = svn_error_createf> > (SVN_ERR_XML_MALFORMED, NULL,> > - _("Malformed XML: %s at line %d"),> > + _("Malformed XML: %s at line %d; XML starts:\n%.*s"),> > XML_ErrorString (XML_GetErrorCode (svn_parser->parser)),> > - XML_GetCurrentLineNumber (svn_parser->parser));> > + XML_GetCurrentLineNumber (svn_parser->parser),> > + (char *) end - sanitized->data + 1, sanitized->data);> >> > /* Kill all parsers and return the expat error */> > svn_xml_free_parser (svn_parser);> >> > Overall this patch makes me uneasy, I'd prefer a patch that builds on> our existing ctype and/or utf-8 code. However as I haven't tried to> implement such a patch I don't know whether our existing code is> a suitable starting point.
That's certainly possible. I'd actually started out that way, andswitched to the above when it seemed I was about to add ya parallelroutine to escape a string in a slightly different way.
Thanks for the review. If you or one of the other core developerscould answer some of the "philosophic" questions above, I can revisethe patch appropriately.
-- Regards,Charles BaileyLists: bailey _dot_ charles _at_ gmail _dot_ comOther: bailey _at_ newman _dot_ upenn _dot_ edu
Received on Tue Jun 21 20:41:38 2005

This is an archived mail posted to the Subversion Dev mailing list.