[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] allow svnsync to translate non-UTF-8 log messages to UTF-8 (v. 2)

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Mon, 20 Sep 2010 05:39:24 +0100

(to stsp and other interested parties: I reviewed only the libsvn_subr
part of the patch; see last paragraph for details)

Daniel Trebbien wrote on Mon, Sep 13, 2010 at 19:22:38 -0700:
> Attached are the new patch and log message.
>
> I have already shared one test repository with this mailing list that
> is useful in testing that revprops are re-encoded properly. Since
> then, I have created two more: test2 is useful in testing that
> directory properties (in this case `svn:ignore`) are re-encoded, and
> test3 is useful in testing that line endings are normalized to
> Unix-style in revprops (and that the "# normalized" message is still
> there). Archives of these new repositories have also been attached.
> Also, this shell script makes testing with these repositories *much*
> easier: http://pastebin.com/gz1pmngc
>
>
> #! /usr/bin/env sh
> TEST=test3 # change this line to test with a different repository
> rm -Rf ${TEST}_mirror
> svnadmin create ${TEST}_mirror
> cat <<'EOF' > ${TEST}_mirror/hooks/pre-revprop-change
> #! /usr/bin/env sh
> USER="$3"
>
> if [ "$USER" = "svnsync" ]; then exit 0; fi
>
> echo "Only the svnsync user can change revprops" >&2
> exit 1
> EOF
> chmod +x ${TEST}_mirror/hooks/pre-revprop-change
> svnsync init \
> --username svnsync --source-encoding ISO-8859-1 \
> file://`pwd`/${TEST}_mirror file://`pwd`/${TEST}
> svnsync sync \
> --username svnsync --source-encoding ISO-8859-1 \
> file://`pwd`/${TEST}_mirror

> Subject: [PATCH] allow svnsync to translate non-UTF-8 log messages to UTF-8 (v. 2)
>
> [[[
> Add a command line option (--source-encoding) to the svnsync init, sync, and
> copy-revprops subcommands that allows the user to specify the character
> encoding of translatable properties from the source repository. This is needed
> to allow svnsync to sync some older Subversion repositories that have
> properties that were not encoded in UTF-8.
>

Could you mention in here (at the paragraphs beginning the log message)
that you're adding a new API, too? I think it should be mentioned since
that forms a significant part of the change.

> As discussed at:
> http://thread.gmane.org/gmane.comp.version-control.subversion.user/100020
> http://thread.gmane.org/gmane.comp.version-control.subversion.devel/122518
>
> * subversion/include/svn_subst.h
> Added documentation of the new public API function
> svn_subst_translate_string2().
>

Should use this syntax:

* subversion/include/svn_subst.h
  (svn_subst_translate_string2): New function.

(and mention svn_subst_translate_string() too, per below review comments)

> * subversion/libsvn_subr/subst.c
> (translation_baton): Added TRANSLATED_EOL field. It's where translate_chunk()
> records that it translated a line ending.
> (create_translation_baton): Added a new parameter TRANSLATED_EOL. Initialize
> the TRANSLATED_EOL field of the resulting translation_baton to this value.
> (translate_chunk): After translating a line ending, set the value that is
> pointed to by TRANSLATED_EOL (from the translation baton) to TRUE.

Could just say "Record that an EOL translation has been done" or similar.

In general, the log message looks good, but to my taste it is borders
the 'too verbose a log message' line. (I might have written the same
without the last sentence of each bullet.) Remember that the log
message must never /repeat/ the code (or in-code comments); rather, it
should describe them from a high level.

> (stream_translated): New static function. Its implementation is the old
> implementation of svn_subst_stream_translated(), but accepting another
> parameter, TRANSLATED_EOL, that is passed to the in/out translation batons
> that it creates.
> (svn_subst_stream_translated): Now a wrapper for stream_translated().
> (translate_cstring2): New static function. Its implementation is the old
> implementation of svn_subst_translate_cstring2(), but modified to accept
> another parameter, TRANSLATED_EOL, that is passed to stream_translated().
> (svn_subst_translate_cstring2): Now a wrapper for translate_cstring2().

So, you had to "internally" revv both svn_subst_stream_translated() and
svn_subst_translate_cstring2(). I wonder why...

(self-answered hereinbelow)

> (svn_subst_translate_string2): New function. It does what
> svn_subst_translate_string() does, but also records whether it
> translated a line ending or performed re-encoding.
>
> * subversion/svnsync/main.c
> (svnsync__opt) Added svnsync_opt_source_encoding.
> (svnsync_cmd_table): Added svnsync_opt_source_encoding to the list of
> acceptable options for the init, sync, and copy-revprops subcommands.
> (opt_baton_t, subcommand_baton_t): Added SOURCE_ENCODING field.
> (copy_revprops): Added a parameter to receive the subcommand_baton_t*.
> Removed the QUIET parameter as this can be found from the subcommand baton.
> (make_subcommand_baton): Set the SOURCE_ENCODING field of the resulting
> subcommand_baton_t object to the value of SOURCE_ENCODING from opt_baton_t.
> (do_initialize, do_synchronize, do_copy_revprops, replay_rev_started,
> replay_rev_finished): Pass SOURCE_ENCODING to svnsync_* functions and
> copy_revprops().
> (main): Handled the case when the command line option is --source-encoding.
> Set the SOURCE_ENCODING field of the opt_baton_t* to either the ARG of
> --source-encoding or NULL.
>
> * subversion/svnsync/sync.c
> (normalize_string): Renamed WAS_NORMALIZED to LE_NORMALIZED to make clear
> that the counter only counts line ending normalizations. Added the ENCODING
> parameter. Handle the case when ENCODING is not NULL by calling
> svn_subst_translate_string2().
> (svnsync_normalize_revprops): Renamed NORMALIZED_COUNT to
> NORMALIZED_LE_COUNT. Added the ENCODING parameter. Moved the call to
> apr_hash_set() outside of the if block (if (le_normalized)), as the
> property value may have been changed even when no line ending was
> normalized.
> (edit_baton_t): Added the PROP_ENCODING field. Renamed the
> NORMALIZED_NODE_PROPS_COUNTER field to NORMALIZED_LE_NODE_PROPS_COUNTER.
> (change_file_prop, change_dir_prop): Pass in PROP_ENCODING to
> normalize_string().
> (svnsync_get_sync_editor): Added a PROP_ENCODING parameter, the value of
> which is set in the PROP_ENCODING field of the edit_baton_t. Renamed
> the NORMALIZED_NODE_PROPS_COUNTER parameter to
> NORMALIZED_LE_NODE_PROPS_COUNTER.
>
> * subversion/svnsync/sync.h
> Updated the declarations and documentation of the svnsync_* functions.
> ]]]

OK, it's a 30KB patch. A bit larger than average/recommended. These
are harder to review, too (more time required) --- both pre-commit and
post-commit (by future archeologists).

(but I have 4 hours to kill at the airport)

> Index: subversion/include/svn_subst.h
> ===================================================================
> --- subversion/include/svn_subst.h (revision 996730)
> +++ subversion/include/svn_subst.h (working copy)
> @@ -598,6 +598,26 @@ svn_error_t *svn_subst_translate_string(svn_string
> const char *encoding,
> apr_pool_t *pool);
>
> +/** Translate the data in @a value (assumed to be in encoded in charset
> + * @a encoding) to UTF8 and LF line-endings. If @a encoding is @c NULL,
> + * then assume that @a value is in the system-default language encoding.
> + * If @p translated_to_utf8 is not @c NULL, then
> + * @code *translated_to_utf8 @endcode is set to @c TRUE if at least one

You use @a, @p, and @code. Existing code uses only @a. Why the
difference?

> + * character of @p value in the source charset was translated to UTF-8;

Need the word "to" after the semicolon for the English to be correct.

> + * @c FALSE otherwise. If @p translated_line_endings is not @c NULL, then
> + * @code *translated_line_endings @endcode is set to @c TRUE if at least one
> + * line ending was changed to LF; @c FALSE otherwise. Return the translated data in @a *new_value,
> + * allocated in @a pool.
> + *

Need to rewrap.

> + * @since New in 1.6

s/1.6/1.7/

> + */
> +svn_error_t *svn_subst_translate_string2(svn_string_t **new_value,

I realize that you followed the precedent of svn_subst_translate_string(),
but please:

Return type on own line, function name on the next line.

> + svn_boolean_t *translated_to_utf8,
> + svn_boolean_t *translated_line_endings,
> + const svn_string_t *value,
> + const char *encoding,
> + apr_pool_t *pool);
> +
> /** Translate the data in @a value from UTF8 and LF line-endings into

Need to update svn_subst_translate_string()'s docstring and
implementation and mark it deprecated:
http://subversion.apache.org/docs/community-guide/community-guide.html#deprecation
http://mid.gmane.org/20100814132023.089A223889B3@eris.apache.org

> * native locale and native line-endings, or to the output locale if
> * @a for_output is TRUE. Return the translated data in @a

> Index: subversion/libsvn_subr/subst.c
> ===================================================================
> --- subversion/libsvn_subr/subst.c (revision 996730)
> +++ subversion/libsvn_subr/subst.c (working copy)
> @@ -765,6 +765,7 @@ svn_subst_keywords_differ2(apr_hash_t *a,
> struct translation_baton
> {
> const char *eol_str;
> + svn_boolean_t *translated_eol;
> svn_boolean_t repair;
> apr_hash_t *keywords;
> svn_boolean_t expand;
> @@ -805,6 +806,7 @@ struct translation_baton
> */
> static struct translation_baton *
> create_translation_baton(const char *eol_str,
> + svn_boolean_t *translated_eol,
> svn_boolean_t repair,
> apr_hash_t *keywords,
> svn_boolean_t expand,
> @@ -818,6 +820,7 @@ create_translation_baton(const char *eol_str,
>
> b->eol_str = eol_str;
> b->eol_str_len = eol_str ? strlen(eol_str) : 0;
> + b->translated_eol = translated_eol;
> b->repair = repair;
> b->keywords = keywords;
> b->expand = expand;
> @@ -884,6 +887,8 @@ translate_chunk(svn_stream_t *dst,
> b->src_format,
> &b->src_format_len, b->newline_buf,
> b->newline_off, dst, b->repair));
> + if (b->translated_eol)
> + *b->translated_eol = TRUE;
>

I don't have the patience to dive into this function right now, but the
patch is large enough that I'll review the rest and come back to this
some other time.

<appendum>
There are several calls to translate_newline() in this function, but you
added the '*translated_eol = TRUE' line only after one of them. That
triggers a red light for me...
</appendum>

(PS: "appendum"? anyone got a better suggestion?)
(PPS: "addendum".)

> b->newline_off = 0;
> }
> @@ -1256,13 +1261,14 @@ svn_subst_read_specialfile(svn_stream_t **stream,
> }
>
>
> -svn_stream_t *
> -svn_subst_stream_translated(svn_stream_t *stream,
> - const char *eol_str,
> - svn_boolean_t repair,
> - apr_hash_t *keywords,
> - svn_boolean_t expand,
> - apr_pool_t *result_pool)
> +static svn_stream_t *
> +stream_translated(svn_stream_t *stream,
> + const char *eol_str,
> + svn_boolean_t *translated_eol,
> + svn_boolean_t repair,
> + apr_hash_t *keywords,
> + svn_boolean_t expand,
> + apr_pool_t *result_pool)
> {
> struct translated_stream_baton *baton
> = apr_palloc(result_pool, sizeof(*baton));
> @@ -1304,9 +1310,11 @@ svn_subst_read_specialfile(svn_stream_t **stream,
> /* Setup the baton fields */
> baton->stream = stream;
> baton->in_baton
> - = create_translation_baton(eol_str, repair, keywords, expand, result_pool);
> + = create_translation_baton(eol_str, translated_eol, repair, keywords,
> + expand, result_pool);
> baton->out_baton
> - = create_translation_baton(eol_str, repair, keywords, expand, result_pool);
> + = create_translation_baton(eol_str, translated_eol, repair, keywords,
> + expand, result_pool);
> baton->written = FALSE;
> baton->readbuf = svn_stringbuf_create("", result_pool);
> baton->readbuf_off = 0;
> @@ -1323,15 +1331,28 @@ svn_subst_read_specialfile(svn_stream_t **stream,
> return s;
> }
>
> +svn_stream_t *
> +svn_subst_stream_translated(svn_stream_t *stream,
> + const char *eol_str,
> + svn_boolean_t repair,
> + apr_hash_t *keywords,
> + svn_boolean_t expand,
> + apr_pool_t *result_pool)
> +{
> + return stream_translated(stream, eol_str, NULL, repair, keywords, expand,
> + result_pool);
> +}
>
> -svn_error_t *
> -svn_subst_translate_cstring2(const char *src,
> - const char **dst,
> - const char *eol_str,
> - svn_boolean_t repair,
> - apr_hash_t *keywords,
> - svn_boolean_t expand,
> - apr_pool_t *pool)
> +
> +static svn_error_t *
> +translate_cstring2(const char **dst,
> + svn_boolean_t *translated_eol,
> + const char *src,
> + const char *eol_str,
> + svn_boolean_t repair,
> + apr_hash_t *keywords,
> + svn_boolean_t expand,
> + apr_pool_t *pool)
> {
> svn_stringbuf_t *dst_stringbuf;
> svn_stream_t *dst_stream;
> @@ -1348,9 +1369,12 @@ svn_subst_read_specialfile(svn_stream_t **stream,
> dst_stringbuf = svn_stringbuf_create("", pool);
> dst_stream = svn_stream_from_stringbuf(dst_stringbuf, pool);
>
> + if (translated_eol)
> + *translated_eol = FALSE;
> +
> /* Another wrapper to translate the content. */
> - dst_stream = svn_subst_stream_translated(dst_stream, eol_str, repair,
> - keywords, expand, pool);
> + dst_stream = stream_translated(dst_stream, eol_str, translated_eol, repair,
> + keywords, expand, pool);

*nod* that's the "revv an API and update callers" pattern which explains why
stream_translated() was revved. And I see that svn_subst_stream_translated()
is made a wrapper. Good (it means we'll remember this if/when we revv this
function).

We could make stream_translated() a public svn_subst_stream_translated2()
if we think it's useful (e.g., if there's demand for it). But for now
I think it's okay the way your patch suggests.

>
> /* Jam the text into the destination stream (to translate it). */
> SVN_ERR(svn_stream_write(dst_stream, src, &len));
> @@ -1362,6 +1386,19 @@ svn_subst_read_specialfile(svn_stream_t **stream,
> return SVN_NO_ERROR;
> }
>
> +svn_error_t *
> +svn_subst_translate_cstring2(const char *src,
> + const char **dst,
> + const char *eol_str,
> + svn_boolean_t repair,
> + apr_hash_t *keywords,
> + svn_boolean_t expand,
> + apr_pool_t *pool)
> +{
> + return translate_cstring2(dst, NULL, src, eol_str, repair, keywords, expand,
> + pool);
> +}

So, this is revved because svn_subst_translate_string2() needs it (and
svn_subst_translate_string() doesn't).

> +
> /* Given a special file at SRC, generate a textual representation of
> it in a normal file at DST. Perform all allocations in POOL. */
> /* ### this should be folded into svn_subst_copy_and_translate3 */
> @@ -1714,6 +1751,54 @@ svn_subst_translate_string(svn_string_t **new_valu
>
>
> svn_error_t *
> +svn_subst_translate_string2(svn_string_t **new_value,
> + svn_boolean_t *translated_to_utf8,
> + svn_boolean_t *translated_line_endings,
> + const svn_string_t *value,
> + const char *encoding,
> + apr_pool_t *pool)
> +{

Why is this adding entirely new lines of code (the diff shows only /^+/
lines)? Normally we make the new function by editing the old one
in-place, and then re-create the old one as a wrapper (like
translate_cstring2()).

The fact you didn't do so makes me suspect you just copied the code of
svn_subst_translate_string(), which is bad for several reasons:

* It's code duplication. If we ever make a bugfix, we'll have to
  remember to make it to both versions of the function.

* It makes it impossible for me to see /from the unidiff/ what's the
  code difference between the current and being-added versions of the
  function.

> + const char *val_utf8;
> + const char *val_utf8_lf;
> + apr_pool_t *scratch_pool = svn_pool_create(pool);
> +
> + if (value == NULL)
> + {
> + *new_value = NULL;
> + return SVN_NO_ERROR;
> + }
> +
> + if (encoding)
> + {
> + SVN_ERR(svn_utf_cstring_to_utf8_ex2(&val_utf8, value->data,
> + encoding, scratch_pool));
> + }
> + else
> + {
> + SVN_ERR(svn_utf_cstring_to_utf8(&val_utf8, value->data, scratch_pool));
> + }
> +
> + if (translated_to_utf8)
> + *translated_to_utf8 = strcmp(value->data, val_utf8)
> + ? /* not equal */ TRUE
> + : FALSE;
> +
> + SVN_ERR(translate_cstring2(&val_utf8_lf,
> + translated_line_endings,
> + val_utf8,
> + "\n", /* translate to LF */
> + FALSE, /* no repair */
> + NULL, /* no keywords */
> + FALSE, /* no expansion */
> + scratch_pool));
> +
> + *new_value = svn_string_create(val_utf8_lf, pool);
> + svn_pool_destroy(scratch_pool);
> + return SVN_NO_ERROR;
> +}
> +
> +
> +svn_error_t *
> svn_subst_detranslate_string(svn_string_t **new_value,
> const svn_string_t *value,
> svn_boolean_t for_output,

> Index: subversion/svnsync/sync.h

> Index: subversion/svnsync/main.c

> Index: subversion/svnsync/sync.c

Out of patience for one sitting, so not reviewing these now. (I feel they're
orthogonal/independent of the parts I'd already reviewed.)

I suggest you go ahead (respond to the review, submit an updated patch,
etc) without waiting; the svnsync part will eventually get its share of
reviews. Also, you may want to consider splitting this patch into two
parts (subst work separate from svnsync work).

Daniel
Received on 2010-09-20 06:40:34 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.