Re: [PATCH] allow svnsync to translate non-UTF-8 log messages to UTF-8 (v. 3)

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Sun, 2 Jan 2011 02:02:47 +0200

[ trimming CC ]

Danny Trebbien wrote on Sun, Dec 26, 2010 at 15:39:05 -0800:
> Attached are version 3 of the patch and corresponding log message.
> Also attached is a TGZ archive of two Subversion repository dumps that
> are used by a new test case.

> [[[
> Add a command line option (--source-encoding) to the svnsync init, sync, and
> copy-revprops subcommands that allows the user to specify the character
> encoding of translatable properties from the source repository. This is needed
> to allow svnsync to sync some older Subversion repositories that have
> properties that were not encoded in UTF-8.
>
> As discussed at:
> http://thread.gmane.org/gmane.comp.version-control.subversion.user/100020
> http://thread.gmane.org/gmane.comp.version-control.subversion.devel/122518
> http://thread.gmane.org/gmane.comp.version-control.subversion.devel/122550
>
> Around half of the changes are to rename the "normalized count" variables so
> that it is more clear that the counters only count line ending normalizations
> and not re-encoding normalizations. The other half of the changes are cosmetic
> (adding comments or trimming trailing whitespace in lines) or exist to pass the
> argument to --source-encoding through to the functions that need it (mainly
> normalize_string() in subversion/svnsync/sync.c).
>

+1.

Refresh my memory please: did we decide in some thread that the
recodings should be counted too?

> * subversion/svnsync/main.c
> (svnsync__opt) Add svnsync_opt_source_encoding.
> (svnsync_cmd_table): Add svnsync_opt_source_encoding to the list of
> acceptable options for the init, sync, and copy-revprops subcommands.
> (svnsync_options): Add a description of the --source-encoding option.
> (opt_baton_t, subcommand_baton_t): Add the SOURCE_ENCODING field.
> (log_properties_normalized): Rename the NORMALIZED_REV_PROPS_COUNT parameter
> to LE_NORMALIZED_REV_PROPS_COUNT and the NORMALIZED_NODE_PROPS_COUNT
> parameter to LE_NORMALIZED_NODE_PROPS_COUNT.
> (copy_revprops): Add the PROP_ENCODING parameter. Rename the NORMALIZED_COUNT
> parameter to LE_NORMALIZED_COUNT.
> (make_subcommand_baton): Set the SOURCE_ENCODING field of the resulting
> subcommand_baton_t object to the value of SOURCE_ENCODING from the
> opt_baton_t object.
> (do_initialize, do_synchronize, do_copy_revprops, replay_rev_started,
> replay_rev_finished): Pass SOURCE_ENCODING to svnsync_* functions and
> copy_revprops().
> (replay_baton_t): Rename the NORMALIZED_REV_PROPS_COUNT field to
> LE_NORMALIZED_REV_PROPS_COUNT and the NORMALIZED_NODE_PROPS_COUNT field to
> LE_NORMALIZED_NODE_PROPS_COUNT.
> (main): Handle the case when the command line option is --source-encoding.
> Set the SOURCE_ENCODING field of the opt_baton_t object to either OPT_ARG
> or NULL.
>
> * subversion/svnsync/sync.c
> (normalize_string): Rename WAS_NORMALIZED to WAS_LE_NORMALIZED to make clear
> that the counter only counts line ending normalizations. Add the ENCODING
> parameter. Handle the case when ENCODING is not NULL by calling
> svn_subst_translate_string2(). Switch to the "two pools" (result & scratch
> pools) pattern. Use memchr() instead of strchr() because the length of
> (*STR)->data is known.
> (svnsync_normalize_revprops): Rename NORMALIZED_COUNT to
> NORMALIZED_LE_COUNT. Add the ENCODING parameter. Move the call to
> apr_hash_set() outside of the if block (if (le_normalized)), as the
> property value may have been changed because of re-encoding even when no
> line ending was normalized.

The sentence "Move.*" is too much detail. It's probably fine to just
omit it --- there is no functional change, and the details of the hunk
need no introduction.

> * subversion/tests/cmdline/svnsync_tests_data/copy-bad-line-endings2.dump
> A dump of a repository with the following features:
> 1. The log message (`svn:log` revision property) of revision 1 has CRLF line
> endings.
> 2. The log message of revision 2 has CR line endings.
> 3. Revision 3 introduces an `svn:ignore` node property with CRLF line endings.
> 4. Revision 4 introduces a custom node property, `x:related-to`, with CRLF
> line endings.
>

I don't understand

* What is the difference between copy_bad_line_endings() and
copy_bad_line_endings2()?

* How does copy_bad_line_endings2() relate to the subject --- which is
non-UTF-8 log messages, as opposed to non-LF-only log messages?

  i.e., why aren't you adding a dump where the 'before' has an
  ISO-8859-1 property, and the 'after' has that property recoded to
  UTF-8?

> * subversion/tests/cmdline/svnsync_tests_data/copy-bad-line-endings2.expected.dump
> A dump of the expected result of using svnsync to sync
> copy-bad-line-endings2.dump.
> ]]]

> Index: subversion/svnsync/sync.h
> ===================================================================
> --- subversion/svnsync/sync.h (revision 1052903)
> +++ subversion/svnsync/sync.h (working copy)
> @@ -51,19 +56,25 @@ svnsync_normalize_revprops(apr_hash_t *rev_props,
> * the commit. TO_URL is the URL of the root of the repository into
> * which the commit is being made.
> *
> + * If PROP_ENCODING is NULL, then property values are presumed to be encoded
> + * in UTF-8 and are not re-encoded. Otherwise, the property values are
> + * presumed to be encoded in PROP_ENCODING, and are normalized to UTF-8.
> + *

> * As the sync editor encounters property values, it might see the need to
> - * normalize them (to LF line endings). Each carried out normalization adds 1
> - * to the *NORMALIZED_NODE_PROPS_COUNTER (for notification).
> + * normalize them (re-encode and/or change to LF line endings). Each carried-out
> + * line ending normalization adds 1 to the *LE_NORMALIZED_NODE_PROPS_COUNTER
> + * (for notification).
> */
> svn_error_t *
> svnsync_get_sync_editor(const svn_delta_editor_t *wrapped_editor,
> void *wrapped_edit_baton,
> svn_revnum_t base_revision,
> const char *to_url,
> + const char *prop_encoding,
> svn_boolean_t quiet,
> const svn_delta_editor_t **editor,
> void **edit_baton,
> - int *normalized_node_props_counter,
> + int *le_normalized_node_props_counter,

I wonder: it might be a good idea to split those renames to their own
patch, so as to not hide the needle (meat of the change) in a haystack
of hunks.

(but unlike the anonymous struct naming thing, this one isn't just
a simple global search/replace, so I can't do it non-manually)

> apr_pool_t *pool);
>
>
> Index: subversion/svnsync/main.c
> ===================================================================
> --- subversion/svnsync/main.c (revision 1052903)
> +++ subversion/svnsync/main.c (working copy)
> @@ -105,8 +106,9 @@ static const svn_opt_subcommand_desc2_t svnsync_cm
> "the destination repository by any method other than 'svnsync'.\n"
> "In other words, the destination repository should be a read-only\n"
> "mirror of the source repository.\n"),
> - { SVNSYNC_OPTS_DEFAULT, 'q', svnsync_opt_allow_non_empty,
> - svnsync_opt_disable_locking, svnsync_opt_steal_lock } },
> + { SVNSYNC_OPTS_DEFAULT, svnsync_opt_source_encoding, 'q',
> + svnsync_opt_allow_non_empty, svnsync_opt_disable_locking,
> + svnsync_opt_steal_lock } },
> { "synchronize", synchronize_cmd, { "sync" },

Why does 'svnsync init' need to take the new option?

> N_("usage: svnsync synchronize DEST_URL [SOURCE_URL]\n"
> "\n"
> @@ -226,7 +234,7 @@ static const apr_getopt_option_t svnsync_options[]
> { 0, 0, 0, 0 }
> };
>
> -typedef struct {
> +typedef struct { /* opt_baton_t */

I fixed that globally in r1053996, thanks for the reminder.

> svn_boolean_t non_interactive;
> svn_boolean_t trust_server_cert;
> svn_boolean_t no_auth_cache;
> @@ -441,7 +453,7 @@ check_if_session_is_at_repos_root(svn_ra_session_t
> * revision REV of the repository associated with RA session SESSION.
> *
> * For REV zero, don't remove properties with the "svn:sync-" prefix.
> - *
> + *

No whitespace-only hunks in a patch that makes a functional change,
please. Thank you.

> * All allocations will be done in a subpool of POOL.
> */
> static svn_error_t *
> @@ -1707,6 +1729,7 @@ main(int argc, const char *argv[])
> const char *username = NULL, *source_username = NULL, *sync_username = NULL;
> const char *password = NULL, *source_password = NULL, *sync_password = NULL;
> apr_array_header_t *config_options = NULL;
> + opt_baton.source_encoding = NULL;
> apr_allocator_t *allocator;

Oops, you can't have a declaration (of ALLOCATOR) after statement
(non-declaration assignment) in C89.

Also, we memset(&opt_baton, 0), so this NULL may not be necessary.

(all 0-v-NULL discussions aside, I'd leave it in only if we init to NULL
the other pointer members of the struct)

>
> if (svn_cmdline_init("svnsync", stderr) != EXIT_SUCCESS)
> @@ -1831,6 +1854,10 @@ main(int argc, const char *argv[])
> return svn_cmdline_handle_exit_error(err, pool, "svnsync: ");
> break;
>
> + case svnsync_opt_source_encoding:
> + opt_baton.source_encoding = apr_pstrdup(pool, opt_arg);
> + break;

Why do you need to pstrdup() it? Doesn't it have the proper lifetime
already?

Why don't you call svn_utf_cstring_to_utf8(), like is done in almost
every other use of opt_arg?

(Yes, ironic, isn't it.)

> +
> case svnsync_opt_disable_locking:
> opt_baton.disable_locking = TRUE;
> break;
> Index: subversion/svnsync/sync.c
> ===================================================================
> --- subversion/svnsync/sync.c (revision 1052903)
> +++ subversion/svnsync/sync.c (working copy)
> @@ -45,56 +45,72 @@
> #include <apr_uuid.h>
>
>
> -/* Normalize the line ending style of *STR, so that it contains only
> - * LF (\n) line endings. After return, *STR may point at a new
> - * svn_string_t* allocated from POOL.
> +/* Normalize the encoding and line ending style of *STR, so that it contains
> + * only LF (\n) line endings and is encoded in UTF-8. After return, *STR may
> + * point at a new svn_string_t* allocated in RESULT_POOL.
> *
> + * If ENCODING is NULL, then *STR is presumed to be encoded in UTF-8.
> + *
> - * *WAS_NORMALIZED is set to TRUE when *STR needed to be normalized,
> - * and to FALSE if *STR remains unchanged.
> + * *WAS_LE_NORMALIZED is set to TRUE when *STR needed line ending normalization.
> + *

You lost the "... and set to FALSE" otherwise bit. It's important --- the
semantics of the contract are different without that statement --- so please
restore it.

> + * SCRATCH_POOL is used for temporary allocations.
> */
> static svn_error_t *
> normalize_string(const svn_string_t **str,
> - svn_boolean_t *was_normalized,
> - apr_pool_t *pool)
> + svn_boolean_t *was_le_normalized,
> + const char *encoding,
> + apr_pool_t *result_pool,
> + apr_pool_t *scratch_pool)
> {
> - *was_normalized = FALSE;
> + *was_le_normalized = FALSE;
>
> if (*str == NULL)
> return SVN_NO_ERROR;
>
> SVN_ERR_ASSERT((*str)->data != NULL);
>
> - /* Detect inconsistent line ending style simply by looking
> - for carriage return (\r) characters. */
> - if (strchr((*str)->data, '\r') != NULL)
> + if (encoding)
> {
> + svn_string_t *new_str = NULL;
> + SVN_ERR(svn_subst_translate_string2(&new_str, NULL, was_le_normalized,
> + *str, encoding, result_pool,
> + scratch_pool));
> + *str = new_str;
> + }
> +
> + /* Only detect inconsistent line ending style. This is accomplished by simply
> + looking for carriage return (\r) characters. */
> + else if (memchr((*str)->data, (unsigned char) '\r', (*str)->len) != NULL)

I see the reason for the s/strchr/memchr/ in the log message: because
the length is known. I suppose the underlying motivation is to make the
condition slightly faster?

We repeat that strchr() in a couple of other places ---
svn_repos__validate_prop() is one of them --- so if you make the change
here, you probably need to change there too.

> + {
> /* Found some. Normalize. */
> const char* cstring = NULL;
> SVN_ERR(svn_subst_translate_cstring2((*str)->data, &cstring,
> "\n", TRUE,
> NULL, FALSE,
> - pool));

Should this call be to svn_subst_translate_string2()?
^^

> - *str = svn_string_create(cstring, pool);
> - *was_normalized = TRUE;
> + scratch_pool));
> + *str = svn_string_create(cstring, result_pool);
> + *was_le_normalized = TRUE;
> }
>
> return SVN_NO_ERROR;
> }
>
>
> -/* Normalize the line ending style of the values of properties in REV_PROPS
> - * that "need translation" (according to svn_prop_needs_translation(),
> - * currently all svn:* props) so that they contain only LF (\n) line endings.
> - * The number of properties that needed normalization is returned in
> - * *NORMALIZED_COUNT.
> +/* Normalize the encoding and line ending style of the values of properties
> + * in REV_PROPS that "need translation" (according to
> + * svn_prop_needs_translation(), which is currently all svn:* props) so that
> + * they are encoded in UTF-8 and contain only LF (\n) line endings.
> + * The number of properties that needed line ending normalization is returned in
> + * *NORMALIZED_LE_COUNT. No re-encoding is performed if ENCODING is NULL.
> */
> svn_error_t *
> svnsync_normalize_revprops(apr_hash_t *rev_props,
> - int *normalized_count,
> + int *normalized_le_count,
> + const char *encoding,
> apr_pool_t *pool)
> {
> apr_hash_index_t *hi;
> - *normalized_count = 0;
> + *normalized_le_count = 0;
>
> for (hi = apr_hash_first(pool, rev_props);
> hi;
> @@ -105,15 +121,15 @@ svnsync_normalize_revprops(apr_hash_t *rev_props,
>
> if (svn_prop_needs_translation(propname))
> {
> - svn_boolean_t was_normalized;
> - SVN_ERR(normalize_string(&propval, &was_normalized, pool));
> - if (was_normalized)
> - {
> - /* Replace the existing prop value. */
> - apr_hash_set(rev_props, propname, APR_HASH_KEY_STRING, propval);
> - /* And count this. */
> - (*normalized_count)++;
> - }
> + svn_boolean_t was_le_normalized;
> + SVN_ERR(normalize_string(&propval, &was_le_normalized, encoding, pool,
> + pool));
> +
> + /* Replace the existing prop value. */
> + apr_hash_set(rev_props, propname, APR_HASH_KEY_STRING, propval);
> +
> + if (was_le_normalized)
> + (*normalized_le_count)++; /* Count it. */
> }
> }
> return SVN_NO_ERROR;
> Index: subversion/tests/cmdline/svnsync_tests.py
> ===================================================================
> --- subversion/tests/cmdline/svnsync_tests.py (revision 1052903)
> +++ subversion/tests/cmdline/svnsync_tests.py (working copy)
> @@ -780,11 +780,17 @@ def info_not_synchronized(sbox):
> #----------------------------------------------------------------------
>
> def copy_bad_line_endings(sbox):
> - "copy with inconsistent lineendings in svn:props"
> + "copy with inconsistent line endings in svn:* props"

Thanks, r1054320.

> run_test(sbox, "copy-bad-line-endings.dump",
> exp_dump_file_name="copy-bad-line-endings.expected.dump",
> bypass_prop_validation=True)
>
> +def copy_bad_line_endings2(sbox):
> + "copy with non-LF line endings in svn:* props"
> + run_test(sbox, "copy-bad-line-endings2.dump",
> + exp_dump_file_name="copy-bad-line-endings2.expected.dump",
> + bypass_prop_validation=True)
> +

Most svnsync tests are used by svnrdump too. Could you check if
a similar test should be added to svnrdump_tests.py?

If yes, could you please add an XFail test in svnrdump_tests.py?
(should be three lines of code)

> #----------------------------------------------------------------------
>
> def delete_svn_props(sbox):

*phew*. This took a good while...
Received on 2011-01-02 01:06:20 CET

This message: [ Message body ]
Next message: Johan Corveleyn: "Re: svn commit: r1054286 - /subversion/trunk/subversion/libsvn_delta/xdelta.c"
Previous message: Stefan Fuhrmann: "Re: svn commit: r1054250 - /subversion/trunk/subversion/libsvn_subr/eol.c"
Next in thread: Julian Foad: "Re: [PATCH] allow svnsync to translate non-UTF-8 log messages to UTF-8 (v. 3)"
Reply: Julian Foad: "Re: [PATCH] allow svnsync to translate non-UTF-8 log messages to UTF-8 (v. 3)"
Reply: Danny Trebbien: "Re: [PATCH] allow svnsync to translate non-UTF-8 log messages to UTF-8 (v. 3)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]