[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] allow svnsync to translate non-UTF-8 log messages to UTF-8 (v. 3)

From: Danny Trebbien <dtrebbien_at_gmail.com>
Date: Fri, 7 Jan 2011 13:03:18 -0800

Hi Daniel and Julian,

My apologies for the delay in responding.

> Refresh my memory please: did we decide in some thread that the
> recodings should be counted too?

At one point Stefan asked me if I was interested:
http://article.gmane.org/gmane.comp.version-control.subversion.devel/122540

>>   (svnsync_normalize_revprops): Rename NORMALIZED_COUNT to
>>     NORMALIZED_LE_COUNT. Add the ENCODING parameter. Move the call to
>>     apr_hash_set() outside of the if block (if (le_normalized)), as the
>>     property value may have been changed because of re-encoding even when no
>>     line ending was normalized.
>
> The sentence "Move.*" is too much detail.  It's probably fine to just
> omit it --- there is no functional change, and the details of the hunk
> need no introduction.

Okay.

>> * subversion/tests/cmdline/svnsync_tests_data/copy-bad-line-endings2.dump
>>   A dump of a repository with the following features:
>>   1. The log message (`svn:log` revision property) of revision 1 has CRLF line
>>      endings.
>>   2. The log message of revision 2 has CR line endings.
>>   3. Revision 3 introduces an `svn:ignore` node property with CRLF line endings.
>>   4. Revision 4 introduces a custom node property, `x:related-to`, with CRLF
>>      line endings.
>>
>
> I don't understand
>
> * What is the difference between copy_bad_line_endings() and
>  copy_bad_line_endings2()?

The difference is that copy_bad_line_endings2 tests that non-LF line
endings in a non-translatable property (one that does not begin with
`svn:`) are not affected. I simply ported one of the three test
repositories that I posted to the Developers' mailing list a while ago
to the cmdline tests architecture.

> * How does copy_bad_line_endings2() relate to the subject --- which is
>  non-UTF-8 log messages, as opposed to non-LF-only log messages?
>
>  i.e., why aren't you adding a dump where the 'before' has an
>  ISO-8859-1 property, and the 'after' has that property recoded to
>  UTF-8?

I haven't done this yet. I still need to do this.

> I wonder: it might be a good idea to split those renames to their own
> patch, so as to not hide the needle (meat of the change) in a haystack
> of hunks.

Yes, probably. I will revert these changes and save for later.

> No whitespace-only hunks in a patch that makes a functional change,
> please.  Thank you.

Okay.

>>    apr_array_header_t *config_options = NULL;
>> +  opt_baton.source_encoding = NULL;
>>    apr_allocator_t *allocator;
>
> Oops, you can't have a declaration (of ALLOCATOR) after statement
> (non-declaration assignment) in C89.
>
> Also, we memset(&opt_baton, 0), so this NULL may not be necessary.
>
> (all 0-v-NULL discussions aside, I'd leave it in only if we init to NULL
> the other pointer members of the struct)

In a way, the pointer fields of the opt baton are all initialized to
NULL if the corresponding command line options are not specified
because there are local variables for each that are initialized to
NULL, and these are copied into the opt baton later on in main(). I
think that I will switch to this pattern (i.e. define `const char
*source_encoding = NULL;` and copy it into the opt baton later).

I vaguely recall using that pattern, but switching to the current
pattern for some reason.

>>
>>    if (svn_cmdline_init("svnsync", stderr) != EXIT_SUCCESS)
>> @@ -1831,6 +1854,10 @@ main(int argc, const char *argv[])
>>                return svn_cmdline_handle_exit_error(err, pool, "svnsync: ");
>>              break;
>>
>> +          case svnsync_opt_source_encoding:
>> +            opt_baton.source_encoding = apr_pstrdup(pool, opt_arg);
>> +            break;
>
> Why do you need to pstrdup() it?  Doesn't it have the proper lifetime
> already?

Oh, I guess it does. I used it because I was thinking that maybe the
string value could be modified by something, but I see now that the
program arguments are passed as an array of const C-strings.

> Why don't you call svn_utf_cstring_to_utf8(), like is done in almost
> every other use of opt_arg?

Well, here's my reasoning for not reencoding into UTF-8: as opposed to
strings such as the username and password, which must be represented
independent of the local machine (so, reencoded into a standard
character encoding), the representation of the value of the "source
encoding" option is tied to the local machine's charset conversion
faculties. Let's say that it's iconv for simplification. The charset
of the values of the "to encoding" and "from encoding" parameters are
system-dependent, so we don't know what charset they are encoded in.
If somehow someone managed to install a charset with an unusual name
(not representable in ASCII) and the system iconv implementation
expects ISO-8859-1 strings for the to/from encoding values, then that
someone might not be able to pick the special charset if we reencode
the to/from encoding values into UTF-8. Presumably however, the user
would know how to pass the ISO-8859-1 representation of the name of
that charset to the program as one of the elements of ARGV, so pass it
to iconv unmodified.

> You lost the "... and set to FALSE" otherwise bit.  It's important --- the
> semantics of the contract are different without that statement --- so please
> restore it.

I will.

>> +      *str = new_str;
>> +    }
>> +
>> +  /* Only detect inconsistent line ending style. This is accomplished by simply
>> +     looking for carriage return (\r) characters. */
>> +  else if (memchr((*str)->data, (unsigned char) '\r', (*str)->len) != NULL)
>
> I see the reason for the s/strchr/memchr/ in the log message: because
> the length is known.  I suppose the underlying motivation is to make the
> condition slightly faster?

And to support NUL characters ('\0').

> We repeat that strchr() in a couple of other places ---
> svn_repos__validate_prop() is one of them ---  so if you make the change
> here, you probably need to change there too.

Okay.

>> +    {
>>        /* Found some. Normalize. */
>>        const char* cstring = NULL;
>>        SVN_ERR(svn_subst_translate_cstring2((*str)->data, &cstring,
>>                                             "\n", TRUE,
>>                                             NULL, FALSE,
>> -                                           pool));
>
> Should this call be to svn_subst_translate_string2()?
>                                          ^^

I didn't call svn_subst_translate_string2() because this is the else
condition for "source encoding is not NULL". In the documentation for
the function, I added that if source encoding is not NULL, then no
reencoding is performed because the string is presumed to be encoded
in UTF-8.

Perhaps, though, I should use svn_subst_translate_string2() with the
encoding set to "UTF-8". It would greatly simplify normalize_string(),
but might do more work if converting UTF-8 to UTF-8 is not a no-op.
I'm not sure if svn_utf_cstring_to_utf8_ex2() recognizes this special
case.

> Most svnsync tests are used by svnrdump too.  Could you check if
> a similar test should be added to svnrdump_tests.py?

Sure. I will look.
Received on 2011-01-07 22:03:57 CET

This is an archived mail posted to the Subversion Dev mailing list.