Re: [PATCH] allow svnsync to translate non-UTF-8 log messages to UTF-8 (v. 3)

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Mon, 10 Jan 2011 07:36:53 +0200

Danny Trebbien wrote on Fri, Jan 07, 2011 at 13:03:18 -0800:
> > Refresh my memory please: did we decide in some thread that the
> > recodings should be counted too?
>
> At one point Stefan asked me if I was interested:
> http://article.gmane.org/gmane.comp.version-control.subversion.devel/122540
>

Okay. I no longer remember what was decided (too many threads about
this), but I'll respond to your points below.

> >> * subversion/tests/cmdline/svnsync_tests_data/copy-bad-line-endings2.dump
> >> A dump of a repository with the following features:
> >> 1. The log message (`svn:log` revision property) of revision 1 has CRLF line
> >> endings.
> >> 2. The log message of revision 2 has CR line endings.
> >> 3. Revision 3 introduces an `svn:ignore` node property with CRLF line endings.
> >> 4. Revision 4 introduces a custom node property, `x:related-to`, with CRLF
> >> line endings.
> >>
> >
> > I don't understand
> >
> > * What is the difference between copy_bad_line_endings() and
> > copy_bad_line_endings2()?
>
> The difference is that copy_bad_line_endings2 tests that non-LF line
> endings in a non-translatable property (one that does not begin with
> `svn:`) are not affected.

Okay. (I overlooked that detail.)

> I simply ported one of the three test
> repositories that I posted to the Developers' mailing list a while ago
> to the cmdline tests architecture.
>
> > Why don't you call svn_utf_cstring_to_utf8(), like is done in almost
> > every other use of opt_arg?
>
> Well, here's my reasoning for not reencoding into UTF-8: as opposed to
> strings such as the username and password, which must be represented
> independent of the local machine (so, reencoded into a standard
> character encoding), the representation of the value of the "source
> encoding" option is tied to the local machine's charset conversion
> faculties. Let's say that it's iconv for simplification. The charset
> of the values of the "to encoding" and "from encoding" parameters are
> system-dependent, so we don't know what charset they are encoded in.
> If somehow someone managed to install a charset with an unusual name
> (not representable in ASCII) and the system iconv implementation
> expects ISO-8859-1 strings for the to/from encoding values, then that
> someone might not be able to pick the special charset if we reencode
> the to/from encoding values into UTF-8. Presumably however, the user
> would know how to pass the ISO-8859-1 representation of the name of
> that charset to the program as one of the elements of ARGV, so pass it
> to iconv unmodified.
>

Can you name a charset for which this is not a theoretical concern?

To put it succintly, it seems to me you're just worrying too much here.

Also: consistency is good. If you're going to special-case one --option's
parsing from all others, you need a VERY good reason.

> >> + *str = new_str;
> >> + }
> >> +
> >> + /* Only detect inconsistent line ending style. This is accomplished by simply
> >> + looking for carriage return (\r) characters. */
> >> + else if (memchr((*str)->data, (unsigned char) '\r', (*str)->len) != NULL)
> >
> > I see the reason for the s/strchr/memchr/ in the log message: because
> > the length is known. I suppose the underlying motivation is to make the
> > condition slightly faster?
>
> And to support NUL characters ('\0').
>

I'm not sure that NUL is actually allowed in these propvalues, but
I don't feel strongly about this.

> >> + {
> >> /* Found some. Normalize. */
> >> const char* cstring = NULL;
> >> SVN_ERR(svn_subst_translate_cstring2((*str)->data, &cstring,
> >> "\n", TRUE,
> >> NULL, FALSE,
> >> - pool));
> >
> > Should this call be to svn_subst_translate_string2()?
> > ^^
>

(I just noticed that translate_string2() and translate_cstring2() aren't
actually related; one is for keywords and one not. Unfortunate naming.)

> I didn't call svn_subst_translate_string2() because this is the else
> condition for "source encoding is not NULL". In the documentation for
> the function, I added that if source encoding is not NULL, then no
> reencoding is performed because the string is presumed to be encoded
> in UTF-8.
>
> Perhaps, though, I should use svn_subst_translate_string2() with the
> encoding set to "UTF-8". It would greatly simplify normalize_string(),

+1. Just call it with encoding=NULL; using the same API in both
branches makes life easier for the next person to revv that API.

> but might do more work if converting UTF-8 to UTF-8 is not a no-op.

Do you know about the UTF-8 multiple-normalization-forms issues?
Because if not, you're just worrying too much again :-)

> I'm not sure if svn_utf_cstring_to_utf8_ex2() recognizes this special
> case.
>
> > Most svnsync tests are used by svnrdump too. Could you check if
> > a similar test should be added to svnrdump_tests.py?
>
> Sure. I will look.

Thanks.

Daniel
Received on 2011-01-10 06:40:29 CET

This message: [ Message body ]
Next message: Daniel Shahaf: "Re: svn commit: r1004286 - in /subversion/trunk: ./ subversion/libsvn_subr/io.c"
Previous message: Daniel Shahaf: "Re: fsfs's fs_serialized_init() not thread safe?"
In reply to: Danny Trebbien: "Re: [PATCH] allow svnsync to translate non-UTF-8 log messages to UTF-8 (v. 3)"
Next in thread: Danny Trebbien: "Re: [PATCH] allow svnsync to translate non-UTF-8 log messages to UTF-8 (v. 3)"
Reply: Danny Trebbien: "Re: [PATCH] allow svnsync to translate non-UTF-8 log messages to UTF-8 (v. 3)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]