[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] extend svn_subst_translate_string() to record whether re-encoding and/or line ending translation were performed (v. 3)

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Wed, 10 Nov 2010 00:12:16 +0200

[ @dtrebbien, please read the whole email before starting to reply or act on
  the feedback; some of the earlier comments are mooted by the later ones ]

[ @all, I have a question about the cost of creating subpools;
  search for #subpool_question to find it ]

Daniel Trebbien wrote on Sun, Nov 07, 2010 at 17:36:13 -0800:
> Thank you for your feedback. I have attached a new patch and log
> message that addresses the points that you have mentioned.
>

Surely you meant to say s/addresses/address/ :-).

> > If we can provide the information cheaply, I see no reason not to let
> > the APIs provide it.  As some of these APIs are already revved for 1.7,
> > we might as well add the new information to them now.
>
> Maybe this should be done with other patches? I am concerned that
> this patch is getting long again :)
>

Sure.

> > And all of this is necessary just so svnsync can /report/ which of the
> > two fixes it made, right?  If so, we might move on with teaching svnsync
> > to fix newlines too, and add the "Report back how many properties had
> > their newlines changed" functionality separately.  (easier to review,
> > and I don't think "512 properties were re-newlined" is /that/ useful to
> > know for most people)
>
> Yes.
>
> Having svnsync do the newline normalization would be okay if there
> were a function that only performed the re-encoding, which
> unfortunately I do not see.
>

I recall that while writing the above my idea was to parallelize some
work. I think you're saying that's not possible? If so, let's just
move on, working on the patches one after the other instead.

> Someone (I forget who) suggested that the final summary message "512
> properties were re-newlined" be replaced with a revision-by-revision
> listing of the exact properties that were changed in importing each
> revision. I like this idea, and was thinking about making patches to
> implement it.
>

There is value for a summary line --- just like the "Conflicts summary"
in 'svn up' and friends (svn/notify.c): if someone runs 'svnsync sync'
manually and doesn't pipe the output into a pager, they still know at
the end something was wrong.

IIRC svnadmin does that too in some places?

> >> [[[
> >> Adds a public API function, svn_subst_translate_string2(), an extension of
> >> svn_subst_translate_string(), that has two, additional output parameters for
> >> determining whether re-encoding and/or line ending translation were performed.
> >>
> >
> > What's happening in this file is essentially "Add the parameters to
> > <this> public API, punch them through everything, and finally in
> > translate_newline() make use of them".  Could you add something
> > (a paragraph?) to the log message that makes that clear?
> >
> > Or, another option is two have two subst.c entries in the log message
> > --- one for the 'punching through' and one for the interesting changes
> > --- though we have less precedent for that.
> >
> > I'll leave it to you to find a way to make the overall picture clear :)
>
> I added a couple of paragraphs. One explains that most of the changes
> are to send the TRANSLATED_EOL parameter all the way through to
> translate_newline().

Cool, that helps.

> Another paragraph explains the idea behind my
> solution for the efficiency concern that you had about the new
> translate_newline() implementation.
>

I'll reply inline (at the quoted log message).

> >> +/** Translate the data in @a value (assumed to be encoded in charset
> >> + * @a encoding) to UTF8 and LF line-endings.  If @a encoding is @c NULL,
> >> + * then assume that @a value is in the system-default character encoding.
> >
> > You changed 'language encoding' to 'character encoding'.
>
> Changed back. Note that the docstring has been reworded since version
> 2 of the patch. I am now following the new wording and modified it as
> appropriate.
>

I'm sorry: I wasn't clear. I wasn't actually asking you to revert that,
but just tried to ask why you made that change (because it is departs
from the existing wording, but I didn't recall seeing the change
mentioned anywhere).

> >> @@ -633,7 +637,8 @@ translate_newline(const char *eol_str,
> >>                    const char *newline_buf,
> >>                    apr_size_t newline_len,
> >>                    svn_stream_t *dst,
> >> -                  svn_boolean_t repair)
> >> +                  svn_boolean_t repair,
> >> +                  svn_boolean_t *translated_eol)
> >>  {
> >>    /* If we've seen a newline before, compare it with our cache to
> >>       check for consistency, else cache it for future comparisons. */
> >> @@ -653,8 +658,17 @@ translate_newline(const char *eol_str,
> >>        strncpy(src_format, newline_buf, newline_len);
> >>        *src_format_len = newline_len;
> >>      }
> >> -  /* Write the desired newline */
> >> -  return translate_write(dst, eol_str, eol_str_len);
> >> +
> >> +  /* Translate the newline */
> >> +  SVN_ERR(translate_write(dst, eol_str, eol_str_len));
> >> +
> >> +  if (translated_eol)
> >> +    {
> >
> > Right now, this if() will be done every time translate_newline() is
> > called.  Could you rework the logic to check it fewer times?
> >
> > e.g., to only check it once per translation_baton if possible, or to
> > skip checking it if it had been set already, or if REPAIR is false and
> > a newline had been seen already, etc.  (These are just ideas, you don't
> > have to implement all of them!)
> >
> > Stefan2 indicates that the subst() code is a rather expensive part of
> > 'checkout' and 'export', so doing the check once-per-file (instead of
> > once-per-line) will be nice.  (However, I realize that right now you
> > only expose translated_eol for the svn_string_t API.)
>
> Okay. I re-worked the translate_newline() function a bit so that
> TRANSLATED_EOL will never be set to TRUE twice per stream and that the
> check whether TRANSLATED_EOL is NULL is performed exactly once.
>
> My solution was to make a slight variant of translate_newline(),
> called translate_newline_alt(), that has the same signature as
> translate_newline() and the same effect except that
> translate_newline_alt() does not care about TRANSLATED_EOL. I added a
> function pointer field to the translation baton that is either the
> address of translate_newline() or translate_newline_alt(). This
> allows the code to vary the implementation that is used.
>

I'd like to avoid the code duplication. Further down I've voted -0.95
on having it, and at the definition of translate_newline() I'm
discussing alternatives.

(Can you tell that I'm writing my response out-of-order?)

> >> +svn_error_t *
> >> +svn_subst_translate_string2(svn_string_t **new_value,
> >> +                            svn_boolean_t *translated_to_utf8,
> >> +                            svn_boolean_t *translated_line_endings,
> >> +                            const svn_string_t *value,
> >> +                            const char *encoding,
> >> +                            apr_pool_t *pool)
> >> +{
> >>    const char *val_utf8;
> >>    const char *val_utf8_lf;
> >>    apr_pool_t *scratch_pool = svn_pool_create(pool);
> >> @@ -1703,14 +1767,18 @@ svn_subst_translate_string(svn_string_t **new_valu
> >>        SVN_ERR(svn_utf_cstring_to_utf8(&val_utf8, value->data, scratch_pool));
> >>      }
> >>
> >
> > Not related to your patch, but the above if/else block calls the UTF-8
> > translation routines on value->data.  Since this is specifically the
> > translate_string() API (and there is a separate translate_cstring()
> > API), I think either we should fix this (the whole reason svn_string_t
> > exists is to allow embedded NULs) or at least document this limitation
> > in the API docs.
>
> value->data can be NULL?
>

NUL != NULL. In general, the 'data' member of an svn_string_t is never
a NULL pointer. However, the memory it points to --- value->data[0]
through value->data[value->len-1] --- may contain NUL (aka ASCII 0,
aka '\0') bytes.

> Note: I made a change that was not requested. In
> translate_newline(), I replaced the condition that checked for
> different newline strings with one that I believe to be more
> efficient. It used to be:
>
> > if (newline_len != eol_str_len || strncmp(newline_buf, eol_str, newline_len))
>
> Now it is:
>
> > if ((newline_len != eol_str_len) || (*newline_buf != *eol_str))
>
> This exploits the knowledge of the entire set of newline strings that
> are recognized (LF, CR, and CRLF).

Okay, and thanks for pointing out this "unsolicited" change.

> [[[
> Adds a public API function, svn_subst_translate_string2(), an extension of
> svn_subst_translate_string(), that has two additional output parameters for
> determining whether re-encoding and/or line ending translation were performed.
>
>
> As discussed at:
> http://thread.gmane.org/gmane.comp.version-control.subversion.devel/122550
> http://thread.gmane.org/gmane.comp.version-control.subversion.devel/123020
>
>
> The essential changes are to the translate_newline() function, which now takes
> an svn_boolean_t pointer (through the translation_baton) which is set to TRUE if
> a different newline is written out. Most other changes are to pass the
> svn_boolean_t pointer through to translate_newline().
>

+1 to this paragraph.

> There were also concerns about the efficiency of the new translate_newline()
> given that it is called once per newline. The concern was that if TRANSLATED_EOL
> is NULL or already set to TRUE, then there would be a lot of wasted computation.

A log message is not a history lesson; there is no need to summarize
the entire discussion here; that's what references to the list archives
are for. Just say what the patch /is/ doing, and what are the
high-level design choices/tradeoffs it makes.

> The solution in this revision is to add a variant of translate_newline(),
> translate_newline_alt(), that has the same signature as translate_newline()
> except that it does not care about b->translated_eol. During newline
> normalization, a function pointer variable that was added to the translation
> baton is the address of either translate_newline() or translate_newline_alt(),
> allowing the variant that is used to be varied during normalization. This allows
> the code that sets TRANSLATED_EOL to run at most once per translation of a
> stream. And, if it's NULL, then translate_newline_alt() is always used.
>

Consider moving some of the description to a comment in the code: is it
useful only when reading the diff, or also to people reading the code in
the future? If it's the latter, then it should be a comment in the code
(which you can summarize in the log message if needed).

>
> * subversion/include/svn_subst.h
> (svn_subst_translate_string2): New function.
> (svn_subst_translate_string): Deprecated in favor of
> svn_subst_translate_string2().
>
> * subversion/libsvn_subr/subst.c
> (translate_newline): Now takes the pointer to the translation baton. If it
> writes out a different newline, then it sets b->translate_eol to TRUE.
> (translate_newline_alt): Copy of translate_newline() that does not care about
> the value of b->translated_eol.

"%s_alt" isn't a very descriptive name. For maintainability, a more
descriptive suffix would be better.

Also: I forgot to say this earlier, but parts of this subst.c code have
been refactored on the 'performance' branch. Some of those refactorings
have been merged, but I'm not sure if all of them were. Could you have
a look there and see if there are any unmerged changes there? And if
so, how they relate to this patch?

> (translation_baton): Added the TRANSLATED_EOL and TRANSLATE_NEWLINE_FN fields.
> (create_translation_baton): Added a new parameter TRANSLATED_EOL that is
> passed to the resulting translation_baton. Sets TRANSLATE_NEWLINE_FN to
> either &translate_newline or &translate_newline_alt as appropriate.
> (translate_chunk): Replaced the three calls to translate_newline() with
> calls to TRANSLATE_NEWLINE_FN.

Present tense, so s/Added/Add/, etc.

> (stream_translated): New static function. Its implementation is the old
> implementation of svn_subst_stream_translated(), but accepting another
> parameter, TRANSLATED_EOL, that is passed to the in/out translation batons
> that it creates.
> (svn_subst_stream_translated): Now a wrapper for stream_translated().
> (translate_cstring): New static function. Its implementation is the old
> implementation of svn_subst_translate_cstring2(), but modified to accept
> another parameter, TRANSLATED_EOL, that is passed to stream_translated().
> (svn_subst_translate_cstring2): Now a wrapper for translate_cstring().
> (svn_subst_translate_string): Moved to deprecated.c.
> (svn_subst_translate_string2): New function. It takes a new parameter,
> TRANSLATED_EOL, and another pool parameter following the scratch pool &
> result pool pattern. The task of recording whether it translated a line
> ending is delegated to translate_cstring().
>

Okay. I might tweak it slightly, but nothing to worry about.

> * subversion/libsvn_subr/deprecated.c
> Received the implementation of the deprecated wrapper function
> svn_subst_translate_string().

Write in the standard file-symbol syntax:

  * subversion/libsvn_subr/deprecated.c
    (svn_subst_translate_string): ...

> ]]]

> Index: subversion/include/svn_subst.h
> ===================================================================
> --- subversion/include/svn_subst.h (revision 1032431)
> +++ subversion/include/svn_subst.h (working copy)
> @@ -592,19 +592,46 @@ svn_subst_stream_detranslated(svn_stream_t **strea
>
> /* EOL conversion and character encodings */
>
> +/** @deprecated Provided for backward compatibility with the 1.6 API. Callers
> + * should use svn_subst_translate_string2().
> + *
> + * Similar to svn_subst_translate_string2(), except that the information about
> + * whether re-encoding or line ending translation were performed is discarded.
> + */

Okay, except that the two paragraphs are in the wrong order, and the
"should use" comment isn't needed in this case.

> +SVN_DEPRECATED
> +svn_error_t *svn_subst_translate_string(svn_string_t **new_value,
> + const svn_string_t *value,
> + const char *encoding,
> + apr_pool_t *pool);
> +
> /** Translate the string @a value from character encoding @a encoding to
> * UTF8, and also from its current line-ending style to LF line-endings. If
> * @a encoding is @c NULL, translate from the system-default encoding.
> *
> + * If @a translated_to_utf8 is not @c NULL, then
> + * <code>*translated_to_utf8</code> is set to @c TRUE if at least one
> + * character of @a value in the source character encoding was translated to
> + * UTF-8; to @c FALSE otherwise. If @a translated_line_endings is not @c NULL,
> + * then <code>*translated_line_endings</code> is set to @c TRUE if at least one
> + * line ending was changed to LF; to @c FALSE otherwise.
> + *

Thanks for paragraphing this. I'd add another paragraph break after the
first "otherwise", so that each parameter is described in its own paragraph.

> * Recognized line endings are LF, CR, CRLF. If @a value has inconsistent
> * line endings, return @c SVN_ERR_IO_INCONSISTENT_EOL.
> *
> - * Set @a *new_value to the translated string, allocated in @a pool.
> + * Set @a *new_value to the translated string, allocated in @a result_pool.
> + *
> + * @a scratch_pool is used for temporary allocations.
> + *
> + * @since New in 1.7.
> */
> -svn_error_t *svn_subst_translate_string(svn_string_t **new_value,
> - const svn_string_t *value,
> - const char *encoding,
> - apr_pool_t *pool);
> +svn_error_t *
> +svn_subst_translate_string2(svn_string_t **new_value,
> + svn_boolean_t *translated_to_utf8,
> + svn_boolean_t *translated_line_endings,
> + const svn_string_t *value,
> + const char *encoding,
> + apr_pool_t *result_pool,
> + apr_pool_t *scratch_pool);
>
> /** Translate the string @a value from UTF8 and LF line-endings into native
> * character encoding and native line-endings. If @a for_output is TRUE,
> Index: subversion/libsvn_subr/subst.c
> ===================================================================
> --- subversion/libsvn_subr/subst.c (revision 1032431)
> +++ subversion/libsvn_subr/subst.c (working copy)
> @@ -607,23 +607,129 @@ translate_keyword(char *buf,
> }
>
>
> +struct translation_baton;
> +
> +/* Baton for translate_chunk() to store its state in. */
> +struct translation_baton
> +{

So, you added a forward declaration, AND moved the struct definition
from further below. One or the other is unnecessary :-)

> + const char *eol_str;
> + svn_boolean_t *translated_eol;
> + svn_error_t* (*translate_newline_fn)(const char *eol_str,
> + apr_size_t eol_str_len,
> + char *src_format,
> + apr_size_t *src_format_len,
> + const char *newline_buf,
> + apr_size_t newline_len,
> + svn_stream_t *dst,
> + struct translation_baton *b);

For reference, the only change to the struct is the addition of the
above two members.

At the definitions of functions that implement the 'translate_newline_fn'
signatures, you should add a comment saying they implement <this>
callback signature.

> +};
> +
> +
> +/* Similar to translate_newline() except that b->translated_eol is NULL or it
> + was already set to TRUE, so the code that deals with b->translated_eol is
> + omitted.
> +
> + If any part of this is changed, translate_newline() must also be updated
> + as appropriate!!!!! */

-0.95 on code duplication. (and on the use of five exclamation marks)

I'll suggest an alternative further below.

> +static svn_error_t *
> +translate_newline_alt(const char *eol_str,
> + apr_size_t eol_str_len,

Formal parameter isn't aligned with the opening paren.

> + char *src_format,
> + apr_size_t *src_format_len,
> + const char *newline_buf,
> + apr_size_t newline_len,
> + svn_stream_t *dst,
> + struct translation_baton *b)
> +{
...
> +}
> +
> +
> /* Translate the newline string NEWLINE_BUF (of length NEWLINE_LEN) to
> the newline string EOL_STR (of length EOL_STR_LEN), writing the
> result (which is always EOL_STR) to the stream DST.
> +
> + This function assumes that NEWLINE_BUF and EOL_STR are either "\n", "\r", or
> + "\r\n".
>

I suggest
  s/assumes that .* are/assumes that each of .* is/
for clarity.

> Also check for consistency of the source newline strings across
> multiple calls, using SRC_FORMAT (length *SRC_FORMAT_LEN) as a cache
> of the first newline found. If the current newline is not the same
> - as SRC_FORMAT, look to the REPAIR parameter. If REPAIR is TRUE,
> + as SRC_FORMAT, look to b->repair. If b->repair is TRUE,

Maintain upper-case for parameters in internal docstrings; that is,
either B->REPAIR or B->repair. (follow precedent)

Same for 'b->translated_eol'.

> ignore the inconsistency, else return an SVN_ERR_IO_INCONSISTENT_EOL
> error. If *SRC_FORMAT_LEN is 0, assume we are examining the first
> newline in the file, and copy it to {SRC_FORMAT, *SRC_FORMAT_LEN} to
> use for later consistency checks.
>
> - Note: all parameters are required even if REPAIR is TRUE.
> - ### We could require that REPAIR must not change across a sequence of
> + To use this variant, b->translated_eol must not be NULL.
> +
> + Sets *b->translated_eol to TRUE if the newline string that was written
> + (EOL_STR) is not the same as the newline string that was translated
> + (NEWLINE_BUF).
> +
> + Note: all parameters are required even if b->repair is TRUE.
> + ### We could require that b->repair must not change across a sequence of
> calls, and could then optimize by not using SRC_FORMAT at all if
> - REPAIR is TRUE.
> + b->repair is TRUE.
> +
> +
> + If any part of this is changed, translate_newline_alt() must also be updated
> + as appropriate!!!!!
> */
> static svn_error_t *
> translate_newline(const char *eol_str,
> @@ -633,15 +739,22 @@ translate_newline(const char *eol_str,
> const char *newline_buf,
> apr_size_t newline_len,
> svn_stream_t *dst,
> - svn_boolean_t repair)
> + struct translation_baton *b)
> {
> + assert((eol_str_len == 2 && eol_str[0] == '\r' && eol_str[1] == '\n') ||
> + (eol_str_len == 1 && (eol_str[0] == '\n' || eol_str[0] == '\r')));
> + assert((newline_len == 2 && newline_buf[0] == '\r' &&
> + newline_buf[1] == '\n') ||
> + (newline_len == 1 && (newline_buf[0] == '\n' ||
> + newline_buf[0] == '\r')));
> +

s/assert/SVN_ERR_ASSERT/, because that's more friendly to library users.

I'd like to see a comment on this assert, the condition is too long for me.

Both here and in the "unsolicited" change above, perhaps introduce named
macros would make the code more readable?

#define SAME_EOL(eol_str1, eol_str2) /* ... */
#defien STRING_IS_EOL(maybe_eol_str) /* ... */

> /* If we've seen a newline before, compare it with our cache to
> check for consistency, else cache it for future comparisons. */
> if (*src_format_len)
> {
> /* Comparing with cache. If we are inconsistent and
> we are NOT repairing the file, generate an error! */
> - if ((! repair) &&
> + if ((! b->repair) &&
> ((*src_format_len != newline_len) ||
> (strncmp(src_format, newline_buf, newline_len))))
> return svn_error_create(SVN_ERR_IO_INCONSISTENT_EOL, NULL, NULL);
> @@ -653,8 +766,36 @@ translate_newline(const char *eol_str,
> strncpy(src_format, newline_buf, newline_len);
> *src_format_len = newline_len;
> }
> - /* Write the desired newline */
> - return translate_write(dst, eol_str, eol_str_len);
> +
> + /* Translate the newline */
> + SVN_ERR(translate_write(dst, eol_str, eol_str_len));
> +

The only difference between translate_newline() and translate_newline_alt()
is the lines of code from here to the end of the function. I think you
could keep translate_newline() as is, and just move these checks elsewhere
(to a new function or to the caller; I haven't thought about this in
detail).

In any case, please avoid the code duplication if possible. (Function
pointers are nice, but sometimes just a way to shoot yourself in the
foot.) If you're unsure what approach to take, do brainstorm on-list
before sending a new iteration of the patch. (That saves time to you
and to us)

> + /* Set *b->translated_eol to TRUE if a different line ending string was
> + written out. */
> + assert(b->translated_eol != NULL);
> +
> + /* We know that NEWLINE_BUF and EOL_STR are either "\n", "\r", or
> + "\r\n". If the length of NEWLINE_BUF (NEWLINE_LEN) is not the same
> + as the length of EOL_STR (EOL_STR_LEN), then NEWLINE_BUF and
> + EOL_STR_BUF are of course different. Otherwise, NEWLINE_LEN and
> + EOL_STR_LEN are both 1. We need only check the one character for
> + equality to determine whether NEWLINE_BUF and EOL_STR_BUF are
> + the same in that case. */
> + if ((newline_len != eol_str_len) || (*newline_buf != *eol_str))
> + {
> + *b->translated_eol = TRUE;
> + b->translate_newline_fn = &translate_newline_alt; // Now that
> + // TRANSLATED_EOL has
> + // been set to TRUE,
> + // switch the
> + // translate_newline
> + // function that is used

No C++ comments; we follow the C89 standard, which doesn't allow them.

> + // to the alternate
> + // which does not care
> + // about TRANSLATED_EOL
> + }
> +
> + return SVN_NO_ERROR;
> }
>
>
> Index: subversion/libsvn_subr/deprecated.c
> ===================================================================
> --- subversion/libsvn_subr/deprecated.c (revision 1032431)
> +++ subversion/libsvn_subr/deprecated.c (working copy)
> @@ -250,6 +250,19 @@ svn_subst_stream_translated_to_normal_form(svn_str
> }
>
> svn_error_t *
> +svn_subst_translate_string(svn_string_t **new_value,
> + const svn_string_t *value,
> + const char *encoding,
> + apr_pool_t *result_pool)
> +{
> + apr_pool_t *scratch_pool = svn_pool_create(result_pool);
> + svn_error_t *res = svn_subst_translate_string2(new_value, NULL, NULL, value,
> + encoding, result_pool, scratch_pool);
> + svn_pool_destroy(scratch_pool);

Is it worth the overhead to create a subpool here?

On the one hand, that's what the current code does.

On the other hand, creating a subpool allocates 8KB right away (right?
#subpool_question), and an svn_string_t is probably smaller than that.
(Something larger would be transferred in a stream, I hope!) So it
seems a subpool is not worth the overhead, and it'll be better to just
pass result_pool as both pool arguments.

> + return res;
> +}
> +
> +svn_error_t *
> svn_subst_stream_detranslated(svn_stream_t **stream_p,
> const char *src,
> svn_subst_eol_style_t eol_style,

So, thanks for all the fixes, but unfortunately I disagree with the
translate_newline_alt() design choice :-(

Looking forward to the next iteration,

Daniel
Received on 2010-11-09 23:15:22 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.