Re: [PATCH] svnsync 1.6 works even when source has eol-style impurities

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Thu, 16 Apr 2009 14:54:58 +0300 (Jerusalem Daylight Time)

[ CCing the list again, per discussion ]

B Smith-Mannschott wrote on Mon, 13 Apr 2009 at 14:54 +0200:
> Many thanks for taking the time to review my patch. I anticipate
> having some time to rework it this week, though not today.
>

And I'm sneaking in a quick reply too...

> On Sat, Apr 11, 2009 at 14:21, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
> > B Smith-Mannschott wrote on Wed, 1 Apr 2009 at 19:39 +0200:
> >> Here's my attempt at a patch for the svnsync issues raised by this
> >> thread. Please let me know what I can do to improve it.
> >>
> >> [[[
> >> Make is possible to use svnsync 1.6 even when source has eol-style impurities.
> >>
> >
> > Are there any relevant issue numbers to be mentioned here? (I don't think
> > there is an issue specifically for these new problems, but asking.)
>
> No issue number, that I know of. I'll check again and file an issue if
> I find nothing suitable. I think it would be good to have a tracked
> issue to bind the mailing list threads and any resulting commits
> together.
>

+1, good call.

Please CC me ('danielsh') on the issue. Thanks.

> >> diff --git a/subversion/include/svn_string.h b/subversion/include/svn_string.h
> >> index 2154ef7..c71fc58 100644
> >> --- a/subversion/include/svn_string.h
> >> +++ b/subversion/include/svn_string.h
> >> @@ -290,6 +290,23 @@ svn_stringbuf_first_non_whitespace(const svn_stringbuf_t *str);
> >> void
> >> svn_stringbuf_strip_whitespace(svn_stringbuf_t *str);
> >>
> >> +/** Return a svn_string_t with contents of @a str with linebreaks normalized.
> >> + *
> >> + * @a str contains zero or more linebreaks of the form \n or \r\n.
> >> + * The returned svn_string_t will have the same number of linebreaks
> >> + * as @a str but will contain no \r characters.
> >> + *
> >> + * The returned svn_string_t may be identical (the same memory) to @a
> >> + * str, when @a str is already fully eol-normalized.
> >> + *
> >> + * We make no attempt to handle lone \r characters (as used in Classic
> >> + * Mac OS) correctly as this is not a supported platform. (This
> >> + * simplifies the implementation.)
> >
> > Supporting \r's is easy (s/0x0D/0x0A/g), why not do it?
>
> Possible, but only with a second pass after s/0x0D0x0A/0x0A/g or with
> more complex logic in the loop. (This is the simplification I referred
> to above. It cut the loop from hairy down to trivial, but more below.)

If the loop gets too complicated, there is the option of the split/join
trick from one of the previous iterations. Or use a string
search/replace API (if you find one; can't think of any right now).

> Remember, I don't want to change the number of logical lines, so just
> replacing all CR with LF would clearly be wrong.
>

I'm trying to find a solution that works even if all three EOL styles
are used in the same value. s/\r/\n/ is simple and works for every mix
of EOLs you throw at it. I'm not saying it's the best, and I agree that
in the CRLF-and-LF's case it adds empty lines.

> >> + */
> >> +svn_string_t *
> >> +svn_string_with_normalized_eol_style(const svn_string_t *str, apr_pool_t *pool);
> >> +
> >> +
> >> index c64ebdc..feb78e7 100644
> >> --- a/subversion/libsvn_subr/svn_string.c
> >> +++ b/subversion/libsvn_subr/svn_string.c
> >> @@ -460,6 +460,46 @@ svn_stringbuf_strip_whitespace(svn_stringbuf_t *str)
> >> }
> >>
> >>
> >> +static void
> >> +svn_stringbuf_normalize_eol_style(svn_stringbuf_t *str)
> >> +{
> >> + char *r, *w; /* read and write pointer, w <= r */
> >> + char *z; /* end of buf pointer, r < z */
> >> +
> >> + /* We walk r and w along the buffer, copying from r to w. Initially
> >> + they are equal, though w will lag behind r as we encounter '\r'
> >> + chars. */
> >> +
> >> + z = &(str->data[str->len]);
> >> + w = r = &(str->data[0]);
> >> +
> >> + while (r < z)
> >> + {
> >> + if (*r != '\r') /* Copy char and advance the writer when not \r */
> >> + { /* Copying from r to w is a no-op when w and r are */
> >> + *w++ = *r; /* equal. Advancing w, however always takes place */
> >> + }
> >> + ++r; /* always advance the reader */
> >> + }
> >> +
> >
> > *headache*
>
> Yea. The pointer magic is a little ugly. I was under the impression
> that this is fairly idiomatic C, though as I gaze around the
> Subversion code base, not so much. Might be clearer using indexes.
> I'll try that out when I rewrite it to handle bare \r properly.
>

Don't worry about this point too much. The code was good as it was :)

> >
> > I'm not sure what you need the stringbuf for. Personally, I'd write it
> > along the lines of
> >
> > if '\r' in src->data:
> > char *new_data = ...;
> > while (i < src->len)
> > new_data[i] = (src[i] == '\r' ? '\n' : src[i]);
> >
> > new_string->data = new_data;
>
> This can't work as written. EOL normalization can change the length of
> the string. because we need to replace the two-byte sequence \r\n with
> the single-byte sequence \n. If we just blindly overwrite every \r
> with a \n we end up inserting empty lines into the property value.
> That can't be what we want.
>

See above for the which-EOLs-are-handled issue. Another point of this
excerpt, though, was to avoid using a stringbuf.

> >> + {
> >> + svn_string_t *real_value;
> >> + real_value = svn_string_with_normalized_eol_style(value, pool);
> >> + return eb->wrapped_editor->change_file_prop(fb->wrapped_node_baton,
> >> + name, real_value, pool);
> >> + }
> >> +
> >> return eb->wrapped_editor->change_file_prop(fb->wrapped_node_baton,
> >
> > So now there are two nearly-identical 'return change_file_prop()' calls.
> > The code would be better (easier to extend) if there was just one, IMO.
> >
>
> I did this with the intention of keeping changes more localized by
> limiting the size of the scope of mutable variables (real_value). I
> find it harder to read code where a variable is given some value at
> the top, which might or might not get changed, mangled, mutilated
> through any number of conditionals before finally being used at the
> end.
>
> But, in the end, I can go either way on this. I'll try your suggestion
> and see how it reads.
>

Fair enough.

> >> + /* Now that we have assured that there will be an svn:log message,
> >> + normalize line breaks in log message.*/
> >> + if (getenv("SVNSYNC_UNSUPPORTED_NORMALIZE_PROP_EOL_STYLE"))
> >> + {
> >> + svn_string_t *svn_log;
> >> + svn_log = apr_hash_get(filtered, SVN_PROP_REVISION_LOG,
> >> + APR_HASH_KEY_STRING);
> >> + svn_log = svn_string_with_normalized_eol_style(svn_log, pool);
> >> + apr_hash_set(filtered, SVN_PROP_REVISION_LOG, APR_HASH_KEY_STRING,
> >> + svn_log);
> >> + }
> >> +
> >
> > As I said: you want this logic not just for svn:log, but maybe also for
> > other props. You can iterate over all props you're about to commit
> > (compare filter_props()) and see if any of them needs_translation().
> >
> > Also (as I wrote above), this renormalization logic is (likely; haven't
> > tested it) needed elsewhere in svnsync/main.c as well.
>
> I'll need to study the surrounding code more closely in order to
> tackle these last two points.

Cool. If you have any questions, don't hesitate to ask.

> I limited myself to svn:log because that
> was the only multi-line rev-prop I was aware of and the only one I
> needed to normalized for my particular use-case. I can see the
> argument for a more general solution, however.
>

Okay.

> // Ben
>

Thanks,

Daniel

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1745954
Received on 2009-04-16 13:55:17 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]