[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Limit size of emails sent using commit-email.pl v2

From: <kfogel_at_collab.net>
Date: 2005-02-20 21:56:30 CET

> A new log message and modified commit-email.pl.in are attached.
> Changing the truncate function to zero out the difflines became too
> difflines array specific, so it has been called process_diff instead.

Here's a review; sorry for the long delay in giving this feedback.

(Note that Branden Robinson submitted a similar patch, at
http://subversion.tigris.org/servlets/ReadMsg?list=dev&msgNo=89960,
which has also not yet been reviewed.)

> Controls to modify email message based on diff size
>
> * commit-email.pl.in
> Added new commandline arguments (-ds, -dl)
> -ds: diff_size (max) - overrides default of 1_000_000
> -dl: diff_location - added to email body if set
> $max_diff_size: New variable
> check_diff(): New function to modify body based on diff size
> Added call to check (and possibly truncate) size of @difflines
>
>
> Index: tools/hook-scripts/commit-email.pl.in
> ===================================================================
> --- tools/hook-scripts/commit-email.pl.in (revision 12807)
> +++ tools/hook-scripts/commit-email.pl.in (working copy)
> @@ -57,6 +57,9 @@
> # $no_diff_added to 1.
> my $no_diff_added = 0;
>
> +# This controls the size of the commit email.
> +my $max_diff_size = 1_000_000; # characters
> +

The comment says it controls the size of the commit email, but the
variable name says it controls the size of the diff. Which is it? :-)
(Diff, I assume.)

> # Since the path to svnlook depends upon the local installation
> # preferences, check that the required programs exist to insure that
> # the administrator has set up the script properly.
> @@ -108,7 +111,9 @@
> '-l' => 'log_file',
> '-m' => '',
> '-r' => 'reply_to',
> - '-s' => 'subject_prefix');
> + '-s' => 'subject_prefix',
> + '-ds' => 'diff_size',
> + '-dl' => 'diff_location');

I think long names would be better here, especially since this script
is almost never invoked manually (only when it's being tested,
basically). So '--diff-size-limit' and '--diff-alternative', maybe?

I call the latter '--diff-alternative' instead of '--diff-location'
because it would not always be used to give the location of the diff.
It might just be a string of text explaining that the diff could not
be shown, or perhaps giving an svn command line client example command
for how to get the diff, or whatever. Whether or not it is a location
is up to the caller, not up to us.

> while (@ARGV)
> {
> @@ -367,6 +372,7 @@
> push(@body, "Log:\n");
> push(@body, @log);
> push(@body, "\n");
> +&process_diff();
> push(@body, map { /[\r\n]+$/ ? $_ : "$_\n" } @difflines);
>
> # Go through each project and see if there are any matches for this
> @@ -500,6 +506,8 @@
> " -m regex Regular expression to match committed path\n",
> " -r email_address Email address for 'Reply-To:'\n",
> " -s subject_prefix Subject line prefix\n",
> + " -ds diff_size Max size of diff content\n",
> + " -dl diff_location String to be put before diff content\n",
> "\n",
> "This script supports a single repository with multiple projects,\n",
> "where each project receives email only for commits that modify that\n",

Usage for diff_size should say what units (bytes? lines?).

> @@ -518,7 +526,12 @@
> "a list of email addresses on the command line. If you do not want\n",
> "a project that matches the entire repository, then use a -m with a\n",
> "regular expression before any other command line options or email\n",
> - "addresses.\n";
> + "addresses.\n",
> + "\n",
> + "The maximum amount of diff content and hence the size of the email\n",
> + "can be set using the -ds option and a diff 'location' can be set\n",
> + "using the -dl option like so:\n",
> + "-ds 'http://svn.collab.net/viewcvs/svn/?view=rev&rev=12798'\n";
> }

I think that example meant '-dl' where it says '-ds'?

Again, the diff content does not define the size of the email, since
there's also the log message, which we aren't limiting. So we should
find a more accurate way to phrase what's going on.

But more importantly, it seems like this is saying that if the diff is
too large, we'll show some of it, just not all of it. If so, that's
not a useful behavior -- after all, there's no way to programatically
determine which parts of a diff are important and which aren't. If we
can't show the whole thing, we should show none, and the reader should
use the alternative (ViewCVS or whatever) instead. It's not helpful
to show just the first N lines and leave the rest out, IMHO.

> # Return a new hash data structure for a new empty project that
> @@ -598,3 +611,30 @@
> return @output;
> }
> }
> +
> +# Check size of @difflines and truncate if necessary
> +# Add diff_location (e.g, a ViewCVS URL) if specified
> +sub process_diff
> +{
> + # all projects have the same diff_location
> + if ($project_settings_list[0]->{diff_location})
> + {
> + push(@body, "$project_settings_list[0]->{diff_location}\n");
> + }
> +
> + if ($project_settings_list[0]->{diff_size})
> + {
> + $max_diff_size = $project_settings_list[0]->{diff_size};
> + }
> +
> + for (my ($line, $size) = (0, 0); $line <= $#difflines; $line++)
> + {
> + if (($size + length $difflines[$line]) > $max_diff_size)
> + {
> + splice(@difflines, 0, $#difflines,
> + "\n(Diff too large to include in email. See above)");
> + last;
> + }
> + $size += length($difflines[$line]);
> + }
> +}

It looks like there's no way to have a max_diff_size of infinity.
People can override the default with an even higher number, but
shouldn't there be some way to say "No limit at all"? A value of -1
or something like that?

Okay, I see here you're truncating @difflines from its beginning, and
replacing it with a single-element list containing just a message
about the diff being too large. So the problem is only in the usage
message, not in the implementation.

Here's an interesting implementation thought: one of the things I had
hoped for was that this new limit option would also mean that the
server would not have to hold the entire diff in memory at once if the
diff were over the limit. That is, we'd accumulate the diff to a tmp
file, check the size of the file, and then splice it into the mail iff
it's below the limit.

I realize this could be considered an optimization; on the other hand,
it's kind of an important one, since a diff could be arbitrarily
large, and it's been a principle (in Subversion itself, anyway) to
avoid holding all the data in memory at once when the data is of
arbitrary size. What do you think of doing that here?

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sun Feb 20 22:12:52 2005

This is an archived mail posted to the Subversion Dev mailing list.