[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH v2] Saving a few cycles, part 2/3

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Fri, 14 May 2010 01:00:45 +0100

Greg - do you want to respond re. 'fast_memcpy()'?

On Sat, 2010-05-01, Stefan Fuhrmann wrote:
> [[[
> svn_txdelta_apply_instructions is relatively slow for long
> instruction sequences copying small pieces of data. This
> seems to be particularly visible in non-packed FSFS
> repositories.
>
> This patch extracts invariants out from the 'for' loop,
> optimizes overlapping copies as well as small data copy
> runtime. As a result, the performance is about doubled
> and the total svn export runtime gets reduced by 2 %.
>
> * subversion/libsvn_delta/text_delta.c
> (fast_memcpy): new function
> (svn_txdelta_apply_instructions): reduce loop overhead,
> use fast_memcpy, optimize overlapped copy code
>
> patch by stefanfuhrmann < at > alice-dsl.de
> ]]]

Hi Stefan.

I think this patch is mostly fine. I have a few minor comments (in
line, below) which I want to get out to you.

The fast_memcpy() optimization raised an eyebrow: one developer (Greg
Stein) mentioned to me that optimizing standard memcpy felt wrong in
principle. I don't know what was behind his feeling. I guess one kind
of concern that could be considered against it is that newer compilers
and hardware in a few years could possibly negate the optimization, yet
that code is unlikely to be re-assessed in the same time frame, so it
could end up being 'stale' code. Is that a big concern? No, a small
one. I totally understand that you get a significant improvement in
practice, right now, at least in your tests, and perhaps in other
environments.

My thought is that if we have evidence of a benefit and no current or
predicted harm, we should take it. That code is at least short and
clear.

If anyone else has thoughts on that, let's hear them.

(It occurs to me now to wonder whether the 'delta' format is really
working efficiently if it is requesting a high frequency of copies as
small as 1 or 2 bytes. That just seems odd. I haven't looked into it.)

plain text document attachment (PerformanceApplyDeltas.patch):
> Index: subversion/libsvn_delta/text_delta.c
> ===================================================================
> --- subversion/libsvn_delta/text_delta.c (revision 939976)
> +++ subversion/libsvn_delta/text_delta.c (working copy)
> @@ -564,61 +591,109 @@
> return SVN_NO_ERROR;
> }
>
> +/* memcpy() is hard to tune for a wide range of buffer lengths.
> + * Therefore, it is often tuned for high throughput on large
> + * buffers and relatively low latency for mid-sized buffers
> + * (tens of bytes). However, the overhead for very small buffers
> + * (<10 bytes) is still high. Even passing the parameters, for
> + * instance, may take as long as copying 3 bytes.
> + *
> + * Because short copy sequences seem to be a common case in
> + * non-packed FSFS repositories, we copy them directly. Larger
> + * buffer sizes aren't hurt measurably by the exta 'if' clause.
> + */

Please document that it returns (target + len), since that is different
from the behaviour of standard memcpy().

> +static APR_INLINE char*
> +fast_memcpy(char* target, const char* source, apr_size_t len)

Style nit: We write pointers as "char *target" rather than "char*
target".

> +{
> + if (len > 7)
> + {
> + memcpy(target, source, len);
> + target += len;
> + }
> + else
> + {
> + /* memcpy is not exactly fast for small block sizes.
> + Since they are common, let's run optimized code for them. */
> + const char* end = source + len;
> + for (; source != end; source++)
> + *(target++) = *source;
> + }
>
> + return target;
> +}
> +
> void
> svn_txdelta_apply_instructions(svn_txdelta_window_t *window,
> const char *sbuf, char *tbuf,
> apr_size_t *tlen)
> {
> - const svn_txdelta_op_t *op;
> - apr_size_t i, j, tpos = 0;
> + /* The main loop is run quite often.
> + * Pre-calculate as much data as possible.
> + */
> + const svn_txdelta_op_t *op, *last_op = window->ops + window->num_ops;
> + apr_size_t to_fill = *tlen < window->tview_len ? *tlen : window->tview_len;
> + apr_size_t left = to_fill;
> + const char* end, *source;
> + char *target = tbuf;
>
> - for (op = window->ops; op < window->ops + window->num_ops; op++)
> + for (op = window->ops; left > 0; op++)

I haven't studied this enough to see why "left > 0" is equivalent to "op
< window->ops + window->num_ops". Can't we use "op < last_op" to make
it obvious? (Generally a loop condition is more obvious when it refers
to the loop counter.)

> {
> - const apr_size_t buf_len = (op->length < *tlen - tpos
> - ? op->length : *tlen - tpos);
> + const apr_size_t buf_len = op->length < left ? op->length : left;
> + left -= buf_len;
>
> /* Check some invariants common to all instructions. */
> - assert(tpos + op->length <= window->tview_len);
> + assert(target - tbuf + op->length <= window->tview_len);
>
> switch (op->action_code)
> {
> case svn_txdelta_source:
> /* Copy from source area. */
> assert(op->offset + op->length <= window->sview_len);
> - memcpy(tbuf + tpos, sbuf + op->offset, buf_len);
> + target = fast_memcpy(target, sbuf + op->offset, buf_len);
> break;
>
> case svn_txdelta_target:
> - /* Copy from target area. Don't use memcpy() since its
> - semantics aren't guaranteed for overlapping memory areas,
> - and target copies are allowed to overlap to generate
> - repeated data. */
> - assert(op->offset < tpos);
> - for (i = op->offset, j = tpos; i < op->offset + buf_len; i++)
> - tbuf[j++] = tbuf[i];
> + /* Copy from target area. We can't use memcpy() or the like
> + since we need a specific semantics for overlapping copies:
> + they must result in repeating patterns.
> + */
> + assert(op->offset < target - *tbuf);
> + source = tbuf + op->offset;
> + end = tbuf + op->offset + buf_len;
> +
> + /* Copy in larger chunks if source and target don't overlap
> + * closer than the size of the chunks (or don't overlap at all).
> + * Use the natural machine word as chunk size
> + * (for some architectures size_t is even a bit larger).
> + */
> + if (end + sizeof(apr_size_t) <= target)
> + for (; source + sizeof (apr_size_t) <= end;
> + source += sizeof (apr_size_t),
> + target += sizeof (apr_size_t))
> + *(apr_size_t*)(target) = *(apr_size_t*)(source);
> +
> + /* Copy trailing bytes */
> + for (; source != end; source++)
> + *(target++) = *source;

I would prefer to see this "overlapping copy" code factored out as a
subroutine, analogous to fast_memcpy.

> break;
>
> case svn_txdelta_new:
> /* Copy from window new area. */
> assert(op->offset + op->length <= window->new_data->len);
> - memcpy(tbuf + tpos,
> - window->new_data->data + op->offset,
> - buf_len);
> + target = fast_memcpy(target,
> + window->new_data->data + op->offset,
> + buf_len);
> break;
>
> default:
> assert(!"Invalid delta instruction code");
> }
> -
> - tpos += op->length;
> - if (tpos >= *tlen)but
> - return; /* The buffer is full. */
> }
>
> - /* Check that we produced the right amount of data. */
> - assert(tpos == window->tview_len);
> - *tlen = tpos;
> + /* We always fill the whole target buffer unless we stumble across

That comment doesn't look right. To say "the whole target buffer"
implies TBUF with TLEN as passed in to this function; but we are writing
a new value to *TLEN because the actual amount of data produced may be
less, so that's not always the "whole" buffer.

Instinctively it seems to me that the assertion (with its original
comment) should be kept, and should be adjusted to suit the new
variables, though I haven't totally thought it through.

> + * an invalid instruction (caught by debug assertions only).
> + */
> + *tlen = to_fill;
> }

- Julian
Received on 2010-05-14 02:01:16 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.