Re: [PATCH v2] Saving a few cycles, part 2/3

From: Stefan Fuhrmann <stefanfuhrmann_at_alice-dsl.de>
Date: Tue, 18 May 2010 13:25:24 +0200

Julian Foad wrote:
> Greg - do you want to respond re. 'fast_memcpy()'?
>
> On Sat, 2010-05-01, Stefan Fuhrmann wrote:
>> [[[
>> svn_txdelta_apply_instructions is relatively slow for long
>> instruction sequences copying small pieces of data. This
>> seems to be particularly visible in non-packed FSFS
>> repositories.
>>
>> This patch extracts invariants out from the 'for' loop,
>> optimizes overlapping copies as well as small data copy
>> runtime. As a result, the performance is about doubled
>> and the total svn export runtime gets reduced by 2 %.
>>
>> * subversion/libsvn_delta/text_delta.c
>> (fast_memcpy): new function
>> (svn_txdelta_apply_instructions): reduce loop overhead,
>> use fast_memcpy, optimize overlapped copy code
>>
>> patch by stefanfuhrmann < at > alice-dsl.de
>> ]]]
>>
>
> Hi Stefan.
>
> I think this patch is mostly fine. I have a few minor comments (in
> line, below) which I want to get out to you.
>
> The fast_memcpy() optimization raised an eyebrow: one developer (Greg
> Stein) mentioned to me that optimizing standard memcpy felt wrong in
> principle. I don't know what was behind his feeling. I guess one kind
> of concern that could be considered against it is that newer compilers
> and hardware in a few years could possibly negate the optimization, yet
> that code is unlikely to be re-assessed in the same time frame, so it
> could end up being 'stale' code. Is that a big concern? No, a small
> one. I totally understand that you get a significant improvement in
> practice, right now, at least in your tests, and perhaps in other
> environments.
>
As I tried to explain in http://svn.haxx.se/dev/archive-2010-04/0596.shtml,
memcpy is very hard to optimize for *all* kinds of possible uses.
MS' implementation is a 600 line assembly fun read plus some
special-case code that is not public. That much to the prospect
of supporting variable-sized in intrinsic code.

The main use-cases for memcpy are: string or string-like operations
(-> minimize latency for copies of 10 .. 100 bytes) and copying large
blobs (-> optimize throughput in the "else" case).

Very short copies suffer from call latency. A compiler might generate
code that is similar to fast_memcpy but that would hurt the latency
and to some lesser degree the throughput for string operations.

If compilers in future should go to great length trying to solve that
problem or to shift priorities, fast_memcpy would still be roughly
on par because
* it handles *our* critical case very well
* special case checks in memcpy checking for very short buffers
would be 100% predictable -> close to zero overhead for bypassing
the code for the duplicate implementation
> My thought is that if we have evidence of a benefit and no current or
> predicted harm, we should take it. That code is at least short and
> clear.
>
>
Like much of the svndelta code, this code is mainly used for
non-packed repositories.Packed repositories seem to have
much larger delta chunks. Profiling data, I used during optimization:

a) flat (1.4 format) repository, total runtime ~4850 samples
- non-optimized: 316 samples total in function, thereof 107 in memcpy
- non-optimized: 245 samples total in function, thereof 64 in memcpy
a) packed 1.6 repository, total runtime ~3700 samples
- non-optimized: 100 samples total in function, thereof 95 in memcpy
- non-optimized: 90 samples total in function, thereof 85 in memcpy

Please note that individual sample counts will fluctuate by up to 5
percent or 20 samples (whichever is larger). So, 1.5% savings
in case a), no significant change in case b).

The thing with the whole optimization business is that SVN already
is quite good algorithmically (details to be laid out when posting the
performance analysis). However, there are still significant gains
to made but not with a single shot.

It takes a hundred arrows to shoot an elephant. Although some may
miss, they all need to be fired. Currently we inspect every single
arrow, discus its potential penetration depth, smoothness of the
shaft and likelihood of hitting another arrow instead of the elephant.

Let's hope in the mean while the elephant did not die of old age ;)

> If anyone else has thoughts on that, let's hear them.
>
> (It occurs to me now to wonder whether the 'delta' format is really
> working efficiently if it is requesting a high frequency of copies as
> small as 1 or 2 bytes. That just seems odd. I haven't looked into it.)
>
I haven't created a histogram for these calls but from my work
on zlib, I would expect 3 bytes to be the lower limit.
> plain text document attachment (PerformanceApplyDeltas.patch):
>> Index: subversion/libsvn_delta/text_delta.c
>> ===================================================================
>> --- subversion/libsvn_delta/text_delta.c (revision 939976)
>> +++ subversion/libsvn_delta/text_delta.c (working copy)
>> @@ -564,61 +591,109 @@
>> return SVN_NO_ERROR;
>> }
>>
>> +/* memcpy() is hard to tune for a wide range of buffer lengths.
>> + * Therefore, it is often tuned for high throughput on large
>> + * buffers and relatively low latency for mid-sized buffers
>> + * (tens of bytes). However, the overhead for very small buffers
>> + * (<10 bytes) is still high. Even passing the parameters, for
>> + * instance, may take as long as copying 3 bytes.
>> + *
>> + * Because short copy sequences seem to be a common case in
>> + * non-packed FSFS repositories, we copy them directly. Larger
>> + * buffer sizes aren't hurt measurably by the exta 'if' clause.
>> + */
>>
>
> Please document that it returns (target + len), since that is different
> from the behaviour of standard memcpy().
>
>
Will do - in particular as it is part of the optimization.
>> +static APR_INLINE char*
>> +fast_memcpy(char* target, const char* source, apr_size_t len)
>>
>
> Style nit: We write pointers as "char *target" rather than "char*
> target".
>
o.k.
>
>> +{
>> + if (len > 7)
>> + {
>> + memcpy(target, source, len);
>> + target += len;
>> + }
>> + else
>> + {
>> + /* memcpy is not exactly fast for small block sizes.
>> + Since they are common, let's run optimized code for them. */
>> + const char* end = source + len;
>> + for (; source != end; source++)
>> + *(target++) = *source;
>> + }
>>
>> + return target;
>> +}
>> +
>> void
>> svn_txdelta_apply_instructions(svn_txdelta_window_t *window,
>> const char *sbuf, char *tbuf,
>> apr_size_t *tlen)
>> {
>> - const svn_txdelta_op_t *op;
>> - apr_size_t i, j, tpos = 0;
>> + /* The main loop is run quite often.
>> + * Pre-calculate as much data as possible.
>> + */
>> + const svn_txdelta_op_t *op, *last_op = window->ops + window->num_ops;
>> + apr_size_t to_fill = *tlen < window->tview_len ? *tlen :
>> window->tview_len;
>> + apr_size_t left = to_fill;
>> + const char* end, *source;
>> + char *target = tbuf;
>>
>> - for (op = window->ops; op < window->ops + window->num_ops; op++)
>> + for (op = window->ops; left > 0; op++)
>>
>
> I haven't studied this enough to see why "left > 0" is equivalent to "op
> < window->ops + window->num_ops". Can't we use "op < last_op" to make
> it obvious? (Generally a loop condition is more obvious when it refers
> to the loop counter.)
>
>
I will add a longer comment for that in the code.
>> {
>> - const apr_size_t buf_len = (op->length < *tlen - tpos
>> - ? op->length : *tlen - tpos);
>> + const apr_size_t buf_len = op->length < left ? op->length : left;
>> + left -= buf_len;
>>
>> /* Check some invariants common to all instructions. */
>> - assert(tpos + op->length <= window->tview_len);
>> + assert(target - tbuf + op->length <= window->tview_len);
>>
>> switch (op->action_code)
>> {
>> case svn_txdelta_source:
>> /* Copy from source area. */
>> assert(op->offset + op->length <= window->sview_len);
>> - memcpy(tbuf + tpos, sbuf + op->offset, buf_len);
>> + target = fast_memcpy(target, sbuf + op->offset, buf_len);
>> break;
>>
>> case svn_txdelta_target:
>> - /* Copy from target area. Don't use memcpy() since its
>> - semantics aren't guaranteed for overlapping memory areas,
>> - and target copies are allowed to overlap to generate
>> - repeated data. */
>> - assert(op->offset < tpos);
>> - for (i = op->offset, j = tpos; i < op->offset + buf_len; i++)
>> - tbuf[j++] = tbuf[i];
>> + /* Copy from target area. We can't use memcpy() or the
>> like + since we need a specific semantics for overlapping
>> copies: + they must result in repeating patterns.
>> + */
>> + assert(op->offset < target - *tbuf);
>> + source = tbuf + op->offset;
>> + end = tbuf + op->offset + buf_len;
>> +
>> + /* Copy in larger chunks if source and target don't overlap
>> + * closer than the size of the chunks (or don't overlap at
>> all).
>> + * Use the natural machine word as chunk size
>> + * (for some architectures size_t is even a bit larger).
>> + */
>> + if (end + sizeof(apr_size_t) <= target)
>> + for (; source + sizeof (apr_size_t) <= end;
>> + source += sizeof (apr_size_t),
>> + target += sizeof (apr_size_t))
>> + *(apr_size_t*)(target) = *(apr_size_t*)(source);
>> +
>> + /* Copy trailing bytes */
>> + for (; source != end; source++)
>> + *(target++) = *source;
>>
>
> I would prefer to see this "overlapping copy" code factored out as a
> subroutine, analogous to fast_memcpy.
>
Can do.
>
>> break;
>>
>> case svn_txdelta_new:
>> /* Copy from window new area. */
>> assert(op->offset + op->length <= window->new_data->len);
>> - memcpy(tbuf + tpos,
>> - window->new_data->data + op->offset,
>> - buf_len);
>> + target = fast_memcpy(target,
>> + window->new_data->data + op->offset,
>> + buf_len);
>> break;
>>
>> default:
>> assert(!"Invalid delta instruction code");
>> }
>> -
>> - tpos += op->length;
>> - if (tpos >= *tlen)but
>> - return; /* The buffer is full. */
>> }
>>
>> - /* Check that we produced the right amount of data. */
>> - assert(tpos == window->tview_len);
>> - *tlen = tpos;
>> + /* We always fill the whole target buffer unless we stumble across
>>
>
> That comment doesn't look right. To say "the whole target buffer"
> implies TBUF with TLEN as passed in to this function; but we are writing
> a new value to *TLEN because the actual amount of data produced may be
> less, so that's not always the "whole" buffer.
>
> Instinctively it seems to me that the assertion (with its original
> comment) should be kept, and should be adjusted to suit the new
> variables, though I haven't totally thought it through.
>
>
Again, needs more commentary.
>> + * an invalid instruction (caught by debug assertions only).
>> + */
>> + *tlen = to_fill;
>> }
>>
>
>
> - Julian
>
>
Thanks for the review.

-- Stefan^2.
Received on 2010-05-18 13:26:15 CEST

This message: [ Message body ]
Next message: Paul Burba: "Re: svn commit: r945350 - /subversion/trunk/subversion/tests/cmdline/merge_authz_tests.py"
Previous message: Bert Huijben: "RE: svn commit: r945350 - /subversion/trunk/subversion/tests/cmdline/merge_authz_tests.py"
In reply to: Julian Foad: "Re: [PATCH v2] Saving a few cycles, part 2/3"
Next in thread: Greg Stein: "Re: [PATCH v2] Saving a few cycles, part 2/3"
Reply: Greg Stein: "Re: [PATCH v2] Saving a few cycles, part 2/3"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]