Re: Performance optimization - svn_stringbuf_appendbyte()

From: Stefan Fuhrmann <stefanfuhrmann_at_alice-dsl.de>
Date: Sun, 10 Oct 2010 15:54:46 +0200

On 07.10.2010 16:07, Julian Foad wrote:
>> New Revision: 997203
>>
>> URL: http://svn.apache.org/viewvc?rev=997203&view=rev
>> Log:
>> Merge r985037, r985046, r995507 and r995603 from the performance branch.
>>
>> These changes introduce the svn_stringbuf_appendbyte() function, which has
>> significantly less overhead than svn_stringbuf_appendbytes(), and can be
>> used in a number of places within our codebase.
> Hi Stefan2.
>
> I have been wondering about the actual benefit of such tightly
> hand-optimized code. Today I got around to giving it a quick spin.
>
> In my first test, it made no difference whatsoever, because the
> optimized code is never executed. The recent merge r1002898 of r1001413
> from the performance branch introduced a bug:
>
> - if (str->blocksize> old_len + 1)
> + if (str->blocksize< old_len + 1)
WTF how did that happen?
Well, that's what reviews are for ;)
> which totally disables the optimized code path.
>
> Fixed in r1005437.
Thanks.
> That solved, a loop doing a million simple appendbyte calls runs more
> than twice as fast as calling appendbytes(..., 1). That's fantastic.
>
> But what about that hand-optimization? I wrote a simple version of the
> function, and called it 'appendbyte0':
>
> svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte)
> {
> if (str->blocksize> str->len + 1)
> {
> str->data[str->len++] = byte;
> str->data[str->len] = '\0';
> }
> else
> svn_stringbuf_appendbytes(str,&byte, 1);
> }
>
> Here are the results (see full patch attached):
>
> Times: appendbytes appendbyte0 appendbyte (in ms)
> run: 89 31 34
> run: 88 30 35
> run: 88 31 34
> run: 88 30 34
> run: 88 31 34
> min: 88 30 34
>
> This tells me that the hand-optimization is actually harmful and the
> compiler does a 10% better job by itself.
>
> Have I made a mistake?
My guess is that you might have made two mistakes actually.
First, the number of operations was relatively low - everything
in the low ms range could be due to secondary effects (and
still be repeatable).

The most important problem would be compiler optimizations
or lack thereof. Since the numbers are very large (50..100
ticks per byte, depending on your CPU clock), I would assume
that you used a normal debug build for testing. In that case,
the actual number of C statements has a large impact on
the execution time. See my results below for details.
> What are the results for your system?
>
> (I'm using GCC 4.4.1 on an Intel Centrino laptop CPU.)
>
Test code used (10^10 calls, newer re-allocate):

int i;
unsigned char c;

         svn_stringbuf_t* s = svn_stringbuf_create_ensure (255, pool);
         for (i = 0; i < 50000000; ++i)
         {
             s->len = 0;
             for (c = 0; c < 200; ++c)
                 svn_stringbuf_appendbyte (s, c);
         }

XEON 55xx / Core i7, hyper-threading on, 3GHz peak
64 bits Linux, GCC 4.3.3; ./configure --disable-debug

(1) 10,7s; IPC = 2.1
(2) 8,11s; IPC = 1.4
(3) 2,64s; IPC = 2.4
(4) 2,43s; IPC = 2.3

(1) use appendbytes gives 9 instructions in main, 59 in appbytes
(2) handle count==1 directly in in appendbytes; 9 inst. in main, 26 in
appbytes
(3) appendbyte0 (same compiler output as the the non-handtuned code);
13 inst. in appbyte, 6 in main
(4) trunk_at_HEAD appendbyte; 11 inst. in appbyte, 6 in main

Core2 2.4GHz, Win32, VS2010
(not using valgrind to count instructions here; also not testing (2))

(1) 17,0s release, 20,2s debug
(3) 10,6s release, 12,2s debug
(4) 9,7s release, 13,0s debug

That shows the following:
* for debug code, appendbyte0 is faster
   (because debug code stores all variables on stack and the
   hand-tuned code introduced some additional ones)
* the ABI has significant effect (Win32: all parameters put on stack;
   Lin64: all fit into registers)
* shorter API (function signature) speeds up the *caller* significantly
* as long as small functions don't need to r/w to stack they can
   be very fast (64 bit gives more registers, parameters passed in
   registers eliminate stack frame ops, simple functions don't need
   more than the available scratch registers)
* your numbers seem very large in comparison
   -> probably something else going on

-- Stefan^2
Received on 2010-10-10 15:55:42 CEST

This message: [ Message body ]
Next message: Bob Fletcher: "Re: Problem with external file"
Previous message: Johan Corveleyn: "Re: trunk failing tests on Windows XP (32 bit): prop-tests.py 33, stat-tests.py 5, upgrade-tests.py 11"
In reply to: Julian Foad: "Re: Performance optimization - svn_stringbuf_appendbyte()"
Next in thread: Julian Foad: "Re: Performance optimization - svn_stringbuf_appendbyte()"
Reply: Julian Foad: "Re: Performance optimization - svn_stringbuf_appendbyte()"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]