Re: Performance optimization - svn_stringbuf_appendbyte()

From: Stefan Fuhrmann <stefanfuhrmann_at_alice-dsl.de>
Date: Sun, 24 Oct 2010 21:45:40 +0200

On 11.10.2010 17:07, Julian Foad wrote:
> On Sun, 2010-10-10, Stefan Fuhrmann wrote:
>> On 07.10.2010 16:07, Julian Foad wrote:
>>>> New Revision: 997203
>>>>
>>>> URL: http://svn.apache.org/viewvc?rev=997203&view=rev
>>>> Log:
>>>> Merge r985037, r985046, r995507 and r995603 from the performance branch.
>>>>
>>>> These changes introduce the svn_stringbuf_appendbyte() function, which has
>>>> significantly less overhead than svn_stringbuf_appendbytes(), and can be
>>>> used in a number of places within our codebase.
>>> Here are the results (see full patch attached):
>>>
>>> Times: appendbytes appendbyte0 appendbyte (in ms)
>>> run: 89 31 34
>>> run: 88 30 35
>>> run: 88 31 34
>>> run: 88 30 34
>>> run: 88 31 34
>>> min: 88 30 34
>>>
>>> This tells me that the hand-optimization is actually harmful and the
>>> compiler does a 10% better job by itself.
>>>
>>> Have I made a mistake?
>> My guess is that you might have made two mistakes actually.
> Heh, OK.
>
>> First, the number of operations was relatively low - everything
>> in the low ms range could be due to secondary effects (and
>> still be repeatable).
> I can't think of any reason why. I ran the whole benchmark lots of
> times. Occasionally some of the times were a big chunk longer due to
> other system activity. Normally they were stable. I also ran it
> several times with 10 million ops instead of 1 million, and the results
> were exactly 10x longer with the same degree of variability.
>
OK. It is just that sometimes, process startup or caching
artifacts (especially running tests in a particular order)
result in producible effects of that magnitude. But obviously
that was not the case here.
>> The most important problem would be compiler optimizations
>> or lack thereof. Since the numbers are very large (50..100
>> ticks per byte, depending on your CPU clock), I would assume
>> that you used a normal debug build for testing. In that case,
> You're right. I'll try again, with "--disable-debug -O2".
Gotcha! The optimizer's impact is massive on that kind
of code.

>> the actual number of C statements has a large impact on
>> the execution time. See my results below for details.
>>> What are the results for your system?
>>>
>>> (I'm using GCC 4.4.1 on an Intel Centrino laptop CPU.)
>>>
>> Test code used (10^10 calls, newer re-allocate):
>>
>> int i;
>> unsigned char c;
>>
>> svn_stringbuf_t* s = svn_stringbuf_create_ensure (255, pool);
> OK so you're avoiding any calls to the "re-alloc". My tests included
> re-allocs, but were long enough strings that this wasn't much overhead.
> Nevertheless I'll avoid re-allocs so we can compare results fairly.
>
>> for (i = 0; i< 50000000; ++i)
>> {
>> s->len = 0;
>> for (c = 0; c< 200; ++c)
>> svn_stringbuf_appendbyte (s, c);
>> }
>>
>>
>> XEON 55xx / Core i7, hyper-threading on, 3GHz peak
>> 64 bits Linux, GCC 4.3.3; ./configure --disable-debug
>>
>> (1) 10,7s; IPC = 2.1
>> (2) 8,11s; IPC = 1.4
>> (3) 2,64s; IPC = 2.4
>> (4) 2,43s; IPC = 2.3
Grml. After normalizing the numbers to 10^9 iterations,
I forgot to adjust the example code. Sorry!
>> (1) use appendbytes gives 9 instructions in main, 59 in appbytes
>> (2) handle count==1 directly in in appendbytes; 9 inst. in main, 26 in
>> appbytes
>> (3) appendbyte0 (same compiler output as the the non-handtuned code);
>> 13 inst. in appbyte, 6 in main
>> (4) trunk_at_HEAD appendbyte; 11 inst. in appbyte, 6 in main
>>
>> Core2 2.4GHz, Win32, VS2010
>> (not using valgrind to count instructions here; also not testing (2))
>>
>> (1) 17,0s release, 20,2s debug
>> (3) 10,6s release, 12,2s debug
>> (4) 9,7s release, 13,0s debug
> With a test harness more similar to yours, and built with
> "--disable-debug -O2", here are my relative numbers (but only 1 million
> outer loops, compared to your 50 million):
>
> $ svn-c-test subr string 24
> Times for 1000000 x 200 bytes, in seconds
> (1)appendbytes (3)appendbyte0 (4)trunk_at_HEAD (5)macro
> run: 7.03 2.06 1.37 0.53
> run: 6.62 1.76 1.55 0.53
> run: 6.53 1.67 1.44 0.53
> run: 6.54 1.60 1.44 0.53
> run: 6.52 1.84 1.37 0.53
> min: 6.52 1.60 1.37 0.53
>
> This agrees with what you found: the hand-tuned code gives a 10%
> improvement.
>
> This also shows another variant, writing the function as a macro,
> reported in the column "(5)macro". This gives a further two-and-a-half
> times speed-up in this test.
>
The macro variant (basically inlining) would take 2.65s for 10^9
total iterations. That is more or less what I got on Core i7 for the
non-inlined variant. My theory:

* function call & return are cheap jumps
* the code is small enough for the loop stream decoder to kick in
* you platform ABI uses stack frames and probably passes parameters
on the stack (instead of registers).

The latter is also the case for 32 bit Windows and gives similar
results (9.7s vs. your 6.8s).
> [...]
>> * your numbers seem very large in comparison
>> -> probably something else going on
> Can you use my test code or show me yours? Mine has the harness in
> "string-test.c" and the code under test in libsvn_subr - see attachment
> - and the default build on my platform loads libsvn_subr as a shared
> library. Maybe yours is completely different.
I think our test results are consistent now. For simple
tests like that, I usually misuse svn.exe or so and put
the code into main().
> Other observations:
>
> - Some callers that use very tight loops are in fact simply copying a
> string, up as far as some particular end marker. They may benefit from
> being re-written to first determine how much they want to copy and then
> call svn_stringbuf_appendbytyes().
I will browse through the code and see where it can
be reworked easily.
> - The stringbuf type initially allocates a 1-byte space, and doubles
> this every time it re-allocates. Could we not could allocate at least a
> few bytes (4? 8?) as a minimum, without actually consuming more memory
> in practice? That would avoid two or three re-allocs in some usage
> patterns.
True. Two points from my side:

* APR pools allocate memory in 8 byte chunks.
   -> allocating just one byte does not save any memory
   over allocating 8
   -> will post a patch for that
* I added a svn_stringbuf_create_empty() function that
   creates a truly empty string (zero capacity) in r985697
   and used it r985673. That is usefuly if the actual required
   capacity will be known only later - but before data is being
   added to the string. Alas, these empty strings seem to
   break some assumptions and cannot be used freely
   in all suitable places. I tried that and got errors.

-- Stefan^2.
Received on 2010-10-24 21:46:46 CEST

This message: [ Message body ]
Next message: Stefan Fuhrmann: "Re: svn commit: r984984 - /subversion/branches/performance/subversion/libsvn_repos/reporter.c"
Previous message: Daniel Shahaf: "Re: Some tips on profiling"
In reply to: Julian Foad: "Re: Performance optimization - svn_stringbuf_appendbyte()"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]