Bert Huijben wrote:
>
>> -----Original Message-----
>> From: Stefan Fuhrmann [mailto:stefanfuhrmann_at_alice-dsl.de]
>> In this patch, I eliminated calls to memcpy for small copies as they are
>> particularly expensive in the MS CRT.
>>
>
> Which CRT did you use for these measurements? (2005, 2008, 2010, Debug vs
> Release and DLL vs Static?). Which compiler version? (Standard/Express or
> Professional+). (I assume you use the normal Subversion build using .sln
> files and not the TortoiseSVN scripts? Did you use the shared library builds
> or a static build)?
>
VSTS2008 Developer Edition. Release build (am I an Amateur?!)
TSVN build scripts which set /Ox (global opt, intrinsics, omit frame
pointers, ...)
> Did you try enabling the intrinsincs for this method instead of using a
> handcoded copy?
>
<mode="eductional prick">
Yes, but it does not help in this case: memset will use intrinsics
only for short (<48 bytes on x86) _fixed-size_ buffers. memcpy
will use intrinsics for _fixed-size_ buffers only, but seemingly with
no size limit.
> I'm pretty sure that the modern variants enable inlined optimized assembly
> for memcpy in release mode (and certainly if you ask for that using the
> right #pragma), but enabling profiling on the linker or advanced debugging
> options will probably disable all that. I would be extremely surprised if
> this optimized assembler is measurable slower than the CRT on other OSs as
> copying memory is a pretty elementary operation.
> (But I'm pretty certain that the debug CRT with variable usage and memory
> usage tracking is slower than the other CRT's.. But that is why we deliver
> applications in release mode)
>
The problem is that non-trivial intrinsics are hard to implement
in a compiler. Non-trivial means including dynamically optimizing
for short buffers. It's the variable length that will make it hard:
* a primitive copy loop takes 2 ticks / iteration (here: byte)
+ 1 tick setup time
* a "rep movsb" is only fast for <10 bytes but still slower
than a primitive loop (otherwise: 50 ticks setup)
* "rep movsw/d/q" has > 10 bytes / tick but > 15 ticks
amortized setup time.
* conditional jumps may be hard to predict an increase
the load of branch predictor
Without feedback optimization, a compiler cannot know
whether it should optimize for short, mid-sized or long
sequences. A generic code trying to accommodate all three
cases would be very long and still perform poorly on either
short or mid-sized sequences or both.
In our case, the system will have two typical load scenarios:
* hard-to-compress data or data with high compressibility
-> long copy sequences -> use memcpy
* mildly compressible data -> many very short sequences
-> manual copy
What scenario applies to the current run is an intrinsic
property of the file being processed, i.e. the condition
switching between them is highly predictable for most
loads.
Since the overhead of calling a __cdecl function alone
is copying 3 values, we save considerable time using a
manual byte-wise copy for short sequences.
> In my performance measurements the only case where I saw a memcpy turn up
> was in the apr pool functions. (This was while using a profiler that didn't
> touch the code (just timed measure points) and tracing on optimized release
> mode binaries)
>
I use sampling profiling only (just as you did). The problem
with the MS CRT is that the individual functions are rarely
visible in the result. Instead, they get reported cumulatively
under MSVCRT.DLL. The reason behind this is that the
named function is a simple JMP instruction to the actual code.
</mode>
-- Stefan^2.
Received on 2010-04-27 01:10:41 CEST