Dmitry <wipedout_at_yandex.ru> wrote:
> >>> 2. using ptr[i] is faster anyway - according intel optimazition
manual
> >>> (at least as i remmember)
> >> I though pointer arithmetic was faster - but I guess with todays
> >> optimizing compilers it won't make a difference.
> >I will do some test to check it, but as I rememeber it comes from
> >internal cache mechanizm. So not sure if compiler is able to do such
> >optimalization.
> >Also according that paper (I should probably reread it) decrementing of
> >pointer takes more time - because of cache.
Maybe they assumed that the pointer itself was stored
somewhere in memory (store-forwarding & friends?).
Aside from the effect mentioned below, the cache should
be oblivious to how a certain address got generated.
> I've done a test - a locally allocated (automatic variable) buffer of
size 2048 is processed 100
> million times in a loop. First time it is referenced as buffer[i],
second time there're two
> pointers - head set to the first element and tail set to the last, tail
is decremented along the pass.
>
> Code is compiled with VC++7, Full Optimization (/Ox), GetTickCount() is
used to measure time.
>
> The first variant takes 157375 ms, the second one - 157406 ms. They are
almost equal - about 0,01%
> difference.
As a rule of thumb, you can only get 1 memory access
and one 1 condition / branch per cycle. Therefore,
most of these trivial loops take 2 cycles per iteration.
ALU resources usually don't limit the throughput.
That's the case for over 15 years now ...
On other words, the only way to be even faster is to
have only one memory access and only one condition.
A 4-issue architecture like Core2 / i7 could execute
the following code at 1 clock cycle per iteration
(didn't test it, though):
for (; p != end; ++p)
sum += p->sub;
> Decrementing or incrementing the buffer index or pointer used for buffer
access (traversing
> direction) should also not result in any significant difference. The
cache has no idea of what
> code runs in the processor core - it only works with actual memory
accesses. If several subsequent
> accesses occur at adjacent addresses and all of them correspond to one
cache line they all result
> in cache hits. Regardless of traversing direction once the cache miss
occurs the line is fetched
> and several subsequent accesses result in hits.
That is true only as long as you hit L1. For L2,
L3 and external memory access, you want the
hardware prefetch to kick in. Processors of 2003
and before (e.g. early Opterons) would not detect
negative strides, i.e. there would be considerable
latency every 64 bytes.
> I guess TSVN has some code that needs performance tuning. But the string
processing routine we
> discuss here is not the number one candidate.
True. My impression is that we issue too many,
i.e. redundant, server requests "under the hood".
Local operations and most TSVN-internal logic
should be either fast enough or a Subversion
core issue.
It would be nice, if we added optional tracing
to the SVN and SVNInfo classes (for a start).
For every svn_* call, the following info would
be written to OutputDebugString: function name,
thread ID, duration and maybe parameters.
A #define would enable the output.
In a tool like DbgView, we could then see the
main SVN interface calls for every user action.
-- Stefan^2.
------------------------------------------------------
http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=757&dsMessageId=2390151
To unsubscribe from this discussion, e-mail: [dev-unsubscribe_at_tortoisesvn.tigris.org].
Received on 2009-09-02 14:06:38 CEST