On 20.12.2010 02:43, Johan Corveleyn wrote:
> On Wed, Dec 15, 2010 at 10:58 AM, Stefan Fuhrmann<eqfox_at_web.de> wrote:
>> On 15.12.2010 02:30, Stefan Fuhrmann wrote:
>>> On 14.12.2010 23:35, Johan Corveleyn wrote:
>>>
>>>> Thoughts?
>>> Two things you might try:
>>> * introduce a local variable for afile[i]
>>> * replace the for() loop with two nested ones, keeping
>>> calls to other functions out of the hot spot:
>>>
>>> for (i=0; i< file_len;)
>> That should read:
>> for (i=0; i< file_len; i++)
>>> {
>>> /* hot spot: */
>>> for(; i< file_len; i++)
>>> {
>>> curFile = afile[i];
>>> if (curFile->chunk==-1)
>>> curFile->chunk = 0;
>>> else if (curFile->curp != curFile->endp -1)
>>> curFile->curp++;
>>> else
>>> break;
>>> }
>>>
>>> if (i< file_len)
>>> {
>>> /* the complex, rarely used stuff goes here */
>>> }
>>> }
> Ok, I tried this, but it didn't really help. It's still only about as
> fast as before. While the macro version is about 10% faster (I made a
> cleaner macro version, only macro'ing the hotspot stuff, with a
> function call to the complex, rarely used stuff).
>
> For the inline version, I tried several variations (always with
> APR_INLINE in the function signature, and with local variable for
> file[i]): with or without the nested for loop, with or without the
> complex stuff factored out to a separate function, ... it made no
> difference.
>
> Maybe I'm still doing something wrong, keeping the compiler from
> inlining it (but I'm not really a compiler expert, maybe I have to add
> some compiler options or something, ...). OTOH, I now have a macro
> implementation which is quite readable (and which uses the do ...
> while(0) construct - thanks Peter), and which does the trick. So maybe
> I should simply stick with that option ...
The key factor here is that file_len is only 2.
Basically, that will make my optimization a
pessimization.
Assuming that for most calls, curp of *both*
files will be somewhere *between* BOF and
EOF, the unrolling the loop may help:
#define INCREMENT_POINTERS
/* special code for the common case. 10 .. 12 ticks per execution */
static APR_INLINE svn_boolean_t
quickly_increment_pointers(struct file_info *afile[])
{
struct file_info *file0 = afile[0];
struct file_info *file1 = afile[1];
if ((afile0->chunk != -1) && (afile1->chunk != -1))
{
/* suitable_type */ nextp0 = afile0->curp + 1;
/* suitable_type */ nextp1 = afile1->curp + 1;
if ((nextp0 != afile0->endp) && (nextp1 != afile1->endp))
{
afile0->curp = nextp0;
afile1->curp = nextp1;
return TRUE;
}
}
return FALSE;
}
/* slow code covering all special cases */
static svn_error_t*
slowly_increment_pointers(struct file_info *file[], apr_size_t file_len,
apr_pool_t *pool)
{
for (i = 0; i < file_len;i++)
...
}
/* maybe as macro: */
return ((file_len == 2) && quickly_increment_pointers (file))
? SVN_NO_ERROR
: slowly_increment_pointers(file, file_len, pool);
Minor remark on C style:
> static APR_INLINE svn_error_t *
> increment_pointers(struct file_info *file[], int file_len, apr_pool_t *pool)
file_len should be apr_size_t unless it can assume negative
values in which case it should be apr_ssize_t.That way, you
know for sure it will never overflow. Int is potentially too short
(although extreeeemly unlikely in this case) and it is signed
(often not appropriate for an index value).
One more thing that might help speed up your code. The fact
that the calling overhead of a single & simple function turns
out to be a significant contribution to the overall runtime tells
us that the execution is tightly CPU-bound and squeezing
some extra cycles out of it could help. Try
struct file_info file[] instead of struct file_info *file[]
i.e. array of structs instead of array of pointers. The boring
background is as follows.
file[index]->member
requires 2 cache accesses: reading the pointer form the
array (assuming that the array pointer itself and the index
value are already in some register) followed by an access
relative to that pointer:
/* assume file -> reg_0, index -> reg_1 */
(1.1) mem[reg_0 + 4 * reg_1] -> reg_2 // content of file[index], i.e.
file_info pointer
(1.2) mem[reg_2 + member_offset] -> reg_3 // content of member element
The really bad part of this is that (1.2) depends on (1.1) and
must wait for the data to come in from the cache. On Corei7
this takes 4 ticks (on others like PPC it takes 2). So, while
you can issue 1 cache request per tick, the dependency
makes the whole thing take at least 5 ticks.
Using an array of structures this becomes
file[index].member
/* assume file -> reg_0, index -> reg_1 */
(2.1) reg_1 * 32 -> reg_2 // offset of struct [index], assume 32 ==
sizeof (file_info)
(2.2) mem[reg_0 + reg_2 + member_offset] -> reg_3 // content of member
element
Now this is only 2 ticks with (2.1) often being hidden, pre-
calculated elsewhere or entirely unnecessary (if index is
fixed).
A final problem with the array of pointers issue is that
you often cannot reuse the result of (1.1) because in C
a pointer can point anywhere. e.g.:
file[0] = (file_struct*)file;
file[0]->member1 = 123; /* might (partially) modify file[0] */
file[0]->member0 = 123; /* same here */
file[0]->member2 = 123; /* and here */
The C compiler cannot be *sure* that something to this
effect has not been done somewhere. Therefore, any
write access to a "random" address invalidates all pre-
fetched information from "random" positions. "random"
basically means "not from stack" or similar well-traceable
locations.
file_struct *file0 = file[0]
file0->member1 = 0;
file0->member2 = 123;
Is thus often faster than
file[0]->member1 = 0;
file[0]->member2 = 123;
but still much slower than
file[0].member1 = 0;
file[0].member2 = 123;
Food for thought ;)
-- Stefan^2.
Received on 2010-12-24 15:41:58 CET