[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Tue, 28 Dec 2010 19:31:50 +0100

On Fri, Dec 24, 2010 at 3:40 PM, Stefan Fuhrmann <eqfox_at_web.de> wrote:
> On 20.12.2010 02:43, Johan Corveleyn wrote:
>>
>> On Wed, Dec 15, 2010 at 10:58 AM, Stefan Fuhrmann<eqfox_at_web.de>  wrote:
>>>
>>> On 15.12.2010 02:30, Stefan Fuhrmann wrote:
>>>>
>>>> On 14.12.2010 23:35, Johan Corveleyn wrote:
>>>>
>>>>> Thoughts?
>>>>
>>>> Two things you might try:
>>>> * introduce a local variable for afile[i]
>>>> * replace the for() loop with two nested ones, keeping
>>>>  calls to other functions out of the hot spot:
>>>>
>>>> for (i=0; i<  file_len;)
>>>
>>> That should read:
>>> for (i=0; i<  file_len; i++)
>>>>
>>>> {
>>>>  /* hot spot: */
>>>>  for(; i<  file_len; i++)
>>>>  {
>>>>    curFile = afile[i];
>>>>    if (curFile->chunk==-1)
>>>>      curFile->chunk = 0;
>>>>    else if (curFile->curp != curFile->endp -1)
>>>>      curFile->curp++;
>>>>    else
>>>>      break;
>>>>  }
>>>>
>>>>  if (i<  file_len)
>>>>  {
>>>>    /* the complex, rarely used stuff goes here */
>>>>  }
>>>> }
>>
>> Ok, I tried this, but it didn't really help. It's still only about as
>> fast as before. While the macro version is about 10% faster (I made a
>> cleaner macro version, only macro'ing the hotspot stuff, with a
>> function call to the complex, rarely used stuff).
>>
>> For the inline version, I tried several variations (always with
>> APR_INLINE in the function signature, and with local variable for
>> file[i]): with or without the nested for loop, with or without the
>> complex stuff factored out to a separate function, ... it made no
>> difference.
>>
>> Maybe I'm still doing something wrong, keeping the compiler from
>> inlining it (but I'm not really a compiler expert, maybe I have to add
>> some compiler options or something, ...). OTOH, I now have a macro
>> implementation which is quite readable (and which uses the do ...
>> while(0) construct - thanks Peter), and which does the trick. So maybe
>> I should simply stick with that option ...
>
> The key factor here is that file_len is only 2.
> Basically, that will make my optimization a
> pessimization.
>
> Assuming that for most calls, curp of *both*
> files will be somewhere *between* BOF and
> EOF, the unrolling the loop may help:
>
> #define INCREMENT_POINTERS
>
> /* special code for the common case. 10 .. 12 ticks per execution */
>
> static APR_INLINE svn_boolean_t
> quickly_increment_pointers(struct file_info *afile[])
> {
> struct file_info *file0 = afile[0];
> struct file_info *file1 = afile[1];
> if ((afile0->chunk != -1) && (afile1->chunk != -1))
>  {
>    /* suitable_type */ nextp0 = afile0->curp + 1;
>    /* suitable_type */ nextp1 = afile1->curp + 1;
>    if ((nextp0 != afile0->endp) && (nextp1 != afile1->endp))
>      {
>        afile0->curp = nextp0;
>        afile1->curp = nextp1;
>        return TRUE;
>      }
>  }
> return FALSE;
> }
>
> /* slow code covering all special cases */
>
> static svn_error_t*
> slowly_increment_pointers(struct file_info *file[], apr_size_t file_len,
> apr_pool_t *pool)
> {
>  for (i = 0; i < file_len;i++)
> ...
> }
>
> /* maybe as macro: */
> return ((file_len == 2) && quickly_increment_pointers (file))
>  ? SVN_NO_ERROR
>  : slowly_increment_pointers(file, file_len, pool);

Nice! I will try this out sometime soon, but I'm not so sure it will
help more than the current macro version (only macro for the critical
section, calling a function for increment/decrement chunk - see
diff_file.c in HEAD of diff-optimizations-bytes branch). I guess I'll
just have to test it.

With the macro implementation in place my gut feeling is that the
prefix/suffix scanning is reasonably optimal, and that the remaining
time spent in the diff code (70 seconds for my example blame of big
xml file with 2200 changes) is almost all due to the "original" diff
algorithm (working on the remainder between prefix and suffix). But to
be sure I should probably measure that first.

(I should note here that, with my example file, the prefix/suffix
scanning doesn't always work optimally, because in about half of the
revisions there is a change both in one of the first lines (where an
author name is inserted automatically by XML Spy upon saving it), and
then somewhere in the middle where the actual change is done. It's
really a pity, because that means about 20000-30000 lines cannot be
skipped as prefix.)

> Minor remark on C style:
>
>> static APR_INLINE svn_error_t *
>> increment_pointers(struct file_info *file[], int file_len, apr_pool_t
>> *pool)
>
> file_len should be apr_size_t unless it can assume negative
> values in which case it should be apr_ssize_t.That way, you
> know for sure it will never overflow. Int is potentially too short
> (although extreeeemly unlikely in this case) and it is signed
> (often not appropriate for an index value).

Thanks. I didn't know that. I'll change it.

> One more thing that might help speed up your code. The fact
> that the calling overhead of a single & simple function turns
> out to be a significant contribution to the overall runtime tells
> us that the execution is tightly CPU-bound and squeezing
> some extra cycles out of it could help. Try
>
> struct file_info file[] instead of struct file_info *file[]
>
> i.e. array of structs instead of array of pointers. The boring
> background is as follows.
>
> file[index]->member
>
> requires 2 cache accesses: reading the pointer form the
> array (assuming that the array pointer itself and the index
> value are already in some register) followed by an access
> relative to that pointer:
>
>    /* assume file -> reg_0, index -> reg_1 */
> (1.1) mem[reg_0 + 4 * reg_1] -> reg_2    // content of file[index], i.e.
> file_info pointer
> (1.2) mem[reg_2 + member_offset] -> reg_3  // content of member element
>
> The really bad part of this is that (1.2) depends on (1.1) and
> must wait for the data to come in from the cache. On Corei7
> this takes 4 ticks (on others like PPC it takes 2). So, while
> you can issue 1 cache request per tick, the dependency
> makes the whole thing take at least 5 ticks.
>
> Using an array of structures this becomes
>
> file[index].member
>    /* assume file -> reg_0, index -> reg_1 */
> (2.1) reg_1 * 32 -> reg_2 // offset of struct [index], assume 32 == sizeof
> (file_info)
> (2.2) mem[reg_0 + reg_2 + member_offset] -> reg_3  // content of member
> element
>
> Now this is only 2 ticks with (2.1) often being hidden, pre-
> calculated elsewhere or entirely unnecessary (if index is
> fixed).
>
> A final problem with the array of pointers issue is that
> you often cannot reuse the result of (1.1) because in C
> a pointer can point anywhere. e.g.:
>
> file[0] = (file_struct*)file;
> file[0]->member1 = 123; /* might (partially) modify file[0] */
> file[0]->member0 = 123; /* same here */
> file[0]->member2 = 123; /* and here */
>
> The C compiler cannot be *sure* that something to this
> effect has not been done somewhere. Therefore, any
> write access to a "random" address invalidates all pre-
> fetched information from "random" positions. "random"
> basically means "not from stack" or similar well-traceable
> locations.
>
> file_struct *file0 = file[0]
> file0->member1 = 0;
> file0->member2 = 123;
>
> Is thus often faster than
>
> file[0]->member1 = 0;
> file[0]->member2 = 123;
>
> but still much slower than
>
> file[0].member1 = 0;
> file[0].member2 = 123;
>
> Food for thought ;)

Interesting! I'll have to chew a little bit on this food :-), but
perhaps I'll experiment a little bit with it.

As I said above, I think I'll first try to find out if most of the
time is still spent in prefix/suffix scanning, or in the original diff
algorithm on the remaining non-matching section. Just so I don't spend
too much time optimizing the wrong thing :-). Maybe this kind of
optimization could be tried on the original algorithm ...

Thanks for the input.

Cheers,

-- 
Johan
Received on 2010-12-28 20:32:48 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.