On Wed, May 5, 2010 at 6:28 PM, Bolstridge, Andrew <
andy.bolstridge_at_intergraph.com> wrote:
> > -----Original Message-----
> > From: hyrum_at_hyrumwright.org [mailto:hyrum_at_hyrumwright.org] On Behalf
> Of
> > Hyrum K. Wright
> > Sent: Wednesday, May 05, 2010 3:55 PM
> > To: Peter Samuelson
> > Cc: dev_at_subversion.apache.org
> > Subject: Re: [PATCH] Saving a few cycles, part 2/2
> >
> ...
> > I would agree. I saw the commit of Stefan's patch, and independently
> > thought about using a lookup table (though I prefer Peter's
> > implementation).
> > I don't have any hard performance numbers, but I gotta believe the
> > lookup
> > table is simpler, as well as more performant.
> >
>
> It may be simpler, but generally not as fast as arithmetic. The problem
> (with modern processors) is that the table (or the entry you want) has
> to be loaded into the CPU cache and that may require a main RAM access
> that is very slow (relatively speaking) compared to a few maths
> operations.
>
> For example, you can do many instructions per clock cycle, but fetching
> 1 byte from main RAM (at 10ns delay) will take 30 clock cycles on a 3Ghz
> chip (and that's a lot of maths instructions.... over a thousand).
>
Completely off-topic: I'm interested in how over a thousand math
instructions can be done in only 30 clock cycles. Even on highly parallel
data (which this example is not), and using a very long pipeline (which
modern processors top out at *maybe* 25 stages), your still talking about
over 33 ops/cycle. My microarchitecture is a bit rusty, so I'm interested
to know how this works out.
I think it takes a lot longer than that, with slower latency RAM we use
> today and 3 levels of cache. But the point is that a lookup table is no
> longer necessarily faster than calculating the data. How times have
> changed :)
>
These are valid points, but I'm curious about them. For instance, the
current code has function calls in it. While the overhead might be lower,
because the compiler could pass arguments in registers instead of on the
stack, the functions most likely wouldn't already be resident in the i-cache
(at least not initially), and so would need to be loaded, just the same as
the lookup table would. The lookup table, at only 256 bytes, fits in most
d-caches, and would stay resident for the entire loop.
The lookup table method also eliminates branches (hidden by the trinary
operators), which could lead to pipeline stalls. Since the probability of
taking the branches is uniform (or ought to be with a cryptographic hash),
the branch predictor doesn't really have much chance of doing any good.
> Anyway, it's hardly going to be a big deal to overall performance, so
> whatever is easier to maintain is best.
+1 :)
-Hyrum
Received on 2010-05-05 23:35:49 CEST