My orginial idea was to give an elegant code example, that would demonstrate the impact of instruction cache limitations. I wrote the following piece of code, that creates a large amount of identical functions, using template metaprogramming.
volatile int checksum;
void (*funcs[MAX_FUNCS])(void);
template <unsigned t>
__attribute__ ((noinline)) static void work(void) { ++checksum; }
template <unsigned t>
static void create(void) { funcs[t - 1] = &work<t - 1>; create<t - 1>(); }
template <> void create<0>(void) { }
int main()
for (unsigned range = 1; range <= MAX_FUNCS; range *= 2)
checksum = 0;
for (unsigned i = 0; i < WORKLOAD; ++i)
funcs[i % range]();
return 0;
The outer loop varies the amount of different functions to be called using a jump table. For each loop pass, the time taken to invoke WORKLOAD functions is then measured. Now what are the results? The following chart shows the average run time per function call in relation to the used range. The blue line shows the data measured on a Core i7 machine. The comparative measurement, depicted by the red line, was carried out on a Pentium 4 machine. Yet when it comes to interpreting these lines, I seem to be somehow struggling...
The only jumps of the piecewise constant red curve occur exactly where the total memory consumption for all functions within range exceed the capacity of one cache level on the tested machine, which has no dedicated instruction cache. For very small ranges (below 4 in this case) however, run time still increases with the amount of functions. This may be related to branch prediction efficiency, but since every function call reduces to an unconditional jump in this case, I'm not sure if there should be any branching penalty at all.
The blue curve behaves quite differently. Run time is constant for small ranges and increases logarithmic thereafter. Yet for larger ranges, the curve seems to be approaching a constant asymptote again. How exactly can the qualitative differences of both curves be explained?
I am currently using GCC MinGW Win32 x86 v.4.8.1 with g++ -std=c++11 -ftemplate-depth=65536 and no compiler optimization.
Any help would be appreciated. I am also interested in any idea on how to improve the experiment itself. Thanks in advance!

First, let me say that I really like how you've approached this problem, this is a really neat solution for intentional code bloating. However, there might still be several possible issues with your test -
You also measure the warmup time. you didn't show where you've placed your time checks, but if it's just around the internal loop - then the first time until you reach range/2 you'd still enjoy the warmup of the previous outer iteration. Instead, measure only warm performance - run each internal iteration for several times (add another loop in the middle), and take the timestamp only after 1-2 rounds.
You claim to have measure several cache levels, but your L1 cache is only 32k, which is where your graph ends. Even assuming this counts in terms of "range", each function is ~21 bytes (at least on my gcc 4.8.1), so you'll reach at most 256KB, which is only then scratching the size of your L2.
You didn't specify your CPU model (i7 has at least 4 generations in the market now, Haswell, IvyBridge, SandyBridge and Nehalem). The differences are quite large, for example an additional uop-cache since Sandybrige with complicated storage rules and conditions. Your baseline is also complicating things, if I recall correctly the P4 had a trace cache which might also cause all sorts of performance impacts. You should check an option to disable them if possible.
Don't forget the TLB - even though it probably doesn't play a role here in such a tightly organized code, the number of unique 4k pages should not exceed the ITLB (128 entries), and even before that you may start having collisions if your OS did not spread the physical code pages well enough to avoid ITLB collisions.


How to test the problem size scaling performance of code

I'm running a simple kernel which adds two streams of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indices container contains different indices for different threads.
for (const auto& index : slice_indices)
auto* tens1_data_stream = tens1.get_slice_data(index);
const auto* tens2_data_stream = tens2.get_slice_data(index);
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
The target computer has a Intel(R) Xeon(R) Platinum 8168 CPU # 2.70GHz with 24 cores, L1 cache 32kB, L2 cache 1MB and L3 cache 33MB. The total memory bandwidth is 115GB/s.
The following is how my code scales with problem size S = N x N x N.
Can anybody tell me with the information I've provided if:
it's scaling well, and/or
how I could go about finding out if it's utilizing all the resources which are available to it?
Thanks in advance.
Now I've plotted the performance in GFLOP/s with 24 cores and 48 cores (two NUMA nodes, the same processor). It appears so:
And now the strong and weak scaling plots:
Note: I've measured the BW and it turns out to be 105GB/S.
Question: The meaning of the weird peak at 6 threads/problem size 90x90x90x16 B in the weak scaling plot is not obvious to me. Can anybody clear this?
Your graph has roughly the right shape: tiny arrays should fit in the L1 cache, and therefore get very high performance. Arrays of a megabyte or so fit in L2 and get lower performance, beyond that you should stream from memory and get low performance. So the relation between problem size and runtime should indeed get steeper with increasing size. However, the resulting graph (btw, ops/sec is more common than mere runtime) should have a stepwise structure as you hit successive cache boundaries. I'd say you don't have enough data points to demonstrate this.
Also, typically you would repeat each "experiment" several times to 1. even out statistical hiccups and 2. make sure that data is indeed in cache.
Since you tagged this "openmp" you should also explore taking a given array size, and varying the core count. You should then get a more or less linear increase in performance, until the processor does not have enough bandwidth to sustain all the cores.
A commenter brought up the concepts of strong/weak scaling. Strong scaling means: given a certain problem size, use more and more cores. That should give you increasing performance, but with diminishing returns as overhead starts to dominate. Weak scaling means: keep the problem size per process/thread/whatever constant, and increase the number of processing elements. That should give you almost linear increasing performance, until -- as I indicated -- you run out of bandwidth. What you seem to do is actually neither of these: you're doing "optimistic scaling": increase the problem size, with everything else constant. That should give you better and better performance, except for cache effects as I indicated.
So if you want to say "this code scales" you have to decide under what scenario. For what it's worth, your figure of 200Gb/sec is plausible. It depends on details of your architecture, but for a fairly recent Intel node that sounds reasonable.

Time measurement repeatedly makes mistake in specific places

I need to write program which would measure performance of certain data structures. But I can't get reliable result. For example when I measured performance 8 times for the same size of structure, every other result was different(for example: 15ms, 9ms, 15ms, 9ms, 15ms, ...), although the measurements weren't dependent on each other(for every measurement I generated new data). I tried to extract the problem and here is what I have:
while (true) {
auto start = high_resolution_clock::now();
for (int j = 0; j < 500; j++)
auto end = high_resolution_clock::now();
cout << duration<double, milli>(end - start).count() << " ";
What happens when I run this code is - In the first run of loop the time is significantly higher than in next runs. Well it's always higher in the first run, but from time to time also in other measurements.
Example output: 0.006842 0.002566 0.002566 0.002138 0.002993 0.002138 0.002139 ...
And that's the behaviour everytime I start the program.
Here are some things I tried:
It does matter if I compile Release or Debug version. Measurements are still faulty but in different places.
I turned off code optimization.
I tried using different clocks.
And what I think is quite important - While my Add function wasn't empty, the problem depended on data size. For example program worked well for most data sizes but let's say for element count of 7500 measurements were drastically different.
I just deleted part of code after the segment i posted here. And guess what, first measurement is no longer faulty. I have no idea what's happening here.
I would be glad if someone explained to me what can be possible cause of all of this.
In that code, it's likely that you're just seeing the effect of the instruction cache or the micro-op cache. The first time the test is run, more instructions have to be fetched and decoded; on subsequent runs the results of that are available in the caches. As for the alternating times you were setting on some other code, that could be fluctuations in the branch prediction buffer, or something else entirely.
There's too many complex processes involved in execution on modern CPUs to expect a normal sequence of instructions to execute in a fixed amount of time. While it's possible to measure or at least account for these externalities when looking at individual instructions, for nontrivial code you basically have to accept empirical measurements including their variance.
Depending on what kind of operating system you're on, for durations this short, the scheduler can cause huge differences. If your thread is preempted, then you have the idle duration in your time. There are also many things that happen that you don't see: caches, pages, allocation. Modern systems are complex.
You're better off making the whole benchmark bigger, and then doing multiple runs on each thing you're testing, and then using something like ministat from FreeBSD to compare the runs of the same test, and then compare the ministat for the different things you're comparing.
To do this effectively, your benchmark should try to use the same amount of memory as the real workload, so that you memory access is a part of the benchmark.

First method call takes 10 times longer than consecutive calls with the same data

I am performing some execution time benchmarks for my implementation of quicksort. Out of 100 successive measurements on exactly the same input data it seems like the first call to quicksort takes roughly 10 times longer than all consecutive calls. Is this a consequence of the operating system getting ready to execute the program, or is there some other explanation? Moreover, is it reasonable to discard the first measurement when computing an average runtime?
The below bar chart illustrates execution time (miliseconds) versus method call number. Each time the method is called it processes the exact same data.
To produce this particular graph the main method makes a call to quicksort_timer::time_fpi_quicksort(5, 100) whose implementation can be seen below.
static void time_fpi_quicksort(int size, int runs)
std::vector<int> vector(size);
for (int i = 0; i < runs; i++)
vector = utilities::getRandomIntVectorWithConstantSeed(size);
Timer timer;
quicksort(vector, ver::FixedPivotInsertion);
The getRandomIntVectorWithConstantSeed is implemented as follows
std::vector<int> getRandomIntVectorWithConstantSeed(int size)
std::vector<int> vector(size);
for (int i = 0; i < size; i++)
vector[i] = rand();
return vector;
CPU and Compilation
CPU: Broadwell 2.7 GHz Intel Core i5 (5257U)
Compiler Version: Apple LLVM version 10.0.0 (clang-1000.11.45.5)
Compiler Options: -std=c++17 -O2 -march=native
Yes, it could be a page fault on the page holding the code for the sort function (and the timing code itself). The 10x could also include ramp-up to max turbo clock speed.
Caching is not plausible, though: you're writing the (tiny) array outside the timed region, unless the compiler somehow reordered the init with the constructor of your Timer. Memory allocation being much slower the first time would easily explain it, maybe having to make a system call to get a new page the first time, but later calls to new (to construct std::vector) just grabbing already-hot-in-cache memory from the free list.
Training the branch predictors could also be a big factor, but you'd expect it to take more than 1 run before the TAGE branch predictors in a modern Intel CPU, or the perceptron predictors in a modern AMD, "learned" the full pattern of all the branching. But maybe they get close after the first run.
Note that you produce the same random array every time, by using srand() on every call. To test if branch prediction is the explanation, remove the srand so you get different arrays every time, and see if the time stays much higher.
What CPU, compiler version / options, etc. are you using?
Probably is because of caching, as the memory needs to be fetched from DRAM and allocated in CPU's data cache the first time. That takes (much) more latency more than loads that hit in the CPU's cache.
Then as your instructions are in the pipeline they follow the same branch as it is the instructions from the same memory source as it doesn't need to be invalidated because is the same pointer.
Would be interesting if you implement 4 methods with more or less the same functionality and then swap between them to see what happen.

Linux C++: how to profile time wasted due to cache misses?

I know that I can use gprof to benchmark my code.
However, I have this problem -- I have a smart pointer that has an extra level of indirection (think of it as a proxy object).
As a result, I have this extra layer that effects pretty much all functions, and screws with caching.
Is there a way to measure the time my CPU wastes due to cache misses?
You could try cachegrind and it's front-end kcachegrind.
Linux supports with perf from 2.6.31 on. This allows you to do the following:
compile your code with -g to have debug information included
run your code e.g. using the last level cache misses counters: perf record -e LLC-loads,LLC-load-misses yourExecutable
run perf report
after acknowledging the initial message, select the LLC-load-misses line,
then e.g. the first function and
then annotate. You should see the lines (in assembly code, surrounded by the the original source code) and a number indicating what fraction of last level cache misses for the lines where cache misses occurred.
You could find a tool that accesses the CPU performance counters. There is probably a register in each core that counts L1, L2, etc misses. Alternately Cachegrind performs a cycle-by-cycle simulation.
However, I don't think that would be insightful. Your proxy objects are presumably modified by their own methods. A conventional profiler will tell you how much time those methods are taking. No profile tool would tell you how performance would improve without that source of cache pollution. That's a matter of reducing the size and structure of the program's working set, which isn't easy to extrapolate.
A quick Google search turned up boost::intrusive_ptr which might interest you. It doesn't appear to support something like weak_ptr, but converting your program might be trivial, and then you would know for sure the cost of the non-intrusive ref counts.
Continuing along the lines of #Mike_Dunlavey's answer:
First, obtain a time based profile, using your favorite tool: VTune or PTU or OProf.
Then, obtain a cache miss profile. L1 cache misses, or L2 cache misses, or ...
I.e. the first profile associates a "time spent" with each program counter.
The second associates a "number of cache misses" value with each program counter.
Note: I often "reduce" the data, summing it up by function, or (if I have the technology) by loop. Or by bins of, say, 64 bytes. Comparing individual program counters is often not useful, because the performance counters are fuzzy - the place where you see a cache miss get reported is often several instructions different from where it actually happened.
OK, so now graph these two profiles to compare them. Here are some graphs that I find useful:
"Iceberg" charts: X axis is PC, positive Y axis is time, negative Y access is cache misses. Look for places that go both up and down.
("Interleaved" charts are also useful: same idea, X axis is PC, plot both time and cache misseson Y axis, but with narrow vertical lines of different colors, typically red and blue. Places where is a lot of both time and cache misses spent will have finely interleaved red and blue lines, almost looking purple. This extends to L2 and L3 cache misses, all on the same graph. By the way, you probably want to "normalize" the numbers, either to %age of total time or cache misses, or, even better, %age of the maximum data point of time or cache misses. If you get the scale wrong, you won't see anything.)
XY charts: for each sampling bin (PC, or function, or loop, or...) plot a point whose X coordinate is the normalized time, and whose Y coordinate is the normalized cache misses. If you get a lot of data points in the upper right hand corner - large %age time AND large %age cache misses - that is interesting evidence. Or, forget number of points - if the sum of all percentages in the upper corner is big...
Note, unfortunately, that you often have to roll these analyses yourself. Last I checked VTune does not do it for you. I have used gnuplot and Excel. (Warning: Excel dies above 64 thousand data points.)
More advice:
If your smart pointer is inlined, you may get the counts all over the place. In an ideal world you would be able to trace back PCs to the original line of source code. In this case, you may want to defer the reduction a bit: look at all individual PCs; map them back to lines of source code; and then map those into the original function. Many compilers, e.g. GCC, have symbol table options that allow you to do this.
By the way, I suspect that your problem is NOT with the smart pointer causing cache thrashing. Unless you are doing smart_ptr<int> all over the place. If you are doing smart_ptr<Obj>, and sizeof(Obj) + is greater than say, 4*sizeof(Obj*) (and if the smart_ptr itself is not huge), then it is not that much.
More likely it is the extra level of indirection that the smart pointer does that is causing yor problem.
Coincidentally, I was talking to a guy at lunch who had a reference counted smart pointer that was using a handle, i.e. a level of indirection, something like
template<typename T> class refcntptr {
refcnt_handle<T> handle;
refcntptr(T*obj) {
this->handle = new refcnt_handle<T>();
this->handle->ptr = obj;
this->handle->count = 1;
template<typename T> class refcnt_handle {
T* ptr;
int count;
friend refcnt_ptr<T>;
(I wouldn't code it this way, but it serves for exposition.)
The double indirection this->handle->ptr can be a big performance problem. Or even a triple indirection, this->handle->ptr->field. At the least, on a machine with 5 cycle L1 cache hits, each this->handle->ptr->field would take 10 cycles. And be much harder to overlap than a single pointer chase. But, worse, if each is an L1 cache miss, even if it were only 20 cycles to the L2... well, it is much harder to hide 2*20=40 cycles of cache miss latency, than a single L1 miss.
In general, it is good advice to avoid levels of indirection in smart pointers. Instead of pointing to a handle, that all smart pointers point to, which itself points to the object, you might make the smart pointer bigger by having it point to the object as well as the handle. (Which then is no longer what is commonly called a handle, but is more like an info object.)
template<typename T> class refcntptr {
refcnt_info<T> info;
T* ptr;
refcntptr(T*obj) {
this->ptr = obj;
this->info = new refcnt_handle<T>();
this->info->count = 1;
template<typename T> class refcnt_info {
T* ptr; // perhaps not necessary, but useful.
int count;
friend refcnt_ptr<T>;
Anyway - a time profile is your best friend.
Oh, yeah - Intel EMON hardware can also tell you how many cycles you waited at a PC. That can distinguish a large number of L1 misses from a small number of L2 misses.
It depends on what OS and CPU you are using. E.g. for Mac OS X and x86 or ppc, Shark will do cache miss profiling. Ditto for Zoom on Linux.
If you're running an AMD processor, you can get CodeAnalyst, apparently free as in beer.
My advice would be to use PTU (Performance Tuning Utility) from Intel.
This utility is the direct descendant of VTune and provide the best available sampling profiler available. You'll be able to track where the CPU is spending or wasting time (with the help of the available hardware events), and this with no slowdown of your application or perturbation of the profile.
And of course you'll be able to gather all cache line misses events you are looking for.
Another tool for CPU performance counter-based profiling is oprofile. You can view its results using kcachegrind.
Here's kind of a general answer.
For example, if your program is spending, say, 50% of it's time in cache misses, then 50% of the time when you pause it the program counter will be at the exact locations where it is waiting for the memory fetches that are causing the cache misses.

C++ 'small' optimization behaving strangely

I'm trying to optimize 'in the small' on a project of mine.
There's a series of array accesses that are individually tiny, but profiling has revealed that these array accesses are where the vast majority of my program is spending its time. So, time to make things faster, since the program takes about an hour to run.
I've moved the following type of access:
const float theValOld1 = image(x, y, z);
const float theValOld2 = image(x, y+1, z);
const float theValOld3 = image(x, y-1, z);
const float theValOld4 = image(x-1, y, z);
etc, for 28 accesses around the current pixel.
where image thunks down to
float image(const int x, const int y, const int z) const {
return data[z*xsize*ysize + y*xsize + x];
and I've replaced it with
const int yindex = y*xsize;
const int zindex = z*xsize*ysize;
const float* thePtr = &(data[z*xsize*ysize + y*xsize + x]);
const float theVal1 = *(thePtr);
const float theVal2 = *(thePtr + yindex);
const float theVal3 = *(thePtr - yindex);
const float theVal4 = *(thePtr - 1);
etc, for the same number of operations.
I would expect that, if the compiler were totally awesome, that this change would do nothing to the speed. If the compiler is not awesome, then I'd say that the second version should be faster, if only because I'm avoiding the implict pointer addition that comes with the [] thunk, as well as removing the multiplications for the y and z indeces.
To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
To measure performance, I'm using QueryPerformanceCounter.
What's odd to me is that the order of operations matters!
If I leave the operations as described and compare the timings (as well as the results, to make sure that the same value is calculated after optimization), then the older code takes about 45 ticks per pixel and the new code takes 10 ticks per pixel. If I reverse the operations, then the old code takes about 14 ticks per pixel and the new code takes about 30 ticks per pixel (with lots of noise in there, these are averages over about 100 pixels).
Why should the order matter? Is there caching or something happening? The variables are all named different things, so I wouldn't think that would matter. If there is some caching happening, is there any way I can take advantage of it from pixel to pixel?
Corollary: To compare speed, I'm supposing that the right way is to run the two versions independently of one another, and then compare the results from different runs. I'd like to have the two comparisons next to each other make sense, but there's obviously something happening here that prevents that. Is there a way to salvage this side-by-side run to get a reasonable speed comparison from a single run, so I can make sure that the results are identical as well (easily)?
EDIT: To clarify.
I have both new and old code in the same function, so I can make sure that the results are identical.
If I run old code and then new code, new code runs faster than old.
If I run new code and then old code, old code runs faster than new.
The z hit is required by the math, and the if statement cannot be removed, and is present in both. For the new code, I've just moved more z-specific code into the z section, and the test code I'm using is 100% 2D. When I move to 3D testing, then I'm sure I'll see more of the effect of branching.
You may (possibly) be running into some sort of readahead or cacheline boundary issue. Generally speaking, when you load a single value and it isn't "hot" (in cache), the CPU will pull in a cache line (32, 64, or 128 bytes are pretty typical, depending on processor). Subsequent reads to the same line will be much faster.
If you change the order of operations, you may just be seeing stalls due to how the lines are being loaded and evicted.
The best way to figure something like this out is to open "Disassembly" view and spend some quality time with your processor's reference manual.
If you're lucky, the changes that the code reordering causes will be obvious (the compiler may be generating extra instructions or branches). Less lucky, it will be a stall somewhere in the processor -- during the decode pipeline or due to a memory fetch...
A good profiler that can count stalls and cache misses may help here too (AMD has CodeAnalyst, for example).
If you're not under a time crunch, it's really worthwhile to dive into the disasm -- at the very least, you'll probably end up learning something you didn't know before about how your CPU, machine architecture, compiler, libraries, etc work. (I almost always end up going "huh" when studying disasm.)
If both the new and old versions run on the same data array, then yes, the last run will almost certainly get a speed bump due to caching. Even if the code is different, it'll be accessing data that was already touched by the previous version, so depending on data size, it might be in L1 cache, will probably be in L2 cache, and if a L3 cache exists, almost certainly in that. There'll probably also be some overlap in the code, meaning that the instruction cache will further boost performance of the second version.
A common way to benchmark is to run the algorithm once first, without timing it, simply to ensure that that's going to be cached, is cached, and then run it again a large number of times with timing enabled. (Don't trust a single execution, unless it takes at least a second or two. Otherwise small variations in system load, cache, OS interrupts or page faults can cause the measured time to vary). To eliminate the noise, measure the combined time taken for several runs of the algorithm, and obviously with no output in between. The fact that you're seeing spikes of 3x the usual time means that you're measuring at a way too fine-grained level. Which basically makes your timings useless.
Why should the order matter? Is there caching or something happening? The variables are all named different things, so I wouldn't think that would matter. If there is some caching happening, is there any way I can take advantage of it from pixel to pixel?
The naming doesn't matter. When the code is compiled, variables are translated into memory addresses or register id's. But when you run through your image array, you're loading it all into CPU cache, so it can be read faster the next time you run through it.
And yes, you can and should take advantage of it.
The computer tries very hard to exploit spatial and temporal locality -- that is, if you access a memory address X at time T, it assumes that you're going to need address X+1 very soon (spatial locality), and that you'll probably also need X again, at time T+1 (temporal locality). It tries to speed up those cases in every way possible (primarily by caching), so you should try to exploit it.
To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
I don't know where you placed that if statement, but if it's in a frequently evaluated block of code, the cost of the branch might hurt you more than you're saving. Branches can be expensive, and they inhibit the compiler's and CPU's ability to reorder and schedule instructions. So you may be better off without it. You should probably do this as a separate optimization that can be benchmarked in isolation.
I don't know which algorithm you're implementing, but I'm guessing you need to do this for every pixel?
If so, you should try to cache your lookups. Once you've got image(x, y, z), that'll be the next pixel's image(x+1, y, z), so cache it in the loop so the next pixel won't have to look it up from scratch. That would potentially allow you to reduce your 9 accesses in the X/Y plane down to three (use 3 cached values from the last iteration, 3 from the one before it, and 3 we just loaded in this iteration)
If you're updating the value of each pixel as a result of its neighbors values, a better approach may be to run the algorithm in a checkerboard pattern. Update every other pixel in the first iteration, using only values from their neighbors (which you're not updating), and then run a second pass where you update the pixels you read from before, based on the values of the pixels you updated before. This allows you to eliminate dependencies between neighboring pixels, so their evaluation can be pipelined and parallelized efficiently.
In the loop that performs all the lookups, unroll it a few times, and try to place all the memory reads at the top, and all the computations further down, to give the CPU a chance to overlap the two (since data reads are a lot slower, get them started, and while they're running, the CPU will try to find other instructions it can evaluate).
For any constant values, try to precompute them as much as possible. (rather than z*xsize*ysize, precompute xsize*ysize, and multiply z with the result of that.
Another thing that may help is to prefer local variables over globals or class members. You may gain something simply by, at the start of the function, making local copies of the class members you're going to need. The compiler can always optimize the extra variables out again if it wants to, but you make it clear that it shouldn't worry about underlying changes to the object state (which might otherwise force it to reload the members every time you access them)
And finally, study the generated assembly in detail. See where it's performing unnecessary store/loads, where operations are being repeated even though they could be cached, and where the ordering of instructions is inefficient, or where the compiler fails to inline as much as you'd hoped.
I honestly wouldn't expect your changes to the lookup function to have much effect though. An array access with the operator[] is easily convertible to the equivalent pointer arithmetic, and the compiler can optimize that pretty efficiently, as long as the offsets you're adding don't change.
Usually, the key to low-level optimizations is, somewhat ironically, not to look at individual lines of code, but at whole functions, and at loops. You need a certain amount of instructions in a block so you have something to work with, since a lot of optimizations deal with breaking dependencies between chains of instructions, reordering to hide instruction latency, and with caching individual values to avoid memory load/stores. That's almost impossible to do on individual array lookups, but there's almost certainly a lot gained if you consider a couple of pixels at a time.
Of course, as with almost all microoptimizations, there are no always true answers. Some of the above might be useful to you, or they might not.
If you tell us more about the access pattern (which pixels are you accessing, is there any required order, and are you just reading, or writing as well? If writing, when and where are the updated values used?)
If you give us a bit more information, we'll be able to offer much more specific (and likely to be effective) suggestions
When optimising, examining the data access pattern is essential.
for example:
assuming a width of 240
for a pixel at <x,y,z> 10,10,0
with original access pattern would give you:
a. data[0+ 10*240 + 10] -> data[2410]
b. data[0+ 11*240 + 10] -> data[2650]
c. data[0+ 9*240 + 10] -> data[2170]
d. data[0+ 10*240 + 9] -> data[2409]
Notice the indices which are in arbitrary order.
Memory controller makes aligned accesses to the main memory to fill the cache lines.
If you order your operations so that accesses are to ascending memory addresses
(e.g. c,d,a,b ) then the memory controller would be able to stream the data in to
the cache lines.
Missing cache on read would be expensive as it has to search down the cache
hierarchy down to the main memory. Main memory access could be 100x slower than
cache. Minimising main memory access will improve the speed of your operation.
To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
Did you actually measure that? Because I'd be pretty surprised if that were true. An if statement in the inner loop of your program can add a surprising amount of overhead -- see Is "IF" expensive?. I'd be willing to bet that the overhead of the extra multiply is a lot less than the overhead of the branching, unless z happens to be zero 99% of the time.
What's odd to me is that the order of operations matters!
The order of what operations? It's not clear to me what you're reordering here. Please give some more snippets of what you're trying to do.