Code speed difference when using a random array index

Code speed difference when using a random array index - c++

Given a real number X within [0,1], after a specific binning I have to identify in what bin X falls. Given the bin size dx, I am using i = std::size_t(X/dx) , which works very well. I then look for the respective value of a given array v and set a second variable Y using double Y=v[i]. The whole code looks as follows:
double X = func();
dx=0.01;
int i = std::size_t(X/dx);
double Y = v[i];
print(Y)
This method correctly gives the expected value for the index i within the range [0, length(v)].
My main issue is not with finding the index, but using it: X is determined from an auxiliary function, and whenever I need to set Y=v[i] using the index determined above the code becomes extremely slow.
Without commenting or removing any of the lines, the code becomes much faster when setting X to some random value between 0 and 1 right after its definition or by setting i to some random value between 0 and length of v after the third line.
Could anyone be able to tell why this occurs? The speed changes of a factor 1000 if not more, and since there are only additional steps in the faster method and func() is called anyway I can't understand why it should become faster.

Since you have put no code in the question, there has to be a wild-guess like this:
You didn't sort all the X results before accessing lookup table. Processing a sorted array is faster.
Some of X had denormalized values which took a toll on computation time for certain CPU types including yours.
The dataset is too big for the L3 cache and it accessed RAM always, instead of quick cache hits that were seen in the other test.
Compiler was optimizing all of the expensive function calls out, but in real-world test scenario, it is not.
Time measurement has bugs
Computer is not stable in performance (like being a shared server or an antivirus intervention feeding on RAM bandwidth)

Related

C++ - Performance of static arrays, with variable size at launch

I wrote a cellular automaton program that stores data in a matrix (an array of arrays). For a 300*200 matrix I can achieve 60 or more iterations per second using static memory allocation (e.g. std::array).
I would like to produce matrices of different sizes without recompiling the program every time, i.e. the user enters a size and then the simulation for that matrix size begins. However, if I use dynamic memory allocation, (e.g. std::vector), the simulation drops to ~2 iterations per second. How can I solve this problem? One option I've resorted to is to pre-allocate a static array larger than what I anticipate the user will select (e.g. 2000*2000), but this seems wasteful and still limits user choice to some degree.
I'm wondering if I can either
a) allocate memory once and then somehow "freeze" it for ordinary static array performance?
b) or perform more efficient operations on the std::vector? For reference, I am only performing matrix[x][y] == 1 and matrix[x][y] = 1 operations on the matrix.
According to this question/answer, there is no difference in performance between std::vector or using pointers.
EDIT:
I've rewritten the matrix, as per UmNyobe' suggestion, to be a single array, accessed via matrix[y*size_x + x]. Using dynamic memory allocation (sized once at launch), I double the performance to 5 iterations per second.
As per PaulMcKenzie's comment, I compiled a release build and got the performance I was looking for (60 or more iterations per second). However, this is the foundation for more, so I still want to quantify the benefit of one method over the other more thoroughly, so I used a std::chrono::high_resolution_clock to time each iteration, and found that the performance difference between dynamic and static arrays (after using a single array matrix representation) to be within the margin of error (450~600 microseconds per iteration).
The performance during debugging is a slight concern however, so I think I'll keep both, and switch to a static array when debugging.

For reference, I am only performing
matrix[x][y]
Red flag! Are you using vector<vector<int>>for your matrix
representation? This is a mistake, as rows of your matrix will be far
apart in memory. You should use a single vector of size rows x cols
and use matrix[y * row + x]
Furthermore, you should follow the approach where you index first by rows and then by columns, ie matrix[y][x] rather than matrix[x][y]. Your algorithm should also process the same way. This is due to the fact that with matrix[y][x] (x, y) and (x + 1, y) are one memory block from each other while with any other mechanism elements (x,y) and (x + 1, y), (x, y + 1) are much farther away.
Even if there is performance decrease from std::array to std::vector (as the array can have its elements on the stack, which is faster), a decent algorithm will perform on the same magnitude using both collections.

Efficiency in C and C++

So my teacher tells me that I should compute intermediate results as needed on the fly rather than storing them, because the speed of processors nowadays is much more faster than the speed of memory.
So when we compute an intermediate result, we also need to use some memory right ? Can anyone please explain it to me ?

your teacher is right speed of processors nowadays is much more faster than the speed of memory. Access to RAM is slower what access to the internal memory: cache, registers, etc.
Suppose you want to compute a trigonometric function: sin(x). To do this you can either call a function (math library offers one, or implement your own) which is computing the value; or you can use a lookup table stored in memory to get the result which means storing the intermediate values (sort of).
Calling a function will result in executing a number of instructions, while using a lookup table will result in fewer instructions (getting the address of the LUT, getting the offset to the desired element, reading from address+offset). In this case, storing the intermediate values is faster
But if you were to do c = a+b, computing the value will be much faster than reading it from somewhere in RAM. Notice that in this case the number of instructions to be executed would be similar.
So while it is true that access to RAM is slower, whether it's worth accessing RAM instead of doing the computation is a sensible question and several things need to be considered: number of instructions to be executed, if the computation happens in a loop and you can take advantage the architectures pipeline, cache memory, etc.
There is no one answer, you need to analyze each situation individually.

Your teacher's advice is oversimplifying advice on a complex topic.
If you think of "intermediate" as a single term (in the arithmetical sense of the word), then ask yourself, is your code re-using that term anywhere else ? I.e. if you have code like:
void calculate_sphere_parameters(double radius, double & area, double & volume)
{
area = 4 * (4 * acos(1)) * radius * radius;
volume = 4 * (4 * acos(1)) * radius * radius * radius / 3;
}
should you instead write:
void calculate_sphere_parameters(double radius, double & area, double *volume)
{
double quarter_pi = acos(1);
double pi = 4 * quarter_pi;
double four_pi = 4 * pi;
double four_thirds_pi = four_pi / 3;
double radius_squared = radius * radius;
double radius_cubed = radius_squared * radius;
area = four_pi * radius_squared;
volume = four_thirds_pi * radius_cubed; // maybe use "(area * radius) / 3" ?
}
It's not unlikely that a modern optimizing compiler will emit the same binary code for these two. I leave it to the reader to determine what they prefer to see in the sourcecode ...
The same is true for a lot of simple arithmetics (at the very least, if no function calls are involved in the calculation). In addition to that, modern compilers and/or CPU instruction sets might have the ability to do "offset" calculations for free, i.e. something like:
for (int i = 0; i < N; i++) {
do_something_with(i, i + 25, i + 314159);
}
will turn out the same as:
for (int i = 0; i < N; i++) {
int j = i + 25;
int k = i + 314159;
do_something_with(i, j, k);
}
So the main rule should be, if your code's readability doesn't benefit from creating a new variable to hold the result of a "temporary" calculation, it's probably overkill to use one.
If, on the other hand, you're using i + 12345 a dozen times in ten lines of code ... name it, and comment why this strange hardcoded offset is so important.
Remember just because your source code contains a variable doesn't mean the binary code as emitted by the compiler will allocate memory for this variable. The compiler might come to the conclusion that the value isn't even used (and completely discard the calculation assigning it), or it might come to the conclusion that it's "only an intermediate" (never used later where it would've to be retrieved from memory) and so store it in a register, to overwrite after "last use". It's far more efficiently to do something like calculate the value i + 1 each time you need it than to retrieve it from a memory location.
My advice would be:
keep your code readable first and foremost - too many variables rather obscure than help.
don't bother saving "simple" intermediates - addition/subtraction or scaling by powers of two is pretty much a "free" operation
if you reuse the same value ("arithmetic term") in multiple places, save it if it is expensive to calculate (for example involves function calls, a long sequence of arithmetics, or a lot of memory accesses like an array checksum).

So when we compute an intermediate result, we also need to use some memory right ? Can anyone please explain it to me?
There are several levels of memory in a computer. The layers look like this
registers – the CPU does all the calculations on this and access is instant
Caches - memory that's tightly coupled to the CPU core; all memory accesses to main system memory go through the cache actually and to the program it looks like if the data goes and comes from system memory. If the data is present in the cache and the access is well aligned the access is almost instant as well and hence very fast.
main system memory - connected to the CPU through a memory controller and shared by the CPU cores in a system. Accessing main memory introduces latencies through addressing and the limited bandwidth between memory and CPUs
When you work with in-situ calculated intermediary results those often never leave the registers or may go only as far as the cache and thus are not limited by the available system memory bandwidth or blocked by memory bus arbitration or address generation interlock.

This hurts me.
Ask your teacher (or better, don't, because with his level of competence in programming I wouldn't trust him), whether he has measured it, and what the difference was. The rule when you are programming for speed is: If you haven't measured it, and measured it before and after a change, then what you are doing is purely based on presumption and worthless.
In reality, an optimising compiler will take the code that you write and translate it to the fastest possible machine code. As a result, it is unlikely that there is any difference in code or speed.
On the other hand, using intermediate variables will make complex expressions easier to understand and easier to get right, and it makes debugging a lot easier. If your huge complex expression gives what looks like the wrong result, intermediate variables make it possible to check the calculation bit by bit and find where the error is.
Now even if he was right and removing intermediate variables made your code faster, and even if anyone cared about the speed difference, he would be wrong: Making your code readable and easier to debug gets you to a correctly working version of the code quicker (and if it doesn't work, nobody cares how fast it is). Now if it turns out that the code needs to be faster, the time you saved will allow you to make changes that make it really faster.

Memory management with low memory: finding and tracking duplicates of random function return values

Lets assume i have a function that takes 32bit integer in, and returns random 32bit integer out.
Now, i want to see how many and which duplicate values this function will return on all possible input values from 0 to 2^32-1. I could make this easy if i had more than 4gigs free ram, but i dont have more than 1gig ram.
I tried to map the calculated values on disk, using 4gig file where one byte represented how many duplicates it had got, but i noticed the approximated finishing time will be 25 days in the future with my HDD speeds! (i had to use SSD in fear of breaking my HDD...)
So, now the next step is to calculate this all in RAM and not use disk at all, but i ran at wall when thinking how to solve this elegantly. The only method i could think of was to loop (2^32)*(2^32) times the function, but this is obviously even slower than my HDD method.
What i need now is some nasty ideas to speed this up!
Edit: The function isnt really a random function, but similar to a random function, but the fact is you dont need to know anything about the function, its not the problem here. I want to see all the duplicates by my bare eyes, not just some mathematical guessing how many there could be. Why im doing this? Out of curiosity :)

To check for 2^32 possible duplicates you only need 4 gigabits which is 512MB, since you need only a single bit per value. The first hit of a zero bit sets it to 1 and on every hit of a 1 bit you know you have a duplicate and can print it out or do whatever you want to do with it.
I.e. you can do something like this:
int value = nextValue(...);
static int bits[] = new int[ 0x08000000 ]();
unsigned int idx = value >> 5, bit = 1 << ( value & 31 );
if( bits[ idx ] & bit )
// duplicate
else
bits[ idx ] |= bit;
in response to your comments
Yes, putting the duplicates into a map is a good idea if there are not too many and not to many different duplicates. The worst case here is 2^31 entries if every 2nd value appears exactly twice. If the map becomes too large to be held in in memory at once you can partition it, i.e. by only allowing values in the certain range, i.e. a quarter of the entire number space. This would make the map have only 1/4th of the size of the entire map if the duplicates are distributed rather equally. You would of course need to run the program 4 times for each quarter to find all duplicates.
To find also the 1st duplicate you can run it in two passes: In the first pass you use the bitmap to find the duplicates and put them into the map. In the 2nd pass you skip the bitmap and add the values into the map if there is already a entry in the map and the value is not yet there.
No, there is no good reason for a int over a unsigned int array. you can as well use unsigned int which would actually be more appropriate here.

The unaskable question: Why?. what are you trying to achieve?
Is this some kind of Monte-Carlo experiment?
If not, just look up the implementation algorithn of your (P)RNG and it will tell you exactly what the distribution of values is going to be.
Have a look at Boost.Random for more choices than you can fathom, and it will have e.g. uniform_int<> and variate generators that can limit your output range while still having well-defined guarantees on distribution of values across the output domain

Std::vector fill time goes from 0ms to 16ms after a certain threshold?

Here is what I'm doing. My application takes points from the user while dragging and in real time displays a filled polygon.
It basically adds the mouse position on MouseMove. This point is a USERPOINT and has bezier handles because eventually I will do bezier and this is why I must transfer them into a vector.
So basically MousePos -> USERPOINT. USERPOINT gets added to a std::vector<USERPOINT> . Then in my UpdateShape() function, I do this:
DrawingPoints is defined like this:
std::vector<std::vector<GLdouble>> DrawingPoints;
Contour[i].DrawingPoints.clear();
for(unsigned int x = 0; x < Contour[i].UserPoints.size() - 1; ++x)
SetCubicBezier(
Contour[i].UserPoints[x],
Contour[i].UserPoints[x + 1],
i);
SetCubicBezier() currently looks like this:
void OGLSHAPE::SetCubicBezier(USERFPOINT &a,USERFPOINT &b, int &currentcontour )
{
std::vector<GLdouble> temp(2);
if(a.RightHandle.x == a.UserPoint.x && a.RightHandle.y == a.UserPoint.y
&& b.LeftHandle.x == b.UserPoint.x && b.LeftHandle.y == b.UserPoint.y )
{
temp[0] = (GLdouble)a.UserPoint.x;
temp[1] = (GLdouble)a.UserPoint.y;
Contour[currentcontour].DrawingPoints.push_back(temp);
temp[0] = (GLdouble)b.UserPoint.x;
temp[1] = (GLdouble)b.UserPoint.y;
Contour[currentcontour].DrawingPoints.push_back(temp);
}
else
{
//do cubic bezier calculation
}
So for the reason of cubic bezier, I need to make USERPOINTS into GlDouble[2] (since GLUTesselator takes in a static array of double.
So I did some profiling. At ~ 100 points, the code:
for(unsigned int x = 0; x < Contour[i].UserPoints.size() - 1; ++x)
SetCubicBezier(
Contour[i].UserPoints[x],
Contour[i].UserPoints[x + 1],
i);
Took 0 ms to execute. then around 120, it jumps to 16ms and never looks back. I'm positive this is due to std::vector. What can I do to make it stay at 0ms. I don't mind using lots of memory while generating the shape then removing the excess when the shape is finalized, or something like this.

0ms is no time...nothing executes in no time. This should be your first indicator that you might want to check your timing methods over timing results.
Namely, timers typically don't have good resolution. Your pre-16ms results are probably just actually 1ms - 15ms being incorrectly reported at 0ms. In any case, if we could tell you how to keep it at 0ms, we'd be rich and famous.
Instead, find out which parts of the loop take the longest, and optimize those. Don't work towards an arbitrary time measure. I'd recommend getting a good profiler to get accurate results. Then you don't need to guess what's slow (something in the loop), but can actually see what part is slow.

You could use vector::reserve() to avoid unnecessary reallocations in DrawingPoints:
Contour[i].DrawingPoints.reserve(Contour[i].size());
for(unsigned int x = 0; x < Contour[i].UserPoints.size() - 1; ++x) {
...
}

If you actually timed the second code snippet only (as you stated in your post), then you're probably just reading from the vector. This means, the cause can not be the re-allocation cost of the vector. In that case, it may due to cache issues of the CPU (i.e. the small datasets can be read in lightning speed from cpu cache, but whenever the dataset is larger than the cache [or when alternately reading from different memory locations], the cpu has to access ram, which is distinctly slower than cache access).
If the part of the code, which you profiled, appends data to the vector, then use std::vector::reserve() with an appropriate capacity (number of expected entries in vector) before filling it.
However, regard two general rules for profiling/benchmarking:
1) Use time measurement methods with high resolution precision (as others stated, the resolution of your timer IS too low)
2) In any case, run the code snippet more than once (e.g. 100 times), get the total time of all runs and divide it by number of runs. This will give you some REAL numbers.

There's a lot of guessing going on here. Good guesses, I imagine, but guesses nevertheless. And when you try to measure the time functions take, that doesn't tell you how they take it. You can see if you try different things that the time will change, and from that you can have some suggestion of what was taking the time, but you can't really be certain.
If you really want to know what's taking the time, you need to catch it when it's taking that time, and find out for certain what it's doing. One way is to single-step it at the instruction level through that code, but I suspect that's out of the question. The next best way is to get stack samples. You can find profilers that are based on stack samples. Personally, I rely on the manual technique, for the reasons given here.
Notice that it's not really about measuring time. It's about finding out why that extra time is being spent, which is a very different question.

C++ 'small' optimization behaving strangely

I'm trying to optimize 'in the small' on a project of mine.
There's a series of array accesses that are individually tiny, but profiling has revealed that these array accesses are where the vast majority of my program is spending its time. So, time to make things faster, since the program takes about an hour to run.
I've moved the following type of access:
const float theValOld1 = image(x, y, z);
const float theValOld2 = image(x, y+1, z);
const float theValOld3 = image(x, y-1, z);
const float theValOld4 = image(x-1, y, z);
etc, for 28 accesses around the current pixel.
where image thunks down to
float image(const int x, const int y, const int z) const {
return data[z*xsize*ysize + y*xsize + x];
}
and I've replaced it with
const int yindex = y*xsize;
const int zindex = z*xsize*ysize;
const float* thePtr = &(data[z*xsize*ysize + y*xsize + x]);
const float theVal1 = *(thePtr);
const float theVal2 = *(thePtr + yindex);
const float theVal3 = *(thePtr - yindex);
const float theVal4 = *(thePtr - 1);
etc, for the same number of operations.
I would expect that, if the compiler were totally awesome, that this change would do nothing to the speed. If the compiler is not awesome, then I'd say that the second version should be faster, if only because I'm avoiding the implict pointer addition that comes with the [] thunk, as well as removing the multiplications for the y and z indeces.
To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
To measure performance, I'm using QueryPerformanceCounter.
What's odd to me is that the order of operations matters!
If I leave the operations as described and compare the timings (as well as the results, to make sure that the same value is calculated after optimization), then the older code takes about 45 ticks per pixel and the new code takes 10 ticks per pixel. If I reverse the operations, then the old code takes about 14 ticks per pixel and the new code takes about 30 ticks per pixel (with lots of noise in there, these are averages over about 100 pixels).
Why should the order matter? Is there caching or something happening? The variables are all named different things, so I wouldn't think that would matter. If there is some caching happening, is there any way I can take advantage of it from pixel to pixel?
Corollary: To compare speed, I'm supposing that the right way is to run the two versions independently of one another, and then compare the results from different runs. I'd like to have the two comparisons next to each other make sense, but there's obviously something happening here that prevents that. Is there a way to salvage this side-by-side run to get a reasonable speed comparison from a single run, so I can make sure that the results are identical as well (easily)?
EDIT: To clarify.
I have both new and old code in the same function, so I can make sure that the results are identical.
If I run old code and then new code, new code runs faster than old.
If I run new code and then old code, old code runs faster than new.
The z hit is required by the math, and the if statement cannot be removed, and is present in both. For the new code, I've just moved more z-specific code into the z section, and the test code I'm using is 100% 2D. When I move to 3D testing, then I'm sure I'll see more of the effect of branching.

You may (possibly) be running into some sort of readahead or cacheline boundary issue. Generally speaking, when you load a single value and it isn't "hot" (in cache), the CPU will pull in a cache line (32, 64, or 128 bytes are pretty typical, depending on processor). Subsequent reads to the same line will be much faster.
If you change the order of operations, you may just be seeing stalls due to how the lines are being loaded and evicted.
The best way to figure something like this out is to open "Disassembly" view and spend some quality time with your processor's reference manual.
If you're lucky, the changes that the code reordering causes will be obvious (the compiler may be generating extra instructions or branches). Less lucky, it will be a stall somewhere in the processor -- during the decode pipeline or due to a memory fetch...
A good profiler that can count stalls and cache misses may help here too (AMD has CodeAnalyst, for example).
If you're not under a time crunch, it's really worthwhile to dive into the disasm -- at the very least, you'll probably end up learning something you didn't know before about how your CPU, machine architecture, compiler, libraries, etc work. (I almost always end up going "huh" when studying disasm.)

If both the new and old versions run on the same data array, then yes, the last run will almost certainly get a speed bump due to caching. Even if the code is different, it'll be accessing data that was already touched by the previous version, so depending on data size, it might be in L1 cache, will probably be in L2 cache, and if a L3 cache exists, almost certainly in that. There'll probably also be some overlap in the code, meaning that the instruction cache will further boost performance of the second version.
A common way to benchmark is to run the algorithm once first, without timing it, simply to ensure that that's going to be cached, is cached, and then run it again a large number of times with timing enabled. (Don't trust a single execution, unless it takes at least a second or two. Otherwise small variations in system load, cache, OS interrupts or page faults can cause the measured time to vary). To eliminate the noise, measure the combined time taken for several runs of the algorithm, and obviously with no output in between. The fact that you're seeing spikes of 3x the usual time means that you're measuring at a way too fine-grained level. Which basically makes your timings useless.
Why should the order matter? Is there caching or something happening? The variables are all named different things, so I wouldn't think that would matter. If there is some caching happening, is there any way I can take advantage of it from pixel to pixel?
The naming doesn't matter. When the code is compiled, variables are translated into memory addresses or register id's. But when you run through your image array, you're loading it all into CPU cache, so it can be read faster the next time you run through it.
And yes, you can and should take advantage of it.
The computer tries very hard to exploit spatial and temporal locality -- that is, if you access a memory address X at time T, it assumes that you're going to need address X+1 very soon (spatial locality), and that you'll probably also need X again, at time T+1 (temporal locality). It tries to speed up those cases in every way possible (primarily by caching), so you should try to exploit it.
To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
I don't know where you placed that if statement, but if it's in a frequently evaluated block of code, the cost of the branch might hurt you more than you're saving. Branches can be expensive, and they inhibit the compiler's and CPU's ability to reorder and schedule instructions. So you may be better off without it. You should probably do this as a separate optimization that can be benchmarked in isolation.
I don't know which algorithm you're implementing, but I'm guessing you need to do this for every pixel?
If so, you should try to cache your lookups. Once you've got image(x, y, z), that'll be the next pixel's image(x+1, y, z), so cache it in the loop so the next pixel won't have to look it up from scratch. That would potentially allow you to reduce your 9 accesses in the X/Y plane down to three (use 3 cached values from the last iteration, 3 from the one before it, and 3 we just loaded in this iteration)
If you're updating the value of each pixel as a result of its neighbors values, a better approach may be to run the algorithm in a checkerboard pattern. Update every other pixel in the first iteration, using only values from their neighbors (which you're not updating), and then run a second pass where you update the pixels you read from before, based on the values of the pixels you updated before. This allows you to eliminate dependencies between neighboring pixels, so their evaluation can be pipelined and parallelized efficiently.
In the loop that performs all the lookups, unroll it a few times, and try to place all the memory reads at the top, and all the computations further down, to give the CPU a chance to overlap the two (since data reads are a lot slower, get them started, and while they're running, the CPU will try to find other instructions it can evaluate).
For any constant values, try to precompute them as much as possible. (rather than z*xsize*ysize, precompute xsize*ysize, and multiply z with the result of that.
Another thing that may help is to prefer local variables over globals or class members. You may gain something simply by, at the start of the function, making local copies of the class members you're going to need. The compiler can always optimize the extra variables out again if it wants to, but you make it clear that it shouldn't worry about underlying changes to the object state (which might otherwise force it to reload the members every time you access them)
And finally, study the generated assembly in detail. See where it's performing unnecessary store/loads, where operations are being repeated even though they could be cached, and where the ordering of instructions is inefficient, or where the compiler fails to inline as much as you'd hoped.
I honestly wouldn't expect your changes to the lookup function to have much effect though. An array access with the operator[] is easily convertible to the equivalent pointer arithmetic, and the compiler can optimize that pretty efficiently, as long as the offsets you're adding don't change.
Usually, the key to low-level optimizations is, somewhat ironically, not to look at individual lines of code, but at whole functions, and at loops. You need a certain amount of instructions in a block so you have something to work with, since a lot of optimizations deal with breaking dependencies between chains of instructions, reordering to hide instruction latency, and with caching individual values to avoid memory load/stores. That's almost impossible to do on individual array lookups, but there's almost certainly a lot gained if you consider a couple of pixels at a time.
Of course, as with almost all microoptimizations, there are no always true answers. Some of the above might be useful to you, or they might not.
If you tell us more about the access pattern (which pixels are you accessing, is there any required order, and are you just reading, or writing as well? If writing, when and where are the updated values used?)
If you give us a bit more information, we'll be able to offer much more specific (and likely to be effective) suggestions

When optimising, examining the data access pattern is essential.
for example:
assuming a width of 240
for a pixel at <x,y,z> 10,10,0
with original access pattern would give you:
a. data[0+ 10*240 + 10] -> data[2410]
b. data[0+ 11*240 + 10] -> data[2650]
c. data[0+ 9*240 + 10] -> data[2170]
d. data[0+ 10*240 + 9] -> data[2409]
Notice the indices which are in arbitrary order.
Memory controller makes aligned accesses to the main memory to fill the cache lines.
If you order your operations so that accesses are to ascending memory addresses
(e.g. c,d,a,b ) then the memory controller would be able to stream the data in to
the cache lines.
Missing cache on read would be expensive as it has to search down the cache
hierarchy down to the main memory. Main memory access could be 100x slower than
cache. Minimising main memory access will improve the speed of your operation.

To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
Did you actually measure that? Because I'd be pretty surprised if that were true. An if statement in the inner loop of your program can add a surprising amount of overhead -- see Is "IF" expensive?. I'd be willing to bet that the overhead of the extra multiply is a lot less than the overhead of the branching, unless z happens to be zero 99% of the time.
What's odd to me is that the order of operations matters!
The order of what operations? It's not clear to me what you're reordering here. Please give some more snippets of what you're trying to do.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js