Computational cost of math operations in GLSL - glsl

I'm writing a GPGPU program using GLSL shaders and am trying to come up with a few optimizations for an N-body collision detection algorithm. One is performing a 'quick' check to determine whether two objects are even in the same ballpark. The idea is to quickly disqualify lots of possibilities so that I only have to perform a more accurate collision test on a handful of objects. If the quick check decides there's a chance they might collide, the accurate check is performed.
The objects are circles (or spheres). I know the position of their center and their radius. The quick check will see if their square (or cube) bounding boxes overlap:
//make sure A is to the right of and above B
//code for that
if(A_minX > B_maxX) return false; //they definitely don't collide
if(A_minY > B_maxY) return false; //they definitely don't collide
if(length(A_position - B_position) <= A_radius + B_radius){
//they definitely do collide
return true;
}
My question is whether the overhead of performing this quick check (making sure that A and B are in the right order, then checking whether their bounding boxes overlap) is going to be faster than calling length() and comparing that against their combined radii.
It'd be useful to know the relative computational cost of various math operations in GLSL, but I'm not quite sure how to discover them empirically or whether this information is already posted somewhere.

You can avoid using square roots (which are implicitly needed for the length() function) by comparing the squares of the values.
The test could then look like this:
vec3 vDiff = A_position - B_position;
float radSum = A_radius + B_radius;
if (dot(vDiff, vDiff) < radSum * radSum) {
return true;
}
This reduces it back to a single test, but still uses only simple and efficient operations.

While we're on the topic of costs, you don't need two branches here. You can test the results of a component-wise test instead. So, this could be collapsed into a single test using any (greaterThan (A_min, B_max)). A good compiler will probably figure this out, but it helps if you consider data parallelism yourself.
Costs are all relative. 15 years ago the arithmetic work necessary to do what length (...) does was such that you could do a cubemap texture lookup in less time - on newer hardware you'd be insane to do that because compute is quicker than memory.
To put this all into perspective, thread divergence can be more of a bottleneck than instruction or memory throughput. That is, if two of your shader invocations running in parallel take separate paths through the shader, you may introduce unnecessary performance penalties. The underlying hardware architecture means that things that were once a safe bet for optimization may not be in the future and might even cause your optimization attempt to hurt performance.

Related

Which one is more memory-consuming? Matrix or trigonometry transformations

I have written two different ways to transform Euler Angles to the normalized unit direction vector. But i'm not sure which one is the faster. The one which uses trigonometry operations or the one that transforms the forward vector through matrix?
D3DXVECTOR3 EulerToDir(D3DXVECTOR3 EulerRotation) { return D3DXVECTOR3(sin(EulerRotation.x)*cos(EulerRotation.y), -sin(EulerRotation.y), cos(EulerRotation.x)*cos(EulerRotation.y)); }//Convert euler angles to the unit direction vector.
D3DXVECTOR3 EulerToDirM(D3DXVECTOR3 EulerRotation)//Same thing but using matrix transformation. More accurate.
{
D3DXMATRIX rotMat;
D3DXMatrixRotationYawPitchRoll(&rotMat, EulerRotation.x, EulerRotation.y, EulerRotation.z);
D3DXVECTOR3 resultVec(0, 0, 1);//Facing towards the z.
D3DXVec3TransformNormal(&resultVec, &resultVec, &rotMat);
return resultVec;
}
Thanks.
Well, what exactly do you care about? Memory usage like stated in the top level question? Or speed, as specified in the description?
If it's speed, the only real way to tell is measure it on your target architecture/environment. Trying to guess is usually a waste of time.
The easiest way to test performance of self-containted code snippets is to setup a unit test where you do something like this:
// setup everything first
time startTime = getCurrentTimeInMicros()
for (int i = 0; i < NUM_ITERATIONS; ++i)
{
// code to be performance tested
}
time endTime = getCurrentTimeInMicros()
Then you can do endTime - startTime and see which code took longer to run.
If you need to test memory usage, you could print out sizeof() the classes/structs if they are simple, else you could allocate them while instrumenting your code with valgrind/massif.
You can do a complexity analysis on the functions, using the Big (O)
notation. For your example, it uses predefined sine/cosine function which is system dependent and it's implemented in many different ways, for which the C++ decide which one is better for a particular x(input). Different implementations of sine.
You should try searching for the complexity of implementations of the matrix operations you performed on msdn, though I believe the EulerToDirM function uses matrix operations which are atleast O(N), and EulerToDir gives the result in O(1), which is better.

Faster to create map of pointers or if statement

I'm creating a game engine using C++ and SFML. I have a class called character that will be the base for entities within the game. The physics class is also going to handle character movement.
My question is, is it faster to create a vector of pointers to the characters that move in a frame. Then, whenever a function moves a character it places it inside that vector. After the physics class is done handling the vector it gets cleared?
Or is it faster to have a bool variable that gets set to true whenever a function moves a character and then have an if statement inside my physics class that tests every character for movement?
EDIT:
Ok i've gone with a different approach where a function inside the Physics class is responsible for dealing with character movement. Immediately upon movement, it tests for collision detection. If collision happens it stops the movement in that direction.
Thanks for your help guys
Compared to all the other stuff that is going on in your program (physics, graphics), this will not make a difference. Use the method that makes programming easier because you will not notice a runtime difference at all.
If the total number of characters is relatively small, then you won't notice the difference between the approaches.
Else (the number of characters is large), if most of characters move during a frame, then the approach with flag inside a character seems more appropriate, because even with vector of moved characters, you'll traverse all of them and besides that you get additional overhead of maintaining the vector.
Else (the number of characters is large, but only few of them move during a frame), it may be better to use vector because it can save you time by not traversing characters which didn't move.
What is a small or large number, depends on your application. You should test under which conditions you get better performance using either of approaches.
This would be the right time to quote Hoare, but I'll abstain. Generally, however, you should profile before you optimize (if, and only if, the time budget is not enough on the minimum spec hardware -- if your game runs at 60fps on the target hardware you will do nothing whatsoever).
It is much more likely that the actual physics calculations will be the limiting factor, not doing the "is this unit moving?" check. Also, it is much more likely that submitting draw calls will bite you rather than checking a few hundred or so units.
As an "obvious" thing, it appears to be faster to hold a vector of object pointers and only process the units that are actually moving. However, the "obvious" is not always correct. Iterating linearly over a greater number of elements can very well be faster than jumping around (due to cache). Again, if this part of your game is identified as the bottleneck (very unlikely) then you will have to measure which is better.

Ring buffer: Disadvantages by moving through memory backwards?

This is probably language agnostic, but I'm asking from a C++ background.
I am hacking together a ring buffer for an embedded system (AVR, 8-bit). Let's assume:
const uint8_t size = /* something > 0 */;
uint8_t buffer[size];
uint8_t write_pointer;
There's this neat trick of &ing the write and read pointers with size-1 to do an efficient, branchless rollover if the buffer's size is a power of two, like so:
// value = buffer[write_pointer];
write_pointer = (write_pointer+1) & (size-1);
If, however, the size is not a power of two, the fallback would probably be a compare of the pointer (i.e. index) to the size and do a conditional reset:
// value = buffer[write_pointer];
if (++write_pointer == size) write_pointer ^= write_pointer;
Since the reset occurs rather rarely, this should be easy for any branch prediction.
This assumes though that the pointers need to be advancing foreward in memory. While this is intuitive, it requires a load of size in every iteration. I assume that reversing the order (advancing backwards) would yield better CPU instructions (i.e. jump if not zero) in the regular case, since size is only required during the reset.
// value = buffer[--write_pointer];
if (write_pointer == 0) write_pointer = size;
so
TL;DR: My question is: Does marching backwards through memory have a negative effect on the execution time due to cache misses (since memory cannot simply be read forward) or is this a valid optimization?
You have an 8 bit avr with a cache? And branch prediction?
How does forward or backwards matter as far as caches are concerned? The hit or miss on a cache is anywhere within the cache line, beginning, middle, end, random, sequential, doesnt matter. You can work from the back to the front or the front to back of a cache line, it is the same cost (assuming all other things held constant) the first mist causes a fill, then that line is in cache and you can access any of the items in any pattern at a lower latency until evicted.
On a microcontroller like that you want to make the effort, even at the cost of throwing away some memory, to align a circular buffer such that you can mask. There is no cache the instruction fetches are painful because they are likely from a flash that may be slower than the processor clock rate, so you do want to reduce instructions executed, or make the execution a little more deterministic (same number of instructions every loop until that task is done). There might be a pipeline that would appreciate the masking rather than an if-then-else.
TL;DR: My question is: Does marching backwards through memory have a
negative effect on the execution time due to cache misses (since
memory cannot simply be read forward) or is this a valid optimization?
The cache doesnt care, a miss from any item in the line causes a fill, once in the cache any pattern of access, random, sequential forward or back, or just pounding on the same address, takes less time being in faster memory. Until evicted. Evictions wont come from neighboring cache lines they will come from cache lines larger powers of two away, so whether the next cache line you pull is at a higher address or lower, the cost is the same.
Does marching backwards through memory have a negative effect on the
execution time due to cache misses (since memory cannot simply be read
forward)
Why do you think that you will have a cache miss? You will have a cache miss if you try to access outside the cache (forward or backward).
There are a number of points which need clarification:
That size needs to be loaded each and every time (it's const, therefore ought to be immutable)
That your code is correct. For example in a 0-based index (as used in C/C++ for array access) the value 0' is a valid pointer into the buffer, and the valuesize' is not. Similarly there is no need to xor when you could simply assign 0, equally a modulo operator will work (writer_pointer = (write_pointer +1) % size).
What happens in the general case with virtual memory (i.e. the logically adjacent addresses might be all over the place in the real memory map), paging (stuff may well be cached on a page-by-page basis) and other factors (pressure from external processes, interrupts)
In short: this is the kind of optimisation that leads to more feet related injuries than genuine performance improvements. Additionally it is almost certainly the case that you get much, much better gains using vectorised code (SIMD).
EDIT: And in interpreted languages or JIT'ed languages it might be a tad optimistic to assume you can rely on the use of JNZ and others at all. At which point the question is, how much of a difference is there really between loading size and comparing versus comparing with 0.
As usual, when performing any form of manual code optimization, you must have extensive in-depth knowledge of the specific hardware. If you don't have that, then you should not attempt manual optimizations, end of story.
Thus, your question is filled with various strange assumptions:
First, you assume that write_pointer = (write_pointer+1) & (size-1) is more efficient than something else, such as the XOR example you posted. You are just guessing here, you will have to disassemble the code and see which yields the less CPU instructions.
Because, when writing code for a tiny, primitive 8-bit MCU, there is not much going on in the core to speed up your code. I don't know AVR8, but it seems likely that you have a small instruction pipe and that's it. It seems quite unlikely that you have much in the way of branch prediction. It seems very unlikely that you have a data and/or instruction cache. Read the friendly CPU core manual.
As for marching backwards through memory, it will unlikely have any impact at all on your program's performance. On old, crappy compilers you would get slightly more efficient code if the loop condition was a comparison vs zero instead of a value. On modern compilers, this shouldn't be an issue. As for cache memory concerns, I doubt you have any cache memory to worry about.
The best way to write efficient code on 8-bit MCUs is to stick to 8-bit arithmetic whenever possible and to avoid 32-bit arithmetic like the plague. And forget you ever heard about something called floating point. This is what will make your program efficient, you are unlikely to find any better way to manually optimize your code.

Image interpolation implementation with C++

I have a question related to the implementation of image interpolation (bicubic and bilinear methods) with C++. My main concern is speed. Based on my understanding of the problem, in order to make the interpolation program fast and efficient, the following strategies can be adopted:
Fast image interpolation using Streaming SIMD Extensions (SSE)
Image interpretation with multi-thread or GPU
Fast image interpolation algorithms
C++ implementation tricks
Here, I am more interested in the last strategy. I set up a class for interpolation:
/**
* This class is used to perform interpretaion for a certain poin in
* the image grid.
*/
class Sampling
{
public:
// samples[0] *-------------* samples[1]
// --------------
// --------------
// samples[2] *-------------*samples[3]
inline void sampling_linear(unsigned char *samples, unsigned char &res)
{
unsigned char res_temp[2];
sampling_linear_1D(samples,res_temp[0]);
sampling_linear_1D(samples+2,res_temp[1]);
sampling_linear_1D(res_temp,res);
}
private:
inline void sampling_linear_1D(unsigned char *samples, unsigned char &res)
{
}
}
Here I only give an example for bilinear interpolation. In order to make the program run faster, the inline function is employed. My question is whether this implementation scheme is efficient. Additionally, during the interpretation procedure if I give the use the option of choosing between different interpolation methods. Then I have two choices:
Depending on the interpolation method, invoke the function the perform interpolation for the whole image.
For each output image pixel, first determine its position in the input image, and then according to the interpolation method setting, determine the interpolation function.
The first method means more codes in the program while the second one may lead to inefficiency. Then, how could I choose between these two schemes? Thanks!
Fast image interpolation using Streaming SIMD Extensions (SSE)
This may not provide desired result, because I expect that your algorithm will be memory-bounded rather than FLOP/s bounded.
I mean - it definitely will be improvement, but not beneficial in compare to implementation cost.
And by the way, modern compilers can perform auto-vectorization (i.e. use of SSE and futher extensions): GCC starting from 4.0, MSVC starting from 2012, MSVC Auto-Vectorization video lectures.
Image interpretation with multi-thread or GPU
Multi-thread version should give good effect, because it would allow you to exploit all available memory throughput.
If you do not plan to process data several times, or use it in some way on GPU, then GPGPU may not give desired result. Yes, it will produce result faster (mostly due to higher memory speed), but this effect will be crossed out by slow transfer between main RAM and GPU's RAM.
Just for example, approximate modern throughputs:
CPU RAM ~ 20GiB/s
GPU RAM ~ 150GiB/s
Transfering between CPU RAM <-> GPU RAM ~ 3-5 GiB/s
For single pass memory bounded algorithms, in most cases, third item makes usage of GPUs impractical (for such algoirthms).
In order to make the program run faster, the inline function is employed
Class member functions are "inline" by default. Beaware, that main purpose of "inline" is not actually "inling", but helping to prevent One Definition Rule violation when your functions are defined in headers.
There are compiler-dependent "forceinline" features, for instance MSVC has __forceinline. Or abstracted from compiler ifdef'ed BOOST_FORCEINLINE macro.
Anyway, trust your compiler unless you don't prove otherwise (with help of assembler for example). Most important fact, is that compiler should see functions defenitions - then it can decide itself to inline, even if function is not inline itself.
My question is whether this implementation scheme is efficient.
As I understand, as pre-step, you gather samples into 2x2 matrix. I think it may be better to pass directly two pointers to arrays of two elements within image directly, or one pointer + width size (to calc second pointer automaticly). However, it is not a big issue, most likely your temporary 2x2 matrix will be optimized away.
What is really important - is how you traverse your image.
Let's say for given x and y, index is calculated as:
i=width*y+x;
Then your traversal loop should be:
for(int y=/*...*/)
for(int x=/*...*/)
{
// loop body
}
Because, if you would chose another order (x first, then y) - it will be not cache-friendly, and as the result performance drop can be up to 64x (depending on your pixel size). You may check it just for your interest.
The first method means more codes in the program while the second one may lead to inefficiency. Then, how could I choose between these two schemes? Thanks!
In this case, you can use compile-time polymorphism to reduce code amount in first version. For instance, based on templates.
Just look at std::accumulate - it can be written once, and then it will work on different types of iterators, different binary operations (functions or functors), without imply any runtime penalty due to it's polymorphism.
Alexander Stepanov says:
For many years, I tried to achieve relative efficiency in more advanced languages (e.g., Ada and Scheme) but failed. My generic versions of even simple algorithms were not able to compete with built-in primitives. But in C++ I was finally able to not only accomplish relative efficiency but come very close to the more ambitious goal of absolute efficiency. To verify this, I spent countless hours looking at the assembly code generated by different compilers on different architectures.
Check Boost's Generic Image Library - it has good tutorial, and there is video presentation from author.

C++ 'small' optimization behaving strangely

I'm trying to optimize 'in the small' on a project of mine.
There's a series of array accesses that are individually tiny, but profiling has revealed that these array accesses are where the vast majority of my program is spending its time. So, time to make things faster, since the program takes about an hour to run.
I've moved the following type of access:
const float theValOld1 = image(x, y, z);
const float theValOld2 = image(x, y+1, z);
const float theValOld3 = image(x, y-1, z);
const float theValOld4 = image(x-1, y, z);
etc, for 28 accesses around the current pixel.
where image thunks down to
float image(const int x, const int y, const int z) const {
return data[z*xsize*ysize + y*xsize + x];
}
and I've replaced it with
const int yindex = y*xsize;
const int zindex = z*xsize*ysize;
const float* thePtr = &(data[z*xsize*ysize + y*xsize + x]);
const float theVal1 = *(thePtr);
const float theVal2 = *(thePtr + yindex);
const float theVal3 = *(thePtr - yindex);
const float theVal4 = *(thePtr - 1);
etc, for the same number of operations.
I would expect that, if the compiler were totally awesome, that this change would do nothing to the speed. If the compiler is not awesome, then I'd say that the second version should be faster, if only because I'm avoiding the implict pointer addition that comes with the [] thunk, as well as removing the multiplications for the y and z indeces.
To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
To measure performance, I'm using QueryPerformanceCounter.
What's odd to me is that the order of operations matters!
If I leave the operations as described and compare the timings (as well as the results, to make sure that the same value is calculated after optimization), then the older code takes about 45 ticks per pixel and the new code takes 10 ticks per pixel. If I reverse the operations, then the old code takes about 14 ticks per pixel and the new code takes about 30 ticks per pixel (with lots of noise in there, these are averages over about 100 pixels).
Why should the order matter? Is there caching or something happening? The variables are all named different things, so I wouldn't think that would matter. If there is some caching happening, is there any way I can take advantage of it from pixel to pixel?
Corollary: To compare speed, I'm supposing that the right way is to run the two versions independently of one another, and then compare the results from different runs. I'd like to have the two comparisons next to each other make sense, but there's obviously something happening here that prevents that. Is there a way to salvage this side-by-side run to get a reasonable speed comparison from a single run, so I can make sure that the results are identical as well (easily)?
EDIT: To clarify.
I have both new and old code in the same function, so I can make sure that the results are identical.
If I run old code and then new code, new code runs faster than old.
If I run new code and then old code, old code runs faster than new.
The z hit is required by the math, and the if statement cannot be removed, and is present in both. For the new code, I've just moved more z-specific code into the z section, and the test code I'm using is 100% 2D. When I move to 3D testing, then I'm sure I'll see more of the effect of branching.
You may (possibly) be running into some sort of readahead or cacheline boundary issue. Generally speaking, when you load a single value and it isn't "hot" (in cache), the CPU will pull in a cache line (32, 64, or 128 bytes are pretty typical, depending on processor). Subsequent reads to the same line will be much faster.
If you change the order of operations, you may just be seeing stalls due to how the lines are being loaded and evicted.
The best way to figure something like this out is to open "Disassembly" view and spend some quality time with your processor's reference manual.
If you're lucky, the changes that the code reordering causes will be obvious (the compiler may be generating extra instructions or branches). Less lucky, it will be a stall somewhere in the processor -- during the decode pipeline or due to a memory fetch...
A good profiler that can count stalls and cache misses may help here too (AMD has CodeAnalyst, for example).
If you're not under a time crunch, it's really worthwhile to dive into the disasm -- at the very least, you'll probably end up learning something you didn't know before about how your CPU, machine architecture, compiler, libraries, etc work. (I almost always end up going "huh" when studying disasm.)
If both the new and old versions run on the same data array, then yes, the last run will almost certainly get a speed bump due to caching. Even if the code is different, it'll be accessing data that was already touched by the previous version, so depending on data size, it might be in L1 cache, will probably be in L2 cache, and if a L3 cache exists, almost certainly in that. There'll probably also be some overlap in the code, meaning that the instruction cache will further boost performance of the second version.
A common way to benchmark is to run the algorithm once first, without timing it, simply to ensure that that's going to be cached, is cached, and then run it again a large number of times with timing enabled. (Don't trust a single execution, unless it takes at least a second or two. Otherwise small variations in system load, cache, OS interrupts or page faults can cause the measured time to vary). To eliminate the noise, measure the combined time taken for several runs of the algorithm, and obviously with no output in between. The fact that you're seeing spikes of 3x the usual time means that you're measuring at a way too fine-grained level. Which basically makes your timings useless.
Why should the order matter? Is there caching or something happening? The variables are all named different things, so I wouldn't think that would matter. If there is some caching happening, is there any way I can take advantage of it from pixel to pixel?
The naming doesn't matter. When the code is compiled, variables are translated into memory addresses or register id's. But when you run through your image array, you're loading it all into CPU cache, so it can be read faster the next time you run through it.
And yes, you can and should take advantage of it.
The computer tries very hard to exploit spatial and temporal locality -- that is, if you access a memory address X at time T, it assumes that you're going to need address X+1 very soon (spatial locality), and that you'll probably also need X again, at time T+1 (temporal locality). It tries to speed up those cases in every way possible (primarily by caching), so you should try to exploit it.
To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
I don't know where you placed that if statement, but if it's in a frequently evaluated block of code, the cost of the branch might hurt you more than you're saving. Branches can be expensive, and they inhibit the compiler's and CPU's ability to reorder and schedule instructions. So you may be better off without it. You should probably do this as a separate optimization that can be benchmarked in isolation.
I don't know which algorithm you're implementing, but I'm guessing you need to do this for every pixel?
If so, you should try to cache your lookups. Once you've got image(x, y, z), that'll be the next pixel's image(x+1, y, z), so cache it in the loop so the next pixel won't have to look it up from scratch. That would potentially allow you to reduce your 9 accesses in the X/Y plane down to three (use 3 cached values from the last iteration, 3 from the one before it, and 3 we just loaded in this iteration)
If you're updating the value of each pixel as a result of its neighbors values, a better approach may be to run the algorithm in a checkerboard pattern. Update every other pixel in the first iteration, using only values from their neighbors (which you're not updating), and then run a second pass where you update the pixels you read from before, based on the values of the pixels you updated before. This allows you to eliminate dependencies between neighboring pixels, so their evaluation can be pipelined and parallelized efficiently.
In the loop that performs all the lookups, unroll it a few times, and try to place all the memory reads at the top, and all the computations further down, to give the CPU a chance to overlap the two (since data reads are a lot slower, get them started, and while they're running, the CPU will try to find other instructions it can evaluate).
For any constant values, try to precompute them as much as possible. (rather than z*xsize*ysize, precompute xsize*ysize, and multiply z with the result of that.
Another thing that may help is to prefer local variables over globals or class members. You may gain something simply by, at the start of the function, making local copies of the class members you're going to need. The compiler can always optimize the extra variables out again if it wants to, but you make it clear that it shouldn't worry about underlying changes to the object state (which might otherwise force it to reload the members every time you access them)
And finally, study the generated assembly in detail. See where it's performing unnecessary store/loads, where operations are being repeated even though they could be cached, and where the ordering of instructions is inefficient, or where the compiler fails to inline as much as you'd hoped.
I honestly wouldn't expect your changes to the lookup function to have much effect though. An array access with the operator[] is easily convertible to the equivalent pointer arithmetic, and the compiler can optimize that pretty efficiently, as long as the offsets you're adding don't change.
Usually, the key to low-level optimizations is, somewhat ironically, not to look at individual lines of code, but at whole functions, and at loops. You need a certain amount of instructions in a block so you have something to work with, since a lot of optimizations deal with breaking dependencies between chains of instructions, reordering to hide instruction latency, and with caching individual values to avoid memory load/stores. That's almost impossible to do on individual array lookups, but there's almost certainly a lot gained if you consider a couple of pixels at a time.
Of course, as with almost all microoptimizations, there are no always true answers. Some of the above might be useful to you, or they might not.
If you tell us more about the access pattern (which pixels are you accessing, is there any required order, and are you just reading, or writing as well? If writing, when and where are the updated values used?)
If you give us a bit more information, we'll be able to offer much more specific (and likely to be effective) suggestions
When optimising, examining the data access pattern is essential.
for example:
assuming a width of 240
for a pixel at <x,y,z> 10,10,0
with original access pattern would give you:
a. data[0+ 10*240 + 10] -> data[2410]
b. data[0+ 11*240 + 10] -> data[2650]
c. data[0+ 9*240 + 10] -> data[2170]
d. data[0+ 10*240 + 9] -> data[2409]
Notice the indices which are in arbitrary order.
Memory controller makes aligned accesses to the main memory to fill the cache lines.
If you order your operations so that accesses are to ascending memory addresses
(e.g. c,d,a,b ) then the memory controller would be able to stream the data in to
the cache lines.
Missing cache on read would be expensive as it has to search down the cache
hierarchy down to the main memory. Main memory access could be 100x slower than
cache. Minimising main memory access will improve the speed of your operation.
To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
Did you actually measure that? Because I'd be pretty surprised if that were true. An if statement in the inner loop of your program can add a surprising amount of overhead -- see Is "IF" expensive?. I'd be willing to bet that the overhead of the extra multiply is a lot less than the overhead of the branching, unless z happens to be zero 99% of the time.
What's odd to me is that the order of operations matters!
The order of what operations? It's not clear to me what you're reordering here. Please give some more snippets of what you're trying to do.