Is there any workaround to "reserve" a cache fraction?

Is there any workaround to "reserve" a cache fraction? - c++

Assume I have to write a C or C++ computational intensive function that has 2 arrays as input and one array as output. If the computation uses the 2 input arrays more often than it updates the output array, I'll end up in a situation where the output array seldom gets cached because it's evicted in order to fetch the 2 input arrays.
I want to reserve one fraction of the cache for the output array and enforce somehow that those lines don't get evicted once they are fetched, in order to always write partial results in the cache.
Update1(output[]) // Output gets cached
DoCompute1(input1[]); // Input 1 gets cached
DoCompute2(input2[]); // Input 2 gets cached
Update2(output[]); // Output is not in the cache anymore and has to get cached again
...
I know there are mechanisms to help eviction: clflush, clevict, _mm_clevict, etc. Are there any mechanisms for the opposite?
I am thinking of 3 possible solutions:
Using _mm_prefetch from time to time to fetch the data back if it has been evicted. However this might generate unnecessary traffic plus that I need to be very careful to when to introduce them;
Trying to do processing on smaller chunks of data. However this would work only if the problem allows it;
Disabling hardware prefetchers where that's possible to reduce the rate of unwanted evictions.
Other than that, is there any elegant solution?

Intel CPUs have something called No Eviction Mode (NEM) but I doubt this is what you need.
While you are attempting to optimise the second (unnecessary) fetch of output[], have you given thought to using SSE2/3/4 registers to store your intermediate output values, update them when necessary, and writing them back only when all updates related to that part of output[] are done?
I have done something similar while computing FFTs (Fast Fourier Transforms) where part of the output is in registers and they are moved out (to memory) only when it is known they will not be accessed anymore. Until then, all updates happen to the registers. You'll need to introduce inline assembly to effectively use SSE* registers. Of course, such optimisations are highly dependent on the nature of the algorithm and data placement.

I am trying to get a better understanding of the question:
If it is true that the 'output' array is strictly for output, and you never do something like
output[i] = Foo(newVal, output[i]);
then, all elements in output[] are strictly write. If so, all you would ever need to 'reserve' is one cache-line. Isn't that correct?
In this scenario, all writes to 'output' generate cache-fills and could compete with the cachelines needed for 'input' arrays.
Wouldn't you want a cap on the cachelines 'output' can consume as opposed to reserving a certain number of lines.

I see two options, which may or may not work depending on the CPU you are targeting, and on your precise program flow:
If output is only written to and not read, you can use streaming-stores, i.e., a write instruction with a no-read hint, so it will not be fetched into cache.
You can use prefetching with a non-temporally-aligned (NTA) hint for input. I don't know how this is implemented in general, but I know for sure that on some Intel CPUs (e.g., the Xeon Phi) each hardware thread uses a specific way of cache for NTA data, i.e., with an 8-way cache 1/8th per thread.

I guess solution to this is hidden inside, the algorithm employed and the L1 cache size and cache line size.
Though I am not sure how much performance improvement we will see with this.
We can probably introduce artificial reads which cleverly dodge compiler and while execution, do not hurt computations as well. Single artificial read should fill cache lines as many needed to accommodate one page. Therefore, algorithm should be modified to compute blocks of output array. Something like the ones used in matrix multiplication of huge matrices, done using GPUs. They use blocks of matrices for computation and writing result.
As pointed out earlier, the write to output array should happen in a stream.
To bring in artificial read, we should initialize at compile time the output array at right places, once in each block, probably with 0 or 1.

Related

For-loop variables and cache misses? [duplicate]

What is the difference between "cache unfriendly code" and the "cache friendly" code?
How can I make sure I write cache-efficient code?

Preliminaries
On modern computers, only the lowest level memory structures (the registers) can move data around in single clock cycles. However, registers are very expensive and most computer cores have less than a few dozen registers. At the other end of the memory spectrum (DRAM), the memory is very cheap (i.e. literally millions of times cheaper) but takes hundreds of cycles after a request to receive the data. To bridge this gap between super fast and expensive and super slow and cheap are the cache memories, named L1, L2, L3 in decreasing speed and cost. The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. If the processor can't find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache, and if not there, main memory. Each of these "misses" is expensive in time.
(The analogy is cache memory is to system memory, as system memory is to hard disk storage. Hard disk storage is super cheap but very slow).
Caching is one of the main methods to reduce the impact of latency. To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency.
Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. The cache hit rate is crucial for performance since every cache miss results in fetching data from RAM (or worse ...) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.
In modern computer architectures, the performance bottleneck is leaving the CPU die (e.g. accessing RAM or higher). This will only get worse over time. The increase in processor frequency is currently no longer relevant to increase performance. The problem is memory access. Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data!
There is quite a lot to be said on the subject. Here are a few great references about caches, memory hierarchies and proper programming:
Agner Fog's page. In his excellent documents, you can find detailed examples covering languages ranging from assembly to C++.
If you are into videos, I strongly recommend to have a look at Herb Sutter's talk on machine architecture (youtube) (specifically check 12:00 and onwards!).
Slides about memory optimization by Christer Ericson (director of technology # Sony)
LWN.net's article "What every programmer should know about memory"
Main concepts for cache-friendly code
A very important aspect of cache-friendly code is all about the principle of locality, the goal of which is to place related data close in memory to allow efficient caching. In terms of the CPU cache, it's important to be aware of cache lines to understand how this works: How do cache lines work?
The following particular aspects are of high importance to optimize caching:
Temporal locality: when a given memory location was accessed, it is likely that the same location is accessed again in the near future. Ideally, this information will still be cached at that point.
Spatial locality: this refers to placing related data close to each other. Caching happens on many levels, not just in the CPU. For example, when you read from RAM, typically a larger chunk of memory is fetched than what was specifically asked for because very often the program will require that data soon. HDD caches follow the same line of thought. Specifically for CPU caches, the notion of cache lines is important.
Use appropriate c++ containers
A simple example of cache-friendly versus cache-unfriendly is c++'s std::vector versus std::list. Elements of a std::vector are stored in contiguous memory, and as such accessing them is much more cache-friendly than accessing elements in a std::list, which stores its content all over the place. This is due to spatial locality.
A very nice illustration of this is given by Bjarne Stroustrup in this youtube clip (thanks to #Mohammad Ali Baydoun for the link!).
Don't neglect the cache in data structure and algorithm design
Whenever possible, try to adapt your data structures and order of computations in a way that allows maximum use of the cache. A common technique in this regard is cache blocking (Archive.org version), which is of extreme importance in high-performance computing (cfr. for example ATLAS).
Know and exploit the implicit structure of data
Another simple example, which many people in the field sometimes forget is column-major (ex. fortran,matlab) vs. row-major ordering (ex. c,c++) for storing two dimensional arrays. For example, consider the following matrix:
1 2
3 4
In row-major ordering, this is stored in memory as 1 2 3 4; in column-major ordering, this would be stored as 1 3 2 4. It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. Unfortunately, I see stuff like this very often in my domain (machine learning). #MatteoItalia showed this example in more detail in his answer.
When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line).
For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it M):
Exploiting the ordering (e.g. changing column index first in c++):
M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
= 1 + 2 + 3 + 4
--> 2 cache hits, 2 memory accesses
Not exploiting the ordering (e.g. changing row index first in c++):
M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
= 1 + 3 + 2 + 4
--> 0 cache hits, 4 memory accesses
In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice, the performance difference can be much larger.
Avoid unpredictable branches
Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. This will indirectly lead to more cache misses.
This is explained very well here (thanks to #0x90 for the link): Why is processing a sorted array faster than processing an unsorted array?
Avoid virtual functions
In the context of c++, virtual methods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). Virtual functions can induce cache misses during look up, but this only happens if the specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class?
Common problems
A common problem in modern architectures with multiprocessor caches is called false sharing. This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. Effectively, different threads make each other wait by inducing cache misses in this situation.
See also (thanks to #Matt for the link): How and when to align to cache line size?
An extreme symptom of poor caching in RAM memory (which is probably not what you mean in this context) is so-called thrashing. This occurs when the process continuously generates page faults (e.g. accesses memory which is not in the current page) which require disk access.

In addition to #Marc Claesen's answer, I think that an instructive classic example of cache-unfriendly code is code that scans a C bidimensional array (e.g. a bitmap image) column-wise instead of row-wise.
Elements that are adjacent in a row are also adjacent in memory, thus accessing them in sequence means accessing them in ascending memory order; this is cache-friendly, since the cache tends to prefetch contiguous blocks of memory.
Instead, accessing such elements column-wise is cache-unfriendly, since elements on the same column are distant in memory from each other (in particular, their distance is equal to the size of the row), so when you use this access pattern you are jumping around in memory, potentially wasting the effort of the cache of retrieving the elements nearby in memory.
And all that it takes to ruin the performance is to go from
// Cache-friendly version - processes pixels which are adjacent in memory
for(unsigned int y=0; y<height; ++y)
{
for(unsigned int x=0; x<width; ++x)
{
... image[y][x] ...
}
}
to
// Cache-unfriendly version - jumps around in memory for no good reason
for(unsigned int x=0; x<width; ++x)
{
for(unsigned int y=0; y<height; ++y)
{
... image[y][x] ...
}
}
This effect can be quite dramatic (several order of magnitudes in speed) in systems with small caches and/or working with big arrays (e.g. 10+ megapixels 24 bpp images on current machines); for this reason, if you have to do many vertical scans, often it's better to rotate the image of 90 degrees first and perform the various analysis later, limiting the cache-unfriendly code just to the rotation.

Optimizing cache usage largely comes down to two factors.
Locality of Reference
The first factor (to which others have already alluded) is locality of reference. Locality of reference really has two dimensions though: space and time.
Spatial
The spatial dimension also comes down to two things: first, we want to pack our information densely, so more information will fit in that limited memory. This means (for example) that you need a major improvement in computational complexity to justify data structures based on small nodes joined by pointers.
Second, we want information that will be processed together also located together. A typical cache works in "lines", which means when you access some information, other information at nearby addresses will be loaded into the cache with the part we touched. For example, when I touch one byte, the cache might load 128 or 256 bytes near that one. To take advantage of that, you generally want the data arranged to maximize the likelihood that you'll also use that other data that was loaded at the same time.
For just a really trivial example, this can mean that a linear search can be much more competitive with a binary search than you'd expect. Once you've loaded one item from a cache line, using the rest of the data in that cache line is almost free. A binary search becomes noticeably faster only when the data is large enough that the binary search reduces the number of cache lines you access.
Time
The time dimension means that when you do some operations on some data, you want (as much as possible) to do all the operations on that data at once.
Since you've tagged this as C++, I'll point to a classic example of a relatively cache-unfriendly design: std::valarray. valarray overloads most arithmetic operators, so I can (for example) say a = b + c + d; (where a, b, c and d are all valarrays) to do element-wise addition of those arrays.
The problem with this is that it walks through one pair of inputs, puts results in a temporary, walks through another pair of inputs, and so on. With a lot of data, the result from one computation may disappear from the cache before it's used in the next computation, so we end up reading (and writing) the data repeatedly before we get our final result. If each element of the final result will be something like (a[n] + b[n]) * (c[n] + d[n]);, we'd generally prefer to read each a[n], b[n], c[n] and d[n] once, do the computation, write the result, increment n and repeat 'til we're done.2
Line Sharing
The second major factor is avoiding line sharing. To understand this, we probably need to back up and look a little at how caches are organized. The simplest form of cache is direct mapped. This means one address in main memory can only be stored in one specific spot in the cache. If we're using two data items that map to the same spot in the cache, it works badly -- each time we use one data item, the other has to be flushed from the cache to make room for the other. The rest of the cache might be empty, but those items won't use other parts of the cache.
To prevent this, most caches are what are called "set associative". For example, in a 4-way set-associative cache, any item from main memory can be stored at any of 4 different places in the cache. So, when the cache is going to load an item, it looks for the least recently used3 item among those four, flushes it to main memory, and loads the new item in its place.
The problem is probably fairly obvious: for a direct-mapped cache, two operands that happen to map to the same cache location can lead to bad behavior. An N-way set-associative cache increases the number from 2 to N+1. Organizing a cache into more "ways" takes extra circuitry and generally runs slower, so (for example) an 8192-way set associative cache is rarely a good solution either.
Ultimately, this factor is more difficult to control in portable code though. Your control over where your data is placed is usually fairly limited. Worse, the exact mapping from address to cache varies between otherwise similar processors. In some cases, however, it can be worth doing things like allocating a large buffer, and then using only parts of what you allocated to ensure against data sharing the same cache lines (even though you'll probably need to detect the exact processor and act accordingly to do this).
False Sharing
There's another, related item called "false sharing". This arises in a multiprocessor or multicore system, where two (or more) processors/cores have data that's separate, but falls in the same cache line. This forces the two processors/cores to coordinate their access to the data, even though each has its own, separate data item. Especially if the two modify the data in alternation, this can lead to a massive slowdown as the data has to be constantly shuttled between the processors. This can't easily be cured by organizing the cache into more "ways" or anything like that either. The primary way to prevent it is to ensure that two threads rarely (preferably never) modify data that could possibly be in the same cache line (with the same caveats about difficulty of controlling the addresses at which data is allocated).
Those who know C++ well might wonder if this is open to optimization via something like expression templates. I'm pretty sure the answer is that yes, it could be done and if it was, it would probably be a pretty substantial win. I'm not aware of anybody having done so, however, and given how little valarray gets used, I'd be at least a little surprised to see anybody do so either.
In case anybody wonders how valarray (designed specifically for performance) could be this badly wrong, it comes down to one thing: it was really designed for machines like the older Crays, that used fast main memory and no cache. For them, this really was a nearly ideal design.
Yes, I'm simplifying: most caches don't really measure the least recently used item precisely, but they use some heuristic that's intended to be close to that without having to keep a full time-stamp for each access.

Welcome to the world of Data Oriented Design. The basic mantra is to Sort, Eliminate Branches, Batch, Eliminate virtual calls - all steps towards better locality.
Since you tagged the question with C++, here's the obligatory typical C++ Bullshit. Tony Albrecht's Pitfalls of Object Oriented Programming is also a great introduction into the subject.

Just piling on: the classic example of cache-unfriendly versus cache-friendly code is the "cache blocking" of matrix multiply.
Naive matrix multiply looks like:
for(i=0;i<N;i++) {
for(j=0;j<N;j++) {
dest[i][j] = 0;
for( k=0;k<N;k++) {
dest[i][j] += src1[i][k] * src2[k][j];
}
}
}
If N is large, e.g. if N * sizeof(elemType) is greater than the cache size, then every single access to src2[k][j] will be a cache miss.
There are many different ways of optimizing this for a cache. Here's a very simple example: instead of reading one item per cache line in the inner loop, use all of the items:
int itemsPerCacheLine = CacheLineSize / sizeof(elemType);
for(i=0;i<N;i++) {
for(j=0;j<N;j += itemsPerCacheLine ) {
for(jj=0;jj<itemsPerCacheLine; jj+) {
dest[i][j+jj] = 0;
}
for( k=0;k<N;k++) {
for(jj=0;jj<itemsPerCacheLine; jj+) {
dest[i][j+jj] += src1[i][k] * src2[k][j+jj];
}
}
}
}
If the cache line size is 64 bytes, and we are operating on 32 bit (4 byte) floats, then there are 16 items per cache line. And the number of cache misses via just this simple transformation is reduced approximately 16-fold.
Fancier transformations operate on 2D tiles, optimize for multiple caches (L1, L2, TLB), and so on.
Some results of googling "cache blocking":
http://stumptown.cc.gt.atl.ga.us/cse6230-hpcta-fa11/slides/11a-matmul-goto.pdf
http://software.intel.com/en-us/articles/cache-blocking-techniques
A nice video animation of an optimized cache blocking algorithm.
http://www.youtube.com/watch?v=IFWgwGMMrh0
Loop tiling is very closely related:
http://en.wikipedia.org/wiki/Loop_tiling

Processors today work with many levels of cascading memory areas. So the CPU will have a bunch of memory that is on the CPU chip itself. It has very fast access to this memory. There are different levels of cache each one slower access ( and larger ) than the next, until you get to system memory which is not on the CPU and is relatively much slower to access.
Logically, to the CPU's instruction set you just refer to memory addresses in a giant virtual address space. When you access a single memory address the CPU will go fetch it. in the old days it would fetch just that single address. But today the CPU will fetch a bunch of memory around the bit you asked for, and copy it into the cache. It assumes that if you asked for a particular address that is is highly likely that you are going to ask for an address nearby very soon. For example if you were copying a buffer you would read and write from consecutive addresses - one right after the other.
So today when you fetch an address it checks the first level of cache to see if it already read that address into cache, if it doesn't find it, then this is a cache miss and it has to go out to the next level of cache to find it, until it eventually has to go out into main memory.
Cache friendly code tries to keep accesses close together in memory so that you minimize cache misses.
So an example would be imagine you wanted to copy a giant 2 dimensional table. It is organized with reach row in consecutive in memory, and one row follow the next right after.
If you copied the elements one row at a time from left to right - that would be cache friendly. If you decided to copy the table one column at a time, you would copy the exact same amount of memory - but it would be cache unfriendly.

It needs to be clarified that not only data should be cache-friendly, it is just as important for the code. This is in addition to branch predicition, instruction reordering, avoiding actual divisions and other techniques.
Typically the denser the code, the fewer cache lines will be required to store it. This results in more cache lines being available for data.
The code should not call functions all over the place as they typically will require one or more cache lines of their own, resulting in fewer cache lines for data.
A function should begin at a cache line-alignment-friendly address. Though there are (gcc) compiler switches for this be aware that if the the functions are very short it might be wasteful for each one to occupy an entire cache line. For example, if three of the most often used functions fit inside one 64 byte cache line, this is less wasteful than if each one has its own line and results in two cache lines less available for other usage. A typical alignment value could be 32 or 16.
So spend some extra time to make the code dense. Test different constructs, compile and review the generated code size and profile.

As #Marc Claesen mentioned that one of the ways to write cache friendly code is to exploit the structure in which our data is stored. In addition to that another way to write cache friendly code is: change the way our data is stored; then write new code to access the data stored in this new structure.
This makes sense in the case of how database systems linearize the tuples of a table and store them. There are two basic ways to store the tuples of a table i.e. row store and column store. In row store as the name suggests the tuples are stored row wise. Lets suppose a table named Product being stored has 3 attributes i.e. int32_t key, char name[56] and int32_t price, so the total size of a tuple is 64 bytes.
We can simulate a very basic row store query execution in main memory by creating an array of Product structs with size N, where N is the number of rows in table. Such memory layout is also called array of structs. So the struct for Product can be like:
struct Product
{
int32_t key;
char name[56];
int32_t price'
}
/* create an array of structs */
Product* table = new Product[N];
/* now load this array of structs, from a file etc. */
Similarly we can simulate a very basic column store query execution in main memory by creating an 3 arrays of size N, one array for each attribute of the Product table. Such memory layout is also called struct of arrays. So the 3 arrays for each attribute of Product can be like:
/* create separate arrays for each attribute */
int32_t* key = new int32_t[N];
char* name = new char[56*N];
int32_t* price = new int32_t[N];
/* now load these arrays, from a file etc. */
Now after loading both the array of structs (Row Layout) and the 3 separate arrays (Column Layout), we have row store and column store on our table Product present in our memory.
Now we move on to the cache friendly code part. Suppose that the workload on our table is such that we have an aggregation query on the price attribute. Such as
SELECT SUM(price)
FROM PRODUCT
For the row store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + table[i].price;
For the column store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + price[i];
The code for the column store would be faster than the code for the row layout in this query as it requires only a subset of attributes and in column layout we are doing just that i.e. only accessing the price column.
Suppose that the cache line size is 64 bytes.
In the case of row layout when a cache line is read, the price value of only 1(cacheline_size/product_struct_size = 64/64 = 1) tuple is read, because our struct size of 64 bytes and it fills our whole cache line, so for every tuple a cache miss occurs in case of a row layout.
In the case of column layout when a cache line is read, the price value of 16(cacheline_size/price_int_size = 64/4 = 16) tuples is read, because 16 contiguous price values stored in memory are brought into the cache, so for every sixteenth tuple a cache miss ocurs in case of column layout.
So the column layout will be faster in the case of given query, and is faster in such aggregation queries on a subset of columns of the table. You can try out such experiment for yourself using the data from TPC-H benchmark, and compare the run times for both the layouts. The wikipedia article on column oriented database systems is also good.
So in database systems, if the query workload is known beforehand, we can store our data in layouts which will suit the queries in workload and access data from these layouts. In the case of above example we created a column layout and changed our code to compute sum so that it became cache friendly.

Be aware that caches do not just cache continuous memory. They have multiple lines (at least 4) so discontinous and overlapping memory can often be stored just as efficiently.
What is missing from all the above examples is measured benchmarks. There are many myths about performance. Unless you measure it you do not know. Do not complicate your code unless you have a measured improvement.

Cache-friendly code is code that has been optimized to make efficient use of the CPU cache. This typically involves organizing data in a way that takes advantage of spatial and temporal locality, which refers to the idea that data that is accessed together is likely to be stored together in memory, and that data that is accessed frequently is likely to be accessed again in the near future.
There are several ways to make code cache-friendly, including:
Using contiguous memory layouts: By storing data in contiguous
blocks in memory, you can take advantage of spatial locality and
reduce the number of cache misses.
Using arrays: Arrays are a good choice for data structures when you
need to access data sequentially, as they allow you to take
advantage of temporal locality and keep hot data in the cache.
Using pointers carefully: Pointers can be used to access data that
is not stored contiguously in memory, but they can also lead to
cache misses if they are used excessively. If you need to use
pointers, try to use them in a way that takes advantage of spatial
and temporal locality to minimize cache misses.
Using compiler optimization flags: Most compilers have optimization
flags that can be used to optimize the use of the CPU cache. These
flags can help to minimize the number of cache misses and improve
the overall performance of your code.
It is important to note that the specific techniques that work best for optimizing the use of the CPU cache will depend on the specific requirements and constraints of your system. It may be necessary to experiment with different approaches to find the best solution for your needs.

Ring buffer: Disadvantages by moving through memory backwards?

This is probably language agnostic, but I'm asking from a C++ background.
I am hacking together a ring buffer for an embedded system (AVR, 8-bit). Let's assume:
const uint8_t size = /* something > 0 */;
uint8_t buffer[size];
uint8_t write_pointer;
There's this neat trick of &ing the write and read pointers with size-1 to do an efficient, branchless rollover if the buffer's size is a power of two, like so:
// value = buffer[write_pointer];
write_pointer = (write_pointer+1) & (size-1);
If, however, the size is not a power of two, the fallback would probably be a compare of the pointer (i.e. index) to the size and do a conditional reset:
// value = buffer[write_pointer];
if (++write_pointer == size) write_pointer ^= write_pointer;
Since the reset occurs rather rarely, this should be easy for any branch prediction.
This assumes though that the pointers need to be advancing foreward in memory. While this is intuitive, it requires a load of size in every iteration. I assume that reversing the order (advancing backwards) would yield better CPU instructions (i.e. jump if not zero) in the regular case, since size is only required during the reset.
// value = buffer[--write_pointer];
if (write_pointer == 0) write_pointer = size;
so
TL;DR: My question is: Does marching backwards through memory have a negative effect on the execution time due to cache misses (since memory cannot simply be read forward) or is this a valid optimization?

You have an 8 bit avr with a cache? And branch prediction?
How does forward or backwards matter as far as caches are concerned? The hit or miss on a cache is anywhere within the cache line, beginning, middle, end, random, sequential, doesnt matter. You can work from the back to the front or the front to back of a cache line, it is the same cost (assuming all other things held constant) the first mist causes a fill, then that line is in cache and you can access any of the items in any pattern at a lower latency until evicted.
On a microcontroller like that you want to make the effort, even at the cost of throwing away some memory, to align a circular buffer such that you can mask. There is no cache the instruction fetches are painful because they are likely from a flash that may be slower than the processor clock rate, so you do want to reduce instructions executed, or make the execution a little more deterministic (same number of instructions every loop until that task is done). There might be a pipeline that would appreciate the masking rather than an if-then-else.
TL;DR: My question is: Does marching backwards through memory have a
negative effect on the execution time due to cache misses (since
memory cannot simply be read forward) or is this a valid optimization?
The cache doesnt care, a miss from any item in the line causes a fill, once in the cache any pattern of access, random, sequential forward or back, or just pounding on the same address, takes less time being in faster memory. Until evicted. Evictions wont come from neighboring cache lines they will come from cache lines larger powers of two away, so whether the next cache line you pull is at a higher address or lower, the cost is the same.

Does marching backwards through memory have a negative effect on the
execution time due to cache misses (since memory cannot simply be read
forward)
Why do you think that you will have a cache miss? You will have a cache miss if you try to access outside the cache (forward or backward).

There are a number of points which need clarification:
That size needs to be loaded each and every time (it's const, therefore ought to be immutable)
That your code is correct. For example in a 0-based index (as used in C/C++ for array access) the value 0' is a valid pointer into the buffer, and the valuesize' is not. Similarly there is no need to xor when you could simply assign 0, equally a modulo operator will work (writer_pointer = (write_pointer +1) % size).
What happens in the general case with virtual memory (i.e. the logically adjacent addresses might be all over the place in the real memory map), paging (stuff may well be cached on a page-by-page basis) and other factors (pressure from external processes, interrupts)
In short: this is the kind of optimisation that leads to more feet related injuries than genuine performance improvements. Additionally it is almost certainly the case that you get much, much better gains using vectorised code (SIMD).
EDIT: And in interpreted languages or JIT'ed languages it might be a tad optimistic to assume you can rely on the use of JNZ and others at all. At which point the question is, how much of a difference is there really between loading size and comparing versus comparing with 0.

As usual, when performing any form of manual code optimization, you must have extensive in-depth knowledge of the specific hardware. If you don't have that, then you should not attempt manual optimizations, end of story.
Thus, your question is filled with various strange assumptions:
First, you assume that write_pointer = (write_pointer+1) & (size-1) is more efficient than something else, such as the XOR example you posted. You are just guessing here, you will have to disassemble the code and see which yields the less CPU instructions.
Because, when writing code for a tiny, primitive 8-bit MCU, there is not much going on in the core to speed up your code. I don't know AVR8, but it seems likely that you have a small instruction pipe and that's it. It seems quite unlikely that you have much in the way of branch prediction. It seems very unlikely that you have a data and/or instruction cache. Read the friendly CPU core manual.
As for marching backwards through memory, it will unlikely have any impact at all on your program's performance. On old, crappy compilers you would get slightly more efficient code if the loop condition was a comparison vs zero instead of a value. On modern compilers, this shouldn't be an issue. As for cache memory concerns, I doubt you have any cache memory to worry about.
The best way to write efficient code on 8-bit MCUs is to stick to 8-bit arithmetic whenever possible and to avoid 32-bit arithmetic like the plague. And forget you ever heard about something called floating point. This is what will make your program efficient, you are unlikely to find any better way to manually optimize your code.

What is a "cache-friendly" code?

What is the difference between "cache unfriendly code" and the "cache friendly" code?
How can I make sure I write cache-efficient code?

In addition to #Marc Claesen's answer, I think that an instructive classic example of cache-unfriendly code is code that scans a C bidimensional array (e.g. a bitmap image) column-wise instead of row-wise.
Elements that are adjacent in a row are also adjacent in memory, thus accessing them in sequence means accessing them in ascending memory order; this is cache-friendly, since the cache tends to prefetch contiguous blocks of memory.
Instead, accessing such elements column-wise is cache-unfriendly, since elements on the same column are distant in memory from each other (in particular, their distance is equal to the size of the row), so when you use this access pattern you are jumping around in memory, potentially wasting the effort of the cache of retrieving the elements nearby in memory.
And all that it takes to ruin the performance is to go from
// Cache-friendly version - processes pixels which are adjacent in memory
for(unsigned int y=0; y<height; ++y)
{
for(unsigned int x=0; x<width; ++x)
{
... image[y][x] ...
}
}
to
// Cache-unfriendly version - jumps around in memory for no good reason
for(unsigned int x=0; x<width; ++x)
{
for(unsigned int y=0; y<height; ++y)
{
... image[y][x] ...
}
}
This effect can be quite dramatic (several order of magnitudes in speed) in systems with small caches and/or working with big arrays (e.g. 10+ megapixels 24 bpp images on current machines); for this reason, if you have to do many vertical scans, often it's better to rotate the image of 90 degrees first and perform the various analysis later, limiting the cache-unfriendly code just to the rotation.

Welcome to the world of Data Oriented Design. The basic mantra is to Sort, Eliminate Branches, Batch, Eliminate virtual calls - all steps towards better locality.
Since you tagged the question with C++, here's the obligatory typical C++ Bullshit. Tony Albrecht's Pitfalls of Object Oriented Programming is also a great introduction into the subject.

Just piling on: the classic example of cache-unfriendly versus cache-friendly code is the "cache blocking" of matrix multiply.
Naive matrix multiply looks like:
for(i=0;i<N;i++) {
for(j=0;j<N;j++) {
dest[i][j] = 0;
for( k=0;k<N;k++) {
dest[i][j] += src1[i][k] * src2[k][j];
}
}
}
If N is large, e.g. if N * sizeof(elemType) is greater than the cache size, then every single access to src2[k][j] will be a cache miss.
There are many different ways of optimizing this for a cache. Here's a very simple example: instead of reading one item per cache line in the inner loop, use all of the items:
int itemsPerCacheLine = CacheLineSize / sizeof(elemType);
for(i=0;i<N;i++) {
for(j=0;j<N;j += itemsPerCacheLine ) {
for(jj=0;jj<itemsPerCacheLine; jj+) {
dest[i][j+jj] = 0;
}
for( k=0;k<N;k++) {
for(jj=0;jj<itemsPerCacheLine; jj+) {
dest[i][j+jj] += src1[i][k] * src2[k][j+jj];
}
}
}
}
If the cache line size is 64 bytes, and we are operating on 32 bit (4 byte) floats, then there are 16 items per cache line. And the number of cache misses via just this simple transformation is reduced approximately 16-fold.
Fancier transformations operate on 2D tiles, optimize for multiple caches (L1, L2, TLB), and so on.
Some results of googling "cache blocking":
http://stumptown.cc.gt.atl.ga.us/cse6230-hpcta-fa11/slides/11a-matmul-goto.pdf
http://software.intel.com/en-us/articles/cache-blocking-techniques
A nice video animation of an optimized cache blocking algorithm.
http://www.youtube.com/watch?v=IFWgwGMMrh0
Loop tiling is very closely related:
http://en.wikipedia.org/wiki/Loop_tiling

It needs to be clarified that not only data should be cache-friendly, it is just as important for the code. This is in addition to branch predicition, instruction reordering, avoiding actual divisions and other techniques.
Typically the denser the code, the fewer cache lines will be required to store it. This results in more cache lines being available for data.
The code should not call functions all over the place as they typically will require one or more cache lines of their own, resulting in fewer cache lines for data.
A function should begin at a cache line-alignment-friendly address. Though there are (gcc) compiler switches for this be aware that if the the functions are very short it might be wasteful for each one to occupy an entire cache line. For example, if three of the most often used functions fit inside one 64 byte cache line, this is less wasteful than if each one has its own line and results in two cache lines less available for other usage. A typical alignment value could be 32 or 16.
So spend some extra time to make the code dense. Test different constructs, compile and review the generated code size and profile.

As #Marc Claesen mentioned that one of the ways to write cache friendly code is to exploit the structure in which our data is stored. In addition to that another way to write cache friendly code is: change the way our data is stored; then write new code to access the data stored in this new structure.
This makes sense in the case of how database systems linearize the tuples of a table and store them. There are two basic ways to store the tuples of a table i.e. row store and column store. In row store as the name suggests the tuples are stored row wise. Lets suppose a table named Product being stored has 3 attributes i.e. int32_t key, char name[56] and int32_t price, so the total size of a tuple is 64 bytes.
We can simulate a very basic row store query execution in main memory by creating an array of Product structs with size N, where N is the number of rows in table. Such memory layout is also called array of structs. So the struct for Product can be like:
struct Product
{
int32_t key;
char name[56];
int32_t price'
}
/* create an array of structs */
Product* table = new Product[N];
/* now load this array of structs, from a file etc. */
Similarly we can simulate a very basic column store query execution in main memory by creating an 3 arrays of size N, one array for each attribute of the Product table. Such memory layout is also called struct of arrays. So the 3 arrays for each attribute of Product can be like:
/* create separate arrays for each attribute */
int32_t* key = new int32_t[N];
char* name = new char[56*N];
int32_t* price = new int32_t[N];
/* now load these arrays, from a file etc. */
Now after loading both the array of structs (Row Layout) and the 3 separate arrays (Column Layout), we have row store and column store on our table Product present in our memory.
Now we move on to the cache friendly code part. Suppose that the workload on our table is such that we have an aggregation query on the price attribute. Such as
SELECT SUM(price)
FROM PRODUCT
For the row store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + table[i].price;
For the column store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + price[i];
The code for the column store would be faster than the code for the row layout in this query as it requires only a subset of attributes and in column layout we are doing just that i.e. only accessing the price column.
Suppose that the cache line size is 64 bytes.
In the case of row layout when a cache line is read, the price value of only 1(cacheline_size/product_struct_size = 64/64 = 1) tuple is read, because our struct size of 64 bytes and it fills our whole cache line, so for every tuple a cache miss occurs in case of a row layout.
In the case of column layout when a cache line is read, the price value of 16(cacheline_size/price_int_size = 64/4 = 16) tuples is read, because 16 contiguous price values stored in memory are brought into the cache, so for every sixteenth tuple a cache miss ocurs in case of column layout.
So the column layout will be faster in the case of given query, and is faster in such aggregation queries on a subset of columns of the table. You can try out such experiment for yourself using the data from TPC-H benchmark, and compare the run times for both the layouts. The wikipedia article on column oriented database systems is also good.
So in database systems, if the query workload is known beforehand, we can store our data in layouts which will suit the queries in workload and access data from these layouts. In the case of above example we created a column layout and changed our code to compute sum so that it became cache friendly.

Be aware that caches do not just cache continuous memory. They have multiple lines (at least 4) so discontinous and overlapping memory can often be stored just as efficiently.
What is missing from all the above examples is measured benchmarks. There are many myths about performance. Unless you measure it you do not know. Do not complicate your code unless you have a measured improvement.

Will a modern processor (like the i7) follow pointers and prefetch their data while iterating over a list of them?

I want to learn how to write better code that takes advantage of the CPU's cache. Working with contiguous memory seems to be the ideal situation. That being said, I'm curious if there are similar improvements that can be made with non-contiguous memory, but with an array of pointers to follow, like:
struct Position {
int32_t x,y,z;
}
...
std::vector<Position*> posPointers;
...
updatePosition () {
for (uint32_t i = 0; i < posPointers.size(); i++) {
Position& nextPos = *posPointers[i];
nextPos.x++;
nextPos.y++;
nextPos.z++;
}
}
This is just some rough mock-up code, and for the sake of learning this properly let's just say that all Position structs were created randomly all over the heap.
Can modern, smart, processors such as Intel's i7 look ahead and see that it's going to need X_ptr's data very shortly? Would the following line of code help?
... // for loop
Position& nextPos1 = *posPointers[i];
Position& nextPos2 = *posPointers[i+1];
Position& nextPos3 = *posPointers[i+2];
Position& nextPos4 = *posPointers[i+3];
... // Work on data here
I had read some presentation slides that seemed to indicate code like this would cause the processor to prefetch some data. Is that true? I am aware there are non-standard, platform specific, ways to call prefetching like __builtin_prefetch, but throwing that all over the place just seems like an ugly premature optimization. I am looking for a way I can subconsciously write cache-efficient code.

I know you didn't ask (and probably don't need a sermon on proper treatment of caches, but I thought I'd contribute my two cents anyways. Note that all this only applies in hot code. Remember that premature optimization is the root of all evil.
As has been pointed out in the comments, the best way is to have containers of actual data. Generally speaking, flat data structures are much preferable to "pointer spaghetti", even if you have to duplicate some data and/or pay a price for resizing/moving/defragmenting your data structures.
And as you know, flat data structures (e.g. an array of data) only pay off if you access them linearly and sequentially most of the time.
But this strategy may not always be usable. In lieu of actual linear data, you can use other strategies like employing pool allocators, and iterating over the pools themselves, instead of over the vector holding the pointers. This of course has its own disadvantages and can be a bit more complicated.
I'm sure you know this already, but it bears mentioning again that one of the most effective techniques for getting most out of your cache is having smaller data! In the above code, if you can get away with int16_t instead of int32_t, you should definitely do so. You should pack your many bools and flags and enums into bit-fields, use indexes instead of pointers (specially on 64-bit systems,) use fixed-size hash values in your data structures instead of strings, etc.
Now, about your main question that whether the processor can follow random pointers around and bring the data into cache before they are needed. To a very limited extent, this does happen. As probably you know, modern CPUs employ a lot of tricks to increase their speed (i.e. increase their instruction retire rate.) Tricks like having a store buffer, out-of-order execution, superscalar pipelines, multiple functional units of every kind, branch prediction, etc. Most of the time, these tricks all just help the CPU to keep executing instructions, even if the current instructions have stalled or take too long to finish. For memory loads (which is the slowest thing to do, iff the data is not in the cache,) this means that the CPU should get to the instruction as soon as possible, calculate the address, and request the data from the memory controller. However, the memory controller can have only a very limited number of outstanding requests (usually two these days, but I'm not sure.) This means that even if the CPU did very sophisticated stuff to look ahead into other memory locations (e.g. the elements of your posPointers vector) and deduce that these are the addresses of new data that your code is going to need, it couldn't get very far ahead because the memory controller can have only so many requests pending.
In any case, AFAIK, I don't think that CPUs actually do that yet. Note that this is a hard case, because the addresses of your randomly distributed memory locations are themselves in memory (as opposed to being in a register or calculable from the contents of a register.) And if the CPUs did it, it wouldn't have that much of an effect anyways because of memory interface limitations.
The prefetching technique you mentioned seems valid to me and I've seen it used, but it only yields noticeable effect if your CPU has something to do while waiting for the future data to arrive. Incrementing three integers takes a lot less time than loading 12 bytes from memory (loading one cache line, actually) and therefor it won't mean much for the execution time. But if you had something worthwhile and more heavyweight to overlay on top of the memory prefetches (e.g. calculating a complex function that doesn't require data from memory!) then you could get very nice speedups. You see, the time to go through the above loop is essentially the sum of the time of all the cache misses; and you are getting the coordinate increments and the loop bookkeeping for free. So, you'd have won more if the free stuff were more valuable!

Modern processors have hardware prefetching mechanisms: Intel Hardware prefetcher. They infer stride access patterns to memory and prefetch memory locations that are likely to be accessed in the near future.
However in the case of totally random pointer chasing such techniques can not help. The processor does not know that the program in execution is performing pointer chasing, therefore it can not prefetch accordingly. In such cases hardware mechanisms are detrimental for performance as they would prefetch values that are not likely to be used.
The best that you can do is try to organize you data structures in memory in such a way that accesses to contiguous portions of memory are more likely.

C++ 'small' optimization behaving strangely

I'm trying to optimize 'in the small' on a project of mine.
There's a series of array accesses that are individually tiny, but profiling has revealed that these array accesses are where the vast majority of my program is spending its time. So, time to make things faster, since the program takes about an hour to run.
I've moved the following type of access:
const float theValOld1 = image(x, y, z);
const float theValOld2 = image(x, y+1, z);
const float theValOld3 = image(x, y-1, z);
const float theValOld4 = image(x-1, y, z);
etc, for 28 accesses around the current pixel.
where image thunks down to
float image(const int x, const int y, const int z) const {
return data[z*xsize*ysize + y*xsize + x];
}
and I've replaced it with
const int yindex = y*xsize;
const int zindex = z*xsize*ysize;
const float* thePtr = &(data[z*xsize*ysize + y*xsize + x]);
const float theVal1 = *(thePtr);
const float theVal2 = *(thePtr + yindex);
const float theVal3 = *(thePtr - yindex);
const float theVal4 = *(thePtr - 1);
etc, for the same number of operations.
I would expect that, if the compiler were totally awesome, that this change would do nothing to the speed. If the compiler is not awesome, then I'd say that the second version should be faster, if only because I'm avoiding the implict pointer addition that comes with the [] thunk, as well as removing the multiplications for the y and z indeces.
To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
To measure performance, I'm using QueryPerformanceCounter.
What's odd to me is that the order of operations matters!
If I leave the operations as described and compare the timings (as well as the results, to make sure that the same value is calculated after optimization), then the older code takes about 45 ticks per pixel and the new code takes 10 ticks per pixel. If I reverse the operations, then the old code takes about 14 ticks per pixel and the new code takes about 30 ticks per pixel (with lots of noise in there, these are averages over about 100 pixels).
Why should the order matter? Is there caching or something happening? The variables are all named different things, so I wouldn't think that would matter. If there is some caching happening, is there any way I can take advantage of it from pixel to pixel?
Corollary: To compare speed, I'm supposing that the right way is to run the two versions independently of one another, and then compare the results from different runs. I'd like to have the two comparisons next to each other make sense, but there's obviously something happening here that prevents that. Is there a way to salvage this side-by-side run to get a reasonable speed comparison from a single run, so I can make sure that the results are identical as well (easily)?
EDIT: To clarify.
I have both new and old code in the same function, so I can make sure that the results are identical.
If I run old code and then new code, new code runs faster than old.
If I run new code and then old code, old code runs faster than new.
The z hit is required by the math, and the if statement cannot be removed, and is present in both. For the new code, I've just moved more z-specific code into the z section, and the test code I'm using is 100% 2D. When I move to 3D testing, then I'm sure I'll see more of the effect of branching.

You may (possibly) be running into some sort of readahead or cacheline boundary issue. Generally speaking, when you load a single value and it isn't "hot" (in cache), the CPU will pull in a cache line (32, 64, or 128 bytes are pretty typical, depending on processor). Subsequent reads to the same line will be much faster.
If you change the order of operations, you may just be seeing stalls due to how the lines are being loaded and evicted.
The best way to figure something like this out is to open "Disassembly" view and spend some quality time with your processor's reference manual.
If you're lucky, the changes that the code reordering causes will be obvious (the compiler may be generating extra instructions or branches). Less lucky, it will be a stall somewhere in the processor -- during the decode pipeline or due to a memory fetch...
A good profiler that can count stalls and cache misses may help here too (AMD has CodeAnalyst, for example).
If you're not under a time crunch, it's really worthwhile to dive into the disasm -- at the very least, you'll probably end up learning something you didn't know before about how your CPU, machine architecture, compiler, libraries, etc work. (I almost always end up going "huh" when studying disasm.)

If both the new and old versions run on the same data array, then yes, the last run will almost certainly get a speed bump due to caching. Even if the code is different, it'll be accessing data that was already touched by the previous version, so depending on data size, it might be in L1 cache, will probably be in L2 cache, and if a L3 cache exists, almost certainly in that. There'll probably also be some overlap in the code, meaning that the instruction cache will further boost performance of the second version.
A common way to benchmark is to run the algorithm once first, without timing it, simply to ensure that that's going to be cached, is cached, and then run it again a large number of times with timing enabled. (Don't trust a single execution, unless it takes at least a second or two. Otherwise small variations in system load, cache, OS interrupts or page faults can cause the measured time to vary). To eliminate the noise, measure the combined time taken for several runs of the algorithm, and obviously with no output in between. The fact that you're seeing spikes of 3x the usual time means that you're measuring at a way too fine-grained level. Which basically makes your timings useless.
Why should the order matter? Is there caching or something happening? The variables are all named different things, so I wouldn't think that would matter. If there is some caching happening, is there any way I can take advantage of it from pixel to pixel?
The naming doesn't matter. When the code is compiled, variables are translated into memory addresses or register id's. But when you run through your image array, you're loading it all into CPU cache, so it can be read faster the next time you run through it.
And yes, you can and should take advantage of it.
The computer tries very hard to exploit spatial and temporal locality -- that is, if you access a memory address X at time T, it assumes that you're going to need address X+1 very soon (spatial locality), and that you'll probably also need X again, at time T+1 (temporal locality). It tries to speed up those cases in every way possible (primarily by caching), so you should try to exploit it.
To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
I don't know where you placed that if statement, but if it's in a frequently evaluated block of code, the cost of the branch might hurt you more than you're saving. Branches can be expensive, and they inhibit the compiler's and CPU's ability to reorder and schedule instructions. So you may be better off without it. You should probably do this as a separate optimization that can be benchmarked in isolation.
I don't know which algorithm you're implementing, but I'm guessing you need to do this for every pixel?
If so, you should try to cache your lookups. Once you've got image(x, y, z), that'll be the next pixel's image(x+1, y, z), so cache it in the loop so the next pixel won't have to look it up from scratch. That would potentially allow you to reduce your 9 accesses in the X/Y plane down to three (use 3 cached values from the last iteration, 3 from the one before it, and 3 we just loaded in this iteration)
If you're updating the value of each pixel as a result of its neighbors values, a better approach may be to run the algorithm in a checkerboard pattern. Update every other pixel in the first iteration, using only values from their neighbors (which you're not updating), and then run a second pass where you update the pixels you read from before, based on the values of the pixels you updated before. This allows you to eliminate dependencies between neighboring pixels, so their evaluation can be pipelined and parallelized efficiently.
In the loop that performs all the lookups, unroll it a few times, and try to place all the memory reads at the top, and all the computations further down, to give the CPU a chance to overlap the two (since data reads are a lot slower, get them started, and while they're running, the CPU will try to find other instructions it can evaluate).
For any constant values, try to precompute them as much as possible. (rather than z*xsize*ysize, precompute xsize*ysize, and multiply z with the result of that.
Another thing that may help is to prefer local variables over globals or class members. You may gain something simply by, at the start of the function, making local copies of the class members you're going to need. The compiler can always optimize the extra variables out again if it wants to, but you make it clear that it shouldn't worry about underlying changes to the object state (which might otherwise force it to reload the members every time you access them)
And finally, study the generated assembly in detail. See where it's performing unnecessary store/loads, where operations are being repeated even though they could be cached, and where the ordering of instructions is inefficient, or where the compiler fails to inline as much as you'd hoped.
I honestly wouldn't expect your changes to the lookup function to have much effect though. An array access with the operator[] is easily convertible to the equivalent pointer arithmetic, and the compiler can optimize that pretty efficiently, as long as the offsets you're adding don't change.
Usually, the key to low-level optimizations is, somewhat ironically, not to look at individual lines of code, but at whole functions, and at loops. You need a certain amount of instructions in a block so you have something to work with, since a lot of optimizations deal with breaking dependencies between chains of instructions, reordering to hide instruction latency, and with caching individual values to avoid memory load/stores. That's almost impossible to do on individual array lookups, but there's almost certainly a lot gained if you consider a couple of pixels at a time.
Of course, as with almost all microoptimizations, there are no always true answers. Some of the above might be useful to you, or they might not.
If you tell us more about the access pattern (which pixels are you accessing, is there any required order, and are you just reading, or writing as well? If writing, when and where are the updated values used?)
If you give us a bit more information, we'll be able to offer much more specific (and likely to be effective) suggestions

When optimising, examining the data access pattern is essential.
for example:
assuming a width of 240
for a pixel at <x,y,z> 10,10,0
with original access pattern would give you:
a. data[0+ 10*240 + 10] -> data[2410]
b. data[0+ 11*240 + 10] -> data[2650]
c. data[0+ 9*240 + 10] -> data[2170]
d. data[0+ 10*240 + 9] -> data[2409]
Notice the indices which are in arbitrary order.
Memory controller makes aligned accesses to the main memory to fill the cache lines.
If you order your operations so that accesses are to ascending memory addresses
(e.g. c,d,a,b ) then the memory controller would be able to stream the data in to
the cache lines.
Missing cache on read would be expensive as it has to search down the cache
hierarchy down to the main memory. Main memory access could be 100x slower than
cache. Minimising main memory access will improve the speed of your operation.

To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
Did you actually measure that? Because I'd be pretty surprised if that were true. An if statement in the inner loop of your program can add a surprising amount of overhead -- see Is "IF" expensive?. I'd be willing to bet that the overhead of the extra multiply is a lot less than the overhead of the branching, unless z happens to be zero 99% of the time.
What's odd to me is that the order of operations matters!
The order of what operations? It's not clear to me what you're reordering here. Please give some more snippets of what you're trying to do.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js