Roofline model: calculating operational intensity

Roofline model: calculating operational intensity - c++

Say I have a toy loop like this
float x[N];
float y[N];
for (int i = 1; i < N-1; i++)
y[i] = a*(x[i-1] - x[i] + x[i+1])
And I assume my cache line is 64 Byte (i.e. big enough). Then I will have (per frame) basically 2 accesses to the RAM and 3 FLOP:
1 (cached) read access: loading all 3 x[i-1], x[i], x[i+1]
1 write access: storing y[i]
3 FLOP (1 mul, 1 add, 1 sub)
The operational intensity is ergo
OI = 3 FLOP/(2 * 4 BYTE)
Now what happens if I do something like this
float x[N];
for (int i = 1; i < N-1; i++)
x[i] = a*(x[i-1] - x[i] + x[i+1])
Note that there is no y anymore. Does it mean now that I have a single RAM access
1 (cached) read/write: loading x[i-1], x[i], x[i+1], storing x[i]
or still 2 RAM accesses
1 (cached) read: loading x[i-1], x[i], x[i+1]
1 (cached) write: storing x[i]
Because the operational intensity OI would be different in either case. Can anyone tell something about this? Or maybe clarify some things. Thanks

Disclaimer: I've never heard of the roofline performance model until today. As far as I can tell, it attempts to calculate a theoretical bound on the "arithmetic intensity" of an algorithm, which is the number of FLOPS per byte of data accessed. Such a measure may be useful for comparing similar algorithms as the size of N grows large, but is not very helpful for predicting real-world performance.
As a general rule of thumb, modern processors can execute instructions much more quickly than they can fetch/store data (this becomes drastically more pronounced as the data starts to grow larger than the size of the caches). So contrary to what one might expect, a loop with higher arithmetic intensity may run much faster than a loop with lower arithmetic intensity; what matters most as N scales is the total amount of data touched (this will hold true as long as memory remains significantly slower than the processor, as is true in common desktop and server systems today).
In short, x86 CPUs are unfortunately too complex to be accurately described with such a simple model. An access to memory goes through several layers of caching (typically L1, L2, and L3) before hitting RAM. Maybe all your data fits in L1 -- the second time you run your loop(s) there could be no RAM accesses at all.
And there's not just the data cache. Don't forget that code is in memory too and has to be loaded into the instruction cache. Each read/write is also done from/to a virtual address, which is supported by the hardware TLB (that can in extreme cases trigger a page fault and, say, cause the OS to write a page to disk in the middle of your loop). All of this is assuming your program is hogging the hardware all to itself (in non-realtime OSes this is simply not the case, as other processes and threads are competing for the same limited resources).
Finally, the execution itself is not (directly) done with memory reads and writes, but rather the data is loaded into registers first (then the result is stored).
How the compiler allocates registers, if it attempts loop unrolling, auto-vectorization, the instruction scheduling model (interleaving instructions to avoid data dependencies between instructions) etc. will also all affect the actual throughput of the algorithm.
So, finally, depending on the code produced, the CPU model, the amount of data processed, and the state of various caches, the latency of the algorithm will vary by orders of magnitude. Thus, the operational intensity of a loop cannot be determined by inspecting the code (or even the assembly produced) alone, since there are many other (non-linear) factors in play.
To address your actual question, though, as far as I can see by the definition outlined here, the second loop would count as a single additional 4-byte access per iteration on average, so its OI would be θ(3N FLOPS / 4N bytes). Intuitively, this makes sense because the cache already has the data loaded, and the write can change the cache directly instead of going back to main memory (the data does eventually have to be written back, however, but that requirement is unchanged from the first loop).

Related

Missing small primes in C++ atomic prime sieve

I try to develop a concurrent prime sieve implementation using C++ atomics. However, when core_count is increased, more and more small primes are missing from the output.
My guess is that the producer threads overwrite each others' results, before being read by the consumer. Even though the construction should protect against it by using the magic number 0 to indicate it's ready to accept the next prime. It seems the compare_exchange_weak is not really atomic in this case.
Things I've tried:
Replacing compare_exchange_weak with compare_exchange_strong
Changing the memory_order to anything else.
Swapping around the 'crossing-out' and the write.
I have tested it with Microsoft Visual Studio 2019, Clang 12.0.1 and GCC 11.1.0, but to no avail.
Any ideas on this are welcome, including some best practices I might have missed.
#include <algorithm>
#include <atomic>
#include <future>
#include <iostream>
#include <iterator>
#include <thread>
#include <vector>
int main() {
using namespace std;
constexpr memory_order order = memory_order_relaxed;
atomic<int> output{0};
vector<atomic_bool> sieve(10000);
for (auto& each : sieve) atomic_init(&each, false);
atomic<unsigned> finished_worker_count{0};
auto const worker = [&output, &sieve, &finished_worker_count]() {
for (auto current = next(sieve.begin(), 2); current != sieve.end();) {
current = find_if(current, sieve.end(), [](atomic_bool& value) {
bool untrue = false;
return value.compare_exchange_strong(untrue, true, order);
});
if (current == sieve.end()) break;
int const claimed = static_cast<int>(distance(sieve.begin(), current));
int zero = 0;
while (!output.compare_exchange_weak(zero, claimed, order))
;
for (auto product = 2 * claimed; product < static_cast<int>(sieve.size());
product += claimed)
sieve[product].store(true, order);
}
finished_worker_count.fetch_add(1, order);
};
const auto core_count = thread::hardware_concurrency();
vector<future<void>> futures;
futures.reserve(core_count);
generate_n(back_inserter(futures), core_count,
[&worker]() { return async(worker); });
vector<int> result;
while (finished_worker_count < core_count) {
auto current = output.exchange(0, order);
if (current > 0) result.push_back(current);
}
sort(result.begin(), result.end());
for (auto each : result) cout << each << " ";
cout << '\n';
return 0;
}

compare_exchange_weak will update (change) the "expected" value (the local variable zero) if the update cannot be made. This will allow overwriting one prime number with another if the main thread doesn't quickly handle the first prime.
You'll want to reset zero back to zero before rechecking:
while (!output.compare_exchange_weak(zero, claimed, order))
zero = 0;

Even with correctness bugs fixed, I think this approach is going to be low performance with multiple threads writing to the same cache lines.
As 1201ProgramAlarm's points out in their answer, CAS but I wouldn't expect good performance! Having multiple threads storing to the same cache lines will create big stalls. I'd normally write that as follows so you only need to write the zero = 0 once, but it still happens before every CAS.
do{
zero = 0;
}while(!output.compare_exchange_weak(zero, claimed, order));
Caleth pointed out in comments that it's also Undefined Behaviour for a predictate to modify the element (like in your find_if). That's almost certainly not a problem in practice in this case; find_if is just written in C++ in a header (in mainstream implementations) and likely in a way that there isn't actually any UB in the resulting program.
And it would be straightforward to replace the find_if with a loop. In fact probably making the code more readable, since you can just use array indexing the whole time instead of iterators; let the compiler optimize that to a pointer and then pointer-subtraction if it wants.
Scan read-only until you find a candidate to try to claim, don't try to atomic-RMW every true element until you get to a false one. Especially on x86-64, lock cmpxchg is way slower than read-only access to a few contiguous bytes. It's a full memory barrier; there's no way to do an atomic RMW on x86 that isn't seq_cst.
You might still lose the race, so you do still need to try to claim it with an RMW and keep looping on failure. And CAS is a good choice for that.
Correctness seems plausible with this strategy, but I'd avoid it for performance reasons.
Multiple threads storing to the array will cause contention
Expect cache lines to be bouncing around between cores, with most RFOs (MESI Read For Ownership) having to wait to get the data for a cache line from another core that had it in Modified state. A core can't modify a cache line until after it gets exclusive ownership of that cache line. (Usually 64 bytes on modern systems.)
Your sieve size is only 10k bools, so 10 kB, comfortably fitting into L1d cache on modern CPUs. So a single-threaded implementation would get all L1d cache hits when looping over it (in the same thread that just initialized it all to zero).
But with other threads writing the array, at best you'll get hits in L3 cache. But since the sieve size is small, other threads won't be evicting their copies from their own L1d caches, so the RFO (read for ownership) from a core that wants to write will typically find that some other core has it Modified, so the L3 cache (or other tag directory) will have to send on a request to that core to write back or transfer directly. (Intel CPUs from Nehalem onwards use Inclusive L3 cache where the tags also keep track of which cores have the data. They changed that for server chips with Skylake and later, but client CPUs still I think have inclusive L3 cache where the tags also work as a snoop filter / coherence directory.)
With 1 whole byte per bool, and not even factoring out multiples of 2 from your sieve, crossing off multiples of a prime is very high bandwidth. For primes between 32 and 64, you touch every cache line 1 to 2 times. (And you only start at prime*2, not prime*prime, so even for large strides, you still start very near the bottom of the array and touch most of it.)
A single-threaded sieve can use most of L3 cache bandwidth, or saturate DRAM, on a large sieve, even using a bitmap instead of 1 bool per byte. (I made some benchmarks of a hand-written x86 asm version that used a bitmap version in comments on a Codereview Q&A; https://godbolt.org/z/nh39TWxxb has perf stat results in comments on a Skylake i7-6700k with DDR4-2666. My implementation also has some algorithmic improvements, like not storing the even numbers, and starting the crossing off at i*i).
Although to be fair, L3 bandwidth scales with number of cores, especially if different pairs are bouncing data between each other, like A reading lines recently written by B, and B reading lines recently written by C. Unlike with DRAM where the shared bus is the bottleneck, unless per-core bandwidth limits are lower. (Modern server chips need multiple cores to saturate their DRAM controllers, but modern client chips can nearly max out DRAM with one thread active).
You'd have to benchmark to see whether all thread piled up in a bad way or not, like if they tend to end up close together, or if one with a larger stride can pull ahead and get some distance for write-prefetches not to create extra contention.
The cache-miss delays in committing the store to cache can be hidden some by the store buffer and out-of-order exec (especially since it's relaxed, not seq_cst), but it's still not good.
(Using a bitmap with 8 bools per byte would require atomic RMWs for this threading strategy, which would be a performance disaster. If you're going to thread this way, 1 bool per byte is by far the least bad.)
At least if you aren't reading part of the array that's still being written, you might not be getting memory-order mis-speculation on x86. (x86's memory model disallows LoadStore and LoadLoad reordering, but real implementations speculatively load early, and have to roll back if the value they loaded has been invalidated by the time the load is architecturally allowed to happen.)
Better strategy: each thread owns a chunk of the sieve
Probably much better would be segmenting regions and handing out lists of primes to cross off, with each thread marking off multiples of primes in its own region of the output array. So each cache line of the sieve is only touched by one thread, and each thread only touches a subset of the sieve array. (A good chunk size would be half to 3/4 of the L1d or L2 cache size of a core.)
You might start with a small single-threaded sieve, or a fixed list of the first 10 or 20 primes to get the threads started, and have the thread that owns the starting chunk generate more primes. Perhaps appending them to an array of primes, and updating a valid-index (with a release store so readers can keep reading in that array up to that point, then spin-wait or use C++20 .wait() for a value other than what they last saw. But .wait would need a .notify in the writer, like maybe every 10 primes?)
If you want to move on in a larger sieve, divide up the next chunk of the full array between threads and have them each cross off the known primes from the first part of the array. No thread has to wait for any other, the first set of work already contains all the primes that need to be crossed off from an equal-sized chunk of the sieve.
Probably you don't want an actually array of atomic_int; probably all threads should be scanning the sieve to find not-crossed-off positions. Especially if you can do that efficiently with SIMD, or with bit-scan like tzcnt if you use packed bitmaps for this.
(I assume there are some clever algorithmic ideas for segmented sieves; this is just what I came up with off the top of my head.)

For-loop variables and cache misses? [duplicate]

What is the difference between "cache unfriendly code" and the "cache friendly" code?
How can I make sure I write cache-efficient code?

Preliminaries
On modern computers, only the lowest level memory structures (the registers) can move data around in single clock cycles. However, registers are very expensive and most computer cores have less than a few dozen registers. At the other end of the memory spectrum (DRAM), the memory is very cheap (i.e. literally millions of times cheaper) but takes hundreds of cycles after a request to receive the data. To bridge this gap between super fast and expensive and super slow and cheap are the cache memories, named L1, L2, L3 in decreasing speed and cost. The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. If the processor can't find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache, and if not there, main memory. Each of these "misses" is expensive in time.
(The analogy is cache memory is to system memory, as system memory is to hard disk storage. Hard disk storage is super cheap but very slow).
Caching is one of the main methods to reduce the impact of latency. To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency.
Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. The cache hit rate is crucial for performance since every cache miss results in fetching data from RAM (or worse ...) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.
In modern computer architectures, the performance bottleneck is leaving the CPU die (e.g. accessing RAM or higher). This will only get worse over time. The increase in processor frequency is currently no longer relevant to increase performance. The problem is memory access. Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data!
There is quite a lot to be said on the subject. Here are a few great references about caches, memory hierarchies and proper programming:
Agner Fog's page. In his excellent documents, you can find detailed examples covering languages ranging from assembly to C++.
If you are into videos, I strongly recommend to have a look at Herb Sutter's talk on machine architecture (youtube) (specifically check 12:00 and onwards!).
Slides about memory optimization by Christer Ericson (director of technology # Sony)
LWN.net's article "What every programmer should know about memory"
Main concepts for cache-friendly code
A very important aspect of cache-friendly code is all about the principle of locality, the goal of which is to place related data close in memory to allow efficient caching. In terms of the CPU cache, it's important to be aware of cache lines to understand how this works: How do cache lines work?
The following particular aspects are of high importance to optimize caching:
Temporal locality: when a given memory location was accessed, it is likely that the same location is accessed again in the near future. Ideally, this information will still be cached at that point.
Spatial locality: this refers to placing related data close to each other. Caching happens on many levels, not just in the CPU. For example, when you read from RAM, typically a larger chunk of memory is fetched than what was specifically asked for because very often the program will require that data soon. HDD caches follow the same line of thought. Specifically for CPU caches, the notion of cache lines is important.
Use appropriate c++ containers
A simple example of cache-friendly versus cache-unfriendly is c++'s std::vector versus std::list. Elements of a std::vector are stored in contiguous memory, and as such accessing them is much more cache-friendly than accessing elements in a std::list, which stores its content all over the place. This is due to spatial locality.
A very nice illustration of this is given by Bjarne Stroustrup in this youtube clip (thanks to #Mohammad Ali Baydoun for the link!).
Don't neglect the cache in data structure and algorithm design
Whenever possible, try to adapt your data structures and order of computations in a way that allows maximum use of the cache. A common technique in this regard is cache blocking (Archive.org version), which is of extreme importance in high-performance computing (cfr. for example ATLAS).
Know and exploit the implicit structure of data
Another simple example, which many people in the field sometimes forget is column-major (ex. fortran,matlab) vs. row-major ordering (ex. c,c++) for storing two dimensional arrays. For example, consider the following matrix:
1 2
3 4
In row-major ordering, this is stored in memory as 1 2 3 4; in column-major ordering, this would be stored as 1 3 2 4. It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. Unfortunately, I see stuff like this very often in my domain (machine learning). #MatteoItalia showed this example in more detail in his answer.
When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line).
For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it M):
Exploiting the ordering (e.g. changing column index first in c++):
M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
= 1 + 2 + 3 + 4
--> 2 cache hits, 2 memory accesses
Not exploiting the ordering (e.g. changing row index first in c++):
M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
= 1 + 3 + 2 + 4
--> 0 cache hits, 4 memory accesses
In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice, the performance difference can be much larger.
Avoid unpredictable branches
Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. This will indirectly lead to more cache misses.
This is explained very well here (thanks to #0x90 for the link): Why is processing a sorted array faster than processing an unsorted array?
Avoid virtual functions
In the context of c++, virtual methods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). Virtual functions can induce cache misses during look up, but this only happens if the specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class?
Common problems
A common problem in modern architectures with multiprocessor caches is called false sharing. This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. Effectively, different threads make each other wait by inducing cache misses in this situation.
See also (thanks to #Matt for the link): How and when to align to cache line size?
An extreme symptom of poor caching in RAM memory (which is probably not what you mean in this context) is so-called thrashing. This occurs when the process continuously generates page faults (e.g. accesses memory which is not in the current page) which require disk access.

In addition to #Marc Claesen's answer, I think that an instructive classic example of cache-unfriendly code is code that scans a C bidimensional array (e.g. a bitmap image) column-wise instead of row-wise.
Elements that are adjacent in a row are also adjacent in memory, thus accessing them in sequence means accessing them in ascending memory order; this is cache-friendly, since the cache tends to prefetch contiguous blocks of memory.
Instead, accessing such elements column-wise is cache-unfriendly, since elements on the same column are distant in memory from each other (in particular, their distance is equal to the size of the row), so when you use this access pattern you are jumping around in memory, potentially wasting the effort of the cache of retrieving the elements nearby in memory.
And all that it takes to ruin the performance is to go from
// Cache-friendly version - processes pixels which are adjacent in memory
for(unsigned int y=0; y<height; ++y)
{
for(unsigned int x=0; x<width; ++x)
{
... image[y][x] ...
}
}
to
// Cache-unfriendly version - jumps around in memory for no good reason
for(unsigned int x=0; x<width; ++x)
{
for(unsigned int y=0; y<height; ++y)
{
... image[y][x] ...
}
}
This effect can be quite dramatic (several order of magnitudes in speed) in systems with small caches and/or working with big arrays (e.g. 10+ megapixels 24 bpp images on current machines); for this reason, if you have to do many vertical scans, often it's better to rotate the image of 90 degrees first and perform the various analysis later, limiting the cache-unfriendly code just to the rotation.

Optimizing cache usage largely comes down to two factors.
Locality of Reference
The first factor (to which others have already alluded) is locality of reference. Locality of reference really has two dimensions though: space and time.
Spatial
The spatial dimension also comes down to two things: first, we want to pack our information densely, so more information will fit in that limited memory. This means (for example) that you need a major improvement in computational complexity to justify data structures based on small nodes joined by pointers.
Second, we want information that will be processed together also located together. A typical cache works in "lines", which means when you access some information, other information at nearby addresses will be loaded into the cache with the part we touched. For example, when I touch one byte, the cache might load 128 or 256 bytes near that one. To take advantage of that, you generally want the data arranged to maximize the likelihood that you'll also use that other data that was loaded at the same time.
For just a really trivial example, this can mean that a linear search can be much more competitive with a binary search than you'd expect. Once you've loaded one item from a cache line, using the rest of the data in that cache line is almost free. A binary search becomes noticeably faster only when the data is large enough that the binary search reduces the number of cache lines you access.
Time
The time dimension means that when you do some operations on some data, you want (as much as possible) to do all the operations on that data at once.
Since you've tagged this as C++, I'll point to a classic example of a relatively cache-unfriendly design: std::valarray. valarray overloads most arithmetic operators, so I can (for example) say a = b + c + d; (where a, b, c and d are all valarrays) to do element-wise addition of those arrays.
The problem with this is that it walks through one pair of inputs, puts results in a temporary, walks through another pair of inputs, and so on. With a lot of data, the result from one computation may disappear from the cache before it's used in the next computation, so we end up reading (and writing) the data repeatedly before we get our final result. If each element of the final result will be something like (a[n] + b[n]) * (c[n] + d[n]);, we'd generally prefer to read each a[n], b[n], c[n] and d[n] once, do the computation, write the result, increment n and repeat 'til we're done.2
Line Sharing
The second major factor is avoiding line sharing. To understand this, we probably need to back up and look a little at how caches are organized. The simplest form of cache is direct mapped. This means one address in main memory can only be stored in one specific spot in the cache. If we're using two data items that map to the same spot in the cache, it works badly -- each time we use one data item, the other has to be flushed from the cache to make room for the other. The rest of the cache might be empty, but those items won't use other parts of the cache.
To prevent this, most caches are what are called "set associative". For example, in a 4-way set-associative cache, any item from main memory can be stored at any of 4 different places in the cache. So, when the cache is going to load an item, it looks for the least recently used3 item among those four, flushes it to main memory, and loads the new item in its place.
The problem is probably fairly obvious: for a direct-mapped cache, two operands that happen to map to the same cache location can lead to bad behavior. An N-way set-associative cache increases the number from 2 to N+1. Organizing a cache into more "ways" takes extra circuitry and generally runs slower, so (for example) an 8192-way set associative cache is rarely a good solution either.
Ultimately, this factor is more difficult to control in portable code though. Your control over where your data is placed is usually fairly limited. Worse, the exact mapping from address to cache varies between otherwise similar processors. In some cases, however, it can be worth doing things like allocating a large buffer, and then using only parts of what you allocated to ensure against data sharing the same cache lines (even though you'll probably need to detect the exact processor and act accordingly to do this).
False Sharing
There's another, related item called "false sharing". This arises in a multiprocessor or multicore system, where two (or more) processors/cores have data that's separate, but falls in the same cache line. This forces the two processors/cores to coordinate their access to the data, even though each has its own, separate data item. Especially if the two modify the data in alternation, this can lead to a massive slowdown as the data has to be constantly shuttled between the processors. This can't easily be cured by organizing the cache into more "ways" or anything like that either. The primary way to prevent it is to ensure that two threads rarely (preferably never) modify data that could possibly be in the same cache line (with the same caveats about difficulty of controlling the addresses at which data is allocated).
Those who know C++ well might wonder if this is open to optimization via something like expression templates. I'm pretty sure the answer is that yes, it could be done and if it was, it would probably be a pretty substantial win. I'm not aware of anybody having done so, however, and given how little valarray gets used, I'd be at least a little surprised to see anybody do so either.
In case anybody wonders how valarray (designed specifically for performance) could be this badly wrong, it comes down to one thing: it was really designed for machines like the older Crays, that used fast main memory and no cache. For them, this really was a nearly ideal design.
Yes, I'm simplifying: most caches don't really measure the least recently used item precisely, but they use some heuristic that's intended to be close to that without having to keep a full time-stamp for each access.

Welcome to the world of Data Oriented Design. The basic mantra is to Sort, Eliminate Branches, Batch, Eliminate virtual calls - all steps towards better locality.
Since you tagged the question with C++, here's the obligatory typical C++ Bullshit. Tony Albrecht's Pitfalls of Object Oriented Programming is also a great introduction into the subject.

Just piling on: the classic example of cache-unfriendly versus cache-friendly code is the "cache blocking" of matrix multiply.
Naive matrix multiply looks like:
for(i=0;i<N;i++) {
for(j=0;j<N;j++) {
dest[i][j] = 0;
for( k=0;k<N;k++) {
dest[i][j] += src1[i][k] * src2[k][j];
}
}
}
If N is large, e.g. if N * sizeof(elemType) is greater than the cache size, then every single access to src2[k][j] will be a cache miss.
There are many different ways of optimizing this for a cache. Here's a very simple example: instead of reading one item per cache line in the inner loop, use all of the items:
int itemsPerCacheLine = CacheLineSize / sizeof(elemType);
for(i=0;i<N;i++) {
for(j=0;j<N;j += itemsPerCacheLine ) {
for(jj=0;jj<itemsPerCacheLine; jj+) {
dest[i][j+jj] = 0;
}
for( k=0;k<N;k++) {
for(jj=0;jj<itemsPerCacheLine; jj+) {
dest[i][j+jj] += src1[i][k] * src2[k][j+jj];
}
}
}
}
If the cache line size is 64 bytes, and we are operating on 32 bit (4 byte) floats, then there are 16 items per cache line. And the number of cache misses via just this simple transformation is reduced approximately 16-fold.
Fancier transformations operate on 2D tiles, optimize for multiple caches (L1, L2, TLB), and so on.
Some results of googling "cache blocking":
http://stumptown.cc.gt.atl.ga.us/cse6230-hpcta-fa11/slides/11a-matmul-goto.pdf
http://software.intel.com/en-us/articles/cache-blocking-techniques
A nice video animation of an optimized cache blocking algorithm.
http://www.youtube.com/watch?v=IFWgwGMMrh0
Loop tiling is very closely related:
http://en.wikipedia.org/wiki/Loop_tiling

Processors today work with many levels of cascading memory areas. So the CPU will have a bunch of memory that is on the CPU chip itself. It has very fast access to this memory. There are different levels of cache each one slower access ( and larger ) than the next, until you get to system memory which is not on the CPU and is relatively much slower to access.
Logically, to the CPU's instruction set you just refer to memory addresses in a giant virtual address space. When you access a single memory address the CPU will go fetch it. in the old days it would fetch just that single address. But today the CPU will fetch a bunch of memory around the bit you asked for, and copy it into the cache. It assumes that if you asked for a particular address that is is highly likely that you are going to ask for an address nearby very soon. For example if you were copying a buffer you would read and write from consecutive addresses - one right after the other.
So today when you fetch an address it checks the first level of cache to see if it already read that address into cache, if it doesn't find it, then this is a cache miss and it has to go out to the next level of cache to find it, until it eventually has to go out into main memory.
Cache friendly code tries to keep accesses close together in memory so that you minimize cache misses.
So an example would be imagine you wanted to copy a giant 2 dimensional table. It is organized with reach row in consecutive in memory, and one row follow the next right after.
If you copied the elements one row at a time from left to right - that would be cache friendly. If you decided to copy the table one column at a time, you would copy the exact same amount of memory - but it would be cache unfriendly.

It needs to be clarified that not only data should be cache-friendly, it is just as important for the code. This is in addition to branch predicition, instruction reordering, avoiding actual divisions and other techniques.
Typically the denser the code, the fewer cache lines will be required to store it. This results in more cache lines being available for data.
The code should not call functions all over the place as they typically will require one or more cache lines of their own, resulting in fewer cache lines for data.
A function should begin at a cache line-alignment-friendly address. Though there are (gcc) compiler switches for this be aware that if the the functions are very short it might be wasteful for each one to occupy an entire cache line. For example, if three of the most often used functions fit inside one 64 byte cache line, this is less wasteful than if each one has its own line and results in two cache lines less available for other usage. A typical alignment value could be 32 or 16.
So spend some extra time to make the code dense. Test different constructs, compile and review the generated code size and profile.

As #Marc Claesen mentioned that one of the ways to write cache friendly code is to exploit the structure in which our data is stored. In addition to that another way to write cache friendly code is: change the way our data is stored; then write new code to access the data stored in this new structure.
This makes sense in the case of how database systems linearize the tuples of a table and store them. There are two basic ways to store the tuples of a table i.e. row store and column store. In row store as the name suggests the tuples are stored row wise. Lets suppose a table named Product being stored has 3 attributes i.e. int32_t key, char name[56] and int32_t price, so the total size of a tuple is 64 bytes.
We can simulate a very basic row store query execution in main memory by creating an array of Product structs with size N, where N is the number of rows in table. Such memory layout is also called array of structs. So the struct for Product can be like:
struct Product
{
int32_t key;
char name[56];
int32_t price'
}
/* create an array of structs */
Product* table = new Product[N];
/* now load this array of structs, from a file etc. */
Similarly we can simulate a very basic column store query execution in main memory by creating an 3 arrays of size N, one array for each attribute of the Product table. Such memory layout is also called struct of arrays. So the 3 arrays for each attribute of Product can be like:
/* create separate arrays for each attribute */
int32_t* key = new int32_t[N];
char* name = new char[56*N];
int32_t* price = new int32_t[N];
/* now load these arrays, from a file etc. */
Now after loading both the array of structs (Row Layout) and the 3 separate arrays (Column Layout), we have row store and column store on our table Product present in our memory.
Now we move on to the cache friendly code part. Suppose that the workload on our table is such that we have an aggregation query on the price attribute. Such as
SELECT SUM(price)
FROM PRODUCT
For the row store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + table[i].price;
For the column store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + price[i];
The code for the column store would be faster than the code for the row layout in this query as it requires only a subset of attributes and in column layout we are doing just that i.e. only accessing the price column.
Suppose that the cache line size is 64 bytes.
In the case of row layout when a cache line is read, the price value of only 1(cacheline_size/product_struct_size = 64/64 = 1) tuple is read, because our struct size of 64 bytes and it fills our whole cache line, so for every tuple a cache miss occurs in case of a row layout.
In the case of column layout when a cache line is read, the price value of 16(cacheline_size/price_int_size = 64/4 = 16) tuples is read, because 16 contiguous price values stored in memory are brought into the cache, so for every sixteenth tuple a cache miss ocurs in case of column layout.
So the column layout will be faster in the case of given query, and is faster in such aggregation queries on a subset of columns of the table. You can try out such experiment for yourself using the data from TPC-H benchmark, and compare the run times for both the layouts. The wikipedia article on column oriented database systems is also good.
So in database systems, if the query workload is known beforehand, we can store our data in layouts which will suit the queries in workload and access data from these layouts. In the case of above example we created a column layout and changed our code to compute sum so that it became cache friendly.

Be aware that caches do not just cache continuous memory. They have multiple lines (at least 4) so discontinous and overlapping memory can often be stored just as efficiently.
What is missing from all the above examples is measured benchmarks. There are many myths about performance. Unless you measure it you do not know. Do not complicate your code unless you have a measured improvement.

Cache-friendly code is code that has been optimized to make efficient use of the CPU cache. This typically involves organizing data in a way that takes advantage of spatial and temporal locality, which refers to the idea that data that is accessed together is likely to be stored together in memory, and that data that is accessed frequently is likely to be accessed again in the near future.
There are several ways to make code cache-friendly, including:
Using contiguous memory layouts: By storing data in contiguous
blocks in memory, you can take advantage of spatial locality and
reduce the number of cache misses.
Using arrays: Arrays are a good choice for data structures when you
need to access data sequentially, as they allow you to take
advantage of temporal locality and keep hot data in the cache.
Using pointers carefully: Pointers can be used to access data that
is not stored contiguously in memory, but they can also lead to
cache misses if they are used excessively. If you need to use
pointers, try to use them in a way that takes advantage of spatial
and temporal locality to minimize cache misses.
Using compiler optimization flags: Most compilers have optimization
flags that can be used to optimize the use of the CPU cache. These
flags can help to minimize the number of cache misses and improve
the overall performance of your code.
It is important to note that the specific techniques that work best for optimizing the use of the CPU cache will depend on the specific requirements and constraints of your system. It may be necessary to experiment with different approaches to find the best solution for your needs.

Double-checking understanding of memory coalescing in CUDA

Suppose I define some arrays which are visible to the GPU:
double* doubleArr = createCUDADouble(fieldLen);
float* floatArr = createCUDAFloat(fieldLen);
char* charArr = createCUDAChar(fieldLen);
Now, I have the following CUDA thread:
void thread(){
int o = getOffset(); // the same for all threads in launch
double d = doubleArr[threadIdx.x + o];
float f = floatArr[threadIdx.x + o];
char c = charArr[threadIdx.x + o];
}
I'm not quite sure whether I correctly interpret the documentation, and its very critical for my design: Will the memory accesses for double, float and char be nicely coalesced? (Guess: Yes, it will fit into sizeof(type) * blockSize.x / (transaction size) transactions, plus maybe one extra transaction at the upper and lower boundary.)

Yes, for all the cases you have shown, and assuming createCUDAxxxxx translates into some kind of ordinary cudaMalloc type operation, everything should nicely coalesce.
If we have ordinary 1D device arrays allocated via cudaMalloc, in general we should have good coalescing behavior across threads if our load pattern includes an array index of the form:
data_array[some_constant + threadIdx.x];
It really does not matter what data type the array is - it will coalesce nicely.
However, from a performance perspective, global loads (assuming an L1 miss) will occur in a minimum 128-byte granularity. Therefore loading larger sizes per thread (say, int, float, double, float4, etc.) may give slightly better performance. The caches tend to mitigate any difference, if the loads are across a large enough number of warps.
It's pretty easy also to verify this on a particular piece of code with a profiler. There are many ways to do this depending on which profiler you choose, but for example with nvprof you can do:
nvprof --metric gld_efficiency ./my_exe
and it will return an average percentage number that more or less exactly reflects the percentage of optimal coalescing that is occurring on global loads.
This is the presentation I usually cite for additional background info on memory optimization.
I suppose someone will come along and notice that this pattern:
data_array[some_constant + threadIdx.x];
roughly corresponds to the access type shown on slides 40-41 of the above presentation. And aha!! efficiency drops to 50%-80%. That is true, if only a single warp-load is being considered. However, referring to slide 40, we see that the "first" load will require two cachelines to be loaded. After that however, additional loads (moving to the right, for simplicity) will only require one additional/new cacheline per warp-load (assuming the existence of an L1 or L2 cache, and reasonable locality, i.e. lack of thrashing). Therefore, over a reasonably large array (more than just 128 bytes), the average requirement will be one new cacheline per warp, which corresponds to 100% efficiency.

What is a "cache-friendly" code?

What is the difference between "cache unfriendly code" and the "cache friendly" code?
How can I make sure I write cache-efficient code?

In addition to #Marc Claesen's answer, I think that an instructive classic example of cache-unfriendly code is code that scans a C bidimensional array (e.g. a bitmap image) column-wise instead of row-wise.
Elements that are adjacent in a row are also adjacent in memory, thus accessing them in sequence means accessing them in ascending memory order; this is cache-friendly, since the cache tends to prefetch contiguous blocks of memory.
Instead, accessing such elements column-wise is cache-unfriendly, since elements on the same column are distant in memory from each other (in particular, their distance is equal to the size of the row), so when you use this access pattern you are jumping around in memory, potentially wasting the effort of the cache of retrieving the elements nearby in memory.
And all that it takes to ruin the performance is to go from
// Cache-friendly version - processes pixels which are adjacent in memory
for(unsigned int y=0; y<height; ++y)
{
for(unsigned int x=0; x<width; ++x)
{
... image[y][x] ...
}
}
to
// Cache-unfriendly version - jumps around in memory for no good reason
for(unsigned int x=0; x<width; ++x)
{
for(unsigned int y=0; y<height; ++y)
{
... image[y][x] ...
}
}
This effect can be quite dramatic (several order of magnitudes in speed) in systems with small caches and/or working with big arrays (e.g. 10+ megapixels 24 bpp images on current machines); for this reason, if you have to do many vertical scans, often it's better to rotate the image of 90 degrees first and perform the various analysis later, limiting the cache-unfriendly code just to the rotation.

Welcome to the world of Data Oriented Design. The basic mantra is to Sort, Eliminate Branches, Batch, Eliminate virtual calls - all steps towards better locality.
Since you tagged the question with C++, here's the obligatory typical C++ Bullshit. Tony Albrecht's Pitfalls of Object Oriented Programming is also a great introduction into the subject.

Just piling on: the classic example of cache-unfriendly versus cache-friendly code is the "cache blocking" of matrix multiply.
Naive matrix multiply looks like:
for(i=0;i<N;i++) {
for(j=0;j<N;j++) {
dest[i][j] = 0;
for( k=0;k<N;k++) {
dest[i][j] += src1[i][k] * src2[k][j];
}
}
}
If N is large, e.g. if N * sizeof(elemType) is greater than the cache size, then every single access to src2[k][j] will be a cache miss.
There are many different ways of optimizing this for a cache. Here's a very simple example: instead of reading one item per cache line in the inner loop, use all of the items:
int itemsPerCacheLine = CacheLineSize / sizeof(elemType);
for(i=0;i<N;i++) {
for(j=0;j<N;j += itemsPerCacheLine ) {
for(jj=0;jj<itemsPerCacheLine; jj+) {
dest[i][j+jj] = 0;
}
for( k=0;k<N;k++) {
for(jj=0;jj<itemsPerCacheLine; jj+) {
dest[i][j+jj] += src1[i][k] * src2[k][j+jj];
}
}
}
}
If the cache line size is 64 bytes, and we are operating on 32 bit (4 byte) floats, then there are 16 items per cache line. And the number of cache misses via just this simple transformation is reduced approximately 16-fold.
Fancier transformations operate on 2D tiles, optimize for multiple caches (L1, L2, TLB), and so on.
Some results of googling "cache blocking":
http://stumptown.cc.gt.atl.ga.us/cse6230-hpcta-fa11/slides/11a-matmul-goto.pdf
http://software.intel.com/en-us/articles/cache-blocking-techniques
A nice video animation of an optimized cache blocking algorithm.
http://www.youtube.com/watch?v=IFWgwGMMrh0
Loop tiling is very closely related:
http://en.wikipedia.org/wiki/Loop_tiling

It needs to be clarified that not only data should be cache-friendly, it is just as important for the code. This is in addition to branch predicition, instruction reordering, avoiding actual divisions and other techniques.
Typically the denser the code, the fewer cache lines will be required to store it. This results in more cache lines being available for data.
The code should not call functions all over the place as they typically will require one or more cache lines of their own, resulting in fewer cache lines for data.
A function should begin at a cache line-alignment-friendly address. Though there are (gcc) compiler switches for this be aware that if the the functions are very short it might be wasteful for each one to occupy an entire cache line. For example, if three of the most often used functions fit inside one 64 byte cache line, this is less wasteful than if each one has its own line and results in two cache lines less available for other usage. A typical alignment value could be 32 or 16.
So spend some extra time to make the code dense. Test different constructs, compile and review the generated code size and profile.

As #Marc Claesen mentioned that one of the ways to write cache friendly code is to exploit the structure in which our data is stored. In addition to that another way to write cache friendly code is: change the way our data is stored; then write new code to access the data stored in this new structure.
This makes sense in the case of how database systems linearize the tuples of a table and store them. There are two basic ways to store the tuples of a table i.e. row store and column store. In row store as the name suggests the tuples are stored row wise. Lets suppose a table named Product being stored has 3 attributes i.e. int32_t key, char name[56] and int32_t price, so the total size of a tuple is 64 bytes.
We can simulate a very basic row store query execution in main memory by creating an array of Product structs with size N, where N is the number of rows in table. Such memory layout is also called array of structs. So the struct for Product can be like:
struct Product
{
int32_t key;
char name[56];
int32_t price'
}
/* create an array of structs */
Product* table = new Product[N];
/* now load this array of structs, from a file etc. */
Similarly we can simulate a very basic column store query execution in main memory by creating an 3 arrays of size N, one array for each attribute of the Product table. Such memory layout is also called struct of arrays. So the 3 arrays for each attribute of Product can be like:
/* create separate arrays for each attribute */
int32_t* key = new int32_t[N];
char* name = new char[56*N];
int32_t* price = new int32_t[N];
/* now load these arrays, from a file etc. */
Now after loading both the array of structs (Row Layout) and the 3 separate arrays (Column Layout), we have row store and column store on our table Product present in our memory.
Now we move on to the cache friendly code part. Suppose that the workload on our table is such that we have an aggregation query on the price attribute. Such as
SELECT SUM(price)
FROM PRODUCT
For the row store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + table[i].price;
For the column store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + price[i];
The code for the column store would be faster than the code for the row layout in this query as it requires only a subset of attributes and in column layout we are doing just that i.e. only accessing the price column.
Suppose that the cache line size is 64 bytes.
In the case of row layout when a cache line is read, the price value of only 1(cacheline_size/product_struct_size = 64/64 = 1) tuple is read, because our struct size of 64 bytes and it fills our whole cache line, so for every tuple a cache miss occurs in case of a row layout.
In the case of column layout when a cache line is read, the price value of 16(cacheline_size/price_int_size = 64/4 = 16) tuples is read, because 16 contiguous price values stored in memory are brought into the cache, so for every sixteenth tuple a cache miss ocurs in case of column layout.
So the column layout will be faster in the case of given query, and is faster in such aggregation queries on a subset of columns of the table. You can try out such experiment for yourself using the data from TPC-H benchmark, and compare the run times for both the layouts. The wikipedia article on column oriented database systems is also good.
So in database systems, if the query workload is known beforehand, we can store our data in layouts which will suit the queries in workload and access data from these layouts. In the case of above example we created a column layout and changed our code to compute sum so that it became cache friendly.

Be aware that caches do not just cache continuous memory. They have multiple lines (at least 4) so discontinous and overlapping memory can often be stored just as efficiently.
What is missing from all the above examples is measured benchmarks. There are many myths about performance. Unless you measure it you do not know. Do not complicate your code unless you have a measured improvement.

How to find the size of the L1 cache line size with IO timing measurements?

As a school assignment, I need to find a way to get the L1 data cache line size, without reading config files or using api calls. Supposed to use memory accesses read/write timings to analyze & get this info. So how might I do that?
In an incomplete try for another part of the assignment, to find the levels & size of cache, I have:
for (i = 0; i < steps; i++) {
arr[(i * 4) & lengthMod]++;
}
I was thinking maybe I just need vary line 2, (i * 4) part? So once I exceed the cache line size, I might need to replace it, which takes sometime? But is it so straightforward? The required block might already be in memory somewhere? Or perpahs I can still count on the fact that if I have a large enough steps, it will still work out quite accurately?
UPDATE
Heres an attempt on GitHub ... main part below
// repeatedly access/modify data, varying the STRIDE
for (int s = 4; s <= MAX_STRIDE/sizeof(int); s*=2) {
start = wall_clock_time();
for (unsigned int k = 0; k < REPS; k++) {
data[(k * s) & lengthMod]++;
}
end = wall_clock_time();
timeTaken = ((float)(end - start))/1000000000;
printf("%d, %1.2f \n", s * sizeof(int), timeTaken);
}
Problem is there dont seem to be much differences between the timing. FYI. since its for L1 cache. I have SIZE = 32 K (size of array)

Allocate a BIG char array (make sure it is too big to fit in L1 or L2 cache). Fill it with random data.
Start walking over the array in steps of n bytes. Do something with the retrieved bytes, like summing them.
Benchmark and calculate how many bytes/second you can process with different values of n, starting from 1 and counting up to 1000 or so. Make sure that your benchmark prints out the calculated sum, so the compiler can't possibly optimize the benchmarked code away.
When n == your cache line size, each access will require reading a new line into the L1 cache. So the benchmark results should get slower quite sharply at that point.
If the array is big enough, by the time you reach the end, the data at the beginning of the array will already be out of cache again, which is what you want. So after you increment n and start again, the results will not be affected by having needed data already in the cache.

Have a look at Calibrator, all of the work is copyrighted but source code is freely available. From its document idea to calculate cache line sizes sounds much more educated than what's already said here.
The idea underlying our calibrator tool is to have a micro benchmark whose performance only depends
on the frequency of cache misses that occur. Our calibrator is a simple C program, mainly a small loop
that executes a million memory reads. By changing the stride (i.e., the offset between two subsequent
memory accesses) and the size of the memory area, we force varying cache miss rates.
In principle, the occurance of cache misses is determined by the array size. Array sizes that fit into
the L1 cache do not generate any cache misses once the data is loaded into the cache. Analogously,
arrays that exceed the L1 cache size but still fit into L2, will cause L1 misses but no L2 misses. Finally,
arrays larger than L2 cause both L1 and L2 misses.
The frequency of cache misses depends on the access stride and the cache line size. With strides
equal to or larger than the cache line size, a cache miss occurs with every iteration. With strides
smaller than the cache line size, a cache miss occurs only every n iterations (on average), where n is
the ratio cache
line
size/stride.
Thus, we can calculate the latency for a cache miss by comparing the execution time without
misses to the execution time with exactly one miss per iteration. This approach only works, if
memory accesses are executed purely sequential, i.e., we have to ensure that neither two or more load
instructions nor memory access and pure CPU work can overlap. We use a simple pointer chasing
mechanism to achieve this: the memory area we access is initialized such that each load returns the
address for the subsequent load in the next iteration. Thus, super-scalar CPUs cannot benefit from
their ability to hide memory access latency by speculative execution.
To measure the cache characteristics, we run our experiment several times, varying the stride and
the array size. We make sure that the stride varies at least between 4 bytes and twice the maximal
expected cache line size, and that the array size varies from half the minimal expected cache size to
at least ten times the maximal expected cache size.
I had to comment out #include "math.h" to get it compiled, after that it found my laptop's cache values correctly. I also couldn't view postscript files generated.

You can use the CPUID function in assembler, although non portable, it will give you what you want.
For Intel Microprocessors, the Cache Line Size can be calculated by multiplying bh by 8 after calling cpuid function 0x1.
For AMD Microprocessors, the data Cache Line Size is in cl and the instruction Cache Line Size is in dl after calling cpuid function 0x80000005.
I took this from this article here.

I think you should write program, that will walk throught array in random order instead straight, because modern process do hardware prefetch.
For example, make array of int, which values will number of next cell.
I did similar program 1 year ago http://pastebin.com/9mFScs9Z
Sorry for my engish, I am not native speaker.

See how to memtest86 is implemented. They measure and analyze data transfer rate in some way. Points of rate changing is corresponded to size of L1, L2 and possible L3 cache size.

If you get stuck in the mud and can't get out, look here.
There are manuals and code that explain how to do what you're asking. The code is pretty high quality as well. Look at "Subroutine library".
The code and manuals are based on X86 processors.

Just a note.
Cache line size is variable on few ARM Cortex families and can change during execution without any notifications to a current program.

I think it should be enough to time an operation that uses some amount of memory. Then progresively increase the memory (operands for instance) used by the operation.
When the operation performance severelly decreases you have found the limit.
I would go with just reading a bunch of bytes without printing them (printing would hit the performance so bad that would become a bottleneck). While reading, the timing should be directly proportinal to the ammount of bytes read until the data cannot fit the L1 anymore, then you will get the performance hit.
You should also allocate the memory once at the start of the program and before starting to count time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js