Linux perf tool showing weird cache miss results

Linux perf tool showing weird cache miss results - c++

I'm using linux perf tools to profile one of CRONO benchmarks, I'm specifically interested in L1 DCache Misses, so I run the program like this:
perf record -e L1-dcache-read-misses -o perf/apsp.cycles apps/apsp/apsp 4 16384 16
It runs fine but generates those warnings:
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.
Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.
Samples in kernel modules won't be resolved at all.
If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.
Cannot read kernel map
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
Threads Returned!
Threads Joined!
Time: 2.932636 seconds
[ perf record: Woken up 5 times to write data ]
[ perf record: Captured and wrote 1.709 MB perf/apsp.cycles (44765 samples) ]
I then annotate the output file like this:
perf annotate --stdio -i perf/apsp.cycles --dsos=apsp
But in one of the code sections, I see some weird results:
Percent | Source code & Disassembly of apsp for L1-dcache-read-misses
---------------------------------------------------------------------------
: {
: if((D[W_index[v][i]] > (D[v] + W[v][i])))
19.36 : 401140: movslq (%r10,%rcx,4),%rsi
14.50 : 401144: lea (%rax,%rsi,4),%rdi
1.22 : 401148: mov (%r9,%rcx,4),%esi
5.82 : 40114c: add (%rax,%r8,4),%esi
20.02 : 401150: cmp %esi,(%rdi)
0.00 : 401152: jle 401156 <do_work(void*)+0x226>
: D[W_index[v][i]] = D[v] + W[v][i];
9.72 : 401154: mov %esi,(%rdi)
19.93 : 401156: add $0x1,%rcx
:
Now in those results, How come that some arithmetic instructions have L1 read misses? Also, how come that instructions of the second statement cause so many cache misses even though they should've brought into cache by the previous if statement?
Am I doing something wrong here? I tried the same on a different machine with root access, it gave me similar results, so I think the warnings I mentioned above are not causing this. But what exactly is going on?

So we have this code:
for(v=0;v<N;v++)
{
for(int i = 0; i < DEG; i++)
{
if((/* (V2) 1000000000 * */ D[W_index[v][i]] > (D[v] + W[v][i])))
D[W_index[v][i]] = D[v] + W[v][i];
Q[v]=0; //Current vertex checked
}
}
Note that I added (V2) as a comment in the code. We below come back to this code.
First approximation
Remember that W_index is initialized as W_index[i][j] = i + j (A).
Let's focus on one inner iteration, and first let's assume that DEG is large. Further we assume that the cache is large enough to hold all data for at least two iterations.
D[W_index[v][i]]
The lookup W_index[v] is loaded into a register. For W_index[v][i] we assume one cache miss (64 byte cache line, 4 byte per int, we call the programm with DIM=16). The lookup in D starts always at v, so most of the required part of the array is already in cache. With the assumption that DEG is large this lookup is for free.
D[v] + W[v][i]
The lookup D[v] is for free as it depends on v. The second lookup is the same as above, one cache miss for the second dimension.
The whole inner statement has no influence.
Q[v]=0;
As this is v, this can be ignored.
When we sum up, we get two cache misses.
Second approximation
Now, we come back to the assumption that DEG is large. In fact this is wrong because DEG = 16. So there are fractions of cache misses we also need to consider.
D[W_index[v][i]]
The lookup W_index[v] costs 1/8 of a cache miss (it has a size of 8 bytes, a cache line is 64 byte, so we get a cache miss each eigth iteration).
The same is true for D[W_index[v][i]], except that D holds integers. In average all but one integer are in cache, so this costs 1/16 of a cache miss.
D[v] + W[v][i]
D[v] is already in cache (this is W_index[v][0]). But we get another 1/8 of a cache miss for W[v] for the same reasoning as above.
Q[v]=0;
This is another 1/16 of a cache miss.
And surprize, if we now use the code (V2) where the if-clause never evaluates to true, I get 2.395 cache misses per iteration (note that you really need to configure your CPU well, i.e., no hyperthreading, no turboboost, performance governor if possible). The calculation above would lead to 2.375. So we are pretty good.
Third approximation
Now there is this unfortunate if clause. How often does this comparison evaluate to true. We can't say, in the beginning it will be quite often, and in the end it will never evaluate to true.
So let's focus on the really first execution of the complete loop. In this case, D[v] is infinity and W[v][i] is a number between 1 and 101. So the loop evaluates to true in each iteration.
And then it gets hard - we get 2.9 cache misses in this iteration. Where are they coming from - all data should already be in cache.
But: This is the "mystery of compilers". You never know what they produce in the end. I compiled with GCC and Clang and get the same measures. I activate -funroll-loops, and suddenly I get 2.5 cache misses. Of course this may be different on your system. When I inspected the assembly, I observed that it is really exactly the same, just the loop has been unrolled four times.
So what does this tell us? You never know what your compiler does except you check it. And even then, you can't be sure.
I guess hardware prefetching or execution order could have an influence here. But this is a mystery.
Regarding perf and your problems with it
I think the measurements you did have two problems:
They are relative, the exact line is not that accurate.
You are multithreaded, this may be harder to track.
My experience is that when you want to get good measures for a specific part of your code, you really need to check it manually. Sometimes - not always - it can explain things pretty good.

Related

Low performance with CLWB (cacheline write-backs) to same location vs. cycling through a few lines

Why does the running time of the code below decrease when I increase kNumCacheLines?
In every iteration, the code modifies one of kNumCacheLines cachelines, writes the line to the DIMM with the clwb instruction, and blocks until the store hits the memory controller with sfence. This example requires Intel Skylake-server or newer Xeon, or IceLake client processors.
#include <stdlib.h>
#include <stdint.h>
#define clwb(addr) \
asm volatile(".byte 0x66; xsaveopt %0" : "+m"(*(volatile char *)(addr)));
static constexpr size_t kNumCacheLines = 1;
int main() {
uint8_t *buf = new uint8_t[kNumCacheLines * 64];
size_t data = 0;
for (size_t i = 0; i < 10000000; i++) {
size_t buf_offset = (i % kNumCacheLines) * 64;
buf[buf_offset] = data++;
clwb(&buf[buf_offset]);
asm volatile("sfence" ::: "memory");
}
delete [] buf;
}
(editor's note: _mm_sfence() and _mm_clwb(void*) would avoid needing inline asm, but this inline asm looks correct, including the "memory" clobber).
Here are some performance numbers on my Skylake Xeon machine, reported by running time ./bench with different values of kNumCacheLines:
kNumCacheLines Time (seconds)
1 2.00
2 2.14
3 1.74
4 1.82
5 1.00
6 1.17
7 1.04
8 1.06
Intuitively, I would expect kNumCacheLines = 1 to give the best performance because of hits in the memory controller's write pending queue. But, it is one of the slowest.
As an explanation for the unintuitive slowdown, it is possible that while the memory controller is completing a write to a cache line, it blocks other writes to the same cache line. I suspect that increasing kNumCacheLines increases performance because of higher parallelism available to the memory controller. The running time jumps from 1.82 seconds to 1.00 seconds when kNumCacheLines goes from four to five. This seems to correlate with the fact that the memory controller's write pending queue has space for 256 bytes from a thread [https://arxiv.org/pdf/1908.03583.pdf, Section 5.3].
Note that because buf is smaller than 4 KB, all accesses use the same DIMM. (Assuming it's aligned so it doesn't cross a page boundary)

This is probably fully explained by Intel's CLWB instruction invalidating cache lines - turns out SKX runs clwb the same as clflushopt, i.e. it's a stub implementation for forward compatibility so persistent-memory software can start using it without checking CPU feature levels.
More cache lines means more memory-level parallelism in reloading the invalidated lines for the next store. Or that the flush part is finished before we try to reload. One or the other; there are a lot of details I don't have a specific explanation for.
In each iteration, you store a counter value into a cache line and clwb it. (and sfence). The previous activity on that cache line was kNumCacheLines iterations ago.
We were expecting that these stores could just commit into lines that were already in Exclusive state, but in fact they're going to be Invalid with eviction probably still in flight down the cache hierarchy, depending on exactly when sfence stalls, and for how long.
So each store needs to wait for an RFO (Read For Ownership) to get the line back into cache in Exclusive state before it can commit from the store buffer to L1d.
It seems that you're only getting a factor of 2 speedup from using more cache lines, even though Skylake(-X) has 12 LFBs (i.e. can track 12 in-flight cache lines incoming or outgoing). Perhaps sfence has something to do with that.
The big jump from 4 to 5 is surprising. (Basically two levels of performance, not a continuous transition). That lends some weight to the hypothesis that it's something to do with the store having made it all the way to DRAM before we try to reload, rather than having multiple RFOs in flight. Or at least casts doubt on the idea that it's just MLP for RFOs. CLWB forcing eviction is pretty clearly key, but the specific details of exactly what happens and why there's any speedup is just pure guesswork on my part.
A more detailed analysis might tell us something about microarchitectural details if anyone wants to do one. This hopefully isn't a very normal access pattern so probably we can just avoid doing stuff like this most of the time!
(Possibly related: apparently repeated writes to the same line of Optane DC PM memory are slower than sequential writes, so you don't want write-through caching or an access pattern like this on that kind of non-volatile memory either.)

Why does using the same cache-line from multiple threads not cause serious slowdown?

Look at this snippet:
#include <atomic>
#include <thread>
typedef volatile unsigned char Type;
// typedef std::atomic_uchar Type;
void fn(Type *p) {
for (int i=0; i<500000000; i++) {
(*p)++;
}
}
int main() {
const int N = 4;
std::thread thr[N];
alignas(64) Type buffer[N*64];
for (int i=0; i<N; i++) {
thr[i] = std::thread(&fn, &buffer[i*1]);
}
for (int i=0; i<N; i++) {
thr[i].join();
}
}
This little program increments four adjacent bytes a lot of times from four different threads. Before, I used the rule: don't use the same cache line from different threads, as cache line sharing is bad. So I expected that a four thread version (N=4) is much slower than a one thread version (N=1).
However, these are my measurements (on a Haswell CPU):
N=1: 1 sec
N=4: 1.2 sec
So N=4 is not much slower. If I use different cache lines (replace *1 with *64), then N=4 becomes a little faster: 1.1 sec.
The same measurements for atomic access (swap the comments at typedef), same cache line:
N=1: 3.1 sec
N=4: 48 sec
So the N=4 case is much slower (as I expected). If different cache lines used, then N=4 has similar performance as N=1: 3.3 sec.
I don't understand the reason behind these results. Why don't I get a serious slowdown the non-atomic, N=4 case? Four cores have the same memory in their caches, so they must synchronize them somehow, don't they? How can they run almost perfectly parallel? Why just the atomic case gets a serious slowdown?
I think I need to understand how memory gets updated in this case. In the beginning, no cores have buffer in their caches. After one for iteration (in fn), all 4 cores have buffer in their cache-lines, but each core writes a different byte. How do these cache-lines get synchronized (in the non-atomic case)? How does the cache know, which byte is dirty? Or is there some other mechanism to handle this case? Why is this mechanism a lot cheaper (actually, it is almost free) than the atomic-one?

What you are seeing is basically the effect of the store buffer combined with store-to-load forwarding allowing each core to work mostly independently, despite sharing a cache line. As we will see below, it is truly a weird case where more contention is bad, up to a point, then even more contention suddenly makes things really fast!
Now with the conventional view of contention your code seems like something that will be high contention and therefore much slower than ideal. What happens, however, is that as soon as each core gets a single pending write in its write buffer, all later reads can be satisfied from the write buffer (store forwarding), and later writes just go into the buffer as well even after the core has lost ownership of the cache line. This turns most of the work into a totally local operation. The cache line is still bouncing around between the cores, but it's decoupled from the core execution path and is only needed to actually commit the stores now and then1.
The std::atomic version can't use this magic at all since it has to use locked operations to maintain atomicity and defeat the store buffer, so you see both the full cost of contention and the cost of the long-latency atomic operations2.
Let's try to actually collect some evidence that this is what's occurring. All of the discussion below deals with the non-atomic version of the benchmark that uses volatile to force reads and writes from buffer.
Let's first check the assembly, to make sure it's what we expect:
0000000000400c00 <fn(unsigned char volatile*)>:
400c00: ba 00 65 cd 1d mov edx,0x1dcd6500
400c05: 0f 1f 00 nop DWORD PTR [rax]
400c08: 0f b6 07 movzx eax,BYTE PTR [rdi]
400c0b: 83 c0 01 add eax,0x1
400c0e: 83 ea 01 sub edx,0x1
400c11: 88 07 mov BYTE PTR [rdi],al
400c13: 75 f3 jne 400c08 <fn(unsigned char volatile*)+0x8>
400c15: f3 c3 repz ret
It's straightforward: a five instruction loop with a byte load, an increment of the loaded byte, a byte store, and finally the loop increment and conditional jump back to the top. Here, gcc has missed an optimization by breaking up the sub and jne, inhibiting macro-fusion, but overall it's OK and the store-forwarding latency is going to limit the loop in any case.
Next, let's take a look at the number of L1D misses. Every time a core needs to write into the line that has been stolen away, it will suffer an L1D miss, which we can measure with perf. First, the single threaded (N=1) case:
$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment
Performance counter stats for './cache-line-increment':
1070.188749 task-clock (msec) # 0.998 CPUs utilized
2,775,874,257 cycles # 2.594 GHz
2,504,256,018 instructions # 0.90 insn per cycle
501,139,187 L1-dcache-loads # 468.272 M/sec
69,351 L1-dcache-load-misses # 0.01% of all L1-dcache hits
1.072119673 seconds time elapsed
It is about what we expect: essentially zero L1D misses (0.01% of the total, probably mostly from interrupts and other code outside the loop), and just over 500,000,000 hits (matching almost exactly the number of loop iterations). Note also that we can easily calculate the cycles per iteration: about 5.55. This primarily reflects the cost of store-to-load forwarding, plus one cycle for the increment, which is a carried dependency chain as the same location is repeatedly updated (and volatile means it can't be hoisted into a register).
Let's take a look at the N=4 case:
$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment
Performance counter stats for './cache-line-increment':
5920.758885 task-clock (msec) # 3.773 CPUs utilized
15,356,014,570 cycles # 2.594 GHz
10,012,249,418 instructions # 0.65 insn per cycle
2,003,487,964 L1-dcache-loads # 338.384 M/sec
61,450,818 L1-dcache-load-misses # 3.07% of all L1-dcache hits
1.569040529 seconds time elapsed
As expected the L1 loads jumps from 500 million to 2 billion, since there are 4 threads each doing the 500 million loads. The number of L1D misses also jumped by about a factor of 1,000, to about 60 million. Still, that number is not a lot compared to the 2 billion loads (and 2 billion stores - not shown, but we know they are there). That's ~33 loads and ~33 stores for every miss. It also means 250 cycles between each miss.
That doesn't really fit the model of the cache line bouncing around erratically between the cores, where as soon as a core gets the line, another core demands it. We know that lines bounce around between cores sharing an L2 in perhaps 20-50 cycles, so the ratio of one miss every 250 cycles seems way to low.
Two Hypotheses
A couple ideas spring to mind for the above described behavior:
Perhaps the MESI protocol variant used in this chip is "smart" and recognizes that one line is hot among several cores, but only a small amount of work is being done each time a core gets the lock and the line spends more time moving between L1 and L2 than actually satisfying loads and stores for some core. In light of this, some smart component in the coherence protocol decides to enforce some kind of minimum "ownership time" for each line: after a core gets the line, it will keep it for N cycles, even if demanded by another core (the other cores just have to wait).
This would help balance out the overhead of cache line ping-pong with real work, at the cost of "fairness" and responsiveness of the other cores, kind of like the trade-off between unfair and fair locks, and counteracting the effect described here, where the faster & fairer the coherency protocol is, the worse some (usually synthetic) loops may perform.
Now I've never heard of anything like that (and the immediately previous link shows that at least in the Sandy-Bridge era things were moving in the opposite direction), but it's certainly possible!
The store-buffer effect described is actually occurring, so most operations can complete almost locally.
Some Tests
Let's try to distinguish two cases with some modifications.
Reading and Writing Distinct Bytes
The obvious approach is to change the fn() work function so that the threads still contend on the same cache line, but where store-forwarding can't kick in.
How about we just read from location x and then write to location x + 1? We'll give each thread two consecutive locations (i.e., thr[i] = std::thread(&fn, &buffer[i*2])) so each thread is operating on two private bytes. The modified fn() looks like:
for (int i=0; i<500000000; i++)
unsigned char temp = p[0];
p[1] = temp + 1;
}
The core loop is pretty much identical to earlier:
400d78: 0f b6 07 movzx eax,BYTE PTR [rdi]
400d7b: 83 c0 01 add eax,0x1
400d7e: 83 ea 01 sub edx,0x1
400d81: 88 47 01 mov BYTE PTR [rdi+0x1],al
400d84: 75 f2 jne 400d78
The only thing that's changed is that we write to [rdi+0x1] rather than [rdi].
Now as I mentioned above, the original (same location) loop is actually running fairly slowly at about 5.5 cycles per iteration even in the best-case single-threaded case, because of the loop-carried load->add->store->load... dependency. This new code breaks that chain! The load no longer depends on the store so we can execute everything pretty much in parallel and I expect this loop to run at about 1.25 cycles per iteration (5 instructions / CPU width of 4).
Here's the single threaded case:
$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment
Performance counter stats for './cache-line-increment':
318.722631 task-clock (msec) # 0.989 CPUs utilized
826,349,333 cycles # 2.593 GHz
2,503,706,989 instructions # 3.03 insn per cycle
500,973,018 L1-dcache-loads # 1571.815 M/sec
63,507 L1-dcache-load-misses # 0.01% of all L1-dcache hits
0.322146774 seconds time elapsed
So about 1.65 cycles per iteration3, about about three times faster versus incrementing the same location.
How about 4 threads?
$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment
Performance counter stats for './cache-line-increment':
22299.699256 task-clock (msec) # 3.469 CPUs utilized
57,834,005,721 cycles # 2.593 GHz
10,038,366,836 instructions # 0.17 insn per cycle
2,011,160,602 L1-dcache-loads # 90.188 M/sec
237,664,926 L1-dcache-load-misses # 11.82% of all L1-dcache hits
6.428730614 seconds time elapsed
So it's about 4 times slower than the same location case. Now rather than being just a bit slower than the single-threaded case it is about 20 times slower. This is the contention you've been looking for! Now also that the number of L1D misses has increased by a factor of 4 as well, nicely explaining the performance degradation and consistent with the idea that when store-to-load forwarding can't hide the contention, misses will increase by a lot.
Increasing the Distance Between Stores
Another approach would be to increase the distance in time/instructions between the store and the subsequent load. We can do this by incrementing SPAN consecutive locations in the fn() method, rather than always the same location. E.g, if SPAN is 4, increment consecutively 4 locations like:
for (long i=0; i<500000000 / 4; i++) {
p[0]++;
p[1]++;
p[2]++;
p[3]++;
}
Note that we are still incrementing 500 million locations in total, just spreading out the increments among 4 bytes. Intuitively you would expect overall performance to increase since you now have SPAN parallel dependency with length 1/SPAN, so in the case above you might expect performance to improve by a factor of 4, since the 4 parallel chains can proceed at about 4 times the total throughput.
Here's what we actually get for time (measured in cycles) for the 1 thread and 3 thread4, for SPAN values from 1 to 20:
Initially you see performance increase substantially in both single and multi-threaded cases; the increase from a SPAN of one to two and three is close to the theoretical expected in the case of perfect parallelism for both cases.
The single-threaded case reaches an asymptote of about 4.25x faster than the single-location write: at this point the store-forwarding latency isn't the bottleneck and other bottlenecks have taken over (max IPC and store port contention, mostly).
The multi-threaded case is very different, however! Once you hit a SPAN of about 7, the performance rapidly gets worse, leveling out at about 2.5 times worse than the SPAN=1 case and almost 10x worse compared to the best performance at SPAN=5. What happens is that store-to-load forwarding stops occurring because the store and subsequent load are far enough apart in time/cycles that the store has retired to L1, so the load actually has to get the line and participate in MESI.
Also plotted is the L1D misses, which as mentioned above is indicative of "cache line transfers" between cores. The single-threaded case has essentially zero, and they are uncorrelated with the performance. The performance of the multi-threaded case, however, pretty much tracks exactly the cache misses. With SPAN values in the 2 to 6 range, where store-forwarding is still working, there are proportionally fewer misses. Evidently the core is able to "buffer up" more stores between each cache line transfer since the core loop is faster.
Another way to think of it is that in the contended case L1D misses are basically constant per unit-time (which makes sense, since they are basically tied to the L1->L2->L1 latency, plus some coherency protocol overhead), so the more work you can do in between the cache line transfers, the better.
Here's the code for the multi-span case:
void fn(Type *p) {
for (long i=0; i<500000000 / SPAN; i++) {
for (int j = 0; j < SPAN; j++) {
p[j]++;
}
}
}
The bash script to run perf for all SPAN value from 1 to 20:
PERF_ARGS=${1:--x, -r10}
for span in {1..20}; do
g++ -std=c++11 -g -O2 -march=native -DSPAN=$span cache-line-increment.cpp -lpthread -o cache-line-increment
perf stat ${PERF_ARGS} -e cycles,L1-dcache-loads,L1-dcache-load-misses,machine_clears.count,machine_clears.memory_ordering ./cache-line-increment
done
Finally, "transpose" the results into proper CSV:
FILE=result1.csv; for metric in cycles L1-dcache-loads L1-dcache-load-misses; do { echo $metric; grep $metric $FILE | cut -f1 -d,; } > ${metric}.tmp; done && paste -d, *.tmp
A Final Test
There's a final test that you can do to show that each core is effectively doing most of its work in private: use the version of the benchmark where the threads work on the same location (which doesn't change the performance characteristics) examine the sum of the final counter values (you'd need int counters rather than char). If everything was atomic, you'd have a sum of 2 billion, and in the non-atomic case how close the total is to that value is a rough measure of how frequently the cores were passing around the lines. If the cores are working almost totally privately, the value would be closer to 500 million than 2 billion, and I guess that's what you'll find (a value fairly close to 500 million).
With some more clever incrementing, you can even have each thread track how often the value they incremented came from their last increment rather than another threads increment (e.g., by using a few bits of the value to stash a thread identifier). With an even more clever test you could practically reconstruct thee way the cache line moved around between the cores (is there a pattern, e.g., does core A prefer to hand off to core B?) and which cores contributed most to the final value, etc.
That's all left as an exercise :).
1 On top of that, if Intel has a coalescing store buffer where later stores that fully overlap earlier ones kill the earlier stores, it would only have to commit one value to L1 (the latest store) every time it gets the line.
2 You can't really separate the two effects here, but we will do it later by defeating store-to-load forwarding.
3 A bit more than I expected, perhaps bad scheduling leading to port pressure. If gcc would just all the sub and jne to fuse, it runs at 1.1 cycles per iteration (still worse than the 1.0 I'd expect). It will do that I use -march=haswell instead of -march=native but I'm not going to go back and change all the numbers.
4 The results hold with 4 threads as well: but I only have 4 cores and I'm running stuff like Firefox in the background, so using 1 less core makes the measurements a lot less noisy. Measuring time in cycles helps a lot too.
5 On this CPU architecture, store forwarding where the load arrives before the store data is ready seem to alternate between 4 and 5 cycles, for an average of 4.5 cycles.

The atomic version has to ensure that some other thread will be able to read the result in a sequentially consistent fashion. So there are fences for each write.
The volatile version does not make any relationships visible to the other cores, so does not try and synchronize the memory so it is visible on other cores. For a multi-threaded system using C++11 or newer, volatile is not a mechanism for communicating between threads.

What might cause the same SSE code to run a few times slower in the same function?

Edit 3: The images are links to the full-size versions. Sorry for the pictures-of-text, but the graphs would be hard to copy/paste into a text table.
I have the following VTune profile for a program compiled with icc --std=c++14 -qopenmp -axS -O3 -fPIC:
In that profile, two clusters of instructions are highlighted in the assembly view. The upper cluster takes significantly less time than the lower one, in spite of instructions being identical and in the same order. Both clusters are located inside the same function and are obviously both called n times. This happens every time I run the profiler, on both a Westmere Xeon and a Haswell laptop that I'm using right now (compiled with SSE because that's what I'm targeting and learning right now).
What am I missing?
Ignore the poor concurrency, this is most probably due to the laptop throttling, since it doesn't occur on the desktop Xeon machine.
I believe this is not an example of micro-optimisation, since those three added together amount to a decent % of the total time, and I'm really interested about the possible cause of this behavior.
Edit: OMP_NUM_THREADS=1 taskset -c 1 /opt/intel/vtune...
Same profile, albeit with a slightly lower CPI this time.

HW perf counters typically charge stalls to the instruction that had to wait for its inputs, not the instruction that was slow producing outputs.
The inputs for your first group comes from your gather. This probably cache-misses a lot, and doesn't the costs aren't going to get charged to those SUBPS/MULPS/ADDPS instructions. Their inputs come directly from vector loads of voxel[], so store-forwarding failure will cause some latency. But that's only ~10 cycles IIRC, small compared to cache misses during the gather. (Those cache misses show up as large bars for the instructions right before the first group that you've highlighted)
The inputs for your second group come directly from loads that can miss in cache. In the first group, the direct consumers of the cache-miss loads were instructions for lines like the one that sets voxel[0], which has a really large bar.
But in the second group, the time for the cache misses in a_transfer[] is getting attributed to the group you've highlighted. Or if it's not cache misses, then maybe it's slow address calculation as the loads have to wait for RAX to be ready.
It looks like there's a lot you could optimize here.
instead of store/reload for a_pointf, just keep it hot across loop iterations in a __m128 variable. Storing/reloading in the C source only makes sense if you found the compiler was making a poor choice about which vector register to spill (if it ran out of registers).
calculate vi with _mm_cvttps_epi32(vf), so the ROUNDPS isn't part of the dependency chain for the gather indices.
Do the voxel gather yourself by shuffling narrow loads into vectors, instead of writing code that copies to an array and then loads from it. (guaranteed store-forwarding failure, see Agner Fog's optimization guides and other links from the x86 tag wiki).
It might be worth it to partially vectorize the address math (calculation of base_0, using PMULDQ with a constant vector), so instead of a store/reload (~5 cycle latency) you just have a MOVQ or two (~1 or 2 cycle latency on Haswell, I forget.)
Use MOVD to load two adjacent short values, and merge another pair into the second element with PINSRD. You'll probably get good code from _mm_setr_epi32(*(const int*)base_0, *(const int*)(base_0 + dim_x), 0, 0), except that pointer aliasing is undefined behaviour. You might get worse code from _mm_setr_epi16(*base_0, *(base_0 + 1), *(base_0 + dim_x), *(base_0 + dim_x + 1), 0,0,0,0).
Then expand the low four 16-bit elements into 32-bit elements integers with PMOVSX, and convert them all to float in parallel with _mm_cvtepi32_ps (CVTDQ2PS).
Your scalar LERPs aren't being auto-vectorized, but you're doing two in parallel (and could maybe save an instruction since you want the result in a vector anyway).
Calling floorf() is silly, and a function call forces the compiler to spill all xmm registers to memory. Compile with -ffast-math or whatever to let it inline to a ROUNDSS, or do that manually. Especially since you go ahead and load the float that you calculate from that into a vector!
Use a vector compare instead of scalar prev_x / prev_y / prev_z. Use MOVMASKPS to get the result into an integer you can test. (You only care about the lower 3 elements, so test it with compare_mask & 0b0111 (true if any of the low 3 bits of the 4-bit mask are set, after a compare for not-equal with _mm_cmpneq_ps. See the double version of the instruction for more tables on how it all works: http://www.felixcloutier.com/x86/CMPPD.html).

Well, analyzing assembly code please note that running time is attributed to the next instruction - so, the data you're looking by instructions need to be interpreted carefully. There is a corresponding note in VTune Release Notes:
Running time is attributed to the next instruction (200108041)
To collect the data about time-consuming running regions of the
target, the Intel® VTune™ Amplifier interrupts executing target
threads and attributes the time to the context IP address.
Due to the collection mechanism, the captured IP address points to an
instruction AFTER the one that is actually consuming most of the time.
This leads to the running time being attributed to the next
instruction (or, rarely to one of the subsequent instructions) in the
Assembly view. In rare cases, this can also lead to wrong attribution
of running time in the source - the time may be erroneously attributed
to the source line AFTER the actual hot line.
In case the inline mode is ON and the program has small functions
inlined at the hotspots, this can cause the running time to be
attributed to a wrong function since the next instruction can belong
to a different function in tightly inlined code.

Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs

I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions:
Your assignment is the opposite of our first lab assignment, which was to optimize a prime number program. Your purpose in this assignment is to pessimize the program, i.e. make it run slower. Both of these are CPU-intensive programs. They take a few seconds to run on our lab PCs. You may not change the algorithm.
To deoptimize the program, use your knowledge of how the Intel i7 pipeline operates. Imagine ways to re-order instruction paths to introduce WAR, RAW, and other hazards. Think of ways to minimize the effectiveness of the cache. Be diabolically incompetent.
The assignment gave a choice of Whetstone or Monte-Carlo programs. The cache-effectiveness comments are mostly only applicable to Whetstone, but I chose the Monte-Carlo simulation program:
// Un-modified baseline for pessimization, as given in the assignment
#include <algorithm> // Needed for the "max" function
#include <cmath>
#include <iostream>
// A simple implementation of the Box-Muller algorithm, used to generate
// gaussian random numbers - necessary for the Monte Carlo method below
// Note that C++11 actually provides std::normal_distribution<> in
// the <random> library, which can be used instead of this function
double gaussian_box_muller() {
double x = 0.0;
double y = 0.0;
double euclid_sq = 0.0;
// Continue generating two uniform random variables
// until the square of their "euclidean distance"
// is less than unity
do {
x = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
y = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
euclid_sq = x*x + y*y;
} while (euclid_sq >= 1.0);
return x*sqrt(-2*log(euclid_sq)/euclid_sq);
}
// Pricing a European vanilla call option with a Monte Carlo method
double monte_carlo_call_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
double S_adjust = S * exp(T*(r-0.5*v*v));
double S_cur = 0.0;
double payoff_sum = 0.0;
for (int i=0; i<num_sims; i++) {
double gauss_bm = gaussian_box_muller();
S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
payoff_sum += std::max(S_cur - K, 0.0);
}
return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}
// Pricing a European vanilla put option with a Monte Carlo method
double monte_carlo_put_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
double S_adjust = S * exp(T*(r-0.5*v*v));
double S_cur = 0.0;
double payoff_sum = 0.0;
for (int i=0; i<num_sims; i++) {
double gauss_bm = gaussian_box_muller();
S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
payoff_sum += std::max(K - S_cur, 0.0);
}
return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}
int main(int argc, char **argv) {
// First we create the parameter list
int num_sims = 10000000; // Number of simulated asset paths
double S = 100.0; // Option price
double K = 100.0; // Strike price
double r = 0.05; // Risk-free rate (5%)
double v = 0.2; // Volatility of the underlying (20%)
double T = 1.0; // One year until expiry
// Then we calculate the call/put values via Monte Carlo
double call = monte_carlo_call_price(num_sims, S, K, r, v, T);
double put = monte_carlo_put_price(num_sims, S, K, r, v, T);
// Finally we output the parameters and prices
std::cout << "Number of Paths: " << num_sims << std::endl;
std::cout << "Underlying: " << S << std::endl;
std::cout << "Strike: " << K << std::endl;
std::cout << "Risk-Free Rate: " << r << std::endl;
std::cout << "Volatility: " << v << std::endl;
std::cout << "Maturity: " << T << std::endl;
std::cout << "Call Price: " << call << std::endl;
std::cout << "Put Price: " << put << std::endl;
return 0;
}
The changes I have made seemed to increase the code running time by a second but I'm not entirely sure what I can change to stall the pipeline without adding code. A point to the right direction would be awesome, I appreciate any responses.
Update: the professor who gave this assignment posted some details
The highlights are:
It's a second semester architecture class at a community college (using the Hennessy and Patterson textbook).
the lab computers have Haswell CPUs
The students have been exposed to the CPUID instruction and how to determine cache size, as well as intrinsics and the CLFLUSH instruction.
any compiler options are allowed, and so is inline asm.
Writing your own square root algorithm was announced as being outside the pale
Cowmoogun's comments on the meta thread indicate that it wasn't clear compiler optimizations could be part of this, and assumed -O0, and that a 17% increase in run-time was reasonable.
So it sounds like the goal of the assignment was to get students to re-order the existing work to reduce instruction-level parallelism or things like that, but it's not a bad thing that people have delved deeper and learned more.
Keep in mind that this is a computer-architecture question, not a question about how to make C++ slow in general.

Important background reading: Agner Fog's microarch pdf, and probably also Ulrich Drepper's What Every Programmer Should Know About Memory. See also the other links in the x86 tag wiki, especially Intel's optimization manuals, and David Kanter's analysis of the Haswell microarchitecture, with diagrams.
Very cool assignment; much better than the ones I've seen where students were asked to optimize some code for gcc -O0, learning a bunch of tricks that don't matter in real code. In this case, you're being asked to learn about the CPU pipeline and use that to guide your de-optimization efforts, not just blind guessing. The most fun part of this one is justifying each pessimization with "diabolical incompetence", not intentional malice.
Problems with the assignment wording and code:
The uarch-specific options for this code are limited. It doesn't use any arrays, and much of the cost is calls to exp/log library functions. There isn't an obvious way to have more or less instruction-level parallelism, and the loop-carried dependency chain is very short.
It would be hard to get a slowdown just from re-arranging the expressions to change the dependencies, to reduce ILP from hazards.
Intel Sandybridge-family CPUs are aggressive out-of-order designs that spend lots of transistors and power to find parallelism and avoid hazards (dependencies) that would trouble a classic RISC in-order pipeline. Usually the only traditional hazards that slow it down are RAW "true" dependencies that cause throughput to be limited by latency.
WAR and WAW hazards for registers are pretty much not an issue, thanks to register renaming. (except for popcnt/lzcnt/tzcnt, which have a false dependency their destination on Intel CPUs, even though it should be write-only).
For memory ordering, modern CPUs use a store buffer to delay commit into cache until retirement, also avoiding WAR and WAW hazards. See also this answer about what a store buffer is, and being essential essential for OoO exec to decouple execution from things other cores can see.
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) has more about register renaming and hiding FMA latency in an FP dot product loop.
The "i7" brand-name was introduced with Nehalem (successor to Core2), and some Intel manuals even say Core i7 when they seem to mean Nehalem, but they kept the "i7" branding for Sandybridge and later microarchitectures. SnB is when the P6-family evolved into a new species, the SnB-family. In many ways, Nehalem has more in common with Pentium III than with Sandybridge (e.g. register read stalls aka ROB-read stalls don't happen on SnB, because it changed to using a physical register file. Also a uop cache and a different internal uop format). The term "i7 architecture" is not useful, because it makes little sense to group the SnB-family with Nehalem but not Core2. (Nehalem did introduce the shared inclusive L3 cache architecture for connecting multiple cores together, though. And also integrated GPUs. So chip-level, the naming makes more sense.)
Summary of the good ideas that diabolical incompetence can justify
Even the diabolically incompetent are unlikely to add obviously useless work or an infinite loop, and making a mess with C++/Boost classes is beyond the scope of the assignment.
Multi-thread with a single shared std::atomic<uint64_t> loop counter, so the right total number of iterations happen. Atomic uint64_t is especially bad with -m32 -march=i586. For bonus points, arrange for it to be misaligned, and crossing a page boundary with an uneven split (not 4:4).
False sharing for some other non-atomic variable -> memory-order mis-speculation pipeline clears, as well as extra cache misses.
Instead of using - on FP variables, XOR the high byte with 0x80 to flip the sign bit, causing store-forwarding stalls.
Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID / RDTSC or a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly.
Change multiplies by constants to divides by their reciprocal ("for ease of reading"). div is slow and not fully pipelined.
Vectorize the multiply/sqrt with AVX (SIMD), but fail to use vzeroupper before calls to scalar math-library exp() and log() functions, causing AVX<->SSE transition stalls.
Store the RNG output in a linked list, or in arrays which you traverse out of order. Same for the result of each iteration, and sum at the end.
Also covered in this answer but excluded from the summary: suggestions that would be just as slow on a non-pipelined CPU, or that don't seem to be justifiable even with diabolical incompetence. e.g. many gimp-the-compiler ideas that produce obviously different / worse asm.
Multi-thread badly
Maybe use OpenMP to multi-thread loops with very few iterations, with way more overhead than speed gain. Your monte-carlo code has enough parallelism to actually get a speedup, though, esp. if we succeed at making each iteration slow. (Each thread computes a partial payoff_sum, added at the end). #omp parallel on that loop would probably be an optimization, not a pessimization.
Multi-thread but force both threads to share the same loop counter (with atomic increments so the total number of iterations is correct). This seems diabolically logical. This means using a static variable as a loop counter. This justifies use of atomic for loop counters, and creates actual cache-line ping-ponging (as long as the threads don't run on the same physical core with hyperthreading; that might not be as slow). Anyway, this is much slower than the un-contended case for lock xadd or lock dec. And lock cmpxchg8b to atomically increment a contended uint64_t on a 32bit system will have to retry in a loop instead of having the hardware arbitrate an atomic inc.
Also create false sharing, where multiple threads keep their private data (e.g. RNG state) in different bytes of the same cache line. (Intel tutorial about it, including perf counters to look at). There's a microarchitecture-specific aspect to this: Intel CPUs speculate on memory mis-ordering not happening, and there's a memory-order machine-clear perf event to detect this, at least on P4. The penalty might not be as large on Haswell. As that link points out, a locked instruction assumes this will happen, avoiding mis-speculation. A normal load speculates that other cores won't invalidate a cache line between when the load executes and when it retires in program-order (unless you use pause). True sharing without locked instructions is usually a bug. It would be interesting to compare a non-atomic shared loop counter with the atomic case. To really pessimize, keep the shared atomic loop counter, and cause false sharing in the same or a different cache line for some other variable.
Random uarch-specific ideas:
If you can introduce any unpredictable branches, that will pessimize the code substantially. Modern x86 CPUs have quite long pipelines, so a mispredict costs ~15 cycles (when running from the uop cache).
Dependency chains:
I think this was one of the intended parts of the assignment.
Defeat the CPU's ability to exploit instruction-level parallelism by choosing an order of operations that has one long dependency chain instead of multiple short dependency chains. Compilers aren't allowed to change the order of operations for FP calculations unless you use -ffast-math, because that can change the results (as discussed below).
To really make this effective, increase the length of a loop-carried dependency chain. Nothing leaps out as obvious, though: The loops as written have very short loop-carried dependency chains: just an FP add. (3 cycles). Multiple iterations can have their calculations in-flight at once, because they can start well before the payoff_sum += at the end of the previous iteration. (log() and exp take many instructions, but not a lot more than Haswell's out-of-order window for finding parallelism: ROB size=192 fused-domain uops, and scheduler size=60 unfused-domain uops. As soon as execution of the current iteration progresses far enough to make room for instructions from the next iteration to issue, any parts of it that have their inputs ready (i.e. independent/separate dep chain) can start executing when older instructions leave the execution units free (e.g. because they're bottlenecked on latency, not throughput.).
The RNG state will almost certainly be a longer loop-carried dependency chain than the addps.
Use slower/more FP operations (esp. more division):
Divide by 2.0 instead of multiplying by 0.5, and so on. FP multiply is heavily pipelined in Intel designs, and has one per 0.5c throughput on Haswell and later. FP divsd/divpd is only partially pipelined. (Although Skylake has an impressive one per 4c throughput for divpd xmm, with 13-14c latency, vs not pipelined at all on Nehalem (7-22c)).
The do { ...; euclid_sq = x*x + y*y; } while (euclid_sq >= 1.0); is clearly testing for a distance, so clearly it would be proper to sqrt() it. :P (sqrt is even slower than div).
As #Paul Clayton suggests, rewriting expressions with associative/distributive equivalents can introduce more work (as long as you don't use -ffast-math to allow the compiler to re-optimize). (exp(T*(r-0.5*v*v)) could become exp(T*r - T*v*v/2.0). Note that while math on real numbers is associative, floating point math is not, even without considering overflow/NaN (which is why -ffast-math isn't on by default). See Paul's comment for a very hairy nested pow() suggestion.
If you can scale the calculations down to very small numbers, then FP math ops take ~120 extra cycles to trap to microcode when an operation on two normal numbers produces a denormal. See Agner Fog's microarch pdf for the exact numbers and details. This is unlikely since you have a lot of multiplies, so the scale factor would be squared and underflow all the way to 0.0. I don't see any way to justify the necessary scaling with incompetence (even diabolical), only intentional malice.
###If you can use intrinsics (<immintrin.h>)
Use movnti to evict your data from cache. Diabolical: it's new and weakly-ordered, so that should let the CPU run it faster, right? Or see that linked question for a case where someone was in danger of doing exactly this (for scattered writes where only some of the locations were hot). clflush is probably impossible without malice.
Use integer shuffles between FP math operations to cause bypass delays.
Mixing SSE and AVX instructions without proper use of vzeroupper causes large stalls in pre-Skylake (and a different penalty in Skylake). Even without that, vectorizing badly can be worse than scalar (more cycles spent shuffling data into/out of vectors than saved by doing the add/sub/mul/div/sqrt operations for 4 Monte-Carlo iterations at once, with 256b vectors). add/sub/mul execution units are fully pipelined and full-width, but div and sqrt on 256b vectors aren't as fast as on 128b vectors (or scalars), so the speedup isn't dramatic for double.
exp() and log() don't have hardware support, so that part would require extracting vector elements back to scalar and calling the library function separately, then shuffling the results back into a vector. libm is typically compiled to only use SSE2, so will use the legacy-SSE encodings of scalar math instructions. If your code uses 256b vectors and calls exp without doing a vzeroupper first, then you stall. After returning, an AVX-128 instruction like vmovsd to set up the next vector element as an arg for exp will also stall. And then exp() will stall again when it runs an SSE instruction. This is exactly what happened in this question, causing a 10x slowdown. (Thanks #ZBoson).
See also Nathan Kurz's experiments with Intel's math lib vs. glibc for this code. Future glibc will come with vectorized implementations of exp() and so on.
If targeting pre-IvB, or esp. Nehalem, try to get gcc to cause partial-register stalls with 16bit or 8bit operations followed by 32bit or 64bit operations. In most cases, gcc will use movzx after an 8 or 16bit operation, but here's a case where gcc modifies ah and then reads ax
With (inline) asm:
With (inline) asm, you could break the uop cache: A 32B chunk of code that doesn't fit in three 6uop cache lines forces a switch from the uop cache to the decoders. An incompetent ALIGN (like NASM's default) using many single-byte nops instead of a couple long nops on a branch target inside the inner loop might do the trick. Or put the alignment padding after the label, instead of before. :P This only matters if the frontend is a bottleneck, which it won't be if we succeeded at pessimizing the rest of the code.
Use self-modifying code to trigger pipeline clears (aka machine-nukes).
LCP stalls from 16bit instructions with immediates too large to fit in 8 bits are unlikely to be useful. The uop cache on SnB and later means you only pay the decode penalty once. On Nehalem (the first i7), it might work for a loop that doesn't fit in the 28 uop loop buffer. gcc will sometimes generate such instructions, even with -mtune=intel and when it could have used a 32bit instruction.
A common idiom for timing is CPUID(to serialize) then RDTSC. Time every iteration separately with a CPUID/RDTSC to make sure the RDTSC isn't reordered with earlier instructions, which will slow things down a lot. (In real life, the smart way to time is to time all the iterations together, instead of timing each separately and adding them up).
Cause lots of cache misses and other memory slowdowns
Use a union { double d; char a[8]; } for some of your variables. Cause a store-forwarding stall by doing a narrow store (or Read-Modify-Write) to just one of the bytes. (That wiki article also covers a lot of other microarchitectural stuff for load/store queues). e.g. flip the sign of a double using XOR 0x80 on just the high byte, instead of a - operator. The diabolically incompetent developer may have heard that FP is slower than integer, and thus try to do as much as possible using integer ops. (A compiler could theoretically still compile this to an xorps with a constant like -, but for x87 the compiler would have to realize that it's negating the value and fchs or replace the next add with a subtract.)
Use volatile if you're compiling with -O3 and not using std::atomic, to force the compiler to actually store/reload all over the place. Global variables (instead of locals) will also force some stores/reloads, but the C++ memory model's weak ordering doesn't require the compiler to spill/reload to memory all the time.
Replace local vars with members of a big struct, so you can control the memory layout.
Use arrays in the struct for padding (and storing random numbers, to justify their existence).
Choose your memory layout so everything goes into a different line in the same "set" in the L1 cache. It's only 8-way associative, i.e. each set has 8 "ways". Cache lines are 64B.
Even better, put things exactly 4096B apart, since loads have a false dependency on stores to different pages but with the same offset within a page. Aggressive out-of-order CPUs use Memory Disambiguation to figure out when loads and stores can be reordered without changing the results, and Intel's implementation has false-positives that prevent loads from starting early. Probably they only check bits below the page offset so it can start before the TLB has translated the high bits from a virtual page to a physical page. As well as Agner's guide, see this answer, and a section near the end of #Krazy Glew's answer on the same question. (Andy Glew was an architect of Intel's PPro - P6 microarchitecture.) (Also related: https://stackoverflow.com/a/53330296 and https://github.com/travisdowns/uarch-bench/wiki/Memory-Disambiguation-on-Skylake)
Use __attribute__((packed)) to let you mis-align variables so they span cache-line or even page boundaries. (So a load of one double needs data from two cache-lines). Misaligned loads have no penalty in any Intel i7 uarch, except when crossing cache lines and page lines. Cache-line splits still take extra cycles. Skylake dramatically reduces the penalty for page split loads, from 100 to 5 cycles. (Section 2.1.3). (And can do two page walks in parallel).
A page-split on an atomic<uint64_t> should be just about the worst case, esp. if it's 5 bytes in one page and 3 bytes in the other page, or anything other than 4:4. Even splits down the middle are more efficient for cache-line splits with 16B vectors on some uarches, IIRC. Put everything in a alignas(4096) struct __attribute((packed)) (to save space, of course), including an array for storage for the RNG results. Achieve the misalignment by using uint8_t or uint16_t for something before the counter.
If you can get the compiler to use indexed addressing modes, that will defeat uop micro-fusion. Maybe by using #defines to replace simple scalar variables with my_data[constant].
If you can introduce an extra level of indirection, so load/store addresses aren't known early, that can pessimize further.
Traverse arrays in non-contiguous order
I think we can come up with incompetent justification for introducing an array in the first place: It lets us separate the random number generation from the random number use. Results of each iteration could also be stored in an array, to be summed later (with more diabolical incompetence).
For "maximum randomness", we could have a thread looping over the random array writing new random numbers into it. The thread consuming the random numbers could generate a random index to load a random number from. (There's some make-work here, but microarchitecturally it helps for load-addresses to be known early so any possible load latency can be resolved before the loaded data is needed.) Having a reader and writer on different cores will cause memory-ordering mis-speculation pipeline clears (as discussed earlier for the false-sharing case).
For maximum pessimization, loop over your array with a stride of 4096 bytes (i.e. 512 doubles). e.g.
for (int i=0 ; i<512; i++)
for (int j=i ; j<UPPER_BOUND ; j+=512)
monte_carlo_step(rng_array[j]);
So the access pattern is 0, 4096, 8192, ...,
8, 4104, 8200, ...
16, 4112, 8208, ...
This is what you'd get for accessing a 2D array like double rng_array[MAX_ROWS][512] in the wrong order (looping over rows, instead of columns within a row in the inner loop, as suggested by #JesperJuhl). If diabolical incompetence can justify a 2D array with dimensions like that, garden variety real-world incompetence easily justifies looping with the wrong access pattern. This happens in real code in real life.
Adjust the loop bounds if necessary to use many different pages instead of reusing the same few pages, if the array isn't that big. Hardware prefetching doesn't work (as well/at all) across pages. The prefetcher can track one forward and one backward stream within each page (which is what happens here), but will only act on it if the memory bandwidth isn't already saturated with non-prefetch.
This will also generate lots of TLB misses, unless the pages get merged into a hugepage (Linux does this opportunistically for anonymous (not file-backed) allocations like malloc/new that use mmap(MAP_ANONYMOUS)).
Instead of an array to store the list of results, you could use a linked list. Every iteration would require a pointer-chasing load (a RAW true dependency hazard for the load-address of the next load). With a bad allocator, you might manage to scatter the list nodes around in memory, defeating cache. With a bad toy allocator, it could put every node at the beginning of its own page. (e.g. allocate with mmap(MAP_ANONYMOUS) directly, without breaking up pages or tracking object sizes to properly support free).
These aren't really microarchitecture-specific, and have little to do with the pipeline (most of these would also be a slowdown on a non-pipelined CPU).
Somewhat off-topic: make the compiler generate worse code / do more work:
Use C++11 std::atomic<int> and std::atomic<double> for the most pessimal code. The MFENCEs and locked instructions are quite slow even without contention from another thread.
-m32 will make slower code, because x87 code will be worse than SSE2 code. The stack-based 32bit calling convention takes more instructions, and passes even FP args on the stack to functions like exp(). atomic<uint64_t>::operator++ on -m32 requires a lock cmpxchg8B loop (i586). (So use that for loop counters! [Evil laugh]).
-march=i386 will also pessimize (thanks #Jesper). FP compares with fcom are slower than 686 fcomi. Pre-586 doesn't provide an atomic 64bit store, (let alone a cmpxchg), so all 64bit atomic ops compile to libgcc function calls (which is probably compiled for i686, rather than actually using a lock). Try it on the Godbolt Compiler Explorer link in the last paragraph.
Use long double / sqrtl / expl for extra precision and extra slowness in ABIs where sizeof(long double) is 10 or 16 (with padding for alignment). (IIRC, 64bit Windows uses 8byte long double equivalent to double. (Anyway, load/store of 10byte (80bit) FP operands is 4 / 7 uops, vs. float or double only taking 1 uop each for fld m64/m32/fst). Forcing x87 with long double defeats auto-vectorization even for gcc -m64 -march=haswell -O3.
If not using atomic<uint64_t> loop counters, use long double for everything, including loop counters.
atomic<double> compiles, but read-modify-write operations like += aren't supported for it (even on 64bit). atomic<long double> has to call a library function just for atomic loads/stores. It's probably really inefficient, because the x86 ISA doesn't naturally support atomic 10byte loads/stores, and the only way I can think of without locking (cmpxchg16b) requires 64bit mode.
At -O0, breaking up a big expression by assigning parts to temporary vars will cause more store/reloads. Without volatile or something, this won't matter with optimization settings that a real build of real code would use.
C aliasing rules allow a char to alias anything, so storing through a char* forces the compiler to store/reload everything before/after the byte-store, even at -O3. (This is a problem for auto-vectorizing code that operates on an array of uint8_t, for example.)
Try uint16_t loop counters, to force truncation to 16bit, probably by using 16bit operand-size (potential stalls) and/or extra movzx instructions (safe). Signed overflow is undefined behaviour, so unless you use -fwrapv or at least -fno-strict-overflow, signed loop counters don't have to be re-sign-extended every iteration, even if used as offsets to 64bit pointers.
Force conversion from integer to float and back again. And/or double<=>float conversions. The instructions have latency > 1, and scalar int->float (cvtsi2ss) is badly designed to not zero the rest of the xmm register. (gcc inserts an extra pxor to break dependencies, for this reason.)
Frequently set your CPU affinity to a different CPU (suggested by #Egwor). diabolical reasoning: You don't want one core to get overheated from running your thread for a long time, do you? Maybe swapping to another core will let that core turbo to a higher clock speed. (In reality: they're so thermally close to each other that this is highly unlikely except in a multi-socket system). Now just get the tuning wrong and do it way too often. Besides the time spent in the OS saving/restoring thread state, the new core has cold L2/L1 caches, uop cache, and branch predictors.
Introducing frequent unnecessary system calls can slow you down no matter what they are. Although some important but simple ones like gettimeofday may be implemented in user-space with, with no transition to kernel mode. (glibc on Linux does this with the kernel's help: the kernel exports code+data in the VDSO).
For more on system call overhead (including cache/TLB misses after returning to user-space, not just the context switch itself), the FlexSC paper has some great perf-counter analysis of the current situation, as well as a proposal for batching system calls from massively multi-threaded server processes.

A few things that you can do to make things perform as bad as possible:
compile the code for the i386 architecture. This will prevent the use of SSE and newer instructions and force the use of the x87 FPU.
use std::atomic variables everywhere. This will make them very expensive due to the compiler being forced to insert memory barriers all over the place. And this is something an incompetent person might plausibly do to "ensure thread safety".
make sure to access memory in the worst possible way for the prefetcher to predict (column major vs row major).
to make your variables extra expensive you could make sure they all have 'dynamic storage duration' (heap allocated) by allocating them with new rather than letting them have 'automatic storage duration' (stack allocated).
make sure that all memory you allocate is very oddly aligned and by all means avoid allocating huge pages, since doing so would be much too TLB efficient.
whatever you do, don't build your code with the compilers optimizer enabled. And make sure to enable the most expressive debug symbols you can (won't make the code run slower, but it'll waste some extra disk space).
Note: This answer basically just summarizes my comments that #Peter Cordes already incorporated into his very good answer. Suggest he get's your upvote if you only have one to spare :)

You can use long double for computation. On x86 it should be the 80-bit format. Only the legacy, x87 FPU has support for this.
Few shortcomings of x87 FPU:
Lack of SIMD, may need more instructions.
Stack based, problematic for super scalar and pipelined architectures.
Separate and quite small set of registers, may need more conversion from other registers and more memory operations.
On the Core i7 there are 3 ports for SSE and only 2 for x87, the processor can execute less parallel instructions.

Late answer but I don't feel we have abused linked lists and the TLB enough.
Use mmap to allocate your nodes, such that your mostly use the MSB of the address. This should result in long TLB lookup chains, a page is 12 bits, leaving 52 bits for the translation, or around 5 levels it must travers each time. With a bit of luck they must go to memory each time for 5 levels lookup plus 1 memory access to get to your node, the top level will most likely be in cache somewhere, so we can hope for 5*memory access. Place the node so that is strides the worst border so that reading the next pointer would cause another 3-4 translation lookups. This might also totally wreck the cache due to the massive amount of translation lookups. Also the size of the virtual tables might cause most of the user data to be paged to disk for extra time.
When reading from the single linked list, make sure to read from the start of the list each time to cause maximum delay in reading a single number.

Do atoi() and atof() cache?? They seem to execute quicker the more times called

I used _rdtsc() to time atoi() and atof() and I noticed they were taking pretty long. I therefore wrote my own versions of these functions which were much quicker from the first call.
I am using Windows 7, VS2012 IDE but with the Intel C/C++ compiler v13. I have -/O3 enabled and also -/Ot ("favour fast code"). My CPU is an Ivy Bridge (mobile).
Upon further investigation, it seemed that the more times atoi() and atof() were called, the quicker they executed?? I am talking magnitudes faster:
When I call atoi() from outside my loop, just the once, it takes 5,892 CPU cycles but after thousands of iterations this reduced to 300 - 600 CPU cycles (quite a large execution time range).
atof() initially takes 20,000 to 30,000 CPU cycles and then later on after a few thousand iterations it was taking 18 - 28 CPU cycles (which is the speed at which my custom function takes the first time it is called).
Could someone please explain this effect?
EDIT: forgot to say- the basic setup of my program was a loop parsing bytes from a file. Inside the loop I obviously use my atof and atoi to notice the above. However, what I also noticed is that when I did my investigation before the loop, just calling atoi and atof twice, along with my user-written equivalent functions twice, it seemed to make the loop execute faster. The loop processed 150,000 lines of data, each line requiring 3x atof() or atoi()s. Once again, I cannot understand why calling these functions before my main loop affected the speed of a program calling these functions 500,000 times?!
#include <ia32intrin.h>
int main(){
//call myatoi() and time it
//call atoi() and time it
//call myatoi() and time it
//call atoi() and time it
char* bytes2 = "45632";
_int64 start2 = _rdtsc();
unsigned int a2 = atoi(bytes2);
_int64 finish2 = _rdtsc();
cout << (finish2 - start2) << " CPU cycles for atoi()" << endl;
//call myatof() and time it
//call atof() and time it
//call myatof() and time it
//call atof() and time it
//Iterate through 150,000 lines, each line about 25 characters.
//The below executes slower if the above debugging is NOT done.
while(i < file_size){
//Loop through my data, call atoi() or atof() 1 or 2 times per line
switch(bytes[i]){
case ' ':
//I have an array of shorts which records the distance from the beginning
//of the line to each of the tokens in the line. In the below switch
//statement offset_to_price and offset_to_qty refer to this array.
case '\n':
switch(message_type){
case 'A':
char* temp = bytes + offset_to_price;
_int64 start = _rdtsc();
price = atof(temp);
_int64 finish = _rdtsc();
cout << (finish - start) << " CPU cycles" << endl;
//Other processing with the tokens
break;
case 'R':
//Get the 4th line token using atoi() as above
char* temp = bytes + offset_to_qty;
_int64 start = _rdtsc();
price = atoi(temp);
_int64 finish = _rdtsc();
cout << (finish - start) << " CPU cycles" << endl;
//Other processing with the tokens
break;
}
break;
}
}
}
The lines in the file are like this (with no blank lines in between):
34605792 R dacb 100
34605794 A racb S 44.17 100
34605797 R kacb 100
34605799 A sacb S 44.18 100
34605800 R nacb 100
34605800 A tacb B 44.16 100
34605801 R gacb 100
I am using atoi() on the 4th element in the 'R' messages and 5th element in 'A' messages and using atof() on the 4th element in the 'A' messages.

I'm guessing the reason why you see such a drastic improvement for atoi and atof, but not for your own, simpler function, is that the former have a large number of branches in order to handle all the edge cases. The first few times, this leads to a large number of incorrect branch predictions, which are costly. But after a few times, the predictions get more accurate. A correctly predicted branch is almost free, which would then make them competitive with your simpler version which doesn't include the branches to begin with.
Caching surely also important, but I don't think that explains why your own function was fast from the beginning, and did not see any relevant improvements after repeated execution (if I understand you correctly).

Using RDTSC for profiling is dangerous. From the Intel processor manual:
The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions
have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the
read operation is performed. If software requires RDTSC to be executed only after all previous instructions have
completed locally, it can either use RDTSCP (if the processor supports that instruction) or execute the sequence
LFENCE;RDTSC.
With the inevitable Heisenberg effect that causes, you'll now measure the cost of RDTSCP or LFENCE. Consider measuring a loop instead.

Measuring performance for a single call like this isn't advisable. You'd get too many variance due to power throttles, interrupts and other OS/system interferences, measurement overhead, and as said above - cold/warm variance. On top of that, rdtsc is no longer considered a reliable measurement since your CPU may throttle its own frequency, but for the sake of this simple check we can say it's good enough.
You should run your code at least several thousands of times, discard some portion at the beginning, and then divide to get the average - that would give you the "warm" performance, which would include (as mentioned in the comments above) close caches hit latency for both code and data (and also TLBs), good branch prediction, and might also negate some of the external impacts (such as having only recently woken up your CPU from a powerdown state).
Of course, you may argue that this performance is too optimistic because in real scenarios you won't always hit the L1 cache etc.. - it may still be fine for comparing two different methods (such as competing with the library ato* functions), just don't count on the results for real life. You can also make the test slightly harder, and call the function with a more elaborate pattern of inputs that would stress the caches a bit better.
As for your question regarding the 20k-30k cycles - that's exactly the reason why you should discard the first few iterations. This isn't just cache-miss latency, you're actually waiting for the first instructions to do a code fetch, which may also wait for the code page translation to do a page walk (a long process that may involve multiple memory accesses), and if you're really unlucky - also swapping in a page from disk, which requires OS assistance and lots of IO latency. And this is still before you started executing the first instruction.

The most likely explanation is that because you are calling atoi/atof so often, it is being identified as a hot spot and thus being kept in the Level 1 or Level 2 processor code cache. The CPU's replacement policy -- that microcode that determines what cache lines can be purged when a cache miss occurs) would tag such a hot spot to be kept in cache. There's a decent write up of cpu caching technlogies on wikipedia, if you are interested.
Your initial timings were low because your code wasn't in the CPU's most performant cache yet, but once invoked some number of times, were.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js