Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs - c++

I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions:
Your assignment is the opposite of our first lab assignment, which was to optimize a prime number program. Your purpose in this assignment is to pessimize the program, i.e. make it run slower. Both of these are CPU-intensive programs. They take a few seconds to run on our lab PCs. You may not change the algorithm.
To deoptimize the program, use your knowledge of how the Intel i7 pipeline operates. Imagine ways to re-order instruction paths to introduce WAR, RAW, and other hazards. Think of ways to minimize the effectiveness of the cache. Be diabolically incompetent.
The assignment gave a choice of Whetstone or Monte-Carlo programs. The cache-effectiveness comments are mostly only applicable to Whetstone, but I chose the Monte-Carlo simulation program:
// Un-modified baseline for pessimization, as given in the assignment
#include <algorithm> // Needed for the "max" function
#include <cmath>
#include <iostream>
// A simple implementation of the Box-Muller algorithm, used to generate
// gaussian random numbers - necessary for the Monte Carlo method below
// Note that C++11 actually provides std::normal_distribution<> in
// the <random> library, which can be used instead of this function
double gaussian_box_muller() {
double x = 0.0;
double y = 0.0;
double euclid_sq = 0.0;
// Continue generating two uniform random variables
// until the square of their "euclidean distance"
// is less than unity
do {
x = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
y = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
euclid_sq = x*x + y*y;
} while (euclid_sq >= 1.0);
return x*sqrt(-2*log(euclid_sq)/euclid_sq);
}
// Pricing a European vanilla call option with a Monte Carlo method
double monte_carlo_call_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
double S_adjust = S * exp(T*(r-0.5*v*v));
double S_cur = 0.0;
double payoff_sum = 0.0;
for (int i=0; i<num_sims; i++) {
double gauss_bm = gaussian_box_muller();
S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
payoff_sum += std::max(S_cur - K, 0.0);
}
return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}
// Pricing a European vanilla put option with a Monte Carlo method
double monte_carlo_put_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
double S_adjust = S * exp(T*(r-0.5*v*v));
double S_cur = 0.0;
double payoff_sum = 0.0;
for (int i=0; i<num_sims; i++) {
double gauss_bm = gaussian_box_muller();
S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
payoff_sum += std::max(K - S_cur, 0.0);
}
return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}
int main(int argc, char **argv) {
// First we create the parameter list
int num_sims = 10000000; // Number of simulated asset paths
double S = 100.0; // Option price
double K = 100.0; // Strike price
double r = 0.05; // Risk-free rate (5%)
double v = 0.2; // Volatility of the underlying (20%)
double T = 1.0; // One year until expiry
// Then we calculate the call/put values via Monte Carlo
double call = monte_carlo_call_price(num_sims, S, K, r, v, T);
double put = monte_carlo_put_price(num_sims, S, K, r, v, T);
// Finally we output the parameters and prices
std::cout << "Number of Paths: " << num_sims << std::endl;
std::cout << "Underlying: " << S << std::endl;
std::cout << "Strike: " << K << std::endl;
std::cout << "Risk-Free Rate: " << r << std::endl;
std::cout << "Volatility: " << v << std::endl;
std::cout << "Maturity: " << T << std::endl;
std::cout << "Call Price: " << call << std::endl;
std::cout << "Put Price: " << put << std::endl;
return 0;
}
The changes I have made seemed to increase the code running time by a second but I'm not entirely sure what I can change to stall the pipeline without adding code. A point to the right direction would be awesome, I appreciate any responses.
Update: the professor who gave this assignment posted some details
The highlights are:
It's a second semester architecture class at a community college (using the Hennessy and Patterson textbook).
the lab computers have Haswell CPUs
The students have been exposed to the CPUID instruction and how to determine cache size, as well as intrinsics and the CLFLUSH instruction.
any compiler options are allowed, and so is inline asm.
Writing your own square root algorithm was announced as being outside the pale
Cowmoogun's comments on the meta thread indicate that it wasn't clear compiler optimizations could be part of this, and assumed -O0, and that a 17% increase in run-time was reasonable.
So it sounds like the goal of the assignment was to get students to re-order the existing work to reduce instruction-level parallelism or things like that, but it's not a bad thing that people have delved deeper and learned more.
Keep in mind that this is a computer-architecture question, not a question about how to make C++ slow in general.

Important background reading: Agner Fog's microarch pdf, and probably also Ulrich Drepper's What Every Programmer Should Know About Memory. See also the other links in the x86 tag wiki, especially Intel's optimization manuals, and David Kanter's analysis of the Haswell microarchitecture, with diagrams.
Very cool assignment; much better than the ones I've seen where students were asked to optimize some code for gcc -O0, learning a bunch of tricks that don't matter in real code. In this case, you're being asked to learn about the CPU pipeline and use that to guide your de-optimization efforts, not just blind guessing. The most fun part of this one is justifying each pessimization with "diabolical incompetence", not intentional malice.
Problems with the assignment wording and code:
The uarch-specific options for this code are limited. It doesn't use any arrays, and much of the cost is calls to exp/log library functions. There isn't an obvious way to have more or less instruction-level parallelism, and the loop-carried dependency chain is very short.
It would be hard to get a slowdown just from re-arranging the expressions to change the dependencies, to reduce ILP from hazards.
Intel Sandybridge-family CPUs are aggressive out-of-order designs that spend lots of transistors and power to find parallelism and avoid hazards (dependencies) that would trouble a classic RISC in-order pipeline. Usually the only traditional hazards that slow it down are RAW "true" dependencies that cause throughput to be limited by latency.
WAR and WAW hazards for registers are pretty much not an issue, thanks to register renaming. (except for popcnt/lzcnt/tzcnt, which have a false dependency their destination on Intel CPUs, even though it should be write-only).
For memory ordering, modern CPUs use a store buffer to delay commit into cache until retirement, also avoiding WAR and WAW hazards. See also this answer about what a store buffer is, and being essential essential for OoO exec to decouple execution from things other cores can see.
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) has more about register renaming and hiding FMA latency in an FP dot product loop.
The "i7" brand-name was introduced with Nehalem (successor to Core2), and some Intel manuals even say Core i7 when they seem to mean Nehalem, but they kept the "i7" branding for Sandybridge and later microarchitectures. SnB is when the P6-family evolved into a new species, the SnB-family. In many ways, Nehalem has more in common with Pentium III than with Sandybridge (e.g. register read stalls aka ROB-read stalls don't happen on SnB, because it changed to using a physical register file. Also a uop cache and a different internal uop format). The term "i7 architecture" is not useful, because it makes little sense to group the SnB-family with Nehalem but not Core2. (Nehalem did introduce the shared inclusive L3 cache architecture for connecting multiple cores together, though. And also integrated GPUs. So chip-level, the naming makes more sense.)
Summary of the good ideas that diabolical incompetence can justify
Even the diabolically incompetent are unlikely to add obviously useless work or an infinite loop, and making a mess with C++/Boost classes is beyond the scope of the assignment.
Multi-thread with a single shared std::atomic<uint64_t> loop counter, so the right total number of iterations happen. Atomic uint64_t is especially bad with -m32 -march=i586. For bonus points, arrange for it to be misaligned, and crossing a page boundary with an uneven split (not 4:4).
False sharing for some other non-atomic variable -> memory-order mis-speculation pipeline clears, as well as extra cache misses.
Instead of using - on FP variables, XOR the high byte with 0x80 to flip the sign bit, causing store-forwarding stalls.
Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID / RDTSC or a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly.
Change multiplies by constants to divides by their reciprocal ("for ease of reading"). div is slow and not fully pipelined.
Vectorize the multiply/sqrt with AVX (SIMD), but fail to use vzeroupper before calls to scalar math-library exp() and log() functions, causing AVX<->SSE transition stalls.
Store the RNG output in a linked list, or in arrays which you traverse out of order. Same for the result of each iteration, and sum at the end.
Also covered in this answer but excluded from the summary: suggestions that would be just as slow on a non-pipelined CPU, or that don't seem to be justifiable even with diabolical incompetence. e.g. many gimp-the-compiler ideas that produce obviously different / worse asm.
Multi-thread badly
Maybe use OpenMP to multi-thread loops with very few iterations, with way more overhead than speed gain. Your monte-carlo code has enough parallelism to actually get a speedup, though, esp. if we succeed at making each iteration slow. (Each thread computes a partial payoff_sum, added at the end). #omp parallel on that loop would probably be an optimization, not a pessimization.
Multi-thread but force both threads to share the same loop counter (with atomic increments so the total number of iterations is correct). This seems diabolically logical. This means using a static variable as a loop counter. This justifies use of atomic for loop counters, and creates actual cache-line ping-ponging (as long as the threads don't run on the same physical core with hyperthreading; that might not be as slow). Anyway, this is much slower than the un-contended case for lock xadd or lock dec. And lock cmpxchg8b to atomically increment a contended uint64_t on a 32bit system will have to retry in a loop instead of having the hardware arbitrate an atomic inc.
Also create false sharing, where multiple threads keep their private data (e.g. RNG state) in different bytes of the same cache line. (Intel tutorial about it, including perf counters to look at). There's a microarchitecture-specific aspect to this: Intel CPUs speculate on memory mis-ordering not happening, and there's a memory-order machine-clear perf event to detect this, at least on P4. The penalty might not be as large on Haswell. As that link points out, a locked instruction assumes this will happen, avoiding mis-speculation. A normal load speculates that other cores won't invalidate a cache line between when the load executes and when it retires in program-order (unless you use pause). True sharing without locked instructions is usually a bug. It would be interesting to compare a non-atomic shared loop counter with the atomic case. To really pessimize, keep the shared atomic loop counter, and cause false sharing in the same or a different cache line for some other variable.
Random uarch-specific ideas:
If you can introduce any unpredictable branches, that will pessimize the code substantially. Modern x86 CPUs have quite long pipelines, so a mispredict costs ~15 cycles (when running from the uop cache).
Dependency chains:
I think this was one of the intended parts of the assignment.
Defeat the CPU's ability to exploit instruction-level parallelism by choosing an order of operations that has one long dependency chain instead of multiple short dependency chains. Compilers aren't allowed to change the order of operations for FP calculations unless you use -ffast-math, because that can change the results (as discussed below).
To really make this effective, increase the length of a loop-carried dependency chain. Nothing leaps out as obvious, though: The loops as written have very short loop-carried dependency chains: just an FP add. (3 cycles). Multiple iterations can have their calculations in-flight at once, because they can start well before the payoff_sum += at the end of the previous iteration. (log() and exp take many instructions, but not a lot more than Haswell's out-of-order window for finding parallelism: ROB size=192 fused-domain uops, and scheduler size=60 unfused-domain uops. As soon as execution of the current iteration progresses far enough to make room for instructions from the next iteration to issue, any parts of it that have their inputs ready (i.e. independent/separate dep chain) can start executing when older instructions leave the execution units free (e.g. because they're bottlenecked on latency, not throughput.).
The RNG state will almost certainly be a longer loop-carried dependency chain than the addps.
Use slower/more FP operations (esp. more division):
Divide by 2.0 instead of multiplying by 0.5, and so on. FP multiply is heavily pipelined in Intel designs, and has one per 0.5c throughput on Haswell and later. FP divsd/divpd is only partially pipelined. (Although Skylake has an impressive one per 4c throughput for divpd xmm, with 13-14c latency, vs not pipelined at all on Nehalem (7-22c)).
The do { ...; euclid_sq = x*x + y*y; } while (euclid_sq >= 1.0); is clearly testing for a distance, so clearly it would be proper to sqrt() it. :P (sqrt is even slower than div).
As #Paul Clayton suggests, rewriting expressions with associative/distributive equivalents can introduce more work (as long as you don't use -ffast-math to allow the compiler to re-optimize). (exp(T*(r-0.5*v*v)) could become exp(T*r - T*v*v/2.0). Note that while math on real numbers is associative, floating point math is not, even without considering overflow/NaN (which is why -ffast-math isn't on by default). See Paul's comment for a very hairy nested pow() suggestion.
If you can scale the calculations down to very small numbers, then FP math ops take ~120 extra cycles to trap to microcode when an operation on two normal numbers produces a denormal. See Agner Fog's microarch pdf for the exact numbers and details. This is unlikely since you have a lot of multiplies, so the scale factor would be squared and underflow all the way to 0.0. I don't see any way to justify the necessary scaling with incompetence (even diabolical), only intentional malice.
###If you can use intrinsics (<immintrin.h>)
Use movnti to evict your data from cache. Diabolical: it's new and weakly-ordered, so that should let the CPU run it faster, right? Or see that linked question for a case where someone was in danger of doing exactly this (for scattered writes where only some of the locations were hot). clflush is probably impossible without malice.
Use integer shuffles between FP math operations to cause bypass delays.
Mixing SSE and AVX instructions without proper use of vzeroupper causes large stalls in pre-Skylake (and a different penalty in Skylake). Even without that, vectorizing badly can be worse than scalar (more cycles spent shuffling data into/out of vectors than saved by doing the add/sub/mul/div/sqrt operations for 4 Monte-Carlo iterations at once, with 256b vectors). add/sub/mul execution units are fully pipelined and full-width, but div and sqrt on 256b vectors aren't as fast as on 128b vectors (or scalars), so the speedup isn't dramatic for double.
exp() and log() don't have hardware support, so that part would require extracting vector elements back to scalar and calling the library function separately, then shuffling the results back into a vector. libm is typically compiled to only use SSE2, so will use the legacy-SSE encodings of scalar math instructions. If your code uses 256b vectors and calls exp without doing a vzeroupper first, then you stall. After returning, an AVX-128 instruction like vmovsd to set up the next vector element as an arg for exp will also stall. And then exp() will stall again when it runs an SSE instruction. This is exactly what happened in this question, causing a 10x slowdown. (Thanks #ZBoson).
See also Nathan Kurz's experiments with Intel's math lib vs. glibc for this code. Future glibc will come with vectorized implementations of exp() and so on.
If targeting pre-IvB, or esp. Nehalem, try to get gcc to cause partial-register stalls with 16bit or 8bit operations followed by 32bit or 64bit operations. In most cases, gcc will use movzx after an 8 or 16bit operation, but here's a case where gcc modifies ah and then reads ax
With (inline) asm:
With (inline) asm, you could break the uop cache: A 32B chunk of code that doesn't fit in three 6uop cache lines forces a switch from the uop cache to the decoders. An incompetent ALIGN (like NASM's default) using many single-byte nops instead of a couple long nops on a branch target inside the inner loop might do the trick. Or put the alignment padding after the label, instead of before. :P This only matters if the frontend is a bottleneck, which it won't be if we succeeded at pessimizing the rest of the code.
Use self-modifying code to trigger pipeline clears (aka machine-nukes).
LCP stalls from 16bit instructions with immediates too large to fit in 8 bits are unlikely to be useful. The uop cache on SnB and later means you only pay the decode penalty once. On Nehalem (the first i7), it might work for a loop that doesn't fit in the 28 uop loop buffer. gcc will sometimes generate such instructions, even with -mtune=intel and when it could have used a 32bit instruction.
A common idiom for timing is CPUID(to serialize) then RDTSC. Time every iteration separately with a CPUID/RDTSC to make sure the RDTSC isn't reordered with earlier instructions, which will slow things down a lot. (In real life, the smart way to time is to time all the iterations together, instead of timing each separately and adding them up).
Cause lots of cache misses and other memory slowdowns
Use a union { double d; char a[8]; } for some of your variables. Cause a store-forwarding stall by doing a narrow store (or Read-Modify-Write) to just one of the bytes. (That wiki article also covers a lot of other microarchitectural stuff for load/store queues). e.g. flip the sign of a double using XOR 0x80 on just the high byte, instead of a - operator. The diabolically incompetent developer may have heard that FP is slower than integer, and thus try to do as much as possible using integer ops. (A compiler could theoretically still compile this to an xorps with a constant like -, but for x87 the compiler would have to realize that it's negating the value and fchs or replace the next add with a subtract.)
Use volatile if you're compiling with -O3 and not using std::atomic, to force the compiler to actually store/reload all over the place. Global variables (instead of locals) will also force some stores/reloads, but the C++ memory model's weak ordering doesn't require the compiler to spill/reload to memory all the time.
Replace local vars with members of a big struct, so you can control the memory layout.
Use arrays in the struct for padding (and storing random numbers, to justify their existence).
Choose your memory layout so everything goes into a different line in the same "set" in the L1 cache. It's only 8-way associative, i.e. each set has 8 "ways". Cache lines are 64B.
Even better, put things exactly 4096B apart, since loads have a false dependency on stores to different pages but with the same offset within a page. Aggressive out-of-order CPUs use Memory Disambiguation to figure out when loads and stores can be reordered without changing the results, and Intel's implementation has false-positives that prevent loads from starting early. Probably they only check bits below the page offset so it can start before the TLB has translated the high bits from a virtual page to a physical page. As well as Agner's guide, see this answer, and a section near the end of #Krazy Glew's answer on the same question. (Andy Glew was an architect of Intel's PPro - P6 microarchitecture.) (Also related: https://stackoverflow.com/a/53330296 and https://github.com/travisdowns/uarch-bench/wiki/Memory-Disambiguation-on-Skylake)
Use __attribute__((packed)) to let you mis-align variables so they span cache-line or even page boundaries. (So a load of one double needs data from two cache-lines). Misaligned loads have no penalty in any Intel i7 uarch, except when crossing cache lines and page lines. Cache-line splits still take extra cycles. Skylake dramatically reduces the penalty for page split loads, from 100 to 5 cycles. (Section 2.1.3). (And can do two page walks in parallel).
A page-split on an atomic<uint64_t> should be just about the worst case, esp. if it's 5 bytes in one page and 3 bytes in the other page, or anything other than 4:4. Even splits down the middle are more efficient for cache-line splits with 16B vectors on some uarches, IIRC. Put everything in a alignas(4096) struct __attribute((packed)) (to save space, of course), including an array for storage for the RNG results. Achieve the misalignment by using uint8_t or uint16_t for something before the counter.
If you can get the compiler to use indexed addressing modes, that will defeat uop micro-fusion. Maybe by using #defines to replace simple scalar variables with my_data[constant].
If you can introduce an extra level of indirection, so load/store addresses aren't known early, that can pessimize further.
Traverse arrays in non-contiguous order
I think we can come up with incompetent justification for introducing an array in the first place: It lets us separate the random number generation from the random number use. Results of each iteration could also be stored in an array, to be summed later (with more diabolical incompetence).
For "maximum randomness", we could have a thread looping over the random array writing new random numbers into it. The thread consuming the random numbers could generate a random index to load a random number from. (There's some make-work here, but microarchitecturally it helps for load-addresses to be known early so any possible load latency can be resolved before the loaded data is needed.) Having a reader and writer on different cores will cause memory-ordering mis-speculation pipeline clears (as discussed earlier for the false-sharing case).
For maximum pessimization, loop over your array with a stride of 4096 bytes (i.e. 512 doubles). e.g.
for (int i=0 ; i<512; i++)
for (int j=i ; j<UPPER_BOUND ; j+=512)
monte_carlo_step(rng_array[j]);
So the access pattern is 0, 4096, 8192, ...,
8, 4104, 8200, ...
16, 4112, 8208, ...
This is what you'd get for accessing a 2D array like double rng_array[MAX_ROWS][512] in the wrong order (looping over rows, instead of columns within a row in the inner loop, as suggested by #JesperJuhl). If diabolical incompetence can justify a 2D array with dimensions like that, garden variety real-world incompetence easily justifies looping with the wrong access pattern. This happens in real code in real life.
Adjust the loop bounds if necessary to use many different pages instead of reusing the same few pages, if the array isn't that big. Hardware prefetching doesn't work (as well/at all) across pages. The prefetcher can track one forward and one backward stream within each page (which is what happens here), but will only act on it if the memory bandwidth isn't already saturated with non-prefetch.
This will also generate lots of TLB misses, unless the pages get merged into a hugepage (Linux does this opportunistically for anonymous (not file-backed) allocations like malloc/new that use mmap(MAP_ANONYMOUS)).
Instead of an array to store the list of results, you could use a linked list. Every iteration would require a pointer-chasing load (a RAW true dependency hazard for the load-address of the next load). With a bad allocator, you might manage to scatter the list nodes around in memory, defeating cache. With a bad toy allocator, it could put every node at the beginning of its own page. (e.g. allocate with mmap(MAP_ANONYMOUS) directly, without breaking up pages or tracking object sizes to properly support free).
These aren't really microarchitecture-specific, and have little to do with the pipeline (most of these would also be a slowdown on a non-pipelined CPU).
Somewhat off-topic: make the compiler generate worse code / do more work:
Use C++11 std::atomic<int> and std::atomic<double> for the most pessimal code. The MFENCEs and locked instructions are quite slow even without contention from another thread.
-m32 will make slower code, because x87 code will be worse than SSE2 code. The stack-based 32bit calling convention takes more instructions, and passes even FP args on the stack to functions like exp(). atomic<uint64_t>::operator++ on -m32 requires a lock cmpxchg8B loop (i586). (So use that for loop counters! [Evil laugh]).
-march=i386 will also pessimize (thanks #Jesper). FP compares with fcom are slower than 686 fcomi. Pre-586 doesn't provide an atomic 64bit store, (let alone a cmpxchg), so all 64bit atomic ops compile to libgcc function calls (which is probably compiled for i686, rather than actually using a lock). Try it on the Godbolt Compiler Explorer link in the last paragraph.
Use long double / sqrtl / expl for extra precision and extra slowness in ABIs where sizeof(long double) is 10 or 16 (with padding for alignment). (IIRC, 64bit Windows uses 8byte long double equivalent to double. (Anyway, load/store of 10byte (80bit) FP operands is 4 / 7 uops, vs. float or double only taking 1 uop each for fld m64/m32/fst). Forcing x87 with long double defeats auto-vectorization even for gcc -m64 -march=haswell -O3.
If not using atomic<uint64_t> loop counters, use long double for everything, including loop counters.
atomic<double> compiles, but read-modify-write operations like += aren't supported for it (even on 64bit). atomic<long double> has to call a library function just for atomic loads/stores. It's probably really inefficient, because the x86 ISA doesn't naturally support atomic 10byte loads/stores, and the only way I can think of without locking (cmpxchg16b) requires 64bit mode.
At -O0, breaking up a big expression by assigning parts to temporary vars will cause more store/reloads. Without volatile or something, this won't matter with optimization settings that a real build of real code would use.
C aliasing rules allow a char to alias anything, so storing through a char* forces the compiler to store/reload everything before/after the byte-store, even at -O3. (This is a problem for auto-vectorizing code that operates on an array of uint8_t, for example.)
Try uint16_t loop counters, to force truncation to 16bit, probably by using 16bit operand-size (potential stalls) and/or extra movzx instructions (safe). Signed overflow is undefined behaviour, so unless you use -fwrapv or at least -fno-strict-overflow, signed loop counters don't have to be re-sign-extended every iteration, even if used as offsets to 64bit pointers.
Force conversion from integer to float and back again. And/or double<=>float conversions. The instructions have latency > 1, and scalar int->float (cvtsi2ss) is badly designed to not zero the rest of the xmm register. (gcc inserts an extra pxor to break dependencies, for this reason.)
Frequently set your CPU affinity to a different CPU (suggested by #Egwor). diabolical reasoning: You don't want one core to get overheated from running your thread for a long time, do you? Maybe swapping to another core will let that core turbo to a higher clock speed. (In reality: they're so thermally close to each other that this is highly unlikely except in a multi-socket system). Now just get the tuning wrong and do it way too often. Besides the time spent in the OS saving/restoring thread state, the new core has cold L2/L1 caches, uop cache, and branch predictors.
Introducing frequent unnecessary system calls can slow you down no matter what they are. Although some important but simple ones like gettimeofday may be implemented in user-space with, with no transition to kernel mode. (glibc on Linux does this with the kernel's help: the kernel exports code+data in the VDSO).
For more on system call overhead (including cache/TLB misses after returning to user-space, not just the context switch itself), the FlexSC paper has some great perf-counter analysis of the current situation, as well as a proposal for batching system calls from massively multi-threaded server processes.

A few things that you can do to make things perform as bad as possible:
compile the code for the i386 architecture. This will prevent the use of SSE and newer instructions and force the use of the x87 FPU.
use std::atomic variables everywhere. This will make them very expensive due to the compiler being forced to insert memory barriers all over the place. And this is something an incompetent person might plausibly do to "ensure thread safety".
make sure to access memory in the worst possible way for the prefetcher to predict (column major vs row major).
to make your variables extra expensive you could make sure they all have 'dynamic storage duration' (heap allocated) by allocating them with new rather than letting them have 'automatic storage duration' (stack allocated).
make sure that all memory you allocate is very oddly aligned and by all means avoid allocating huge pages, since doing so would be much too TLB efficient.
whatever you do, don't build your code with the compilers optimizer enabled. And make sure to enable the most expressive debug symbols you can (won't make the code run slower, but it'll waste some extra disk space).
Note: This answer basically just summarizes my comments that #Peter Cordes already incorporated into his very good answer. Suggest he get's your upvote if you only have one to spare :)

You can use long double for computation. On x86 it should be the 80-bit format. Only the legacy, x87 FPU has support for this.
Few shortcomings of x87 FPU:
Lack of SIMD, may need more instructions.
Stack based, problematic for super scalar and pipelined architectures.
Separate and quite small set of registers, may need more conversion from other registers and more memory operations.
On the Core i7 there are 3 ports for SSE and only 2 for x87, the processor can execute less parallel instructions.

Late answer but I don't feel we have abused linked lists and the TLB enough.
Use mmap to allocate your nodes, such that your mostly use the MSB of the address. This should result in long TLB lookup chains, a page is 12 bits, leaving 52 bits for the translation, or around 5 levels it must travers each time. With a bit of luck they must go to memory each time for 5 levels lookup plus 1 memory access to get to your node, the top level will most likely be in cache somewhere, so we can hope for 5*memory access. Place the node so that is strides the worst border so that reading the next pointer would cause another 3-4 translation lookups. This might also totally wreck the cache due to the massive amount of translation lookups. Also the size of the virtual tables might cause most of the user data to be paged to disk for extra time.
When reading from the single linked list, make sure to read from the start of the list each time to cause maximum delay in reading a single number.

Related

What is faster in C++: mod (%) or another counter?

At the risk of this being a duplicate, maybe I just can't find a similar post right now:
I am writing in C++ (C++20 to be specific). I have a loop with a counter that counts up every turn. Let's call it counter. And if this counter reaches a page-limit (let's call it page_limit), the program should continue on the next page. So it looks something like this:
const size_t page_limit = 4942;
size_t counter = 0;
while (counter < foo) {
if (counter % page_limit == 0) {
// start new page
}
// some other code
counter += 1;
}
Now I am wondering since the counter goes pretty high: would the program run faster, if I wouldn't have the program calculate the modulo counter % page_limit every time, but instead make another counter? It could look something like this:
const size_t page_limit = 4942;
size_t counter = 0;
size_t page_counter = 4942;
while (counter < foo) {
if (page_counter == page_limit) {
// start new page
page_counter = 0;
}
// some other code
counter += 1;
page_counter += 1;
}
Most optimizing compilers will convert divide or modulo operations into multiply by pre-generated inverse constant and shift instructions if the divisor is a constant. Possibly also if the same divisor value is used repeatedly in a loop.
Modulo multiplies by inverse to get a quotient, then multiplies quotient by divisor to get a product, and then original number minus product will be the modulo.
Multiply and shift are fast instructions on reasonably recent X86 processors, but branch prediction can also reduce the time it takes for a conditional branch, so as suggested a benchmark may be needed to determine which is best.
(I assume you meant to write if(x%y==0) not if(x%y), to be equivalent to the counter.)
I don't think compilers will do this optimization for you, so it could be worth it. It's going to be smaller code-size, even if you can't measure a speed difference. The x % y == 0 way still branches (so is still subject to a branch misprediction those rare times when it's true). Its only advantage is that it doesn't need a separate counter variable, just some temporary registers at one point in the loop. But it does need the divisor every iteration.
Overall this should be better for code size, and isn't less readable if you're used to the idiom. (Especially if you use if(--page_count == 0) { page_count=page_limit; ... so all pieces of the logic are in two adjacent lines.)
If your page_limit were not a compile-time constant, this is even more likely to help. dec/jz that's only taken once per many decrements is a lot cheaper than div/test edx,edx/jz, including for front-end throughput. (div is micro-coded on Intel CPUs as about 10 uops, so even though it's one instruction it still takes up the front-end for multiple cycles, taking away throughput resources from getting surrounding code into the out-of-order back-end).
(With a constant divisor, it's still multiply, right shift, sub to get the quotient, then multiply and subtract to get the remainder from that. So still several single-uop instructions. Although there are some tricks for divisibility testing by small constants see #Cassio Neri's answer on Fast divisibility tests (by 2,3,4,5,.., 16)? which cites his journal articles; recent GCC may have started using these.)
But if your loop body doesn't bottleneck on front-end instruction/uop throughput (on x86), or the divider execution unit, then out-of-order exec can probably hide most of the cost of even a div instruction. It's not on the critical path so it could be mostly free if its latency happens in parallel with other computation, and there are spare throughput resources. (Branch prediction + speculative execution allow execution to continue without waiting for the branch condition to be known, and since this work is independent of other work it can "run ahead" as the compiler can see into future iterations.)
Still, making that work even cheaper can help the compiler see and handle a branch mispredict sooner. But modern CPUs with fast recovery can keep working on old instructions from before the branch while recovering. ( What exactly happens when a skylake CPU mispredicts a branch? / Avoid stalling pipeline by calculating conditional early )
And of course a few loops do fully keep the CPU's throughput resources busy, not bottlenecking on cache misses or a latency chain. And fewer uops executed per iteration is more friendly to the other hyperthread (or SMT in general).
Or if you care about your code running on in-order CPUs (common for ARM and other non-x86 ISAs that target low-power implementations), the real work has to wait for the branch-condition logic. (Only hardware prefetch or cache-miss loads and things like that can be doing useful work while running extra code to test the branch condition.)
Use a down-counter
Instead of counting up, you'd actually want to hand-hold the compiler into using a down-counter that can compile to dec reg / jz .new_page or similar; all normal ISAs can do that quite cheaply because it's the same kind of thing you'd find at the bottom of normal loops. (dec/jnz to keep looping while non-zero)
if(--page_counter == 0) {
/*new page*/;
page_counter = page_limit;
}
A down-counter is more efficient in asm and equally readable in C (compared to an up-counter), so if you're micro-optimizing you should write it that way. Related: using that technique in hand-written asm FizzBuzz. Maybe also a code review of asm sum of multiples of 3 and 5, but it does nothing for no-match so optimizing it is different.
Notice that page_limit is only accessed inside the if body, so if the compiler is low on registers it can easily spill that and only read it as needed, not tying up a register with it or with multiplier constants.
Or if it's a known constant, just a move-immediate instruction. (Most ISAs also have compare-immediate, but not all. e.g. MIPS and RISC-V only have compare-and-branch instructions that use the space in the instruction word for a target address, not for an immediate.) Many RISC ISAs have special support for efficiently setting a register to a wider constant than most instructions that take an immediate (like ARM movw with a 16-bit immediate, so 4092 can be done in one instruction more mov but not cmp: it doesn't fit in 12 bits).
Compared to dividing (or multiplicative inverse), most RISC ISAs don't have multiply-immediate, and a multiplicative inverse is usually wider than one immediate can hold. (x86 does have multiply-immediate, but not for the form that gives you a high-half.) Divide-immediate is even rarer, not even x86 has that at all, but no compiler would use that unless optimizing for space instead of speed if it did exist.
CISC ISAs like x86 can typically multiply or divide with a memory source operand, so if low on registers the compiler could keep the divisor in memory (especially if it's a runtime variable). Loading once per iteration (hitting in cache) is not expensive. But spilling and reloading an actual variable that changes inside the loop (like page_count) could introduce a store/reload latency bottleneck if the loop is short enough and there aren't enough registers. (Although that might not be plausible: if your loop body is big enough to need all the registers, it probably has enough latency to hide a store/reload.)
If somebody put it in front of me, I would rather it was:
const size_t page_limit = 4942;
size_t npages = 0, nitems = 0;
size_t pagelim = foo / page_limit;
size_t resid = foo % page_limit;
while (npages < pagelim || nitems < resid) {
if (++nitems == page_limit) {
/* start new page */
nitems = 0;
npages++;
}
}
Because the program is now expressing the intent of the processing -- a bunch of things in page_limit sized chunks; rather than an attempt to optimize away an operation.
That the compiler might generate nicer code is just a blessing.

Do current x86 architectures support non-temporal loads (from "normal" memory)?

I am aware of multiple questions on this topic, however, I haven't seen any clear answers nor any benchmark measurements. I thus created a simple program that works with two arrays of integers. The first array a is very large (64 MB) and the second array b is small to fit into L1 cache. The program iterates over a and adds its elements to corresponding elements of b in a modular sense (when the end of b is reached, the program starts from its beginning again). The measured numbers of L1 cache misses for different sizes of b is as follows:
The measurements were made on a Xeon E5 2680v3 Haswell type CPU with 32 kiB L1 data cache. Therefore, in all the cases, b fitted into L1 cache. However, the number of misses grew considerably by around 16 kiB of b memory footprint. This might be expected since the loads of both a and b causes invalidation of cache lines from the beginning of b at this point.
There is absolutely no reason to keep elements of a in cache, they are used only once. I therefore run a program variant with non-temporal loads of a data, but the number of misses did not change. I also run a variant with non-temporal prefetching of a data, but still with the very same results.
My benchmark code is as follows (variant w/o non-temporal prefetching shown):
int main(int argc, char* argv[])
{
uint64_t* a;
const uint64_t a_bytes = 64 * 1024 * 1024;
const uint64_t a_count = a_bytes / sizeof(uint64_t);
posix_memalign((void**)(&a), 64, a_bytes);
uint64_t* b;
const uint64_t b_bytes = atol(argv[1]) * 1024;
const uint64_t b_count = b_bytes / sizeof(uint64_t);
posix_memalign((void**)(&b), 64, b_bytes);
__m256i ones = _mm256_set1_epi64x(1UL);
for (long i = 0; i < a_count; i += 4)
_mm256_stream_si256((__m256i*)(a + i), ones);
// load b into L1 cache
for (long i = 0; i < b_count; i++)
b[i] = 0;
int papi_events[1] = { PAPI_L1_DCM };
long long papi_values[1];
PAPI_start_counters(papi_events, 1);
uint64_t* a_ptr = a;
const uint64_t* a_ptr_end = a + a_count;
uint64_t* b_ptr = b;
const uint64_t* b_ptr_end = b + b_count;
while (a_ptr < a_ptr_end) {
#ifndef NTLOAD
__m256i aa = _mm256_load_si256((__m256i*)a_ptr);
#else
__m256i aa = _mm256_stream_load_si256((__m256i*)a_ptr);
#endif
__m256i bb = _mm256_load_si256((__m256i*)b_ptr);
bb = _mm256_add_epi64(aa, bb);
_mm256_store_si256((__m256i*)b_ptr, bb);
a_ptr += 4;
b_ptr += 4;
if (b_ptr >= b_ptr_end)
b_ptr = b;
}
PAPI_stop_counters(papi_values, 1);
std::cout << "L1 cache misses: " << papi_values[0] << std::endl;
free(a);
free(b);
}
What I wonder is whether CPU vendors support or are going to support non-temporal loads / prefetching or any other way how to label some data as not-being-hold in cache (e.g., to tag them as LRU). There are situations, e.g., in HPC, where similar scenarios are common in practice. For example, in sparse iterative linear solvers / eigensolvers, matrix data are usually very large (larger than cache capacities), but vectors are sometimes small enough to fit into L3 or even L2 cache. Then, we would like to keep them there at all costs. Unfortunately, loading of matrix data can cause invalidation of especially x-vector cache lines, even though in each solver iteration, matrix elements are used only once and there is no reason to keep them in cache after they have been processed.
UPDATE
I just did a similar experiment on an Intel Xeon Phi KNC, while measuring runtime instead of L1 misses (I haven't find a way how to measure them reliably; PAPI and VTune gave weird metrics.) The results are here:
The orange curve represents ordinary loads and it has the expected shape. The blue curve represents loads with so-call eviction hint (EH) set in the instruction prefix and the gray curve represents a case where each cache line of a was manually evicted; both these tricks enabled by KNC obviously worked as we wanted to for b over 16 kiB. The code of the measured loop is as follows:
while (a_ptr < a_ptr_end) {
#ifdef NTLOAD
__m512i aa = _mm512_extload_epi64((__m512i*)a_ptr,
_MM_UPCONV_EPI64_NONE, _MM_BROADCAST64_NONE, _MM_HINT_NT);
#else
__m512i aa = _mm512_load_epi64((__m512i*)a_ptr);
#endif
__m512i bb = _mm512_load_epi64((__m512i*)b_ptr);
bb = _mm512_or_epi64(aa, bb);
_mm512_store_epi64((__m512i*)b_ptr, bb);
#ifdef EVICT
_mm_clevict(a_ptr, _MM_HINT_T0);
#endif
a_ptr += 8;
b_ptr += 8;
if (b_ptr >= b_ptr_end)
b_ptr = b;
}
UPDATE 2
On Xeon Phi, icpc generated for normal-load variant (orange curve) prefetching for a_ptr:
400e93: 62 d1 78 08 18 4c 24 vprefetch0 [r12+0x80]
When I manually (by hex-editing the executable) modified this to:
400e93: 62 d1 78 08 18 44 24 vprefetchnta [r12+0x80]
I got the desired resutls, even better than the blue/gray curves. However, I was not able to force the compiler to generate non-temporal prefetchnig for me, even by using #pragma prefetch a_ptr:_MM_HINT_NTA before the loop :(
To answer specifically the headline question:
Yes, recent1 mainstream Intel CPUs support non-temporal loads on normal 2 memory - but only "indirectly" via non-temporal prefetch instructions, rather than directly using non-temporal load instructions like movntdqa. This is in contrast to non-temporal stores where you can just use the corresponding non-temporal store instructions3 directly.
The basic idea is that you issue a prefetchnta to the cache line before any normal loads, and then issue loads as normal. If the line wasn't already in the cache, it will be loaded in a non-temporal fashion. The exact meaning of non-temporal fashion depends on the architecture but the general pattern is that the line is loaded into, at least the L1 and perhaps some higher cache levels. Indeed for a prefetch to be of any use it needs to cause the line to load, at least into some cache level for consumption by a later load. The line may also be treated specially in the cache, for example by flagging it as high priority for eviction or restricting the ways in which it can be placed.
The upshot of all this is that while non-temporal loads are supported in a sense, they are really only partly non-temporal, unlike stores where you really leave no trace of the line in any of the cache levels. Non-temporal loads will cause some cache pollution, but generally less than regular loads. The exact details are architecture specific, and I've included some details below for modern Intel. You can find a slightly longer writeup in this answer to the question "Non-temporal loads and the hardware prefetcher, do they work together?"
).
Skylake Client
Based on the tests in this answer it seems that the behavior for prefetchnta Skylake is to fetch normally into the L1 cache, to skip the L2 entirely, and fetches in a limited way into the L3 cache (probably into 1 or 2 ways only so the total amount of the L3 available to nta prefetches is limited).
This was tested on Skylake client, but I believe this basic behavior probably extends backwards probably to Sandy Bridge and earlier (based on wording in the Intel optimization guide), and also forwards to Kaby Lake and later architectures based on Skylake client. So unless you are using a Skylake-SP or Skylake-X part, or an extremely old CPU, this is probably the behavior you can expect from prefetchnta.
Skylake Server
The only recent Intel chip known to have different behavior is Skylake server (used in Skylake-X, Skylake-SP and a few other lines). This has a considerably changed L2 and L3 architecture, and the L3 is no longer inclusive of the much larger L2. For this chip, it seems that prefetchnta skips both the L2 and L3 caches, so on this architecture cache pollution is limited to the L1.
This behavior was reported by user Mysticial in a comment. The downside, as pointed out in those comments is that this makes prefetchnta much more brittle: if you get the prefetch distance or timing wrong (especially easy when hyperthreading is involved and the sibling core is active), and the data gets evicted from L1 before you use, you are going all the way back to main memory rather than the L3 on earlier architectures.
1 Recent here probably means anything in the last decade or so, but I don't mean to imply that earlier hardware didn't support non-temporal prefetch: it's possible that support goes right back to the introduction of prefetchnta but I don't have the hardware to check that and can't find an existing reliable source of information on it.
2 Normal here just means WB (writeback) memory, which is the memory dealing with at the application level the overwhelming majority of the time.
3 Specifically, the NT store instructions are movnti for general purpose registers and the movntd* and movntp* families for SIMD registers.
I answer my own question since I found the following post from Intel Developer Forum, which makes sense for me. It was written by John McCalpin:
The results for the mainstream processors are not surprising -- in the absence of true "scratchpad" memory, it is not clear that it is possible to design an implementation of "non-temporal" behavior that is not subject to nasty surprises. Two approaches that have been used in the past are (1) loading the cache line, but marking it LRU instead of MRU, and (2) loading the cache line into one specific "set" of the set-associative cache. In either case it is relatively easy to generate situations in which the cache drops the data before the processor completes reading it.
Both of these approaches risk performance degradation in cases operating on more than a small number of arrays, and are made much more difficult to implement without "gotchas" when HyperThreading is considered.
In other contexts I have argued for the implementation of "load multiple" instructions that would guarantee that the entire contents of a cache line would be copied to registers atomically. My reasoning is that the hardware absolutely guarantees that the cache line is moved atomically and that the time required to copy the remainder of the cache line to registers was so small (an extra 1-3 cycles, depending on the processor generation) that it could be safely implemented as an atomic operation.
Starting with Haswell, the core can read 64 Bytes in a single cycle (2 256-bit aligned AVX reads), so the exposure to unintended side effects becomes even lower.
Starting with KNL, full-cache-line (aligned) loads should be "naturally" atomic, since the transfers from the L1 Data Cache to the core are full cache lines and all of the data is placed into the target AVX-512 register. (This does not mean that Intel guarantees atomicity in the implementation! We don't have visibility into the horrible corner cases that the designers have to account for, but it is reasonable to conclude that most of the time aligned 512-bit loads will occur atomically.) With this "natural" 64-Byte atomicity, some of the tricks used in the past for reducing cache pollution due to "non-temporal" loads may deserve another look....
The MOVNTDQA instruction is intended primarily for reading from address ranges that are mapped as "Write-Combining" (WC), and not for reading from normal system memory that is mapped "Write-Back" (WB). The description in Volume 2 of the SWDM says that an implementation "may" do something special with MOVNTDQA for WB regions, but the emphasis is on the behavior for the WC memory type.
The "Write-Combining" memory type is almost never used for "real" memory --- it is used almost exclusively for Memory-Mapped IO regions.
See here for the whole post: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/597075

What might cause the same SSE code to run a few times slower in the same function?

Edit 3: The images are links to the full-size versions. Sorry for the pictures-of-text, but the graphs would be hard to copy/paste into a text table.
I have the following VTune profile for a program compiled with icc --std=c++14 -qopenmp -axS -O3 -fPIC:
In that profile, two clusters of instructions are highlighted in the assembly view. The upper cluster takes significantly less time than the lower one, in spite of instructions being identical and in the same order. Both clusters are located inside the same function and are obviously both called n times. This happens every time I run the profiler, on both a Westmere Xeon and a Haswell laptop that I'm using right now (compiled with SSE because that's what I'm targeting and learning right now).
What am I missing?
Ignore the poor concurrency, this is most probably due to the laptop throttling, since it doesn't occur on the desktop Xeon machine.
I believe this is not an example of micro-optimisation, since those three added together amount to a decent % of the total time, and I'm really interested about the possible cause of this behavior.
Edit: OMP_NUM_THREADS=1 taskset -c 1 /opt/intel/vtune...
Same profile, albeit with a slightly lower CPI this time.
HW perf counters typically charge stalls to the instruction that had to wait for its inputs, not the instruction that was slow producing outputs.
The inputs for your first group comes from your gather. This probably cache-misses a lot, and doesn't the costs aren't going to get charged to those SUBPS/MULPS/ADDPS instructions. Their inputs come directly from vector loads of voxel[], so store-forwarding failure will cause some latency. But that's only ~10 cycles IIRC, small compared to cache misses during the gather. (Those cache misses show up as large bars for the instructions right before the first group that you've highlighted)
The inputs for your second group come directly from loads that can miss in cache. In the first group, the direct consumers of the cache-miss loads were instructions for lines like the one that sets voxel[0], which has a really large bar.
But in the second group, the time for the cache misses in a_transfer[] is getting attributed to the group you've highlighted. Or if it's not cache misses, then maybe it's slow address calculation as the loads have to wait for RAX to be ready.
It looks like there's a lot you could optimize here.
instead of store/reload for a_pointf, just keep it hot across loop iterations in a __m128 variable. Storing/reloading in the C source only makes sense if you found the compiler was making a poor choice about which vector register to spill (if it ran out of registers).
calculate vi with _mm_cvttps_epi32(vf), so the ROUNDPS isn't part of the dependency chain for the gather indices.
Do the voxel gather yourself by shuffling narrow loads into vectors, instead of writing code that copies to an array and then loads from it. (guaranteed store-forwarding failure, see Agner Fog's optimization guides and other links from the x86 tag wiki).
It might be worth it to partially vectorize the address math (calculation of base_0, using PMULDQ with a constant vector), so instead of a store/reload (~5 cycle latency) you just have a MOVQ or two (~1 or 2 cycle latency on Haswell, I forget.)
Use MOVD to load two adjacent short values, and merge another pair into the second element with PINSRD. You'll probably get good code from _mm_setr_epi32(*(const int*)base_0, *(const int*)(base_0 + dim_x), 0, 0), except that pointer aliasing is undefined behaviour. You might get worse code from _mm_setr_epi16(*base_0, *(base_0 + 1), *(base_0 + dim_x), *(base_0 + dim_x + 1), 0,0,0,0).
Then expand the low four 16-bit elements into 32-bit elements integers with PMOVSX, and convert them all to float in parallel with _mm_cvtepi32_ps (CVTDQ2PS).
Your scalar LERPs aren't being auto-vectorized, but you're doing two in parallel (and could maybe save an instruction since you want the result in a vector anyway).
Calling floorf() is silly, and a function call forces the compiler to spill all xmm registers to memory. Compile with -ffast-math or whatever to let it inline to a ROUNDSS, or do that manually. Especially since you go ahead and load the float that you calculate from that into a vector!
Use a vector compare instead of scalar prev_x / prev_y / prev_z. Use MOVMASKPS to get the result into an integer you can test. (You only care about the lower 3 elements, so test it with compare_mask & 0b0111 (true if any of the low 3 bits of the 4-bit mask are set, after a compare for not-equal with _mm_cmpneq_ps. See the double version of the instruction for more tables on how it all works: http://www.felixcloutier.com/x86/CMPPD.html).
Well, analyzing assembly code please note that running time is attributed to the next instruction - so, the data you're looking by instructions need to be interpreted carefully. There is a corresponding note in VTune Release Notes:
Running time is attributed to the next instruction (200108041)
To collect the data about time-consuming running regions of the
target, the Intel® VTune™ Amplifier interrupts executing target
threads and attributes the time to the context IP address.
Due to the collection mechanism, the captured IP address points to an
instruction AFTER the one that is actually consuming most of the time.
This leads to the running time being attributed to the next
instruction (or, rarely to one of the subsequent instructions) in the
Assembly view. In rare cases, this can also lead to wrong attribution
of running time in the source - the time may be erroneously attributed
to the source line AFTER the actual hot line.
In case the inline mode is ON and the program has small functions
inlined at the hotspots, this can cause the running time to be
attributed to a wrong function since the next instruction can belong
to a different function in tightly inlined code.

What C++ code compiles down to the x86 REP instruction?

I'm copying elements from one array to another in C++. I found the rep movs instruction in x86 that seems to copy an array at ESI to an array at EDI of size ECX. However, neither the for nor while loops I tried compiled to a rep movs instruction in VS 2008 (on an Intel Xeon x64 processor). How can I write code that will get compiled to this instruction?
Honestly, you shouldn't. REP is sort of an obsolete holdover in the instruction set, and actually pretty slow since it has to call a microcoded subroutine inside the CPU, which has a ROM lookup latency and is nonpipelined as well.
In almost every implementation, you will find that the memcpy() compiler intrinsic both is easier to use and runs faster.
Under MSVC there are the __movsxxx & __stosxxx intrinsics that will generate a REP prefixed instruction.
there is also a 'hack' to force intrinsic memset aka REP STOS under vc9+, as the intrinsic no longer exits, due to the sse2 branching in the crt. this is better that __stosxxx due to the fact the compiler can optimize it for constants and order it correctly.
#define memset(mem,fill,size) memset((DWORD*)mem,((fill) << 24|(fill) << 16|(fill) << 8|(fill)),size)
__forceinline void memset(DWORD* pStart, unsigned long dwFill, size_t nSize)
{
//credits to Nepharius for finding this
DWORD* pLast = pStart + (nSize >> 2);
while(pStart < pLast)
*pStart++ = dwFill;
if((nSize &= 3) == 0)
return;
if(nSize == 3)
{
(((WORD*)pStart))[0] = WORD(dwFill);
(((BYTE*)pStart))[2] = BYTE(dwFill);
}
else if(nSize == 2)
(((WORD*)pStart))[0] = WORD(dwFill);
else
(((BYTE*)pStart))[0] = BYTE(dwFill);
}
of course REP isn't always the best thing to use, imo your way better off using memcpy, it'll branch to either sse2 or REPS MOV based on your system (under msvc), unless you feeling like writing custom assembly for 'hot' areas...
If you need exactly that instruction - use built-in assembler and write that instruction manually. You can't rely on the compiler to produce any specific machine code - even if it emits it in one compilation it can decide to emit some other equivalent during next compilation.
REP and friends was nice once upon a time, when the x86 CPU was a single-pipeline industrial CISC-processor.
But that has changed. Nowadays when the processor encounters any instruction, the first it does is translating it into an easier format (VLIW-like micro-ops) and schedules it for future execution (this is part of out-of-order-execution, part of scheduling between different logical CPU cores, it can be used to simplifying write-after-write-sequences into single-writes, et.c.). This machinery works well for instructions that translates into a few VLIW-like opcodes, but not machine-code that translates into loops. Loop-translated machine code will probably cause the execution pipeline to stall.
Rather than spending hundreds of thousands of transistors into building CPU-circuitry for handling looping portions of the micro-ops in the execution pipeline, they just handle it in some sort of crappy legacy-mode that stutterly stalls the pipeline, and ask modern programmers to write your own damn loops!
Therefore it is seldom used when machines write code. If you encounter REP in a binary executable, its probably a human assembly-muppet who didn't know better, or a cracker that really needed the few bytes it saved to use it instead of an actual loop, that wrote it.
(However. Take everything I just wrote with a grain of salt. Maybe this is not true anymore. I am not 100% up to date with the internals of x86 CPUs anymore, I got into other hobbies..)
I use the rep* prefix variants with cmps*, movs*, scas* and stos* instruction variants to generate inline code which minimizes the code size, avoids unnecessary calls/jumps and thereby keeps down the work done by the caches. The alternative is to set up parameters and call a memset or memcpy somewhere else which may overall be faster if I want to copy a hundred bytes or more but if it's just a matter of 10-20 bytes using rep is faster (or at least was the last time I measured).
Since my compiler allows specification and use of inline assembly functions and includes their register usage/modification in the optimization activities it is possible for me to use them when the circumstances are right.
On a historic note - not having any insight into the manufacturer's strategies - there was a time when the "rep movs*" (etc) instructions were very slow. I think it was around the time of the Pentium/Pentium MMX. A colleague of mine (who had more insight than I) said that the manufacturers had decreased the chip area (<=> fewer transistors/more microcode) allocated to the rep handling and used it to make other, more used instructions faster.
In the fifteen years or so since rep has become relatively speaking faster again which would suggest more transistors/less microcode.

Using SSE instructions

I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. My question is should I go for these SSE instructions? Also, what happens if my code runs on a different processor? Will it still work or these instructions are processor specific?
SSE instructions are processor specific. You can look up which processor supports which SSE version on wikipedia.
If SSE code will be faster or not depends on many factors: The first is of course whether the problem is memory-bound or CPU-bound. If the memory bus is the bottleneck SSE will not help much. Try simplifying your integer calculations, if that makes the code faster, it's probably CPU-bound, and you have a good chance of speeding it up.
Be aware that writing SIMD-code is a lot harder than writing C++-code, and that the resulting code is much harder to change. Always keep the C++ code up to date, you'll want it as a comment and to check the correctness of your assembler code.
Think about using a library like the IPP, that implements common low-level SIMD operations optimized for various processors.
SIMD, of which SSE is an example, allows you to do the same operation on multiple chunks of data. So, you won't get any advantage to using SSE as a straight replacement for the integer operations, you will only get advantages if you can do the operations on multiple data items at once. This involves loading some data values that are contiguous in memory, doing the required processing and then stepping to the next set of values in the array.
Problems:
1 If the code path is dependant on the data being processed, SIMD becomes much harder to implement. For example:
a = array [index];
a &= mask;
a >>= shift;
if (a < somevalue)
{
a += 2;
array [index] = a;
}
++index;
is not easy to do as SIMD:
a1 = array [index] a2 = array [index+1] a3 = array [index+2] a4 = array [index+3]
a1 &= mask a2 &= mask a3 &= mask a4 &= mask
a1 >>= shift a2 >>= shift a3 >>= shift a4 >>= shift
if (a1<somevalue) if (a2<somevalue) if (a3<somevalue) if (a4<somevalue)
// help! can't conditionally perform this on each column, all columns must do the same thing
index += 4
2 If the data is not contigous then loading the data into the SIMD instructions is cumbersome
3 The code is processor specific. SSE is only on IA32 (Intel/AMD) and not all IA32 cpus support SSE.
You need to analyse the algorithm and the data to see if it can be SSE'd and that requires knowing how SSE works. There's plenty of documentation on Intel's website.
This kind of problem is a perfect example of where a good low level profiler is essential. (Something like VTune) It can give you a much more informed idea of where your hotspots lie.
My guess, from what you describe is that your hotspot will probably be branch prediction failures resulting from min/max calculations using if/else. Therefore, using SIMD intrinsics should allow you to use the min/max instructions, however, it might be worth just trying to use a branchless min/max caluculation instead. This might achieve most of the gains with less pain.
Something like this:
inline int
minimum(int a, int b)
{
int mask = (a - b) >> 31;
return ((a & mask) | (b & ~mask));
}
If you use SSE instructions, you're obviously limited to processors that support these.
That means x86, dating back to the Pentium 2 or so (can't remember exactly when they were introduced, but it's a long time ago)
SSE2, which, as far as I can recall, is the one that offers integer operations, is somewhat more recent (Pentium 3? Although the first AMD Athlon processors didn't support them)
In any case, you have two options for using these instructions. Either write the entire block of code in assembly (probably a bad idea. That makes it virtually impossible for the compiler to optimize your code, and it's very hard for a human to write efficient assembler).
Alternatively, use the intrinsics available with your compiler (if memory serves, they're usually defined in xmmintrin.h)
But again, the performance may not improve. SSE code poses additional requirements of the data it processes. Mainly, the one to keep in mind is that data must be aligned on 128-bit boundaries. There should also be few or no dependencies between the values loaded into the same register (a 128 bit SSE register can hold 4 ints. Adding the first and the second one together is not optimal. But adding all four ints to the corresponding 4 ints in another register will be fast)
It may be tempting to use a library that wraps all the low-level SSE fiddling, but that might also ruin any potential performance benefit.
I don't know how good SSE's integer operation support is, so that may also be a factor that can limit performance. SSE is mainly targeted at speeding up floating point operations.
If you intend to use Microsoft Visual C++, you should read this:
http://www.codeproject.com/KB/recipes/sseintro.aspx
We have implemented some image processing code, similar to what you describe but on a byte array, In SSE. The speedup compared to C code is considerable, depending on the exact algorithm more than a factor of 4, even in respect to the Intel compiler. However, as you already mentioned you have the following drawbacks:
Portability. The code will run on every Intel-like CPU, so also AMD, but not on other CPUs. That is not a problem for us because we control the target hardware. Switching compilers and even to a 64 bit OS can also be a problem.
You have a steep learning curve, but I found that after you grasp the principles writing new algorithms is not that hard.
Maintainability. Most C or C++ programmers have no knowledge of assembly/SSE.
My advice to you will be to go for it only if you really need the performance improvement, and you can't find a function for your problem in a library like the intel IPP, and if you can live with the portability issues.
I can tell from my experince that SSE brings a huge (4x and up) speedup over a plain c version of the code (no inline asm, no intrinsics used) but hand-optimized assembler can beat Compiler-generated assembly if the compiler can't figure out what the programmer intended (belive me, compilers don't cover all possible code combinations and they never will).
Oh and, the compiler can't everytime layout the data that it runs at the fastest-possible speed.
But you need much experince for a speedup over an Intel-compiler (if possible).
SSE instructions were originally just on Intel chips, but recently (since Athlon?) AMD supports them as well, so if you do code against the SSE instruction set, you should be portable to most x86 procs.
That being said, it may not be worth your time to learn SSE coding unless you're already familiar with assembler on x86's - an easier option might be to check your compiler docs and see if there are options to allow the compiler to autogenerate SSE code for you. Some compilers do very well vectorizing loops in this way. (You're probably not surprised to hear that the Intel compilers do a good job of this :)
Write code that helps the compiler understand what you are doing. GCC will understand and optimize SSE code such as this:
typedef union Vector4f
{
// Easy constructor, defaulted to black/0 vector
Vector4f(float a = 0, float b = 0, float c = 0, float d = 1.0f):
X(a), Y(b), Z(c), W(d) { }
// Cast operator, for []
inline operator float* ()
{
return (float*)this;
}
// Const ast operator, for const []
inline operator const float* () const
{
return (const float*)this;
}
// ---------------------------------------- //
inline Vector4f operator += (const Vector4f &v)
{
for(int i=0; i<4; ++i)
(*this)[i] += v[i];
return *this;
}
inline Vector4f operator += (float t)
{
for(int i=0; i<4; ++i)
(*this)[i] += t;
return *this;
}
// Vertex / Vector
// Lower case xyzw components
struct {
float x, y, z;
float w;
};
// Upper case XYZW components
struct {
float X, Y, Z;
float W;
};
};
Just don't forget to have -msse -msse2 on your build parameters!
Although it is true that SSE is specific to some processors (SSE may be relatively safe, SSE2 much less in my experience), you can detect the CPU at runtime, and load the code dynamically depending on the target CPU.
SIMD intrinsics (such as SSE2) can speed this sort of thing up but take expertise to use correctly. They are very sensitive to alignment and pipeline latency; careless use can make performance even worse than it would have been without them. You'll get a much easier and more immediate speedup from simply using cache prefetching to make sure all your ints are in L1 in time for you to operate on them.
Unless your function needs a throughput of better than 100,000,000 integers per second, SIMD probably isn't worth the trouble for you.
Just to add briefly to what has been said before about different SSE versions being available on different CPUs: This can be checked by looking at the respective feature flags returned by the CPUID instruction (see e.g. Intel's documentation for details).
Have a look at inline assembler for C/C++, here is a DDJ article. Unless you are 100% certain your program will run on a compatible platform you should follow the recommendations many have given here.
I agree with the previous posters. Benefits can be quite large but to get it may require a lot of work. Intel documentation on these instructions is over 4K pages. You may want to check out EasySSE (c++ wrappers library over intrinsics + examples) free from Ocali Inc.
I assume my affiliation with this EasySSE is clear.
I don't recommend doing this yourself unless you're fairly proficient with assembly. Using SSE will, more than likely, require careful reorganization of your data, as Skizz points out, and the benefit is often questionable at best.
It would probably be much better for you to write very small loops and keep your data very tightly organized and just rely on the compiler doing this for you. Both the Intel C Compiler and GCC (since 4.1) can auto-vectorize your code, and will probably do a better job than you. (Just add -ftree-vectorize to your CXXFLAGS.)
Edit: Another thing I should mention is that several compilers support assembly intrinsics, which would probably, IMO, be easier to use than the asm() or __asm{} syntax.