Why is vectorization not beneficial in this for loop?

Why is vectorization not beneficial in this for loop? - c++

I am trying to vectorize this for loop. After using the Rpass flag, I am getting the following remark for it:
int someOuterVariable = 0;
for (unsigned int i = 7; i != -1; i--)
{
array[someOuterVariable + i] -= 0.3 * anotherArray[i];
}
Remark:
The cost-model indicates that vectorization is not beneficial
the cost-model indicates that interleaving is not beneficial
I want to understand what this means. Does "interleaving is not benificial" mean the array indexing is not proper?

It's hard to answer without more details about your types. But in general, starting a loop incurs some costs and vectorising also implies some costs (such as moving data to/from SIMD registers, ensuring proper alignment of data)
I'm guessing here that the compiler tells you that the vectorisation cost here is bigger than simply running the 8 iterations without it, so it's not doing it.
Try to increase the number of iterations, or help the compiler for computing alignement for example.
Typically, unless the type of array's item are exactly of the proper alignment for SIMD vector, accessing an array from a "unknown" offset (what you've called someOuterVariable) prevents the compiler to write an efficient vectorisation code.
EDIT: About the "interleaving" question, it's hard to guess without knowning your tool. But in general, interleaving usually means mixing 2 streams of computations so that the compute units of the CPU are all busy. For example, if you have 2 ALU in your CPU, and the program is doing:
c = a + b;
d = e * f;
The compiler can interleave the computation so that both the addition and multiplication happens at the same time (provided you have 2 ALU available). Typically, this means that the multiplication which is a bit longer to compute (for example 6 cycles) will be started before the addition (for example 3 cycles). You'll then get the result of both operation after only 6 cycles instead of 9 if the compiler serialized the computations. This is only possible if there is no dependencies between the computation (if d required c, it can not work). A compiler is very cautious about this, and, in your example, will not apply this optimization if it can't prove that array and anotherArray don't alias.

Related

What is faster in C++: mod (%) or another counter?

At the risk of this being a duplicate, maybe I just can't find a similar post right now:
I am writing in C++ (C++20 to be specific). I have a loop with a counter that counts up every turn. Let's call it counter. And if this counter reaches a page-limit (let's call it page_limit), the program should continue on the next page. So it looks something like this:
const size_t page_limit = 4942;
size_t counter = 0;
while (counter < foo) {
if (counter % page_limit == 0) {
// start new page
}
// some other code
counter += 1;
}
Now I am wondering since the counter goes pretty high: would the program run faster, if I wouldn't have the program calculate the modulo counter % page_limit every time, but instead make another counter? It could look something like this:
const size_t page_limit = 4942;
size_t counter = 0;
size_t page_counter = 4942;
while (counter < foo) {
if (page_counter == page_limit) {
// start new page
page_counter = 0;
}
// some other code
counter += 1;
page_counter += 1;
}

Most optimizing compilers will convert divide or modulo operations into multiply by pre-generated inverse constant and shift instructions if the divisor is a constant. Possibly also if the same divisor value is used repeatedly in a loop.
Modulo multiplies by inverse to get a quotient, then multiplies quotient by divisor to get a product, and then original number minus product will be the modulo.
Multiply and shift are fast instructions on reasonably recent X86 processors, but branch prediction can also reduce the time it takes for a conditional branch, so as suggested a benchmark may be needed to determine which is best.

(I assume you meant to write if(x%y==0) not if(x%y), to be equivalent to the counter.)
I don't think compilers will do this optimization for you, so it could be worth it. It's going to be smaller code-size, even if you can't measure a speed difference. The x % y == 0 way still branches (so is still subject to a branch misprediction those rare times when it's true). Its only advantage is that it doesn't need a separate counter variable, just some temporary registers at one point in the loop. But it does need the divisor every iteration.
Overall this should be better for code size, and isn't less readable if you're used to the idiom. (Especially if you use if(--page_count == 0) { page_count=page_limit; ... so all pieces of the logic are in two adjacent lines.)
If your page_limit were not a compile-time constant, this is even more likely to help. dec/jz that's only taken once per many decrements is a lot cheaper than div/test edx,edx/jz, including for front-end throughput. (div is micro-coded on Intel CPUs as about 10 uops, so even though it's one instruction it still takes up the front-end for multiple cycles, taking away throughput resources from getting surrounding code into the out-of-order back-end).
(With a constant divisor, it's still multiply, right shift, sub to get the quotient, then multiply and subtract to get the remainder from that. So still several single-uop instructions. Although there are some tricks for divisibility testing by small constants see #Cassio Neri's answer on Fast divisibility tests (by 2,3,4,5,.., 16)? which cites his journal articles; recent GCC may have started using these.)
But if your loop body doesn't bottleneck on front-end instruction/uop throughput (on x86), or the divider execution unit, then out-of-order exec can probably hide most of the cost of even a div instruction. It's not on the critical path so it could be mostly free if its latency happens in parallel with other computation, and there are spare throughput resources. (Branch prediction + speculative execution allow execution to continue without waiting for the branch condition to be known, and since this work is independent of other work it can "run ahead" as the compiler can see into future iterations.)
Still, making that work even cheaper can help the compiler see and handle a branch mispredict sooner. But modern CPUs with fast recovery can keep working on old instructions from before the branch while recovering. ( What exactly happens when a skylake CPU mispredicts a branch? / Avoid stalling pipeline by calculating conditional early )
And of course a few loops do fully keep the CPU's throughput resources busy, not bottlenecking on cache misses or a latency chain. And fewer uops executed per iteration is more friendly to the other hyperthread (or SMT in general).
Or if you care about your code running on in-order CPUs (common for ARM and other non-x86 ISAs that target low-power implementations), the real work has to wait for the branch-condition logic. (Only hardware prefetch or cache-miss loads and things like that can be doing useful work while running extra code to test the branch condition.)
Use a down-counter
Instead of counting up, you'd actually want to hand-hold the compiler into using a down-counter that can compile to dec reg / jz .new_page or similar; all normal ISAs can do that quite cheaply because it's the same kind of thing you'd find at the bottom of normal loops. (dec/jnz to keep looping while non-zero)
if(--page_counter == 0) {
/*new page*/;
page_counter = page_limit;
}
A down-counter is more efficient in asm and equally readable in C (compared to an up-counter), so if you're micro-optimizing you should write it that way. Related: using that technique in hand-written asm FizzBuzz. Maybe also a code review of asm sum of multiples of 3 and 5, but it does nothing for no-match so optimizing it is different.
Notice that page_limit is only accessed inside the if body, so if the compiler is low on registers it can easily spill that and only read it as needed, not tying up a register with it or with multiplier constants.
Or if it's a known constant, just a move-immediate instruction. (Most ISAs also have compare-immediate, but not all. e.g. MIPS and RISC-V only have compare-and-branch instructions that use the space in the instruction word for a target address, not for an immediate.) Many RISC ISAs have special support for efficiently setting a register to a wider constant than most instructions that take an immediate (like ARM movw with a 16-bit immediate, so 4092 can be done in one instruction more mov but not cmp: it doesn't fit in 12 bits).
Compared to dividing (or multiplicative inverse), most RISC ISAs don't have multiply-immediate, and a multiplicative inverse is usually wider than one immediate can hold. (x86 does have multiply-immediate, but not for the form that gives you a high-half.) Divide-immediate is even rarer, not even x86 has that at all, but no compiler would use that unless optimizing for space instead of speed if it did exist.
CISC ISAs like x86 can typically multiply or divide with a memory source operand, so if low on registers the compiler could keep the divisor in memory (especially if it's a runtime variable). Loading once per iteration (hitting in cache) is not expensive. But spilling and reloading an actual variable that changes inside the loop (like page_count) could introduce a store/reload latency bottleneck if the loop is short enough and there aren't enough registers. (Although that might not be plausible: if your loop body is big enough to need all the registers, it probably has enough latency to hide a store/reload.)

If somebody put it in front of me, I would rather it was:
const size_t page_limit = 4942;
size_t npages = 0, nitems = 0;
size_t pagelim = foo / page_limit;
size_t resid = foo % page_limit;
while (npages < pagelim || nitems < resid) {
if (++nitems == page_limit) {
/* start new page */
nitems = 0;
npages++;
}
}
Because the program is now expressing the intent of the processing -- a bunch of things in page_limit sized chunks; rather than an attempt to optimize away an operation.
That the compiler might generate nicer code is just a blessing.

How profitable is it to use powers of two in the calculations?

The question is whether it is possible to achieve a noticeable increase in productivity by using powers of two in multiplications and divisions, since the compiler could convert them to a shift (or it could be explicitly using a shift for this). I have a lot of multiplications by one number in my task (a coefficient that I myself entered), but I can use for example 512 instead of 500.
for(i=0;i<X;i++)
{
cout<<i*512 // or i*500
}
or i need do it same:
for(i=0;i<X;i++)
{
cout<<i>>9;
}
and an additional question - does it make sense to introduce a variable for the condition so that the compiler does not repeatedly read the condition again or does it do it automatically?
For example:
for(int i=0;i<10*K*H;i++)
{
// K and H cant change in this loop
}
I was trying to check it in Compulier Explorer, but it create less lines of code when i divide and no create same code when i multiply

About the limit in the for loop, you may want to give the compiler some assistance.
Compute the limit before the loop:
const int limit = 10 * K * H;
for (i = 0; i < limit; ++i)
{
}
This can help when compiling with no optimizations (e.g. debug mode). Your compiler may perform better optimizations when you increase the optimization level.
I recommend printing the assembly language for your for loop and comparing with the assembly language for the above code. The truth is in the assembly language.
Edit 1: shifting vs. multiplication
In most processors, bit shifting is often faster than multiplication. In modern processors, the savings is in the order of nanoseconds, or possibly microseconds.
Many compilers will convert a multiplication into a bit shift, depending on the optimization level and the context.
In your example, you will probably not notice the optimization gain, because the gain will be wasted in the call to cout. I/O consumes more time than the time gained by micro-optimizations.
Profiling your code will give you the best data for making these kinds of decisions. Also read about benchmarking to collect better data. For example, you may have to run your loop for 1E6 or more iterations to rule out outliers such as interrupts and task swaps.

Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs

I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions:
Your assignment is the opposite of our first lab assignment, which was to optimize a prime number program. Your purpose in this assignment is to pessimize the program, i.e. make it run slower. Both of these are CPU-intensive programs. They take a few seconds to run on our lab PCs. You may not change the algorithm.
To deoptimize the program, use your knowledge of how the Intel i7 pipeline operates. Imagine ways to re-order instruction paths to introduce WAR, RAW, and other hazards. Think of ways to minimize the effectiveness of the cache. Be diabolically incompetent.
The assignment gave a choice of Whetstone or Monte-Carlo programs. The cache-effectiveness comments are mostly only applicable to Whetstone, but I chose the Monte-Carlo simulation program:
// Un-modified baseline for pessimization, as given in the assignment
#include <algorithm> // Needed for the "max" function
#include <cmath>
#include <iostream>
// A simple implementation of the Box-Muller algorithm, used to generate
// gaussian random numbers - necessary for the Monte Carlo method below
// Note that C++11 actually provides std::normal_distribution<> in
// the <random> library, which can be used instead of this function
double gaussian_box_muller() {
double x = 0.0;
double y = 0.0;
double euclid_sq = 0.0;
// Continue generating two uniform random variables
// until the square of their "euclidean distance"
// is less than unity
do {
x = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
y = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
euclid_sq = x*x + y*y;
} while (euclid_sq >= 1.0);
return x*sqrt(-2*log(euclid_sq)/euclid_sq);
}
// Pricing a European vanilla call option with a Monte Carlo method
double monte_carlo_call_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
double S_adjust = S * exp(T*(r-0.5*v*v));
double S_cur = 0.0;
double payoff_sum = 0.0;
for (int i=0; i<num_sims; i++) {
double gauss_bm = gaussian_box_muller();
S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
payoff_sum += std::max(S_cur - K, 0.0);
}
return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}
// Pricing a European vanilla put option with a Monte Carlo method
double monte_carlo_put_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
double S_adjust = S * exp(T*(r-0.5*v*v));
double S_cur = 0.0;
double payoff_sum = 0.0;
for (int i=0; i<num_sims; i++) {
double gauss_bm = gaussian_box_muller();
S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
payoff_sum += std::max(K - S_cur, 0.0);
}
return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}
int main(int argc, char **argv) {
// First we create the parameter list
int num_sims = 10000000; // Number of simulated asset paths
double S = 100.0; // Option price
double K = 100.0; // Strike price
double r = 0.05; // Risk-free rate (5%)
double v = 0.2; // Volatility of the underlying (20%)
double T = 1.0; // One year until expiry
// Then we calculate the call/put values via Monte Carlo
double call = monte_carlo_call_price(num_sims, S, K, r, v, T);
double put = monte_carlo_put_price(num_sims, S, K, r, v, T);
// Finally we output the parameters and prices
std::cout << "Number of Paths: " << num_sims << std::endl;
std::cout << "Underlying: " << S << std::endl;
std::cout << "Strike: " << K << std::endl;
std::cout << "Risk-Free Rate: " << r << std::endl;
std::cout << "Volatility: " << v << std::endl;
std::cout << "Maturity: " << T << std::endl;
std::cout << "Call Price: " << call << std::endl;
std::cout << "Put Price: " << put << std::endl;
return 0;
}
The changes I have made seemed to increase the code running time by a second but I'm not entirely sure what I can change to stall the pipeline without adding code. A point to the right direction would be awesome, I appreciate any responses.
Update: the professor who gave this assignment posted some details
The highlights are:
It's a second semester architecture class at a community college (using the Hennessy and Patterson textbook).
the lab computers have Haswell CPUs
The students have been exposed to the CPUID instruction and how to determine cache size, as well as intrinsics and the CLFLUSH instruction.
any compiler options are allowed, and so is inline asm.
Writing your own square root algorithm was announced as being outside the pale
Cowmoogun's comments on the meta thread indicate that it wasn't clear compiler optimizations could be part of this, and assumed -O0, and that a 17% increase in run-time was reasonable.
So it sounds like the goal of the assignment was to get students to re-order the existing work to reduce instruction-level parallelism or things like that, but it's not a bad thing that people have delved deeper and learned more.
Keep in mind that this is a computer-architecture question, not a question about how to make C++ slow in general.

Important background reading: Agner Fog's microarch pdf, and probably also Ulrich Drepper's What Every Programmer Should Know About Memory. See also the other links in the x86 tag wiki, especially Intel's optimization manuals, and David Kanter's analysis of the Haswell microarchitecture, with diagrams.
Very cool assignment; much better than the ones I've seen where students were asked to optimize some code for gcc -O0, learning a bunch of tricks that don't matter in real code. In this case, you're being asked to learn about the CPU pipeline and use that to guide your de-optimization efforts, not just blind guessing. The most fun part of this one is justifying each pessimization with "diabolical incompetence", not intentional malice.
Problems with the assignment wording and code:
The uarch-specific options for this code are limited. It doesn't use any arrays, and much of the cost is calls to exp/log library functions. There isn't an obvious way to have more or less instruction-level parallelism, and the loop-carried dependency chain is very short.
It would be hard to get a slowdown just from re-arranging the expressions to change the dependencies, to reduce ILP from hazards.
Intel Sandybridge-family CPUs are aggressive out-of-order designs that spend lots of transistors and power to find parallelism and avoid hazards (dependencies) that would trouble a classic RISC in-order pipeline. Usually the only traditional hazards that slow it down are RAW "true" dependencies that cause throughput to be limited by latency.
WAR and WAW hazards for registers are pretty much not an issue, thanks to register renaming. (except for popcnt/lzcnt/tzcnt, which have a false dependency their destination on Intel CPUs, even though it should be write-only).
For memory ordering, modern CPUs use a store buffer to delay commit into cache until retirement, also avoiding WAR and WAW hazards. See also this answer about what a store buffer is, and being essential essential for OoO exec to decouple execution from things other cores can see.
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) has more about register renaming and hiding FMA latency in an FP dot product loop.
The "i7" brand-name was introduced with Nehalem (successor to Core2), and some Intel manuals even say Core i7 when they seem to mean Nehalem, but they kept the "i7" branding for Sandybridge and later microarchitectures. SnB is when the P6-family evolved into a new species, the SnB-family. In many ways, Nehalem has more in common with Pentium III than with Sandybridge (e.g. register read stalls aka ROB-read stalls don't happen on SnB, because it changed to using a physical register file. Also a uop cache and a different internal uop format). The term "i7 architecture" is not useful, because it makes little sense to group the SnB-family with Nehalem but not Core2. (Nehalem did introduce the shared inclusive L3 cache architecture for connecting multiple cores together, though. And also integrated GPUs. So chip-level, the naming makes more sense.)
Summary of the good ideas that diabolical incompetence can justify
Even the diabolically incompetent are unlikely to add obviously useless work or an infinite loop, and making a mess with C++/Boost classes is beyond the scope of the assignment.
Multi-thread with a single shared std::atomic<uint64_t> loop counter, so the right total number of iterations happen. Atomic uint64_t is especially bad with -m32 -march=i586. For bonus points, arrange for it to be misaligned, and crossing a page boundary with an uneven split (not 4:4).
False sharing for some other non-atomic variable -> memory-order mis-speculation pipeline clears, as well as extra cache misses.
Instead of using - on FP variables, XOR the high byte with 0x80 to flip the sign bit, causing store-forwarding stalls.
Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID / RDTSC or a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly.
Change multiplies by constants to divides by their reciprocal ("for ease of reading"). div is slow and not fully pipelined.
Vectorize the multiply/sqrt with AVX (SIMD), but fail to use vzeroupper before calls to scalar math-library exp() and log() functions, causing AVX<->SSE transition stalls.
Store the RNG output in a linked list, or in arrays which you traverse out of order. Same for the result of each iteration, and sum at the end.
Also covered in this answer but excluded from the summary: suggestions that would be just as slow on a non-pipelined CPU, or that don't seem to be justifiable even with diabolical incompetence. e.g. many gimp-the-compiler ideas that produce obviously different / worse asm.
Multi-thread badly
Maybe use OpenMP to multi-thread loops with very few iterations, with way more overhead than speed gain. Your monte-carlo code has enough parallelism to actually get a speedup, though, esp. if we succeed at making each iteration slow. (Each thread computes a partial payoff_sum, added at the end). #omp parallel on that loop would probably be an optimization, not a pessimization.
Multi-thread but force both threads to share the same loop counter (with atomic increments so the total number of iterations is correct). This seems diabolically logical. This means using a static variable as a loop counter. This justifies use of atomic for loop counters, and creates actual cache-line ping-ponging (as long as the threads don't run on the same physical core with hyperthreading; that might not be as slow). Anyway, this is much slower than the un-contended case for lock xadd or lock dec. And lock cmpxchg8b to atomically increment a contended uint64_t on a 32bit system will have to retry in a loop instead of having the hardware arbitrate an atomic inc.
Also create false sharing, where multiple threads keep their private data (e.g. RNG state) in different bytes of the same cache line. (Intel tutorial about it, including perf counters to look at). There's a microarchitecture-specific aspect to this: Intel CPUs speculate on memory mis-ordering not happening, and there's a memory-order machine-clear perf event to detect this, at least on P4. The penalty might not be as large on Haswell. As that link points out, a locked instruction assumes this will happen, avoiding mis-speculation. A normal load speculates that other cores won't invalidate a cache line between when the load executes and when it retires in program-order (unless you use pause). True sharing without locked instructions is usually a bug. It would be interesting to compare a non-atomic shared loop counter with the atomic case. To really pessimize, keep the shared atomic loop counter, and cause false sharing in the same or a different cache line for some other variable.
Random uarch-specific ideas:
If you can introduce any unpredictable branches, that will pessimize the code substantially. Modern x86 CPUs have quite long pipelines, so a mispredict costs ~15 cycles (when running from the uop cache).
Dependency chains:
I think this was one of the intended parts of the assignment.
Defeat the CPU's ability to exploit instruction-level parallelism by choosing an order of operations that has one long dependency chain instead of multiple short dependency chains. Compilers aren't allowed to change the order of operations for FP calculations unless you use -ffast-math, because that can change the results (as discussed below).
To really make this effective, increase the length of a loop-carried dependency chain. Nothing leaps out as obvious, though: The loops as written have very short loop-carried dependency chains: just an FP add. (3 cycles). Multiple iterations can have their calculations in-flight at once, because they can start well before the payoff_sum += at the end of the previous iteration. (log() and exp take many instructions, but not a lot more than Haswell's out-of-order window for finding parallelism: ROB size=192 fused-domain uops, and scheduler size=60 unfused-domain uops. As soon as execution of the current iteration progresses far enough to make room for instructions from the next iteration to issue, any parts of it that have their inputs ready (i.e. independent/separate dep chain) can start executing when older instructions leave the execution units free (e.g. because they're bottlenecked on latency, not throughput.).
The RNG state will almost certainly be a longer loop-carried dependency chain than the addps.
Use slower/more FP operations (esp. more division):
Divide by 2.0 instead of multiplying by 0.5, and so on. FP multiply is heavily pipelined in Intel designs, and has one per 0.5c throughput on Haswell and later. FP divsd/divpd is only partially pipelined. (Although Skylake has an impressive one per 4c throughput for divpd xmm, with 13-14c latency, vs not pipelined at all on Nehalem (7-22c)).
The do { ...; euclid_sq = x*x + y*y; } while (euclid_sq >= 1.0); is clearly testing for a distance, so clearly it would be proper to sqrt() it. :P (sqrt is even slower than div).
As #Paul Clayton suggests, rewriting expressions with associative/distributive equivalents can introduce more work (as long as you don't use -ffast-math to allow the compiler to re-optimize). (exp(T*(r-0.5*v*v)) could become exp(T*r - T*v*v/2.0). Note that while math on real numbers is associative, floating point math is not, even without considering overflow/NaN (which is why -ffast-math isn't on by default). See Paul's comment for a very hairy nested pow() suggestion.
If you can scale the calculations down to very small numbers, then FP math ops take ~120 extra cycles to trap to microcode when an operation on two normal numbers produces a denormal. See Agner Fog's microarch pdf for the exact numbers and details. This is unlikely since you have a lot of multiplies, so the scale factor would be squared and underflow all the way to 0.0. I don't see any way to justify the necessary scaling with incompetence (even diabolical), only intentional malice.
###If you can use intrinsics (<immintrin.h>)
Use movnti to evict your data from cache. Diabolical: it's new and weakly-ordered, so that should let the CPU run it faster, right? Or see that linked question for a case where someone was in danger of doing exactly this (for scattered writes where only some of the locations were hot). clflush is probably impossible without malice.
Use integer shuffles between FP math operations to cause bypass delays.
Mixing SSE and AVX instructions without proper use of vzeroupper causes large stalls in pre-Skylake (and a different penalty in Skylake). Even without that, vectorizing badly can be worse than scalar (more cycles spent shuffling data into/out of vectors than saved by doing the add/sub/mul/div/sqrt operations for 4 Monte-Carlo iterations at once, with 256b vectors). add/sub/mul execution units are fully pipelined and full-width, but div and sqrt on 256b vectors aren't as fast as on 128b vectors (or scalars), so the speedup isn't dramatic for double.
exp() and log() don't have hardware support, so that part would require extracting vector elements back to scalar and calling the library function separately, then shuffling the results back into a vector. libm is typically compiled to only use SSE2, so will use the legacy-SSE encodings of scalar math instructions. If your code uses 256b vectors and calls exp without doing a vzeroupper first, then you stall. After returning, an AVX-128 instruction like vmovsd to set up the next vector element as an arg for exp will also stall. And then exp() will stall again when it runs an SSE instruction. This is exactly what happened in this question, causing a 10x slowdown. (Thanks #ZBoson).
See also Nathan Kurz's experiments with Intel's math lib vs. glibc for this code. Future glibc will come with vectorized implementations of exp() and so on.
If targeting pre-IvB, or esp. Nehalem, try to get gcc to cause partial-register stalls with 16bit or 8bit operations followed by 32bit or 64bit operations. In most cases, gcc will use movzx after an 8 or 16bit operation, but here's a case where gcc modifies ah and then reads ax
With (inline) asm:
With (inline) asm, you could break the uop cache: A 32B chunk of code that doesn't fit in three 6uop cache lines forces a switch from the uop cache to the decoders. An incompetent ALIGN (like NASM's default) using many single-byte nops instead of a couple long nops on a branch target inside the inner loop might do the trick. Or put the alignment padding after the label, instead of before. :P This only matters if the frontend is a bottleneck, which it won't be if we succeeded at pessimizing the rest of the code.
Use self-modifying code to trigger pipeline clears (aka machine-nukes).
LCP stalls from 16bit instructions with immediates too large to fit in 8 bits are unlikely to be useful. The uop cache on SnB and later means you only pay the decode penalty once. On Nehalem (the first i7), it might work for a loop that doesn't fit in the 28 uop loop buffer. gcc will sometimes generate such instructions, even with -mtune=intel and when it could have used a 32bit instruction.
A common idiom for timing is CPUID(to serialize) then RDTSC. Time every iteration separately with a CPUID/RDTSC to make sure the RDTSC isn't reordered with earlier instructions, which will slow things down a lot. (In real life, the smart way to time is to time all the iterations together, instead of timing each separately and adding them up).
Cause lots of cache misses and other memory slowdowns
Use a union { double d; char a[8]; } for some of your variables. Cause a store-forwarding stall by doing a narrow store (or Read-Modify-Write) to just one of the bytes. (That wiki article also covers a lot of other microarchitectural stuff for load/store queues). e.g. flip the sign of a double using XOR 0x80 on just the high byte, instead of a - operator. The diabolically incompetent developer may have heard that FP is slower than integer, and thus try to do as much as possible using integer ops. (A compiler could theoretically still compile this to an xorps with a constant like -, but for x87 the compiler would have to realize that it's negating the value and fchs or replace the next add with a subtract.)
Use volatile if you're compiling with -O3 and not using std::atomic, to force the compiler to actually store/reload all over the place. Global variables (instead of locals) will also force some stores/reloads, but the C++ memory model's weak ordering doesn't require the compiler to spill/reload to memory all the time.
Replace local vars with members of a big struct, so you can control the memory layout.
Use arrays in the struct for padding (and storing random numbers, to justify their existence).
Choose your memory layout so everything goes into a different line in the same "set" in the L1 cache. It's only 8-way associative, i.e. each set has 8 "ways". Cache lines are 64B.
Even better, put things exactly 4096B apart, since loads have a false dependency on stores to different pages but with the same offset within a page. Aggressive out-of-order CPUs use Memory Disambiguation to figure out when loads and stores can be reordered without changing the results, and Intel's implementation has false-positives that prevent loads from starting early. Probably they only check bits below the page offset so it can start before the TLB has translated the high bits from a virtual page to a physical page. As well as Agner's guide, see this answer, and a section near the end of #Krazy Glew's answer on the same question. (Andy Glew was an architect of Intel's PPro - P6 microarchitecture.) (Also related: https://stackoverflow.com/a/53330296 and https://github.com/travisdowns/uarch-bench/wiki/Memory-Disambiguation-on-Skylake)
Use __attribute__((packed)) to let you mis-align variables so they span cache-line or even page boundaries. (So a load of one double needs data from two cache-lines). Misaligned loads have no penalty in any Intel i7 uarch, except when crossing cache lines and page lines. Cache-line splits still take extra cycles. Skylake dramatically reduces the penalty for page split loads, from 100 to 5 cycles. (Section 2.1.3). (And can do two page walks in parallel).
A page-split on an atomic<uint64_t> should be just about the worst case, esp. if it's 5 bytes in one page and 3 bytes in the other page, or anything other than 4:4. Even splits down the middle are more efficient for cache-line splits with 16B vectors on some uarches, IIRC. Put everything in a alignas(4096) struct __attribute((packed)) (to save space, of course), including an array for storage for the RNG results. Achieve the misalignment by using uint8_t or uint16_t for something before the counter.
If you can get the compiler to use indexed addressing modes, that will defeat uop micro-fusion. Maybe by using #defines to replace simple scalar variables with my_data[constant].
If you can introduce an extra level of indirection, so load/store addresses aren't known early, that can pessimize further.
Traverse arrays in non-contiguous order
I think we can come up with incompetent justification for introducing an array in the first place: It lets us separate the random number generation from the random number use. Results of each iteration could also be stored in an array, to be summed later (with more diabolical incompetence).
For "maximum randomness", we could have a thread looping over the random array writing new random numbers into it. The thread consuming the random numbers could generate a random index to load a random number from. (There's some make-work here, but microarchitecturally it helps for load-addresses to be known early so any possible load latency can be resolved before the loaded data is needed.) Having a reader and writer on different cores will cause memory-ordering mis-speculation pipeline clears (as discussed earlier for the false-sharing case).
For maximum pessimization, loop over your array with a stride of 4096 bytes (i.e. 512 doubles). e.g.
for (int i=0 ; i<512; i++)
for (int j=i ; j<UPPER_BOUND ; j+=512)
monte_carlo_step(rng_array[j]);
So the access pattern is 0, 4096, 8192, ...,
8, 4104, 8200, ...
16, 4112, 8208, ...
This is what you'd get for accessing a 2D array like double rng_array[MAX_ROWS][512] in the wrong order (looping over rows, instead of columns within a row in the inner loop, as suggested by #JesperJuhl). If diabolical incompetence can justify a 2D array with dimensions like that, garden variety real-world incompetence easily justifies looping with the wrong access pattern. This happens in real code in real life.
Adjust the loop bounds if necessary to use many different pages instead of reusing the same few pages, if the array isn't that big. Hardware prefetching doesn't work (as well/at all) across pages. The prefetcher can track one forward and one backward stream within each page (which is what happens here), but will only act on it if the memory bandwidth isn't already saturated with non-prefetch.
This will also generate lots of TLB misses, unless the pages get merged into a hugepage (Linux does this opportunistically for anonymous (not file-backed) allocations like malloc/new that use mmap(MAP_ANONYMOUS)).
Instead of an array to store the list of results, you could use a linked list. Every iteration would require a pointer-chasing load (a RAW true dependency hazard for the load-address of the next load). With a bad allocator, you might manage to scatter the list nodes around in memory, defeating cache. With a bad toy allocator, it could put every node at the beginning of its own page. (e.g. allocate with mmap(MAP_ANONYMOUS) directly, without breaking up pages or tracking object sizes to properly support free).
These aren't really microarchitecture-specific, and have little to do with the pipeline (most of these would also be a slowdown on a non-pipelined CPU).
Somewhat off-topic: make the compiler generate worse code / do more work:
Use C++11 std::atomic<int> and std::atomic<double> for the most pessimal code. The MFENCEs and locked instructions are quite slow even without contention from another thread.
-m32 will make slower code, because x87 code will be worse than SSE2 code. The stack-based 32bit calling convention takes more instructions, and passes even FP args on the stack to functions like exp(). atomic<uint64_t>::operator++ on -m32 requires a lock cmpxchg8B loop (i586). (So use that for loop counters! [Evil laugh]).
-march=i386 will also pessimize (thanks #Jesper). FP compares with fcom are slower than 686 fcomi. Pre-586 doesn't provide an atomic 64bit store, (let alone a cmpxchg), so all 64bit atomic ops compile to libgcc function calls (which is probably compiled for i686, rather than actually using a lock). Try it on the Godbolt Compiler Explorer link in the last paragraph.
Use long double / sqrtl / expl for extra precision and extra slowness in ABIs where sizeof(long double) is 10 or 16 (with padding for alignment). (IIRC, 64bit Windows uses 8byte long double equivalent to double. (Anyway, load/store of 10byte (80bit) FP operands is 4 / 7 uops, vs. float or double only taking 1 uop each for fld m64/m32/fst). Forcing x87 with long double defeats auto-vectorization even for gcc -m64 -march=haswell -O3.
If not using atomic<uint64_t> loop counters, use long double for everything, including loop counters.
atomic<double> compiles, but read-modify-write operations like += aren't supported for it (even on 64bit). atomic<long double> has to call a library function just for atomic loads/stores. It's probably really inefficient, because the x86 ISA doesn't naturally support atomic 10byte loads/stores, and the only way I can think of without locking (cmpxchg16b) requires 64bit mode.
At -O0, breaking up a big expression by assigning parts to temporary vars will cause more store/reloads. Without volatile or something, this won't matter with optimization settings that a real build of real code would use.
C aliasing rules allow a char to alias anything, so storing through a char* forces the compiler to store/reload everything before/after the byte-store, even at -O3. (This is a problem for auto-vectorizing code that operates on an array of uint8_t, for example.)
Try uint16_t loop counters, to force truncation to 16bit, probably by using 16bit operand-size (potential stalls) and/or extra movzx instructions (safe). Signed overflow is undefined behaviour, so unless you use -fwrapv or at least -fno-strict-overflow, signed loop counters don't have to be re-sign-extended every iteration, even if used as offsets to 64bit pointers.
Force conversion from integer to float and back again. And/or double<=>float conversions. The instructions have latency > 1, and scalar int->float (cvtsi2ss) is badly designed to not zero the rest of the xmm register. (gcc inserts an extra pxor to break dependencies, for this reason.)
Frequently set your CPU affinity to a different CPU (suggested by #Egwor). diabolical reasoning: You don't want one core to get overheated from running your thread for a long time, do you? Maybe swapping to another core will let that core turbo to a higher clock speed. (In reality: they're so thermally close to each other that this is highly unlikely except in a multi-socket system). Now just get the tuning wrong and do it way too often. Besides the time spent in the OS saving/restoring thread state, the new core has cold L2/L1 caches, uop cache, and branch predictors.
Introducing frequent unnecessary system calls can slow you down no matter what they are. Although some important but simple ones like gettimeofday may be implemented in user-space with, with no transition to kernel mode. (glibc on Linux does this with the kernel's help: the kernel exports code+data in the VDSO).
For more on system call overhead (including cache/TLB misses after returning to user-space, not just the context switch itself), the FlexSC paper has some great perf-counter analysis of the current situation, as well as a proposal for batching system calls from massively multi-threaded server processes.

A few things that you can do to make things perform as bad as possible:
compile the code for the i386 architecture. This will prevent the use of SSE and newer instructions and force the use of the x87 FPU.
use std::atomic variables everywhere. This will make them very expensive due to the compiler being forced to insert memory barriers all over the place. And this is something an incompetent person might plausibly do to "ensure thread safety".
make sure to access memory in the worst possible way for the prefetcher to predict (column major vs row major).
to make your variables extra expensive you could make sure they all have 'dynamic storage duration' (heap allocated) by allocating them with new rather than letting them have 'automatic storage duration' (stack allocated).
make sure that all memory you allocate is very oddly aligned and by all means avoid allocating huge pages, since doing so would be much too TLB efficient.
whatever you do, don't build your code with the compilers optimizer enabled. And make sure to enable the most expressive debug symbols you can (won't make the code run slower, but it'll waste some extra disk space).
Note: This answer basically just summarizes my comments that #Peter Cordes already incorporated into his very good answer. Suggest he get's your upvote if you only have one to spare :)

You can use long double for computation. On x86 it should be the 80-bit format. Only the legacy, x87 FPU has support for this.
Few shortcomings of x87 FPU:
Lack of SIMD, may need more instructions.
Stack based, problematic for super scalar and pipelined architectures.
Separate and quite small set of registers, may need more conversion from other registers and more memory operations.
On the Core i7 there are 3 ports for SSE and only 2 for x87, the processor can execute less parallel instructions.

Late answer but I don't feel we have abused linked lists and the TLB enough.
Use mmap to allocate your nodes, such that your mostly use the MSB of the address. This should result in long TLB lookup chains, a page is 12 bits, leaving 52 bits for the translation, or around 5 levels it must travers each time. With a bit of luck they must go to memory each time for 5 levels lookup plus 1 memory access to get to your node, the top level will most likely be in cache somewhere, so we can hope for 5*memory access. Place the node so that is strides the worst border so that reading the next pointer would cause another 3-4 translation lookups. This might also totally wreck the cache due to the massive amount of translation lookups. Also the size of the virtual tables might cause most of the user data to be paged to disk for extra time.
When reading from the single linked list, make sure to read from the start of the list each time to cause maximum delay in reading a single number.

Is integer multiplication really done at the same speed as addition on a modern CPU?

I hear this statement quite often, that multiplication on modern hardware is so optimized that it actually is at the same speed as addition. Is that true?
I never can get any authoritative confirmation. My own research only adds questions. The speed tests usually show data that confuses me. Here is an example:
#include <stdio.h>
#include <sys/time.h>
unsigned int time1000() {
timeval val;
gettimeofday(&val, 0);
val.tv_sec &= 0xffff;
return val.tv_sec * 1000 + val.tv_usec / 1000;
}
int main() {
unsigned int sum = 1, T = time1000();
for (int i = 1; i < 100000000; i++) {
sum += i + (i+1); sum++;
}
printf("%u %u\n", time1000() - T, sum);
sum = 1;
T = time1000();
for (int i = 1; i < 100000000; i++) {
sum += i * (i+1); sum++;
}
printf("%u %u\n", time1000() - T, sum);
}
The code above can show that multiplication is faster:
clang++ benchmark.cpp -o benchmark
./benchmark
746 1974919423
708 3830355456
But with other compilers, other compiler arguments, differently written inner loops, the results can vary and I cannot even get an approximation.

Multiplication of two n-bit numbers can in fact be done in O(log n) circuit depth, just like addition.
Addition in O(log n) is done by splitting the number in half and (recursively) adding the two parts in parallel, where the upper half is solved for both the "0-carry" and "1-carry" case. Once the lower half is added, the carry is examined, and its value is used to choose between the 0-carry and 1-carry case.
Multiplication in O(log n) depth is also done through parallelization, where every sum of 3 numbers is reduced to a sum of just 2 numbers in parallel, and the sums are done in some manner like the above.
I won't explain it here, but you can find reading material on fast addition and multiplication by looking up "carry-lookahead" and "carry-save" addition.
So from a theoretical standpoint, since circuits are obviously inherently parallel (unlike software), the only reason multiplication would be asymptotically slower is the constant factor in the front, not the asymptotic complexity.

Integer multiplication will be slower.
Agner Fog's instruction tables show that when using 32-bit integer registers, Haswell's ADD/SUB take 0.25–1 cycles (depending on how well pipelined your instructions are) while MUL takes 2–4 cycles. Floating-point is the other way around: ADDSS/SUBSS take 1–3 cycles while MULSS takes 0.5–5 cycles.

This is an even more complex answer than simply multiplication versus addition. In reality the answer will most likely NEVER be yes. Multiplication, electronically, is a much more complicated circuit. Most of the reasons why, is that multiplication is the act of a multiplication step followed by an addition step, remember what it was like to multiply decimal numbers prior to using a calculator.
The other thing to remember is that multiplication will take longer or shorter depending on the architecture of the processor you are running it on. This may or may not be simply company specific. While an AMD will most likely be different than an Intel, even an Intel i7 may be different from a core 2 (within the same generation), and certainly different between generations (especially the farther back you go).
In all TECHNICALITY, if multiplies were the only thing you were doing (without looping, counting etc...), multiplies would be 2 to (as ive seen on PPC architectures) 35 times slower. This is more an exercise in understanding your architecture, and electronics.
In Addition:
It should be noted that a processor COULD be built for which ALL operations including a multiply take a single clock. What this processor would have to do is, get rid of all pipelining, and slow the clock so that the HW latency of any OPs circuit is less than or equal to the latency PROVIDED by the clock timing.
To do this would get rid of the inherent performance gains we are able to get when adding pipelining into a processor. Pipelining is the idea of taking a task and breaking it down into smaller sub-tasks that can be performed much quicker. By storing and forwarding the results of each sub-task between sub-tasks, we can now run a faster clock rate that only needs to allow for the longest latency of the sub-tasks, and not from the overarching task as a whole.
Picture of time through a multiply:
|--------------------------------------------------| Non-Pipelined
|--Step 1--|--Step 2--|--Step 3--|--Step 4--|--Step 5--| Pipelined
In the above diagram, the non-pipelined circuit takes 50 units of time. In the pipelined version, we have split the 50 units into 5 steps each taking 10 units of time, with a store step in between. It is EXTREMELY important to note that in the pipelined example, each of the steps can be working completely on their own and in parallel. For an operation to be completed, it must move through all 5 steps in order but another of the same operation with operands can be in step 2 as one is in step 1, 3, 4, and 5.
With all of this being said, this pipelined approach allows us to continuously fill the operator each clock cycle, and get a result out on each clock cycle IF we are able to order our operations such that we can perform all of one operation before we switch to another operation, and all we take as a timing hit is the original amount of clocks necessary to get the FIRST operation out of the pipeline.
Mystical brings up another good point. It is also important to look at the architecture from a more systems perspective. It is true that the newer Haswell architectures was built to better the Floating Point multiply performance within the processor. For this reason as the System level, it was architected to allow multiple multiplies to occur in simultaneity versus an add which can only happen once per system clock.
All of this can be summed up as follows:
Each architecture is different from a lower level HW perspective as well as from a system perspective
FUNCTIONALLY, a multiply will always take more time than an add because it combines a true multiply along with a true addition step.
Understand the architecture you are trying to run your code on, and find the right balance between readability and getting truly the best performance from that architecture.

Intel since Haswell has
add performance of 4/clock throughput, 1 cycle latency. (Any operand-size)
imul performance of 1/clock throughput, 3 cycle latency. (Any operand-size)
Ryzen is similar. Bulldozer-family has much lower integer throughput and not-fully-pipelined multiply, including extra slow for 64-bit operand-size multiply. See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info
But a good compiler could auto-vectorize your loops. (SIMD-integer multiply throughput and latency are both worse than SIMD-integer add). Or simply constant-propagate through them to just print out the answer! Clang really does know the closed-form Gauss's formula for sum(i=0..n) and can recognize some loops that do that.
You forgot to enable optimization so both loops bottleneck on the ALU + store/reload latency of keeping sum in memory between each of sum += independent stuff and sum++. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about just how bad the resulting asm is, and why that's the case. clang++ defaults to -O0 (debug mode: keep variables in memory where a debugger can modify them between any C++ statements).
Store-forwarding latency on a modern x86 like Sandybridge-family (including Haswell and Skylake) is about 3 to 5 cycles, depending on timing of the reload. So with a 1-cycle latency ALU add in there, too, you're looking at about two 6-cycle latency steps in the critical path for this loop. (Plenty to hide all the store / reload and calculation based on i, and the loop-counter update).
See also Adding a redundant assignment speeds up code when compiled without optimization for another no-optimization benchmark. In that one, store-forwarding latency is actually reduced by having more independent work in the loop, delaying the reload attempt.
Modern x86 CPUs have 1/clock multiply throughput so even with optimization you wouldn't see a throughput bottleneck from it. Or on Bulldozer-family, not fully pipelined with 1 per 2-clock throughput.
More likely you'd bottleneck on the front-end work of getting all the work issued every cycle.
Although lea does allow very efficient copy-and-add, and doing i + i + 1 with a single instruction. Although really a good compiler would see that the loop only uses 2*i and optimize to increment by 2. i.e. a strength-reduction to do repeated addition by 2 instead of having to shift inside the loop.
And of course with optimization the extra sum++ can just fold into the sum += stuff where stuff already includes a constant. Not so with the multiply.

I came to this thread to get an idea of what the modern processors are doing in regard to integer math and the number of cycles required to do them. I worked on this problem of speeding up 32-bit integer multiplies and divides on the 65c816 processor in the 1990's. Using the method below, I was able to triple the speed of the standard math libraries available in the ORCA/M compilers at the time.
So the idea that multiplies are faster than adds is simply not the case (except rarely) but like people said it depends upon how the architecture is implemented. If there are enough steps being performed available between clock cycles, yes a multiply could effectively be the same speed as an add based on the clock, but there would be a lot of wasted time. In that case it would be nice to have an instruction that performs multiple (dependent) adds / subtracts given one instruction and multiple values. One can dream.
On the 65c816 processor, there were no multiply or divide instructions. Mult and Div were done with shifts and adds.
To perform a 16 bit add, you would do the following:
LDA $0000 - loaded a value into the Accumulator (5 cycles)
ADC $0002 - add with carry (5 cycles)
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
15 cycles total for an add
If dealing with a call like from C, you would have additional overhead of dealing with pushing and pulling values off the stack. Creating routines that would do two multiples at once would save overhead for example.
The traditional way of doing the multiply is shifts and adds through the entire value of the one number. Each time the carry became a one as it is shifted left would mean you needed to add the value again. This required a test of each bit and a shift of the result.
I replaced that with a lookup table of 256 items so as the carry bits would not need to be checked. It was also possible to determine overflow before doing the multiply to not waste time. (On a modern processor this could be done in parallel but I don't know if they do this in the hardware). Given two 32 bit numbers and prescreened overflow, one of the multipliers is always 16 bits or less, thus one would only need to run through 8 bit multiplies once or twice to perform the entire 32 bit multiply. The result of this was multiplies that were 3 times as fast.
the speed of the 16 bit multiplies ranged from 12 cycles to about 37 cycles
multiply by 2 (0000 0010)
LDA $0000 - loaded a value into the Accumulator (5 cycles).
ASL - shift left (2 cycles).
STA $0004 - store the value in the Accumulator back to memory (5 cycles).
12 cycles plus call overhead.
multiply by (0101 1010)
LDA $0000 - loaded a value into the Accumulator (5 cycles)
ASL - shift left (2 cycles)
ASL - shift left (2 cycles)
ADC $0000 - add with carry for next bit (5 cycles)
ASL - shift left (2 cycles)
ADC $0000 - add with carry for next bit (5 cycles)
ASL - shift left (2 cycles)
ASL - shift left (2 cycles)
ADC $0000 - add with carry for next bit (5 cycles)
ASL - shift left (2 cycles)
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
37 cycles plus call overhead
Since the databus of the AppleIIgs for which this was written was only 8 bits wide, to load 16 bit values required 5 cycles to load from memory, one extra for the pointer, and one extra cycle for the second byte.
LDA instruction (1 cycle because it is an 8 bit value)
$0000 (16 bit value requires two cycles to load)
memory location (requires two cycles to load because of an 8 bit data bus)
Modern processors would be able to do this faster because they have a 32 bit data bus at worst. In the processor logic itself the system of gates would have no additional delay at all compared to the data bus delay since the whole value would get loaded at once.
To do the complete 32 bit multiply, you would need to do the above twice and add the results together to get the final answer. The modern processors should be able to do the two in parallel and add the results for the answer. Combined with the overflow precheck done in parallel, it would minimize the time required to do the multiply.
Anyway it is readily apparent that multiplies require significantly more effort than an add. How many steps to process the operation between cpu clock cycles would determine how many cycles of the clock would be required. If the clock is slow enough, then the adds would appear to be the same speed as a multiply.
Regards,
Ken

A multiplication requires a final step of an addition of, at minimum, the same size of the number; so it will take longer than an addition. In decimal:
123
112
----
+246 ----
123 | matrix generation
123 ----
-----
13776 <---------------- Addition
Same applies in binary, with a more elaborate reduction of the matrix.
That said, reasons why they may take the same amount of time:
To simplify the pipelined architecture, all regular instructions can be designed to take the same amount of cycles (exceptions are memory moves for instance, that depend on how long it takes to talk to external memory).
Since the adder for the final step of the multiplier is just like the adder for an add instruction... why not use the same adder by skipping the matrix generation and reduction? If they use the same adder, then obviously they will take the same amount of time.
Of course, there are more complex architectures where this is not the case, and you might obtain completely different values. You also have architectures that take several instructions in parallel when they don't depend on each other, and then you are a bit at the mercy of your compiler... and of the operating system.
The only way to run this test rigorously you would have to run in assembly and without an operating system - otherwise there are too many variables.

Even if it were, that mostly tells us what restriction the clock puts on our hardware. We can't clock higher because heat(?), but the number of ADD instruction gates a signal could pass during a clock could be very many but a single ADD instruction would only utilize one of them. So while it may at some point take equally many clock cycles, not all of the propagation time for the signals is utilized.
If we could clock higher we could def. make ADD faster probably by several orders of magnitude.

This really depends on your machine. Of course, integer multiplication is quite complex compared to addition, but quite a few AMD CPU can execute a multiplication in a single cycle. That is just as fast as addition.
Other CPUs take three or four cycles to do a multiplication, which is a bit slower than addition. But it's nowhere near the performance penalty you had to suffer ten years ago (back then a 32-Bit multiplication could take thirty-something cycles on some CPUs).
So, yes, multiplication is in the same speed class nowadays, but no, it's still not exactly as fast as addition on all CPUs.

Even on ARM (known for its high efficiency and small, clean design), integer multiplications take 3-7 cycles and than integer additions take 1 cycle.
However, an add/shift trick is often used to multiply integers by constants faster than the multiply instruction can calculate the answer.
The reason this works well on ARM is that ARM has a "barrel shifter", which allows many instructions to shift or rotate one of their arguments by 1-31 bits at zero cost, i.e. x = a + b and x = a + (b << s) take exactly the same amount of time.
Utilizing this processor feature, let's say you want to calculate a * 15. Then since 15 = 1111 (base 2), the following pseudocode (translated into ARM assembly) would implement the multiplication:
a_times_3 = a + (a << 1) // a * (0011 (base 2))
a_times_15 = a_times_3 + (a_times_3 << 2) // a * (0011 (base 2) + 1100 (base 2))
Similarly you could multiply by 13 = 1101 (base 2) using either of the following:
a_times_5 = a + (a << 2)
a_times_13 = a_times_5 + (a << 3)
a_times_3 = a + (a << 1)
a_times_15 = a_times_3 + (a_times_3 << 2)
a_times_13 = a_times_15 - (a << 1)
The first snippet is obviously faster in this case, but sometimes subtraction helps when translating a constant multiplication into add/shift combinations.
This multiplication trick was used heavily in the ARM assembly coding community in the late 80s, on the Acorn Archimedes and Acorn RISC PC (the origin of the ARM processor). Back then, a lot of ARM assembly was written by hand, since squeezing every last cycle out of the processor was important. Coders in the ARM demoscene developed many techniques like this for speeding up code, most of which are probably lost to history now that almost no assembly code is written by hand anymore. Compilers probably incorporate many tricks like this, but I'm sure there are many more that never made the transition from "black art optimization" to compiler implementation.
You can of course write explicit add/shift multiplication code like this in any compiled language, and the code may or may not run faster than a straight multiplication once compiled.
x86_64 may also benefit from this multiplication trick for small constants, although I don't believe shifting is zero-cost on the x86_64 ISA, in either the Intel or AMD implementations (x86_64 probably takes one extra cycle for each integer shift or rotate).

There are lots of good answers here about your main question, but I just wanted to point out that your code is not a good way to measure operation performance.
For starters, modern cpus adjust freqyuencies all the time, so you should use rdtsc to count the actual number of cycles instead of elapsed microseconds.
But more importantly, your code has artificial dependency chains, unnecessary control logic and iterators that will make your measure into an odd mix of latency and throughtput plus some constant terms added for no reason.
To really measure throughtput you should significantly unroll the loop and also add several partial sums in parallel (more sums than steps in the add/mul cpu pipelines).

No it's not, and in fact it's noticeably slower (which translated into a 15% performance hit for the particular real-world program I was running).
I realized this myself when asking this question from just a few days ago here.

Since the other answers deal with real, present-day devices -- which are bound to change and improve as time passes -- I thought we could look at the question from the theoretical side.
Proposition: When implemented in logic gates, using the usual algorithms, an integer multiplication circuit is O(log N) times slower than an addition circuit, where N is the number of bits in a word.
Proof: The time for a combinatorial circuit to stabilise is proportional to the depth of the longest sequence of logic gates from any input to any output. So we must show that a gradeschool multiply circuit is O(log N) times deeper than an addition circuit.
Addition is normally implemented as a half adder followed by N-1 full adders, with the carry bits chained from one adder to the next. This circuit clearly has depth O(N). (This circuit can be optimized in many ways, but the worst case performance will always be O(N) unless absurdly large lookup tables are used.)
To multiply A by B, we first need to multiply each bit of A with each bit of B. Each bitwise multiply is simply an AND gate. There are N^2 bitwise multiplications to perform, hence N^2 AND gates -- but all of them can execute in parallel, for a circuit depth of 1. This solves the multiplication phase of the gradeschool algorithm, leaving just the addition phase.
In the addition phase, we can combine the partial products using an inverted binary tree-shaped circuit to do many of the additions in parallel. The tree will be (log N) nodes deep, and at each node, we will be adding together two numbers with O(N) bits. This means each node can be implemented with an adder of depth O(N), giving a total circuit depth of O(N log N). QED.

Performance wise, how fast are Bitwise Operators vs. Normal Modulus?

Does using bitwise operations in normal flow or conditional statements like for, if, and so on increase overall performance and would it be better to use them where possible? For example:
if(i++ & 1) {
}
vs.
if(i % 2) {
}

Unless you're using an ancient compiler, it can already handle this level of conversion on its own. That is to say, a modern compiler can and will implement i % 2 using a bitwise AND instruction, provided it makes sense to do so on the target CPU (which, in fairness, it usually will).
In other words, don't expect to see any difference in performance between these, at least with a reasonably modern compiler with a reasonably competent optimizer. In this case, "reasonably" has a pretty broad definition too--even quite a few compilers that are decades old can handle this sort of micro-optimization with no difficulty at all.

TL;DR Write for semantics first, optimize measured hot-spots second.
At the CPU level, integer modulus and divisions are among the slowest operations. But you are not writing at the CPU level, instead you write in C++, which your compiler translates to an Intermediate Representation, which finally is translated into assembly according to the model of CPU for which you are compiling.
In this process, the compiler will apply Peephole Optimizations, among which figure Strength Reduction Optimizations such as (courtesy of Wikipedia):
Original Calculation Replacement Calculation
y = x / 8 y = x >> 3
y = x * 64 y = x << 6
y = x * 2 y = x << 1
y = x * 15 y = (x << 4) - x
The last example is perhaps the most interesting one. Whilst multiplying or dividing by powers of 2 is easily converted (manually) into bit-shifts operations, the compiler is generally taught to perform even smarter transformations that you would probably think about on your own and who are not as easily recognized (at the very least, I do not personally immediately recognize that (x << 4) - x means x * 15).

This is obviously CPU dependent, but you can expect that bitwise operations will never take more, and typically take less, CPU cycles to complete. In general, integer / and % are famously slow, as CPU instructions go. That said, with modern CPU pipelines having a specific instruction complete earlier doesn't mean your program necessarily runs faster.
Best practice is to write code that's understandable, maintainable, and expressive of the logic it implements. It's extremely rare that this kind of micro-optimisation makes a tangible difference, so it should only be used if profiling has indicated a critical bottleneck and this is proven to make a significant difference. Moreover, if on some specific platform it did make a significant difference, your compiler optimiser may already be substituting a bitwise operation when it can see that's equivalent (this usually requires that you're /-ing or %-ing by a constant).
For whatever it's worth, on x86 instructions specifically - and when the divisor is a runtime-variable value so can't be trivially optimised into e.g. bit-shifts or bitwise-ANDs, the time taken by / and % operations in CPU cycles can be looked up here. There are too many x86-compatible chips to list here, but as an arbitrary example of recent CPUs - if we take Agner's "Sunny Cove (Ice Lake)" (i.e. 10th gen Intel Core) data, DIV and IDIV instructions have a latency between 12 and 19 cycles, whereas bitwise-AND has 1 cycle. On many older CPUs DIV can be 40-60x worse.

By default you should use the operation that best expresses your intended meaning, because you should optimize for readable code. (Today most of the time the scarcest resource is the human programmer.)
So use & if you extract bits, and use % if you test for divisibility, i.e. whether the value is even or odd.
For unsigned values both operations have exactly the same effect, and your compiler should be smart enough to replace the division by the corresponding bit operation. If you are worried you can check the assembly code it generates.
Unfortunately integer division is slightly irregular on signed values, as it rounds towards zero and the result of % changes sign depending on the first operand. Bit operations, on the other hand, always round down. So the compiler cannot just replace the division by a simple bit operation. Instead it may either call a routine for integer division, or replace it with bit operations with additional logic to handle the irregularity. This may depends on the optimization level and on which of the operands are constants.
This irregularity at zero may even be a bad thing, because it is a nonlinearity. For example, I recently had a case where we used division on signed values from an ADC, which had to be very fast on an ARM Cortex M0. In this case it was better to replace it with a right shift, both for performance and to get rid of the nonlinearity.

C operators cannot be meaningfully compared in therms of "performance". There's no such thing as "faster" or "slower" operators at language level. Only the resultant compiled machine code can be analyzed for performance. In your specific example the resultant machine code will normally be exactly the same (if we ignore the fact that the first condition includes a postfix increment for some reason), meaning that there won't be any difference in performance whatsoever.

Here is the compiler (GCC 4.6) generated optimized -O3 code for both options:
int i = 34567;
int opt1 = i++ & 1;
int opt2 = i % 2;
Generated code for opt1:
l %r1,520(%r11)
nilf %r1,1
st %r1,516(%r11)
asi 520(%r11),1
Generated code for opt2:
l %r1,520(%r11)
nilf %r1,2147483649
ltr %r1,%r1
jhe .L14
ahi %r1,-1
oilf %r1,4294967294
ahi %r1,1
.L14: st %r1,512(%r11)
So 4 extra instructions...which are nothing for a prod environment. This would be a premature optimization and just introduce complexity

Always these answers about how clever compilers are, that people should not even think about the performance of their code, that they should not dare to question Her Cleverness The Compiler, that bla bla bla… and the result is that people get convinced that every time they use % [SOME POWER OF TWO] the compiler magically converts their code into & ([SOME POWER OF TWO] - 1). This is simply not true. If a shared library has this function:
int modulus (int a, int b) {
return a % b;
}
and a program launches modulus(135, 16), nowhere in the compiled code there will be any trace of bitwise magic. The reason? The compiler is clever, but it did not have a crystal ball when it compiled the library. It sees a generic modulus calculation with no information whatsoever about the fact that only powers of two will be involved and it leaves it as such.
But you can know if only powers of two will be passed to a function. And if that is the case, the only way to optimize your code is to rewrite your function as
unsigned int modulus_2 (unsigned int a, unsigned int b) {
return a & (b - 1);
}
The compiler cannot do that for you.

Bitwise operations are much faster.
This is why the compiler will use bitwise operations for you.
Actually, I think it will be faster to implement it as:
~i & 1
Similarly, if you look at the assembly code your compiler generates, you may see things like x ^= x instead of x=0. But (I hope) you are not going to use this in your C++ code.
In summary, do yourself, and whoever will need to maintain your code, a favor. Make your code readable, and let the compiler do these micro optimizations. It will do it better.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js