SSE slower than FPU?

SSE slower than FPU? - c++

I have a large piece of code, part of whose body contains this piece of code:
result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1);
which I have vectorized as follows (everything is already a float):
__m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx),
_mm_set_ps(ny, nx, m_Ly, m_Lx));
__declspec(align(16)) int asInt[4] = {
_mm_extract_ps(r,0), _mm_extract_ps(r,1),
_mm_extract_ps(r,2), _mm_extract_ps(r,3)
};
float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt);
result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1);
The result is correct; however, my benchmarking shows that the vectorized version is slower:
The non-vectorized version takes 3750 ms
The vectorized version takes 4050 ms
Setting result to 0 directly (and removing this part of the code entirely) reduces the entire process to 2500 ms
Given that the vectorized version only contains one set of SSE multiplications (instead of four individual FPU multiplications), why is it slower? Is the FPU indeed faster than SSE, or is there a confounding variable here?
(I'm on a mobile Core i5.)

You are spending a lot of time moving scalar values to/from SSE registers with _mm_set_ps and _mm_extract_ps - this is generating a lot of instructions, the execution time of which will far outweigh any benefit from using _mm_mul_ps. Take a look at the generated assembly output to see how much code is being generated in addition to the single MULPS instruction.
To vectorize this properly you need to use 128 bit SSE loads and stores (_mm_load_ps/_mm_store_ps) and then use SSE shuffle instructions to move elements around within registers where needed.
One further point to note - modern CPUs such as Core i5, Core i7, have two scalar FPUs and can issue 2 floating point multiplies per clock. The potential benefit from SSE for single precision floating point is therefore only 2x at best. It's easy to lose most/all of this 2x benefit if you have excessive "housekeeping" instructions, as is the case here.

There are several problems :
You will not see much benefits from using SSE instructions in such operations, because the SSE instructions are supposed to be better on parallel operations (that is, multiplying several values at the same time). What you did is a misuse of the SSE
do not set the values, use the pointer to the 1st value in the array, but then your values are not in the array
do not extract and copy values into the array. That is also a misuse of SSE. The result is supposed to be in an array.

My take would be that the processor has the time to compute the first multiplication when using the FPU while loading the next values. The SSE has to load all the values first.

Related

How to use SSE instruction set on C6678 DSP?

SSE can only be used on x86 x64 CPUs. I have a problem using the SPEEXDSP library on a TI C6678. I've never used the SSE instruction, I've tried many ways and can't get it to work on the DSP.
Is it possible to modify SSE instructions to normal C++ instructions? How to modify it?
Looking forward to your reply.
Example:
static inline double interpolate_product_double(const float* a, const float* b, unsigned int len, const spx_uint32_t oversample, float* frac) {
int i;
double ret;
__m128d sum;
__m128d sum1 = _mm_setzero_pd();
__m128d sum2 = _mm_setzero_pd();
__m128 f = _mm_loadu_ps(frac);
__m128d f1 = _mm_cvtps_pd(f);
__m128d f2 = _mm_cvtps_pd(_mm_movehl_ps(f, f));
__m128 t;
for (i = 0; i < len; i += 2)
{
t = _mm_mul_ps(_mm_load1_ps(a + i), _mm_loadu_ps(b + i * oversample));
sum1 = _mm_add_pd(sum1, _mm_cvtps_pd(t));
sum2 = _mm_add_pd(sum2, _mm_cvtps_pd(_mm_movehl_ps(t, t)));
t = _mm_mul_ps(_mm_load1_ps(a + i + 1), _mm_loadu_ps(b + (i + 1) * oversample));
sum1 = _mm_add_pd(sum1, _mm_cvtps_pd(t));
sum2 = _mm_add_pd(sum2, _mm_cvtps_pd(_mm_movehl_ps(t, t)));
}
sum1 = _mm_mul_pd(f1, sum1);
sum2 = _mm_mul_pd(f2, sum2);
sum = _mm_add_pd(sum1, sum2);
sum = _mm_add_sd(sum, _mm_unpackhi_pd(sum, sum));
_mm_store_sd(&ret, sum);
return ret;
}

Yes, you can use SIMD Everywhere (SIMDe). It provides portable implementations of many intrinsics, including all of the ones in your code. Full disclosure: I am the lead developer.
Edit: replying to phuclv here since it's a bit long for a comment.
SIMDe doesn't currently use the c6x instrinsics to implement functions like we often do for NEON, AltiVec/VSX, WASM SIMD, etc. There is nothing preventing it, and patches are very much welcome, but they're not there yet.
However, every function in SiMDe has fallback implementations all the way back to standard C. Usually things don't get that far, though; even discounting the architecture-specific implementations mentioned above, if the compiler supports it the operations are also implemented using GNU C vector extensions, and even the portable fallbacks are actually annotated with OpenMP SIMD directives. Conversion functions use compiler built-ins like __builtin_convertvector, and functions which require shuffling data around will use __builtin_shuffle / __builtin_shufflevector.
Basically, SIMDe goes to great lengths to get the compiler to vectorize the whenever possible, even if SIMDe doesn't actually know how to do it. The functions above are all pretty straightforward; I don't know enough about c6x SIMD to know what kind of operations are supported in hardware, but GCC and clang (which the TI compilers are based on) generally do a very good job with all the information SIMDe gives them. Honestly, the thing I'm most worried about here is whether the c6x supports double-precision floating point in SIMD (which the code above uses)... there is a pretty good chance it only supports single-precision floats.

Is it possible to modify SSE instructions to normal C++ instructions?
There's no such thing as "C++ instructions" because C++ is a high level language with only statements and no instructions. But yes it's possible to convert SSE intrinsics to C++ expressions because they're simply multiple operations in parallel
SSE is one of the SIMD instruction sets so just convert it to the corresponding SIMD in the target architecture. In your case TI C6678 does have SIMD support:
The C64x+ and C674x DSPs support 2-way SIMD operations for 16-bit data and 4-way SIMD operations for 8-bit data. On the C66x DSP, the vector processing capability is improved by extending the width of the SIMD instructions. C66x DSPs can execute instructions that operate on 128-bit vectors.

The C66x architecture indeed supports a number of SIMD instructions, somewhat comparable to those of Intel's SSE.
You need to be aware of the processor's register set in both architectures and compare the available instructions.
For example, _mm_add_ps performs four simultaneous additions of single-precision floats, contained four by four in the SSE registers. The DSP has a similar DADDSP instruction that only performs two such additions. Hence you will need to translate one _mm_add_ps by two DADDSP.
Read the manuals (these instruction sets are online), understand what the instructions are doing, and find the equivalences. In case of a dead-end, you still have the recourse to good old scalar operations, like C[0]= A[0]+B[0]; C[1]= A[1]+B[1];

Why is vectorization not beneficial in this for loop?

I am trying to vectorize this for loop. After using the Rpass flag, I am getting the following remark for it:
int someOuterVariable = 0;
for (unsigned int i = 7; i != -1; i--)
{
array[someOuterVariable + i] -= 0.3 * anotherArray[i];
}
Remark:
The cost-model indicates that vectorization is not beneficial
the cost-model indicates that interleaving is not beneficial
I want to understand what this means. Does "interleaving is not benificial" mean the array indexing is not proper?

It's hard to answer without more details about your types. But in general, starting a loop incurs some costs and vectorising also implies some costs (such as moving data to/from SIMD registers, ensuring proper alignment of data)
I'm guessing here that the compiler tells you that the vectorisation cost here is bigger than simply running the 8 iterations without it, so it's not doing it.
Try to increase the number of iterations, or help the compiler for computing alignement for example.
Typically, unless the type of array's item are exactly of the proper alignment for SIMD vector, accessing an array from a "unknown" offset (what you've called someOuterVariable) prevents the compiler to write an efficient vectorisation code.
EDIT: About the "interleaving" question, it's hard to guess without knowning your tool. But in general, interleaving usually means mixing 2 streams of computations so that the compute units of the CPU are all busy. For example, if you have 2 ALU in your CPU, and the program is doing:
c = a + b;
d = e * f;
The compiler can interleave the computation so that both the addition and multiplication happens at the same time (provided you have 2 ALU available). Typically, this means that the multiplication which is a bit longer to compute (for example 6 cycles) will be started before the addition (for example 3 cycles). You'll then get the result of both operation after only 6 cycles instead of 9 if the compiler serialized the computations. This is only possible if there is no dependencies between the computation (if d required c, it can not work). A compiler is very cautious about this, and, in your example, will not apply this optimization if it can't prove that array and anotherArray don't alias.

Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs

I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions:
Your assignment is the opposite of our first lab assignment, which was to optimize a prime number program. Your purpose in this assignment is to pessimize the program, i.e. make it run slower. Both of these are CPU-intensive programs. They take a few seconds to run on our lab PCs. You may not change the algorithm.
To deoptimize the program, use your knowledge of how the Intel i7 pipeline operates. Imagine ways to re-order instruction paths to introduce WAR, RAW, and other hazards. Think of ways to minimize the effectiveness of the cache. Be diabolically incompetent.
The assignment gave a choice of Whetstone or Monte-Carlo programs. The cache-effectiveness comments are mostly only applicable to Whetstone, but I chose the Monte-Carlo simulation program:
// Un-modified baseline for pessimization, as given in the assignment
#include <algorithm> // Needed for the "max" function
#include <cmath>
#include <iostream>
// A simple implementation of the Box-Muller algorithm, used to generate
// gaussian random numbers - necessary for the Monte Carlo method below
// Note that C++11 actually provides std::normal_distribution<> in
// the <random> library, which can be used instead of this function
double gaussian_box_muller() {
double x = 0.0;
double y = 0.0;
double euclid_sq = 0.0;
// Continue generating two uniform random variables
// until the square of their "euclidean distance"
// is less than unity
do {
x = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
y = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
euclid_sq = x*x + y*y;
} while (euclid_sq >= 1.0);
return x*sqrt(-2*log(euclid_sq)/euclid_sq);
}
// Pricing a European vanilla call option with a Monte Carlo method
double monte_carlo_call_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
double S_adjust = S * exp(T*(r-0.5*v*v));
double S_cur = 0.0;
double payoff_sum = 0.0;
for (int i=0; i<num_sims; i++) {
double gauss_bm = gaussian_box_muller();
S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
payoff_sum += std::max(S_cur - K, 0.0);
}
return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}
// Pricing a European vanilla put option with a Monte Carlo method
double monte_carlo_put_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
double S_adjust = S * exp(T*(r-0.5*v*v));
double S_cur = 0.0;
double payoff_sum = 0.0;
for (int i=0; i<num_sims; i++) {
double gauss_bm = gaussian_box_muller();
S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
payoff_sum += std::max(K - S_cur, 0.0);
}
return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}
int main(int argc, char **argv) {
// First we create the parameter list
int num_sims = 10000000; // Number of simulated asset paths
double S = 100.0; // Option price
double K = 100.0; // Strike price
double r = 0.05; // Risk-free rate (5%)
double v = 0.2; // Volatility of the underlying (20%)
double T = 1.0; // One year until expiry
// Then we calculate the call/put values via Monte Carlo
double call = monte_carlo_call_price(num_sims, S, K, r, v, T);
double put = monte_carlo_put_price(num_sims, S, K, r, v, T);
// Finally we output the parameters and prices
std::cout << "Number of Paths: " << num_sims << std::endl;
std::cout << "Underlying: " << S << std::endl;
std::cout << "Strike: " << K << std::endl;
std::cout << "Risk-Free Rate: " << r << std::endl;
std::cout << "Volatility: " << v << std::endl;
std::cout << "Maturity: " << T << std::endl;
std::cout << "Call Price: " << call << std::endl;
std::cout << "Put Price: " << put << std::endl;
return 0;
}
The changes I have made seemed to increase the code running time by a second but I'm not entirely sure what I can change to stall the pipeline without adding code. A point to the right direction would be awesome, I appreciate any responses.
Update: the professor who gave this assignment posted some details
The highlights are:
It's a second semester architecture class at a community college (using the Hennessy and Patterson textbook).
the lab computers have Haswell CPUs
The students have been exposed to the CPUID instruction and how to determine cache size, as well as intrinsics and the CLFLUSH instruction.
any compiler options are allowed, and so is inline asm.
Writing your own square root algorithm was announced as being outside the pale
Cowmoogun's comments on the meta thread indicate that it wasn't clear compiler optimizations could be part of this, and assumed -O0, and that a 17% increase in run-time was reasonable.
So it sounds like the goal of the assignment was to get students to re-order the existing work to reduce instruction-level parallelism or things like that, but it's not a bad thing that people have delved deeper and learned more.
Keep in mind that this is a computer-architecture question, not a question about how to make C++ slow in general.

Important background reading: Agner Fog's microarch pdf, and probably also Ulrich Drepper's What Every Programmer Should Know About Memory. See also the other links in the x86 tag wiki, especially Intel's optimization manuals, and David Kanter's analysis of the Haswell microarchitecture, with diagrams.
Very cool assignment; much better than the ones I've seen where students were asked to optimize some code for gcc -O0, learning a bunch of tricks that don't matter in real code. In this case, you're being asked to learn about the CPU pipeline and use that to guide your de-optimization efforts, not just blind guessing. The most fun part of this one is justifying each pessimization with "diabolical incompetence", not intentional malice.
Problems with the assignment wording and code:
The uarch-specific options for this code are limited. It doesn't use any arrays, and much of the cost is calls to exp/log library functions. There isn't an obvious way to have more or less instruction-level parallelism, and the loop-carried dependency chain is very short.
It would be hard to get a slowdown just from re-arranging the expressions to change the dependencies, to reduce ILP from hazards.
Intel Sandybridge-family CPUs are aggressive out-of-order designs that spend lots of transistors and power to find parallelism and avoid hazards (dependencies) that would trouble a classic RISC in-order pipeline. Usually the only traditional hazards that slow it down are RAW "true" dependencies that cause throughput to be limited by latency.
WAR and WAW hazards for registers are pretty much not an issue, thanks to register renaming. (except for popcnt/lzcnt/tzcnt, which have a false dependency their destination on Intel CPUs, even though it should be write-only).
For memory ordering, modern CPUs use a store buffer to delay commit into cache until retirement, also avoiding WAR and WAW hazards. See also this answer about what a store buffer is, and being essential essential for OoO exec to decouple execution from things other cores can see.
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) has more about register renaming and hiding FMA latency in an FP dot product loop.
The "i7" brand-name was introduced with Nehalem (successor to Core2), and some Intel manuals even say Core i7 when they seem to mean Nehalem, but they kept the "i7" branding for Sandybridge and later microarchitectures. SnB is when the P6-family evolved into a new species, the SnB-family. In many ways, Nehalem has more in common with Pentium III than with Sandybridge (e.g. register read stalls aka ROB-read stalls don't happen on SnB, because it changed to using a physical register file. Also a uop cache and a different internal uop format). The term "i7 architecture" is not useful, because it makes little sense to group the SnB-family with Nehalem but not Core2. (Nehalem did introduce the shared inclusive L3 cache architecture for connecting multiple cores together, though. And also integrated GPUs. So chip-level, the naming makes more sense.)
Summary of the good ideas that diabolical incompetence can justify
Even the diabolically incompetent are unlikely to add obviously useless work or an infinite loop, and making a mess with C++/Boost classes is beyond the scope of the assignment.
Multi-thread with a single shared std::atomic<uint64_t> loop counter, so the right total number of iterations happen. Atomic uint64_t is especially bad with -m32 -march=i586. For bonus points, arrange for it to be misaligned, and crossing a page boundary with an uneven split (not 4:4).
False sharing for some other non-atomic variable -> memory-order mis-speculation pipeline clears, as well as extra cache misses.
Instead of using - on FP variables, XOR the high byte with 0x80 to flip the sign bit, causing store-forwarding stalls.
Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID / RDTSC or a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly.
Change multiplies by constants to divides by their reciprocal ("for ease of reading"). div is slow and not fully pipelined.
Vectorize the multiply/sqrt with AVX (SIMD), but fail to use vzeroupper before calls to scalar math-library exp() and log() functions, causing AVX<->SSE transition stalls.
Store the RNG output in a linked list, or in arrays which you traverse out of order. Same for the result of each iteration, and sum at the end.
Also covered in this answer but excluded from the summary: suggestions that would be just as slow on a non-pipelined CPU, or that don't seem to be justifiable even with diabolical incompetence. e.g. many gimp-the-compiler ideas that produce obviously different / worse asm.
Multi-thread badly
Maybe use OpenMP to multi-thread loops with very few iterations, with way more overhead than speed gain. Your monte-carlo code has enough parallelism to actually get a speedup, though, esp. if we succeed at making each iteration slow. (Each thread computes a partial payoff_sum, added at the end). #omp parallel on that loop would probably be an optimization, not a pessimization.
Multi-thread but force both threads to share the same loop counter (with atomic increments so the total number of iterations is correct). This seems diabolically logical. This means using a static variable as a loop counter. This justifies use of atomic for loop counters, and creates actual cache-line ping-ponging (as long as the threads don't run on the same physical core with hyperthreading; that might not be as slow). Anyway, this is much slower than the un-contended case for lock xadd or lock dec. And lock cmpxchg8b to atomically increment a contended uint64_t on a 32bit system will have to retry in a loop instead of having the hardware arbitrate an atomic inc.
Also create false sharing, where multiple threads keep their private data (e.g. RNG state) in different bytes of the same cache line. (Intel tutorial about it, including perf counters to look at). There's a microarchitecture-specific aspect to this: Intel CPUs speculate on memory mis-ordering not happening, and there's a memory-order machine-clear perf event to detect this, at least on P4. The penalty might not be as large on Haswell. As that link points out, a locked instruction assumes this will happen, avoiding mis-speculation. A normal load speculates that other cores won't invalidate a cache line between when the load executes and when it retires in program-order (unless you use pause). True sharing without locked instructions is usually a bug. It would be interesting to compare a non-atomic shared loop counter with the atomic case. To really pessimize, keep the shared atomic loop counter, and cause false sharing in the same or a different cache line for some other variable.
Random uarch-specific ideas:
If you can introduce any unpredictable branches, that will pessimize the code substantially. Modern x86 CPUs have quite long pipelines, so a mispredict costs ~15 cycles (when running from the uop cache).
Dependency chains:
I think this was one of the intended parts of the assignment.
Defeat the CPU's ability to exploit instruction-level parallelism by choosing an order of operations that has one long dependency chain instead of multiple short dependency chains. Compilers aren't allowed to change the order of operations for FP calculations unless you use -ffast-math, because that can change the results (as discussed below).
To really make this effective, increase the length of a loop-carried dependency chain. Nothing leaps out as obvious, though: The loops as written have very short loop-carried dependency chains: just an FP add. (3 cycles). Multiple iterations can have their calculations in-flight at once, because they can start well before the payoff_sum += at the end of the previous iteration. (log() and exp take many instructions, but not a lot more than Haswell's out-of-order window for finding parallelism: ROB size=192 fused-domain uops, and scheduler size=60 unfused-domain uops. As soon as execution of the current iteration progresses far enough to make room for instructions from the next iteration to issue, any parts of it that have their inputs ready (i.e. independent/separate dep chain) can start executing when older instructions leave the execution units free (e.g. because they're bottlenecked on latency, not throughput.).
The RNG state will almost certainly be a longer loop-carried dependency chain than the addps.
Use slower/more FP operations (esp. more division):
Divide by 2.0 instead of multiplying by 0.5, and so on. FP multiply is heavily pipelined in Intel designs, and has one per 0.5c throughput on Haswell and later. FP divsd/divpd is only partially pipelined. (Although Skylake has an impressive one per 4c throughput for divpd xmm, with 13-14c latency, vs not pipelined at all on Nehalem (7-22c)).
The do { ...; euclid_sq = x*x + y*y; } while (euclid_sq >= 1.0); is clearly testing for a distance, so clearly it would be proper to sqrt() it. :P (sqrt is even slower than div).
As #Paul Clayton suggests, rewriting expressions with associative/distributive equivalents can introduce more work (as long as you don't use -ffast-math to allow the compiler to re-optimize). (exp(T*(r-0.5*v*v)) could become exp(T*r - T*v*v/2.0). Note that while math on real numbers is associative, floating point math is not, even without considering overflow/NaN (which is why -ffast-math isn't on by default). See Paul's comment for a very hairy nested pow() suggestion.
If you can scale the calculations down to very small numbers, then FP math ops take ~120 extra cycles to trap to microcode when an operation on two normal numbers produces a denormal. See Agner Fog's microarch pdf for the exact numbers and details. This is unlikely since you have a lot of multiplies, so the scale factor would be squared and underflow all the way to 0.0. I don't see any way to justify the necessary scaling with incompetence (even diabolical), only intentional malice.
###If you can use intrinsics (<immintrin.h>)
Use movnti to evict your data from cache. Diabolical: it's new and weakly-ordered, so that should let the CPU run it faster, right? Or see that linked question for a case where someone was in danger of doing exactly this (for scattered writes where only some of the locations were hot). clflush is probably impossible without malice.
Use integer shuffles between FP math operations to cause bypass delays.
Mixing SSE and AVX instructions without proper use of vzeroupper causes large stalls in pre-Skylake (and a different penalty in Skylake). Even without that, vectorizing badly can be worse than scalar (more cycles spent shuffling data into/out of vectors than saved by doing the add/sub/mul/div/sqrt operations for 4 Monte-Carlo iterations at once, with 256b vectors). add/sub/mul execution units are fully pipelined and full-width, but div and sqrt on 256b vectors aren't as fast as on 128b vectors (or scalars), so the speedup isn't dramatic for double.
exp() and log() don't have hardware support, so that part would require extracting vector elements back to scalar and calling the library function separately, then shuffling the results back into a vector. libm is typically compiled to only use SSE2, so will use the legacy-SSE encodings of scalar math instructions. If your code uses 256b vectors and calls exp without doing a vzeroupper first, then you stall. After returning, an AVX-128 instruction like vmovsd to set up the next vector element as an arg for exp will also stall. And then exp() will stall again when it runs an SSE instruction. This is exactly what happened in this question, causing a 10x slowdown. (Thanks #ZBoson).
See also Nathan Kurz's experiments with Intel's math lib vs. glibc for this code. Future glibc will come with vectorized implementations of exp() and so on.
If targeting pre-IvB, or esp. Nehalem, try to get gcc to cause partial-register stalls with 16bit or 8bit operations followed by 32bit or 64bit operations. In most cases, gcc will use movzx after an 8 or 16bit operation, but here's a case where gcc modifies ah and then reads ax
With (inline) asm:
With (inline) asm, you could break the uop cache: A 32B chunk of code that doesn't fit in three 6uop cache lines forces a switch from the uop cache to the decoders. An incompetent ALIGN (like NASM's default) using many single-byte nops instead of a couple long nops on a branch target inside the inner loop might do the trick. Or put the alignment padding after the label, instead of before. :P This only matters if the frontend is a bottleneck, which it won't be if we succeeded at pessimizing the rest of the code.
Use self-modifying code to trigger pipeline clears (aka machine-nukes).
LCP stalls from 16bit instructions with immediates too large to fit in 8 bits are unlikely to be useful. The uop cache on SnB and later means you only pay the decode penalty once. On Nehalem (the first i7), it might work for a loop that doesn't fit in the 28 uop loop buffer. gcc will sometimes generate such instructions, even with -mtune=intel and when it could have used a 32bit instruction.
A common idiom for timing is CPUID(to serialize) then RDTSC. Time every iteration separately with a CPUID/RDTSC to make sure the RDTSC isn't reordered with earlier instructions, which will slow things down a lot. (In real life, the smart way to time is to time all the iterations together, instead of timing each separately and adding them up).
Cause lots of cache misses and other memory slowdowns
Use a union { double d; char a[8]; } for some of your variables. Cause a store-forwarding stall by doing a narrow store (or Read-Modify-Write) to just one of the bytes. (That wiki article also covers a lot of other microarchitectural stuff for load/store queues). e.g. flip the sign of a double using XOR 0x80 on just the high byte, instead of a - operator. The diabolically incompetent developer may have heard that FP is slower than integer, and thus try to do as much as possible using integer ops. (A compiler could theoretically still compile this to an xorps with a constant like -, but for x87 the compiler would have to realize that it's negating the value and fchs or replace the next add with a subtract.)
Use volatile if you're compiling with -O3 and not using std::atomic, to force the compiler to actually store/reload all over the place. Global variables (instead of locals) will also force some stores/reloads, but the C++ memory model's weak ordering doesn't require the compiler to spill/reload to memory all the time.
Replace local vars with members of a big struct, so you can control the memory layout.
Use arrays in the struct for padding (and storing random numbers, to justify their existence).
Choose your memory layout so everything goes into a different line in the same "set" in the L1 cache. It's only 8-way associative, i.e. each set has 8 "ways". Cache lines are 64B.
Even better, put things exactly 4096B apart, since loads have a false dependency on stores to different pages but with the same offset within a page. Aggressive out-of-order CPUs use Memory Disambiguation to figure out when loads and stores can be reordered without changing the results, and Intel's implementation has false-positives that prevent loads from starting early. Probably they only check bits below the page offset so it can start before the TLB has translated the high bits from a virtual page to a physical page. As well as Agner's guide, see this answer, and a section near the end of #Krazy Glew's answer on the same question. (Andy Glew was an architect of Intel's PPro - P6 microarchitecture.) (Also related: https://stackoverflow.com/a/53330296 and https://github.com/travisdowns/uarch-bench/wiki/Memory-Disambiguation-on-Skylake)
Use __attribute__((packed)) to let you mis-align variables so they span cache-line or even page boundaries. (So a load of one double needs data from two cache-lines). Misaligned loads have no penalty in any Intel i7 uarch, except when crossing cache lines and page lines. Cache-line splits still take extra cycles. Skylake dramatically reduces the penalty for page split loads, from 100 to 5 cycles. (Section 2.1.3). (And can do two page walks in parallel).
A page-split on an atomic<uint64_t> should be just about the worst case, esp. if it's 5 bytes in one page and 3 bytes in the other page, or anything other than 4:4. Even splits down the middle are more efficient for cache-line splits with 16B vectors on some uarches, IIRC. Put everything in a alignas(4096) struct __attribute((packed)) (to save space, of course), including an array for storage for the RNG results. Achieve the misalignment by using uint8_t or uint16_t for something before the counter.
If you can get the compiler to use indexed addressing modes, that will defeat uop micro-fusion. Maybe by using #defines to replace simple scalar variables with my_data[constant].
If you can introduce an extra level of indirection, so load/store addresses aren't known early, that can pessimize further.
Traverse arrays in non-contiguous order
I think we can come up with incompetent justification for introducing an array in the first place: It lets us separate the random number generation from the random number use. Results of each iteration could also be stored in an array, to be summed later (with more diabolical incompetence).
For "maximum randomness", we could have a thread looping over the random array writing new random numbers into it. The thread consuming the random numbers could generate a random index to load a random number from. (There's some make-work here, but microarchitecturally it helps for load-addresses to be known early so any possible load latency can be resolved before the loaded data is needed.) Having a reader and writer on different cores will cause memory-ordering mis-speculation pipeline clears (as discussed earlier for the false-sharing case).
For maximum pessimization, loop over your array with a stride of 4096 bytes (i.e. 512 doubles). e.g.
for (int i=0 ; i<512; i++)
for (int j=i ; j<UPPER_BOUND ; j+=512)
monte_carlo_step(rng_array[j]);
So the access pattern is 0, 4096, 8192, ...,
8, 4104, 8200, ...
16, 4112, 8208, ...
This is what you'd get for accessing a 2D array like double rng_array[MAX_ROWS][512] in the wrong order (looping over rows, instead of columns within a row in the inner loop, as suggested by #JesperJuhl). If diabolical incompetence can justify a 2D array with dimensions like that, garden variety real-world incompetence easily justifies looping with the wrong access pattern. This happens in real code in real life.
Adjust the loop bounds if necessary to use many different pages instead of reusing the same few pages, if the array isn't that big. Hardware prefetching doesn't work (as well/at all) across pages. The prefetcher can track one forward and one backward stream within each page (which is what happens here), but will only act on it if the memory bandwidth isn't already saturated with non-prefetch.
This will also generate lots of TLB misses, unless the pages get merged into a hugepage (Linux does this opportunistically for anonymous (not file-backed) allocations like malloc/new that use mmap(MAP_ANONYMOUS)).
Instead of an array to store the list of results, you could use a linked list. Every iteration would require a pointer-chasing load (a RAW true dependency hazard for the load-address of the next load). With a bad allocator, you might manage to scatter the list nodes around in memory, defeating cache. With a bad toy allocator, it could put every node at the beginning of its own page. (e.g. allocate with mmap(MAP_ANONYMOUS) directly, without breaking up pages or tracking object sizes to properly support free).
These aren't really microarchitecture-specific, and have little to do with the pipeline (most of these would also be a slowdown on a non-pipelined CPU).
Somewhat off-topic: make the compiler generate worse code / do more work:
Use C++11 std::atomic<int> and std::atomic<double> for the most pessimal code. The MFENCEs and locked instructions are quite slow even without contention from another thread.
-m32 will make slower code, because x87 code will be worse than SSE2 code. The stack-based 32bit calling convention takes more instructions, and passes even FP args on the stack to functions like exp(). atomic<uint64_t>::operator++ on -m32 requires a lock cmpxchg8B loop (i586). (So use that for loop counters! [Evil laugh]).
-march=i386 will also pessimize (thanks #Jesper). FP compares with fcom are slower than 686 fcomi. Pre-586 doesn't provide an atomic 64bit store, (let alone a cmpxchg), so all 64bit atomic ops compile to libgcc function calls (which is probably compiled for i686, rather than actually using a lock). Try it on the Godbolt Compiler Explorer link in the last paragraph.
Use long double / sqrtl / expl for extra precision and extra slowness in ABIs where sizeof(long double) is 10 or 16 (with padding for alignment). (IIRC, 64bit Windows uses 8byte long double equivalent to double. (Anyway, load/store of 10byte (80bit) FP operands is 4 / 7 uops, vs. float or double only taking 1 uop each for fld m64/m32/fst). Forcing x87 with long double defeats auto-vectorization even for gcc -m64 -march=haswell -O3.
If not using atomic<uint64_t> loop counters, use long double for everything, including loop counters.
atomic<double> compiles, but read-modify-write operations like += aren't supported for it (even on 64bit). atomic<long double> has to call a library function just for atomic loads/stores. It's probably really inefficient, because the x86 ISA doesn't naturally support atomic 10byte loads/stores, and the only way I can think of without locking (cmpxchg16b) requires 64bit mode.
At -O0, breaking up a big expression by assigning parts to temporary vars will cause more store/reloads. Without volatile or something, this won't matter with optimization settings that a real build of real code would use.
C aliasing rules allow a char to alias anything, so storing through a char* forces the compiler to store/reload everything before/after the byte-store, even at -O3. (This is a problem for auto-vectorizing code that operates on an array of uint8_t, for example.)
Try uint16_t loop counters, to force truncation to 16bit, probably by using 16bit operand-size (potential stalls) and/or extra movzx instructions (safe). Signed overflow is undefined behaviour, so unless you use -fwrapv or at least -fno-strict-overflow, signed loop counters don't have to be re-sign-extended every iteration, even if used as offsets to 64bit pointers.
Force conversion from integer to float and back again. And/or double<=>float conversions. The instructions have latency > 1, and scalar int->float (cvtsi2ss) is badly designed to not zero the rest of the xmm register. (gcc inserts an extra pxor to break dependencies, for this reason.)
Frequently set your CPU affinity to a different CPU (suggested by #Egwor). diabolical reasoning: You don't want one core to get overheated from running your thread for a long time, do you? Maybe swapping to another core will let that core turbo to a higher clock speed. (In reality: they're so thermally close to each other that this is highly unlikely except in a multi-socket system). Now just get the tuning wrong and do it way too often. Besides the time spent in the OS saving/restoring thread state, the new core has cold L2/L1 caches, uop cache, and branch predictors.
Introducing frequent unnecessary system calls can slow you down no matter what they are. Although some important but simple ones like gettimeofday may be implemented in user-space with, with no transition to kernel mode. (glibc on Linux does this with the kernel's help: the kernel exports code+data in the VDSO).
For more on system call overhead (including cache/TLB misses after returning to user-space, not just the context switch itself), the FlexSC paper has some great perf-counter analysis of the current situation, as well as a proposal for batching system calls from massively multi-threaded server processes.

A few things that you can do to make things perform as bad as possible:
compile the code for the i386 architecture. This will prevent the use of SSE and newer instructions and force the use of the x87 FPU.
use std::atomic variables everywhere. This will make them very expensive due to the compiler being forced to insert memory barriers all over the place. And this is something an incompetent person might plausibly do to "ensure thread safety".
make sure to access memory in the worst possible way for the prefetcher to predict (column major vs row major).
to make your variables extra expensive you could make sure they all have 'dynamic storage duration' (heap allocated) by allocating them with new rather than letting them have 'automatic storage duration' (stack allocated).
make sure that all memory you allocate is very oddly aligned and by all means avoid allocating huge pages, since doing so would be much too TLB efficient.
whatever you do, don't build your code with the compilers optimizer enabled. And make sure to enable the most expressive debug symbols you can (won't make the code run slower, but it'll waste some extra disk space).
Note: This answer basically just summarizes my comments that #Peter Cordes already incorporated into his very good answer. Suggest he get's your upvote if you only have one to spare :)

You can use long double for computation. On x86 it should be the 80-bit format. Only the legacy, x87 FPU has support for this.
Few shortcomings of x87 FPU:
Lack of SIMD, may need more instructions.
Stack based, problematic for super scalar and pipelined architectures.
Separate and quite small set of registers, may need more conversion from other registers and more memory operations.
On the Core i7 there are 3 ports for SSE and only 2 for x87, the processor can execute less parallel instructions.

Late answer but I don't feel we have abused linked lists and the TLB enough.
Use mmap to allocate your nodes, such that your mostly use the MSB of the address. This should result in long TLB lookup chains, a page is 12 bits, leaving 52 bits for the translation, or around 5 levels it must travers each time. With a bit of luck they must go to memory each time for 5 levels lookup plus 1 memory access to get to your node, the top level will most likely be in cache somewhere, so we can hope for 5*memory access. Place the node so that is strides the worst border so that reading the next pointer would cause another 3-4 translation lookups. This might also totally wreck the cache due to the massive amount of translation lookups. Also the size of the virtual tables might cause most of the user data to be paged to disk for extra time.
When reading from the single linked list, make sure to read from the start of the list each time to cause maximum delay in reading a single number.

How to reach AVX computation throughput for a simple loop

Recently I am working on a numerical solver on computational Electrodynamics by Finite difference method.
The solver was very simple to implement, but it is very difficult to reach the theoretical throughput of modern processors, because there is only 1 math operation on the loaded data, for example:
#pragma ivdep
for(int ii=0;ii<Large_Number;ii++)
{ Z[ii] = C1*Z[ii] + C2*D[ii];}
Large_Number is about 1,000,000, but not bigger than 10,000,000
I have tried to manually unroll the loop and write AVX code but failed to make it faster:
int Vec_Size = 8;
int Unroll_Num = 6;
int remainder = Large_Number%(Vec_Size*Unroll_Num);
int iter = Large_Number/(Vec_Size*Unroll_Num);
int addr_incr = Vec_Size*Unroll_Num;
__m256 AVX_Div1, AVX_Div2, AVX_Div3, AVX_Div4, AVX_Div5, AVX_Div6;
__m256 AVX_Z1, AVX_Z2, AVX_Z3, AVX_Z4, AVX_Z5, AVX_Z6;
__m256 AVX_Zb = _mm256_set1_ps(Zb);
__m256 AVX_Za = _mm256_set1_ps(Za);
for(int it=0;it<iter;it++)
{
int addr = addr + addr_incr;
AVX_Div1 = _mm256_loadu_ps(&Div1[addr]);
AVX_Z1 = _mm256_loadu_ps(&Z[addr]);
AVX_Z1 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div1),_mm256_mul_ps(AVX_Za,AVX_Z1));
_mm256_storeu_ps(&Z[addr],AVX_Z1);
AVX_Div2 = _mm256_loadu_ps(&Div1[addr+8]);
AVX_Z2 = _mm256_loadu_ps(&Z[addr+8]);
AVX_Z2 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div2),_mm256_mul_ps(AVX_Za,AVX_Z2));
_mm256_storeu_ps(&Z[addr+8],AVX_Z2);
AVX_Div3 = _mm256_loadu_ps(&Div1[addr+16]);
AVX_Z3 = _mm256_loadu_ps(&Z[addr+16]);
AVX_Z3 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div3),_mm256_mul_ps(AVX_Za,AVX_Z3));
_mm256_storeu_ps(&Z[addr+16],AVX_Z3);
AVX_Div4 = _mm256_loadu_ps(&Div1[addr+24]);
AVX_Z4 = _mm256_loadu_ps(&Z[addr+24]);
AVX_Z4 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div4),_mm256_mul_ps(AVX_Za,AVX_Z4));
_mm256_storeu_ps(&Z[addr+24],AVX_Z4);
AVX_Div5 = _mm256_loadu_ps(&Div1[addr+32]);
AVX_Z5 = _mm256_loadu_ps(&Z[addr+32]);
AVX_Z5 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div5),_mm256_mul_ps(AVX_Za,AVX_Z5));
_mm256_storeu_ps(&Z[addr+32],AVX_Z5);
AVX_Div6 = _mm256_loadu_ps(&Div1[addr+40]);
AVX_Z6 = _mm256_loadu_ps(&Z[addr+40]);
AVX_Z6 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div6),_mm256_mul_ps(AVX_Za,AVX_Z6));
_mm256_storeu_ps(&Z[addr+40],AVX_Z6);
}
The above AVX loop is actually a bit slower than the Inter compiler generated code.
The compiler generated code can reach about 8G flops/s, about 25% of the single thread theoretical throughput of a 3GHz Ivybridge processor. I wonder if it is even possible to reach the throughput for the simple loop like this.
Thank you!

Improving performance for the codes like yours is "well explored" and still popular area. Take a look at dot-product (perfect link provided by Z Boson already) or at some (D)AXPY optimization discussions (https://scicomp.stackexchange.com/questions/1932/are-daxpy-dcopy-dscal-overkills)
In general , key topics to explore and consider applying are:
AVX2 advantage over AVX due to FMA and better load/store ports u-architecture on Haswell
Pre-Fetching. "Streaming stores", "non-temporal stores" for some platforms.
Threading parallelism to reach max sustained bandwidth
Unrolling (already done by you; Intel Compiler is also capable to do that with #pragma unroll (X) ). Not a big difference for "streaming" codes.
Finally deciding what is a set of hardware platforms you want to optimize your code for
Last bullet is particularly important, because for "streaming" and overall memory-bound codes - it's important to know more about target memory-sybsystems; for example, with existing and especially future high-end HPC servers (2nd gen Xeon Phi codenamed Knights Landing as an example) you may have very different "roofline model" balance between bandwidth and compute, and even different techniques than in case of optimizing for average desktop machine.

Are you sure that 8 GFLOPS/s is about 25% of the maximum throughput of a 3 GHz Ivybridge processor? Let's do the calculations.
Every 8 elements require two single-precision AVX multiplications and one AVX addition. An Ivybridge processor can only execute one 8-wide AVX addition and one 8-wide AVX multiplication per cycle. Also since the addition is dependent on the two multiplications, then 3 cycles are required to process 8 elements. Since the addition can be overlapped with the next multiplication, we can reduce this to 2 cycles per 8 elements. For one billion elements, 2*10^9/8 = 10^9/4 cycles are required. Considering 3 GHz clock, we get 10^9/4 * 10^-9/3 = 1/12 = 0.08 seconds. So the maximum theoretical throughput is 12 GLOPS/s and the compiler-generated code is reaching 66%, which is fine.
One more thing, by unrolling the loop 8 times, it can be vectorized efficiently. I doubt that you'll gain any significant speed up if you unroll this particular loop more than that, especially more than 16 times.

I think the real bottleneck is that there are 2 load and 1 store instructions for every 2 multiplication and 1 addition. Maybe the calculation is memory bandwidth limited. Every element requires transfer 12 bytes of data, and if 2G elements are processed every second (which is 6G flops) that is 24GB/s data transfer, reaching the theoretical bandwidth of ivy bridge. I wonder if this argument holds and there is indeed no solution to this problem.
The reason why I am answering to my own question is to hope someone can correct me before I easily give up the optimization. This simple loop is EXTREMELY important for many scientific solvers, it is the backbone of finite element and finite difference method. If one cannot even feed one processor because the computation is memory bandwith limited, why bother with multicore? A high bandwith GPU or Xeon Phi should be better solutions.

Is integer multiplication really done at the same speed as addition on a modern CPU?

I hear this statement quite often, that multiplication on modern hardware is so optimized that it actually is at the same speed as addition. Is that true?
I never can get any authoritative confirmation. My own research only adds questions. The speed tests usually show data that confuses me. Here is an example:
#include <stdio.h>
#include <sys/time.h>
unsigned int time1000() {
timeval val;
gettimeofday(&val, 0);
val.tv_sec &= 0xffff;
return val.tv_sec * 1000 + val.tv_usec / 1000;
}
int main() {
unsigned int sum = 1, T = time1000();
for (int i = 1; i < 100000000; i++) {
sum += i + (i+1); sum++;
}
printf("%u %u\n", time1000() - T, sum);
sum = 1;
T = time1000();
for (int i = 1; i < 100000000; i++) {
sum += i * (i+1); sum++;
}
printf("%u %u\n", time1000() - T, sum);
}
The code above can show that multiplication is faster:
clang++ benchmark.cpp -o benchmark
./benchmark
746 1974919423
708 3830355456
But with other compilers, other compiler arguments, differently written inner loops, the results can vary and I cannot even get an approximation.

Multiplication of two n-bit numbers can in fact be done in O(log n) circuit depth, just like addition.
Addition in O(log n) is done by splitting the number in half and (recursively) adding the two parts in parallel, where the upper half is solved for both the "0-carry" and "1-carry" case. Once the lower half is added, the carry is examined, and its value is used to choose between the 0-carry and 1-carry case.
Multiplication in O(log n) depth is also done through parallelization, where every sum of 3 numbers is reduced to a sum of just 2 numbers in parallel, and the sums are done in some manner like the above.
I won't explain it here, but you can find reading material on fast addition and multiplication by looking up "carry-lookahead" and "carry-save" addition.
So from a theoretical standpoint, since circuits are obviously inherently parallel (unlike software), the only reason multiplication would be asymptotically slower is the constant factor in the front, not the asymptotic complexity.

Integer multiplication will be slower.
Agner Fog's instruction tables show that when using 32-bit integer registers, Haswell's ADD/SUB take 0.25–1 cycles (depending on how well pipelined your instructions are) while MUL takes 2–4 cycles. Floating-point is the other way around: ADDSS/SUBSS take 1–3 cycles while MULSS takes 0.5–5 cycles.

This is an even more complex answer than simply multiplication versus addition. In reality the answer will most likely NEVER be yes. Multiplication, electronically, is a much more complicated circuit. Most of the reasons why, is that multiplication is the act of a multiplication step followed by an addition step, remember what it was like to multiply decimal numbers prior to using a calculator.
The other thing to remember is that multiplication will take longer or shorter depending on the architecture of the processor you are running it on. This may or may not be simply company specific. While an AMD will most likely be different than an Intel, even an Intel i7 may be different from a core 2 (within the same generation), and certainly different between generations (especially the farther back you go).
In all TECHNICALITY, if multiplies were the only thing you were doing (without looping, counting etc...), multiplies would be 2 to (as ive seen on PPC architectures) 35 times slower. This is more an exercise in understanding your architecture, and electronics.
In Addition:
It should be noted that a processor COULD be built for which ALL operations including a multiply take a single clock. What this processor would have to do is, get rid of all pipelining, and slow the clock so that the HW latency of any OPs circuit is less than or equal to the latency PROVIDED by the clock timing.
To do this would get rid of the inherent performance gains we are able to get when adding pipelining into a processor. Pipelining is the idea of taking a task and breaking it down into smaller sub-tasks that can be performed much quicker. By storing and forwarding the results of each sub-task between sub-tasks, we can now run a faster clock rate that only needs to allow for the longest latency of the sub-tasks, and not from the overarching task as a whole.
Picture of time through a multiply:
|--------------------------------------------------| Non-Pipelined
|--Step 1--|--Step 2--|--Step 3--|--Step 4--|--Step 5--| Pipelined
In the above diagram, the non-pipelined circuit takes 50 units of time. In the pipelined version, we have split the 50 units into 5 steps each taking 10 units of time, with a store step in between. It is EXTREMELY important to note that in the pipelined example, each of the steps can be working completely on their own and in parallel. For an operation to be completed, it must move through all 5 steps in order but another of the same operation with operands can be in step 2 as one is in step 1, 3, 4, and 5.
With all of this being said, this pipelined approach allows us to continuously fill the operator each clock cycle, and get a result out on each clock cycle IF we are able to order our operations such that we can perform all of one operation before we switch to another operation, and all we take as a timing hit is the original amount of clocks necessary to get the FIRST operation out of the pipeline.
Mystical brings up another good point. It is also important to look at the architecture from a more systems perspective. It is true that the newer Haswell architectures was built to better the Floating Point multiply performance within the processor. For this reason as the System level, it was architected to allow multiple multiplies to occur in simultaneity versus an add which can only happen once per system clock.
All of this can be summed up as follows:
Each architecture is different from a lower level HW perspective as well as from a system perspective
FUNCTIONALLY, a multiply will always take more time than an add because it combines a true multiply along with a true addition step.
Understand the architecture you are trying to run your code on, and find the right balance between readability and getting truly the best performance from that architecture.

Intel since Haswell has
add performance of 4/clock throughput, 1 cycle latency. (Any operand-size)
imul performance of 1/clock throughput, 3 cycle latency. (Any operand-size)
Ryzen is similar. Bulldozer-family has much lower integer throughput and not-fully-pipelined multiply, including extra slow for 64-bit operand-size multiply. See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info
But a good compiler could auto-vectorize your loops. (SIMD-integer multiply throughput and latency are both worse than SIMD-integer add). Or simply constant-propagate through them to just print out the answer! Clang really does know the closed-form Gauss's formula for sum(i=0..n) and can recognize some loops that do that.
You forgot to enable optimization so both loops bottleneck on the ALU + store/reload latency of keeping sum in memory between each of sum += independent stuff and sum++. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about just how bad the resulting asm is, and why that's the case. clang++ defaults to -O0 (debug mode: keep variables in memory where a debugger can modify them between any C++ statements).
Store-forwarding latency on a modern x86 like Sandybridge-family (including Haswell and Skylake) is about 3 to 5 cycles, depending on timing of the reload. So with a 1-cycle latency ALU add in there, too, you're looking at about two 6-cycle latency steps in the critical path for this loop. (Plenty to hide all the store / reload and calculation based on i, and the loop-counter update).
See also Adding a redundant assignment speeds up code when compiled without optimization for another no-optimization benchmark. In that one, store-forwarding latency is actually reduced by having more independent work in the loop, delaying the reload attempt.
Modern x86 CPUs have 1/clock multiply throughput so even with optimization you wouldn't see a throughput bottleneck from it. Or on Bulldozer-family, not fully pipelined with 1 per 2-clock throughput.
More likely you'd bottleneck on the front-end work of getting all the work issued every cycle.
Although lea does allow very efficient copy-and-add, and doing i + i + 1 with a single instruction. Although really a good compiler would see that the loop only uses 2*i and optimize to increment by 2. i.e. a strength-reduction to do repeated addition by 2 instead of having to shift inside the loop.
And of course with optimization the extra sum++ can just fold into the sum += stuff where stuff already includes a constant. Not so with the multiply.

I came to this thread to get an idea of what the modern processors are doing in regard to integer math and the number of cycles required to do them. I worked on this problem of speeding up 32-bit integer multiplies and divides on the 65c816 processor in the 1990's. Using the method below, I was able to triple the speed of the standard math libraries available in the ORCA/M compilers at the time.
So the idea that multiplies are faster than adds is simply not the case (except rarely) but like people said it depends upon how the architecture is implemented. If there are enough steps being performed available between clock cycles, yes a multiply could effectively be the same speed as an add based on the clock, but there would be a lot of wasted time. In that case it would be nice to have an instruction that performs multiple (dependent) adds / subtracts given one instruction and multiple values. One can dream.
On the 65c816 processor, there were no multiply or divide instructions. Mult and Div were done with shifts and adds.
To perform a 16 bit add, you would do the following:
LDA $0000 - loaded a value into the Accumulator (5 cycles)
ADC $0002 - add with carry (5 cycles)
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
15 cycles total for an add
If dealing with a call like from C, you would have additional overhead of dealing with pushing and pulling values off the stack. Creating routines that would do two multiples at once would save overhead for example.
The traditional way of doing the multiply is shifts and adds through the entire value of the one number. Each time the carry became a one as it is shifted left would mean you needed to add the value again. This required a test of each bit and a shift of the result.
I replaced that with a lookup table of 256 items so as the carry bits would not need to be checked. It was also possible to determine overflow before doing the multiply to not waste time. (On a modern processor this could be done in parallel but I don't know if they do this in the hardware). Given two 32 bit numbers and prescreened overflow, one of the multipliers is always 16 bits or less, thus one would only need to run through 8 bit multiplies once or twice to perform the entire 32 bit multiply. The result of this was multiplies that were 3 times as fast.
the speed of the 16 bit multiplies ranged from 12 cycles to about 37 cycles
multiply by 2 (0000 0010)
LDA $0000 - loaded a value into the Accumulator (5 cycles).
ASL - shift left (2 cycles).
STA $0004 - store the value in the Accumulator back to memory (5 cycles).
12 cycles plus call overhead.
multiply by (0101 1010)
LDA $0000 - loaded a value into the Accumulator (5 cycles)
ASL - shift left (2 cycles)
ASL - shift left (2 cycles)
ADC $0000 - add with carry for next bit (5 cycles)
ASL - shift left (2 cycles)
ADC $0000 - add with carry for next bit (5 cycles)
ASL - shift left (2 cycles)
ASL - shift left (2 cycles)
ADC $0000 - add with carry for next bit (5 cycles)
ASL - shift left (2 cycles)
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
37 cycles plus call overhead
Since the databus of the AppleIIgs for which this was written was only 8 bits wide, to load 16 bit values required 5 cycles to load from memory, one extra for the pointer, and one extra cycle for the second byte.
LDA instruction (1 cycle because it is an 8 bit value)
$0000 (16 bit value requires two cycles to load)
memory location (requires two cycles to load because of an 8 bit data bus)
Modern processors would be able to do this faster because they have a 32 bit data bus at worst. In the processor logic itself the system of gates would have no additional delay at all compared to the data bus delay since the whole value would get loaded at once.
To do the complete 32 bit multiply, you would need to do the above twice and add the results together to get the final answer. The modern processors should be able to do the two in parallel and add the results for the answer. Combined with the overflow precheck done in parallel, it would minimize the time required to do the multiply.
Anyway it is readily apparent that multiplies require significantly more effort than an add. How many steps to process the operation between cpu clock cycles would determine how many cycles of the clock would be required. If the clock is slow enough, then the adds would appear to be the same speed as a multiply.
Regards,
Ken

A multiplication requires a final step of an addition of, at minimum, the same size of the number; so it will take longer than an addition. In decimal:
123
112
----
+246 ----
123 | matrix generation
123 ----
-----
13776 <---------------- Addition
Same applies in binary, with a more elaborate reduction of the matrix.
That said, reasons why they may take the same amount of time:
To simplify the pipelined architecture, all regular instructions can be designed to take the same amount of cycles (exceptions are memory moves for instance, that depend on how long it takes to talk to external memory).
Since the adder for the final step of the multiplier is just like the adder for an add instruction... why not use the same adder by skipping the matrix generation and reduction? If they use the same adder, then obviously they will take the same amount of time.
Of course, there are more complex architectures where this is not the case, and you might obtain completely different values. You also have architectures that take several instructions in parallel when they don't depend on each other, and then you are a bit at the mercy of your compiler... and of the operating system.
The only way to run this test rigorously you would have to run in assembly and without an operating system - otherwise there are too many variables.

Even if it were, that mostly tells us what restriction the clock puts on our hardware. We can't clock higher because heat(?), but the number of ADD instruction gates a signal could pass during a clock could be very many but a single ADD instruction would only utilize one of them. So while it may at some point take equally many clock cycles, not all of the propagation time for the signals is utilized.
If we could clock higher we could def. make ADD faster probably by several orders of magnitude.

This really depends on your machine. Of course, integer multiplication is quite complex compared to addition, but quite a few AMD CPU can execute a multiplication in a single cycle. That is just as fast as addition.
Other CPUs take three or four cycles to do a multiplication, which is a bit slower than addition. But it's nowhere near the performance penalty you had to suffer ten years ago (back then a 32-Bit multiplication could take thirty-something cycles on some CPUs).
So, yes, multiplication is in the same speed class nowadays, but no, it's still not exactly as fast as addition on all CPUs.

Even on ARM (known for its high efficiency and small, clean design), integer multiplications take 3-7 cycles and than integer additions take 1 cycle.
However, an add/shift trick is often used to multiply integers by constants faster than the multiply instruction can calculate the answer.
The reason this works well on ARM is that ARM has a "barrel shifter", which allows many instructions to shift or rotate one of their arguments by 1-31 bits at zero cost, i.e. x = a + b and x = a + (b << s) take exactly the same amount of time.
Utilizing this processor feature, let's say you want to calculate a * 15. Then since 15 = 1111 (base 2), the following pseudocode (translated into ARM assembly) would implement the multiplication:
a_times_3 = a + (a << 1) // a * (0011 (base 2))
a_times_15 = a_times_3 + (a_times_3 << 2) // a * (0011 (base 2) + 1100 (base 2))
Similarly you could multiply by 13 = 1101 (base 2) using either of the following:
a_times_5 = a + (a << 2)
a_times_13 = a_times_5 + (a << 3)
a_times_3 = a + (a << 1)
a_times_15 = a_times_3 + (a_times_3 << 2)
a_times_13 = a_times_15 - (a << 1)
The first snippet is obviously faster in this case, but sometimes subtraction helps when translating a constant multiplication into add/shift combinations.
This multiplication trick was used heavily in the ARM assembly coding community in the late 80s, on the Acorn Archimedes and Acorn RISC PC (the origin of the ARM processor). Back then, a lot of ARM assembly was written by hand, since squeezing every last cycle out of the processor was important. Coders in the ARM demoscene developed many techniques like this for speeding up code, most of which are probably lost to history now that almost no assembly code is written by hand anymore. Compilers probably incorporate many tricks like this, but I'm sure there are many more that never made the transition from "black art optimization" to compiler implementation.
You can of course write explicit add/shift multiplication code like this in any compiled language, and the code may or may not run faster than a straight multiplication once compiled.
x86_64 may also benefit from this multiplication trick for small constants, although I don't believe shifting is zero-cost on the x86_64 ISA, in either the Intel or AMD implementations (x86_64 probably takes one extra cycle for each integer shift or rotate).

There are lots of good answers here about your main question, but I just wanted to point out that your code is not a good way to measure operation performance.
For starters, modern cpus adjust freqyuencies all the time, so you should use rdtsc to count the actual number of cycles instead of elapsed microseconds.
But more importantly, your code has artificial dependency chains, unnecessary control logic and iterators that will make your measure into an odd mix of latency and throughtput plus some constant terms added for no reason.
To really measure throughtput you should significantly unroll the loop and also add several partial sums in parallel (more sums than steps in the add/mul cpu pipelines).

No it's not, and in fact it's noticeably slower (which translated into a 15% performance hit for the particular real-world program I was running).
I realized this myself when asking this question from just a few days ago here.

Since the other answers deal with real, present-day devices -- which are bound to change and improve as time passes -- I thought we could look at the question from the theoretical side.
Proposition: When implemented in logic gates, using the usual algorithms, an integer multiplication circuit is O(log N) times slower than an addition circuit, where N is the number of bits in a word.
Proof: The time for a combinatorial circuit to stabilise is proportional to the depth of the longest sequence of logic gates from any input to any output. So we must show that a gradeschool multiply circuit is O(log N) times deeper than an addition circuit.
Addition is normally implemented as a half adder followed by N-1 full adders, with the carry bits chained from one adder to the next. This circuit clearly has depth O(N). (This circuit can be optimized in many ways, but the worst case performance will always be O(N) unless absurdly large lookup tables are used.)
To multiply A by B, we first need to multiply each bit of A with each bit of B. Each bitwise multiply is simply an AND gate. There are N^2 bitwise multiplications to perform, hence N^2 AND gates -- but all of them can execute in parallel, for a circuit depth of 1. This solves the multiplication phase of the gradeschool algorithm, leaving just the addition phase.
In the addition phase, we can combine the partial products using an inverted binary tree-shaped circuit to do many of the additions in parallel. The tree will be (log N) nodes deep, and at each node, we will be adding together two numbers with O(N) bits. This means each node can be implemented with an adder of depth O(N), giving a total circuit depth of O(N log N). QED.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js