A given benchmark consists of 35% loads, 10% stores, 16% branches, 27% integer ALU operations, 8% FP +/-, 3% FP * and 1% FP /. We want to compare the benchmark as run on two processors. CPI of P1 = 5.05 and CPI of P2 = 3.58.
You are considering two possible enhancements for the Processor 1. One enhancement is a better memory organization, which would improve the average CPI for FP/ instructions from 30 to 2. The other enhancement is a new multiply-and-add instruction that would reduce the number of ALU instructions by 20% while still maintaining the average CPI of 4 for the remaining ALU instructions. Unfortunately, there is room on the processor chip for only one of these two enhancements, so you must choose the enhancement that provides better overall performance. Which one would you choose, and why?
So for this part CPI (FP/) = 5.05 - 0.01(30 - 2) = 4.77
But, I am not able to find the new CPI for ALU.
Is it -> CPI (ALU) = 5.05 - 0.20 (4 - 4) = 5.05? I am probably wrong about this.
Caveat: This may only be a partial answer because I'm not sure what you mean by "CPI". This could be "cost per instruction", but, I'm guessing it could be "cycles per instruction". And, we may need more information for a more full/complete answer.
The original cost for FP/ is 1% * 30 --> 30. The enhancement is 1% * 2 --> 2. So, the improvement is 30 - 2 --> 28
The original cost for ALU is 27% * 4 --> 108. With a 20% reduction in the number of ALU instructions executed, this becomes 0.8 * 27% * 4 --> 86.4. The improvement is 108 - 86.4 --> 21.6
So [I think] that answers your basic question.
And, I might choose the improvement for FP.
But, I'd be careful with this. And, the following could be wrong, overthinking the problem, but I offer it anyway.
The FP improvement just speeds up the instruction. But, the number of cycles for the FP/ is reduced and these cycles can be used for other things.
The ALU improvement frees up some cycles that can be used for other things.
In both cases, we don't know what the additional instructions might be. That is, we're changing the percentages of everything after the proposed improvement. We have to assume that the new "windfall" instructions will follow the stated original percentages. But, we may have to calculate the post-improvement, adjusted percentages
We could recalculate things [by solving for unknowns] from:
505 == 35*loads + 10*stores + 16*branches + 27*ALU + 8*FPadd + 3*FPmul + 1*FPdiv
... if we knew the CPI for the other instructions (e.g. the CPI for a load, etc.). But, this is missing information.
Related
AFAIK the sqrt operation is expensive in most situations.
Is the test below for the vector already being a length of 1 with epsilon worth it? Does it save much. If normalize is called often on vectors that are already normalized. If not is it too expensive ?
double Vec3d::normalize() {
double mod = x * x + y * y + z * z;
if (mod == 0) {
return(0);
}
if (consideredEqual(mod, 1.0, .0000001)) { // is this test worth it ???
return(1.0);
}
mod = std::sqrt(mod);
x /= mod;
y /= mod;
z /= mod;
return mod;
}
For recent pentium microarchitectures, a sqrt has a latency of 10-22 cycles (to compare to 3cy for a fp add, 5cy for a fp mult and 2-4cy for type conversion fp-int). The cost is significantly higher, especially as sqrt is not pipeline and it only possible to start a new operation every 5 cycles.
But adding a test may not be a good idea, as the test also has a cost that must be considered. In modern processors with deep pipeline, instructions are fetched in advance to fill the pipeline and a branch may require to forget all these fetched instructions. To limit this nasty effect, processors try to "predict" the behavior of tests: Are branchs taken or not and what is the target address? Prediction is based on the regularity of the program behavior. Present predictors are very good and for many problems a branch does not have not a significant cost if properly predicted.
But predictions can fail and a mispredict cost 15-20 cycles, which is very high.
Now try to evaluate roughly what would be the gain of the modification that you propose. We can consider several scenarios.
90% of the time value is != 1.0 and 10% of the time it is equal to 1.0. Based on this behavior, branch predictors will bet that you do not take the branch (value!=1.0).
So 90% of the time you have a normal sqrt to compute (and the test cost is negligible) and 10% of the time, you have a mispredict. You avoid the 10-20 cycles sqrt, but you pay 15 cycles branch penalty. The gain is null.
90% of the time value is = 1.0 and 10% of the time it is different. Branch predictors will assume that you take the branch.
When value is 1.0, you have a clear win and the cost is almost null. 10% of the time you will pay a branch mispredict and a sqrt. Compared to 100% sqrt, on the average, there is a win.
50% of values are 1.0 and 50% are different. This is somehow a disaster scenario. Branch predictors will have great difficulties to find a clear behavior of the branch and may fail a significant fraction of the time, say 40% to 100% if you are very unlucky. You will add many branch mispredicts to your computational cost and you may have a negative gain!!!
These estimations are very rough and would require a finer computation with a model of your data, but probably except when a large part of your data is 1.0, you will have at best no gain, and you may even have a slowdown.
You can find measures of the cost of operations in the site of Agner Fog https://www.agner.org/optimize
Well answers to the question below indicate that it is NOT worth it for a general use with standard C libraries and compilers and current processors with fpus. But that it might be marginally worth it in known limited situations or on processors without float support.
c++ practical computational complexity of <cmath> SQRT()
I'm using linux perf tools to profile one of CRONO benchmarks, I'm specifically interested in L1 DCache Misses, so I run the program like this:
perf record -e L1-dcache-read-misses -o perf/apsp.cycles apps/apsp/apsp 4 16384 16
It runs fine but generates those warnings:
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.
Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.
Samples in kernel modules won't be resolved at all.
If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.
Cannot read kernel map
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
Threads Returned!
Threads Joined!
Time: 2.932636 seconds
[ perf record: Woken up 5 times to write data ]
[ perf record: Captured and wrote 1.709 MB perf/apsp.cycles (44765 samples) ]
I then annotate the output file like this:
perf annotate --stdio -i perf/apsp.cycles --dsos=apsp
But in one of the code sections, I see some weird results:
Percent | Source code & Disassembly of apsp for L1-dcache-read-misses
---------------------------------------------------------------------------
: {
: if((D[W_index[v][i]] > (D[v] + W[v][i])))
19.36 : 401140: movslq (%r10,%rcx,4),%rsi
14.50 : 401144: lea (%rax,%rsi,4),%rdi
1.22 : 401148: mov (%r9,%rcx,4),%esi
5.82 : 40114c: add (%rax,%r8,4),%esi
20.02 : 401150: cmp %esi,(%rdi)
0.00 : 401152: jle 401156 <do_work(void*)+0x226>
: D[W_index[v][i]] = D[v] + W[v][i];
9.72 : 401154: mov %esi,(%rdi)
19.93 : 401156: add $0x1,%rcx
:
Now in those results, How come that some arithmetic instructions have L1 read misses? Also, how come that instructions of the second statement cause so many cache misses even though they should've brought into cache by the previous if statement?
Am I doing something wrong here? I tried the same on a different machine with root access, it gave me similar results, so I think the warnings I mentioned above are not causing this. But what exactly is going on?
So we have this code:
for(v=0;v<N;v++)
{
for(int i = 0; i < DEG; i++)
{
if((/* (V2) 1000000000 * */ D[W_index[v][i]] > (D[v] + W[v][i])))
D[W_index[v][i]] = D[v] + W[v][i];
Q[v]=0; //Current vertex checked
}
}
Note that I added (V2) as a comment in the code. We below come back to this code.
First approximation
Remember that W_index is initialized as W_index[i][j] = i + j (A).
Let's focus on one inner iteration, and first let's assume that DEG is large. Further we assume that the cache is large enough to hold all data for at least two iterations.
D[W_index[v][i]]
The lookup W_index[v] is loaded into a register. For W_index[v][i] we assume one cache miss (64 byte cache line, 4 byte per int, we call the programm with DIM=16). The lookup in D starts always at v, so most of the required part of the array is already in cache. With the assumption that DEG is large this lookup is for free.
D[v] + W[v][i]
The lookup D[v] is for free as it depends on v. The second lookup is the same as above, one cache miss for the second dimension.
The whole inner statement has no influence.
Q[v]=0;
As this is v, this can be ignored.
When we sum up, we get two cache misses.
Second approximation
Now, we come back to the assumption that DEG is large. In fact this is wrong because DEG = 16. So there are fractions of cache misses we also need to consider.
D[W_index[v][i]]
The lookup W_index[v] costs 1/8 of a cache miss (it has a size of 8 bytes, a cache line is 64 byte, so we get a cache miss each eigth iteration).
The same is true for D[W_index[v][i]], except that D holds integers. In average all but one integer are in cache, so this costs 1/16 of a cache miss.
D[v] + W[v][i]
D[v] is already in cache (this is W_index[v][0]). But we get another 1/8 of a cache miss for W[v] for the same reasoning as above.
Q[v]=0;
This is another 1/16 of a cache miss.
And surprize, if we now use the code (V2) where the if-clause never evaluates to true, I get 2.395 cache misses per iteration (note that you really need to configure your CPU well, i.e., no hyperthreading, no turboboost, performance governor if possible). The calculation above would lead to 2.375. So we are pretty good.
Third approximation
Now there is this unfortunate if clause. How often does this comparison evaluate to true. We can't say, in the beginning it will be quite often, and in the end it will never evaluate to true.
So let's focus on the really first execution of the complete loop. In this case, D[v] is infinity and W[v][i] is a number between 1 and 101. So the loop evaluates to true in each iteration.
And then it gets hard - we get 2.9 cache misses in this iteration. Where are they coming from - all data should already be in cache.
But: This is the "mystery of compilers". You never know what they produce in the end. I compiled with GCC and Clang and get the same measures. I activate -funroll-loops, and suddenly I get 2.5 cache misses. Of course this may be different on your system. When I inspected the assembly, I observed that it is really exactly the same, just the loop has been unrolled four times.
So what does this tell us? You never know what your compiler does except you check it. And even then, you can't be sure.
I guess hardware prefetching or execution order could have an influence here. But this is a mystery.
Regarding perf and your problems with it
I think the measurements you did have two problems:
They are relative, the exact line is not that accurate.
You are multithreaded, this may be harder to track.
My experience is that when you want to get good measures for a specific part of your code, you really need to check it manually. Sometimes - not always - it can explain things pretty good.
Recently I am working on a numerical solver on computational Electrodynamics by Finite difference method.
The solver was very simple to implement, but it is very difficult to reach the theoretical throughput of modern processors, because there is only 1 math operation on the loaded data, for example:
#pragma ivdep
for(int ii=0;ii<Large_Number;ii++)
{ Z[ii] = C1*Z[ii] + C2*D[ii];}
Large_Number is about 1,000,000, but not bigger than 10,000,000
I have tried to manually unroll the loop and write AVX code but failed to make it faster:
int Vec_Size = 8;
int Unroll_Num = 6;
int remainder = Large_Number%(Vec_Size*Unroll_Num);
int iter = Large_Number/(Vec_Size*Unroll_Num);
int addr_incr = Vec_Size*Unroll_Num;
__m256 AVX_Div1, AVX_Div2, AVX_Div3, AVX_Div4, AVX_Div5, AVX_Div6;
__m256 AVX_Z1, AVX_Z2, AVX_Z3, AVX_Z4, AVX_Z5, AVX_Z6;
__m256 AVX_Zb = _mm256_set1_ps(Zb);
__m256 AVX_Za = _mm256_set1_ps(Za);
for(int it=0;it<iter;it++)
{
int addr = addr + addr_incr;
AVX_Div1 = _mm256_loadu_ps(&Div1[addr]);
AVX_Z1 = _mm256_loadu_ps(&Z[addr]);
AVX_Z1 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div1),_mm256_mul_ps(AVX_Za,AVX_Z1));
_mm256_storeu_ps(&Z[addr],AVX_Z1);
AVX_Div2 = _mm256_loadu_ps(&Div1[addr+8]);
AVX_Z2 = _mm256_loadu_ps(&Z[addr+8]);
AVX_Z2 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div2),_mm256_mul_ps(AVX_Za,AVX_Z2));
_mm256_storeu_ps(&Z[addr+8],AVX_Z2);
AVX_Div3 = _mm256_loadu_ps(&Div1[addr+16]);
AVX_Z3 = _mm256_loadu_ps(&Z[addr+16]);
AVX_Z3 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div3),_mm256_mul_ps(AVX_Za,AVX_Z3));
_mm256_storeu_ps(&Z[addr+16],AVX_Z3);
AVX_Div4 = _mm256_loadu_ps(&Div1[addr+24]);
AVX_Z4 = _mm256_loadu_ps(&Z[addr+24]);
AVX_Z4 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div4),_mm256_mul_ps(AVX_Za,AVX_Z4));
_mm256_storeu_ps(&Z[addr+24],AVX_Z4);
AVX_Div5 = _mm256_loadu_ps(&Div1[addr+32]);
AVX_Z5 = _mm256_loadu_ps(&Z[addr+32]);
AVX_Z5 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div5),_mm256_mul_ps(AVX_Za,AVX_Z5));
_mm256_storeu_ps(&Z[addr+32],AVX_Z5);
AVX_Div6 = _mm256_loadu_ps(&Div1[addr+40]);
AVX_Z6 = _mm256_loadu_ps(&Z[addr+40]);
AVX_Z6 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div6),_mm256_mul_ps(AVX_Za,AVX_Z6));
_mm256_storeu_ps(&Z[addr+40],AVX_Z6);
}
The above AVX loop is actually a bit slower than the Inter compiler generated code.
The compiler generated code can reach about 8G flops/s, about 25% of the single thread theoretical throughput of a 3GHz Ivybridge processor. I wonder if it is even possible to reach the throughput for the simple loop like this.
Thank you!
Improving performance for the codes like yours is "well explored" and still popular area. Take a look at dot-product (perfect link provided by Z Boson already) or at some (D)AXPY optimization discussions (https://scicomp.stackexchange.com/questions/1932/are-daxpy-dcopy-dscal-overkills)
In general , key topics to explore and consider applying are:
AVX2 advantage over AVX due to FMA and better load/store ports u-architecture on Haswell
Pre-Fetching. "Streaming stores", "non-temporal stores" for some platforms.
Threading parallelism to reach max sustained bandwidth
Unrolling (already done by you; Intel Compiler is also capable to do that with #pragma unroll (X) ). Not a big difference for "streaming" codes.
Finally deciding what is a set of hardware platforms you want to optimize your code for
Last bullet is particularly important, because for "streaming" and overall memory-bound codes - it's important to know more about target memory-sybsystems; for example, with existing and especially future high-end HPC servers (2nd gen Xeon Phi codenamed Knights Landing as an example) you may have very different "roofline model" balance between bandwidth and compute, and even different techniques than in case of optimizing for average desktop machine.
Are you sure that 8 GFLOPS/s is about 25% of the maximum throughput of a 3 GHz Ivybridge processor? Let's do the calculations.
Every 8 elements require two single-precision AVX multiplications and one AVX addition. An Ivybridge processor can only execute one 8-wide AVX addition and one 8-wide AVX multiplication per cycle. Also since the addition is dependent on the two multiplications, then 3 cycles are required to process 8 elements. Since the addition can be overlapped with the next multiplication, we can reduce this to 2 cycles per 8 elements. For one billion elements, 2*10^9/8 = 10^9/4 cycles are required. Considering 3 GHz clock, we get 10^9/4 * 10^-9/3 = 1/12 = 0.08 seconds. So the maximum theoretical throughput is 12 GLOPS/s and the compiler-generated code is reaching 66%, which is fine.
One more thing, by unrolling the loop 8 times, it can be vectorized efficiently. I doubt that you'll gain any significant speed up if you unroll this particular loop more than that, especially more than 16 times.
I think the real bottleneck is that there are 2 load and 1 store instructions for every 2 multiplication and 1 addition. Maybe the calculation is memory bandwidth limited. Every element requires transfer 12 bytes of data, and if 2G elements are processed every second (which is 6G flops) that is 24GB/s data transfer, reaching the theoretical bandwidth of ivy bridge. I wonder if this argument holds and there is indeed no solution to this problem.
The reason why I am answering to my own question is to hope someone can correct me before I easily give up the optimization. This simple loop is EXTREMELY important for many scientific solvers, it is the backbone of finite element and finite difference method. If one cannot even feed one processor because the computation is memory bandwith limited, why bother with multicore? A high bandwith GPU or Xeon Phi should be better solutions.
I wonder is it faster to replace branching with 2 multiplications or no (due to cache miss penalty)?
Here is my case:
float dot = rib1.x*-dir.y + rib1.y*dir.x;
if(dot<0){
dir.x = -dir.x;
dir.y = -dir.y;
}
And I'm trying to replace it with:
float dot = rib1.x*-dir.y + rib1.y*dir.x;
int sgn = (dot < 0.0) - (0.0 < dot ); //returns -1 or 1 (no branching here, tested)
dir.x *= sgn;
dir.y *= sgn;
Branching does not imply cache miss: only instruction prefetching/pipelining is disturbed, so it's possible you block some SSE optimization at compile-time with it.
On the other side, if x86 instructions are being used only, the speculative execution will let the processor to properly start the execution of the most used branch.
On the other side, if you enter the if for the 50% of the times you are in the worst condition: in this case I'd try to look for SSE pipelining and to have the execution optimized with SSE, probably getting some hints from this post, in line with your second block of code.
However, benchmark your code, check the produced assembler in order to find the best solution for this optimization, and get the proper insight. And eventually keep us updated :)
The cost of the multiplication depends on several factors, whether you use 32-bit or 64-bit floats, and whether you enable SSE or not. The cost of two float multiplications is 10 cycles according to this source: http://www.agner.org/optimize/instruction_tables.pdf
The cost of the branch also depends on several factors. As a rule of thumb, do not worry about branches in your code. The exact behaviour of the branch predictor on the CPU will define the performance, but in this case you should probably expect that the branch will be unpredictable at best, so this is likely to lead to a lot of branch mispredictions. The cost of a branch misprediction is 10-30 cycles according to this source: http://valgrind.org/docs/manual/cg-manual.html
The best advice anyone can give here is to profile and test. I would guess that on a modern Core i7 the two multiplications should be faster than the branch, if the range of input varies sufficiently as to cause sufficient branch mispredictions as to outweigh the cost of the additional multiplication.
Assuming 50% miss rate, the cost of the branch averages 15 cycles (30 * 0.5), the cost of the float mul is 10 cycles.
EDIT: Added links, updated estimated instruction cost.
I hear this statement quite often, that multiplication on modern hardware is so optimized that it actually is at the same speed as addition. Is that true?
I never can get any authoritative confirmation. My own research only adds questions. The speed tests usually show data that confuses me. Here is an example:
#include <stdio.h>
#include <sys/time.h>
unsigned int time1000() {
timeval val;
gettimeofday(&val, 0);
val.tv_sec &= 0xffff;
return val.tv_sec * 1000 + val.tv_usec / 1000;
}
int main() {
unsigned int sum = 1, T = time1000();
for (int i = 1; i < 100000000; i++) {
sum += i + (i+1); sum++;
}
printf("%u %u\n", time1000() - T, sum);
sum = 1;
T = time1000();
for (int i = 1; i < 100000000; i++) {
sum += i * (i+1); sum++;
}
printf("%u %u\n", time1000() - T, sum);
}
The code above can show that multiplication is faster:
clang++ benchmark.cpp -o benchmark
./benchmark
746 1974919423
708 3830355456
But with other compilers, other compiler arguments, differently written inner loops, the results can vary and I cannot even get an approximation.
Multiplication of two n-bit numbers can in fact be done in O(log n) circuit depth, just like addition.
Addition in O(log n) is done by splitting the number in half and (recursively) adding the two parts in parallel, where the upper half is solved for both the "0-carry" and "1-carry" case. Once the lower half is added, the carry is examined, and its value is used to choose between the 0-carry and 1-carry case.
Multiplication in O(log n) depth is also done through parallelization, where every sum of 3 numbers is reduced to a sum of just 2 numbers in parallel, and the sums are done in some manner like the above.
I won't explain it here, but you can find reading material on fast addition and multiplication by looking up "carry-lookahead" and "carry-save" addition.
So from a theoretical standpoint, since circuits are obviously inherently parallel (unlike software), the only reason multiplication would be asymptotically slower is the constant factor in the front, not the asymptotic complexity.
Integer multiplication will be slower.
Agner Fog's instruction tables show that when using 32-bit integer registers, Haswell's ADD/SUB take 0.25–1 cycles (depending on how well pipelined your instructions are) while MUL takes 2–4 cycles. Floating-point is the other way around: ADDSS/SUBSS take 1–3 cycles while MULSS takes 0.5–5 cycles.
This is an even more complex answer than simply multiplication versus addition. In reality the answer will most likely NEVER be yes. Multiplication, electronically, is a much more complicated circuit. Most of the reasons why, is that multiplication is the act of a multiplication step followed by an addition step, remember what it was like to multiply decimal numbers prior to using a calculator.
The other thing to remember is that multiplication will take longer or shorter depending on the architecture of the processor you are running it on. This may or may not be simply company specific. While an AMD will most likely be different than an Intel, even an Intel i7 may be different from a core 2 (within the same generation), and certainly different between generations (especially the farther back you go).
In all TECHNICALITY, if multiplies were the only thing you were doing (without looping, counting etc...), multiplies would be 2 to (as ive seen on PPC architectures) 35 times slower. This is more an exercise in understanding your architecture, and electronics.
In Addition:
It should be noted that a processor COULD be built for which ALL operations including a multiply take a single clock. What this processor would have to do is, get rid of all pipelining, and slow the clock so that the HW latency of any OPs circuit is less than or equal to the latency PROVIDED by the clock timing.
To do this would get rid of the inherent performance gains we are able to get when adding pipelining into a processor. Pipelining is the idea of taking a task and breaking it down into smaller sub-tasks that can be performed much quicker. By storing and forwarding the results of each sub-task between sub-tasks, we can now run a faster clock rate that only needs to allow for the longest latency of the sub-tasks, and not from the overarching task as a whole.
Picture of time through a multiply:
|--------------------------------------------------| Non-Pipelined
|--Step 1--|--Step 2--|--Step 3--|--Step 4--|--Step 5--| Pipelined
In the above diagram, the non-pipelined circuit takes 50 units of time. In the pipelined version, we have split the 50 units into 5 steps each taking 10 units of time, with a store step in between. It is EXTREMELY important to note that in the pipelined example, each of the steps can be working completely on their own and in parallel. For an operation to be completed, it must move through all 5 steps in order but another of the same operation with operands can be in step 2 as one is in step 1, 3, 4, and 5.
With all of this being said, this pipelined approach allows us to continuously fill the operator each clock cycle, and get a result out on each clock cycle IF we are able to order our operations such that we can perform all of one operation before we switch to another operation, and all we take as a timing hit is the original amount of clocks necessary to get the FIRST operation out of the pipeline.
Mystical brings up another good point. It is also important to look at the architecture from a more systems perspective. It is true that the newer Haswell architectures was built to better the Floating Point multiply performance within the processor. For this reason as the System level, it was architected to allow multiple multiplies to occur in simultaneity versus an add which can only happen once per system clock.
All of this can be summed up as follows:
Each architecture is different from a lower level HW perspective as well as from a system perspective
FUNCTIONALLY, a multiply will always take more time than an add because it combines a true multiply along with a true addition step.
Understand the architecture you are trying to run your code on, and find the right balance between readability and getting truly the best performance from that architecture.
Intel since Haswell has
add performance of 4/clock throughput, 1 cycle latency. (Any operand-size)
imul performance of 1/clock throughput, 3 cycle latency. (Any operand-size)
Ryzen is similar. Bulldozer-family has much lower integer throughput and not-fully-pipelined multiply, including extra slow for 64-bit operand-size multiply. See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info
But a good compiler could auto-vectorize your loops. (SIMD-integer multiply throughput and latency are both worse than SIMD-integer add). Or simply constant-propagate through them to just print out the answer! Clang really does know the closed-form Gauss's formula for sum(i=0..n) and can recognize some loops that do that.
You forgot to enable optimization so both loops bottleneck on the ALU + store/reload latency of keeping sum in memory between each of sum += independent stuff and sum++. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about just how bad the resulting asm is, and why that's the case. clang++ defaults to -O0 (debug mode: keep variables in memory where a debugger can modify them between any C++ statements).
Store-forwarding latency on a modern x86 like Sandybridge-family (including Haswell and Skylake) is about 3 to 5 cycles, depending on timing of the reload. So with a 1-cycle latency ALU add in there, too, you're looking at about two 6-cycle latency steps in the critical path for this loop. (Plenty to hide all the store / reload and calculation based on i, and the loop-counter update).
See also Adding a redundant assignment speeds up code when compiled without optimization for another no-optimization benchmark. In that one, store-forwarding latency is actually reduced by having more independent work in the loop, delaying the reload attempt.
Modern x86 CPUs have 1/clock multiply throughput so even with optimization you wouldn't see a throughput bottleneck from it. Or on Bulldozer-family, not fully pipelined with 1 per 2-clock throughput.
More likely you'd bottleneck on the front-end work of getting all the work issued every cycle.
Although lea does allow very efficient copy-and-add, and doing i + i + 1 with a single instruction. Although really a good compiler would see that the loop only uses 2*i and optimize to increment by 2. i.e. a strength-reduction to do repeated addition by 2 instead of having to shift inside the loop.
And of course with optimization the extra sum++ can just fold into the sum += stuff where stuff already includes a constant. Not so with the multiply.
I came to this thread to get an idea of what the modern processors are doing in regard to integer math and the number of cycles required to do them. I worked on this problem of speeding up 32-bit integer multiplies and divides on the 65c816 processor in the 1990's. Using the method below, I was able to triple the speed of the standard math libraries available in the ORCA/M compilers at the time.
So the idea that multiplies are faster than adds is simply not the case (except rarely) but like people said it depends upon how the architecture is implemented. If there are enough steps being performed available between clock cycles, yes a multiply could effectively be the same speed as an add based on the clock, but there would be a lot of wasted time. In that case it would be nice to have an instruction that performs multiple (dependent) adds / subtracts given one instruction and multiple values. One can dream.
On the 65c816 processor, there were no multiply or divide instructions. Mult and Div were done with shifts and adds.
To perform a 16 bit add, you would do the following:
LDA $0000 - loaded a value into the Accumulator (5 cycles)
ADC $0002 - add with carry (5 cycles)
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
15 cycles total for an add
If dealing with a call like from C, you would have additional overhead of dealing with pushing and pulling values off the stack. Creating routines that would do two multiples at once would save overhead for example.
The traditional way of doing the multiply is shifts and adds through the entire value of the one number. Each time the carry became a one as it is shifted left would mean you needed to add the value again. This required a test of each bit and a shift of the result.
I replaced that with a lookup table of 256 items so as the carry bits would not need to be checked. It was also possible to determine overflow before doing the multiply to not waste time. (On a modern processor this could be done in parallel but I don't know if they do this in the hardware). Given two 32 bit numbers and prescreened overflow, one of the multipliers is always 16 bits or less, thus one would only need to run through 8 bit multiplies once or twice to perform the entire 32 bit multiply. The result of this was multiplies that were 3 times as fast.
the speed of the 16 bit multiplies ranged from 12 cycles to about 37 cycles
multiply by 2 (0000 0010)
LDA $0000 - loaded a value into the Accumulator (5 cycles).
ASL - shift left (2 cycles).
STA $0004 - store the value in the Accumulator back to memory (5 cycles).
12 cycles plus call overhead.
multiply by (0101 1010)
LDA $0000 - loaded a value into the Accumulator (5 cycles)
ASL - shift left (2 cycles)
ASL - shift left (2 cycles)
ADC $0000 - add with carry for next bit (5 cycles)
ASL - shift left (2 cycles)
ADC $0000 - add with carry for next bit (5 cycles)
ASL - shift left (2 cycles)
ASL - shift left (2 cycles)
ADC $0000 - add with carry for next bit (5 cycles)
ASL - shift left (2 cycles)
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
37 cycles plus call overhead
Since the databus of the AppleIIgs for which this was written was only 8 bits wide, to load 16 bit values required 5 cycles to load from memory, one extra for the pointer, and one extra cycle for the second byte.
LDA instruction (1 cycle because it is an 8 bit value)
$0000 (16 bit value requires two cycles to load)
memory location (requires two cycles to load because of an 8 bit data bus)
Modern processors would be able to do this faster because they have a 32 bit data bus at worst. In the processor logic itself the system of gates would have no additional delay at all compared to the data bus delay since the whole value would get loaded at once.
To do the complete 32 bit multiply, you would need to do the above twice and add the results together to get the final answer. The modern processors should be able to do the two in parallel and add the results for the answer. Combined with the overflow precheck done in parallel, it would minimize the time required to do the multiply.
Anyway it is readily apparent that multiplies require significantly more effort than an add. How many steps to process the operation between cpu clock cycles would determine how many cycles of the clock would be required. If the clock is slow enough, then the adds would appear to be the same speed as a multiply.
Regards,
Ken
A multiplication requires a final step of an addition of, at minimum, the same size of the number; so it will take longer than an addition. In decimal:
123
112
----
+246 ----
123 | matrix generation
123 ----
-----
13776 <---------------- Addition
Same applies in binary, with a more elaborate reduction of the matrix.
That said, reasons why they may take the same amount of time:
To simplify the pipelined architecture, all regular instructions can be designed to take the same amount of cycles (exceptions are memory moves for instance, that depend on how long it takes to talk to external memory).
Since the adder for the final step of the multiplier is just like the adder for an add instruction... why not use the same adder by skipping the matrix generation and reduction? If they use the same adder, then obviously they will take the same amount of time.
Of course, there are more complex architectures where this is not the case, and you might obtain completely different values. You also have architectures that take several instructions in parallel when they don't depend on each other, and then you are a bit at the mercy of your compiler... and of the operating system.
The only way to run this test rigorously you would have to run in assembly and without an operating system - otherwise there are too many variables.
Even if it were, that mostly tells us what restriction the clock puts on our hardware. We can't clock higher because heat(?), but the number of ADD instruction gates a signal could pass during a clock could be very many but a single ADD instruction would only utilize one of them. So while it may at some point take equally many clock cycles, not all of the propagation time for the signals is utilized.
If we could clock higher we could def. make ADD faster probably by several orders of magnitude.
This really depends on your machine. Of course, integer multiplication is quite complex compared to addition, but quite a few AMD CPU can execute a multiplication in a single cycle. That is just as fast as addition.
Other CPUs take three or four cycles to do a multiplication, which is a bit slower than addition. But it's nowhere near the performance penalty you had to suffer ten years ago (back then a 32-Bit multiplication could take thirty-something cycles on some CPUs).
So, yes, multiplication is in the same speed class nowadays, but no, it's still not exactly as fast as addition on all CPUs.
Even on ARM (known for its high efficiency and small, clean design), integer multiplications take 3-7 cycles and than integer additions take 1 cycle.
However, an add/shift trick is often used to multiply integers by constants faster than the multiply instruction can calculate the answer.
The reason this works well on ARM is that ARM has a "barrel shifter", which allows many instructions to shift or rotate one of their arguments by 1-31 bits at zero cost, i.e. x = a + b and x = a + (b << s) take exactly the same amount of time.
Utilizing this processor feature, let's say you want to calculate a * 15. Then since 15 = 1111 (base 2), the following pseudocode (translated into ARM assembly) would implement the multiplication:
a_times_3 = a + (a << 1) // a * (0011 (base 2))
a_times_15 = a_times_3 + (a_times_3 << 2) // a * (0011 (base 2) + 1100 (base 2))
Similarly you could multiply by 13 = 1101 (base 2) using either of the following:
a_times_5 = a + (a << 2)
a_times_13 = a_times_5 + (a << 3)
a_times_3 = a + (a << 1)
a_times_15 = a_times_3 + (a_times_3 << 2)
a_times_13 = a_times_15 - (a << 1)
The first snippet is obviously faster in this case, but sometimes subtraction helps when translating a constant multiplication into add/shift combinations.
This multiplication trick was used heavily in the ARM assembly coding community in the late 80s, on the Acorn Archimedes and Acorn RISC PC (the origin of the ARM processor). Back then, a lot of ARM assembly was written by hand, since squeezing every last cycle out of the processor was important. Coders in the ARM demoscene developed many techniques like this for speeding up code, most of which are probably lost to history now that almost no assembly code is written by hand anymore. Compilers probably incorporate many tricks like this, but I'm sure there are many more that never made the transition from "black art optimization" to compiler implementation.
You can of course write explicit add/shift multiplication code like this in any compiled language, and the code may or may not run faster than a straight multiplication once compiled.
x86_64 may also benefit from this multiplication trick for small constants, although I don't believe shifting is zero-cost on the x86_64 ISA, in either the Intel or AMD implementations (x86_64 probably takes one extra cycle for each integer shift or rotate).
There are lots of good answers here about your main question, but I just wanted to point out that your code is not a good way to measure operation performance.
For starters, modern cpus adjust freqyuencies all the time, so you should use rdtsc to count the actual number of cycles instead of elapsed microseconds.
But more importantly, your code has artificial dependency chains, unnecessary control logic and iterators that will make your measure into an odd mix of latency and throughtput plus some constant terms added for no reason.
To really measure throughtput you should significantly unroll the loop and also add several partial sums in parallel (more sums than steps in the add/mul cpu pipelines).
No it's not, and in fact it's noticeably slower (which translated into a 15% performance hit for the particular real-world program I was running).
I realized this myself when asking this question from just a few days ago here.
Since the other answers deal with real, present-day devices -- which are bound to change and improve as time passes -- I thought we could look at the question from the theoretical side.
Proposition: When implemented in logic gates, using the usual algorithms, an integer multiplication circuit is O(log N) times slower than an addition circuit, where N is the number of bits in a word.
Proof: The time for a combinatorial circuit to stabilise is proportional to the depth of the longest sequence of logic gates from any input to any output. So we must show that a gradeschool multiply circuit is O(log N) times deeper than an addition circuit.
Addition is normally implemented as a half adder followed by N-1 full adders, with the carry bits chained from one adder to the next. This circuit clearly has depth O(N). (This circuit can be optimized in many ways, but the worst case performance will always be O(N) unless absurdly large lookup tables are used.)
To multiply A by B, we first need to multiply each bit of A with each bit of B. Each bitwise multiply is simply an AND gate. There are N^2 bitwise multiplications to perform, hence N^2 AND gates -- but all of them can execute in parallel, for a circuit depth of 1. This solves the multiplication phase of the gradeschool algorithm, leaving just the addition phase.
In the addition phase, we can combine the partial products using an inverted binary tree-shaped circuit to do many of the additions in parallel. The tree will be (log N) nodes deep, and at each node, we will be adding together two numbers with O(N) bits. This means each node can be implemented with an adder of depth O(N), giving a total circuit depth of O(N log N). QED.