Is multiplication of two numbers a constant time algorithm?

Is multiplication of two numbers a constant time algorithm? - c++

Suppose I write,
int a = 111;
int b = 509;
int c = a * b;
So what is the time complexity to compute 'a * b' ? How is the multiplication operation executed?

Compiling this function:
int f(int a, int b) {
return a * b;
}
With gcc -O3 -march=native -m64 -fomit-frame-pointer -S gives me the following assembly:
f:
movl %ecx, %eax
imull %edx, %eax
ret
The first instruction (movl) loads the first argument, the second instruction (imull) loads the second argument and multiplies it with the first - then the result gets returned.
The actual multiplication is done with imull, which - depending on your CPU type - will take a certain amount of CPU cycles.
If you look at Agner Fog's instruction timing tables you can see how much time each instruction will take. On most x86 processors it seems to be a small constant, however the imul instruction on the AMD K8 with a 64 bit argument and result shows as 4-5 CPU cycles. I don't know if that's a measurement issue or really variable time however.
Also note that there's other factors involved than just the execution time. The integer has to be moved through the processor and get into the right place to get multiplied. All of this and other factors make latency, which is also noted in Agner Fog's tables. There are other issues such as cache issues which also make life more difficult - it's not that easy to simply say how fast something will run without running it.
x86 isn't the only architecture, and it's actually not inconceivable there are CPU's and architectures out there that have non-constant time multiplication. This is especially important for cryptography where algorithms using multiplication might be susceptible to timing attacks on those platforms.

Multiplication itself on most common architectures will be constant. Time to load registers may vary depending on the location of the variables (L1, L2, RAM, etc) but the number of cycles operation takes will be constant. This is in contrast to operations like sqrt that may require additional cycles to achieve certain precision.
you can get instruction costs here for AMD, Intel, VIA: http://www.agner.org/optimize/instruction_tables.pdf

By time complexity, I presume you mean whether it depends on the number of digits in a and b? So whether the number of CPU clock cycles would vary depending on whether you multiplied say 2*3 or 111*509. I think yes they would vary and it would depend on how that architecture implements the multiplication operation and how the intermediate results are stored.
Although there can be many ways to do this one simple/primitive way is to implement multiplication using the binary adder/subtractor circuit.
Multiplication of a*b is adding a to itself b times using n-digit binary adders. Similarly division a/b is subtraction b from a until it reaches 0, although this will take more space to store the quotient and remainder.

void myfun()
{
int a = 111;
int b = 509;
int c = a * b;
}
De assemble part :
movl $111, -4(%ebp)
movl $509, -8(%ebp)
movl -4(%ebp), %eax
imull -8(%ebp), %eax
So as you can see it all depends on imull instruction, specifically the fetch, decode, and execute cycle of a CPU.

In your example, the compiler would do the multiplication and your code would look like
int c = 56499;
If you changed your example to look like
int c = a * 509;
then the compiler MIGHT decide to rewrite your code like
int c = a * ( 512 - 2 - 1 );
int c = (a << 9) - (a << 1) - a;
I said might because the compiler will compare the cost of using shift to the cost of a multiply instruction and pick the best option.
Given a fast multiply instruction, that usually means only 1 or maybe 2 shift will be faster.
If your numbers are too large to fit in an integer (32-bits) then the arbitrary precision
math routines use between O(n^2) and O(n log n) time where n is the number of 32-bit parts needed to hold the numbers.

Related

How to unset N right-most set bits

There is a relatively well-known trick for unsetting a single right-most bit:
y = x & (x - 1) // 0b001011100 & 0b001011011 = 0b001011000 :)
I'm finding myself with a tight loop to clear n right-most bits, but is there a simpler algebraic trick?
Assume relatively large n (n has to be <64 for 64bit integers, but it's often on the order of 20-30).
// x = 0b001011100 n=2
for (auto i=0; i<n; i++) x &= x - 1;
// x = 0b001010000
I've thumbed my TAOCP Vol4a few times, but can't find any inspiration.
Maybe there is some hardware support for it?

For Intel x86 CPUs with BMI2, pext and pdep are fast. AMD before Zen3 has very slow microcoded PEXT/PDEP (https://uops.info/) so be careful with this; other options might be faster on AMD, maybe even blsi in a loop, or better a binary-search on popcount (see below).
Only Intel has dedicated hardware execution units for the mask-controlled pack/unpack that pext/pdep do, making it constant-time: 1 uop, 3 cycle latency, can only run on port 1.
I'm not aware of other ISAs having a similar bit-packing hardware operation.
pdep basics: pdep(-1ULL, a) == a. Taking the low popcnt(a) bits from the first operand, and depositing them at the places where a has set bits, will give you a back again.
But if, instead of all-ones, your source of bits has the low N bits cleared, the first N set bits in a will grab a 0 instead of 1. This is exactly what you want.
uint64_t unset_first_n_bits_bmi2(uint64_t a, int n){
return _pdep_u64(-1ULL << n, a);
}
-1ULL << n works for n=0..63 in C. x86 asm scalar shift instructions mask their count (effectively &63), so that's probably what will happen for the C undefined-behaviour of a larger n. If you care, use n&63 in the source so the behaviour is well-defined in C, and it can still compile to a shift instruction that uses the count directly.
On Godbolt with a simple looping reference implementation, showing that they produce the same result for a sample input a and n.
GCC and clang both compile it the obvious way, as written:
# GCC10.2 -O3 -march=skylake
unset_first_n_bits_bmi2(unsigned long, int):
mov rax, -1
shlx rax, rax, rsi
pdep rax, rax, rdi
ret
(SHLX is single-uop, 1 cycle latency, unlike legacy variable-count shifts that update FLAGS... except if CL=0)
So this has 3 cycle latency from a->output (just pdep)
and 4 cycle latency from n->output (shlx, pdep).
And is only 3 uops for the front-end.
A semi-related BMI2 trick:
pext(a,a) will pack the bits at the bottom, like (1ULL<<popcnt(a)) - 1 but without overflow if all bits are set.
Clearing the low N bits of that with an AND mask, and expanding with pdep would work. But that's an overcomplicated expensive way to create a source of bits with enough ones above N zeros, which is all that actually matters for pdep. Thanks to #harold for spotting this in the first version of this answer.
Without fast PDEP: perhaps binary search for the right popcount
#Nate's suggestion of a binary search for how many low bits to clear is probably a good alternative to pdep.
Stop when popcount(x>>c) == popcount(x) - N to find out how many low bits to clear, preferably with branchless updating of c. (e.g. c = foo ? a : b often compiles to cmov).
Once you're done searching, x & (-1ULL<<c) uses that count, or just tmp << c to shift back the x>>c result you already have. Using right-shift directly is cheaper than generating a new mask and using it every iteration.
High-performance popcount is relatively widely available on modern CPUs. (Although not baseline for x86-64; you still need to compile with -mpopcnt or -march=native).
Tuning this could involve choosing a likely starting-point, and perhaps using a max initial step size instead of pure binary search. Getting some instruction-level parallelism out of trying some initial guesses could perhaps help shorten the latency bottleneck.

Why is __int128_t faster than long long on x86-64 GCC?

This is my test code:
#include <chrono>
#include <iostream>
#include <cstdlib>
using namespace std;
using ll = long long;
int main()
{
__int128_t a, b;
ll x, y;
a = rand() + 10000000;
b = rand() % 50000;
auto t0 = chrono::steady_clock::now();
for (int i = 0; i < 100000000; i++)
{
a += b;
a /= b;
b *= a;
b -= a;
a %= b;
}
cout << chrono::duration_cast<chrono::milliseconds>(chrono::steady_clock::now() - t0).count() << ' '
<< (ll)a % 100000 << '\n';
x = rand() + 10000000;
y = rand() % 50000;
t0 = chrono::steady_clock::now();
for (int i = 0; i < 100000000; i++)
{
x += y;
x /= y;
y *= x;
y -= x;
x %= y;
}
cout << chrono::duration_cast<chrono::milliseconds>(chrono::steady_clock::now() - t0).count() << ' '
<< (ll)x % 100000 << '\n';
return 0;
}
This is the test result:
$ g++ main.cpp -o main -O2
$ ./main
2432 1
2627 1
Using GCC 10.1.0 on x64 GNU/Linux, no matter if it's using optimization of -O2 or un-optimized, __int128_t is always a little faster than long long.
int and double are both significantly faster than long long; long long has become the slowest type.
How does this happen?

The performance difference comes from the efficiency of 128-bit divisions/modulus with GCC/Clang in this specific case.
Indeed, on my system as well as on godbolt, sizeof(long long) = 8 and sizeof(__int128_t) = 16. Thus operation on the former are performed by native instruction while not the latter (since we focus on 64 bit platforms). Additions, multiplications and subtractions are slower with __int128_t. But, built-in functions for divisions/modulus on 16-byte types (__divti3 and __modti3 on x86 GCC/Clang) are surprisingly faster than the native idiv instruction (which is pretty slow, at least on Intel processors).
If we look deeper in the implementation of GCC/Clang built-in functions (used only for __int128_t here), we can see that __modti3 uses conditionals (when calling __udivmodti4). Intel processors can execute the code faster because:
taken branches can be well predicted in this case since they are always the same (and also because the loop is executed millions of times);
the division/modulus is split in faster native instructions that can mostly be executed in parallel by multiple CPU ports (and that benefit from out-of-order execution). A div instruction is still used in most possible paths (especially in this case);
The execution time of the div/idiv instructions covers most of the overall execution time because of their very high latencies. The div/idiv instructions cannot be executed in parallel because of the loop dependencies. However, the latency of a div lower than a idiv making the former faster.
Please note that the performance of the two implementations can highly differ from one architecture to another (because of the number of CPU ports, the branch prediction capability and the latency/througput of the idiv instruction).
Indeed, the latency of a 64-bit idiv instruction takes 41-95 cycles on Skylake while it takes 8-41 cycles on AMD Ryzen processors for example. Respectively the latency of a div is about 6-89 cycles on Skylake and still the same on Ryzen. This means that the benchmark performance results should be significantly different on Ryzen processors (the opposite effect may be seen due to the additional instructions/branch costs in the 128-bits case).

TL:DR: __int128 division helper functions internally end up doing an unsigned div reg64 (after some branching on the values being positive and the upper halves being 0). 64-bit div is faster on Intel CPUs than the signed idiv reg64 that GCC inlines for signed long long. Faster by enough to make up for all the extra overhead of the helper function, and extended-precision for the other operations.
You probably wouldn't see this effect on AMD CPUs: long long would be faster as expected because idiv r64 is similar enough in perf to div r64 there.
And unsigned long long is faster than unsigned __int128 even on Intel CPUs, for example on my i7-6700k (Skylake) at 3.9GHz (run under perf stat to be sure of CPU frequency during the test):
2097 (i128) vs. 2332 (i64) - your original test (run back-to-back for CPU freq warm-up)
2075 (u128) vs. 1900 (u64) - unsigned versions. Slightly less branching in u128 division vs. i128, but major difference for i64 vs. u64 where the only difference is div vs. idiv.
Also, drawing any general conclusions from a very specific micro-benchmark like this would be a bad idea. It is interesting to dig into why exactly the extended-precision __int128 type manages to be faster in this division benchmark with positive numbers small enough to fit in a 32-bit integer, though.
Your benchmark is heavily weighted towards division, which you do twice per iteration (/ and %), even though it's much more expensive than other operations and in most code used much less often. (e.g. sum a whole array then divide once to get the average.)
Your benchmark is also has no instruction-level parallelism: each step has a data dependency on the previous step. This prevents auto-vectorization or anything that would show some of the advantages of narrower types.
(It's also not careful to avoid warm-up effects like the first timed region being slow until the CPU gets up to max turbo. Idiomatic way of performance evaluation?. But that happens much faster than the couple seconds of your timed regions, so that's not a problem here.)
128-bit integer division (especially signed) is too complicated for GCC to want to inline, so gcc emits a call to a helper function, __divti3 or __modti3. (TI = tetra-integer, GCC's internal name for an integer that's 4x the size of int.) These functions are documented in the GCC-internals manual.
You can see the compiler-generated asm on the Godbolt compiler-explorer. i.e. 128-bit addition with add/adc, multiplication with one mul full-multiply of the low halves, and 2x non-widening imul of the cross products. Yes these are slower than the single-instruction equivalents for int64_t.
But Godbolt doesn't show you the asm for libgcc helper functions. It doesn't disassemble them even in "compile-to-binary" and disassemble mode (instead of the usual compiler asm text output) because it dynamically links libgcc_s instead of libgcc.a.
Extended-precision signed division is done by negating if necessary and doing unsigned division of 64-bit chunks, then fixing up the sign of the result if necessary.
With both inputs small and positive, no actual negation is needed (just testing and branching). There are also fast-paths for small numbers (high-half divisor = 0, and quotient will fit in 64 bits), which is the case here. The end result is that the path of execution through __divti3 looks like this:
This is from manually single-stepping into the call to __divti3 with gdb, after compiling with g++ -g -O3 int128-bench.cpp -o int128-bench.O3 on my Arch GNU/Linux system, with gcc-libs 10.1.0-2.
# Inputs: dividend = RSI:RDI, divisor = RCX:RDX
# returns signed quotient RDX:RAX
| >0x7ffff7c4fd40 <__divti3> endbr64 # in case caller was using CFE (control-flow enforcement), apparently this instruction has to pollute all library functions now. I assume it's cheap at least in the no-CFE case.
│ 0x7ffff7c4fd44 <__divti3+4> push r12
│ 0x7ffff7c4fd46 <__divti3+6> mov r11,rdi
│ 0x7ffff7c4fd49 <__divti3+9> mov rax,rdx │ 0x7ffff7c4fd4c <__divti3+12> xor edi,edi
│ 0x7ffff7c4fd4e <__divti3+14> push rbx
│ 0x7ffff7c4fd4f <__divti3+15> mov rdx,rcx
│ 0x7ffff7c4fd52 <__divti3+18> test rsi,rsi # check sign bit of dividend (and jump over a negation)
│ 0x7ffff7c4fd55 <__divti3+21> jns 0x7ffff7c4fd6e <__divti3+46>
... taken branch to
| >0x7ffff7c4fd6e <__divti3+46> mov r10,rdx
│ 0x7ffff7c4fd71 <__divti3+49> test rdx,rdx # check sign bit of divisor (and jump over a negation), note there was a mov rdx,rcx earlier
│ 0x7ffff7c4fd74 <__divti3+52> jns 0x7ffff7c4fd86 <__divti3+70>
... taken branch to
│ >0x7ffff7c4fd86 <__divti3+70> mov r9,rax
│ 0x7ffff7c4fd89 <__divti3+73> mov r8,r11
│ 0x7ffff7c4fd8c <__divti3+76> test r10,r10 # check high half of abs(divisor) for being non-zero
│ 0x7ffff7c4fd8f <__divti3+79> jne 0x7ffff7c4fdb0 <__divti3+112> # falls through: small-number fast path
│ 0x7ffff7c4fd91 <__divti3+81> cmp rax,rsi # check that quotient will fit in 64 bits so 128b/64b single div won't fault: jump if (divisor <= high half of dividend)
│ 0x7ffff7c4fd94 <__divti3+84> jbe 0x7ffff7c4fe00 <__divti3+192> # falls through: small-number fast path
│ 0x7ffff7c4fd96 <__divti3+86> mov rdx,rsi
│ 0x7ffff7c4fd99 <__divti3+89> mov rax,r11
│ 0x7ffff7c4fd9c <__divti3+92> xor esi,esi
│ >0x7ffff7c4fd9e <__divti3+94> div r9 #### Do the actual division ###
│ 0x7ffff7c4fda1 <__divti3+97> mov rcx,rax
│ 0x7ffff7c4fda4 <__divti3+100> jmp 0x7ffff7c4fdb9 <__divti3+121>
...taken branch to
│ >0x7ffff7c4fdb9 <__divti3+121> mov rax,rcx
│ 0x7ffff7c4fdbc <__divti3+124> mov rdx,rsi
│ 0x7ffff7c4fdbf <__divti3+127> test rdi,rdi # check if the result should be negative
│ 0x7ffff7c4fdc2 <__divti3+130> je 0x7ffff7c4fdce <__divti3+142>
... taken branch over a neg rax / adc rax,0 / neg rdx
│ >0x7ffff7c4fdce <__divti3+142> pop rbx
│ 0x7ffff7c4fdcf <__divti3+143> pop r12
│ 0x7ffff7c4fdd1 <__divti3+145> ret
... return back to the loop body that called it
Intel CPUs (since IvyBridge) have zero-latency mov, so all that overhead doesn't worsen the critical path latency (which is your bottleneck) significantly. Or at least not enough to make up the difference between idiv and div.
The branching is handled by branch prediction and speculative execution, only checking the predictions after the fact when the actual input register values are the same. The branching goes the same way every time so it's trivial for branch prediction to learn. Since division is so slow, there's plenty of time for out-of-order exec to catch up.
64-bit operand-size integer division is very slow on Intel CPUs, even when the numbers are actually small and would fit in a 32-bit integer, and the extra microcode for signed integer division is even more expensive.
e.g. on my Skylake (i7-6700k), https://uops.info/ shows that (table search result )
idiv r64 is 56 uops for the front-end, with latency from 41 to 95 cycles (from divisor to quotient, which is the relevant case here I think).
div r64 is 33 uops for the front-end, with latency from 35 to 87 cycles. (for that same latency path).
The latency best-case happens for small quotients or small dividends or something, I can never remember which.
Similar to the branching that GCC does in software for 128-bit division in terms of 64-bit, I think the CPU microcode is internally doing 64-bit division in terms of narrower operations, probably the 32-bit that's only 10 uops for signed or unsigned, with much lower latency. (Ice Lake improves the divider so 64-bit division isn't much slower than 32-bit.)
This is why you found long long so much slower than int for this benchmark. In a lot of cases it's about the same, or half speed if memory bandwidth or SIMD are involved. (Only 2 elements per 128-bit of vector width, not 4).
AMD CPUs handle 64-bit operand size more efficiently, with performance only depending on the actual values, so about the same for div r32 vs. div r64 with the same numbers.
BTW, the actual values tend to be something like a=1814246614 / b=1814246613 = 1, then a=1 % b=1814246612 (with b decreasing by 1 every iteration). Only ever testing division with quotient=1 seems very silly. (The first iteration might be different, but we get into this state for the 2nd and later.)
Performance of integer operations other than division is not data-dependent on modern CPUs. (Unless of course there are compile-time constants that allow different asm to be emitted. Like division by a constant is much cheaper when done with a multiplicative inverse calculated at compile time.)
re: double: see Floating point division vs floating point multiplication for division vs. multiplication. FP division is often harder to avoid, and its performance is relevant in more cases, so it's handled better.
Related:
Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux has a specific example of changing div r64 to div r32 in a program that uses small-enough numbers, and seeing throughput improve ~3x.
Can 128bit/64bit hardware unsigned division be faster in some cases than 64bit/32bit division on x86-64 Intel/AMD CPUs? has some details about div and idiv being microcoded.
How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson? has some hardware details of how div/sqrt execution units are designed in general, and in older Intel CPUs. But that doesn't explain why 64-bit is so many more uops than 32-bit; I'm only inferring that the hardware must be narrower before Ice Lake from the fact that it needs so many more uops of microcode.

Mysteries of C++ optimization

Take the two following snippets:
int main()
{
unsigned long int start = utime();
__int128_t n = 128;
for(__int128_t i=1; i<1000000000; i++)
n = (n * i);
unsigned long int end = utime();
cout<<(unsigned long int) n<<endl;
cout<<end - start<<endl;
}
and
int main()
{
unsigned long int start = utime();
__int128_t n = 128;
for(__int128_t i=1; i<1000000000; i++)
n = (n * i) >> 2;
unsigned long int end = utime();
cout<<(unsigned long int) n<<endl;
cout<<end - start<<endl;
}
I am benchmarking 128 bit integers in C++. When executing the first one (just multiplication) everything runs in approx. 0.95 seconds. When I also add the bit shift operation (second snippet) the execution time raises to an astounding 2.49 seconds.
How is this possible? I thought that bit shifting was one of the lightest operations for a processor. How comes that there is so much overhead due to such a simple operation? I am compiling with O3 flag activated.
Any idea?

This question has been bugging me for the past few days, so I decided to do some more investigation. My initial answer focused on the difference in data values between the two tests. My assertion was that the integer multiplication unit in the processor finishes an operation in fewer clock cycles if one of the operands is zero.
While there are instructions that are clearly documented to work that way (integer division, for example), there are very strong indications that integer multiplication is done in a constant number of cycles in modern processors, regardless of input. The note in Intel's documentation that initially made me think that the number of cycles for integer multiplication can depend on input data doesn't seem to apply to these instructions. Also, I did some more rigorous performance tests with the same sequence of instructions on both zero and non-zero operands and the results didn't yield significant differences. As far as I can tell, harold's comment on this subject is correct. My mistake; sorry.
While contemplating the possibility of deleting this answer altogether, so that it doesn't lead people astray in the future, I realized there were still quite a few more things worth saying on this subject. I also think there's at least one other way in which data values can influence performance in such calculations (included in the last section). So, I decided to restructure and enhance the rest of the information in my initial answer, started writing, and... didn't quite stop for a while. It's up to you to decide whether it was worth it.
The information is structured into the following sections:
What the code does
What the compiler does
What the processor does
What you can do about it
Unanswered questions
What the code does
It overflows, mostly.
In the first version, n starts overflowing on the 33rd iteration. In the second version, with the shift, n starts overflowing on the 52nd iteration.
In the version without the shift, starting with the 128th iteration, n is zero (it overflows "cleanly", leaving only zeros in the least significant 128 bits of the result).
In the second version, the right shift (dividing by 4) takes out more factors of two from the value of n on each iteration than the new operands bring in, so the shift results in rounding on some iterations. Quick calculation: the total number of factors of two in all numbers from 1 to 128 is equal to
128 / 2 + 128 / 4 + ... + 2 + 1 = 26 + 25 + ... + 2 + 1 = 27 - 1
while the number of factors of two taken out by the right shift (if it had enough to take from) is 128 * 2, more than double.
Armed with this knowledge, we can give a first answer: from the point of view of the C++ standard, this code spends most of its time in undefined behaviour land, so all bets are off. Problem solved; stop reading now.
What the compiler does
If you're still reading, from this point forward we'll ignore the overflows and look at some implementation details. "The compiler", in this case, means GCC 4.9.2 or Clang 3.5.1. I've only done performance measurements on code generated by GCC. For Clang, I've looked at the generated code for a few test cases and noted some differences that I'll mention below, but I haven't actually run the code; I might have missed some things.
Both multiplication and shift operations are available for 64-bit operands, so 128-bit operations need to be implemented in terms of those. First, multiplication: n can be written as 264 nh + nl, where nh and nl are the high and low 64-bit halves, respectively. The same goes for i. So, the multiplication can be written:
(264 nh + nl)(264 ih + il) = 2128 nh ih + 264 (nh il + nl ih) + nl il
The first term doesn't have any non-zero bits in the lower 128-bit part; it's either all overflow or all zero. Since ignoring integer overflows is valid and common for C++ implementations, that's what the compiler does: the first term is ignored completely.
The parenthesis only contributes bits to the upper 64-bit half of the 128-bit result; any overflow resulting from the two multiplications or the addition is also ignored (the result is truncated to 64 bits).
The last term determines the bits in the low 64-bit half of the result and, if the result of that multiplication has more than 64 bits, the extra bits need to be added to the high 64-bit half obtained from the parenthesis discussed before. There's a very useful multiplication instruction in x86-64 assembly that does just what's needed: takes two 64-bit operands and places the result in two 64-bit registers, so the high half is ready to be added to the result of the operations in the parenthesis.
That is how 128-bit integer multiplication is implemented: three 64-bit multiplications and two 64-bit additions.
Now, the shift: using the same notations as above, the two least significant bits of nh need to become the two most significant bits of nl, after the contents of the latter is shifted right by two bits. Using C++ syntax, it would look like this:
nl = nh << 62 | nl >> 2 //Doesn't change nh, only uses its bits.
Besides that, nh also needs to be shifted, using something like
nh >>= 2;
That is how the compiler implements a 128-bit shift. For the first part, there's an x86-64 instruction that has the exact semantics of that expression; it's called SHRD. Using it can be good or bad, as we'll see below, and the two compilers make slightly different choices in this respect.
What the processor does
... is highly processor-dependent. (No... really?!)
Detailed information about what happens for Haswell processors can be found in harold's excellent answer. Here, I'll try to cover more ground at a higher level. For more detailed data, here are some sources:
Intel® 64 and IA-32 Architectures Optimization Reference Manual
Software Optimization Guide for AMD Family 15h Processors
Agner Fog's microarchitecture manual and instruction tables.
I'll refer to the following architectures:
Intel Sandy Bridge / Ivy Bridge - abbreviated "IntelSB" going forward;
Intel Haswell / Broadwell - "IntelH" going forward;
I'll just use "Intel" for things that are the same between SB and H.
AMD Bulldozer / Piledriver / Steamroller - "AMD" going forward.
I have measurement data taken on an IntelSB system; I think it's precise enough, as long as the compiler doesn't act up. Unfortunately, when working with such tight loops, this can happen very easily. At various points during testing, I had to use all kinds of stupid tricks to avoid GCC's idiosyncrasies, usually related to register use. For example, it seems to have a tendency to shuffle registers around unnecessarily, when compiling simpler code than for other cases when it generates optimal assembly. Ironically, on my test setup, it tended to generate optimal code for the second sample, using the shift, and worse code for the first one, making the impact of the shift less visible. Clang/LLVM seems to have fewer of those bad habits, but then again, I looked at fewer examples using it and I didn't measure any of them, so this doesn't mean much. In the interest of comparing apples with apples, all measurement data below refers to the best code generated for each case.
First, let's rearrange the expression for 128-bit multiplication from the previous section into a (horrible) diagram:
nh * il
\
+ -> tmp
/ \
nl * ih + -> next nh
/
high 64 bits
/
nl * il --------
\
low 64 bits
\
-> next nl
(sorry, I hope it gets the point across)
Some important points:
The two additions can't execute until their respective inputs are ready; the final addition can't execute until everything else is ready.
The three multiplications can, theoretically, execute in parallel (no input depends on another multiplication's output).
In the ideal scenario above, the total number of cycles to complete the entire calculation for one iteration is the sum of the number of cycles for one multiplication and two additions.
The next nl can be ready early. This, together with the fact that the next il and ih are very cheap to calculate, means the nl * il and nl * ih calculations for the next iteration can start early, possibly before the next nh has been computed.
Multiplications can't really execute entirely in parallel on these processors, as there's only one integer multiplication unit for each core, but they can execute concurrently through pipelining. One multiplication can begin executing on each cycle on Intel, and every 4 cycles on AMD, even if previous multiplications haven't finished executing yet.
All of the above mean that, if the loop's body doesn't contain anything else that gets in the way, the processor can reorder those multiplications to achieve something as close as possible to the ideal scenario above. This applies to the first code snippet. On IntelH, as measured by harold, it's exactly the ideal scenario: 5 cycles per iteration are made up of 3 cycles for one multiplication and one cycle each for the two additions (impressive, to be honest). On IntelSB, I measured 6 cycles per iteration (closer to 5.5, actually).
The problem is that in the second code snippet something does get in the way:
nh * il
\ normal shift -> next nh
+ -> tmp /
/ \ /
nl * ih + ----> temp nh
/ \
high 64 bits \
/ "composite" shift -> next nl
nl * il -------- /
\ /
low 64 bits /
\ /
-> temp nl ---------
The next nl is no longer ready early. temp nl has to wait for temp nh to be ready, so that both can be fed into the composite shift, and only then will we have the next nl. Even if both shifts are very fast and execute in parallel, they don't just add the execution cost of one shift to an iteration; they also change the dynamics of the loop's "pipeline", acting like a sort of synchronizing barrier.
If the two shifts finish at the same time, then all three multiplications for the next iteration will be ready to execute at the same time, and they can't all start in parallel, as explained above; they'll have to wait for one another, wasting cycles. This is the case on IntelSB, where the two shifts are equally fast (see below); I measured 8 cycles per iteration for this case.
If the two shifts don't finish at the same time, it will typically be the normal shift that finishes first (the composite shift is slower on most architectures). This means that the next nh will be ready early, so the top multiplication can start early for the next iteration. However, the other two multiplications still have to wait more (wasted) cycles for the composite shift to finish, and then they'll be ready at the same time and one will have to wait for the other to start, wasting some more time. This is the case on IntelH, measured by harold at 9 cycles per iteration.
I expect AMD to fall under this last category as well. While there's an even bigger difference in performance between the composite shift and normal shift on this platform, integer multiplications are also slower on AMD than on Intel (more than twice as slow), making the first sample slower to begin with. As a very rough estimate, I think the first version could take about 12 cycles on AMD, with the second one at around 16. It would be nice to have some concrete measurements, though.
Some more data on the troublesome composite shift, SHRD:
On IntelSB, it's exactly as cheap as a simple shift (great!); simple shifts are about as cheap as they come: they execute in one cycle, and two shifts can start executing each cycle.
On IntelH, SHRD takes 3 cycles to execute (yes, it got worse in the newer generation), and two shifts of any kind (simple or composite) can start executing each cycle;
On AMD, it's even worse. If I'm reading the data correctly, executing an SHRD keeps both shift execution units busy until execution finishes - no parallelism and no pipelining possible; it takes 3 cycles, during which no other shift can start executing.
What you can do about it
I can think of three possible improvements:
replace SHRD with something faster on platforms where it makes sense;
optimize the multiplication to take advantage of the data types involved here;
restructure the loop.
1. SHRD can be replaced with two shifts and a bitwise OR, as described in the compiler section. A C++ implementation of a 128-bit shift right by two bits could look like this:
__int128_t shr2(__int128_t n)
{
using std::int64_t;
using std::uint64_t;
//Unpack the two halves.
int64_t nh = n >> 64;
uint64_t nl = static_cast<uint64_t>(n);
//Do the actual work.
uint64_t rl = nl >> 2 | nh << 62;
int64_t rh = nh >> 2;
//Pack the result.
return static_cast<__int128_t>(rh) << 64 | rl;
}
Although it looks like a lot of code, only the middle section doing the actual work generates shifts and ORs. The other parts merely indicate to the compiler which 64-bit parts we want to work with; since the 64-bit parts are already in separate registers, those are effectively no-ops in the generated assembly code.
However, keep in mind that this amounts to "trying to write assembly using C++ syntax", and it's generally not a very bright idea. I'm only using it because I verified that it works for GCC and I'm trying to keep the amount of assembly code in this answer to a minimum. Even so, there's one surprise: the LLVM optimizer detects what we're trying to do with those two shifts and one OR and... replaces them with an SHRD in some cases (more about this below).
Functions of the same form can be used for shifts by other numbers of bits, less than 64. From 64 to 127, it gets easier, but the form changes. One thing to keep in mind is that it would be a mistake to pass the number of bits for shifting as a runtime parameter to a shr function. Shift instructions by a variable number of bits are slower than the ones using a constant number on most architectures. You could use a non-type template parameter to generate different functions at compile time - this is C++, after all...
I think using such a function makes sense on all architectures except IntelSB, where SHRD is already as fast as it can get. On AMD, it will definitely be an improvement. Less so on IntelH: for our case, I don't think it will make a difference, but generally it could shave once cycle off some calculations; there could theoretically be cases where it could make things slightly worse, but I think those are very uncommon (as usual, there's no substitute for measuring). I don't think it will make a difference for our loop because it will change things from [nh being ready after once cycle and nl after three] to [both being ready after two]; this means all three multiplications for the next iteration will be ready at the same time and they'll have to wait for one another, essentially wasting the cycle that was gained by the shift.
GCC seems to use SHRD on all architectures, and the "assembly in C++" code above can be used as an optimization where it makes sense. The LLVM optimizer uses a more nuanced approach: it does the optimization (replaces SHRD) automatically for AMD, but not for Intel, where it even reverses it, as mentioned above. This may change in future releases, as indicated by the discussion on the patch for LLVM that implemented this optimization. For now, if you want to use the alternative with LLVM on Intel, you'll have to resort to assembly code.
2. Optimizing the multiplication: The test code uses a 128-bit integer for i, but that's not needed in this case, as its value fits easily in 64 bits (32, actually, but that doesn't help us here). What this means is that ih will always be zero; this reduces the diagram for 128-bit multiplication to the following:
nh * il
\
\
\
+ -> next nh
/
high 64 bits
/
nl * il
\
low 64 bits
\
-> next nl
Normally, I'd just say "declare i as long long and let the compiler optimize things" but unfortunately this doesn't work here; both compilers go for the standard behaviour of converting the two operands to their common type before doing the calculation, so i ends up on 128 bits even if it starts on 64. We'll have to do things the hard way:
__int128_t mul(__int128_t n, long long i)
{
using std::int64_t;
using std::uint64_t;
//Unpack the two halves.
int64_t nh = n >> 64;
uint64_t nl = static_cast<uint64_t>(n);
//Do the actual work.
__asm__(R"(
movq %0, %%r10
imulq %2, %%r10
mulq %2
addq %%r10, %0
)" : "+d"(nh), "+a"(nl) : "r"(i) : "%r10");
//Pack the result.
return static_cast<__int128_t>(nh) << 64 | nl;
}
I said I tried to avoid assembly code in this answer, but it's not always possible. I managed to coax GCC into generating the right code with "assembly in C++" for the function above, but once the function is inlined everything falls apart - the optimizer sees what's going on in the complete loop body and converts everything to 128 bits. LLVM seems to behave in this case, but, since I was testing on GCC, I had to use a reliable way to get the right code in there.
Declaring i as long long and using this function instead of the normal multiplication operator, I measured 5 cycles per iteration for the first sample and 7 cycles for the second one on IntelSB, a gain of one cycle in each case. I expect it to shave one cycle off the iterations for both examples on IntelH as well.
3. The loop can sometimes be restructured to encourage pipelined execution, when (at least some) iterations don't really depend on previous results, even though it may look like they do. For example, we could replace the for loop for the second sample with something like this:
__int128_t n2 = 1;
long long j = 1000000000 / 2;
for(long long i = 1; i < 1000000000 / 2; ++i, ++j)
{
n *= i;
n2 *= j;
n >>= 2;
n2 >>= 2;
}
n *= (n2 * j) >> 2;
This takes advantage of the fact that some partial results can be calculated independently and only aggregated at the end. We're also hinting to the compiler that we want to pipeline the multiplications and shifts (not always necessary, but it does make a small difference for GCC for this code).
The code above is nothing more than a naive proof of concept. Real code would need to handle the total number of iterations in a more reliable way. The bigger problem is that this code won't generate the same results as the original, because of different behaviour in the presence of overflow and rounding. Even if we stop the loop on the 51st iteration, to avoid overflow, the result will still be different by about 10%, because of rounding happening in different ways when shifting right. In real code, this would most likely be a problem; but then again, you wouldn't have real code like this, would you?
Assuming this technique is applied to a case where the problems above don't occur, I measured the performance of such code in a few cases, again on IntelSB. The results are given in "cycles per iteration", as before, where "iteration" means the one from the original code (I divided the total number of cycles for executing the whole loop by the total number of iterations executed by the original code, not for the restructured one, to have a meaningful comparison):
The code above executes in 7 cycles per iteration, a gain of one cycle over the original.
The code above with the multiplication operator replaced with our mul() function needs 6 cycles per iteration.
The restructured code does suffer from more register shuffling, which can't be avoided unfortunately (more variables). More recent processors like IntelH have architecture improvements that make register moves essentially free in many cases; this could make the code yield even larger gains. Using new instructions like MULX for IntelH could avoid some register moves altogether; GCC does use such instructions when compiling with -march=haswell.
Unanswered questions
None of the measurements that we have so far explain the large differences in performance reported by the OP, and observed by me on a different system.
My initial timings were taken on a remote system (Westmere family processor) where, of course, a lot of things could happen; yet, the results were strangely stable.
On that system, I also experimented with executing the second sample with a right shift and a left shift; the code using a right shift was consistently 50% slower than the other variant. I couldn't replicate that on my controlled test system on IntelSB, and I don't have an explanation for those results either.
We can discard all of the above as unpredictable side effects of compiler / processor / overall system behaviour, but I can't shake the feeling that not everything has been explained here.
It's true that it doesn't really make much sense to benchmark such tight loops without a controlled system, precise tools (counting cycles) and looking at the generated assembly code for each case. Compiler idiosyncrasies can easily result in code that artificially introduces differences of 50% or more in performance.
Another factor that could explain large differences is the presence of Intel Hyper-Threading. Different parts of the core behave differently when this is enabled, and the behaviour has also changed between processor families. This could have strange effects on tight loops.
To top everything off, here's a crazy hypothesis: Flipping bits consumes more power than keeping them constant. In our case, the first sample, working with zero values most of the time, will be flipping far fewer bits than the second one, so the latter will consume more power. Many modern processors have features that dynamically adjust the core frequency depending on electrical and thermal limits (Intel Turbo Boost / AMD Turbo Core). This means that, theoretically, under the right (or wrong?) conditions, the second sample could trigger a reduction of the core frequency, thus making the same number of cycles take longer time, and making the performance data-dependent.

After benchmarking both (using the assembly generated by GCC 4.7.3 on -O2) on my 4770K, I found that the first one takes 5 cycles per iteration and the second one takes 9 cycles per iteration. Why so much difference?
It turns out to be an interplay between throughput and latency. The main killer is shrd, which takes 3 cycles and is on the critical path. Here's a picture of it (I ignore the chain for i because it is faster and there is plenty of spare throughput for it to just run ahead, it will not interfere):
The edges here are dependencies, not dataflow.
Based solely on latencies in this chain, the expected time would be 8 cycles per iteration. But it is not. The problem here is that for 8 cycles to happen, mul2 and imul3 have to be executed in parallel, and integer multiplication only has a throughput of 1/cycle. So it (either one) has to wait a cycle, and holds up the chain by a cycle. I verified this by changing that imul to an add, which reduced the time to 8 cycles per iteration. Changing the other imul to an add had no effect, as predicted based on this explanation (it doesn't depend on shrd and can thus be scheduled earlier, without interfering with the other multiplications).
These exact details are only for Haswell.
The code I used was this:
section .text
global cmp1
proc_frame cmp1
[endprolog]
mov r8, rsi
mov r9, rdi
mov esi, 1
xor edi, edi
mov eax, 128
xor edx, edx
.L2:
mov rcx, rdx
mov rdx, rdi
imul rdx, rax
imul rcx, rsi
add rcx, rdx
mul rsi
add rdx, rcx
add rsi, 1
mov rcx, rsi
adc rdi, 0
xor rcx, 10000000
or rcx, rdi
jne .L2
mov rdi, r9
mov rsi, r8
ret
endproc_frame
global cmp2
proc_frame cmp2
[endprolog]
mov r8, rsi
mov r9, rdi
mov esi, 1
xor edi, edi
mov eax, 128
xor edx, edx
.L3:
mov rcx, rdi
imul rcx, rax
imul rdx, rsi
add rcx, rdx
mul rsi
add rdx, rcx
shrd rax, rdx, 2
sar rdx, 2
add rsi, 1
mov rcx, rsi
adc rdi, 0
xor rcx, 10000000
or rcx, rdi
jne .L3
mov rdi, r9
mov rsi, r8
ret
endproc_frame

Unless your processor can support native 128-bit operations, the operations will have to be software coded to use the next best option.
Your 128-bit operations are using the same scheme as the 8-bit processors did when using 16-bit operations, and this takes time.
For example, a 128-bit right shift, by one bit, using 64-bit registers requires:
Shift the Most Significant register right into carry. The Carry flag will contain the bit that was shifted out.
Shift the Least Significant register right, with carry. The bits will be shifted right, with the carry flag being shifted into the Most Significant Bit position.
Without support for native 128-bit operations, you code will take twice as many operations as the same 64-bit operations; sometimes more (such as multiplication). This is why you are seeing such poor performance.
I highly recommend only using 128-bits in places where it is extremely necessary.

Variance in RDTSC overhead

I'm constructing a micro-benchmark to measure performance changes as I experiment with the use of SIMD instruction intrinsics in some primitive image processing operations. However, writing useful micro-benchmarks is difficult, so I'd like to first understand (and if possible eliminate) as many sources of variation and error as possible.
One factor that I have to account for is the overhead of the measurement code itself. I'm measuring with RDTSC, and I'm using the following code to find the measurement overhead:
extern inline unsigned long long __attribute__((always_inline)) rdtsc64() {
unsigned int hi, lo;
__asm__ __volatile__(
"xorl %%eax, %%eax\n\t"
"cpuid\n\t"
"rdtsc"
: "=a"(lo), "=d"(hi)
: /* no inputs */
: "rbx", "rcx");
return ((unsigned long long)hi << 32ull) | (unsigned long long)lo;
}
unsigned int find_rdtsc_overhead() {
const int trials = 1000000;
std::vector<unsigned long long> times;
times.resize(trials, 0.0);
for (int i = 0; i < trials; ++i) {
unsigned long long t_begin = rdtsc64();
unsigned long long t_end = rdtsc64();
times[i] = (t_end - t_begin);
}
// print frequencies of cycle counts
}
When running this code, I get output like this:
Frequency of occurrence (for 1000000 trials):
234 cycles (counted 28 times)
243 cycles (counted 875703 times)
252 cycles (counted 124194 times)
261 cycles (counted 37 times)
270 cycles (counted 2 times)
693 cycles (counted 1 times)
1611 cycles (counted 1 times)
1665 cycles (counted 1 times)
... (a bunch of larger times each only seen once)
My questions are these:
What are the possible causes of the bi-modal distribution of cycle counts generated by the code above?
Why does the fastest time (234 cycles) only occur a handful of times—what highly unusual circumstance could reduce the count?
Further Information
Platform:
Linux 2.6.32 (Ubuntu 10.04)
g++ 4.4.3
Core 2 Duo (E6600); this has constant rate TSC.
SpeedStep has been turned off (processor is set to performance mode and is running at 2.4GHz); if running in 'ondemand' mode, I get two peaks at 243 and 252 cycles, and two (presumably corresponding) peaks at 360 and 369 cycles.
I'm using sched_setaffinity to lock the process to one core. If I run the test on each core in turn (i.e., lock to core 0 and run, then lock to core 1 and run), I get similar results for the two cores, except that the fastest time of 234 cycles tends to occur slightly fewer times on core 1 than on core 0.
Compile command is:
g++ -Wall -mssse3 -mtune=core2 -O3 -o test.bin test.cpp
The code that GCC generates for the core loop is:
.L105:
#APP
# 27 "test.cpp" 1
xorl %eax, %eax
cpuid
rdtsc
# 0 "" 2
#NO_APP
movl %edx, %ebp
movl %eax, %edi
#APP
# 27 "test.cpp" 1
xorl %eax, %eax
cpuid
rdtsc
# 0 "" 2
#NO_APP
salq $32, %rdx
salq $32, %rbp
mov %eax, %eax
mov %edi, %edi
orq %rax, %rdx
orq %rdi, %rbp
subq %rbp, %rdx
movq %rdx, (%r8,%rsi)
addq $8, %rsi
cmpq $8000000, %rsi
jne .L105

RDTSC can return inconsistent results for a number of reasons:
On some CPUs (especially certain older Opterons), the TSC isn't synchronized between cores. It sounds like you're already handling this by using sched_setaffinity -- good!
If the OS timer interrupt fires while your code is running, there'll be a delay introduced while it runs. There's no practical way to avoid this; just throw out unusually high values.
Pipelining artifacts in the CPU can sometimes throw you off by a few cycles in either direction in tight loops. It's perfectly possible to have some loops that run in a non-integer number of clock cycles.
Cache! Depending on the vagaries of the CPU cache, memory operations (like the write to times[]) can vary in speed. In this case, you're fortunate that the std::vector implementation being used is just a flat array; even so, that write can throw things off. This is probably the most significant factor for this code.
I'm not enough of a guru on the Core2 microarchitecture to say exactly why you're getting this bimodal distribution, or how your code ran faster those 28 times, but it probably has something to do with one of the reasons above.

The Intel Programmer's manual recommends you use lfence;rdtsc or rdtscp if you want to ensure that instructions prior to the rdtsc have actually executed. This is because rdtsc isn't a serializing instruction by itself.

You should make sure that frequency throttling/green functionality is disabled at the OS level. Restart the machine. You may otherwise have a situation where the cores have unsynchronized time stamp counter values.
The 243 reading is by far the most common which is one reason for using it. On the other hand suppose you get an elapsed time <243: you subtract the overhead and get an underflow. Since the arithmetic is unsigned you end up with an enormous result. This fact speaks for using the lowest reading (234) instead. It is extremely difficult to accurately measure sequences that are only a few cycles long. On a typical x86 # a few GHz I'd recommend against timing sequences shorter than 10ns and even at that length they would typically be far from rock-solid.
The rest of my answer here is what I do, how I handle the results and my reasoning on the subject matter.
As to overhead the easiest way is to use code such as this
unsigned __int64 rdtsc_inline (void);
unsigned __int64 rdtsc_function (void);
The first form emits the rdtsc instruction into the generated code (as in your code). The second will cause a the function to be called, the rdtsc executed and a return instruction. Perhaps it will generate stack frames. Obviously the second form is much slower than the first.
The (C) code for overhead calculation can then be written
unsigned __int64 start_cycle,end_cycle; /* place these # the module level*/
unsigned __int64 overhead;
/* place this code inside a function */
start_cycle=rdtsc_inline();
end_cycle=rdtsc_inline();
overhead=end_cycle-start_cycle;
If you are using the inline variant you will get a low(er) overhead. You will also run the risk of calculating an overhead which is greater than it "should" be (especially for the function form) which in turn means that if you measure very short/fast sequences you may run into having a previously calculated overhead which is greater than the measurement itself. When you attempt to adjust for the overhead you will get an underflow which will lead to messy conditions. The best way to handle this is to
time the overhead several times and always use the smallest value achieved,
not measure really short code sequences as you may run into pipelining effects which will require messy synchronising instructions before the rdtsc instruction and
if you must measure very short sequences regard the results as indications rather than as facts
I've previously discussed what I do with the results in this thread.
Another thing that I do is to integrate the measurement code into the application. The overhead is insignificant. After a result has been computed I send it to a special structure where I count the number of measurements, sum x and x^2 values and determine min and max measurements. Later on I can use the data to calculate average and standard deviation. The structure itself is indexed and I can measure different performance aspects such individual application functions ("functional performance"), time spent in cpu, disk read/writing, network read/writing ("non-functional performance") etc.
If an application is instrumented in this manner and monitored from the very beginning I expect that the risk of it having performance problems during its lifetime will be greatly reduced.

Do modern compilers optimize the x * 2 operation to x << 1?

Does the C++ compiler optimize the multiply by two operation x*2 to a bitshift operation x<<1?
I would love to believe that yes.

Actually VS2008 optimizes this to x+x:
01391000 push ecx
int x = 0;
scanf("%d", &x);
01391001 lea eax,[esp]
01391004 push eax
01391005 push offset string "%d" (13920F4h)
0139100A mov dword ptr [esp+8],0
01391012 call dword ptr [__imp__scanf (13920A4h)]
int y = x * 2;
01391018 mov ecx,dword ptr [esp+8]
0139101C lea edx,[ecx+ecx]
In an x64 build it is even more explicit and uses:
int y = x * 2;
000000013FB9101E mov edx,dword ptr [x]
printf("%d", y);
000000013FB91022 lea rcx,[string "%d" (13FB921B0h)]
000000013FB91029 add edx,edx
This is will the optimization settings on 'Maximize speed' (/O2)

This article from Raymond Chen could be interesting:
When is x/2 different from x>>1? :
http://blogs.msdn.com/oldnewthing/archive/2005/05/27/422551.aspx
Quoting Raymond:
Of course, the compiler is free to recognize this and rewrite your multiplication or shift operation. In fact, it is very likely to do this, because x+x is more easily pairable than a multiplication or shift. Your shift or multiply-by-two is probably going to be rewritten as something closer to an add eax, eax instruction.
[...]
Even if you assume that the shift fills with the sign bit, The result of the shift and the divide are different if x is negative.
(-1) / 2 ≡ 0
(-1) >> 1 ≡ -1
[...]
The moral of the story is to write what you mean. If you want to divide by two, then write "/2", not ">>1".
We can only assume it is wise to tell the compiler what you want, not what you want him to do: The compiler is better than an human is at optimizing small scale code (thanks for Daemin to point this subtle point): If you really want optimization, use a profiler, and study your algorithms' efficiency.

VS 2008 optimized mine to x << 1.
x = x * 2;
004013E7 mov eax,dword ptr [x]
004013EA shl eax,1
004013EC mov dword ptr [x],eax
EDIT: This was using VS default "Debug" configuration with optimization disabled (/Od). Using any of the optimization switches (/O1, /O2 (VS "Retail"), or /Ox) results in the the add self code Rob posted. Also, just for good measure, I verified x = x << 1 is indeed treated the same way as x = x * 2 by the cl compiler in both /Od and /Ox. So, in summary, cl.exe version 15.00.30729.01 for x86 treats * 2 and << 1 identically and I expect nearly all other recent compilers do the same.

Not if x is a float it won't.

Yes. They also optimize other similar operations, such as multiplying by non-powers of two that can be rewritten as the sums of some shifts. They will also optimize divisions by powers of 2 into right-shifts, but beware that when working with signed integers, the two operations are different! The compiler has to emit some extra bit twiddling instructions to make sure the results are the same for positive and negative numbers, but it's still faster than doing a division. It also similarly optimizes moduli by powers of 2.

The answer is "if it is faster" (or smaller). This depends on the target architecture heavily as well as the register usage model for a given compiler. In general, the answer is "yes, always" as this is a very simple peephole optimization to implement and is usually a decent win.

That's only the start of what optimizers can do. To see what your compiler does, look for the switch that causes it to emit assembler source. For the Digital Mars compilers, the output assembler can be examined with the OBJ2ASM tool. If you want to learn how your compiler works, looking at the assembler output can be very illuminating.

I'm sure they all do these kind of optimizations, but I wonder if they are still relevant. Older processors did multiplication by shifting and adding, which could take a number of cycles to complete. Modern processors, on the other hand, have a set of barrel-shifters which can do all the necessary shifts and additions simultaneously in one clock cycle or less. Has anyone actually benchmarked whether these optimizations really help?

Yes, they will.

Unless something is specified in a languages standard you'll never get a guaranteed answer to such a question. When in doubt have your compiler spit out assemble code and check. That's going to be the only way to really know.

#Ferruccio Barletta
That's a good question. I went Googling to try to find the answer.
I couldn't find answers for Intel processors directly, but this page has someone who tried to time things. It shows shifts to be more than twice as fast as ads and multiplies. Bit shifts are so simple (where a multiply could be a shift and an addition) that this makes sense.
So then I Googled AMD, and found an old optimization guide for the Athlon from 2002 that lists that lists the fastest ways to multiply numbers by contants between 2 and 32. Interestingly, it depends on the number. Some are ads, some shifts. It's on page 122.
A guide for the Athlon 64 shows the same thing (page 164 or so). It says multiplies are 3 (in 32-bit) or 4 (in 64-bit) cycle operations, where shifts are 1 and adds are 2.
It seems it is still useful as an optimization.
Ignoring cycle counts though, this kind of method would prevent you from tying up the multiplication execution units (possibly), so if you were doing lots of multiplications in a tight loop where some use constants and some don't the extra scheduling room might be useful.
But that's speculation.

Why do you think that's an optimization?
Why not 2*x → x+x? Or maybe the multiplication operation is as fast as the left-shift operation (maybe only if only one bit is set in the multiplicand)? If you never use the result, why not leave it out from the compiled output? If the compiler already loaded 2 to some register, maybe the multiplication instruction will be faster e.g. if we'd have to load the shift count first. Maybe the shift operation is larger, and your inner loop would no longer fit into the prefetch buffer of the CPU thus penalizing performance? Maybe the compiler can prove that the only time you call your function x will have the value 37 and x*2 can be replaced by 74? Maybe you're doing 2*x where x is a loop count (very common, though implicit, when looping over two-byte objects)? Then the compiler can change the loop from
for(int x = 0; x < count; ++x) ...2*x...
to the equivalent (leaving aside pathologies)
int count2 = count * 2
for(int x = 0; x < count2; x += 2) ...x...
which replaces count multiplications with a single multiplication, or it might be able to leverage the lea instruction which combines the multiplication with a memory read.
My point is: there are millions of factors deciding whether replacing x*2 by x<<1 yields a faster binary. An optimizing compiler will try to generate the fastest code for the program it is given, not for an isolated operation. Therefore optimization results for the same code can vary largely depending on the surrounding code, and they may not be trivial at all.
Generally, there are very few benchmarks that show large differences between compilers. It is therefore a fair assumption that compilers are doing a good job because if there were cheap optimizations left, at least one of the compilers would implement them -- and all the others would follow in their next release.

It depends on what compiler you have. Visual C++ for example is notoriously poor in optimizing. If you edit your post to say what compiler you are using, it would be easier to answer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js