Try to see which cast is faster (not necessary better): new c++ case or old fashion C style cast. Any ideas?
There should be no difference at all if you compare int() to equivalent functionality of static_cast<int>().
Using VC2008:
double d = 10.5;
013A13EE fld qword ptr [__real#4025000000000000 (13A5840h)]
013A13F4 fstp qword ptr [d]
int x = int(d);
013A13F7 fld qword ptr [d]
013A13FA call #ILT+215(__ftol2_sse) (13A10DCh)
013A13FF mov dword ptr [x],eax
int y = static_cast<int>(d);
013A1402 fld qword ptr [d]
013A1405 call #ILT+215(__ftol2_sse) (13A10DCh)
013A140A mov dword ptr [y],eax
Obviously, it is 100% the same!
No difference whatsoever.
When it comes to such basic constructs as a single cast, once two constructs have the same semantic meaning, their performace will be perfectly identical, and the machine code generated for these constructs will be the same.
I believe that the actual result is implementation defined. You should check it in your version of compiler. But I believe that it will give the same result in most modern compilers. And in C++ you shouldn't use C-cast, instead use C++ casts - it will allow you to find errors at compile time.
Take a look at the assembly using each method. If it differs use a profiler.
They are same as it is resolved during compile time itself and there is no runtime overhead. Even if there was some difference I really wouldn't bother too much about these tiny (not even micro) optimizations.
As most people say one hopes these should be the same speed, although you're at the mercy of your compiler... and that's not always a very happy situation. Read on for war stories.
Depending on your compiler and the particular model of processor core which the program executes on the speed of float f; int i(f);, float f; int i = (int)f; and float f; int i = static_cast<int>(f); and their ilk (including variations involving double, long and unsigned types) can be atrociously slow - an order of magnitude worse than you expect. The compiler may emit instructions altering internal processor modes causing instruction pipelines to be thrown away. This is, in effect, a bug in the optimization element of the compiler. I've seen cases where one suffers the sort 40-clock-cycle costs mentioned in this analysis, at which point you have a major, unexpected and irritating performance bottleneck with AFAIK no entirely pleasing, robust, generic solution. There are alternatives involving assembler but AFAIK they do not round floating point to integer the same way as the casts do. If anyone knows any better I am interested. I'm hoping this issue is/will shortly be confined to legacy compilers/hardware but you need your wits about you.
P.S. I can't reach that link because my firewall blocks it as games-related but a Google cache of it suffices to demonstrate that its author knows more about it than I do.
When you choice makes little difference to the code, I'd pick the one which looks more familiar to later programmers. Making code easier to understand by others is always worth considering. In this case, I'd stick to int(…) for that reason.
Related
Good evening. Sorry, I used google tradutor.
I use NASM in VC ++ on x86 and I'm learning how to use MASM on x64.
Is there any way to specify where each argument goes and the return of an assembly function in such a way that the compiler manages to leave the data there in the fastest way? We can too specify which registers will be used so that the compiler knows what data is still saved to make the best use of it?
For example, since there is no intrinsic function that applies the exactly IDIV r/m64 (64-bit signed integer division of assembly language), we may need to implement it. The IDIV requires that the low magnitude part of the dividend/numerator be in RAX, the high in RDX and the divisor/denominator in any register or in a region of memory. At the end, the quotient is in EAX and the remainder in EDX. We may therefore want to develop functions so (I put inutilities to exemplify):
void DivLongLongInt( long long NumLow , long long NumHigh , long long Den , long long *Quo , long long *Rem ){
__asm(
// Specify used register: [rax], specify pre location: NumLow --> [rax]
reg(rax)=NumLow ,
// Specify used register: [rdx], specify pre location: NumHigh --> [rdx]
reg(rdx)=NumHigh ,
// Specify required memory: memory64bits [den], specify pre location: Den --> [den]
mem[64](den)=Den ,
// Specify used register: [st0], specify pre location: Const(12.5) --> [st0]
reg(st0)=25*0.5 ,
// Specify used register: [bh]
reg(bh) ,
// Specify required memory: memory64bits [nothing]
mem[64](nothing) ,
// Specify used register: [st1]
reg(st1)
){
// Specify code
IDIV [den]
}(
// Specify pos location: [rax] --> *Quo
*Quo=reg(rax) ,
// Specify pos location: [rdx] --> *Rem
*Rem=reg(rdx)
) ;
}
Is it possible to do something at least close to that?
Thanks for all help.
If there is no way to do this, it's a shame because it would certainly be a great way to implement high-level functions with assembly-level features. I think it's a simple interface between C ++ and ASM that should already exist and enable assembly code to be embedded inline and at high level, practically as simple C++ code.
As others have mentioned, MSVC does not support any form of inline assembly when targeting x86-64.
Inline assembly is supported only in x86-32 builds, and even there, it is rather limited in what you can do. In particular, you can't specify inputs and outputs, so the use of inline assembly necessarily entails a lot of shuffling of values back and forth between registers and memory, which is precisely the opposite of what you want when writing high-performance code. Unless there is something that you cannot possibly do any other way except by causing the manual emission of machine code, you should avoid the inline assembler. Its original purpose was to do things like generate OUT instructions and call ROM BIOS interrupts in obsolete 8-bit and 16-bit programming environments. It made it into the 32-bit compiler for compatibility purposes, but the team drew the line with 64-bit.
Intrinsics are now the recommended solution, because these play much better with the optimizer. Virtually any SIMD code that you need the compiler to generate can be accomplished using intrinsics, just as you would on most any other compiler targeting x86, so not only are you getting better code, but you're also getting slightly more portable code.
Even on Gnu-style compilers that support extended asm blocks, which give you the type of input/output operand power that you are looking for, there are still lots of good reasons to avoid the use of inline asm. Intrinsics are still a better solution there, as is finding a way to represent what you want in C and persuading the compiler to generate the assembly code that you wish it to emit.
The only exception is cases where there are no intrinsics available. The IDIV instruction is, unfortunately, one of those cases. (There are intrinsics available for 128-bit multiplication. They go by various names: either Windows-specific or compiler-specific.)
On Gnu compilers that support 128-bit integer types as an extension on 64-bit targets, you can get the compiler to generate the code for you:
__int128_t dividend = 1234;
int64_t divisor = 64;
int64_t quotient = (dividend / divisor);
Now, this is generally compiled as a call to their library function that does 128-bit division, rather than an inline IDIV instruction that returns a 64-bit quotient. Presumably, this is because of the need to handle overflows, as David mentioned. Actually, it's worse than that. No C or C++ implementation can use the DIV/IDIV instructions because they are non-conforming. These instructions will result in overflow exceptions, whereas the standard says that the result should be truncated. (With multiplication, you do get inline IMUL/MUL instruction(s) because these don't have the overflow problem, since they return 128-bit results.)
This isn't actually as big of a loss as you might think. You seem to be assuming that the 64-bit IDIV instruction is really fast. It isn't. Although the actual numbers vary depending on the number of significant bits in the absolute value of the dividend, your values probably are quite large if you actually need the range of a 128-bit integer. Looking at Agner Fog's instruction tables will give you some idea of the performance you can expect on various architectures. It's getting faster on newer architectures (especially on the newer AMD processors; it's still sluggish on Intel), but it still has pretty substantial latencies. Just because it's one instruction doesn't mean that it runs in one cycle or anything like that. A single instruction might be good for code density when you're optimizing for size and worried about a call to a library function evicting other instructions from your cache, but division is a slow enough operation that this usually doesn't matter. In fact, division is so slow that compilers try very hard not to use it—whenever possible, they will do multiplication by the reciprocal, which is significantly faster. And if you're really needing to do multiplications quickly, you should look into parallelizing them with SIMD instructions, which all have intrinsics available.
Back to MSVC (although everything I said in the last paragraph still applies, of course), there are no 128-bit integer types, so if you need to implement this type of division, you will need to write the code in an external assembly module and link it in. The code is pretty simple, and Visual Studio has excellent, built-in support for assembling code with MASM and linking it directly into your project:
; Windows 64-bit calling convention passes parameters as follows:
; RCX == first 64-bit integer parameter (low bits of dividend)
; RDX == second 64-bit integer parameter (high bits of dividend)
; R8 == third 64-bit integer parameter (divisor)
; R9 == fourth 64-bit integer parameter (pointer to remainder)
Div128x64 PROC
mov rax, rcx
idiv r8 ; 128-bit divide (RDX:RAX / R8)
mov [r9], rdx ; store remainder
ret ; return, with quotient in RDX:RAX
Div128x64 ENDP
Then you just prototype that in your C++ code as:
extern int64_t Div128x64(int64_t loDividend,
int64_t hiDividend,
int64_t divisor,
int64_t* pRemainder);
and you're done. Call it as desired.
The equivalent can be written for unsigned division, using the DIV instruction.
No, you don't get intelligent register allocation, but this isn't really a big deal with register renaming in the front end that can often elide register-register moves entirely (in other words, MOVs become zero-latency operations). Plus, the IDIV instruction is so restrictive anyway in terms of its operands, since they are hardcoded to RAX and RDX, that it's pretty unlikely a scheduler would be able to keep the values in those registers anyway, at least for any non-trivial piece of code.
Beware that once you write the necessary code to check for the possibility of overflows, or worse—the code to handle exceptions—this will very likely end up performing the same or worse as a library function that does a proper 128-bit division, so you should arguably just write and use that (until such time as Microsoft sees fit to provide one). That can be written in C (also see implementation of __divti3 library function for Gnu compilers), which makes it a candidate for inlining and otherwise plays better with the optimizer.
No, it is not possible to do this. MSVC doesn't support inline assembly for x64 builds. Instead, you should use intrinsics; almost everything is available. The sad thing is, as far as I know, 128-bit idiv is missing from the intrinsics.
A note: you can solve your issue with two movs (to put inputs in the correct registers). And you should not worry about that; current CPUs handle mov very well. Putting mov into a code may not slow it down at all. And div is very expensive compared to a mov, so it doesn't matter too much.
Hard as it may be to believe the construct p[u+1] occurs in several places in innermost loops of code I maintain such that getting the micro optimization of it right makes hours of difference in an operation that runs for days.
Typically *((p+u)+1) is most efficient. Sometimes *(p+(u+1)) is most efficient. Rarely *((p+1)+u) is best. (But usually an optimizer can convert *((p+1)+u) to *((p+u)+1) when the latter is better, and can't convert *(p+(u+1)) with either of the others).
p is a pointer and u is an unsigned. In the actual code at least one of them (more likely both) will already be in register(s) at the point the expression is evaluated. Those facts are critical to the point of my question.
In 32-bit (before my project dropped support for that) all three have exactly the same semantics and any half decent compiler simply picks the best of the three and the programmer never needs to care.
In these 64-bit uses, the programmer knows all three have the same semantics, but the compiler doesn't know. So far as the compiler knows, the decision of when to extend u from 32-bit to 64-bit can affect the result.
What is the cleanest way to tell the compiler that the semantics of all three are the same and the compiler should select the fastest of them?
In one Linux 64-bit compiler, I got nearly there with p[u+1L] which causes the compiler to select intelligently between the usually best *((p+u)+1) and the sometimes better *(p+( (long)(u) + 1) ). In the rare case *(p+(u+1)) was still better than the second of those, a little is lost.
Obviously, that does no good in 64-bit Windows. Now that we dropped 32-bit support, maybe p[u+1LL] is portable enough and good enough. But can I do better?
Note that using std::size_t instead of unsigned for u would eliminate this entire problem, but create a larger performance problem nearby. Casting u to std::size_t right there is almost good enough, and maybe the best I can do. But that is pretty verbose for an imperfect solution.
Simply coding (p+1)[u] makes a selection more likely to be optimal than p[u+1]. If the code were less templated and more stable, I could set them all to (p+1)[u] then profile then switch a few back to p[u+1]. But the templating tends to destroy that approach (A single source line appears in many places in the profile adding up to serious time, but not individually serious time).
Compilers that should be efficient for this are GCC, ICC and MSVC.
The answer is inevitably compiler and target specific, but even if 1ULL is wider than a pointer on whatever target architecture, a good compiler should optimize it away. Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? explains why a wider computation truncated to pointer width will give identical results as doing computation with pointer width in the first place. This is why compilers can optimize it away even on 32bit machines (or x86-64 with the x32 ABI) when 1ULL leads to promotion of the + operands to a 64bit type. (Or on some 64bit ABI for some architecture where long long is 128b).
1ULL looks optimal for 64bit, and for 32bit with clang. You don't care about 32bit anyway, but gcc wastes an instruction in the return p[u + 1ULL];. All the other cases are compiled to a single load with scaled-index+4+p addressing mode. So other than one compiler's optimization failure, 1ULL looks fine for 32bit as well. (I think it's unlikely that it's a clang bug and that optimization is illegal).
int v1ULL(std::uint32_t u) { return p[u + 1ULL]; }
// ... load u from the stack
// add eax, 1
// mov eax, DWORD PTR p[0+eax*4]
instead of
mov eax, DWORD PTR p[4+eax*4]
Interestingly, gcc 5.3 doesn't make this mistake when targeting the x32 ABI (long mode with 32bit pointers and a register-call ABI similar to SySV AMD64). It uses a 32bit address-size prefix to avoid using the upper 32b of edi.
Annoyingly, it still uses an address-size prefix when it could save a byte of machine code by using a 64bit effective address (when there's no chance of overflow/carry into the upper32 generating an address outside the low 4GiB). Passing the pointer by reference is a good example:
int x2 (char *&c) { return *c; }
// mov eax, DWORD PTR [edi] ; upper32 of rax is zero
// movsx eax, BYTE PTR [eax] ; could be byte [rax], saving one byte of machine code
Err, actually I forget. 32bit addresses might sign-extend to 64b, not zero-extend. If that's the case, it could have used movsx for the first instruction, too, but that would have cost a byte because movsx has a longer opcode than mov.
Anyway, x32 is still an interesting choice for pointer-heavy code that wants more registers and a nicer ABI, without the cache-miss hit of 8B pointers.
The 64bit asm has to zero the upper32 of the register holding the parameter (with mov edi,edi), but that goes away when inlining. Looking at godbolt output for tiny functions is a valid way to test this.
If we want to make doubly sure that the compiler isn't shooting itself in the foot and zeroing the upper32 when it should know it's already zero, we could make test functions with an arg passed by reference.
int v1ULL(const std::uint32_t &u) { return p[u + 1ULL]; }
// mov eax, DWORD PTR [rdi]
// mov eax, DWORD PTR p[4+rax*4]
which of the two is faster: ?
1.
char* _pos ..;
short value = ..;
*((short*)_pos = va;
2.
char* _pos ..;
short value = ..;
memcpy(_pos, &value, sizeof(short));
As with all "which is faster?" questions, you should benchmark it to see for yourself. And if it matters, then ask why and pick which you want.
In any case, your first example is technically undefined behavior since you are violating strict-aliasing. So if you had to choose without benchmarking, go with the second one.
To answer the actual question, which is faster will probably depend on the alignment of pos. If it's aligned properly, then 1 will probably be faster. If not, then 2 might be faster depending on how it's optimized by the compiler. (1 might even crash if the hardware doesn't support misaligned access.)
But this is all guess-work. You really need to benchmark it to know for sure.
At the very least, you should look at the compiled assembly:
: *(short *)_pos = value;
mov WORD PTR [rcx], dx
vs.
: memcpy(_pos, &value, sizeof(short));
mov WORD PTR [rcx], dx
Which in this case (in MSVC) shows the exact same assembly with default optimizations. So you can expect the performance to be the same.
With gcc at an optimization level of -O1 or higher, the following two functions compile to exactly the same machine code on x86:
void foo(char *_pos, short value)
{
memcpy(_pos, &value, sizeof(short));
}
void bar(char *_pos, short value)
{
*(short *)_pos = value;
}
The compiler might implement them both the same way.
If it does it naively, assignment will be faster.
For any practical purpose, they'll both be done in no time, and you don't need to worry.
Also note that you may have alignment problem s(_pos may not be aligned on 2 bytes, which may crash on some processors), and type punning problems (the compiler may assume that what _pos points to isn't changed, because you wrote using a short *).
Does it matter? It might be that the first case will save you some cycles (depends on the compiler sophistication and optimizations). But is it worth the readibility and maintainability hit?
Many bugs are introduced because of premature optimization. You should first identify the bottleneck, and if this assignment is that bottleneck - benchmark each of the options (taking care of alignment and other issues mentioned here by others already).
The question is implementation-dependent. In practice, for doing nothing but copying sizeof(short) bytes, if one is going to be slower, it's going to be memcpy. For considerably larger data sets, if one is going to be faster, it's generally going to be memcpy.
As pointed out, #1 invokes undefined behavior.
We can see that simple assignment is certainly easier to read and write and less error prone than both. Clarity and correctness should come first, even in performance-critical areas for the simple reason that it's easier to optimize correct code than it is to fix optimized, incorrect code. If this is really a C++ question, the need for such code (casts or memcpy that bulldoze over the type system to x-ray and copy around bits) should be very, very rare.
If you are certain that there won't be an alignment issue, and you really find this is a bottleneck situation then go ahead and do the first.
If you are unhappy calling memcpy then do something like:
*pos = static_cast<char>(value & 0xff );
*(pos+1) = static_cast<char>(value >> 8 );
although if you are going to do that then use unsigned values.
The above code ensures you get little-endian too. (Obviously reverse the order of the assignments if you want big-endian). You might want a consistent endian-ness if the data is passed around as some kind of binary blob, which is, I assume, what you are trying to create.
You might wish to use something like google protocol buffers if you want to create binary blobs. There is also boost::serialize which includes binary serialization.
You can avoid breaking aliasing rules and calling a function by using a union:
union {
char* c;
short* s;
} _pos;
short value = ...
_pos->s = value;
I have a huge function that sorts a very large amount of int data. The code works fine except the fact that it's slower that it should be. My first step into solving this is to place some asm code inside C++. How can I interchange 2 variables using asm? I've tried this:
_asm{ push a[x]; push a[y]; pop a[x]; pop a[y];}
and this:
_asm(mov eax, a[x];mov ebx,a[y]; mov a[x],ebx; mov a[y],eax;}
but both crash. How can I save some time on these interchanges ? I use VS_2010
In general, it is very difficult to do better than your compiler with simple code like this.
A compiler, when faced with a swap operation on integers, will typically issue code like this:
mov eax, [x]
mov ebx, [y]
mov [x], ebx
mov [y], eax
Before you try to override, first check what the compiler is actually generating. If it's something like this, don't bother going any further; you won't be able to do better than this. Moreover, if you leave it to the compiler, it may, if these variables are used immediately thereafter, choose to reuse one of these registers to save on variable loads/stores as well. This is impossible with hand-coded assembly; the compiler must reload the variables after the black box that is hand-coded asm.
Note that the push/push/pop/pop sequence is likely to be much slower; not only does it add an additional four memory operations to the stack, it also introduces dependencies on the stack pointer, eliminating any possibility of pipelining. With the simple mov sequence, it is at least possible to run the pair of reads and pair of writes in parallel if they are on different memory banks, or one is in cache, etc. It also does not introduce stalls on the stack pointer in later code.
As such, you should not try to micro-optimize the cost of an interchange; instead, reduce the number of interchanges performed. There are many sorting algorithms available, each with slightly different characteristics. You may find some are better (cause less swaps) on your dataset than others.
What makes you think you can produce faster assembly than an optimizing compiler?
Even if you'll get it to work properly, all you're likely to achieve is to confuse the optimizer to produce even slower code.
When you do in-line assembly, you can change things so that assumptions the compiler has made about register contents will no longer be true. Often times EAX is used to pass a parameter or return a value, so trashing EAX might not have much effect, but you clobbered EBX and didn't put it back, and that could cause problems. Try pushing EBX before you use it, then pop it when you are done.
You can use the variable names, function names and labels in assembly code as symbols. Note that things like a[x] is not such valid symbol.
Writing more efficient code takes skill and knowledge, using asm does not necessarily help you there.
You can compare assembly code that your compiler produces for both the function with inline assembler and without to see where you did break it.
Does the C++ compiler optimize the multiply by two operation x*2 to a bitshift operation x<<1?
I would love to believe that yes.
Actually VS2008 optimizes this to x+x:
01391000 push ecx
int x = 0;
scanf("%d", &x);
01391001 lea eax,[esp]
01391004 push eax
01391005 push offset string "%d" (13920F4h)
0139100A mov dword ptr [esp+8],0
01391012 call dword ptr [__imp__scanf (13920A4h)]
int y = x * 2;
01391018 mov ecx,dword ptr [esp+8]
0139101C lea edx,[ecx+ecx]
In an x64 build it is even more explicit and uses:
int y = x * 2;
000000013FB9101E mov edx,dword ptr [x]
printf("%d", y);
000000013FB91022 lea rcx,[string "%d" (13FB921B0h)]
000000013FB91029 add edx,edx
This is will the optimization settings on 'Maximize speed' (/O2)
This article from Raymond Chen could be interesting:
When is x/2 different from x>>1? :
http://blogs.msdn.com/oldnewthing/archive/2005/05/27/422551.aspx
Quoting Raymond:
Of course, the compiler is free to recognize this and rewrite your multiplication or shift operation. In fact, it is very likely to do this, because x+x is more easily pairable than a multiplication or shift. Your shift or multiply-by-two is probably going to be rewritten as something closer to an add eax, eax instruction.
[...]
Even if you assume that the shift fills with the sign bit, The result of the shift and the divide are different if x is negative.
(-1) / 2 ≡ 0
(-1) >> 1 ≡ -1
[...]
The moral of the story is to write what you mean. If you want to divide by two, then write "/2", not ">>1".
We can only assume it is wise to tell the compiler what you want, not what you want him to do: The compiler is better than an human is at optimizing small scale code (thanks for Daemin to point this subtle point): If you really want optimization, use a profiler, and study your algorithms' efficiency.
VS 2008 optimized mine to x << 1.
x = x * 2;
004013E7 mov eax,dword ptr [x]
004013EA shl eax,1
004013EC mov dword ptr [x],eax
EDIT: This was using VS default "Debug" configuration with optimization disabled (/Od). Using any of the optimization switches (/O1, /O2 (VS "Retail"), or /Ox) results in the the add self code Rob posted. Also, just for good measure, I verified x = x << 1 is indeed treated the same way as x = x * 2 by the cl compiler in both /Od and /Ox. So, in summary, cl.exe version 15.00.30729.01 for x86 treats * 2 and << 1 identically and I expect nearly all other recent compilers do the same.
Not if x is a float it won't.
Yes. They also optimize other similar operations, such as multiplying by non-powers of two that can be rewritten as the sums of some shifts. They will also optimize divisions by powers of 2 into right-shifts, but beware that when working with signed integers, the two operations are different! The compiler has to emit some extra bit twiddling instructions to make sure the results are the same for positive and negative numbers, but it's still faster than doing a division. It also similarly optimizes moduli by powers of 2.
The answer is "if it is faster" (or smaller). This depends on the target architecture heavily as well as the register usage model for a given compiler. In general, the answer is "yes, always" as this is a very simple peephole optimization to implement and is usually a decent win.
That's only the start of what optimizers can do. To see what your compiler does, look for the switch that causes it to emit assembler source. For the Digital Mars compilers, the output assembler can be examined with the OBJ2ASM tool. If you want to learn how your compiler works, looking at the assembler output can be very illuminating.
I'm sure they all do these kind of optimizations, but I wonder if they are still relevant. Older processors did multiplication by shifting and adding, which could take a number of cycles to complete. Modern processors, on the other hand, have a set of barrel-shifters which can do all the necessary shifts and additions simultaneously in one clock cycle or less. Has anyone actually benchmarked whether these optimizations really help?
Yes, they will.
Unless something is specified in a languages standard you'll never get a guaranteed answer to such a question. When in doubt have your compiler spit out assemble code and check. That's going to be the only way to really know.
#Ferruccio Barletta
That's a good question. I went Googling to try to find the answer.
I couldn't find answers for Intel processors directly, but this page has someone who tried to time things. It shows shifts to be more than twice as fast as ads and multiplies. Bit shifts are so simple (where a multiply could be a shift and an addition) that this makes sense.
So then I Googled AMD, and found an old optimization guide for the Athlon from 2002 that lists that lists the fastest ways to multiply numbers by contants between 2 and 32. Interestingly, it depends on the number. Some are ads, some shifts. It's on page 122.
A guide for the Athlon 64 shows the same thing (page 164 or so). It says multiplies are 3 (in 32-bit) or 4 (in 64-bit) cycle operations, where shifts are 1 and adds are 2.
It seems it is still useful as an optimization.
Ignoring cycle counts though, this kind of method would prevent you from tying up the multiplication execution units (possibly), so if you were doing lots of multiplications in a tight loop where some use constants and some don't the extra scheduling room might be useful.
But that's speculation.
Why do you think that's an optimization?
Why not 2*x → x+x? Or maybe the multiplication operation is as fast as the left-shift operation (maybe only if only one bit is set in the multiplicand)? If you never use the result, why not leave it out from the compiled output? If the compiler already loaded 2 to some register, maybe the multiplication instruction will be faster e.g. if we'd have to load the shift count first. Maybe the shift operation is larger, and your inner loop would no longer fit into the prefetch buffer of the CPU thus penalizing performance? Maybe the compiler can prove that the only time you call your function x will have the value 37 and x*2 can be replaced by 74? Maybe you're doing 2*x where x is a loop count (very common, though implicit, when looping over two-byte objects)? Then the compiler can change the loop from
for(int x = 0; x < count; ++x) ...2*x...
to the equivalent (leaving aside pathologies)
int count2 = count * 2
for(int x = 0; x < count2; x += 2) ...x...
which replaces count multiplications with a single multiplication, or it might be able to leverage the lea instruction which combines the multiplication with a memory read.
My point is: there are millions of factors deciding whether replacing x*2 by x<<1 yields a faster binary. An optimizing compiler will try to generate the fastest code for the program it is given, not for an isolated operation. Therefore optimization results for the same code can vary largely depending on the surrounding code, and they may not be trivial at all.
Generally, there are very few benchmarks that show large differences between compilers. It is therefore a fair assumption that compilers are doing a good job because if there were cheap optimizations left, at least one of the compilers would implement them -- and all the others would follow in their next release.
It depends on what compiler you have. Visual C++ for example is notoriously poor in optimizing. If you edit your post to say what compiler you are using, it would be easier to answer.