Are compilers able to avoid branching instructions?

Are compilers able to avoid branching instructions? - c++

I've been reading about bit twiddling hacks and thought, are compilers able to avoid branching in the following code:
constexpr int min(const int lhs, const int rhs) noexcept {
if (lhs < rhs) {
return lhs;
}
return rhs
}
by replacing it with (explanation):
constexpr int min(const int lhs, const int rhs) noexcept {
return rhs ^ ((lhs ^ rhs) & -(lhs < rhs));
}

Are compilers able to...: Yes definitively!
Can I rely on those optimizations?: No, you can't rely on any optimization. There may always be some strange conditions under which a compiler chooses not to implement a certain optimization for some non-obvious reason or just fails to see the possibility. Also in general, I've made the observation that compilers sometimes are a lot dumber than people think (Or the people (including me) are dumber than they think).(1)
Not asked, but a very important aspect: Can I rely on this actually being an optimization? NO! For one, (especially on x86) performance always depends on the surrounding code and there are a lot of different optimizations that interact. Also, some architectures might even offer commands that implement the operation even more efficient.
Should I use bit twiddeling optimizations?: In general: No - especially not without verifying that they actually give you any benefit! Even when they do improve performance, it makes your code harder to read and review and it makes some architecture and compiler specific assumptions (representation of integers, execution time of instructions, penalty for branch miss-prediction ...), that might lead to worse performance when you port your code to another architecture or - in the worst case even lead to false results.
My advice:
If you need to get the last bit of performance for a specific system, then just try both variants and measure (and verify the result every time, you update your CPU and/or compiler). For any other case, assume that the compiler is at least as good in making low level optimizations as you are. I'd also suggest that you first learn about all optimization related compiler flags and set up a proper benchmark BEFORE starting to use low level optimizations in any case.
I think the only area, where hand optimizations are still sometimes beneficial is if you want to to optimally use vector units. Modern compiler can auto-vectorize many things, but this is still relatively new field and there are certain things which the compiler is just not allowed to do because it violates some guarantees from the standard (especially where floating point operations are concerned).
(1) Some people seem to think, that independent of what their code looks like, the compiler will always produce the optimal code that provides the same semantics. For one, there are limits to what a compiler can do in a limited amount of time (there are a lot of heuristics that work MOST of the time but not always). Second, in many cases, the c++ standard requires the compiler to give certain guarantees, that you are not actually interested in at the moment, but still prevent optimizations.

clang++ (3.5.2-1) seems to be smart enough -O3 (I'm not using c++11 or c++14, constexpr and noexcept removed from source):
08048760 <_Z3minii>:
8048760: 8b 44 24 08 mov 0x8(%esp),%eax
8048764: 8b 4c 24 04 mov 0x4(%esp),%ecx
8048768: 39 c1 cmp %eax,%ecx
804876a: 0f 4e c1 cmovle %ecx,%eax
804876d: c3 ret
gcc (4.9.3) (-O3) instead do branch with jle:
08048740 <_Z3minii>:
8048740: 8b 54 24 08 mov 0x8(%esp),%edx
8048744: 8b 44 24 04 mov 0x4(%esp),%eax
8048748: 39 d0 cmp %edx,%eax
804874a: 7e 02 jle 804874e <_Z3minii+0xe>
804874c: 89 d0 mov %edx,%eax
804874e: f3 c3 repz ret
(x86 32bit)
This min2 (mangled) is the bit alternative (from gcc):
08048750 <_Z4min2ii>:
8048750: 8b 44 24 08 mov 0x8(%esp),%eax
8048754: 8b 54 24 04 mov 0x4(%esp),%edx
8048758: 31 c9 xor %ecx,%ecx
804875a: 39 c2 cmp %eax,%edx
804875c: 0f 9c c1 setl %cl
804875f: 31 c2 xor %eax,%edx
8048761: f7 d9 neg %ecx
8048763: 21 ca and %ecx,%edx
8048765: 31 d0 xor %edx,%eax
8048767: c3 ret

It's possible for a compiler to detect this pattern and replace it by your proposal.
However, neither clang++ nor g++ do this optimization, see for instance g++ 5.2.0's assembly output.

Related

Why is there a locked xadd instruction in this disassambled std::string dtor?

I have a very simple code:
#include <string>
#include <iostream>
int main() {
std::string s("abc");
std::cout << s;
}
Then, I compiled it:
g++ -Wall test_string.cpp -o test_string -std=c++17 -O3 -g3 -ggdb3
And then decompiled it, and the most interesting piece is:
00000000004009a0 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10>:
4009a0: 48 81 ff a0 11 60 00 cmp rdi,0x6011a0
4009a7: 75 01 jne 4009aa <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0xa>
4009a9: c3 ret
4009aa: b8 00 00 00 00 mov eax,0x0
4009af: 48 85 c0 test rax,rax
4009b2: 74 11 je 4009c5 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x25>
4009b4: 83 c8 ff or eax,0xffffffff
4009b7: f0 0f c1 47 10 lock xadd DWORD PTR [rdi+0x10],eax
4009bc: 85 c0 test eax,eax
4009be: 7f e9 jg 4009a9 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x9>
4009c0: e9 cb fd ff ff jmp 400790 <_ZdlPv#plt>
4009c5: 8b 47 10 mov eax,DWORD PTR [rdi+0x10]
4009c8: 8d 50 ff lea edx,[rax-0x1]
4009cb: 89 57 10 mov DWORD PTR [rdi+0x10],edx
4009ce: eb ec jmp 4009bc <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x1c>
Why _ZNSs4_Rep10_M_disposeERKSaIcE.isra.10 (which is std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_M_dispose(std::allocator<char> const&) [clone .isra.10]) is a lock prefixed xadd?
A follow-up question is how I can avoid it?

It looks like code associated with copy on write strings. The locked instruction is decrementing a reference count and then calling operator delete only if the reference count for the possibly shared buffer containing the actual string data is zero (i.e., it is not shared: no other string object refers to it).
Since libstdc++ is open source, we can confirm this by looking at the source!
The function you've disassembled, _ZNSs4_Rep10_M_disposeERKSaIcE de-mangles1 to std::basic_string<char>::_Rep::_M_dispose(std::allocator<char> const&). Here's the corresponding source for libstdc++ in the gcc-4.x era2:
void
_M_dispose(const _Alloc& __a)
{
#if _GLIBCXX_FULLY_DYNAMIC_STRING == 0
if (__builtin_expect(this != &_S_empty_rep(), false))
#endif
{
// Be race-detector-friendly. For more info see bits/c++config.
_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&this->_M_refcount);
if (__gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount,
-1) <= 0)
{
_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&this->_M_refcount);
_M_destroy(__a);
}
}
} // XXX MT
Given that, we can annotate the assembly you provided, mapping each instruction back to the C++ source:
00000000004009a0 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10>:
# the next two lines implement the check:
# if (__builtin_expect(this != &_S_empty_rep(), false))
# which is an empty string optimization. The S_empty_rep singleton
# is at address 0x6011a0 and if the current buffer points to that
# we are done (execute the ret)
4009a0: cmp rdi,0x6011a0
4009a7: jne 4009aa <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0xa>
4009a9: ret
# now we are in the implementation of
# __gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount, -1)
# which dispatches either to an atomic version of the add function
# or the non-atomic version, depending on the value of `eax` which
# is always directly set to zero, so the non-atomic version is
# *always called* (see details below)
4009aa: mov eax,0x0
4009af: test rax,rax
4009b2: je 4009c5 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x25>
# this is the atomic version of the decrement you were concerned about
# but we never execute this code because the test above always jumps
# to 4009c5 (the non-atomic version)
4009b4: or eax,0xffffffff
4009b7: lock xadd DWORD PTR [rdi+0x10],eax
4009bc: test eax,eax
# check if the result of the xadd was zero, if not skip the delete
4009be: jg 4009a9 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x9>
# the delete call
4009c0: jmp 400790 <_ZdlPv#plt> # tailcall
# the non-atomic version starts here, this is the code that is
# always executed
4009c5: mov eax,DWORD PTR [rdi+0x10]
4009c8: lea edx,[rax-0x1]
4009cb: mov DWORD PTR [rdi+0x10],edx
# this jumps up to the test eax,eax check which calls operator delete
# if the refcount was zero
4009ce: jmp 4009bc <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x1c>
A key note is that the lock xadd code you were concerned about is never executed. There is a mov eax, 0 followed by a test rax, rax; je - this test always succeeds and the jump always occurs because rax is always zero.
What's happening here is that __gnu_cxx::__atomic_add_dispatch is implemented in a way that it checks whether the process is definitely single threaded. If it is definitely single threaded, it doesn't bother to use expensive atomic instructions for things like __atomic_add_dispatch - it simply uses a regular non-atomic addition. It does this by checking the address of a pthreads function, __pthread_key_create - if this is zero, the pthread library hasn't been linked in, and hence the process is definitely single threaded. In your case, the address of this pthread function gets resolved at link time to 0 (you didn't have -lpthread on your compile command line), which is where the mov eax, 0x0 comes from. At link time, it's too late to optimize on this knowledge, so the vestigial atomic increment code remains but never executes. This mechanism is described in more detail in this answer.
The code that does execute is the last part of the function, starting at 4009c5. This code also decrements the reference count, but in a non-atomic way. The check which decides between these two options is probably based on whether the process is multithreaded or not, e.g., whether -lpthread has been linked. For whatever reason this check, inside __exchange_and_add_dispatch, is implemented in a way that prevents the compiler from actually removing the atomic half of the branch, even though the fact that it will never be taken is known at some point during the build process (after all, the hard-coded mov eax, 0 got there somehow).
A follow-up question is how I can avoid it?
Well you've already avoided the lock add part, so if that's what you care about, your good to go. However, you still have a cause for concern:
Copy on write std::string implementations are not standards compliant due to changes made in C++11, so the question remains why exactly you are getting this COW string behavior even when specifying -std=c++17.
The problem is most likely distribution related: CentOS 7 by default uses an ancient gcc version < 5 which still uses the non-compliant COW strings. However, you mention that you are using gcc 8.2.1, which by default in a normal install which uses non-COW strings. It seems like if you installed 8.2.1 use the RHEL "devtools" method, you'll get a new gcc which still uses the old ABI and links against the old system libstdc++.
To confirm this, you might want to check the value of _GLIBCXX_USE_CXX11_ABI macro in your test program, and also your libstdc++ version (the version information here might prove useful).
You can avoid by using an OS other than CentOS that doesn't use ancient gcc and glibc version. If you need to stick with CentOS for some reason you'll have to look into if there is a supported way to use newer libstdc++ version on that distribution. You could also consider using a containerization technology to build an executable independent of the library versions of your local host.
1 You can demangle it like so: echo '_ZNSs4_Rep10_M_disposeERKSaIcE' | c++filt.
2 I'm using gcc-4 era source since I'm guessing that's what you end up using in CentOS 7.

Do macros in C++ improve performance?

I'm a beginner in C++ and I've just read that macros work by replacing text whenever needed. In this case, does this mean that it makes the .exe run faster? And how is this different than an inline function?
For example, if I have the following macro :
#define SQUARE(x) ((x) * (x))
and normal function :
int Square(const int& x)
{
return x*x;
}
and inline function :
inline int Square(const int& x)
{
return x*x;
}
What are the main differences between these three and especially between the inline function and the macro? Thank you.

You should avoid using macros if possible. Inline functions are always the better choice, as they are type safe. An inline function should be as fast as a macro (if it is indeed inlined by the compiler; note that the inline keyword is not binding but just a hint to the compiler, which may ignore it if inlining is not possible).
PS: as a matter of style, avoid using const Type& for parameter types that are fundamental, like int or double. Simply use the type itself, in other words, use
int Square(int x)
since a copy won't affect (or even make it worse) performance, see e.g. this question for more details.

Macros translate to: stupid replacing of pattern A with pattern B. This means: everything happens before the compiler kicks in. Sometimes they come in handy; but in general, they should be avoided. Because you can do a lot of things, and later on, in the debugger, you have no idea what is going on.
Besides: your approach to performance is well, naive, to say it friendly. First you learn the language (which is hard for modern C++, because there are a ton of important concepts and things one absolutely need to know and understand). Then you practice, practice, practice. And then, when you really come to a point where your existing application has performance problems; then do profiling to understand the real issue.
In other words: if you are interested in performance, you are asking the wrong question. You should worry much more about architecture (like: potential bottlenecks), configuration (in the sense of latency between different nodes in your system), and so on. Of course, you should apply common sense; and not write code that is obviously wasting memory or CPU cycles. But sometimes a piece of code that runs 50% slower ... might be 500% easier to read and maintain. And if execution time is then 500ms, and not 250ms; that might be totally OK (unless that specific part is called a thousand times per minute).

The difference between a macro and an inlined function is that a macro is dealt with before the compiler sees it.
On my compiler (clang++) without optimisation flags the square function won't be inlined. The code it generates looks like this
4009f0: 55 push %rbp
4009f1: 48 89 e5 mov %rsp,%rbp
4009f4: 89 7d fc mov %edi,-0x4(%rbp)
4009f7: 8b 7d fc mov -0x4(%rbp),%edi
4009fa: 0f af 7d fc imul -0x4(%rbp),%edi
4009fe: 89 f8 mov %edi,%eax
400a00: 5d pop %rbp
400a01: c3 retq
the imul is the assembly instruction doing the work, the rest is moving data around.
code that calls it looks like
400969: e8 82 00 00 00 callq 4009f0 <_Z6squarei>
iI add the -O3 flag to Inline it and that imul shows up in the main function where the function is called from in C++ code
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
It's a reasonable thing to do to get a basic handle on assembly language for your machine and use gcc -S on your source, or objdump -D on your binary (as I did here) to see exactly what is going on.
Using the macro instead of the inlined function gets something very similar
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
Note one of the many dangers here with macros: what does this do ?
x = 5; std::cout << SQUARE(++x) << std::endl;
36? nope, 42. It becomes
std::cout << ++x * ++x << std::endl;
which becomes 6 * 7
Don't be put off by people telling you not to care about optimisation. Using C or C++ as your language is an optimisation in itself. Just try to work out if you're wasting time with it and be sensible.

Macros just perform text substitution to modify source code.
As such, macros don't inherently affect performance of code. The techniques you use to design and code obviously affect performance. So the only implication of macros on performance is based on what the macro does (i.e. what code you write the macro to emit).
The big danger of macros is that they do not respect scope. The changes they make are unconditional, cross function boundaries, and things like that. There are a lot of subtleties in writing macros to make them behave as intended (avoid unintended side effects in code, avoid undefined behaviour, etc). This means code which uses macros is harder to understand, and harder to get right.
At best, with modern compilers, the performance gain you can get using macros, is the same as can be achieved with inline functions - at the expense of increasing chances of the code behaving incorrectly. You are therefore better off using inline functions - unlike macros they are typesafe and work consistently with other code.
Modern compilers might choose to not inline a function, even if you have specified it as inline. If that happens, you generally don't need to worry - modern compilers are able to do a better job than most modern programmers in deciding whether a function should be inlined.

Using such a macro only make sense if its argument is itself a #define'd constant, as the computation will then be performed by the preprocessor. Even then, double-check that the result is the expected one.
When working on classic variables, the (inlined) function form should be preferred as:
It is type-safe;
It will handle expressions used as an argument in a consistent way. This not only includes the case of per/post increments as quoted by Peter, but when the argument it itself some computation-intensive expression, using the macro form forces the evaluation of that argument twice (which may not necessarely evaluate to the same value btw) vs. only once for the function.
I have to admit that I used to code such macros for quick prototyping of apparently simple functions, but the time those make me lose over the years finalyl changed my mind !

Data races, UB, and counters in C++11

The following pattern is commonplace in lots of software that wants to tell its user how many times it has done various things:
int num_times_done_it; // global
void doit() {
++num_times_done_it;
// do something
}
void report_stats() {
printf("called doit %i times\n", num_times_done_it);
// and probably some other stuff too
}
Unfortunately, if multiple threads can call doit without some sort of synchronisation, the concurrent read-modify-writes to num_times_done_it may be a data race and hence the entire program's behaviour would be undefined. Further, if report_stats can be called concurrently with doit absent any synchronisation, there's another data race between the thread modifying num_times_done_it and the thread reporting its value.
Often, the programmer just wants a mostly-right count of the number of times doit has been called with as little overhead as possible.
(If you consider this example trivial, Hogwild! gains a significant speed advantage over a data-race-free stochastic gradient descent using essentially this trick. Also, I believe the Hotspot JVM does exactly this sort of unguarded, multithreaded access to a shared counter for method invocation counts---though it's in the clear since it generates assembly code instead of C++11.)
Apparent non-solutions:
Atomics, with any memory order I know of, fail "as little overhead as possible" here (an atomic increment can be considerably more expensive than an ordinary increment) while overdelivering on "mostly-right" (by being exactly right).
I don't believe tossing volatile into the mix makes data races OK, so replacing the declaration of num_times_done_it by volatile int num_times_done_it doesn't fix anything.
There's the awkward solution of having a separate counter per thread and adding them all up in report_stats, but that doesn't solve the data race between doit and report_stats. Also, it's messy, it assumes the updates are associative, and doesn't really fit Hogwild!'s usage.
Is it possible to implement invocation counters with well-defined semantics in a nontrivial, multithreaded C++11 program without some form of synchronisation?
EDIT: It seems that we can do this in a slightly indirect way using memory_order_relaxed:
atomic<int> num_times_done_it;
void doit() {
num_times_done_it.store(1 + num_times_done_it.load(memory_order_relaxed),
memory_order_relaxed);
// as before
}
However, gcc 4.8.2 generates this code on x86_64 (with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: 83 c0 01 add $0x1,%eax
9: 89 05 00 00 00 00 mov %eax,0x0(%rip)
and clang 3.4 generates this code on x86_64 (again with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: ff c0 inc %eax
8: 89 05 00 00 00 00 mov %eax,0x0(%rip)
My understanding of x86-TSO is that both of these code sequences are, barring interrupts and funny page protection flags, entirely equivalent to the one-instruction memory inc and the one-instruction memory add generated by the straightforward code. Does this use of memory_order_relaxed constitute a data race?

count for each thread separately and sum up after the threads joined. For intermediate results, you may also sum up in between, you result might be off though. This pattern is also faster. You might embed it into a basic helper class for your threads so you have it everywheren if you are using it often.
And - depending on compiler & platform, atomics aren't that expensive (see Herb Sutters "atomic weapons" talk http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2) but in your case it'll create problems with the caches so it's not advisable.

It seems that the memory_order_relaxed trick is the right way to do this.
This blog post by Dmitry Vyukov at Intel begins by answering exactly my question, and proceeds to list the memory_order_relaxed store and load as the proper alternative.
I am still unsure of whether this is really OK; in particular, N3710 makes me doubt that I ever understood memory_order_relaxed in the first place.

GCC function padding value

Whenever I compile C or C++ code with optimizations enable,d GCC aligns functions to a 16-byte boundary (on IA-32). If the function is shorter than 16 bytes, GCC pads it with some bytes, which don't seem to be random at all:
19: c3 ret
1a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
It always seems to be either 8d b6 00 00 00 00 ... or 8d 74 26 00.
Do function padding bytes have any significance?

The padding is created by the assembler, not by gcc. It merely sees a .align directive (or equivalent) and doesn't know whether the space to be padded is inside a function (e.g. loop alignment) or between functions, so it must insert NOPs of some sort. Modern x86 assemblers use the largest possible NOP opcodes with the intention of spending as few cycles as possible if the padding is for loop alignment.
Personally, I'm extremely skeptical of alignment as an optimization technique. I've never seen it help much, and it can definitely hurt by increasing the total code size (and cache utilization) tremendously. If you use the -Os optimization level, it's off by default, so there's nothing to worry about. Otherwise you can disable all the alignments with the proper -f options.

The assembler first sees an .align directive. Since it doesn't know if this address is within a function body or not, it cannot output NULL 0x00 bytes, and must generate NOPs (0x90).
However:
lea esi,[esi+0x0] ; does nothing, psuedocode: ESI = ESI + 0
executes in fewer clock cycles than
nop
nop
nop
nop
nop
nop
If this code happened to fall within a function body (for instance, loop alignment), the lea version would be much faster, while still "doing nothing."

The instruction lea 0x0(%esi),%esi just loads the value in %esi into %esi - it's no-operation (or NOP), which means that if it's executed it will have no effect.
This just happens to be a single instruction, 6-byte NOP. 8d 74 26 00 is just a 4-byte encoding of the same instruction.

In C++, which is faster? (2 * i + 1) or (i << 1 | 1)?

I realize that the answer is probably hardware specific, but I'm curious if there was a more general intuition that I'm missing?
I asked this question & given the answer, now I'm wondering if I should alter my approach in general to use "(i << 1|1)" instead of "(2*i + 1)"??

Since the ISO standard doesn't actually mandate performance requirements, this will depend on the implementation, the compiler flags chosen, the target CPU and quite possibly the phase of the moon.
These sort of optimisations (saving a couple of cycles) almost always pale into insignificance in terms of return on investment, against macro-level optimisations like algorithm selection.
Aim for readability of code first and foremost. If your intent is to shift bits and OR, use the bit-shift version. If your intent is to multiply, use the * version. Only worry about performance once you've established there's an issue.
Any decent compiler will optimise it far better than you can anyway :-)

Just an experiment regarding answers given about "... it'll use LEA":
The following code:
int main(int argc, char **argv)
{
#ifdef USE_SHIFTOR
return (argc << 1 | 1);
#else
return (2 * argc + 1);
#endif
}
will, with gcc -fomit-frame-pointer -O8 -m{32|64} (for 32bit or 64bit) compile into the following assembly code:
x86, 32bit:080483a0 <main>:
80483a0: 8b 44 24 04 mov 0x4(%esp),%eax
80483a4: 8d 44 00 01 lea 0x1(%eax,%eax,1),%eax
80483a8: c3 ret
x86, 64bit:00000000004004c0 <main>:
4004c0: 8d 44 3f 01 lea 0x1(%rdi,%rdi,1),%eax
4004c4: c3 retq
x86, 64bit, -DUSE_SHIFTOR:080483a0 <main>:
80483a0: 8b 44 24 04 mov 0x4(%esp),%eax
80483a4: 01 c0 add %eax,%eax
80483a6: 83 c8 01 or $0x1,%eax
80483a9: c3 ret
x86, 32bit, -DUSE_SHIFTOR:00000000004004c0 <main>:
4004c0: 8d 04 3f lea (%rdi,%rdi,1),%eax
4004c3: 83 c8 01 or $0x1,%eax
4004c6: c3 retq
In fact, it's true that most cases will use LEA. Yet the code is not the same for the two cases. There are two reasons for that:
addition can overflow and wrap around, while bit operations like << or | cannot
(x + 1) == (x | 1) only is true if !(x & 1) else the addition carries over to the next bit. In general, adding one only results in having the lowest bit set in half of the cases.
While we (and the compiler, probably) know that the second is necessarily applicable, the first is still a possibility. The compiler therefore creates different code, since the "or-version" requires forcing bit zero to 1.

Any but the most brain-dead compiler will see those expressions as equivalent and compile them to the same executable code.
Typically it's not really worth worrying too much about optimizing simple arithmetic expressions like these, since it's the sort of thing compilers are best at optimizing. (Unlike many other cases in which a "smart compiler" could do the the right thing, but an actual compiler falls flat.)
This will work out to the same pair of instructions on PPC, Sparc, and MIPS, by the way: a shift followed by an add. On the ARM it'll cook down to a single fused shift-add instruction, and on x86 it'll probably be a single LEA op.

Output of gcc with the -S option (no compiler flags given):
.LCFI3:
movl 8(%ebp), %eax
addl %eax, %eax
orl $1, %eax
popl %ebp
ret
.LCFI1:
movl 8(%ebp), %eax
addl %eax, %eax
addl $1, %eax
popl %ebp
ret
I'm not sure which one is which, but I don't believe it matters.
If the compiler does no optimizations at all, then the second would probably translate to faster assembly instructions. How long each instruction takes is completely architecture-dependent. Most compilers will optimize them to be the same assembly-level instructions.

I just tested this with gcc-4.7.1 using the source of FrankH, the code generated is
lea 0x1(%rdi,%rdi,1),%eax
retq
no matter if the shift or the multiplication version is used.

Nobody cares. Nor should they.
Quit worrying about that and get your code correct, simple, and done.

i + i + 1 may be faster than other two,
because addition is faster than multiplication and can be faster than shift.

The faster is the first form (the one with the shift right), in fact the shr instruction takes 4 clock cycles to complete in the worst case, while the mul 10 in the best case. However, the best form should be decided by the compiler because it has a complete view of the others (assembly) instructions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js