Explanation for GCC compiler optimisation's adverse performance effect?

Explanation for GCC compiler optimisation's adverse performance effect? - c++

Please note: this question is neither about code quality, and ways to improve the code, nor about the (in)significance of the runtime differences. It is about GCC and why which compiler optimisation costs performance.
The program
The following code counts the number of Fibonacci primes up to m:
int main() {
unsigned int m = 500000000u;
unsigned int i = 0u;
unsigned int a = 1u;
unsigned int b = 1u;
unsigned int c = 1u;
unsigned int count = 0u;
while (a + b <= m) {
for (i = 2u; i < a + b; ++i) {
c = (a + b) % i;
if (c == 0u) {
i = a + b;
// break;
}
}
if (c != 0u) {
count = count + 1u;
}
a = a + b;
b = a - b;
}
return count; // Just to "output" (and thus use) count
}
When compiled with g++.exe (Rev2, Built by MSYS2 project) 9.2.0 and no optimisations (-O0), the resulting binary executes (on my machine) in 1.9s. With -O1 and -O3 it takes 3.3s and 1.7s, respectively.
I've tried to make sense of the resulting binaries by looking at the assembly code (godbolt.org) and the corresponding control-flow graph (hex-rays.com/products/ida), but my assembler skills don't suffice.
Additional observations
An explicit break in the innermost if makes the -O1 code fast again:
if (c == 0u) {
i = a + b; // Not actually needed any more
break;
}
As does "inlining" the loop's progress expression:
for (i = 2u; i < a + b; ) { // No ++i any more
c = (a + b) % i;
if (c == 0u) {
i = a + b;
++i;
} else {
++i;
}
}
Questions
Which optimisation does/could explain the performance drop?
Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?
Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?

The important thing at play here are loop-carried data dependencies.
Look at machine code of the slow variant of the innermost loop. I'm showing -O2 assembly here, -O1 is less optimized, but has similar data dependencies overall:
.L4:
xorl %edx, %edx
movl %esi, %eax
divl %ecx
testl %edx, %edx
cmove %esi, %ecx
addl $1, %ecx
cmpl %ecx, %esi
ja .L4
See how the increment of the loop counter in %ecx depends on the previous instruction (the cmov), which in turn depends on the result of the division, which in turn depends on the previous value of loop counter.
Effectively there is a chain of data dependencies on computing the value in %ecx that spans the entire loop, and since the time to execute the loop dominates, the time to compute that chain decides the execution time of the program.
Adjusting the program to compute the number of divisions reveals that it executes 434044698 div instructions. Dividing the number of machine cycles taken by the program by this number gives 26 cycles in my case, which corresponds closely to latency of the div instruction plus about 3 or 4 cycles from the other instructions in the chain (the chain is div-test-cmov-add).
In contrast, the -O3 code does not have this chain of dependencies, making it throughput-bound rather than latency-bound: the time to execute the -O3 variant is determined by the time to compute 434044698 independent div instructions.
Finally, to give specific answers to your questions:
1. Which optimisation does/could explain the performance drop?
As another answer mentioned, this is if-conversion creating a loop-carried data dependency where originally there was a control dependency. Control dependencies may be costly too, when they correspond to unpredictable branches, but in this case the branch is easy to predict.
2. Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?
Perhaps you can imagine the optimization transforming the code to
for (i = 2u; i < a + b; ++i) {
c = (a + b) % i;
i = (c != 0) ? i : a + b;
}
Where the ternary operator is evaluated on the CPU such that new value of i is not known until c has been computed.
3. Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?
In those variants the code is not eligible for if-conversion, so the problematic data dependency is not introduced.

I think the problem is in the -fif-conversion that instructs the compiler to do CMOV instead of TEST/JZ for some comparisons. And CMOV is known for being not so great in the general case.
There are two points in the disassembly, that I know of, affected by this flag:
First, the if (c == 0u) { i = a + b; } in line 13 is compiled to:
test edx,edx //edx is c
cmove ecx,esi //esi is (a + b), ecx is i
Second, the if (c != 0u) { count = count + 1u; } is compiled to
cmp eax,0x1 //eax is c
sbb r8d,0xffffffff //r8d is count, but what???
Nice trick! It is substracting -1 to count but with carry, and the carry is only set if c is less than 1, which being unsigned means 0. Thus, if eax is 0 it substracts -1 to count but then substracts 1 again: it does not change. If eax is not 0, then it substracts -1, that increments the variable.
Now, this avoids branches, but at the cost of missing the obvious optimization that if c == 0u you could jump directly to the next while iteration. This one is so easy that it is even done in -O0.

I believe this is caused by the "conditional move" instruction (CMOVEcc) that the compiler generates to replace branching when using -O1 and -O2.
When using -O0, the statement if (c == 0u) is compiled to a jump:
cmp DWORD PTR [rbp-16], 0
jne .L4
With -O1 and -O2:
test edx, edx
cmove ecx, esi
while -O3 produces a jump (similar to -O0):
test edx, edx
je .L5
There is a known bug in gcc where "using conditional moves instead of compare and branch result in almost 2x slower code"
As rodrigo suggested in his comment, using the flag -fno-if-conversion tells gcc not to replace branching with conditional moves, hence preventing this performance issue.

Related

Why is pow(int, int) so slow?

I've been working on a few project Euler exercises to improve my knowledge of C++.
I've written the following function:
int a = 0,b = 0,c = 0;
for (a = 1; a <= SUMTOTAL; a++)
{
for (b = a+1; b <= SUMTOTAL-a; b++)
{
c = SUMTOTAL-(a+b);
if (c == sqrt(pow(a,2)+pow(b,2)) && b < c)
{
std::cout << "a: " << a << " b: " << b << " c: "<< c << std::endl;
std::cout << a * b * c << std::endl;
}
}
}
This computes in 17 milliseconds.
However, if I change the line
if (c == sqrt(pow(a,2)+pow(b,2)) && b < c)
to
if (c == sqrt((a*a)+(b*b)) && b < c)
the computation takes place in 2 milliseconds. Is there some obvious implementation detail of pow(int, int) that I'm missing which makes the first expression compute so much slower?

pow() works with real floating-point numbers and uses under the hood the formula
pow(x,y) = e^(y log(x))
to calculate x^y. The int are converted to double before calling pow. (log is the natural logarithm, e-based)
x^2 using pow() is therefore slower than x*x.
Edit based on relevant comments
Using pow even with integer exponents may yield incorrect results (PaulMcKenzie)
In addition to using a math function with double type, pow is a function call (while x*x isn't) (jtbandes)
Many modern compilers will in fact optimize out pow with constant integer arguments, but this should not be relied upon.

You've picked one of the slowest possible ways to check
c*c == a*a + b*b // assuming c is non-negative
That compiles to three integer multiplications (one of which can be hoisted out of the loop). Even without pow(), you're still converting to double and taking a square root, which is terrible for throughput. (And also latency, but branch prediction + speculative execution on modern CPUs means that latency isn't a factor here).
Intel Haswell's SQRTSD instruction has a throughput of one per 8-14 cycles (source: Agner Fog's instruction tables), so even if your sqrt() version keeps the FP sqrt execution unit saturated, it's still about 4 times slower than what I got gcc to emit (below).
You can also optimize the loop condition to break out of the loop when the b < c part of the condition becomes false, so the compiler only has to do one version of that check.
void foo_optimized()
{
for (int a = 1; a <= SUMTOTAL; a++) {
for (int b = a+1; b < SUMTOTAL-a-b; b++) {
// int c = SUMTOTAL-(a+b); // gcc won't always transform signed-integer math, so this prevents hoisting (SUMTOTAL-a) :(
int c = (SUMTOTAL-a) - b;
// if (b >= c) break; // just changed the loop condition instead
// the compiler can hoist a*a out of the loop for us
if (/* b < c && */ c*c == a*a + b*b) {
// Just print a newline. std::endl also flushes, which bloats the asm
std::cout << "a: " << a << " b: " << b << " c: "<< c << '\n';
std::cout << a * b * c << '\n';
}
}
}
}
This compiles (with gcc6.2 -O3 -mtune=haswell) to code with this inner loop. See the full code on the Godbolt compiler explorer.
# a*a is hoisted out of the loop. It's in r15d
.L6:
add ebp, 1 # b++
sub ebx, 1 # c--
add r12d, r14d # ivtmp.36, ivtmp.43 # not sure what this is or why it's in the loop, would have to look again at the asm outside
cmp ebp, ebx # b, _39
jg .L13 ## This is the loop-exit branch, not-taken until the end
## .L13 is the rest of the outer loop.
## It sets up for the next entry to this inner loop.
.L8:
mov eax, ebp # multiply a copy of the counters
mov edx, ebx
imul eax, ebp # b*b
imul edx, ebx # c*c
add eax, r15d # a*a + b*b
cmp edx, eax # tmp137, tmp139
jne .L6
## Fall-through into the cout print code when we find a match
## extremely rare, so should predict near-perfectly
On Intel Haswell, all these instructions are 1 uop each. (And the cmp/jcc pairs macro-fuse into compare-and-branch uops.) So that's 10 fused-domain uops, which can issue at one iteration per 2.5 cycles.
Haswell runs imul r32, r32 with a throughput of one iteration per clock, so the two multiplies inside the inner loop aren't saturating port 1 at two multiplies per 2.5c. This leaves room to soak up the inevitable resource conflicts from ADD and SUB stealing port 1.
We're not even close to any other execution-port bottlenecks, so the front-end bottleneck is the only issue, and this should run at one iteration per 2.5 cycles on Intel Haswell and later.
Loop-unrolling could help here to reduce the number of uops per check. e.g. use lea ecx, [rbx+1] to compute b+1 for the next iteration, so we can imul ebx, ebx without using a MOV to make it non-destructive.
A strength-reduction is also possible: Given b*b we could try to compute (b-1) * (b-1) without an IMUL. (b-1) * (b-1) = b*b - 2*b + 1, so maybe we can do an lea ecx, [rbx*2 - 1] and then subtract that from b*b. (There are no addressing-modes that subtract instead of add. Hmm, maybe we could keep -b in a register, and count up towards zero, so we could use lea ecx, [rcx + rbx*2 - 1] to update b*b in ECX, given -b in EBX).
Unless you actually bottleneck on IMUL throughput, this might end up taking more uops and not be a win. It might be fun to see how well a compiler would do with this strength-reduction in the C++ source.
You could probably also vectorize this with SSE or AVX, checking 4 or 8 consecutive b values in parallel. Since hits are really rare, you just check if any of the 8 had a hit and then sort out which one it was in the rare case that there was a match.
See also the x86 tag wiki for more optimization stuff.

In special cases: Is & faster than %?

I saw the chosen answer to this post.
I was suprised that (x & 255) == (x % 256) if x is an unsigned integer, I wondered if it makes sense to always replace % with & in x % n for n = 2^a (a = [1, ...]) and x being a positive integer.
Since this is a special case in which I as a human can decide because I know with which values the program will deal with and the compiler does not. Can I gain a significant performance boost if my program uses a lot of modulo operations?
Sure, I could just compile and look at the dissassembly. But this would only answer my question for one compiler/architecture. I would like to know if this is in principle faster.

If your integral type is unsigned, the compiler will optimize it, and the result will be the same. If it's signed, something is different...
This program:
int mod_signed(int i) {
return i % 256;
}
int and_signed(int i) {
return i & 255;
}
unsigned mod_unsigned(unsigned int i) {
return i % 256;
}
unsigned and_unsigned(unsigned int i) {
return i & 255;
}
will be compiled (by GCC 6.2 with -O3; Clang 3.9 produces very similar code) into:
mod_signed(int):
mov edx, edi
sar edx, 31
shr edx, 24
lea eax, [rdi+rdx]
movzx eax, al
sub eax, edx
ret
and_signed(int):
movzx eax, dil
ret
mod_unsigned(unsigned int):
movzx eax, dil
ret
and_unsigned(unsigned int):
movzx eax, dil
ret
The result assembly of mod_signed is different because
If both operands to a multiplication, division, or modulus expression have the same sign, the result is positive. Otherwise, the result is negative. The result of a modulus operation's sign is implementation-defined.
and AFAICT, most of implementation decided that the result of a modulus expression is always the same as the sign of the first operand. See this documentation.
Hence, mod_signed is optimized to (from nwellnhof's comment):
int d = i < 0 ? 255 : 0;
return ((i + d) & 255) - d;
Logically, we can prove that i % 256 == i & 255 for all unsigned integers, hence, we can trust the compiler to do its job.

I did some measurements with gcc, and
if the argument of a / or % is a compiled time constant that's a power of 2, gcc can turn it into the corresponding bit operation.
Here are some of my benchmarks for divisions
What has a better performance: multiplication or division? and as you can see, the running times with divisors that are statically known powers of two are noticably lower than with other statically known divisors.
So if / and % with statically known power-of-two arguments describe your algorithm better than bit ops, feel free to prefer / and %.
You shouldn't lose any performance with a decent compiler.

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with:
nasm -felf64 p14.asm && gcc p14.o -o p14
The C++ was compiled with:
g++ p14.cpp -o p14
Assembly, p14.asm:
section .data
fmt db "%d", 10, 0
global main
extern printf
section .text
main:
mov rcx, 1000000
xor rdi, rdi ; max i
xor rsi, rsi ; i
l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx
l2:
test rax, 1
jpe even
mov rbx, 3
mul rbx
inc rax
jmp c1
even:
mov rbx, 2
xor rdx, rdx
div rbx
c1:
inc r10
cmp rax, 1
jne l2
cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx
cmp rcx, 2
jne l1
mov rdi, fmt
xor rax, rax
call printf
ret
C++, p14.cpp:
#include <iostream>
int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = 3*n + 1;
++count;
}
return count;
}
int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}
std::cout << maxi << std::endl;
}
I know about the compiler optimizations to improve speed and everything, but I don’t see many ways to further optimize my assembly solution (speaking programmatically, not mathematically).
The C++ code uses modulus every term and division every other term, while the assembly code only uses a single division every other term.
But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking mainly out of curiosity.
Execution times
My system: 64-bit Linux on 1.4 GHz Intel Celeron 2955U (Haswell microarchitecture).
g++ (unoptimized): avg 1272 ms.
g++ -O3: avg 578 ms.
asm (div) (original): avg 2650 ms.
asm (shr): avg 679 ms.
#johnfound asm (assembled with NASM): avg 501 ms.
#hidefromkgb asm: avg 200 ms.
#hidefromkgb asm, optimized by #Peter Cordes: avg 145 ms.
#Veedrac C++: avg 81 ms with -O3, 305 ms with -O0.

If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0 (compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).
See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the x86 tag wiki for more perf links.
See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).
Usually you're fine letting the compiler do its thing, especially if you try to write C++ that can compile efficiently. Also see is assembly faster than compiled languages?. One of the answers links to these neat slides showing how various C compilers optimize some really simple functions with cool tricks. Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is in a similar vein.
even:
mov rbx, 2
xor rdx, rdx
div rbx
On Intel Haswell, div r64 is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but out-of-order execution can run those early). High-uop-count instructions like DIV are microcoded, which can also cause front-end bottlenecks. In this case, latency is the most relevant factor because it's part of a loop-carried dependency chain.
shr rax, 1 does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.
For comparison, 32-bit division is faster, but still horrible vs. shifts. idiv r32 is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell.
As you can see from looking at gcc's -O0 asm output (Godbolt compiler explorer), it only uses shifts instructions. clang -O0 does compile naively like you thought, even using 64-bit IDIV twice. (When optimizing, compilers do use both outputs of IDIV when the source does a division and modulus with the same operands, if they use IDIV at all)
GCC doesn't have a totally-naive mode; it always transforms through GIMPLE, which means some "optimizations" can't be disabled. This includes recognizing division-by-constant and using shifts (power of 2) or a fixed-point multiplicative inverse (non power of 2) to avoid IDIV (see div_by_13 in the above godbolt link).
gcc -Os (optimize for size) does use IDIV for non-power-of-2 division,
unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster.
Helping the compiler
(summary for this case: use uint64_t n)
First of all, it's only interesting to look at optimized compiler output. (-O3).
-O0 speed is basically meaningless.
Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in #Veedrac's code below.)
This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on #johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3 and g++ -O3 -march=skylake will do the right thing. (Or -march=native.) Or -mtune=... to just tune, without using instructions that other CPUs might not support.
My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.
Hand-written asm is a black-box for the optimizer, so constant-propagation doesn't work when inlining makes an input a compile-time constant. Other optimizations are also affected. Read https://gcc.gnu.org/wiki/DontUseInlineAsm before using asm. (And avoid MSVC-style inline asm: inputs/outputs have to go through memory which adds overhead.)
In this case: your n has a signed type, and gcc uses the SAR/SHR/ADD sequence that gives the correct rounding. (IDIV and arithmetic-shift "round" differently for negative inputs, see the SAR insn set ref manual entry). (IDK if gcc tried and failed to prove that n can't be negative, or what. Signed-overflow is undefined behaviour, so it should have been able to.)
You should have used uint64_t n, so it can just SHR. And so it's portable to systems where long is only 32-bit (e.g. x86-64 Windows).
BTW, gcc's optimized asm output looks pretty good (using unsigned long n): the inner loop it inlines into main() does this:
# from gcc5.4 -O3 plus my comments
# edx= count=1
# rax= uint64_t n
.L9: # do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
mov rdi, rax
shr rdi # rdi = n>>1;
test al, 1 # set flags based on n%2 (aka n&1)
mov rax, rcx
cmove rax, rdi # n= (n%2) ? 3*n+1 : n/2;
add edx, 1 # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
cmp/branch to update max and maxi, and then do the next n
The inner loop is branchless, and the critical path of the loop-carried dependency chain is:
3-component LEA (3 cycles)
cmov (2 cycles on Haswell, 1c on Broadwell or later).
Total: 5 cycle per iteration, latency bottleneck. Out-of-order execution takes care of everything else in parallel with this (in theory: I haven't tested with perf counters to see if it really runs at 5c/iter).
The FLAGS input of cmov (produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.
Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.
The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.
Beating the compiler
GCC did a pretty good job here. It could save one code byte by using inc edx instead of add edx, 1, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.
It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use cmovc instead of test / cmovz.
### Hand-optimized version of what gcc does
.L9: #do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
shr rax, 1 # n>>=1; CF = n&1 = n%2
cmovc rax, rcx # n= (n&1) ? 3*n+1 : n/2;
inc edx # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
See #johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)
Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency (and Ice Lake with updated microcode). The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See #johnfound's answer, where he has timing results from an AMD CPU.
Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple n values in parallel in an interleaved loop (see below).
LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components ([base+idx+const], which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).
So lea rcx, [rax + rax*2] / inc rcx is only 2c latency, faster than lea rcx, [rax + rax*2 + 1], on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.
Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)
Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead / throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in main, but that's fine because each thread can just check a range of n values and produce a pair of integers as a result.
Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same max / maxi. This creates more instruction-level parallelism.
The trick is deciding whether to wait until all the n values have reached 1 before getting another pair of starting n values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.
You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where n hadn't reached 1 yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors of n values up in the air. Maybe only worth with 256b vector (4x uint64_t).
I think the best strategy to make detection of a 1 "sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a 1 in an element, the increment-vector will have a zero, and +=0 is a no-op.
Untested idea for manual vectorization
# starting with YMM0 = [ n_d, n_c, n_b, n_a ] (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1): increment vector
# ymm5 = all-zeros: count vector
.inner_loop:
vpaddq ymm1, ymm0, xmm0
vpaddq ymm1, ymm1, xmm0
vpaddq ymm1, ymm1, set1_epi64(1) # ymm1= 3*n + 1. Maybe could do this more efficiently?
vpsllq ymm3, ymm0, 63 # shift bit 1 to the sign bit
vpsrlq ymm0, ymm0, 1 # n /= 2
# FP blend between integer insns may cost extra bypass latency, but integer blends don't have 1 bit controlling a whole qword.
vpblendvpd ymm0, ymm0, ymm1, ymm3 # variable blend controlled by the sign bit of each 64-bit element. I might have the source operands backwards, I always have to look this up.
# ymm0 = updated n in each element.
vpcmpeqq ymm1, ymm0, set1_epi64(1)
vpandn ymm4, ymm1, ymm4 # zero out elements of ymm4 where the compare was true
vpaddq ymm5, ymm5, ymm4 # count++ in elements where n has never been == 1
vptest ymm4, ymm4
jnz .inner_loop
# Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero
vextracti128 ymm0, ymm5, 1
vpmaxq .... crap this doesn't exist
# Actually just delay doing a horizontal max until the very very end. But you need some way to record max and maxi.
You can and should implement this with intrinsics instead of hand-written asm.
Algorithmic / implementation improvement:
Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)
#EOF points out that tzcnt (or bsf) could be used to do multiple n/=2 iterations in one step. That's probably better than SIMD vectorizing; no SSE or AVX instruction can do that. It's still compatible with doing multiple scalar ns in parallel in different integer registers, though.
So the loop might look like this:
goto loop_entry; // C++ structured like the asm, for illustration only
do {
n = n*3 + 1;
loop_entry:
shift = _tzcnt_u64(n);
n >>= shift;
count += shift;
} while(n != 1);
This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).
It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding n>>1, though. #Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).
We can safely use BSF or TZCNT interchangeably, because n can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).
TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use REP BSF, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use __builtin_ctzll even with -mno-bmi.
They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written, before Intel updated the TLB management rules.)
Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see this Q&A. This is why in gcc's asm output for #Veedrac's code, you see it breaking the dep chain with xor-zeroing on the register it's about to use as TZCNT's destination when it doesn't use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is a performance bug / limitation. Presumably it's worth some transistors / power to have them behave like other uops that go to the same execution unit. The only perf upside is interaction with another uarch limitation: they can micro-fuse a memory operand with an indexed addressing mode on Haswell, but on Skylake where Intel removed the false dep for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.
Improvements to ideas / code from other answers:
#hidefromkgb's answer has a nice observation that you're guaranteed to be able to do one right shift after a 3n+1. You can compute this more even more efficiently than just leaving out the checks between steps. The asm implementation in that answer is broken, though (it depends on OF, which is undefined after SHRD with a count > 1), and slow: ROR rdi,2 is faster than SHRD rdi,rdi,2, and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel.
I put tidied / improved C (which guides the compiler to produce better asm), and tested+working faster asm (in comments below the C) up on Godbolt: see the link in #hidefromkgb's answer. (This answer hit the 30k char limit from the large Godbolt URLs, but shortlinks can rot and were too long for goo.gl anyway.)
Also improved the output-printing to convert to a string and make one write() instead of writing one char at a time. This minimizes impact on timing the whole program with perf stat ./collatz (to record performance counters), and I de-obfuscated some of the non-critical asm.
#Veedrac's code
I got a minor speedup from right-shifting as much as we know needs doing, and checking to continue the loop. From 7.5s for limit=1e8 down to 7.275s, on Core2Duo (Merom), with an unroll factor of 16.
code + comments on Godbolt. Don't use this version with clang; it does something silly with the defer-loop. Using a tmp counter k and then adding it to count later changes what clang does, but that slightly hurts gcc.
See discussion in comments: Veedrac's code is excellent on CPUs with BMI1 (i.e. not Celeron/Pentium)

Claiming that the C++ compiler can produce more optimal code than a competent assembly language programmer is a very bad mistake. And especially in this case. The human always can make the code better than the compiler can, and this particular situation is a good illustration of this claim.
The timing difference you're seeing is because the assembly code in the question is very far from optimal in the inner loops.
(The below code is 32-bit, but can be easily converted to 64-bit)
For example, the sequence function can be optimized to only 5 instructions:
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
The whole code looks like:
include "%lib%/freshlib.inc"
#BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"
start:
InitializeAll
mov ecx, 999999
xor edi, edi ; max
xor ebx, ebx ; max i
.main_loop:
xor esi, esi
mov eax, ecx
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
cmp edi, esi
cmovb edi, esi
cmovb ebx, ecx
dec ecx
jnz .main_loop
OutputValue "Max sequence: ", edi, 10, -1
OutputValue "Max index: ", ebx, 10, -1
FinalizeAll
stdcall TerminateAll, 0
In order to compile this code, FreshLib is needed.
In my tests, (1 GHz AMD A4-1200 processor), the above code is approximately four times faster than the C++ code from the question (when compiled with -O0: 430 ms vs. 1900 ms), and more than two times faster (430 ms vs. 830 ms) when the C++ code is compiled with -O3.
The output of both programs is the same: max sequence = 525 on i = 837799.

For more performance: A simple change is observing that after n = 3n+1, n will be even, so you can divide by 2 immediately. And n won't be 1, so you don't need to test for it. So you could save a few if statements and write:
while (n % 2 == 0) n /= 2;
if (n > 1) for (;;) {
n = (3*n + 1) / 2;
if (n % 2 == 0) {
do n /= 2; while (n % 2 == 0);
if (n == 1) break;
}
}
Here's a big win: If you look at the lowest 8 bits of n, all the steps until you divided by 2 eight times are completely determined by those eight bits. For example, if the last eight bits are 0x01, that is in binary your number is ???? 0000 0001 then the next steps are:
3n+1 -> ???? 0000 0100
/ 2 -> ???? ?000 0010
/ 2 -> ???? ??00 0001
3n+1 -> ???? ??00 0100
/ 2 -> ???? ???0 0010
/ 2 -> ???? ???? 0001
3n+1 -> ???? ???? 0100
/ 2 -> ???? ???? ?010
/ 2 -> ???? ???? ??01
3n+1 -> ???? ???? ??00
/ 2 -> ???? ???? ???0
/ 2 -> ???? ???? ????
So all these steps can be predicted, and 256k + 1 is replaced with 81k + 1. Something similar will happen for all combinations. So you can make a loop with a big switch statement:
k = n / 256;
m = n % 256;
switch (m) {
case 0: n = 1 * k + 0; break;
case 1: n = 81 * k + 1; break;
case 2: n = 81 * k + 1; break;
...
case 155: n = 729 * k + 425; break;
...
}
Run the loop until n ≤ 128, because at that point n could become 1 with fewer than eight divisions by 2, and doing eight or more steps at a time would make you miss the point where you reach 1 for the first time. Then continue the "normal" loop - or have a table prepared that tells you how many more steps are need to reach 1.
PS. I strongly suspect Peter Cordes' suggestion would make it even faster. There will be no conditional branches at all except one, and that one will be predicted correctly except when the loop actually ends. So the code would be something like
static const unsigned int multipliers [256] = { ... }
static const unsigned int adders [256] = { ... }
while (n > 128) {
size_t lastBits = n % 256;
n = (n >> 8) * multipliers [lastBits] + adders [lastBits];
}
In practice, you would measure whether processing the last 9, 10, 11, 12 bits of n at a time would be faster. For each bit, the number of entries in the table would double, and I excect a slowdown when the tables don't fit into L1 cache anymore.
PPS. If you need the number of operations: In each iteration we do exactly eight divisions by two, and a variable number of (3n + 1) operations, so an obvious method to count the operations would be another array. But we can actually calculate the number of steps (based on number of iterations of the loop).
We could redefine the problem slightly: Replace n with (3n + 1) / 2 if odd, and replace n with n / 2 if even. Then every iteration will do exactly 8 steps, but you could consider that cheating :-) So assume there were r operations n <- 3n+1 and s operations n <- n/2. The result will be quite exactly n' = n * 3^r / 2^s, because n <- 3n+1 means n <- 3n * (1 + 1/3n). Taking the logarithm we find r = (s + log2 (n' / n)) / log2 (3).
If we do the loop until n ≤ 1,000,000 and have a precomputed table how many iterations are needed from any start point n ≤ 1,000,000 then calculating r as above, rounded to the nearest integer, will give the right result unless s is truly large.

On a rather unrelated note: more performance hacks!
[the first «conjecture» has been finally debunked by #ShreevatsaR; removed]
When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element N (shown first):
[even] [odd]
[odd] [even]
[even] [even]
To leap past these 2 elements means to compute (N >> 1) + N + 1, ((N << 1) + N + 1) >> 1 and N >> 2, respectively.
Let`s prove that for both cases (1) and (2) it is possible to use the first formula, (N >> 1) + N + 1.
Case (1) is obvious. Case (2) implies (N & 1) == 1, so if we assume (without loss of generality) that N is 2-bit long and its bits are ba from most- to least-significant, then a = 1, and the following holds:
(N << 1) + N + 1: (N >> 1) + N + 1:
b10 b1
b1 b
+ 1 + 1
---- ---
bBb0 bBb
where B = !b. Right-shifting the first result gives us exactly what we want.
Q.E.D.: (N & 1) == 1 ⇒ (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1.
As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.
The resulting algorithm looks like this:
uint64_t sequence(uint64_t size, uint64_t *path) {
uint64_t n, i, c, maxi = 0, maxc = 0;
for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
c = 2;
while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
c += 2;
if (n == 2)
c++;
if (c > maxc) {
maxi = i;
maxc = c;
}
}
*path = maxc;
return maxi;
}
int main() {
uint64_t maxi, maxc;
maxi = sequence(1000000, &maxc);
printf("%llu, %llu\n", maxi, maxc);
return 0;
}
Here we compare n > 2 because the process may stop at 2 instead of 1 if the total length of the sequence is odd.
[EDIT:]
Let`s translate this into assembly!
MOV RCX, 1000000;
DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;
#main:
XOR RSI, RSI;
LEA RDI, [RCX + 1];
#loop:
ADD RSI, 2;
LEA RDX, [RDI + RDI*2 + 2];
SHR RDX, 1;
SHRD RDI, RDI, 2; ror rdi,2 would do the same thing
CMOVL RDI, RDX; Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
CMOVS RDI, RDX;
CMP RDI, 2;
JA #loop;
LEA RDX, [RSI + 1];
CMOVE RSI, RDX;
CMP RAX, RSI;
CMOVB RAX, RSI;
CMOVB RBX, RCX;
SUB RCX, 2;
JA #main;
MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;
#itoa:
XOR RDX, RDX;
DIV RCX;
ADD RDX, '0';
PUSH RDX;
TEST RAX, RAX;
JNE #itoa;
PUSH RCX;
LEA RAX, [RBX + 1];
TEST RBX, RBX;
MOV RBX, RDI;
JNE #itoa;
POP RCX;
INC RDI;
MOV RDX, RDI;
#outp:
MOV RSI, RSP;
MOV RAX, RDI;
SYSCALL;
POP RAX;
TEST RAX, RAX;
JNE #outp;
LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;
Use these commands to compile:
nasm -f elf64 file.asm
ld -o file file.o
See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt. (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)

C++ programs are translated to assembly programs during the generation of machine code from the source code. It would be virtually wrong to say assembly is slower than C++. Moreover, the binary code generated differs from compiler to compiler. So a smart C++ compiler may produce binary code more optimal and efficient than a dumb assembler's code.
However I believe your profiling methodology has certain flaws. The following are general guidelines for profiling:
Make sure your system is in its normal/idle state. Stop all running processes (applications) that you started or that use CPU intensively (or poll over the network).
Your datasize must be greater in size.
Your test must run for something more than 5-10 seconds.
Do not rely on just one sample. Perform your test N times. Collect results and calculate the mean or median of the result.

From comments:
But, this code never stops (because of integer overflow) !?! Yves Daoust
For many numbers it will not overflow.
If it will overflow - for one of those unlucky initial seeds, the overflown number will very likely converge toward 1 without another overflow.
Still this poses interesting question, is there some overflow-cyclic seed number?
Any simple final converging series starts with power of two value (obvious enough?).
2^64 will overflow to zero, which is undefined infinite loop according to algorithm (ends only with 1), but the most optimal solution in answer will finish due to shr rax producing ZF=1.
Can we produce 2^64? If the starting number is 0x5555555555555555, it's odd number, next number is then 3n+1, which is 0xFFFFFFFFFFFFFFFF + 1 = 0. Theoretically in undefined state of algorithm, but the optimized answer of johnfound will recover by exiting on ZF=1. The cmp rax,1 of Peter Cordes will end in infinite loop (QED variant 1, "cheapo" through undefined 0 number).
How about some more complex number, which will create cycle without 0?
Frankly, I'm not sure, my Math theory is too hazy to get any serious idea, how to deal with it in serious way. But intuitively I would say the series will converge to 1 for every number : 0 < number, as the 3n+1 formula will slowly turn every non-2 prime factor of original number (or intermediate) into some power of 2, sooner or later. So we don't need to worry about infinite loop for original series, only overflow can hamper us.
So I just put few numbers into sheet and took a look on 8 bit truncated numbers.
There are three values overflowing to 0: 227, 170 and 85 (85 going directly to 0, other two progressing toward 85).
But there's no value creating cyclic overflow seed.
Funnily enough I did a check, which is the first number to suffer from 8 bit truncation, and already 27 is affected! It does reach value 9232 in proper non-truncated series (first truncated value is 322 in 12th step), and the maximum value reached for any of the 2-255 input numbers in non-truncated way is 13120 (for the 255 itself), maximum number of steps to converge to 1 is about 128 (+-2, not sure if "1" is to count, etc...).
Interestingly enough (for me) the number 9232 is maximum for many other source numbers, what's so special about it? :-O 9232 = 0x2410 ... hmmm.. no idea.
Unfortunately I can't get any deep grasp of this series, why does it converge and what are the implications of truncating them to k bits, but with cmp number,1 terminating condition it's certainly possible to put the algorithm into infinite loop with particular input value ending as 0 after truncation.
But the value 27 overflowing for 8 bit case is sort of alerting, this looks like if you count the number of steps to reach value 1, you will get wrong result for majority of numbers from the total k-bit set of integers. For the 8 bit integers the 146 numbers out of 256 have affected series by truncation (some of them may still hit the correct number of steps by accident maybe, I'm too lazy to check).

You did not post the code generated by the compiler, so there' some guesswork here, but even without having seen it, one can say that this:
test rax, 1
jpe even
... has a 50% chance of mispredicting the branch, and that will come expensive.
The compiler almost certainly does both computations (which costs neglegibly more since the div/mod is quite long latency, so the multiply-add is "free") and follows up with a CMOV. Which, of course, has a zero percent chance of being mispredicted.

For the Collatz problem, you can get a significant boost in performance by caching the "tails". This is a time/memory trade-off. See: memoization
(https://en.wikipedia.org/wiki/Memoization). You could also look into dynamic programming solutions for other time/memory trade-offs.
Example python implementation:
import sys
inner_loop = 0
def collatz_sequence(N, cache):
global inner_loop
l = [ ]
stop = False
n = N
tails = [ ]
while not stop:
inner_loop += 1
tmp = n
l.append(n)
if n <= 1:
stop = True
elif n in cache:
stop = True
elif n % 2:
n = 3*n + 1
else:
n = n // 2
tails.append((tmp, len(l)))
for key, offset in tails:
if not key in cache:
cache[key] = l[offset:]
return l
def gen_sequence(l, cache):
for elem in l:
yield elem
if elem in cache:
yield from gen_sequence(cache[elem], cache)
raise StopIteration
if __name__ == "__main__":
le_cache = {}
for n in range(1, 4711, 5):
l = collatz_sequence(n, le_cache)
print("{}: {}".format(n, len(list(gen_sequence(l, le_cache)))))
print("inner_loop = {}".format(inner_loop))

As a generic answer, not specifically directed at this task: In many cases, you can significantly speed up any program by making improvements at a high level. Like calculating data once instead of multiple times, avoiding unnecessary work completely, using caches in the best way, and so on. These things are much easier to do in a high level language.
Writing assembler code, it is possible to improve on what an optimising compiler does, but it is hard work. And once it's done, your code is much harder to modify, so it is much more difficult to add algorithmic improvements. Sometimes the processor has functionality that you cannot use from a high level language, inline assembly is often useful in these cases and still lets you use a high level language.
In the Euler problems, most of the time you succeed by building something, finding why it is slow, building something better, finding why it is slow, and so on and so on. That is very, very hard using assembler. A better algorithm at half the possible speed will usually beat a worse algorithm at full speed, and getting the full speed in assembler isn't trivial.

Even without looking at assembly, the most obvious reason is that /= 2 is probably optimized as >>=1 and many processors have a very quick shift operation. But even if a processor doesn't have a shift operation, the integer division is faster than floating point division.
Edit: your milage may vary on the "integer division is faster than floating point division" statement above. The comments below reveal that the modern processors have prioritized optimizing fp division over integer division. So if someone were looking for the most likely reason for the speedup which this thread's question asks about, then compiler optimizing /=2 as >>=1 would be the best 1st place to look.
On an unrelated note, if n is odd, the expression n*3+1 will always be even. So there is no need to check. You can change that branch to
{
n = (n*3+1) >> 1;
count += 2;
}
So the whole statement would then be
if (n & 1)
{
n = (n*3 + 1) >> 1;
count += 2;
}
else
{
n >>= 1;
++count;
}

The simple answer:
doing a MOV RBX, 3 and MUL RBX is expensive; just ADD RBX, RBX twice
ADD 1 is probably faster than INC here
MOV 2 and DIV is very expensive; just shift right
64-bit code is usually noticeably slower than 32-bit code and the alignment issues are more complicated; with small programs like this you have to pack them so you are doing parallel computation to have any chance of being faster than 32-bit code
If you generate the assembly listing for your C++ program, you can see how it differs from your assembly.

In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing)

Consider the following code (p is of type unsigned char* and bitmap->width is of some integer type, exactly which is unknown and depends on which version of some external library we're using):
for (unsigned x = 0; x < static_cast<unsigned>(bitmap->width); ++x)
{
*p++ = 0xAA;
*p++ = 0xBB;
*p++ = 0xCC;
}
Is it worth optimizing it [..]
Could there be a case where this could yield more efficient results by writing:
unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0; x < width; ++x)
{
*p++ = 0xAA;
*p++ = 0xBB;
*p++ = 0xCC;
}
... or is this trivial for the compiler to optimize?
What would you consider to be "better" code?
Note from editor (Ike): for those wondering about the strikeout text, the original question, as phrased, was dangerously close to off-topic territory and was very close to being closed in spite of positive feedback. These have been stricken out. Yet please do not punish the answerers who addressed these stricken sections of the question.

At first glance, I thought the compiler could generate equivalent assembly for both versions with optimization flags activated. When I checked it, I was surprised to see the result:
Source unoptimized.cpp
note: this code is not meant to be executed.
struct bitmap_t
{
long long width;
} bitmap;
int main(int argc, char** argv)
{
for (unsigned x = 0 ; x < static_cast<unsigned>(bitmap.width) ; ++x)
{
argv[x][0] = '\0';
}
return 0;
}
Source optimized.cpp
note: this code is not meant to be executed.
struct bitmap_t
{
long long width;
} bitmap;
int main(int argc, char** argv)
{
const unsigned width = static_cast<unsigned>(bitmap.width);
for (unsigned x = 0 ; x < width ; ++x)
{
argv[x][0] = '\0';
}
return 0;
}
Compilation
$ g++ -s -O3 unoptimized.cpp
$ g++ -s -O3 optimized.cpp
Assembly (unoptimized.s)
.file "unoptimized.cpp"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
movl bitmap(%rip), %eax
testl %eax, %eax
je .L2
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
mov %eax, %edx
addl $1, %eax
movq (%rsi,%rdx,8), %rdx
movb $0, (%rdx)
cmpl bitmap(%rip), %eax
jb .L3
.L2:
xorl %eax, %eax
ret
.cfi_endproc
.LFE0:
.size main, .-main
.globl bitmap
.bss
.align 8
.type bitmap, #object
.size bitmap, 8
bitmap:
.zero 8
.ident "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-16)"
.section .note.GNU-stack,"",#progbits
Assembly (optimized.s)
.file "optimized.cpp"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
movl bitmap(%rip), %eax
testl %eax, %eax
je .L2
subl $1, %eax
leaq 8(,%rax,8), %rcx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movq (%rsi,%rax), %rdx
addq $8, %rax
cmpq %rcx, %rax
movb $0, (%rdx)
jne .L3
.L2:
xorl %eax, %eax
ret
.cfi_endproc
.LFE0:
.size main, .-main
.globl bitmap
.bss
.align 8
.type bitmap, #object
.size bitmap, 8
bitmap:
.zero 8
.ident "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-16)"
.section .note.GNU-stack,"",#progbits
diff
$ diff -uN unoptimized.s optimized.s
--- unoptimized.s 2015-11-24 16:11:55.837922223 +0000
+++ optimized.s 2015-11-24 16:12:02.628922941 +0000
## -1,4 +1,4 ##
- .file "unoptimized.cpp"
+ .file "optimized.cpp"
.text
.p2align 4,,15
.globl main
## -10,16 +10,17 ##
movl bitmap(%rip), %eax
testl %eax, %eax
je .L2
+ subl $1, %eax
+ leaq 8(,%rax,8), %rcx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
- mov %eax, %edx
- addl $1, %eax
- movq (%rsi,%rdx,8), %rdx
+ movq (%rsi,%rax), %rdx
+ addq $8, %rax
+ cmpq %rcx, %rax
movb $0, (%rdx)
- cmpl bitmap(%rip), %eax
- jb .L3
+ jne .L3
.L2:
xorl %eax, %eax
ret
The generated assembly for the optimized version does actually load (lea) the width constant unlike the unoptimized version which computes the width offset at each iteration (movq).
When I'll get time, I eventually post some benchmark on that. Good question.

There is actually insufficient information from your code snippet to be able to tell, and the one thing that I can think of is aliasing. From our point of view, it's pretty clear that you don't want p and bitmap to point to the same location in memory, but the compiler doesn't know that and (because p is of type char*) the compiler has to make this code work even if p and bitmap overlap.
This means in this case that if the loop changes bitmap->width through the pointer p then that has to be seen when re-reading bitmap->width later on, which in turn means that storing it in a local variable would be illegal.
That being said, I believe some compilers will actually sometimes generate two versions of the same code (I have seen circumstantial evidence of this, but never directly sought out information on what the compiler is doing in this case), and quickly check if the pointers alias and run the faster code if it determines it's okay to.
That being said, I stand by my comment about simply measuring the performance of the two versions, my money is on not seeing any consistent performance difference between the two versions of the code.
In my opinion, questions like these are okay if your purpose is to learn about compiler optimization theories and techniques, but is a waste of time (a useless micro-optimization) if your end goal here is to make the program run faster.

Ok, guys, so I've measured, with GCC -O3 (using GCC 4.9 on Linux x64).
Turns out, the second version runs 54% faster!
So, I guess aliasing is the thing, I hadn't thought about it.
[Edit]
I've tried again the first version with all pointers defined with __restrict__, and the results are the same. Weird.. Either aliasing is not the problem, or, for some reason, the compiler doesn't optimize it well even with __restrict__.
[Edit 2]
Ok, I think I was pretty much able to prove that aliasing is the problem. I repeated my original test, this time using an array rather than a pointer:
const std::size_t n = 0x80000000ull;
bitmap->width = n;
static unsigned char d[n*3];
std::size_t i=0;
for (unsigned x = 0; x < static_cast<unsigned>(bitmap->width); ++x)
{
d[i++] = 0xAA;
d[i++] = 0xBB;
d[i++] = 0xCC;
}
And measured (had to use "-mcmodel=large" to link it). Then I tried:
const std::size_t n = 0x80000000ull;
bitmap->width = n;
static unsigned char d[n*3];
std::size_t i=0;
unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0; x < width; ++x)
{
d[i++] = 0xAA;
d[i++] = 0xBB;
d[i++] = 0xCC;
}
The measure results were the same - Seems like the compiler was able to optimize it by itself.
Then I tried the original codes (with a pointer p), this time when p is of type std::uint16_t*. Again, the results were the same - due to strict aliasing. Then I tried building with "-fno-strict-aliasing", and again saw a difference in time.

Other answers have pointed out that hoisting the pointer operation out of the loop may change defined behaviour due to aliasing rules that allow char to alias anything and hence is not an allowable optimisation for a compiler even though in most cases it is obviously correct to a human programmer.
They have also pointed out that hoisting the operation out of the loop is usually but not always an improvement from a performance point of view and is often a negative from a readability point of view.
I would like to point out that there is often a "third way". Rather than counting up to the number of iterations you want you can count down to zero. This means that the number of iterations is only needed once at the start of the loop, it doesn't have to be stored after that. Better still at the assembler level it often eliminates the need for an explicit comparison as the decrement operation will usually set flags that indicate whether the counter was zero both before (carry flag) and after (zero flag) the decrement.
for (unsigned x = static_cast<unsigned>(bitmap->width);x > 0; x--)
{
*p++ = 0xAA;
*p++ = 0xBB;
*p++ = 0xCC;
}
Note that this version of the loop gives x values in the range 1..width rather than the range 0..(width-1). That doesn't matter in your case because you aren't actually using x for anything but it's something to be aware of. If you want a count down loop with x values in the range 0..(width-1) you can do.
for (unsigned x = static_cast<unsigned>(bitmap->width); x-- > 0;)
{
*p++ = 0xAA;
*p++ = 0xBB;
*p++ = 0xCC;
}
You can also get rid of the casts in the above examples if you want without worrying about it's impact on comparison rules since all you are doing with bitmap->width is assigning it directly to a variable.

The only thing here that can prevent the optimization is the strict aliasing rule. In short:
"Strict aliasing is an assumption, made by the C (or C++) compiler, that dereferencing pointers to objects of different types will never refer to the same memory location (i.e. alias each other.)"
[…]
The exception to the rule is a char*, which is allowed to point to any type.
The exception also applies to unsigned and signed char pointers.
This is the case in your code: You're modifying *p through p which is an unsigned char*, so the compiler must assume that it could point to bitmap->width. Hence the caching of bitmap->width is an invalid optimization. This optimization-preventing behavior is shown in YSC's answer.
If and only if p pointed to a non-char and non-decltype(bitmap->width) type, would the caching be a possible optimization.

The question originally asked:
Is it worth optimizing it?
And my answer to that (garnering a good mix of both up and down votes..)
Let the compiler worry about it.
The compiler will almost certainly do a better job than you. And
there's no guarantee that your 'optimization' is any better than the
'obvious' code - have you measured it??
More importantly, have you any proof that the code you're optimizing
has any impact on the performance of your program?
Despite the downvotes (and now seeing the aliasing issue), I'm still happy with that as a valid answer. If you don't know if it's worth optimizing something, it probably isn't.
A rather different question, of course, would be this:
How can I tell if it's worth optimizing a fragment of code?
First, does your application or library need to run faster than it currently does? Is the user kept waiting too long? Does your software forecast yesterday's weather instead of tomorrow's?
Only you can really tell this, based on what your software is for and what your users expect.
Assuming your software does need some optimzation, the next thing to do is start measuring. Profilers will tell you where your code spends it's time. If your fragment isn't showing as a bottleneck, it's best left alone. Profilers and other measuring tools will also tell you if your changes have made a difference. It's possible to spend hours attemtping to optimize code, only to find you've made no discernible difference.
What do you mean by 'optimizing', anyway?
If you're not writing 'optimized' code, than your code should be as clear, clean and concise as you can make it. The "Premature optimization is evil" argument isn't an excuse for sloppy or inefficient code.
Optimized code normally sacrifices some of the attributes above for performance. It could involve introducing additional local variables, having objects with wider than expected scope or even reversing normal loop ordering. All of these may be less clear or concise, so document the code (briefly!) about why you're doing this.
But often, with 'slow' code, these micro-optimizations are the last resort. The first place to look is at algorithms and data structures. Is there a way of avoiding doing the work at all? Can linear searches be replaced with binary ones? Would a linked list be faster here than a vector? Or a hash table? Can I cache results? Making good 'efficient' decisions here can often affect performance by an order of magnitude or more!

I use the following pattern in the situation like this. It is almost as short as the first case of yours, and is better than the second case, because it keeps the temporary variable local to the loop.
for (unsigned int x = 0, n = static_cast<unsigned>(bitmap->width); x < n; ++x)
{
*p++ = 0xAA;
*p++ = 0xBB;
*p++ = 0xCC;
}
This will be faster with a less than smart compiler, debug build, or certain compilation flags.
Edit1: Placing a constant operation outside of a loop is a good programming pattern. It shows understanding of basics of machine operation, especially in C/C++. I'd argue that the effort to prove themselves should be on people that do not follow this practice. If compiler punishes for a good pattern, it is a bug in the compiler.
Edit2:: I've measured my suggestion against original code on vs2013, got %1 improvement. Can we do better? A simple manual optimization gives 3 times improvement over the original loop on x64 machine without resorting to exotic instructions. The code below assumes little endian system and properly aligned bitmap. TEST 0 is original (9 sec), TEST 1 is faster (3 sec). I bet somebody could make this even faster, and the result of the test would depend on the size of the bitmap. Definitely soon in the future, compiler will be able to produce consistently fastest code. I afraid this will be the future when the compiler will be also a programmer AI, so we would be out of work. But for now, just write code that shows that you know that extra operation in the loop is not needed.
#include <memory>
#include <time.h>
struct Bitmap_line
{
int blah;
unsigned int width;
Bitmap_line(unsigned int w)
{
blah = 0;
width = w;
}
};
#define TEST 0 //define 1 for faster test
int main(int argc, char* argv[])
{
unsigned int size = (4 * 1024 * 1024) / 3 * 3; //makes it divisible by 3
unsigned char* pointer = (unsigned char*)malloc(size);
memset(pointer, 0, size);
std::unique_ptr<Bitmap_line> bitmap(new Bitmap_line(size / 3));
clock_t told = clock();
#if TEST == 0
for (int iter = 0; iter < 10000; iter++)
{
unsigned char* p = pointer;
for (unsigned x = 0; x < static_cast<unsigned>(bitmap->width); ++x)
//for (unsigned x = 0, n = static_cast<unsigned>(bitmap->width); x < n; ++x)
{
*p++ = 0xAA;
*p++ = 0xBB;
*p++ = 0xCC;
}
}
#else
for (int iter = 0; iter < 10000; iter++)
{
unsigned char* p = pointer;
unsigned x = 0;
for (const unsigned n = static_cast<unsigned>(bitmap->width) - 4; x < n; x += 4)
{
*(int64_t*)p = 0xBBAACCBBAACCBBAALL;
p += 8;
*(int32_t*)p = 0xCCBBAACC;
p += 4;
}
for (const unsigned n = static_cast<unsigned>(bitmap->width); x < n; ++x)
{
*p++ = 0xAA;
*p++ = 0xBB;
*p++ = 0xCC;
}
}
#endif
double ms = 1000.0 * double(clock() - told) / CLOCKS_PER_SEC;
printf("time %0.3f\n", ms);
{
//verify
unsigned char* p = pointer;
for (unsigned x = 0, n = static_cast<unsigned>(bitmap->width); x < n; ++x)
{
if ((*p++ != 0xAA) || (*p++ != 0xBB) || (*p++ != 0xCC))
{
printf("EEEEEEEEEEEEERRRRORRRR!!!\n");
abort();
}
}
}
return 0;
}

There are two things to consider.
A) How often will the optimization run?
If the answer is not very often, like only when a user clicks a button, then don't bother if it makes your code unreadable. If the answer is 1000 times a second then you will probably want to go with the optimization. If it is even a bit complex be sure to put a comment in to explain what is going on to help the next guy that comes along.
B) Will this make the code harder to upkeep/troubleshoot?
If you're not seeing a huge gain in performance then making your code cryptic simply to save a few clock ticks is not a good idea. Lots of people will tell you that any good programmer should be able to look at the code and figure out what is going on. This is true. The problem is that in the business world the extra time figuring that out costs money. So, if you can make it prettier to read then do it. Your friends will thank you for it.
That said I'd personally use the B example.

The compiler is able to optimize a lot of things. For your example, you should go for the readability, mantainability and what follows your code standard. For more information about what can be optimized (with GCC), see this blog post.

As a general rule, let the compiler do the optimization for you, until you determine that you should take over. The logic for this has nothing to do with performance, but rather with human readability. In the vast majority of cases, the readability of your program is more important than its performance. You should aim to write code which is easier for a human to read, and then only worry about optimization when you are convinced that performance is more important than the maintainability of your code.
Once you do see that performance matters, you should run a profiler on the code to determine which loops are being inefficient, and optimize those individually. There may indeed be cases where you want to do that optimization (especially if you migrate towards C++, where STL containers get involved), but the cost in terms of readability is great.
In addition, I can think of pathological situations where it could actually slow the code down. For example, consider the case where the compiler could not prove that bitmap->width was constant through the process. By adding the width variable you force the compiler to maintain a local variable in that scope. If, for some platform specific reason, that extra variable prevented some stack-space optimization, it may have to reorganize how it is emitting bytecodes, and produce something less efficient.
As an example, on Windows x64, one is obliged to call a special API call, __chkstk in the preamble of the function if the function will use more than 1 page of local variables. This function gives windows a chance to manage the guard pages they use to expand the stack when needed. If your extra variable pushes the stack usage up from below 1 page to at-or-above 1 page, your function is now obliged to call __chkstk every time it is entered. If you were to optimize this loop on a slow path, you may actually slow the fast path down more than you saved on the slow path!
Sure, it's a bit pathological, but the point of that example is that you can actually slow the compiler down. It just shows that you do have to profile your work to determine where the optimizations go. In the mean time, please don't sacrifice readability in any way for an optimization that may or may not matter.

The comparison is wrong since the two code snippets
for (unsigned x = 0; x < static_cast<unsigned>(bitmap->width); ++x)
and
unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0; x<width ; ++x)
are not equivalent
In the first case width is dependent and not const, and one cannot assume that it may not change between subsequent iterations. Thus it cannot be optimized, but has to be checked at every loop.
In your optimized case a local variable is assigned the value of bitmap->width at some point during program execution. The compiler can verify that this does not actually change.
Did you think about multi threading , or maybe the value could be externally dependent such that its value is volatile. How would one expect the compiler to figure all these things out if you do not tell?
The compiler can only do as good as your code lets it.

Unless you know how exactly the compiler optimizes the code, it is better to do your own optimizations by keeping the code readability, and design. Practically it is hard to check assembly code for every function we write for new compiler versions.

Compiler cannot optimize bitmap->width because value of width can be changed between iterations. There are a few most common reasons:
Multi-threading. Compiler cannot predict if other thread is about to change value.
Modification inside loop, sometimes it is not simple to tell if variable will be changed inside loop.
It is function call, e.g. iterator::end() or container::size() so it is hard to predict if it will always returns the same result.
To sum up (my personal opinion) for places that requires high level of optimization you need to do that by yourself, in other places just leave it, compiler may optimize it or not, if there is no big difference code readability is main target.

Is < faster than <=?

Is if (a < 901) faster than if (a <= 900)?
Not exactly as in this simple example, but there are slight performance changes on loop complex code. I suppose this has to do something with generated machine code in case it's even true.

No, it will not be faster on most architectures. You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:
A test or cmp instruction, which sets EFLAGS
And a Jcc (jump) instruction, depending on the comparison type (and code layout):
jne - Jump if not equal --> ZF = 0
jz - Jump if zero (equal) --> ZF = 1
jg - Jump if greater --> ZF = 0 and SF = OF
(etc...)
Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c
if (a < b) {
// Do something 1
}
Compiles to:
mov eax, DWORD PTR [esp+24] ; a
cmp eax, DWORD PTR [esp+28] ; b
jge .L2 ; jump if a is >= b
; Do something 1
.L2:
And
if (a <= b) {
// Do something 2
}
Compiles to:
mov eax, DWORD PTR [esp+24] ; a
cmp eax, DWORD PTR [esp+28] ; b
jg .L5 ; jump if a is > b
; Do something 2
.L5:
So the only difference between the two is a jg versus a jge instruction. The two will take the same amount of time.
I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc (Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.
Latency — The number of clock cycles that are required for the
execution core to complete the execution of all of the μops that form
an instruction.
Throughput — The number of clock cycles required to
wait before the issue ports are free to accept the same instruction
again. For many instructions, the throughput of an instruction can be
significantly less than its latency
The values for Jcc are:
Latency Throughput
Jcc N/A 0.5
with the following footnote on Jcc:
Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero.
So, nothing in the Intel docs ever treats one Jcc instruction any differently from the others.
If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)
Edit: Floating Point
This holds true for x87 floating point as well: (Pretty much same code as above, but with double instead of int.)
fld QWORD PTR [esp+32]
fld QWORD PTR [esp+40]
fucomip st, st(1) ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
fstp st(0)
seta al ; Set al if above (CF=0 and ZF=0).
test al, al
je .L2
; Do something 1
.L2:
fld QWORD PTR [esp+32]
fld QWORD PTR [esp+40]
fucomip st, st(1) ; (same thing as above)
fstp st(0)
setae al ; Set al if above or equal (CF=0).
test al, al
je .L5
; Do something 2
.L5:
leave
ret

Historically (we're talking the 1980s and early 1990s), there were some architectures in which this was true. The root issue is that integer comparison is inherently implemented via integer subtractions. This gives rise to the following cases.
Comparison Subtraction
---------- -----------
A < B --> A - B < 0
A = B --> A - B = 0
A > B --> A - B > 0
Now, when A < B the subtraction has to borrow a high-bit for the subtraction to be correct, just like you carry and borrow when adding and subtracting by hand. This "borrowed" bit was usually referred to as the carry bit and would be testable by a branch instruction. A second bit called the zero bit would be set if the subtraction were identically zero which implied equality.
There were usually at least two conditional branch instructions, one to branch on the carry bit and one on the zero bit.
Now, to get at the heart of the matter, let's expand the previous table to include the carry and zero bit results.
Comparison Subtraction Carry Bit Zero Bit
---------- ----------- --------- --------
A < B --> A - B < 0 0 0
A = B --> A - B = 0 1 1
A > B --> A - B > 0 1 0
So, implementing a branch for A < B can be done in one instruction, because the carry bit is clear only in this case, , that is,
;; Implementation of "if (A < B) goto address;"
cmp A, B ;; compare A to B
bcz address ;; Branch if Carry is Zero to the new address
But, if we want to do a less-than-or-equal comparison, we need to do an additional check of the zero flag to catch the case of equality.
;; Implementation of "if (A <= B) goto address;"
cmp A, B ;; compare A to B
bcz address ;; branch if A < B
bzs address ;; also, Branch if the Zero bit is Set
So, on some machines, using a "less than" comparison might save one machine instruction. This was relevant in the era of sub-megahertz processor speed and 1:1 CPU-to-memory speed ratios, but it is almost totally irrelevant today.

Assuming we're talking about internal integer types, there's no possible way one could be faster than the other. They're obviously semantically identical. They both ask the compiler to do precisely the same thing. Only a horribly broken compiler would generate inferior code for one of these.
If there was some platform where < was faster than <= for simple integer types, the compiler should always convert <= to < for constants. Any compiler that didn't would just be a bad compiler (for that platform).

I see that neither is faster. The compiler generates the same machine code in each condition with a different value.
if(a < 901)
cmpl $900, -4(%rbp)
jg .L2
if(a <=901)
cmpl $901, -4(%rbp)
jg .L3
My example if is from GCC on x86_64 platform on Linux.
Compiler writers are pretty smart people, and they think of these things and many others most of us take for granted.
I noticed that if it is not a constant, then the same machine code is generated in either case.
int b;
if(a < b)
cmpl -4(%rbp), %eax
jge .L2
if(a <=b)
cmpl -4(%rbp), %eax
jg .L3

For floating point code, the <= comparison may indeed be slower (by one instruction) even on modern architectures. Here's the first function:
int compare_strict(double a, double b) { return a < b; }
On PowerPC, first this performs a floating point comparison (which updates cr, the condition register), then moves the condition register to a GPR, shifts the "compared less than" bit into place, and then returns. It takes four instructions.
Now consider this function instead:
int compare_loose(double a, double b) { return a <= b; }
This requires the same work as compare_strict above, but now there's two bits of interest: "was less than" and "was equal to." This requires an extra instruction (cror - condition register bitwise OR) to combine these two bits into one. So compare_loose requires five instructions, while compare_strict requires four.
You might think that the compiler could optimize the second function like so:
int compare_loose(double a, double b) { return ! (a > b); }
However this will incorrectly handle NaNs. NaN1 <= NaN2 and NaN1 > NaN2 need to both evaluate to false.

Maybe the author of that unnamed book has read that a > 0 runs faster than a >= 1 and thinks that is true universally.
But it is because a 0 is involved (because CMP can, depending on the architecture, replaced e.g. with OR) and not because of the <.

At the very least, if this were true a compiler could trivially optimise a <= b to !(a > b), and so even if the comparison itself were actually slower, with all but the most naive compiler you would not notice a difference.

TL;DR answer
For most combinations of architecture, compiler and language, < will not be faster than <=.
Full answer
Other answers have concentrated on x86 architecture, and I don't know the ARM architecture (which your example assembler seems to be) well enough to comment specifically on the code generated, but this is an example of a micro-optimisation which is very architecture specific, and is as likely to be an anti-optimisation as it is to be an optimisation.
As such, I would suggest that this sort of micro-optimisation is an example of cargo cult programming rather than best software engineering practice.
Counterexample
There are probably some architectures where this is an optimisation, but I know of at least one architecture where the opposite may be true. The venerable Transputer architecture only had machine code instructions for equal to and greater than or equal to, so all comparisons had to be built from these primitives.
Even then, in almost all cases, the compiler could order the evaluation instructions in such a way that in practice, no comparison had any advantage over any other. Worst case though, it might need to add a reverse instruction (REV) to swap the top two items on the operand stack. This was a single byte instruction which took a single cycle to run, so had the smallest overhead possible.
Summary
Whether or not a micro-optimisation like this is an optimisation or an anti-optimisation depends on the specific architecture you are using, so it is usually a bad idea to get into the habit of using architecture specific micro-optimisations, otherwise you might instinctively use one when it is inappropriate to do so, and it looks like this is exactly what the book you are reading is advocating.

They have the same speed. Maybe in some special architecture what he/she said is right, but in the x86 family at least I know they are the same. Because for doing this the CPU will do a substraction (a - b) and then check the flags of the flag register. Two bits of that register are called ZF (zero Flag) and SF (sign flag), and it is done in one cycle, because it will do it with one mask operation.

This would be highly dependent on the underlying architecture that the C is compiled to. Some processors and architectures might have explicit instructions for equal to, or less than and equal to, which execute in different numbers of cycles.
That would be pretty unusual though, as the compiler could work around it, making it irrelevant.

You should not be able to notice the difference even if there is any. Besides, in practice, you'll have to do an additional a + 1 or a - 1 to make the condition stand unless you're going to use some magic constants, which is a very bad practice by all means.

When I wrote the first version of this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901 vs. a <= 900. Many compilers always shrink the magnitude of constants by converting between < and <=, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.
For ARM, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp r0, #0x00f000 would be encodeable, while cmp r0, #0x00efff would not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for ARM. AArch64 is either shift-by-12 or not, instead of an arbitrary rotation, for instructions like cmp and cmn, unlike 32-bit ARM and Thumb modes.
< vs. <= in general, including for runtime-variable conditions
In assembly language on most machines, a comparison for <= has the same cost as a comparison for <. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.
But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <= can accidentally create something the compiler can't optimize.
As a loop condition, there are cases where <= is qualitatively different from <, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.
Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)
void foo(unsigned size) {
unsigned upper_bound = size - 1; // or any calculation that could produce UINT_MAX
for(unsigned i=0 ; i <= upper_bound ; i++)
...
Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.
(A simple i <= size would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)
In this case, size=0 leads to upper_bound=UINT_MAX, and i <= UINT_MAX is always true. So this loop is infinite for size=0, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size.
Asm like if(!size) skip the loop; do{...}while(--size); is one normally-efficient way to optimize a for( i<size ) loop, if the actual value of i isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).
But that do{}while can't be infinite: if entered with size==0, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.)
With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.
Example: sum of integers from 1 to n
Using unsigned i <= n defeats clang's idiom-recognition that optimizes sum(1 .. n) loops with a closed form based on Gauss's n * (n+1) / 2 formula.
unsigned sum_1_to_n_finite(unsigned n) {
unsigned total = 0;
for (unsigned i = 0 ; i < n+1 ; ++i)
total += i;
return total;
}
x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer
# clang7.0 -O3 closed-form
cmp edi, -1 # n passed in EDI: x86-64 System V calling convention
je .LBB1_1 # if (n == UINT_MAX) return 0; // C++ loop runs 0 times
# else fall through into the closed-form calc
mov ecx, edi # zero-extend n into RCX
lea eax, [rdi - 1] # n-1
imul rax, rcx # n * (n-1) # 64-bit
shr rax # n * (n-1) / 2
add eax, edi # n + (stuff / 2) = n * (n+1) / 2 # truncated to 32-bit
ret # computed without possible overflow of the product before right shifting
.LBB1_1:
xor eax, eax
ret
But for the naive version, we just get a dumb loop from clang.
unsigned sum_1_to_n_naive(unsigned n) {
unsigned total = 0;
for (unsigned i = 0 ; i<=n ; ++i)
total += i;
return total;
}
# clang7.0 -O3
sum_1_to_n(unsigned int):
xor ecx, ecx # i = 0
xor eax, eax # retval = 0
.LBB0_1: # do {
add eax, ecx # retval += i
add ecx, 1 # ++1
cmp ecx, edi
jbe .LBB0_1 # } while( i<n );
ret
GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i values in parallel in the elements of an XMM register.
# "naive" inner loop
.L3:
add eax, 1 # do {
paddd xmm0, xmm1 # vect_total_4.6, vect_vec_iv_.5
paddd xmm1, xmm2 # vect_vec_iv_.5, tmp114
cmp edx, eax # bnd.1, ivtmp.14 # bound and induction-variable tmp, I think.
ja .L3 #, # }while( n > i )
"finite" inner loop
# before the loop:
# xmm0 = 0 = totals
# xmm1 = {0,1,2,3} = i
# xmm2 = set1_epi32(4)
.L13: # do {
add eax, 1 # i++
paddd xmm0, xmm1 # total[0..3] += i[0..3]
paddd xmm1, xmm2 # i[0..3] += 4
cmp eax, edx
jne .L13 # }while( i != upper_limit );
then horizontal sum xmm0
and peeled cleanup for the last n%3 iterations, or something.
It also has a plain scalar loop which I think it uses for very small n, and/or for the infinite loop case.
BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1/jnz instead of add eax,1/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.

You could say that line is correct in most scripting languages, since the extra character results in slightly slower code processing.
However, as the top answer pointed out, it should have no effect in C++, and anything being done with a scripting language probably isn't that concerned about optimization.

Only if the people who created the computers are bad with boolean logic. Which they shouldn't be.
Every comparison (>= <= > <) can be done in the same speed.
What every comparison is, is just a subtraction (the difference) and seeing if it's positive/negative.
(If the msb is set, the number is negative)
How to check a >= b? Sub a-b >= 0 Check if a-b is positive.
How to check a <= b? Sub 0 <= b-a Check if b-a is positive.
How to check a < b? Sub a-b < 0 Check if a-b is negative.
How to check a > b? Sub 0 > b-a Check if b-a is negative.
Simply put, the computer can just do this underneath the hood for the given op:
a >= b == msb(a-b)==0
a <= b == msb(b-a)==0
a > b == msb(b-a)==1
a < b == msb(a-b)==1
and of course the computer wouldn't actually need to do the ==0 or ==1 either.
for the ==0 it could just invert the msb from the circuit.
Anyway, they most certainly wouldn't have made a >= b be calculated as a>b || a==b lol

In C and C++, an important rule for the compiler is the “as-if” rule: If doing X has the exact same behavior as if you did Y, then the compiler is free to choose which one it uses.
In your case, “a < 901” and “a <= 900” always have the same result, so the compiler is free to compile either version. If one version was faster, for whatever reason, then any quality compiler would produce code for the version that is faster. So unless your compiler produced exceptionally bad code, both versions would run at equal speed.
Now if you had a situation where two bits of code will always produce the same result, but it is hard to prove for the compiler, and/or it is hard for the compiler to prove which if any version is faster, then you might get different code running at different speeds.
PS The original example might run at different speeds if the processor supports single byte constants (faster) and multi byte constants (slower), so comparing against 255 (1 byte) might be faster than comparing against 256 (two bytes). I’d expect the compiler to do whatever is faster.

Only if computation path depends on data:
a={1,1,1,1,1000,1,1,1,1}
while (i<=4)
{
for(j from 0 to a[i]){ do_work(); }
i++;
}
will compute 250 times more than while(i<4)
Real-world sample would be computing mandelbrot-set. If you include a pixel that iterates 1000000 times, it will cause a lag but the coincidence with <= usage probability is too low.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js