Saturate short (int16) in C++

Saturate short (int16) in C++ - c++

I am optimizing bottleneck code:
int sum = ........
sum = (sum >> _bitShift);
if (sum > 32000)
sum = 32000; //if we get an overflow, saturate output
else if (sum < -32000)
sum = -32000; //if we get an underflow, saturate output
short result = static_cast<short>(sum);
I would like to write the saturation condition as one "if condition" or even better with no "if condition" to make this code faster. I don't need saturation exactly at value 32000, any similar value like 32768 is acceptable.
According this page, there is a saturation instruction in ARM. Is there anything similar in x86/x64?

I'm not at all convinced that attempting to eliminate the if statement(s) is likely to do any real good. A quick check indicates that given this code:
int clamp(int x) {
if (x < -32768)
x = -32768;
else if (x > 32767)
x = 32767;
return x;
}
...both gcc and Clang produce branch-free results like this:
clamp(int):
cmp edi, 32767
mov eax, 32767
cmovg edi, eax
mov eax, -32768
cmp edi, -32768
cmovge eax, edi
ret
You can do something like x = std::min(std::max(x, -32768), 32767);, but this produces the same sequence, and the source seems less readable, at least to me.
You can do considerably better than this if you use Intel's vector instructions, but probably only if you're willing to put a fair amount of work into it--in particular, you'll probably need to operate on an entire (small) vector of values simultaneously to accomplish much this way. If you do go that way, you usually want to take a somewhat different approach to the task than you seem to be taking right now. Right now, you're apparently depending on int being a 32-bit type, so you're doing the arithmetic on a 32-bit type, then afterwards truncating it back down to a (saturated) 16-bit value.
With something like AVX, you'd typically want to use an instruction like _mm256_adds_epi16 to take a vector of 16 values (of 16-bits apiece), and do a saturating addition on all of them at once (or, likewise, _mm256_subs_epi16 to do saturating subtraction).
Since you're writing C++, what I've given above are the names of the compiler intrinsics used in most current compilers (gcc, icc, clang, msvc) for x86 processors. If you're writing assembly language directly, the instructions would be vpaddsw and vpsubsw respectively.
If you can count on a really current processor (one that supports AVX 512 instructions) you can use them instead to operate on a vector of 32 16-bit values simultaneously.

Are you sure you can beat the compiler at this?
Here's x64 retail with max size optimizations enabled. Visual Studio v15.7.5.
ecx contains the intial value at the start of this block. eax is filled with the saturated value when it is done.
return (x > 32767) ? 32767 : ((x < -32768) ? -32768 : x);
mov edx,0FFFF8000h
movzx eax,cx
cmp ecx,edx
cmovl eax,edx
mov edx,7FFFh
cmp ecx,edx
movzx eax,ax
cmovg eax,edx

Related

C++ std::countr_zero() in SIMD 128/256/512 (find position of least significant 1 bit in 128/256/512-bit number) [duplicate]

I have a huge memory block (bit-vector) with size N bits within one memory page, consider N on average is 5000, i.e. 5k bits to store some flags information.
At a certain points in time (super-frequent - critical) I need to find the first bit set in this whole big bit-vector. Now I do it per-64-word, i.e. with help of __builtin_ctzll). But when N grows and search algorithm cannot be improved, there can be some possibility to scale this search through the expansion of memory access width. This is the main problem in a few words
There is a single assembly instruction called BSF that gives the position of the highest set bit (GCC's __builtin_ctzll()).
So in x86-64 arch I can find the highest bit set cheaply in 64-bit words.
But what about scaling through memory width?
E.g. is there a way to do it efficiently with 128 / 256 / 512 -bit registers?
Basically I'm interested in some C API function to achieve this, but also want to know what this method is based on.
UPD: As for CPU, I'm interested for this optimization to support the following CPU lineups:
Intel Xeon E3-12XX, Intel Xeon E5-22XX/26XX/E56XX, Intel Core i3-5XX/4XXX/8XXX, Intel Core i5-7XX, Intel Celeron G18XX/G49XX (optional for Intel Atom N2600, Intel Celeron N2807, Cortex-A53/72)
P.S. In mentioned algorithm before the final bit scan I need to sum k (in average 20-40) N-bit vectors with CPU AND (the AND result is just a preparatory stage for the bit-scan). This is also desirable to do with memory width scaling (i.e. more efficiently than per 64bit-word AND)
Read also: Find first set

This answer is in a different vein, but if you know in advance that you're going to be maintaining a collection of B bits and need to be able to efficiently set and clear bits while also figuring out which bit is the first bit set, you may want to use a data structure like a van Emde Boas tree or a y-fast trie. These data structures are designed to store integers in a small range, so instead of setting or clearing individual bits, you could add or remove the index of the bit you want to set/clear. They're quite fast - you can add or remove items in time O(log log B), and they let you find the smallest item in time O(1). Figure that if B ≈ 50000, then log log B is about 4.
I'm aware this doesn't directly address how to find the highest bit set in a huge bitvector. If your setup is such that you have to work with bitvectors, the other answers might be more helpful. But if you have the option to reframe the problem in a way that doesn't involve bitvector searching, these other data structures might be a better fit.

The best way to find the first set bit within a whole vector (AFAIK) involves finding the first non-zero SIMD element (e.g. a byte or dword), then using a bit-scan on that. (__builtin_ctz / bsf / tzcnt / ffs-1) . As such, ctz(vector) is not itself a useful building block for searching an array, only for after the loop.
Instead you want to loop over the array searching for a non-zero vector, using a whole-vector check involving SSE4.1 ptest xmm0,xmm0 / jz .loop (3 uops), or with SSE2 pcmpeqd v, zero / pmovmskb / cmp eax, 0xffff / je .loop (3 uops after cmp/jcc macro-fusion). https://uops.info/
Once you do find a non-zero vector, pcmpeqb / movmskps / bsf on that to find a dword index, then load that dword and bsf it. Add the start-bit position (CHAR_BIT*4*dword_idx) to the bsf bit-position within that element. This is a fairly long dependency chain for latency, including an integer L1d load latency. But since you just loaded the vector, at least you can be fairly confident you'll hit in cache when you load it again with integer. (If the vector was generated on the fly, then probably still best to store / reload it and let store-forwarding work, instead of trying to generate a shuffle control for vpermilps/movd or SSSE3 pshufb/movd/movzx ecx, al.)
The loop problem is very much like strlen or memchr, except we're rejecting a single value (0) and looking for anything else. Still, we can take inspiration from hand-optimized asm strlen / memchr implementations like glibc's, for example loading multiple vectors and doing one check to see if any of them have what they're looking for. (For strlen, combine with pminub to get a 0 if any element is 0. For pcmpeqb compare results, OR for memchr). For our purposes, the reduction operation we want is OR - any non-zero input will make the output non-zero, and bitwise boolean ops can run on any vector ALU port.
(If the expected first-bit-position isn't very high, it's not worth being too aggressive with this: if the first set bit is in the first vector, sorting things out between 2 vectors you've loaded will be slower. 5000 bits is only 625 bytes, or 19.5 AVX2 __m256i vectors. And the first set bit is probably not always right at the end)
AVX2 version:
This checks pairs of 32-byte vectors (i.e. whole cache lines) for non-zero, and if found then sorts that out into one 64-bit bitmap for a single CTZ operation. That extra shift/OR costs latency in the critical path, but the hope is that we get to the first 1 bit sooner.
Combining 2 vectors down to one with OR means it's not super useful to know which element of the OR result was non-zero. We basically redo the work inside the if. That's the price we pay for keeping the amount of uops low for the actual search part.
(The if body ends with a return, so in the asm it's actually like an if()break, or actually an if()goto out of the loop since it goes to a difference place than the not-found return -1 from falling through out of the loop.)
// untested, especially the pointer end condition, but compiles to asm that looks good
// Assumes len is a multiple of 64 bytes
#include <immintrin.h>
#include <stdint.h>
#include <string.h>
// aliasing-safe: p can point to any C data type
int bitscan_avx2(const char *p, size_t len /* in bytes */)
{
//assert(len % 64 == 0);
//optimal if p is 64-byte aligned, so we're checking single cache-lines
const char *p_init = p;
const char *endp = p + len - 64;
do {
__m256i v1 = _mm256_loadu_si256((const __m256i*)p);
__m256i v2 = _mm256_loadu_si256((const __m256i*)(p+32));
__m256i or = _mm256_or_si256(v1,v2);
if (!_mm256_testz_si256(or, or)){ // find the first non-zero cache line
__m256i v1z = _mm256_cmpeq_epi32(v1, _mm256_setzero_si256());
__m256i v2z = _mm256_cmpeq_epi32(v2, _mm256_setzero_si256());
uint32_t zero_map = _mm256_movemask_ps(_mm256_castsi256_ps(v1z));
zero_map |= _mm256_movemask_ps(_mm256_castsi256_ps(v2z)) << 8;
unsigned idx = __builtin_ctz(~zero_map); // Use ctzll for GCC, because GCC is dumb and won't optimize away a movsx
uint32_t nonzero_chunk;
memcpy(&nonzero_chunk, p+4*idx, sizeof(nonzero_chunk)); // aliasing / alignment-safe load
return (p-p_init + 4*idx)*8 + __builtin_ctz(nonzero_chunk);
}
p += 64;
}while(p < endp);
return -1;
}
On Godbolt with clang 12 -O3 -march=haswell:
bitscan_avx2:
lea rax, [rdi + rsi]
add rax, -64 # endp
xor ecx, ecx
.LBB0_1: # =>This Inner Loop Header: Depth=1
vmovdqu ymm1, ymmword ptr [rdi] # do {
vmovdqu ymm0, ymmword ptr [rdi + 32]
vpor ymm2, ymm0, ymm1
vptest ymm2, ymm2
jne .LBB0_2 # if() goto out of the inner loop
add ecx, 512 # bit-counter incremented in the loop, for (p-p_init) * 8
add rdi, 64
cmp rdi, rax
jb .LBB0_1 # }while(p<endp)
mov eax, -1 # not-found return path
vzeroupper
ret
.LBB0_2:
vpxor xmm2, xmm2, xmm2
vpcmpeqd ymm1, ymm1, ymm2
vmovmskps eax, ymm1
vpcmpeqd ymm0, ymm0, ymm2
vmovmskps edx, ymm0
shl edx, 8
or edx, eax # mov ah,dl would be interesting, but compilers won't do it.
not edx # one_positions = ~zero_positions
xor eax, eax # break false dependency
tzcnt eax, edx # dword_idx
xor edx, edx
tzcnt edx, dword ptr [rdi + 4*rax] # p[dword_idx]
shl eax, 5 # dword_idx * 4 * CHAR_BIT
add eax, edx
add eax, ecx
vzeroupper
ret
This is probably not optimal for all CPUs, e.g. maybe we could use a memory-source vpcmpeqd for at least one of the inputs, and not cost any extra front-end uops, only back-end. As long as compilers keep using pointer-increments, not indexed addressing modes that would un-laminate. That would reduce the amount of work needed after the branch (which probably mispredicts).
To still use vptest, you might have to take advantage of the CF result from the CF = (~dst & src == 0) operation against a vector of all-ones, so we could check that all elements matched (i.e. the input was all zeros). Unfortunately, Can PTEST be used to test if two registers are both zero or some other condition? - no, I don't think we can usefully use vptest without a vpor.
Clang decided not to actually subtract pointers after the loop, instead to do more work in the search loop. :/ The loop is 9 uops (after macro-fusion of cmp/jb), so unfortunately it can only run a bit less than 1 iteration per 2 cycles. So it's only managing less than half of L1d cache bandwidth.
But apparently a single array isn't your real problem.
Without AVX
16-byte vectors mean we don't have to deal with the "in-lane" behaviour of AVX2 shuffles. So instead of OR, we can combine with packssdw or packsswb. Any set bits in the high half of a pack input will signed-saturate the result to 0x80 or 0x7f. (So signed saturation is key, not unsigned packuswb which will saturate signed-negative inputs to 0.)
However, shuffles only run on port 5 on Intel CPUs, so beware of throughput limits. ptest on Skylake for example is 2 uops, p5 and p0, so using packsswb + ptest + jz would limit to one iteration per 2 clocks. But pcmpeqd + pmovmskb don't.
Unfortunately, using pcmpeq on each input separately before packing / combining would cost more uops. But would reduce the amount of work left for the cleanup, and if the loop-exit usually involves a branch mispredict, that might reduce overall latency.
2x pcmpeqd => packssdw => pmovmskb => not => bsf would give you a number you have to multiply by 2 to use as a byte offset to get to the non-zero dword. e.g. memcpy(&tmp_u32, p + (2*idx), sizeof(tmp_u32));. i.e. bsf eax, [rdi + rdx*2].
With AVX-512:
You mentioned 512-bit vectors, but none of the CPUs you listed support AVX-512. Even if so, you might want to avoid 512-bit vectors because SIMD instructions lowering CPU frequency, unless your program spends a lot of time doing this, and your data is hot in L1d cache so you can truly benefit instead of still bottlenecking on L2 cache bandwidth. But even with 256-bit vectors, AVX-512 has new instructions that are useful for this:
integer compares (vpcmpb/w/d/q) have a choice of predicate, so you can do not-equal instead of having to invert later with NOT. Or even test-into-register vptestmd so you don't need a zeroed vector to compare against.
compare-into-mask is sort of like pcmpeq + movmsk, except the result is in a k register, still need a kmovq rax, k0 before you can tzcnt.
kortest - set FLAGS according to the OR of two mask registers being non-zero. So the search loop could do vpcmpd k0, ymm0, [rdi] / vpcmpd k1, ymm0, [rdi+32] / kortestw k0, k1
ANDing multiple input arrays
You mention your real problem is that you have up-to-20 arrays of bits, and you want to intersect them with AND and find the first set bit in the intersection.
You may want to do this in blocks of a few vectors, optimistically hoping that there will be a set bit somewhere early.
AND groups of 4 or 8 inputs, accumulating across results with OR so you can tell if there were any 1s in this block of maybe 4 vectors from each input. (If there weren't any 1 bits, do another block of 4 vectors, 64 or 128 bytes while you still have the pointers loaded, because the intersection would definitely be empty if you moved on to the other inputs now). Tuning these chunk sizes depends on how sparse your 1s are, e.g. maybe always work in chunks of 6 or 8 vectors. Power-of-2 numbers are nice, though, because you can pad your allocations out to a multiple of 64 or 128 bytes so you don't have to worry about stopping early.)
(For odd numbers of inputs, maybe pass the same pointer twice to a function expecting 4 inputs, instead of dispatching to special versions of the loop for every possible number.)
L1d cache is 8-way associative (before Ice Lake with 12-way), and a limited number of integer/pointer registers can make it a bad idea to try to read too many streams at once. You probably don't want a level of indirection that makes the compiler loop over an actual array in memory of pointers either.

You may try this function, your compiler should optimize this code for your CPU. It's not super perfect, but it should be relatively quick and mostly portable.
PS length should be divisible by 8 for max speed
#include <stdio.h>
#include <stdint.h>
/* Returns the index position of the most significant bit; starting with index 0. */
/* Return value is between 0 and 64 times length. */
/* When return value is exact 64 times length, no significant bit was found, aka bf is 0. */
uint32_t offset_fsb(const uint64_t *bf, const register uint16_t length){
register uint16_t i = 0;
uint16_t remainder = length % 8;
switch(remainder){
case 0 : /* 512bit compare */
while(i < length){
if(bf[i] | bf[i+1] | bf[i+2] | bf[i+3] | bf[i+4] | bf[i+5] | bf[i+6] | bf[i+7]) break;
i += 8;
}
/* fall through */
case 4 : /* 256bit compare */
while(i < length){
if(bf[i] | bf[i+1] | bf[i+2] | bf[i+3]) break;
i += 4;
}
/* fall through */
case 6 : /* 128bit compare */
/* fall through */
case 2 : /* 128bit compare */
while(i < length){
if(bf[i] | bf[i+1]) break;
i += 2;
}
/* fall through */
default : /* 64bit compare */
while(i < length){
if(bf[i]) break;
i++;
}
}
register uint32_t offset_fsb = i * 64;
/* Check the last uint64_t if the last uint64_t is not 0. */
if(bf[i]){
register uint64_t s = bf[i];
offset_fsb += 63;
while(s >>= 1) offset_fsb--;
}
return offset_fsb;
}
int main(int argc, char *argv[]){
uint64_t test[16];
test[0] = 0;
test[1] = 0;
test[2] = 0;
test[3] = 0;
test[4] = 0;
test[5] = 0;
test[6] = 0;
test[7] = 0;
test[8] = 0;
test[9] = 0;
test[10] = 0;
test[11] = 0;
test[12] = 0;
test[13] = 0;
test[14] = 0;
test[15] = 1;
printf("offset_fsb = %d\n", offset_fsb(test, 16));
return 0;
}

Why does division by 3 require a rightshift (and other oddities) on x86?

I have the following C/C++ function:
unsigned div3(unsigned x) {
return x / 3;
}
When compiled using clang 10 at -O3, this results in:
div3(unsigned int):
mov ecx, edi # tmp = x
mov eax, 2863311531 # result = 3^-1
imul rax, rcx # result *= tmp
shr rax, 33 # result >>= 33
ret
What I do understand is: division by 3 is equivalent to multiplying with the multiplicative inverse 3-1 mod 232 which is 2863311531.
There are some things that I don't understand though:
Why do we need to use ecx/rcx at all? Can't we multiply rax with edi directly?
Why do we multiply in 64-bit mode? Wouldn't it be faster to multiply eax and ecx?
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
Edit 1
For those who don't understand what I mean by 3-1 mod 232, I am talking about the multiplicative inverse here.
For example:
// multiplying with inverse of 3:
15 * 2863311531 = 42949672965
42949672965 mod 2^32 = 5
// using fixed-point multiplication
15 * 2863311531 = 42949672965
42949672965 >> 33 = 5
// simply dividing by 3
15 / 3 = 5
So multiplying with 42949672965 is actually equivalent to dividing by 3. I assumed clang's optimization is based on modular arithmetic, when it's really based on fixed point arithmetic.
Edit 2
I have now realized that the multiplicative inverse can only be used for divisions without a remainder. For example, multiplying 1 times 3-1 is equal to 3-1, not zero. Only fixed point arithmetic has correct rounding.
Unfortunately, clang does not make any use of modular arithmetic which would just be a single imul instruction in this case, even when it could. The following function has the same compile output as above.
unsigned div3(unsigned x) {
__builtin_assume(x % 3 == 0);
return x / 3;
}
(Canonical Q&A about fixed-point multiplicative inverses for exact division that work for every possible input: Why does GCC use multiplication by a strange number in implementing integer division? - not quite a duplicate because it only covers the math, not some of the implementation details like register width and imul vs. mul.)

Can't we multiply rax with edi directly?
We can't imul rax, rdi because the calling convention allows the caller to leave garbage in the high bits of RDI; only the EDI part contains the value. This is a non-issue when inlining; writing a 32-bit register does implicitly zero-extend to the full 64-bit register, so the compiler will usually not need an extra instruction to zero-extend a 32-bit value.
(zero-extending into a different register is better because of limitations on mov-elimination, if you can't avoid it).
Taking your question even more literally, no, x86 doesn't have any multiply instructions that zero-extend one of their inputs to let you multiply a 32-bit and a 64-bit register. Both inputs must be the same width.
Why do we multiply in 64-bit mode?
(terminology: all of this code runs in 64-bit mode. You're asking why 64-bit operand-size.)
You could mul edi to multiply EAX with EDI to get a 64-bit result split across EDX:EAX, but mul edi is 3 uops on Intel CPUs, vs. most modern x86-64 CPUs having fast 64-bit imul. (Although imul r64, r64 is slower on AMD Bulldozer-family, and on some low-power CPUs.) https://uops.info/ and https://agner.org/optimize/ (instruction tables and microarch PDF)
(Fun fact: mul rdi is actually cheaper on Intel CPUs, only 2 uops. Perhaps something to do with not having to do extra splitting on the output of the integer multiply unit, like mul edi would have to split the 64-bit low half multiplier output into EDX and EAX halves, but that happens naturally for 64x64 => 128-bit mul.)
Also the part you want is in EDX so you'd need another mov eax, edx to deal with it. (Again, because we're looking at code for a stand-alone definition of the function, not after inlining into a caller.)
GCC 8.3 and earlier did use 32-bit mul instead of 64-bit imul (https://godbolt.org/z/5qj7d5). That was not crazy for -mtune=generic when Bulldozer-family and old Silvermont CPUs were more relevant, but those CPUs are farther in the past for more recent GCC, and its generic tuning choices reflect that. Unfortunately GCC also wasted a mov instruction copying EDI to EAX, making this way look even worse :/
# gcc8.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop, stupid wasted instruction
mov edx, -1431655765 # 1 uop (same 32-bit constant, just printed differently)
mul edx # 3 uops on Sandybridge-family
mov eax, edx # 1 uop
shr eax # 1 uop
ret
# total of 7 uops on SnB-family
Would only be 6 uops with mov eax, 0xAAAAAAAB / mul edi, but still worse than:
# gcc9.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop
mov edi, 2863311531 # 1 uop
imul rax, rdi # 1 uop
shr rax, 33 # 1 uop
ret
# total 4 uops, not counting ret
Unfortunately, 64-bit 0x00000000AAAAAAAB can't be represented as a 32-bit sign-extended immediate, so imul rax, rcx, 0xAAAAAAAB isn't encodeable. It would mean 0xFFFFFFFFAAAAAAAB.
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
It is unsigned. Signedness of the inputs only affects the high half of the result, but imul reg, reg doesn't produce the high half. Only the one-operand forms of mul and imul are full multiplies that do NxN => 2N, so only they need separate signed and unsigned versions.
Only imul has the faster and more flexible low-half-only forms. The only thing that's signed about imul reg, reg is that it sets OF based on signed overflow of the low half. It wasn't worth spending more opcodes and more transistors just to have a mul r,r whose only difference from imul r,r is the FLAGS output.
Intel's manual (https://www.felixcloutier.com/x86/imul) even points out the fact that it can be used for unsigned.
What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
No, there's no multiplier constant that would give the exact right answer for every possible input x if you implemented it that way. The "as-if" optimization rule doesn't allow approximations, only implementations that produce the exact same observable behaviour for every input the program uses. Without knowing a value-range for x other than full range of unsigned, compilers don't have that option. (-ffast-math only applies to floating point; if you want faster approximations for integer math, code them manually like below):
See Why does GCC use multiplication by a strange number in implementing integer division? for more about the fixed-point multiplicative inverse method compilers use for exact division by compile time constants.
For an example of this not working in the general case, see my edit to an answer on Divide by 10 using bit shifts? which proposed
// Warning: INEXACT FOR LARGE INPUTS
// this fast approximation can just use the high half,
// so on 32-bit machines it avoids one shift instruction vs. exact division
int32_t div10(int32_t dividend)
{
int64_t invDivisor = 0x1999999A;
return (int32_t) ((invDivisor * dividend) >> 32);
}
Its first wrong answer (if you loop from 0 upward) is div10(1073741829) = 107374183 when 1073741829/10 is actually 107374182. (It rounded up instead of toward 0 like C integer division is supposed to.)
From your edit, I see you were actually talking about using the low half of a multiply result, which apparently works perfectly for exact multiples all the way up to UINT_MAX.
As you say, it completely fails when the division would have a remainder, e.g. 16 * 0xaaaaaaab = 0xaaaaaab0 when truncated to 32-bit, not 5.
unsigned div3_exact_only(unsigned x) {
__builtin_assume(x % 3 == 0); // or an equivalent with if() __builtin_unreachable()
return x / 3;
}
Yes, if that math works out, it would be legal and optimal for compilers to implement that with 32-bit imul. They don't look for this optimization because it's rarely a known fact. IDK if it would be worth adding compiler code to even look for the optimization, in terms of compile time, not to mention compiler maintenance cost in developer time. It's not a huge difference in runtime cost, and it's rarely going to be possible. It is nice, though.
div3_exact_only:
imul eax, edi, 0xAAAAAAAB # 1 uop, 3c latency
ret
However, it is something you can do yourself in source code, at least for known type widths like uint32_t:
uint32_t div3_exact_only(uint32_t x) {
return x * 0xaaaaaaabU;
}

What's up with the 33-bit right shift at the end? I thought we can just drop the highest 32-bits.
Instead of 3^(-1) mod 3 you have to think more about 0.3333333 where the 0 before the . is located in the upper 32 bit and the the 3333 is located in the lower 32 bit.
This fixed point operation works fine, but the result is obviously shifted to the upper part of rax, therefor the CPU must shift the result down again after the operation.
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
There is no MUL instruction equivalent to the IMUL instruction. The IMUL variant that is used takes two registers:
a <= a * b
There is no MUL instruction that does that. MUL instructions are more expensive because they store the result as 128 Bit in two registers.
Of course you could use the legacy instructions, but this does not change the fact that the result is stored in two registers.

If you look at my answer to the prior question:
Why does GCC use multiplication by a strange number in implementing integer division?
It contains a link to a pdf article that explains this (my answer clarifies the stuff that isn't explained well in this pdf article):
https://gmplib.org/~tege/divcnst-pldi94.pdf
Note that one extra bit of precision is needed for some divisors, such as 7, the multiplier would normally require 33 bits, and the product would normally require 65 bits, but this can be avoided by handling the 2^32 bit separately with 3 additional instructions as shown in my prior answer and below.
Take a look at the generated code if you change to
unsigned div7(unsigned x) {
return x / 7;
}
So to explain the process, let L = ceil(log2(divisor)). For the question above, L = ceil(log2(3)) == 2. The right shift count would initially be 32+L = 34.
To generate a multiplier with a sufficient number of bits, two potential multipliers are generated: mhi will be the multiplier to be used, and the shift count will be 32+L.
mhi = (2^(32+L) + 2^(L))/3 = 5726623062
mlo = (2^(32+L) )/3 = 5726623061
Then a check is made to see if the number of required bits can be reduced:
while((L > 0) && ((mhi>>1) > (mlo>>1))){
mhi = mhi>>1;
mlo = mlo>>1;
L = L-1;
}
if(mhi >= 2^32){
mhi = mhi-2^32
L = L-1;
; use 3 additional instructions for missing 2^32 bit
}
... mhi>>1 = 5726623062>>1 = 2863311531
... mlo>>1 = 5726623061>>1 = 2863311530 (mhi>>1) > (mlo>>1)
... mhi = mhi>>1 = 2863311531
... mlo = mhi>>1 = 2863311530
... L = L-1 = 1
... the next loop exits since now (mhi>>1) == (mlo>>1)
So the multiplier is mhi = 2863311531 and the shift count = 32+L = 33.
On an modern X86, multiply and shift instructions are constant time, so there's no point in reducing the multiplier (mhi) to less than 32 bits, so that while(...) above is changed to an if(...).
In the case of 7, the loop exits on the first iteration, and requires 3 extra instructions to handle the 2^32 bit, so that mhi is <= 32 bits:
L = ceil(log2(7)) = 3
mhi = (2^(32+L) + 2^(L))/7 = 4908534053
mhi = mhi-2^32 = 613566757
Let ecx = dividend, the simple approach could overflow on the add:
mov eax, 613566757 ; eax = mhi
mul ecx ; edx:eax = ecx*mhi
add edx, ecx ; edx:eax = ecx*(mhi + 2^32), potential overflow
shr edx, 3
To avoid the potential overflow, note that eax = eax*2 - eax:
(ecx*eax) = (ecx*eax)<<1) -(ecx*eax)
(ecx*(eax+2^32)) = (ecx*eax)<<1)+ (ecx*2^32)-(ecx*eax)
(ecx*(eax+2^32))>>3 = ((ecx*eax)<<1)+ (ecx*2^32)-(ecx*eax) )>>3
= (((ecx*eax) )+(((ecx*2^32)-(ecx*eax))>>1))>>2
so the actual code, using u32() to mean upper 32 bits:
... visual studio generated code for div7, dividend is ecx
mov eax, 613566757
mul ecx ; edx = u32( (ecx*eax) )
sub ecx, edx ; ecx = u32( ((ecx*2^32)-(ecx*eax)) )
shr ecx, 1 ; ecx = u32( (((ecx*2^32)-(ecx*eax))>>1) )
lea eax, DWORD PTR [edx+ecx] ; eax = u32( (ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1) )
shr eax, 2 ; eax = u32(((ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1))>>2)
If a remainder is wanted, then the following steps can be used:
mhi and L are generated based on divisor during compile time
...
quotient = (x*mhi)>>(32+L)
product = quotient*divisor
remainder = x - product

x/3 is approximately (x * (2^32/3)) / 2^32. So we can perform a single 32x32->64 bit multiplication, take the higher 32 bits, and get approximately x/3.
There is some error because we cannot multiply exactly by 2^32/3, only by this number rounded to an integer. We get more precision using x/3 ≈ (x * (2^33/3)) / 2^33. (We can't use 2^34/3 because that is > 2^32). And that turns out to be good enough to get x/3 in all cases exactly. You would prove this by checking that the formula gives a result of k if the input is 3k or 3k+2.

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with:
nasm -felf64 p14.asm && gcc p14.o -o p14
The C++ was compiled with:
g++ p14.cpp -o p14
Assembly, p14.asm:
section .data
fmt db "%d", 10, 0
global main
extern printf
section .text
main:
mov rcx, 1000000
xor rdi, rdi ; max i
xor rsi, rsi ; i
l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx
l2:
test rax, 1
jpe even
mov rbx, 3
mul rbx
inc rax
jmp c1
even:
mov rbx, 2
xor rdx, rdx
div rbx
c1:
inc r10
cmp rax, 1
jne l2
cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx
cmp rcx, 2
jne l1
mov rdi, fmt
xor rax, rax
call printf
ret
C++, p14.cpp:
#include <iostream>
int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = 3*n + 1;
++count;
}
return count;
}
int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}
std::cout << maxi << std::endl;
}
I know about the compiler optimizations to improve speed and everything, but I don’t see many ways to further optimize my assembly solution (speaking programmatically, not mathematically).
The C++ code uses modulus every term and division every other term, while the assembly code only uses a single division every other term.
But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking mainly out of curiosity.
Execution times
My system: 64-bit Linux on 1.4 GHz Intel Celeron 2955U (Haswell microarchitecture).
g++ (unoptimized): avg 1272 ms.
g++ -O3: avg 578 ms.
asm (div) (original): avg 2650 ms.
asm (shr): avg 679 ms.
#johnfound asm (assembled with NASM): avg 501 ms.
#hidefromkgb asm: avg 200 ms.
#hidefromkgb asm, optimized by #Peter Cordes: avg 145 ms.
#Veedrac C++: avg 81 ms with -O3, 305 ms with -O0.

If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0 (compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).
See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the x86 tag wiki for more perf links.
See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).
Usually you're fine letting the compiler do its thing, especially if you try to write C++ that can compile efficiently. Also see is assembly faster than compiled languages?. One of the answers links to these neat slides showing how various C compilers optimize some really simple functions with cool tricks. Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is in a similar vein.
even:
mov rbx, 2
xor rdx, rdx
div rbx
On Intel Haswell, div r64 is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but out-of-order execution can run those early). High-uop-count instructions like DIV are microcoded, which can also cause front-end bottlenecks. In this case, latency is the most relevant factor because it's part of a loop-carried dependency chain.
shr rax, 1 does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.
For comparison, 32-bit division is faster, but still horrible vs. shifts. idiv r32 is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell.
As you can see from looking at gcc's -O0 asm output (Godbolt compiler explorer), it only uses shifts instructions. clang -O0 does compile naively like you thought, even using 64-bit IDIV twice. (When optimizing, compilers do use both outputs of IDIV when the source does a division and modulus with the same operands, if they use IDIV at all)
GCC doesn't have a totally-naive mode; it always transforms through GIMPLE, which means some "optimizations" can't be disabled. This includes recognizing division-by-constant and using shifts (power of 2) or a fixed-point multiplicative inverse (non power of 2) to avoid IDIV (see div_by_13 in the above godbolt link).
gcc -Os (optimize for size) does use IDIV for non-power-of-2 division,
unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster.
Helping the compiler
(summary for this case: use uint64_t n)
First of all, it's only interesting to look at optimized compiler output. (-O3).
-O0 speed is basically meaningless.
Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in #Veedrac's code below.)
This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on #johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3 and g++ -O3 -march=skylake will do the right thing. (Or -march=native.) Or -mtune=... to just tune, without using instructions that other CPUs might not support.
My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.
Hand-written asm is a black-box for the optimizer, so constant-propagation doesn't work when inlining makes an input a compile-time constant. Other optimizations are also affected. Read https://gcc.gnu.org/wiki/DontUseInlineAsm before using asm. (And avoid MSVC-style inline asm: inputs/outputs have to go through memory which adds overhead.)
In this case: your n has a signed type, and gcc uses the SAR/SHR/ADD sequence that gives the correct rounding. (IDIV and arithmetic-shift "round" differently for negative inputs, see the SAR insn set ref manual entry). (IDK if gcc tried and failed to prove that n can't be negative, or what. Signed-overflow is undefined behaviour, so it should have been able to.)
You should have used uint64_t n, so it can just SHR. And so it's portable to systems where long is only 32-bit (e.g. x86-64 Windows).
BTW, gcc's optimized asm output looks pretty good (using unsigned long n): the inner loop it inlines into main() does this:
# from gcc5.4 -O3 plus my comments
# edx= count=1
# rax= uint64_t n
.L9: # do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
mov rdi, rax
shr rdi # rdi = n>>1;
test al, 1 # set flags based on n%2 (aka n&1)
mov rax, rcx
cmove rax, rdi # n= (n%2) ? 3*n+1 : n/2;
add edx, 1 # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
cmp/branch to update max and maxi, and then do the next n
The inner loop is branchless, and the critical path of the loop-carried dependency chain is:
3-component LEA (3 cycles)
cmov (2 cycles on Haswell, 1c on Broadwell or later).
Total: 5 cycle per iteration, latency bottleneck. Out-of-order execution takes care of everything else in parallel with this (in theory: I haven't tested with perf counters to see if it really runs at 5c/iter).
The FLAGS input of cmov (produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.
Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.
The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.
Beating the compiler
GCC did a pretty good job here. It could save one code byte by using inc edx instead of add edx, 1, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.
It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use cmovc instead of test / cmovz.
### Hand-optimized version of what gcc does
.L9: #do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
shr rax, 1 # n>>=1; CF = n&1 = n%2
cmovc rax, rcx # n= (n&1) ? 3*n+1 : n/2;
inc edx # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
See #johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)
Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency (and Ice Lake with updated microcode). The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See #johnfound's answer, where he has timing results from an AMD CPU.
Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple n values in parallel in an interleaved loop (see below).
LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components ([base+idx+const], which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).
So lea rcx, [rax + rax*2] / inc rcx is only 2c latency, faster than lea rcx, [rax + rax*2 + 1], on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.
Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)
Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead / throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in main, but that's fine because each thread can just check a range of n values and produce a pair of integers as a result.
Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same max / maxi. This creates more instruction-level parallelism.
The trick is deciding whether to wait until all the n values have reached 1 before getting another pair of starting n values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.
You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where n hadn't reached 1 yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors of n values up in the air. Maybe only worth with 256b vector (4x uint64_t).
I think the best strategy to make detection of a 1 "sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a 1 in an element, the increment-vector will have a zero, and +=0 is a no-op.
Untested idea for manual vectorization
# starting with YMM0 = [ n_d, n_c, n_b, n_a ] (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1): increment vector
# ymm5 = all-zeros: count vector
.inner_loop:
vpaddq ymm1, ymm0, xmm0
vpaddq ymm1, ymm1, xmm0
vpaddq ymm1, ymm1, set1_epi64(1) # ymm1= 3*n + 1. Maybe could do this more efficiently?
vpsllq ymm3, ymm0, 63 # shift bit 1 to the sign bit
vpsrlq ymm0, ymm0, 1 # n /= 2
# FP blend between integer insns may cost extra bypass latency, but integer blends don't have 1 bit controlling a whole qword.
vpblendvpd ymm0, ymm0, ymm1, ymm3 # variable blend controlled by the sign bit of each 64-bit element. I might have the source operands backwards, I always have to look this up.
# ymm0 = updated n in each element.
vpcmpeqq ymm1, ymm0, set1_epi64(1)
vpandn ymm4, ymm1, ymm4 # zero out elements of ymm4 where the compare was true
vpaddq ymm5, ymm5, ymm4 # count++ in elements where n has never been == 1
vptest ymm4, ymm4
jnz .inner_loop
# Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero
vextracti128 ymm0, ymm5, 1
vpmaxq .... crap this doesn't exist
# Actually just delay doing a horizontal max until the very very end. But you need some way to record max and maxi.
You can and should implement this with intrinsics instead of hand-written asm.
Algorithmic / implementation improvement:
Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)
#EOF points out that tzcnt (or bsf) could be used to do multiple n/=2 iterations in one step. That's probably better than SIMD vectorizing; no SSE or AVX instruction can do that. It's still compatible with doing multiple scalar ns in parallel in different integer registers, though.
So the loop might look like this:
goto loop_entry; // C++ structured like the asm, for illustration only
do {
n = n*3 + 1;
loop_entry:
shift = _tzcnt_u64(n);
n >>= shift;
count += shift;
} while(n != 1);
This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).
It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding n>>1, though. #Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).
We can safely use BSF or TZCNT interchangeably, because n can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).
TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use REP BSF, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use __builtin_ctzll even with -mno-bmi.
They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written, before Intel updated the TLB management rules.)
Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see this Q&A. This is why in gcc's asm output for #Veedrac's code, you see it breaking the dep chain with xor-zeroing on the register it's about to use as TZCNT's destination when it doesn't use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is a performance bug / limitation. Presumably it's worth some transistors / power to have them behave like other uops that go to the same execution unit. The only perf upside is interaction with another uarch limitation: they can micro-fuse a memory operand with an indexed addressing mode on Haswell, but on Skylake where Intel removed the false dep for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.
Improvements to ideas / code from other answers:
#hidefromkgb's answer has a nice observation that you're guaranteed to be able to do one right shift after a 3n+1. You can compute this more even more efficiently than just leaving out the checks between steps. The asm implementation in that answer is broken, though (it depends on OF, which is undefined after SHRD with a count > 1), and slow: ROR rdi,2 is faster than SHRD rdi,rdi,2, and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel.
I put tidied / improved C (which guides the compiler to produce better asm), and tested+working faster asm (in comments below the C) up on Godbolt: see the link in #hidefromkgb's answer. (This answer hit the 30k char limit from the large Godbolt URLs, but shortlinks can rot and were too long for goo.gl anyway.)
Also improved the output-printing to convert to a string and make one write() instead of writing one char at a time. This minimizes impact on timing the whole program with perf stat ./collatz (to record performance counters), and I de-obfuscated some of the non-critical asm.
#Veedrac's code
I got a minor speedup from right-shifting as much as we know needs doing, and checking to continue the loop. From 7.5s for limit=1e8 down to 7.275s, on Core2Duo (Merom), with an unroll factor of 16.
code + comments on Godbolt. Don't use this version with clang; it does something silly with the defer-loop. Using a tmp counter k and then adding it to count later changes what clang does, but that slightly hurts gcc.
See discussion in comments: Veedrac's code is excellent on CPUs with BMI1 (i.e. not Celeron/Pentium)

Claiming that the C++ compiler can produce more optimal code than a competent assembly language programmer is a very bad mistake. And especially in this case. The human always can make the code better than the compiler can, and this particular situation is a good illustration of this claim.
The timing difference you're seeing is because the assembly code in the question is very far from optimal in the inner loops.
(The below code is 32-bit, but can be easily converted to 64-bit)
For example, the sequence function can be optimized to only 5 instructions:
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
The whole code looks like:
include "%lib%/freshlib.inc"
#BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"
start:
InitializeAll
mov ecx, 999999
xor edi, edi ; max
xor ebx, ebx ; max i
.main_loop:
xor esi, esi
mov eax, ecx
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
cmp edi, esi
cmovb edi, esi
cmovb ebx, ecx
dec ecx
jnz .main_loop
OutputValue "Max sequence: ", edi, 10, -1
OutputValue "Max index: ", ebx, 10, -1
FinalizeAll
stdcall TerminateAll, 0
In order to compile this code, FreshLib is needed.
In my tests, (1 GHz AMD A4-1200 processor), the above code is approximately four times faster than the C++ code from the question (when compiled with -O0: 430 ms vs. 1900 ms), and more than two times faster (430 ms vs. 830 ms) when the C++ code is compiled with -O3.
The output of both programs is the same: max sequence = 525 on i = 837799.

For more performance: A simple change is observing that after n = 3n+1, n will be even, so you can divide by 2 immediately. And n won't be 1, so you don't need to test for it. So you could save a few if statements and write:
while (n % 2 == 0) n /= 2;
if (n > 1) for (;;) {
n = (3*n + 1) / 2;
if (n % 2 == 0) {
do n /= 2; while (n % 2 == 0);
if (n == 1) break;
}
}
Here's a big win: If you look at the lowest 8 bits of n, all the steps until you divided by 2 eight times are completely determined by those eight bits. For example, if the last eight bits are 0x01, that is in binary your number is ???? 0000 0001 then the next steps are:
3n+1 -> ???? 0000 0100
/ 2 -> ???? ?000 0010
/ 2 -> ???? ??00 0001
3n+1 -> ???? ??00 0100
/ 2 -> ???? ???0 0010
/ 2 -> ???? ???? 0001
3n+1 -> ???? ???? 0100
/ 2 -> ???? ???? ?010
/ 2 -> ???? ???? ??01
3n+1 -> ???? ???? ??00
/ 2 -> ???? ???? ???0
/ 2 -> ???? ???? ????
So all these steps can be predicted, and 256k + 1 is replaced with 81k + 1. Something similar will happen for all combinations. So you can make a loop with a big switch statement:
k = n / 256;
m = n % 256;
switch (m) {
case 0: n = 1 * k + 0; break;
case 1: n = 81 * k + 1; break;
case 2: n = 81 * k + 1; break;
...
case 155: n = 729 * k + 425; break;
...
}
Run the loop until n ≤ 128, because at that point n could become 1 with fewer than eight divisions by 2, and doing eight or more steps at a time would make you miss the point where you reach 1 for the first time. Then continue the "normal" loop - or have a table prepared that tells you how many more steps are need to reach 1.
PS. I strongly suspect Peter Cordes' suggestion would make it even faster. There will be no conditional branches at all except one, and that one will be predicted correctly except when the loop actually ends. So the code would be something like
static const unsigned int multipliers [256] = { ... }
static const unsigned int adders [256] = { ... }
while (n > 128) {
size_t lastBits = n % 256;
n = (n >> 8) * multipliers [lastBits] + adders [lastBits];
}
In practice, you would measure whether processing the last 9, 10, 11, 12 bits of n at a time would be faster. For each bit, the number of entries in the table would double, and I excect a slowdown when the tables don't fit into L1 cache anymore.
PPS. If you need the number of operations: In each iteration we do exactly eight divisions by two, and a variable number of (3n + 1) operations, so an obvious method to count the operations would be another array. But we can actually calculate the number of steps (based on number of iterations of the loop).
We could redefine the problem slightly: Replace n with (3n + 1) / 2 if odd, and replace n with n / 2 if even. Then every iteration will do exactly 8 steps, but you could consider that cheating :-) So assume there were r operations n <- 3n+1 and s operations n <- n/2. The result will be quite exactly n' = n * 3^r / 2^s, because n <- 3n+1 means n <- 3n * (1 + 1/3n). Taking the logarithm we find r = (s + log2 (n' / n)) / log2 (3).
If we do the loop until n ≤ 1,000,000 and have a precomputed table how many iterations are needed from any start point n ≤ 1,000,000 then calculating r as above, rounded to the nearest integer, will give the right result unless s is truly large.

On a rather unrelated note: more performance hacks!
[the first «conjecture» has been finally debunked by #ShreevatsaR; removed]
When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element N (shown first):
[even] [odd]
[odd] [even]
[even] [even]
To leap past these 2 elements means to compute (N >> 1) + N + 1, ((N << 1) + N + 1) >> 1 and N >> 2, respectively.
Let`s prove that for both cases (1) and (2) it is possible to use the first formula, (N >> 1) + N + 1.
Case (1) is obvious. Case (2) implies (N & 1) == 1, so if we assume (without loss of generality) that N is 2-bit long and its bits are ba from most- to least-significant, then a = 1, and the following holds:
(N << 1) + N + 1: (N >> 1) + N + 1:
b10 b1
b1 b
+ 1 + 1
---- ---
bBb0 bBb
where B = !b. Right-shifting the first result gives us exactly what we want.
Q.E.D.: (N & 1) == 1 ⇒ (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1.
As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.
The resulting algorithm looks like this:
uint64_t sequence(uint64_t size, uint64_t *path) {
uint64_t n, i, c, maxi = 0, maxc = 0;
for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
c = 2;
while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
c += 2;
if (n == 2)
c++;
if (c > maxc) {
maxi = i;
maxc = c;
}
}
*path = maxc;
return maxi;
}
int main() {
uint64_t maxi, maxc;
maxi = sequence(1000000, &maxc);
printf("%llu, %llu\n", maxi, maxc);
return 0;
}
Here we compare n > 2 because the process may stop at 2 instead of 1 if the total length of the sequence is odd.
[EDIT:]
Let`s translate this into assembly!
MOV RCX, 1000000;
DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;
#main:
XOR RSI, RSI;
LEA RDI, [RCX + 1];
#loop:
ADD RSI, 2;
LEA RDX, [RDI + RDI*2 + 2];
SHR RDX, 1;
SHRD RDI, RDI, 2; ror rdi,2 would do the same thing
CMOVL RDI, RDX; Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
CMOVS RDI, RDX;
CMP RDI, 2;
JA #loop;
LEA RDX, [RSI + 1];
CMOVE RSI, RDX;
CMP RAX, RSI;
CMOVB RAX, RSI;
CMOVB RBX, RCX;
SUB RCX, 2;
JA #main;
MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;
#itoa:
XOR RDX, RDX;
DIV RCX;
ADD RDX, '0';
PUSH RDX;
TEST RAX, RAX;
JNE #itoa;
PUSH RCX;
LEA RAX, [RBX + 1];
TEST RBX, RBX;
MOV RBX, RDI;
JNE #itoa;
POP RCX;
INC RDI;
MOV RDX, RDI;
#outp:
MOV RSI, RSP;
MOV RAX, RDI;
SYSCALL;
POP RAX;
TEST RAX, RAX;
JNE #outp;
LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;
Use these commands to compile:
nasm -f elf64 file.asm
ld -o file file.o
See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt. (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)

C++ programs are translated to assembly programs during the generation of machine code from the source code. It would be virtually wrong to say assembly is slower than C++. Moreover, the binary code generated differs from compiler to compiler. So a smart C++ compiler may produce binary code more optimal and efficient than a dumb assembler's code.
However I believe your profiling methodology has certain flaws. The following are general guidelines for profiling:
Make sure your system is in its normal/idle state. Stop all running processes (applications) that you started or that use CPU intensively (or poll over the network).
Your datasize must be greater in size.
Your test must run for something more than 5-10 seconds.
Do not rely on just one sample. Perform your test N times. Collect results and calculate the mean or median of the result.

From comments:
But, this code never stops (because of integer overflow) !?! Yves Daoust
For many numbers it will not overflow.
If it will overflow - for one of those unlucky initial seeds, the overflown number will very likely converge toward 1 without another overflow.
Still this poses interesting question, is there some overflow-cyclic seed number?
Any simple final converging series starts with power of two value (obvious enough?).
2^64 will overflow to zero, which is undefined infinite loop according to algorithm (ends only with 1), but the most optimal solution in answer will finish due to shr rax producing ZF=1.
Can we produce 2^64? If the starting number is 0x5555555555555555, it's odd number, next number is then 3n+1, which is 0xFFFFFFFFFFFFFFFF + 1 = 0. Theoretically in undefined state of algorithm, but the optimized answer of johnfound will recover by exiting on ZF=1. The cmp rax,1 of Peter Cordes will end in infinite loop (QED variant 1, "cheapo" through undefined 0 number).
How about some more complex number, which will create cycle without 0?
Frankly, I'm not sure, my Math theory is too hazy to get any serious idea, how to deal with it in serious way. But intuitively I would say the series will converge to 1 for every number : 0 < number, as the 3n+1 formula will slowly turn every non-2 prime factor of original number (or intermediate) into some power of 2, sooner or later. So we don't need to worry about infinite loop for original series, only overflow can hamper us.
So I just put few numbers into sheet and took a look on 8 bit truncated numbers.
There are three values overflowing to 0: 227, 170 and 85 (85 going directly to 0, other two progressing toward 85).
But there's no value creating cyclic overflow seed.
Funnily enough I did a check, which is the first number to suffer from 8 bit truncation, and already 27 is affected! It does reach value 9232 in proper non-truncated series (first truncated value is 322 in 12th step), and the maximum value reached for any of the 2-255 input numbers in non-truncated way is 13120 (for the 255 itself), maximum number of steps to converge to 1 is about 128 (+-2, not sure if "1" is to count, etc...).
Interestingly enough (for me) the number 9232 is maximum for many other source numbers, what's so special about it? :-O 9232 = 0x2410 ... hmmm.. no idea.
Unfortunately I can't get any deep grasp of this series, why does it converge and what are the implications of truncating them to k bits, but with cmp number,1 terminating condition it's certainly possible to put the algorithm into infinite loop with particular input value ending as 0 after truncation.
But the value 27 overflowing for 8 bit case is sort of alerting, this looks like if you count the number of steps to reach value 1, you will get wrong result for majority of numbers from the total k-bit set of integers. For the 8 bit integers the 146 numbers out of 256 have affected series by truncation (some of them may still hit the correct number of steps by accident maybe, I'm too lazy to check).

You did not post the code generated by the compiler, so there' some guesswork here, but even without having seen it, one can say that this:
test rax, 1
jpe even
... has a 50% chance of mispredicting the branch, and that will come expensive.
The compiler almost certainly does both computations (which costs neglegibly more since the div/mod is quite long latency, so the multiply-add is "free") and follows up with a CMOV. Which, of course, has a zero percent chance of being mispredicted.

For the Collatz problem, you can get a significant boost in performance by caching the "tails". This is a time/memory trade-off. See: memoization
(https://en.wikipedia.org/wiki/Memoization). You could also look into dynamic programming solutions for other time/memory trade-offs.
Example python implementation:
import sys
inner_loop = 0
def collatz_sequence(N, cache):
global inner_loop
l = [ ]
stop = False
n = N
tails = [ ]
while not stop:
inner_loop += 1
tmp = n
l.append(n)
if n <= 1:
stop = True
elif n in cache:
stop = True
elif n % 2:
n = 3*n + 1
else:
n = n // 2
tails.append((tmp, len(l)))
for key, offset in tails:
if not key in cache:
cache[key] = l[offset:]
return l
def gen_sequence(l, cache):
for elem in l:
yield elem
if elem in cache:
yield from gen_sequence(cache[elem], cache)
raise StopIteration
if __name__ == "__main__":
le_cache = {}
for n in range(1, 4711, 5):
l = collatz_sequence(n, le_cache)
print("{}: {}".format(n, len(list(gen_sequence(l, le_cache)))))
print("inner_loop = {}".format(inner_loop))

As a generic answer, not specifically directed at this task: In many cases, you can significantly speed up any program by making improvements at a high level. Like calculating data once instead of multiple times, avoiding unnecessary work completely, using caches in the best way, and so on. These things are much easier to do in a high level language.
Writing assembler code, it is possible to improve on what an optimising compiler does, but it is hard work. And once it's done, your code is much harder to modify, so it is much more difficult to add algorithmic improvements. Sometimes the processor has functionality that you cannot use from a high level language, inline assembly is often useful in these cases and still lets you use a high level language.
In the Euler problems, most of the time you succeed by building something, finding why it is slow, building something better, finding why it is slow, and so on and so on. That is very, very hard using assembler. A better algorithm at half the possible speed will usually beat a worse algorithm at full speed, and getting the full speed in assembler isn't trivial.

Even without looking at assembly, the most obvious reason is that /= 2 is probably optimized as >>=1 and many processors have a very quick shift operation. But even if a processor doesn't have a shift operation, the integer division is faster than floating point division.
Edit: your milage may vary on the "integer division is faster than floating point division" statement above. The comments below reveal that the modern processors have prioritized optimizing fp division over integer division. So if someone were looking for the most likely reason for the speedup which this thread's question asks about, then compiler optimizing /=2 as >>=1 would be the best 1st place to look.
On an unrelated note, if n is odd, the expression n*3+1 will always be even. So there is no need to check. You can change that branch to
{
n = (n*3+1) >> 1;
count += 2;
}
So the whole statement would then be
if (n & 1)
{
n = (n*3 + 1) >> 1;
count += 2;
}
else
{
n >>= 1;
++count;
}

The simple answer:
doing a MOV RBX, 3 and MUL RBX is expensive; just ADD RBX, RBX twice
ADD 1 is probably faster than INC here
MOV 2 and DIV is very expensive; just shift right
64-bit code is usually noticeably slower than 32-bit code and the alignment issues are more complicated; with small programs like this you have to pack them so you are doing parallel computation to have any chance of being faster than 32-bit code
If you generate the assembly listing for your C++ program, you can see how it differs from your assembly.

Strange uint32_t to float array conversion

I have the following code snippet:
#include <cstdio>
#include <cstdint>
static const size_t ARR_SIZE = 129;
int main()
{
uint32_t value = 2570980487;
uint32_t arr[ARR_SIZE];
for (int x = 0; x < ARR_SIZE; ++x)
arr[x] = value;
float arr_dst[ARR_SIZE];
for (int x = 0; x < ARR_SIZE; ++x)
{
arr_dst[x] = static_cast<float>(arr[x]);
}
printf("%s\n", arr_dst[ARR_SIZE - 1] == arr_dst[ARR_SIZE - 2] ? "OK" : "WTF??!!");
printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 2]);
printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 1]);
return 0;
}
If I compile it under MS Visual Studio 2015 I can see that the output is:
WTF??!!
magic = 2570980352.0000000000
magic = 2570980608.0000000000
So the last arr_dst element is different from the previous one, yet these two values were obtained by converting the same value, which populates the arr array!
Is it a bug?
I noticed that if I modify the conversion loop in the following manner, I get the "OK" result:
for (int x = 0; x < ARR_SIZE; ++x)
{
if (x == 0)
x = 0;
arr_dst[x] = static_cast<float>(arr[x]);
}
So this probably is some issue with vectorizing optimisation.
This behavior does not reproduce on gcc 4.8. Any ideas?

A 32-bit IEEE-754 binary float, such as MSVC++ uses, provides only 6-7 decimal digits of precision. Your starting value is well within the range of that type, but it seems not to be exactly representable by that type, as indeed is the case for most values of type uint32_t.
At the same time, the floating-point unit of an x86 or x86_64 processor uses a wider representation even than MSVC++'s 64-bit double. It seems likely that after the loop exits, the last-computed array element remains in an FPU register, in its extended precision form. The program may then use that value directly from the register instead of reading it back from memory, which it is obligated to do with previous elements.
If the program performs the == comparison by promoting the narrower representation to the wider instead of the other way around, then the two values might indeed compare unequal, as the round-trip from extended precision to float and back loses precision. In any event, both values are converted to type double when passed to printf(); if indeed they compared unequal, then it is likely that the results of those conversions differ, too.
I'm not up on MSVC++ compile options, but very likely there is one that would quash this behavior. Such options sometimes go by names such as "strict math" or "strict fp". Be aware, however, that turning on such an option (or turning off its opposite) can be very costly in an FP-heavy program.

Converting between unsigned and float is not simple on x86; there's no single instruction for it (until AVX512). A common technique is to convert as signed and then fixup the result. There are multiple ways of doing this. (See this Q&A for some manually-vectorized methods with C intrinsics, not all of which have perfectly-rounded results.)
MSVC vectorizes the first 128 with one strategy, and then uses a different strategy (which wouldn't vectorize) for the last scalar element, which involves converting to double and then from double to float.
gcc and clang produce the 2570980608.0 result from their vectorized and scalar methods. 2570980608 - 2570980487 = 121, and 2570980487 - 2570980352 = 135 (with no rounding of inputs/outputs), so gcc and clang produce the correctly rounded result in this case (less than 0.5ulp of error). IDK if that's true for every possible uint32_t (but there are only 2^32 of them, we could exhaustively check). MSVC's end result for the vectorized loop has slightly more than 0.5ulp of error, but the scalar method is correctly rounded for this input.
IEEE math demands that + - * / and sqrt produce correctly rounded results (less than 0.5ulp of error), but other functions (like log) don't have such a strict requirement. IDK what the requirements are on rounding for int->float conversions, so IDK if what MSVC does is strictly legal (if you didn't use /fp:fast or anything).
See also Bruce Dawson's Floating-Point Determinism blog post (part of his excellent series about FP math), although he doesn't mention integer<->FP conversions.
We can see in the asm linked by the OP what MSVC did (stripped down to only the interesting instructions and commented by hand):
; Function compile flags: /Ogtp
# assembler macro constants
_arr_dst$ = -1040 ; size = 516
_arr$ = -520 ; size = 516
_main PROC ; COMDAT
00013 mov edx, 129
00018 mov eax, -1723986809 ; this is your unsigned 2570980487
0001d mov ecx, edx
00023 lea edi, DWORD PTR _arr$[esp+1088] ; edi=arr
0002a rep stosd ; memset in chunks of 4B
# arr[0..128] = 2570980487 at this point
0002c xor ecx, ecx ; i = 0
# xmm2 = 0.0 in each element (i.e. all-zero)
# xmm3 = __xmm#4f8000004f8000004f8000004f800000 (a constant repeated in each of 4 float elements)
####### The vectorized unsigned->float conversion strategy:
$LL7#main: ; do{
00030 movups xmm0, XMMWORD PTR _arr$[esp+ecx*4+1088] ; load 4 uint32_t
00038 cvtdq2ps xmm1, xmm0 ; SIGNED int to Single-precision float
0003b movaps xmm0, xmm1
0003e cmpltps xmm0, xmm2 ; xmm0 = (xmm0 < 0.0)
00042 andps xmm0, xmm3 ; mask the magic constant
00045 addps xmm0, xmm1 ; x += (x<0.0) ? magic_constant : 0.0f;
# There's no instruction for converting from unsigned to float, so compilers use inconvenient techniques like this to correct the result of converting as signed.
00048 movups XMMWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0 ; store 4 floats to arr_dst
; and repeat the same thing again, with addresses that are 16B higher (+1104)
; i.e. this loop is unrolled by two
0006a add ecx, 8 ; i+=8 (two vectors of 4 elements)
0006d cmp ecx, 128
00073 jb SHORT $LL7#main ; }while(i<128)
#### End of vectorized loop
# and then IDK what MSVC smoking; both these values are known at compile time. Is /Ogtp not full optimization?
# I don't see a branch target that would let execution reach this code
# other than by falling out of the loop that ends with ecx=128
00075 cmp ecx, edx
00077 jae $LN21#main ; if(i>=129): always false
0007d sub edx, ecx ; edx = 129-128 = 1
... some more ridiculous known-at-compile-time jumping later ...
######## The scalar unsigned->float conversion strategy for the last element
$LC15#main:
00140 mov eax, DWORD PTR _arr$[esp+ecx*4+1088]
00147 movd xmm0, eax
# eax = xmm0[0] = arr[128]
0014b cvtdq2pd xmm0, xmm0 ; convert the last element TO DOUBLE
0014f shr eax, 31 ; shift the sign bit to bit 1, so eax = 0 or 1
; then eax indexes a 16B constant, selecting either 0 or 0x41f0... (as whatever double that represents)
00152 addsd xmm0, QWORD PTR __xmm#41f00000000000000000000000000000[eax*8]
0015b cvtpd2ps xmm0, xmm0 ; double -> float
0015f movss DWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0 ; and store it
00165 inc ecx ; ++i;
00166 cmp ecx, 129 ; } while(i<129)
0016c jb SHORT $LC15#main
# Yes, this is a loop, which always runs exactly once for the last element
By way of comparison, clang and gcc also don't optimize the whole thing away at compile time, but they do realize that they don't need a cleanup loop, and just do a single scalar store or convert after the respective loops. (clang actually fully unrolls everything unless you tell it not to.)
See the code on the Godbolt compiler explorer.
gcc just converts the upper and lower 16b halves to float separately, and combines them with a multiply by 65536 and add.
Clang's unsigned -> float conversion strategy is interesting: it never uses a cvt instruction at all. I think it stuffs the two 16-bit halves of the unsigned integer into the mantissa of two floats directly (with some tricks to set the exponents (bitwise boolean stuff and an ADDPS), then adds the low and high half together like gcc does.
Of course, if you compile to 64-bit code, the scalar conversion can just zero-extend the uint32_t to 64-bit and convert that as a signed int64_t to float. Signed int64_t can represent every value of uint32_t, and x86 can convert a 64-bit signed int to float efficiently. But that doesn't vectorize.

I did an investigation on a PowerPC imeplementation (Freescale MCP7450) as they IMHO are far better documented than any voodoo Intel comes up with.
As it turns out the floating point unit, FPU, and vector unit may have different rounding for floating point operations. The FPU can be configured to use one of four rounding modes; round to nearest (default), truncate, towards positive infinity and towards negative infinity. The vector unit however is only able to round to nearest, with a few select instructions having specific rounding rules. The internal precision of the FPU is 106-bit. The vector unit fulfills IEEE-754 but the documentation does not state much more.
Looking at your result the conversion 2570980608 is closer to the original integer, suggesting the FPU has better internal precision than the vector unit OR different rounding modes.

Is < faster than <=?

Is if (a < 901) faster than if (a <= 900)?
Not exactly as in this simple example, but there are slight performance changes on loop complex code. I suppose this has to do something with generated machine code in case it's even true.

No, it will not be faster on most architectures. You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:
A test or cmp instruction, which sets EFLAGS
And a Jcc (jump) instruction, depending on the comparison type (and code layout):
jne - Jump if not equal --> ZF = 0
jz - Jump if zero (equal) --> ZF = 1
jg - Jump if greater --> ZF = 0 and SF = OF
(etc...)
Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c
if (a < b) {
// Do something 1
}
Compiles to:
mov eax, DWORD PTR [esp+24] ; a
cmp eax, DWORD PTR [esp+28] ; b
jge .L2 ; jump if a is >= b
; Do something 1
.L2:
And
if (a <= b) {
// Do something 2
}
Compiles to:
mov eax, DWORD PTR [esp+24] ; a
cmp eax, DWORD PTR [esp+28] ; b
jg .L5 ; jump if a is > b
; Do something 2
.L5:
So the only difference between the two is a jg versus a jge instruction. The two will take the same amount of time.
I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc (Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.
Latency — The number of clock cycles that are required for the
execution core to complete the execution of all of the μops that form
an instruction.
Throughput — The number of clock cycles required to
wait before the issue ports are free to accept the same instruction
again. For many instructions, the throughput of an instruction can be
significantly less than its latency
The values for Jcc are:
Latency Throughput
Jcc N/A 0.5
with the following footnote on Jcc:
Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero.
So, nothing in the Intel docs ever treats one Jcc instruction any differently from the others.
If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)
Edit: Floating Point
This holds true for x87 floating point as well: (Pretty much same code as above, but with double instead of int.)
fld QWORD PTR [esp+32]
fld QWORD PTR [esp+40]
fucomip st, st(1) ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
fstp st(0)
seta al ; Set al if above (CF=0 and ZF=0).
test al, al
je .L2
; Do something 1
.L2:
fld QWORD PTR [esp+32]
fld QWORD PTR [esp+40]
fucomip st, st(1) ; (same thing as above)
fstp st(0)
setae al ; Set al if above or equal (CF=0).
test al, al
je .L5
; Do something 2
.L5:
leave
ret

Historically (we're talking the 1980s and early 1990s), there were some architectures in which this was true. The root issue is that integer comparison is inherently implemented via integer subtractions. This gives rise to the following cases.
Comparison Subtraction
---------- -----------
A < B --> A - B < 0
A = B --> A - B = 0
A > B --> A - B > 0
Now, when A < B the subtraction has to borrow a high-bit for the subtraction to be correct, just like you carry and borrow when adding and subtracting by hand. This "borrowed" bit was usually referred to as the carry bit and would be testable by a branch instruction. A second bit called the zero bit would be set if the subtraction were identically zero which implied equality.
There were usually at least two conditional branch instructions, one to branch on the carry bit and one on the zero bit.
Now, to get at the heart of the matter, let's expand the previous table to include the carry and zero bit results.
Comparison Subtraction Carry Bit Zero Bit
---------- ----------- --------- --------
A < B --> A - B < 0 0 0
A = B --> A - B = 0 1 1
A > B --> A - B > 0 1 0
So, implementing a branch for A < B can be done in one instruction, because the carry bit is clear only in this case, , that is,
;; Implementation of "if (A < B) goto address;"
cmp A, B ;; compare A to B
bcz address ;; Branch if Carry is Zero to the new address
But, if we want to do a less-than-or-equal comparison, we need to do an additional check of the zero flag to catch the case of equality.
;; Implementation of "if (A <= B) goto address;"
cmp A, B ;; compare A to B
bcz address ;; branch if A < B
bzs address ;; also, Branch if the Zero bit is Set
So, on some machines, using a "less than" comparison might save one machine instruction. This was relevant in the era of sub-megahertz processor speed and 1:1 CPU-to-memory speed ratios, but it is almost totally irrelevant today.

Assuming we're talking about internal integer types, there's no possible way one could be faster than the other. They're obviously semantically identical. They both ask the compiler to do precisely the same thing. Only a horribly broken compiler would generate inferior code for one of these.
If there was some platform where < was faster than <= for simple integer types, the compiler should always convert <= to < for constants. Any compiler that didn't would just be a bad compiler (for that platform).

I see that neither is faster. The compiler generates the same machine code in each condition with a different value.
if(a < 901)
cmpl $900, -4(%rbp)
jg .L2
if(a <=901)
cmpl $901, -4(%rbp)
jg .L3
My example if is from GCC on x86_64 platform on Linux.
Compiler writers are pretty smart people, and they think of these things and many others most of us take for granted.
I noticed that if it is not a constant, then the same machine code is generated in either case.
int b;
if(a < b)
cmpl -4(%rbp), %eax
jge .L2
if(a <=b)
cmpl -4(%rbp), %eax
jg .L3

For floating point code, the <= comparison may indeed be slower (by one instruction) even on modern architectures. Here's the first function:
int compare_strict(double a, double b) { return a < b; }
On PowerPC, first this performs a floating point comparison (which updates cr, the condition register), then moves the condition register to a GPR, shifts the "compared less than" bit into place, and then returns. It takes four instructions.
Now consider this function instead:
int compare_loose(double a, double b) { return a <= b; }
This requires the same work as compare_strict above, but now there's two bits of interest: "was less than" and "was equal to." This requires an extra instruction (cror - condition register bitwise OR) to combine these two bits into one. So compare_loose requires five instructions, while compare_strict requires four.
You might think that the compiler could optimize the second function like so:
int compare_loose(double a, double b) { return ! (a > b); }
However this will incorrectly handle NaNs. NaN1 <= NaN2 and NaN1 > NaN2 need to both evaluate to false.

Maybe the author of that unnamed book has read that a > 0 runs faster than a >= 1 and thinks that is true universally.
But it is because a 0 is involved (because CMP can, depending on the architecture, replaced e.g. with OR) and not because of the <.

At the very least, if this were true a compiler could trivially optimise a <= b to !(a > b), and so even if the comparison itself were actually slower, with all but the most naive compiler you would not notice a difference.

TL;DR answer
For most combinations of architecture, compiler and language, < will not be faster than <=.
Full answer
Other answers have concentrated on x86 architecture, and I don't know the ARM architecture (which your example assembler seems to be) well enough to comment specifically on the code generated, but this is an example of a micro-optimisation which is very architecture specific, and is as likely to be an anti-optimisation as it is to be an optimisation.
As such, I would suggest that this sort of micro-optimisation is an example of cargo cult programming rather than best software engineering practice.
Counterexample
There are probably some architectures where this is an optimisation, but I know of at least one architecture where the opposite may be true. The venerable Transputer architecture only had machine code instructions for equal to and greater than or equal to, so all comparisons had to be built from these primitives.
Even then, in almost all cases, the compiler could order the evaluation instructions in such a way that in practice, no comparison had any advantage over any other. Worst case though, it might need to add a reverse instruction (REV) to swap the top two items on the operand stack. This was a single byte instruction which took a single cycle to run, so had the smallest overhead possible.
Summary
Whether or not a micro-optimisation like this is an optimisation or an anti-optimisation depends on the specific architecture you are using, so it is usually a bad idea to get into the habit of using architecture specific micro-optimisations, otherwise you might instinctively use one when it is inappropriate to do so, and it looks like this is exactly what the book you are reading is advocating.

They have the same speed. Maybe in some special architecture what he/she said is right, but in the x86 family at least I know they are the same. Because for doing this the CPU will do a substraction (a - b) and then check the flags of the flag register. Two bits of that register are called ZF (zero Flag) and SF (sign flag), and it is done in one cycle, because it will do it with one mask operation.

This would be highly dependent on the underlying architecture that the C is compiled to. Some processors and architectures might have explicit instructions for equal to, or less than and equal to, which execute in different numbers of cycles.
That would be pretty unusual though, as the compiler could work around it, making it irrelevant.

You should not be able to notice the difference even if there is any. Besides, in practice, you'll have to do an additional a + 1 or a - 1 to make the condition stand unless you're going to use some magic constants, which is a very bad practice by all means.

When I wrote the first version of this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901 vs. a <= 900. Many compilers always shrink the magnitude of constants by converting between < and <=, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.
For ARM, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp r0, #0x00f000 would be encodeable, while cmp r0, #0x00efff would not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for ARM. AArch64 is either shift-by-12 or not, instead of an arbitrary rotation, for instructions like cmp and cmn, unlike 32-bit ARM and Thumb modes.
< vs. <= in general, including for runtime-variable conditions
In assembly language on most machines, a comparison for <= has the same cost as a comparison for <. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.
But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <= can accidentally create something the compiler can't optimize.
As a loop condition, there are cases where <= is qualitatively different from <, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.
Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)
void foo(unsigned size) {
unsigned upper_bound = size - 1; // or any calculation that could produce UINT_MAX
for(unsigned i=0 ; i <= upper_bound ; i++)
...
Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.
(A simple i <= size would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)
In this case, size=0 leads to upper_bound=UINT_MAX, and i <= UINT_MAX is always true. So this loop is infinite for size=0, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size.
Asm like if(!size) skip the loop; do{...}while(--size); is one normally-efficient way to optimize a for( i<size ) loop, if the actual value of i isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).
But that do{}while can't be infinite: if entered with size==0, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.)
With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.
Example: sum of integers from 1 to n
Using unsigned i <= n defeats clang's idiom-recognition that optimizes sum(1 .. n) loops with a closed form based on Gauss's n * (n+1) / 2 formula.
unsigned sum_1_to_n_finite(unsigned n) {
unsigned total = 0;
for (unsigned i = 0 ; i < n+1 ; ++i)
total += i;
return total;
}
x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer
# clang7.0 -O3 closed-form
cmp edi, -1 # n passed in EDI: x86-64 System V calling convention
je .LBB1_1 # if (n == UINT_MAX) return 0; // C++ loop runs 0 times
# else fall through into the closed-form calc
mov ecx, edi # zero-extend n into RCX
lea eax, [rdi - 1] # n-1
imul rax, rcx # n * (n-1) # 64-bit
shr rax # n * (n-1) / 2
add eax, edi # n + (stuff / 2) = n * (n+1) / 2 # truncated to 32-bit
ret # computed without possible overflow of the product before right shifting
.LBB1_1:
xor eax, eax
ret
But for the naive version, we just get a dumb loop from clang.
unsigned sum_1_to_n_naive(unsigned n) {
unsigned total = 0;
for (unsigned i = 0 ; i<=n ; ++i)
total += i;
return total;
}
# clang7.0 -O3
sum_1_to_n(unsigned int):
xor ecx, ecx # i = 0
xor eax, eax # retval = 0
.LBB0_1: # do {
add eax, ecx # retval += i
add ecx, 1 # ++1
cmp ecx, edi
jbe .LBB0_1 # } while( i<n );
ret
GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i values in parallel in the elements of an XMM register.
# "naive" inner loop
.L3:
add eax, 1 # do {
paddd xmm0, xmm1 # vect_total_4.6, vect_vec_iv_.5
paddd xmm1, xmm2 # vect_vec_iv_.5, tmp114
cmp edx, eax # bnd.1, ivtmp.14 # bound and induction-variable tmp, I think.
ja .L3 #, # }while( n > i )
"finite" inner loop
# before the loop:
# xmm0 = 0 = totals
# xmm1 = {0,1,2,3} = i
# xmm2 = set1_epi32(4)
.L13: # do {
add eax, 1 # i++
paddd xmm0, xmm1 # total[0..3] += i[0..3]
paddd xmm1, xmm2 # i[0..3] += 4
cmp eax, edx
jne .L13 # }while( i != upper_limit );
then horizontal sum xmm0
and peeled cleanup for the last n%3 iterations, or something.
It also has a plain scalar loop which I think it uses for very small n, and/or for the infinite loop case.
BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1/jnz instead of add eax,1/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.

You could say that line is correct in most scripting languages, since the extra character results in slightly slower code processing.
However, as the top answer pointed out, it should have no effect in C++, and anything being done with a scripting language probably isn't that concerned about optimization.

Only if the people who created the computers are bad with boolean logic. Which they shouldn't be.
Every comparison (>= <= > <) can be done in the same speed.
What every comparison is, is just a subtraction (the difference) and seeing if it's positive/negative.
(If the msb is set, the number is negative)
How to check a >= b? Sub a-b >= 0 Check if a-b is positive.
How to check a <= b? Sub 0 <= b-a Check if b-a is positive.
How to check a < b? Sub a-b < 0 Check if a-b is negative.
How to check a > b? Sub 0 > b-a Check if b-a is negative.
Simply put, the computer can just do this underneath the hood for the given op:
a >= b == msb(a-b)==0
a <= b == msb(b-a)==0
a > b == msb(b-a)==1
a < b == msb(a-b)==1
and of course the computer wouldn't actually need to do the ==0 or ==1 either.
for the ==0 it could just invert the msb from the circuit.
Anyway, they most certainly wouldn't have made a >= b be calculated as a>b || a==b lol

In C and C++, an important rule for the compiler is the “as-if” rule: If doing X has the exact same behavior as if you did Y, then the compiler is free to choose which one it uses.
In your case, “a < 901” and “a <= 900” always have the same result, so the compiler is free to compile either version. If one version was faster, for whatever reason, then any quality compiler would produce code for the version that is faster. So unless your compiler produced exceptionally bad code, both versions would run at equal speed.
Now if you had a situation where two bits of code will always produce the same result, but it is hard to prove for the compiler, and/or it is hard for the compiler to prove which if any version is faster, then you might get different code running at different speeds.
PS The original example might run at different speeds if the processor supports single byte constants (faster) and multi byte constants (slower), so comparing against 255 (1 byte) might be faster than comparing against 256 (two bytes). I’d expect the compiler to do whatever is faster.

Only if computation path depends on data:
a={1,1,1,1,1000,1,1,1,1}
while (i<=4)
{
for(j from 0 to a[i]){ do_work(); }
i++;
}
will compute 250 times more than while(i<4)
Real-world sample would be computing mandelbrot-set. If you include a pixel that iterates 1000000 times, it will cause a lag but the coincidence with <= usage probability is too low.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js