How does a C/C++ compiler optimize division by non-powers-of-two? [duplicate]

How does a C/C++ compiler optimize division by non-powers-of-two? [duplicate] - c++

I've been reading about div and mul assembly operations, and I decided to see them in action by writing a simple program in C:
File division.c
#include <stdlib.h>
#include <stdio.h>
int main()
{
size_t i = 9;
size_t j = i / 5;
printf("%zu\n",j);
return 0;
}
And then generating assembly language code with:
gcc -S division.c -O0 -masm=intel
But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:
mov rax, QWORD PTR [rbp-16] ; Move i (=9) to RAX
movabs rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul rdx ; Multiply 9 by magic number
mov rax, rdx ; Take only the upper 64 bits of the result
shr rax, 2 ; Shift these bits 2 places to the right (?)
mov QWORD PTR [rbp-8], rax ; Magically, RAX contains 9/5=1 now,
; so we can assign it to j
What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?

Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).
If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.
Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add
As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.
For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.

Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).
As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.
See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.
Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).

In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant
A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.
-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.
When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.
This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?
To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.
The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.
The error in our constant is (1/5) * 2-64. The value of i is less than 264 so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2−2.
(1/5) * 2−2 < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.
Unfortunately this doesn't work for all divisors.
If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2-64. After multiplying by an i value of just under 264 we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.
So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.

Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.
http://gmplib.org/~tege/divcnst-pldi94.pdf
In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.
Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.
Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.
In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.
Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:
; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend dq 2 dup(?) ;16 byte dividend
divisor dq 1 dup(?) ; 8 byte divisor
; ...
mov rcx,divisor
mov rdx,0
mov rax,dividend+8 ;upper 8 bytes of dividend
div rcx ;after div, rax == 1
mov rax,dividend ;lower 8 bytes of dividend
div rcx
mov rdx,1 ;rdx:rax = N+1 bit value = 65 bit value
You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).
On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:
; rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
; two instruction sequence for most divisors:
mul rbx ;rdx = upper 64 bits of product
shr rdx,cl ;rdx = quotient
;
; five instruction sequence for divisors like 7
; to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)
mul rbx ;rdx = upper 64 bits of product
sub rbx,rdx ;rbx -= rdx
shr rbx,1 ;rbx >>= 1
add rdx,rbx ;rdx = upper 64 bits of corrected product
shr rdx,cl ;rdx = quotient
; ...
To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)
mul rbx ;rdx = u64(dvnd*mplr)
add rdx,rbx ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
shr rdx,cl
To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:
u64( ( rbx * (2^64 + rax) )>>(cl+1) )
u64( ( rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
u64( ( (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
u64( ( (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl ) )
mul rbx ; (rbx*rax)
sub rbx,rdx ; (rbx*2^64)-(rbx*rax)
shr rbx,1 ;( (rbx*2^64)-(rbx*rax))>>1
add rdx,rbx ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
shr rdx,cl ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl

I will answer from a slightly different angle: Because it is allowed to do it.
C and C++ are defined against an abstract machine. The compiler transforms this program in terms of the abstract machine to concrete machine following the as-if rule.
The compiler is allowed to make ANY changes as long as it doesn't change the observable behaviour as specified by the abstract machine. There is no reasonable expectation that the compiler will transform your code in the most straightforward way possible (even when a lot of C programmer assume that). Usually, it does this because the compiler wants to optimize the performance compared to the straightforward approach (as discussed in the other answers at length).
If under any circumstances the compiler "optimizes" a correct program to something that has a different observable behaviour, that is a compiler bug.
Any undefined behaviour in our code (signed integer overflow is a classical example) and this contract is void.

Related

Assembler division x86 strange numbers [duplicate]

I've been reading about div and mul assembly operations, and I decided to see them in action by writing a simple program in C:
File division.c
#include <stdlib.h>
#include <stdio.h>
int main()
{
size_t i = 9;
size_t j = i / 5;
printf("%zu\n",j);
return 0;
}
And then generating assembly language code with:
gcc -S division.c -O0 -masm=intel
But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:
mov rax, QWORD PTR [rbp-16] ; Move i (=9) to RAX
movabs rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul rdx ; Multiply 9 by magic number
mov rax, rdx ; Take only the upper 64 bits of the result
shr rax, 2 ; Shift these bits 2 places to the right (?)
mov QWORD PTR [rbp-8], rax ; Magically, RAX contains 9/5=1 now,
; so we can assign it to j
What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?

Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).
If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.
Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add
As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.
For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.

Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).
As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.
See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.
Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).

In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant
A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.
-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.
When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.
This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?
To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.
The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.
The error in our constant is (1/5) * 2-64. The value of i is less than 264 so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2−2.
(1/5) * 2−2 < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.
Unfortunately this doesn't work for all divisors.
If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2-64. After multiplying by an i value of just under 264 we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.
So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.

Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.
http://gmplib.org/~tege/divcnst-pldi94.pdf
In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.
Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.
Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.
In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.
Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:
; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend dq 2 dup(?) ;16 byte dividend
divisor dq 1 dup(?) ; 8 byte divisor
; ...
mov rcx,divisor
mov rdx,0
mov rax,dividend+8 ;upper 8 bytes of dividend
div rcx ;after div, rax == 1
mov rax,dividend ;lower 8 bytes of dividend
div rcx
mov rdx,1 ;rdx:rax = N+1 bit value = 65 bit value
You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).
On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:
; rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
; two instruction sequence for most divisors:
mul rbx ;rdx = upper 64 bits of product
shr rdx,cl ;rdx = quotient
;
; five instruction sequence for divisors like 7
; to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)
mul rbx ;rdx = upper 64 bits of product
sub rbx,rdx ;rbx -= rdx
shr rbx,1 ;rbx >>= 1
add rdx,rbx ;rdx = upper 64 bits of corrected product
shr rdx,cl ;rdx = quotient
; ...
To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)
mul rbx ;rdx = u64(dvnd*mplr)
add rdx,rbx ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
shr rdx,cl
To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:
u64( ( rbx * (2^64 + rax) )>>(cl+1) )
u64( ( rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
u64( ( (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
u64( ( (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl ) )
mul rbx ; (rbx*rax)
sub rbx,rdx ; (rbx*2^64)-(rbx*rax)
shr rbx,1 ;( (rbx*2^64)-(rbx*rax))>>1
add rdx,rbx ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
shr rdx,cl ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl

I will answer from a slightly different angle: Because it is allowed to do it.
C and C++ are defined against an abstract machine. The compiler transforms this program in terms of the abstract machine to concrete machine following the as-if rule.
The compiler is allowed to make ANY changes as long as it doesn't change the observable behaviour as specified by the abstract machine. There is no reasonable expectation that the compiler will transform your code in the most straightforward way possible (even when a lot of C programmer assume that). Usually, it does this because the compiler wants to optimize the performance compared to the straightforward approach (as discussed in the other answers at length).
If under any circumstances the compiler "optimizes" a correct program to something that has a different observable behaviour, that is a compiler bug.
Any undefined behaviour in our code (signed integer overflow is a classical example) and this contract is void.

Why does division by 3 require a rightshift (and other oddities) on x86?

I have the following C/C++ function:
unsigned div3(unsigned x) {
return x / 3;
}
When compiled using clang 10 at -O3, this results in:
div3(unsigned int):
mov ecx, edi # tmp = x
mov eax, 2863311531 # result = 3^-1
imul rax, rcx # result *= tmp
shr rax, 33 # result >>= 33
ret
What I do understand is: division by 3 is equivalent to multiplying with the multiplicative inverse 3-1 mod 232 which is 2863311531.
There are some things that I don't understand though:
Why do we need to use ecx/rcx at all? Can't we multiply rax with edi directly?
Why do we multiply in 64-bit mode? Wouldn't it be faster to multiply eax and ecx?
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
Edit 1
For those who don't understand what I mean by 3-1 mod 232, I am talking about the multiplicative inverse here.
For example:
// multiplying with inverse of 3:
15 * 2863311531 = 42949672965
42949672965 mod 2^32 = 5
// using fixed-point multiplication
15 * 2863311531 = 42949672965
42949672965 >> 33 = 5
// simply dividing by 3
15 / 3 = 5
So multiplying with 42949672965 is actually equivalent to dividing by 3. I assumed clang's optimization is based on modular arithmetic, when it's really based on fixed point arithmetic.
Edit 2
I have now realized that the multiplicative inverse can only be used for divisions without a remainder. For example, multiplying 1 times 3-1 is equal to 3-1, not zero. Only fixed point arithmetic has correct rounding.
Unfortunately, clang does not make any use of modular arithmetic which would just be a single imul instruction in this case, even when it could. The following function has the same compile output as above.
unsigned div3(unsigned x) {
__builtin_assume(x % 3 == 0);
return x / 3;
}
(Canonical Q&A about fixed-point multiplicative inverses for exact division that work for every possible input: Why does GCC use multiplication by a strange number in implementing integer division? - not quite a duplicate because it only covers the math, not some of the implementation details like register width and imul vs. mul.)

Can't we multiply rax with edi directly?
We can't imul rax, rdi because the calling convention allows the caller to leave garbage in the high bits of RDI; only the EDI part contains the value. This is a non-issue when inlining; writing a 32-bit register does implicitly zero-extend to the full 64-bit register, so the compiler will usually not need an extra instruction to zero-extend a 32-bit value.
(zero-extending into a different register is better because of limitations on mov-elimination, if you can't avoid it).
Taking your question even more literally, no, x86 doesn't have any multiply instructions that zero-extend one of their inputs to let you multiply a 32-bit and a 64-bit register. Both inputs must be the same width.
Why do we multiply in 64-bit mode?
(terminology: all of this code runs in 64-bit mode. You're asking why 64-bit operand-size.)
You could mul edi to multiply EAX with EDI to get a 64-bit result split across EDX:EAX, but mul edi is 3 uops on Intel CPUs, vs. most modern x86-64 CPUs having fast 64-bit imul. (Although imul r64, r64 is slower on AMD Bulldozer-family, and on some low-power CPUs.) https://uops.info/ and https://agner.org/optimize/ (instruction tables and microarch PDF)
(Fun fact: mul rdi is actually cheaper on Intel CPUs, only 2 uops. Perhaps something to do with not having to do extra splitting on the output of the integer multiply unit, like mul edi would have to split the 64-bit low half multiplier output into EDX and EAX halves, but that happens naturally for 64x64 => 128-bit mul.)
Also the part you want is in EDX so you'd need another mov eax, edx to deal with it. (Again, because we're looking at code for a stand-alone definition of the function, not after inlining into a caller.)
GCC 8.3 and earlier did use 32-bit mul instead of 64-bit imul (https://godbolt.org/z/5qj7d5). That was not crazy for -mtune=generic when Bulldozer-family and old Silvermont CPUs were more relevant, but those CPUs are farther in the past for more recent GCC, and its generic tuning choices reflect that. Unfortunately GCC also wasted a mov instruction copying EDI to EAX, making this way look even worse :/
# gcc8.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop, stupid wasted instruction
mov edx, -1431655765 # 1 uop (same 32-bit constant, just printed differently)
mul edx # 3 uops on Sandybridge-family
mov eax, edx # 1 uop
shr eax # 1 uop
ret
# total of 7 uops on SnB-family
Would only be 6 uops with mov eax, 0xAAAAAAAB / mul edi, but still worse than:
# gcc9.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop
mov edi, 2863311531 # 1 uop
imul rax, rdi # 1 uop
shr rax, 33 # 1 uop
ret
# total 4 uops, not counting ret
Unfortunately, 64-bit 0x00000000AAAAAAAB can't be represented as a 32-bit sign-extended immediate, so imul rax, rcx, 0xAAAAAAAB isn't encodeable. It would mean 0xFFFFFFFFAAAAAAAB.
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
It is unsigned. Signedness of the inputs only affects the high half of the result, but imul reg, reg doesn't produce the high half. Only the one-operand forms of mul and imul are full multiplies that do NxN => 2N, so only they need separate signed and unsigned versions.
Only imul has the faster and more flexible low-half-only forms. The only thing that's signed about imul reg, reg is that it sets OF based on signed overflow of the low half. It wasn't worth spending more opcodes and more transistors just to have a mul r,r whose only difference from imul r,r is the FLAGS output.
Intel's manual (https://www.felixcloutier.com/x86/imul) even points out the fact that it can be used for unsigned.
What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
No, there's no multiplier constant that would give the exact right answer for every possible input x if you implemented it that way. The "as-if" optimization rule doesn't allow approximations, only implementations that produce the exact same observable behaviour for every input the program uses. Without knowing a value-range for x other than full range of unsigned, compilers don't have that option. (-ffast-math only applies to floating point; if you want faster approximations for integer math, code them manually like below):
See Why does GCC use multiplication by a strange number in implementing integer division? for more about the fixed-point multiplicative inverse method compilers use for exact division by compile time constants.
For an example of this not working in the general case, see my edit to an answer on Divide by 10 using bit shifts? which proposed
// Warning: INEXACT FOR LARGE INPUTS
// this fast approximation can just use the high half,
// so on 32-bit machines it avoids one shift instruction vs. exact division
int32_t div10(int32_t dividend)
{
int64_t invDivisor = 0x1999999A;
return (int32_t) ((invDivisor * dividend) >> 32);
}
Its first wrong answer (if you loop from 0 upward) is div10(1073741829) = 107374183 when 1073741829/10 is actually 107374182. (It rounded up instead of toward 0 like C integer division is supposed to.)
From your edit, I see you were actually talking about using the low half of a multiply result, which apparently works perfectly for exact multiples all the way up to UINT_MAX.
As you say, it completely fails when the division would have a remainder, e.g. 16 * 0xaaaaaaab = 0xaaaaaab0 when truncated to 32-bit, not 5.
unsigned div3_exact_only(unsigned x) {
__builtin_assume(x % 3 == 0); // or an equivalent with if() __builtin_unreachable()
return x / 3;
}
Yes, if that math works out, it would be legal and optimal for compilers to implement that with 32-bit imul. They don't look for this optimization because it's rarely a known fact. IDK if it would be worth adding compiler code to even look for the optimization, in terms of compile time, not to mention compiler maintenance cost in developer time. It's not a huge difference in runtime cost, and it's rarely going to be possible. It is nice, though.
div3_exact_only:
imul eax, edi, 0xAAAAAAAB # 1 uop, 3c latency
ret
However, it is something you can do yourself in source code, at least for known type widths like uint32_t:
uint32_t div3_exact_only(uint32_t x) {
return x * 0xaaaaaaabU;
}

What's up with the 33-bit right shift at the end? I thought we can just drop the highest 32-bits.
Instead of 3^(-1) mod 3 you have to think more about 0.3333333 where the 0 before the . is located in the upper 32 bit and the the 3333 is located in the lower 32 bit.
This fixed point operation works fine, but the result is obviously shifted to the upper part of rax, therefor the CPU must shift the result down again after the operation.
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
There is no MUL instruction equivalent to the IMUL instruction. The IMUL variant that is used takes two registers:
a <= a * b
There is no MUL instruction that does that. MUL instructions are more expensive because they store the result as 128 Bit in two registers.
Of course you could use the legacy instructions, but this does not change the fact that the result is stored in two registers.

If you look at my answer to the prior question:
Why does GCC use multiplication by a strange number in implementing integer division?
It contains a link to a pdf article that explains this (my answer clarifies the stuff that isn't explained well in this pdf article):
https://gmplib.org/~tege/divcnst-pldi94.pdf
Note that one extra bit of precision is needed for some divisors, such as 7, the multiplier would normally require 33 bits, and the product would normally require 65 bits, but this can be avoided by handling the 2^32 bit separately with 3 additional instructions as shown in my prior answer and below.
Take a look at the generated code if you change to
unsigned div7(unsigned x) {
return x / 7;
}
So to explain the process, let L = ceil(log2(divisor)). For the question above, L = ceil(log2(3)) == 2. The right shift count would initially be 32+L = 34.
To generate a multiplier with a sufficient number of bits, two potential multipliers are generated: mhi will be the multiplier to be used, and the shift count will be 32+L.
mhi = (2^(32+L) + 2^(L))/3 = 5726623062
mlo = (2^(32+L) )/3 = 5726623061
Then a check is made to see if the number of required bits can be reduced:
while((L > 0) && ((mhi>>1) > (mlo>>1))){
mhi = mhi>>1;
mlo = mlo>>1;
L = L-1;
}
if(mhi >= 2^32){
mhi = mhi-2^32
L = L-1;
; use 3 additional instructions for missing 2^32 bit
}
... mhi>>1 = 5726623062>>1 = 2863311531
... mlo>>1 = 5726623061>>1 = 2863311530 (mhi>>1) > (mlo>>1)
... mhi = mhi>>1 = 2863311531
... mlo = mhi>>1 = 2863311530
... L = L-1 = 1
... the next loop exits since now (mhi>>1) == (mlo>>1)
So the multiplier is mhi = 2863311531 and the shift count = 32+L = 33.
On an modern X86, multiply and shift instructions are constant time, so there's no point in reducing the multiplier (mhi) to less than 32 bits, so that while(...) above is changed to an if(...).
In the case of 7, the loop exits on the first iteration, and requires 3 extra instructions to handle the 2^32 bit, so that mhi is <= 32 bits:
L = ceil(log2(7)) = 3
mhi = (2^(32+L) + 2^(L))/7 = 4908534053
mhi = mhi-2^32 = 613566757
Let ecx = dividend, the simple approach could overflow on the add:
mov eax, 613566757 ; eax = mhi
mul ecx ; edx:eax = ecx*mhi
add edx, ecx ; edx:eax = ecx*(mhi + 2^32), potential overflow
shr edx, 3
To avoid the potential overflow, note that eax = eax*2 - eax:
(ecx*eax) = (ecx*eax)<<1) -(ecx*eax)
(ecx*(eax+2^32)) = (ecx*eax)<<1)+ (ecx*2^32)-(ecx*eax)
(ecx*(eax+2^32))>>3 = ((ecx*eax)<<1)+ (ecx*2^32)-(ecx*eax) )>>3
= (((ecx*eax) )+(((ecx*2^32)-(ecx*eax))>>1))>>2
so the actual code, using u32() to mean upper 32 bits:
... visual studio generated code for div7, dividend is ecx
mov eax, 613566757
mul ecx ; edx = u32( (ecx*eax) )
sub ecx, edx ; ecx = u32( ((ecx*2^32)-(ecx*eax)) )
shr ecx, 1 ; ecx = u32( (((ecx*2^32)-(ecx*eax))>>1) )
lea eax, DWORD PTR [edx+ecx] ; eax = u32( (ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1) )
shr eax, 2 ; eax = u32(((ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1))>>2)
If a remainder is wanted, then the following steps can be used:
mhi and L are generated based on divisor during compile time
...
quotient = (x*mhi)>>(32+L)
product = quotient*divisor
remainder = x - product

x/3 is approximately (x * (2^32/3)) / 2^32. So we can perform a single 32x32->64 bit multiplication, take the higher 32 bits, and get approximately x/3.
There is some error because we cannot multiply exactly by 2^32/3, only by this number rounded to an integer. We get more precision using x/3 ≈ (x * (2^33/3)) / 2^33. (We can't use 2^34/3 because that is > 2^32). And that turns out to be good enough to get x/3 in all cases exactly. You would prove this by checking that the formula gives a result of k if the input is 3k or 3k+2.

Bit fiddling: Negate an int, only if it's negative

I'm working through a codebase that uses the following encoding to indicate sampling with replacement: We maintain an array of ints as indicators of position and presence in the sample, where positive ints indicate a position in another array and negative ints indicate that we should not use the data point in this iteration.
Example:
data_points: [...] // Objects vector of length 5 == position.size()
std::vector<int> position: [3, 4, -3, 1, -2]
would mean that the first element in data_points should go to bucket 3, the second to bucket 4, and the fourth to bucket 1.
The negative values indicate that for this iteration we won't be using those data points, namely the 3rd and 5th data points are marked as excluded because their values in position are negative and have been set with position[i] = ~position[i].
The trick is that we might perform this multiple times, but the positions of the data points in the index should not change. So in the next iteration, if we want to exclude data point 1, and include data point 5 we could do something like:
position[0] = ~position[0] // Bit-wise complement flips the sign on ints and subtracts 1
position[4] = ~position[4]
This will change the position vector into
std::vector<int> position: [-4, 4, -3, 1, 1]
Now on to the question: At the end of each round I want to reset all signs to positive, i.e. position should become [3, 4, 3, 1, 2].
Is there a bit-fiddling trick that will allow me to do this without having an if condition for the sign of the value?
Also, because I'm new to to bit fiddling like this, why/how does taking the bit complement of a signed, positive int give us its mathematical complement? (i.e. the same value with the sign flipped)
Edit: The above is wrong, the complement of a (int) will give -(a + 1) and depends on the representation of ints as noted in the answers. So the original question of simply taking the positive value of an existing one does not apply, we actually need to perform a bitwise complement to get the original value.
Example:
position[0] = 1
position[0] = ~position[0] // Now position[0] is -2
// So if we did
position[0] = std::abs(position[0]) // This is wrong, position[0] is now 2!
// If we want to reset it to 1, we need to do another complement
position[0] = ~position[0] // Now position[0] is again 1

I recommend not attempting bit fiddling. Partly because you're dealing with signed numbers, and you would throw away portability if you fiddle. Partly because bit fiddling is not as readable as reusable functions.
A simple solution:
std::for_each(position.begin(), position.end(), [](int v) {
return std::abs(v);
});
why/how does taking the bit complement of a signed, positive int give us its mathematical complement? (i.e. the same value with the sign flipped)
It doesn't. Not in general anyway. It does so only on systems that use 1's complement representation for negative numbers, and the reason for that is simply because that's how the representation is specified. A negative number is represented by binary complement of the positive value.
Most commonly used representation these days is 2's complement which doesn't work that way.

Probably the first go-to source for bit twiddling hacks: The eponymous site
int v; // we want to find the absolute value of v
unsigned int r; // the result goes here
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
r = (v + mask) ^ mask;
However, I would question the assumption that position[i] = std::abs(position[i]) has worse performance. You should definitely have profiling results that demonstrate the bit hack being superior before you check in that kind of code.
Feel free to play around with a quick benchmark (with disassembly) of both - I don't think there is a difference in speed:
gcc 8.2
clang 6.0
Also take a look at the assembly that is actually generated:
https://godbolt.org/z/Ghcw_c
Evidently, clang sees your bithack and is not impressed - it generates a conditional move in all cases (which is branch-free). gcc does as you tell it, but has two more implementations of abs in store, some making use of register semantics of the target architecture.
And if you get into (auto-)vectorization, things get even more muddy. You'll have to profile regardless.
Conclusion: Just write std::abs - your compiler will do all the bit-twiddling for you.

To answer the extended question: Again, first write the obvious and intuitive code and then check that your compiler does the right thing: Look ma, no branches!
If you let it have its fun with auto-vectorization then you probably won't understand (or be good at judging) the assembly, so you have to profile anyway. Concrete example:
https://godbolt.org/z/oaaOwJ. clang likes to also unroll the auto-vectorized loop while gcc is more conservative. In any case, it's still branch-free.
Chances are that your compiler understands the nitty-gritty details of instruction scheduling on your target platform better than you. If you don't obscure your intent with bit-magic, it'll do a good job by itself. If it's still a hot spot in your code, you can then go and see if you can hand-craft a better version (but that will probably have to be in assembly).

Use functions to signify intent. Allow the compiler's optimiser to do a better job than you ever will.
#include <cmath>
void include_index(int& val)
{
val = std::abs(val);
}
void exclude_index(int& val)
{
val = -std::abs(val);
}
bool is_included(int const& val)
{
return val > 0;
}
Sample output from godbolt's gcc8 x86_64 compiler (note that it's all bit-twiddling and there are no conditional jumps - the bane of high performance computing):
include_index(int&):
mov eax, DWORD PTR [rdi]
sar eax, 31
xor DWORD PTR [rdi], eax
sub DWORD PTR [rdi], eax
ret
exclude_index(int&):
mov eax, DWORD PTR [rdi]
mov edx, DWORD PTR [rdi]
sar eax, 31
xor edx, eax
sub eax, edx
mov DWORD PTR [rdi], eax
ret
is_included(int const&):
mov eax, DWORD PTR [rdi]
test eax, eax
setg al
ret
https://godbolt.org/z/ni6DOk

Also, because I'm new to to bit fiddling like this, why/how does taking the bit complement of a signed, positive int give us its mathematical complement? (i.e. the same value with the sign flipped)
This question deserves an answer all by itself since everyone will tell you that this is how you do it, but nobody ever tells you why.
Notice that 1 - 0 = 1 and 1 - 1 = 0. What this means is that if we do 1 - b, where b is a single bit, the result is the opposite of b, or not b (~b). Also notice how this subtraction will never produce a borrow, this is very important, since b can only be at most 1.
Also notice that subtracting a number with n bits simply means performing n 1-bit subtractions, while taking care of borrows. But our special case will never procude a borrow.
Basically we have created a mathematical definiton for the bitwise not operation. To flip a bit b, do 1 - b. If we wanna flip an n bit number, do this for every bit. But doing n subtractions in sequence is the same as subtracting two n bit numbers. So if we wanna calculate the bitwise not of an 8-bit number, a, we simply do 11111111 - a, and the same for any n bit number. Once again this works because subtracting a bit from 1 will never produce a borrow.
But what is the sequence of n "1" bits? It's the value 2^n - 1. So taking the bitwise not of a number, a, is the same as calculating 2^n - 1 - a.
Now, numbers inside a computer are stored as numbers modulo 2^n. This is because we only have a limited number of bits available. You may know that if you work with 8 bits and you do 255 + 1 you'll get 0. This is because an 8-bit number is a number modulo 2^8 = 256, and 255 + 1 = 256. 256 is obviously equal to 0 modulo 256.
But why not do the same backwards? By this logic, 0 - 1 = 255, right? This is indeed correct. Mathematically, -1 and 255 are "congruent" modulo 256. Congruent essentially means equal to, but it's used to differentiate between regular equality and equality in modular arithmetic.
Notice in fact how 0 is also congruent to 256 modulo 256. So 0 - 1 = 256 - 1 = 255. 256 is our modulus, 2^8. But if bitwise not is defined as 2^n - 1 - a, then we have ~a = 2^8 - 1 - a. You'll notice how we have that - 1 in the middle. We can remove that by adding 1.
So we now have ~a + 1 = 2^n - 1 - a + 1 = 2^n - a. But 2^n - a is the negative a modulo n. And so here we have our negative number. This is called two's complement, and it's used in pretty much every modern processor, because it's the mathematical definition of a negative number in modular arithmetic modulo 2^n, and because numbers inside a processor work as if they were in modulo 2^n the math just works out by itself. You can add and subtract without doing any extra steps. Multiplication and division do require "sign extension", but that's just a quirk of how those operations are defined, the meaning of the number doesn't change when extending the sign.
Of course with this method you lose a bit, because you now have half of the numbers being positive and the other half negative, but you can't just magically add a bit to your processor, so the new range of value you can represent is from -2^(n-1) to 2^(n-1) - 1 inclusive.
Alternatively, you can keep the number as it is and not add 1 at the end. This is known as the one's complement. Of course, this is not quite the same as the mathematical negative, so adding, subtracting, multiplying and dividing don't just work out of the box, you need extra steps to adjust the result. This is why two's complement is the de facto standard for signed arithmetic. There's also the problem that, in one's complement, both 0 and 2^n - 1 represent the same quantity, zero, while in two's complement, negative 0 is still correctly 0 (because ~0 + 1 = 2^n - 1 + 1 = 2^n = 0). I think one's complement is used in the Internet Protocol as a checksum, but other than that it has very limited purpose.
But be aware, "de facto" standard means that this is just what everyone does, but there is no rule that says that it MUST be done this way, so always make sure to check the documentation of you target architecture to make sure you're doing the right thing. Even though let's be honest, chances of finding a useful one's complement processor out there nowadays are pretty much null unless you're working on some extremely specific architecture, but still, better be safe than sorry.

"Is there a bit-fiddling trick that will allow me to do this without having an if condition for the sign of the value?"
Do you need to keep the numeric value of the number being changed from a negative value?
If not, you can use std::max to set the negative values to zero
iValue = std::max(iValue, 0); // returns zero if iValue is less than zero
If you need to keep the numeric value, but just change from negative to positive, then
iValue = std::abs(iValue); // always returns a positive value of iValue

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with:
nasm -felf64 p14.asm && gcc p14.o -o p14
The C++ was compiled with:
g++ p14.cpp -o p14
Assembly, p14.asm:
section .data
fmt db "%d", 10, 0
global main
extern printf
section .text
main:
mov rcx, 1000000
xor rdi, rdi ; max i
xor rsi, rsi ; i
l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx
l2:
test rax, 1
jpe even
mov rbx, 3
mul rbx
inc rax
jmp c1
even:
mov rbx, 2
xor rdx, rdx
div rbx
c1:
inc r10
cmp rax, 1
jne l2
cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx
cmp rcx, 2
jne l1
mov rdi, fmt
xor rax, rax
call printf
ret
C++, p14.cpp:
#include <iostream>
int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = 3*n + 1;
++count;
}
return count;
}
int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}
std::cout << maxi << std::endl;
}
I know about the compiler optimizations to improve speed and everything, but I don’t see many ways to further optimize my assembly solution (speaking programmatically, not mathematically).
The C++ code uses modulus every term and division every other term, while the assembly code only uses a single division every other term.
But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking mainly out of curiosity.
Execution times
My system: 64-bit Linux on 1.4 GHz Intel Celeron 2955U (Haswell microarchitecture).
g++ (unoptimized): avg 1272 ms.
g++ -O3: avg 578 ms.
asm (div) (original): avg 2650 ms.
asm (shr): avg 679 ms.
#johnfound asm (assembled with NASM): avg 501 ms.
#hidefromkgb asm: avg 200 ms.
#hidefromkgb asm, optimized by #Peter Cordes: avg 145 ms.
#Veedrac C++: avg 81 ms with -O3, 305 ms with -O0.

If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0 (compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).
See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the x86 tag wiki for more perf links.
See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).
Usually you're fine letting the compiler do its thing, especially if you try to write C++ that can compile efficiently. Also see is assembly faster than compiled languages?. One of the answers links to these neat slides showing how various C compilers optimize some really simple functions with cool tricks. Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is in a similar vein.
even:
mov rbx, 2
xor rdx, rdx
div rbx
On Intel Haswell, div r64 is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but out-of-order execution can run those early). High-uop-count instructions like DIV are microcoded, which can also cause front-end bottlenecks. In this case, latency is the most relevant factor because it's part of a loop-carried dependency chain.
shr rax, 1 does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.
For comparison, 32-bit division is faster, but still horrible vs. shifts. idiv r32 is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell.
As you can see from looking at gcc's -O0 asm output (Godbolt compiler explorer), it only uses shifts instructions. clang -O0 does compile naively like you thought, even using 64-bit IDIV twice. (When optimizing, compilers do use both outputs of IDIV when the source does a division and modulus with the same operands, if they use IDIV at all)
GCC doesn't have a totally-naive mode; it always transforms through GIMPLE, which means some "optimizations" can't be disabled. This includes recognizing division-by-constant and using shifts (power of 2) or a fixed-point multiplicative inverse (non power of 2) to avoid IDIV (see div_by_13 in the above godbolt link).
gcc -Os (optimize for size) does use IDIV for non-power-of-2 division,
unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster.
Helping the compiler
(summary for this case: use uint64_t n)
First of all, it's only interesting to look at optimized compiler output. (-O3).
-O0 speed is basically meaningless.
Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in #Veedrac's code below.)
This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on #johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3 and g++ -O3 -march=skylake will do the right thing. (Or -march=native.) Or -mtune=... to just tune, without using instructions that other CPUs might not support.
My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.
Hand-written asm is a black-box for the optimizer, so constant-propagation doesn't work when inlining makes an input a compile-time constant. Other optimizations are also affected. Read https://gcc.gnu.org/wiki/DontUseInlineAsm before using asm. (And avoid MSVC-style inline asm: inputs/outputs have to go through memory which adds overhead.)
In this case: your n has a signed type, and gcc uses the SAR/SHR/ADD sequence that gives the correct rounding. (IDIV and arithmetic-shift "round" differently for negative inputs, see the SAR insn set ref manual entry). (IDK if gcc tried and failed to prove that n can't be negative, or what. Signed-overflow is undefined behaviour, so it should have been able to.)
You should have used uint64_t n, so it can just SHR. And so it's portable to systems where long is only 32-bit (e.g. x86-64 Windows).
BTW, gcc's optimized asm output looks pretty good (using unsigned long n): the inner loop it inlines into main() does this:
# from gcc5.4 -O3 plus my comments
# edx= count=1
# rax= uint64_t n
.L9: # do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
mov rdi, rax
shr rdi # rdi = n>>1;
test al, 1 # set flags based on n%2 (aka n&1)
mov rax, rcx
cmove rax, rdi # n= (n%2) ? 3*n+1 : n/2;
add edx, 1 # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
cmp/branch to update max and maxi, and then do the next n
The inner loop is branchless, and the critical path of the loop-carried dependency chain is:
3-component LEA (3 cycles)
cmov (2 cycles on Haswell, 1c on Broadwell or later).
Total: 5 cycle per iteration, latency bottleneck. Out-of-order execution takes care of everything else in parallel with this (in theory: I haven't tested with perf counters to see if it really runs at 5c/iter).
The FLAGS input of cmov (produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.
Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.
The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.
Beating the compiler
GCC did a pretty good job here. It could save one code byte by using inc edx instead of add edx, 1, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.
It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use cmovc instead of test / cmovz.
### Hand-optimized version of what gcc does
.L9: #do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
shr rax, 1 # n>>=1; CF = n&1 = n%2
cmovc rax, rcx # n= (n&1) ? 3*n+1 : n/2;
inc edx # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
See #johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)
Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency (and Ice Lake with updated microcode). The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See #johnfound's answer, where he has timing results from an AMD CPU.
Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple n values in parallel in an interleaved loop (see below).
LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components ([base+idx+const], which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).
So lea rcx, [rax + rax*2] / inc rcx is only 2c latency, faster than lea rcx, [rax + rax*2 + 1], on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.
Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)
Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead / throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in main, but that's fine because each thread can just check a range of n values and produce a pair of integers as a result.
Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same max / maxi. This creates more instruction-level parallelism.
The trick is deciding whether to wait until all the n values have reached 1 before getting another pair of starting n values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.
You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where n hadn't reached 1 yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors of n values up in the air. Maybe only worth with 256b vector (4x uint64_t).
I think the best strategy to make detection of a 1 "sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a 1 in an element, the increment-vector will have a zero, and +=0 is a no-op.
Untested idea for manual vectorization
# starting with YMM0 = [ n_d, n_c, n_b, n_a ] (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1): increment vector
# ymm5 = all-zeros: count vector
.inner_loop:
vpaddq ymm1, ymm0, xmm0
vpaddq ymm1, ymm1, xmm0
vpaddq ymm1, ymm1, set1_epi64(1) # ymm1= 3*n + 1. Maybe could do this more efficiently?
vpsllq ymm3, ymm0, 63 # shift bit 1 to the sign bit
vpsrlq ymm0, ymm0, 1 # n /= 2
# FP blend between integer insns may cost extra bypass latency, but integer blends don't have 1 bit controlling a whole qword.
vpblendvpd ymm0, ymm0, ymm1, ymm3 # variable blend controlled by the sign bit of each 64-bit element. I might have the source operands backwards, I always have to look this up.
# ymm0 = updated n in each element.
vpcmpeqq ymm1, ymm0, set1_epi64(1)
vpandn ymm4, ymm1, ymm4 # zero out elements of ymm4 where the compare was true
vpaddq ymm5, ymm5, ymm4 # count++ in elements where n has never been == 1
vptest ymm4, ymm4
jnz .inner_loop
# Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero
vextracti128 ymm0, ymm5, 1
vpmaxq .... crap this doesn't exist
# Actually just delay doing a horizontal max until the very very end. But you need some way to record max and maxi.
You can and should implement this with intrinsics instead of hand-written asm.
Algorithmic / implementation improvement:
Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)
#EOF points out that tzcnt (or bsf) could be used to do multiple n/=2 iterations in one step. That's probably better than SIMD vectorizing; no SSE or AVX instruction can do that. It's still compatible with doing multiple scalar ns in parallel in different integer registers, though.
So the loop might look like this:
goto loop_entry; // C++ structured like the asm, for illustration only
do {
n = n*3 + 1;
loop_entry:
shift = _tzcnt_u64(n);
n >>= shift;
count += shift;
} while(n != 1);
This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).
It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding n>>1, though. #Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).
We can safely use BSF or TZCNT interchangeably, because n can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).
TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use REP BSF, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use __builtin_ctzll even with -mno-bmi.
They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written, before Intel updated the TLB management rules.)
Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see this Q&A. This is why in gcc's asm output for #Veedrac's code, you see it breaking the dep chain with xor-zeroing on the register it's about to use as TZCNT's destination when it doesn't use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is a performance bug / limitation. Presumably it's worth some transistors / power to have them behave like other uops that go to the same execution unit. The only perf upside is interaction with another uarch limitation: they can micro-fuse a memory operand with an indexed addressing mode on Haswell, but on Skylake where Intel removed the false dep for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.
Improvements to ideas / code from other answers:
#hidefromkgb's answer has a nice observation that you're guaranteed to be able to do one right shift after a 3n+1. You can compute this more even more efficiently than just leaving out the checks between steps. The asm implementation in that answer is broken, though (it depends on OF, which is undefined after SHRD with a count > 1), and slow: ROR rdi,2 is faster than SHRD rdi,rdi,2, and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel.
I put tidied / improved C (which guides the compiler to produce better asm), and tested+working faster asm (in comments below the C) up on Godbolt: see the link in #hidefromkgb's answer. (This answer hit the 30k char limit from the large Godbolt URLs, but shortlinks can rot and were too long for goo.gl anyway.)
Also improved the output-printing to convert to a string and make one write() instead of writing one char at a time. This minimizes impact on timing the whole program with perf stat ./collatz (to record performance counters), and I de-obfuscated some of the non-critical asm.
#Veedrac's code
I got a minor speedup from right-shifting as much as we know needs doing, and checking to continue the loop. From 7.5s for limit=1e8 down to 7.275s, on Core2Duo (Merom), with an unroll factor of 16.
code + comments on Godbolt. Don't use this version with clang; it does something silly with the defer-loop. Using a tmp counter k and then adding it to count later changes what clang does, but that slightly hurts gcc.
See discussion in comments: Veedrac's code is excellent on CPUs with BMI1 (i.e. not Celeron/Pentium)

Claiming that the C++ compiler can produce more optimal code than a competent assembly language programmer is a very bad mistake. And especially in this case. The human always can make the code better than the compiler can, and this particular situation is a good illustration of this claim.
The timing difference you're seeing is because the assembly code in the question is very far from optimal in the inner loops.
(The below code is 32-bit, but can be easily converted to 64-bit)
For example, the sequence function can be optimized to only 5 instructions:
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
The whole code looks like:
include "%lib%/freshlib.inc"
#BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"
start:
InitializeAll
mov ecx, 999999
xor edi, edi ; max
xor ebx, ebx ; max i
.main_loop:
xor esi, esi
mov eax, ecx
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
cmp edi, esi
cmovb edi, esi
cmovb ebx, ecx
dec ecx
jnz .main_loop
OutputValue "Max sequence: ", edi, 10, -1
OutputValue "Max index: ", ebx, 10, -1
FinalizeAll
stdcall TerminateAll, 0
In order to compile this code, FreshLib is needed.
In my tests, (1 GHz AMD A4-1200 processor), the above code is approximately four times faster than the C++ code from the question (when compiled with -O0: 430 ms vs. 1900 ms), and more than two times faster (430 ms vs. 830 ms) when the C++ code is compiled with -O3.
The output of both programs is the same: max sequence = 525 on i = 837799.

For more performance: A simple change is observing that after n = 3n+1, n will be even, so you can divide by 2 immediately. And n won't be 1, so you don't need to test for it. So you could save a few if statements and write:
while (n % 2 == 0) n /= 2;
if (n > 1) for (;;) {
n = (3*n + 1) / 2;
if (n % 2 == 0) {
do n /= 2; while (n % 2 == 0);
if (n == 1) break;
}
}
Here's a big win: If you look at the lowest 8 bits of n, all the steps until you divided by 2 eight times are completely determined by those eight bits. For example, if the last eight bits are 0x01, that is in binary your number is ???? 0000 0001 then the next steps are:
3n+1 -> ???? 0000 0100
/ 2 -> ???? ?000 0010
/ 2 -> ???? ??00 0001
3n+1 -> ???? ??00 0100
/ 2 -> ???? ???0 0010
/ 2 -> ???? ???? 0001
3n+1 -> ???? ???? 0100
/ 2 -> ???? ???? ?010
/ 2 -> ???? ???? ??01
3n+1 -> ???? ???? ??00
/ 2 -> ???? ???? ???0
/ 2 -> ???? ???? ????
So all these steps can be predicted, and 256k + 1 is replaced with 81k + 1. Something similar will happen for all combinations. So you can make a loop with a big switch statement:
k = n / 256;
m = n % 256;
switch (m) {
case 0: n = 1 * k + 0; break;
case 1: n = 81 * k + 1; break;
case 2: n = 81 * k + 1; break;
...
case 155: n = 729 * k + 425; break;
...
}
Run the loop until n ≤ 128, because at that point n could become 1 with fewer than eight divisions by 2, and doing eight or more steps at a time would make you miss the point where you reach 1 for the first time. Then continue the "normal" loop - or have a table prepared that tells you how many more steps are need to reach 1.
PS. I strongly suspect Peter Cordes' suggestion would make it even faster. There will be no conditional branches at all except one, and that one will be predicted correctly except when the loop actually ends. So the code would be something like
static const unsigned int multipliers [256] = { ... }
static const unsigned int adders [256] = { ... }
while (n > 128) {
size_t lastBits = n % 256;
n = (n >> 8) * multipliers [lastBits] + adders [lastBits];
}
In practice, you would measure whether processing the last 9, 10, 11, 12 bits of n at a time would be faster. For each bit, the number of entries in the table would double, and I excect a slowdown when the tables don't fit into L1 cache anymore.
PPS. If you need the number of operations: In each iteration we do exactly eight divisions by two, and a variable number of (3n + 1) operations, so an obvious method to count the operations would be another array. But we can actually calculate the number of steps (based on number of iterations of the loop).
We could redefine the problem slightly: Replace n with (3n + 1) / 2 if odd, and replace n with n / 2 if even. Then every iteration will do exactly 8 steps, but you could consider that cheating :-) So assume there were r operations n <- 3n+1 and s operations n <- n/2. The result will be quite exactly n' = n * 3^r / 2^s, because n <- 3n+1 means n <- 3n * (1 + 1/3n). Taking the logarithm we find r = (s + log2 (n' / n)) / log2 (3).
If we do the loop until n ≤ 1,000,000 and have a precomputed table how many iterations are needed from any start point n ≤ 1,000,000 then calculating r as above, rounded to the nearest integer, will give the right result unless s is truly large.

On a rather unrelated note: more performance hacks!
[the first «conjecture» has been finally debunked by #ShreevatsaR; removed]
When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element N (shown first):
[even] [odd]
[odd] [even]
[even] [even]
To leap past these 2 elements means to compute (N >> 1) + N + 1, ((N << 1) + N + 1) >> 1 and N >> 2, respectively.
Let`s prove that for both cases (1) and (2) it is possible to use the first formula, (N >> 1) + N + 1.
Case (1) is obvious. Case (2) implies (N & 1) == 1, so if we assume (without loss of generality) that N is 2-bit long and its bits are ba from most- to least-significant, then a = 1, and the following holds:
(N << 1) + N + 1: (N >> 1) + N + 1:
b10 b1
b1 b
+ 1 + 1
---- ---
bBb0 bBb
where B = !b. Right-shifting the first result gives us exactly what we want.
Q.E.D.: (N & 1) == 1 ⇒ (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1.
As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.
The resulting algorithm looks like this:
uint64_t sequence(uint64_t size, uint64_t *path) {
uint64_t n, i, c, maxi = 0, maxc = 0;
for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
c = 2;
while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
c += 2;
if (n == 2)
c++;
if (c > maxc) {
maxi = i;
maxc = c;
}
}
*path = maxc;
return maxi;
}
int main() {
uint64_t maxi, maxc;
maxi = sequence(1000000, &maxc);
printf("%llu, %llu\n", maxi, maxc);
return 0;
}
Here we compare n > 2 because the process may stop at 2 instead of 1 if the total length of the sequence is odd.
[EDIT:]
Let`s translate this into assembly!
MOV RCX, 1000000;
DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;
#main:
XOR RSI, RSI;
LEA RDI, [RCX + 1];
#loop:
ADD RSI, 2;
LEA RDX, [RDI + RDI*2 + 2];
SHR RDX, 1;
SHRD RDI, RDI, 2; ror rdi,2 would do the same thing
CMOVL RDI, RDX; Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
CMOVS RDI, RDX;
CMP RDI, 2;
JA #loop;
LEA RDX, [RSI + 1];
CMOVE RSI, RDX;
CMP RAX, RSI;
CMOVB RAX, RSI;
CMOVB RBX, RCX;
SUB RCX, 2;
JA #main;
MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;
#itoa:
XOR RDX, RDX;
DIV RCX;
ADD RDX, '0';
PUSH RDX;
TEST RAX, RAX;
JNE #itoa;
PUSH RCX;
LEA RAX, [RBX + 1];
TEST RBX, RBX;
MOV RBX, RDI;
JNE #itoa;
POP RCX;
INC RDI;
MOV RDX, RDI;
#outp:
MOV RSI, RSP;
MOV RAX, RDI;
SYSCALL;
POP RAX;
TEST RAX, RAX;
JNE #outp;
LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;
Use these commands to compile:
nasm -f elf64 file.asm
ld -o file file.o
See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt. (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)

C++ programs are translated to assembly programs during the generation of machine code from the source code. It would be virtually wrong to say assembly is slower than C++. Moreover, the binary code generated differs from compiler to compiler. So a smart C++ compiler may produce binary code more optimal and efficient than a dumb assembler's code.
However I believe your profiling methodology has certain flaws. The following are general guidelines for profiling:
Make sure your system is in its normal/idle state. Stop all running processes (applications) that you started or that use CPU intensively (or poll over the network).
Your datasize must be greater in size.
Your test must run for something more than 5-10 seconds.
Do not rely on just one sample. Perform your test N times. Collect results and calculate the mean or median of the result.

From comments:
But, this code never stops (because of integer overflow) !?! Yves Daoust
For many numbers it will not overflow.
If it will overflow - for one of those unlucky initial seeds, the overflown number will very likely converge toward 1 without another overflow.
Still this poses interesting question, is there some overflow-cyclic seed number?
Any simple final converging series starts with power of two value (obvious enough?).
2^64 will overflow to zero, which is undefined infinite loop according to algorithm (ends only with 1), but the most optimal solution in answer will finish due to shr rax producing ZF=1.
Can we produce 2^64? If the starting number is 0x5555555555555555, it's odd number, next number is then 3n+1, which is 0xFFFFFFFFFFFFFFFF + 1 = 0. Theoretically in undefined state of algorithm, but the optimized answer of johnfound will recover by exiting on ZF=1. The cmp rax,1 of Peter Cordes will end in infinite loop (QED variant 1, "cheapo" through undefined 0 number).
How about some more complex number, which will create cycle without 0?
Frankly, I'm not sure, my Math theory is too hazy to get any serious idea, how to deal with it in serious way. But intuitively I would say the series will converge to 1 for every number : 0 < number, as the 3n+1 formula will slowly turn every non-2 prime factor of original number (or intermediate) into some power of 2, sooner or later. So we don't need to worry about infinite loop for original series, only overflow can hamper us.
So I just put few numbers into sheet and took a look on 8 bit truncated numbers.
There are three values overflowing to 0: 227, 170 and 85 (85 going directly to 0, other two progressing toward 85).
But there's no value creating cyclic overflow seed.
Funnily enough I did a check, which is the first number to suffer from 8 bit truncation, and already 27 is affected! It does reach value 9232 in proper non-truncated series (first truncated value is 322 in 12th step), and the maximum value reached for any of the 2-255 input numbers in non-truncated way is 13120 (for the 255 itself), maximum number of steps to converge to 1 is about 128 (+-2, not sure if "1" is to count, etc...).
Interestingly enough (for me) the number 9232 is maximum for many other source numbers, what's so special about it? :-O 9232 = 0x2410 ... hmmm.. no idea.
Unfortunately I can't get any deep grasp of this series, why does it converge and what are the implications of truncating them to k bits, but with cmp number,1 terminating condition it's certainly possible to put the algorithm into infinite loop with particular input value ending as 0 after truncation.
But the value 27 overflowing for 8 bit case is sort of alerting, this looks like if you count the number of steps to reach value 1, you will get wrong result for majority of numbers from the total k-bit set of integers. For the 8 bit integers the 146 numbers out of 256 have affected series by truncation (some of them may still hit the correct number of steps by accident maybe, I'm too lazy to check).

You did not post the code generated by the compiler, so there' some guesswork here, but even without having seen it, one can say that this:
test rax, 1
jpe even
... has a 50% chance of mispredicting the branch, and that will come expensive.
The compiler almost certainly does both computations (which costs neglegibly more since the div/mod is quite long latency, so the multiply-add is "free") and follows up with a CMOV. Which, of course, has a zero percent chance of being mispredicted.

For the Collatz problem, you can get a significant boost in performance by caching the "tails". This is a time/memory trade-off. See: memoization
(https://en.wikipedia.org/wiki/Memoization). You could also look into dynamic programming solutions for other time/memory trade-offs.
Example python implementation:
import sys
inner_loop = 0
def collatz_sequence(N, cache):
global inner_loop
l = [ ]
stop = False
n = N
tails = [ ]
while not stop:
inner_loop += 1
tmp = n
l.append(n)
if n <= 1:
stop = True
elif n in cache:
stop = True
elif n % 2:
n = 3*n + 1
else:
n = n // 2
tails.append((tmp, len(l)))
for key, offset in tails:
if not key in cache:
cache[key] = l[offset:]
return l
def gen_sequence(l, cache):
for elem in l:
yield elem
if elem in cache:
yield from gen_sequence(cache[elem], cache)
raise StopIteration
if __name__ == "__main__":
le_cache = {}
for n in range(1, 4711, 5):
l = collatz_sequence(n, le_cache)
print("{}: {}".format(n, len(list(gen_sequence(l, le_cache)))))
print("inner_loop = {}".format(inner_loop))

As a generic answer, not specifically directed at this task: In many cases, you can significantly speed up any program by making improvements at a high level. Like calculating data once instead of multiple times, avoiding unnecessary work completely, using caches in the best way, and so on. These things are much easier to do in a high level language.
Writing assembler code, it is possible to improve on what an optimising compiler does, but it is hard work. And once it's done, your code is much harder to modify, so it is much more difficult to add algorithmic improvements. Sometimes the processor has functionality that you cannot use from a high level language, inline assembly is often useful in these cases and still lets you use a high level language.
In the Euler problems, most of the time you succeed by building something, finding why it is slow, building something better, finding why it is slow, and so on and so on. That is very, very hard using assembler. A better algorithm at half the possible speed will usually beat a worse algorithm at full speed, and getting the full speed in assembler isn't trivial.

Even without looking at assembly, the most obvious reason is that /= 2 is probably optimized as >>=1 and many processors have a very quick shift operation. But even if a processor doesn't have a shift operation, the integer division is faster than floating point division.
Edit: your milage may vary on the "integer division is faster than floating point division" statement above. The comments below reveal that the modern processors have prioritized optimizing fp division over integer division. So if someone were looking for the most likely reason for the speedup which this thread's question asks about, then compiler optimizing /=2 as >>=1 would be the best 1st place to look.
On an unrelated note, if n is odd, the expression n*3+1 will always be even. So there is no need to check. You can change that branch to
{
n = (n*3+1) >> 1;
count += 2;
}
So the whole statement would then be
if (n & 1)
{
n = (n*3 + 1) >> 1;
count += 2;
}
else
{
n >>= 1;
++count;
}

The simple answer:
doing a MOV RBX, 3 and MUL RBX is expensive; just ADD RBX, RBX twice
ADD 1 is probably faster than INC here
MOV 2 and DIV is very expensive; just shift right
64-bit code is usually noticeably slower than 32-bit code and the alignment issues are more complicated; with small programs like this you have to pack them so you are doing parallel computation to have any chance of being faster than 32-bit code
If you generate the assembly listing for your C++ program, you can see how it differs from your assembly.

re implement modulo using bit shifts?

I'm writing some code for a very limited system where the mod operator is very slow. In my code a modulo needs to be used about 180 times per second and I figured that removing it as much as possible would significantly increase the speed of my code, as of now one cycle of my mainloop does not run in 1/60 of a second as it should. I was wondering if it was possible to re-implement the modulo using only bit shifts like is possible with multiplication and division. So here is my code so far in c++ (if i can perform a modulo using assembly it would be even better). How can I remove the modulo without using division or multiplication?
while(input > 0)
{
out = (out << 3) + (out << 1);
out += input % 10;
input = (input >> 8) + (input >> 1);
}
EDIT: Actually I realized that I need to do it way more than 180 times per second. Seeing as the value of input can be a very large number up to 40 digits.

What you can do with simple bitwise operations is taking a power-of-two modulo(divisor) of the value(dividend) by AND'ing it with divisor-1. A few examples:
unsigned int val = 123; // initial value
unsigned int rem;
rem = val & 0x3; // remainder after value is divided by 4.
// Equivalent to 'val % 4'
rem = val % 5; // remainder after value is divided by 5.
// Because 5 isn't power of two, we can't simply AND it with 5-1(=4).
Why it works? Let's consider a bit pattern for the value 123 which is 1111011 and then the divisor 4, which has the bit pattern of 00000100. As we know by now, the divisor has to be power-of-two(as 4 is) and we need to decrement it by one(from 4 to 3 in decimal) which yields us the bit pattern 00000011. After we bitwise-AND both the original 123 and 3, the resulting bit pattern will be 00000011. That turns out to be 3 in decimal. The reason why we need a power-of-two divisor is that once we decrement them by one, we get all the less significant bits set to 1 and the rest are 0. Once we do the bitwise-AND, it 'cancels out' the more significant bits from the original value, and leaves us with simply the remainder of the original value divided by the divisor.
However, applying something specific like this for arbitrary divisors is not going to work unless you know your divisors beforehand(at compile time, and even then requires divisor-specific codepaths) - resolving it run-time is not feasible, especially not in your case where performance matters.
Also there's a previous question related to the subject which probably has interesting information on the matter from different points of view.

Actually division by constants is a well known optimization for compilers and in fact, gcc is already doing it.
This simple code snippet:
int mod(int val) {
return val % 10;
}
Generates the following code on my rather old gcc with -O3:
_mod:
push ebp
mov edx, 1717986919
mov ebp, esp
mov ecx, DWORD PTR [ebp+8]
pop ebp
mov eax, ecx
imul edx
mov eax, ecx
sar eax, 31
sar edx, 2
sub edx, eax
lea eax, [edx+edx*4]
mov edx, ecx
add eax, eax
sub edx, eax
mov eax, edx
ret
If you disregard the function epilogue/prologue, basically two muls (indeed on x86 we're lucky and can use lea for one) and some shifts and adds/subs. I know that I already explained the theory behind this optimization somewhere, so I'll see if I can find that post before explaining it yet again.
Now on modern CPUs that's certainly faster than accessing memory (even if you hit the cache), but whether it's faster for your obviously a bit more ancient CPU is a question that can only be answered with benchmarking (and also make sure your compiler is doing that optimization, otherwise you can always just "steal" the gcc version here ;) ). Especially considering that it depends on an efficient mulhs (ie higher bits of a multiply instruction) to be efficient.
Note that this code is not size independent - to be exact the magic number changes (and maybe also parts of the add/shifts), but that can be adapted.

Doing modulo 10 with bit shifts is going to be hard and ugly, since bit shifts are inherently binary (on any machine you're going to be running on today). If you think about it, bit shifts are simply multiply or divide by 2.
But there's an obvious space-time trade you could make here: set up a table of values for out and out % 10 and look it up. Then the line becomes
out += tab[out]
and with any luck at all, that will turn out to be one 16-bit add and a store operation.

If you want to do modulo 10 and shifts, maybe you can adapt double dabble algorithm to your needs?
This algorithm is used to convert binary numbers to decimal without using modulo or division.

Every power of 16 ends in 6. If you represent the number as a sum of powers of 16 (i.e. break it into nybbles), then each term contributes to the last digit in the same way, except the one's place.
0x481A % 10 = ( 0x4 * 6 + 0x8 * 6 + 0x1 * 6 + 0xA ) % 10
Note that 6 = 5 + 1, and the 5's will cancel out if there are an even number of them. So just sum the nybbles (except the last one) and add 5 if the result is odd.
0x481A % 10 = ( 0x4 + 0x8 + 0x1 /* sum = 13 */
+ 5 /* so add 5 */ + 0xA /* and the one's place */ ) % 10
= 28 % 10
This reduces the 16-bit, 4-nybble modulo to a number at most 0xF * 4 + 5 = 65. In binary, that is annoyingly still 3 nybbles so you would need to repeat the algorithm (although one of them doesn't really count).
But the 286 should have reasonably efficient BCD addition that you can use to perform the sum and obtain the result in one pass. (That requires converting each nybble to BCD manually; I don't know enough about the platform to say how to optimize that or whether it's problematic.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js