Assembler division x86 strange numbers [duplicate] - c++

I've been reading about div and mul assembly operations, and I decided to see them in action by writing a simple program in C:
File division.c
#include <stdlib.h>
#include <stdio.h>
int main()
{
size_t i = 9;
size_t j = i / 5;
printf("%zu\n",j);
return 0;
}
And then generating assembly language code with:
gcc -S division.c -O0 -masm=intel
But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:
mov rax, QWORD PTR [rbp-16] ; Move i (=9) to RAX
movabs rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul rdx ; Multiply 9 by magic number
mov rax, rdx ; Take only the upper 64 bits of the result
shr rax, 2 ; Shift these bits 2 places to the right (?)
mov QWORD PTR [rbp-8], rax ; Magically, RAX contains 9/5=1 now,
; so we can assign it to j
What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?

Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).
If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.
Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add
As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.
For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.

Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).
As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.
See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.
Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).

In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant
A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.
-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.
When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.
This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?
To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.
The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.
The error in our constant is (1/5) * 2-64. The value of i is less than 264 so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2−2.
(1/5) * 2−2 < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.
Unfortunately this doesn't work for all divisors.
If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2-64. After multiplying by an i value of just under 264 we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.
So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.

Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.
http://gmplib.org/~tege/divcnst-pldi94.pdf
In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.
Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.
Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.
In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.
Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:
; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend dq 2 dup(?) ;16 byte dividend
divisor dq 1 dup(?) ; 8 byte divisor
; ...
mov rcx,divisor
mov rdx,0
mov rax,dividend+8 ;upper 8 bytes of dividend
div rcx ;after div, rax == 1
mov rax,dividend ;lower 8 bytes of dividend
div rcx
mov rdx,1 ;rdx:rax = N+1 bit value = 65 bit value
You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).
On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:
; rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
; two instruction sequence for most divisors:
mul rbx ;rdx = upper 64 bits of product
shr rdx,cl ;rdx = quotient
;
; five instruction sequence for divisors like 7
; to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)
mul rbx ;rdx = upper 64 bits of product
sub rbx,rdx ;rbx -= rdx
shr rbx,1 ;rbx >>= 1
add rdx,rbx ;rdx = upper 64 bits of corrected product
shr rdx,cl ;rdx = quotient
; ...
To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)
mul rbx ;rdx = u64(dvnd*mplr)
add rdx,rbx ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
shr rdx,cl
To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:
u64( ( rbx * (2^64 + rax) )>>(cl+1) )
u64( ( rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
u64( ( (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
u64( ( (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl ) )
mul rbx ; (rbx*rax)
sub rbx,rdx ; (rbx*2^64)-(rbx*rax)
shr rbx,1 ;( (rbx*2^64)-(rbx*rax))>>1
add rdx,rbx ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
shr rdx,cl ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl

I will answer from a slightly different angle: Because it is allowed to do it.
C and C++ are defined against an abstract machine. The compiler transforms this program in terms of the abstract machine to concrete machine following the as-if rule.
The compiler is allowed to make ANY changes as long as it doesn't change the observable behaviour as specified by the abstract machine. There is no reasonable expectation that the compiler will transform your code in the most straightforward way possible (even when a lot of C programmer assume that). Usually, it does this because the compiler wants to optimize the performance compared to the straightforward approach (as discussed in the other answers at length).
If under any circumstances the compiler "optimizes" a correct program to something that has a different observable behaviour, that is a compiler bug.
Any undefined behaviour in our code (signed integer overflow is a classical example) and this contract is void.

Related

How does a C/C++ compiler optimize division by non-powers-of-two? [duplicate]

I've been reading about div and mul assembly operations, and I decided to see them in action by writing a simple program in C:
File division.c
#include <stdlib.h>
#include <stdio.h>
int main()
{
size_t i = 9;
size_t j = i / 5;
printf("%zu\n",j);
return 0;
}
And then generating assembly language code with:
gcc -S division.c -O0 -masm=intel
But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:
mov rax, QWORD PTR [rbp-16] ; Move i (=9) to RAX
movabs rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul rdx ; Multiply 9 by magic number
mov rax, rdx ; Take only the upper 64 bits of the result
shr rax, 2 ; Shift these bits 2 places to the right (?)
mov QWORD PTR [rbp-8], rax ; Magically, RAX contains 9/5=1 now,
; so we can assign it to j
What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?
Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).
If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.
Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add
As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.
For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.
Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).
As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.
See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.
Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).
In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant
A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.
-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.
When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.
This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?
To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.
The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.
The error in our constant is (1/5) * 2-64. The value of i is less than 264 so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2−2.
(1/5) * 2−2 < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.
Unfortunately this doesn't work for all divisors.
If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2-64. After multiplying by an i value of just under 264 we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.
So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.
Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.
http://gmplib.org/~tege/divcnst-pldi94.pdf
In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.
Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.
Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.
In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.
Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:
; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend dq 2 dup(?) ;16 byte dividend
divisor dq 1 dup(?) ; 8 byte divisor
; ...
mov rcx,divisor
mov rdx,0
mov rax,dividend+8 ;upper 8 bytes of dividend
div rcx ;after div, rax == 1
mov rax,dividend ;lower 8 bytes of dividend
div rcx
mov rdx,1 ;rdx:rax = N+1 bit value = 65 bit value
You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).
On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:
; rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
; two instruction sequence for most divisors:
mul rbx ;rdx = upper 64 bits of product
shr rdx,cl ;rdx = quotient
;
; five instruction sequence for divisors like 7
; to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)
mul rbx ;rdx = upper 64 bits of product
sub rbx,rdx ;rbx -= rdx
shr rbx,1 ;rbx >>= 1
add rdx,rbx ;rdx = upper 64 bits of corrected product
shr rdx,cl ;rdx = quotient
; ...
To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)
mul rbx ;rdx = u64(dvnd*mplr)
add rdx,rbx ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
shr rdx,cl
To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:
u64( ( rbx * (2^64 + rax) )>>(cl+1) )
u64( ( rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
u64( ( (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
u64( ( (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl ) )
mul rbx ; (rbx*rax)
sub rbx,rdx ; (rbx*2^64)-(rbx*rax)
shr rbx,1 ;( (rbx*2^64)-(rbx*rax))>>1
add rdx,rbx ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
shr rdx,cl ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl
I will answer from a slightly different angle: Because it is allowed to do it.
C and C++ are defined against an abstract machine. The compiler transforms this program in terms of the abstract machine to concrete machine following the as-if rule.
The compiler is allowed to make ANY changes as long as it doesn't change the observable behaviour as specified by the abstract machine. There is no reasonable expectation that the compiler will transform your code in the most straightforward way possible (even when a lot of C programmer assume that). Usually, it does this because the compiler wants to optimize the performance compared to the straightforward approach (as discussed in the other answers at length).
If under any circumstances the compiler "optimizes" a correct program to something that has a different observable behaviour, that is a compiler bug.
Any undefined behaviour in our code (signed integer overflow is a classical example) and this contract is void.

Why does division by 3 require a rightshift (and other oddities) on x86?

I have the following C/C++ function:
unsigned div3(unsigned x) {
return x / 3;
}
When compiled using clang 10 at -O3, this results in:
div3(unsigned int):
mov ecx, edi # tmp = x
mov eax, 2863311531 # result = 3^-1
imul rax, rcx # result *= tmp
shr rax, 33 # result >>= 33
ret
What I do understand is: division by 3 is equivalent to multiplying with the multiplicative inverse 3-1 mod 232 which is 2863311531.
There are some things that I don't understand though:
Why do we need to use ecx/rcx at all? Can't we multiply rax with edi directly?
Why do we multiply in 64-bit mode? Wouldn't it be faster to multiply eax and ecx?
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
Edit 1
For those who don't understand what I mean by 3-1 mod 232, I am talking about the multiplicative inverse here.
For example:
// multiplying with inverse of 3:
15 * 2863311531 = 42949672965
42949672965 mod 2^32 = 5
// using fixed-point multiplication
15 * 2863311531 = 42949672965
42949672965 >> 33 = 5
// simply dividing by 3
15 / 3 = 5
So multiplying with 42949672965 is actually equivalent to dividing by 3. I assumed clang's optimization is based on modular arithmetic, when it's really based on fixed point arithmetic.
Edit 2
I have now realized that the multiplicative inverse can only be used for divisions without a remainder. For example, multiplying 1 times 3-1 is equal to 3-1, not zero. Only fixed point arithmetic has correct rounding.
Unfortunately, clang does not make any use of modular arithmetic which would just be a single imul instruction in this case, even when it could. The following function has the same compile output as above.
unsigned div3(unsigned x) {
__builtin_assume(x % 3 == 0);
return x / 3;
}
(Canonical Q&A about fixed-point multiplicative inverses for exact division that work for every possible input: Why does GCC use multiplication by a strange number in implementing integer division? - not quite a duplicate because it only covers the math, not some of the implementation details like register width and imul vs. mul.)
Can't we multiply rax with edi directly?
We can't imul rax, rdi because the calling convention allows the caller to leave garbage in the high bits of RDI; only the EDI part contains the value. This is a non-issue when inlining; writing a 32-bit register does implicitly zero-extend to the full 64-bit register, so the compiler will usually not need an extra instruction to zero-extend a 32-bit value.
(zero-extending into a different register is better because of limitations on mov-elimination, if you can't avoid it).
Taking your question even more literally, no, x86 doesn't have any multiply instructions that zero-extend one of their inputs to let you multiply a 32-bit and a 64-bit register. Both inputs must be the same width.
Why do we multiply in 64-bit mode?
(terminology: all of this code runs in 64-bit mode. You're asking why 64-bit operand-size.)
You could mul edi to multiply EAX with EDI to get a 64-bit result split across EDX:EAX, but mul edi is 3 uops on Intel CPUs, vs. most modern x86-64 CPUs having fast 64-bit imul. (Although imul r64, r64 is slower on AMD Bulldozer-family, and on some low-power CPUs.) https://uops.info/ and https://agner.org/optimize/ (instruction tables and microarch PDF)
(Fun fact: mul rdi is actually cheaper on Intel CPUs, only 2 uops. Perhaps something to do with not having to do extra splitting on the output of the integer multiply unit, like mul edi would have to split the 64-bit low half multiplier output into EDX and EAX halves, but that happens naturally for 64x64 => 128-bit mul.)
Also the part you want is in EDX so you'd need another mov eax, edx to deal with it. (Again, because we're looking at code for a stand-alone definition of the function, not after inlining into a caller.)
GCC 8.3 and earlier did use 32-bit mul instead of 64-bit imul (https://godbolt.org/z/5qj7d5). That was not crazy for -mtune=generic when Bulldozer-family and old Silvermont CPUs were more relevant, but those CPUs are farther in the past for more recent GCC, and its generic tuning choices reflect that. Unfortunately GCC also wasted a mov instruction copying EDI to EAX, making this way look even worse :/
# gcc8.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop, stupid wasted instruction
mov edx, -1431655765 # 1 uop (same 32-bit constant, just printed differently)
mul edx # 3 uops on Sandybridge-family
mov eax, edx # 1 uop
shr eax # 1 uop
ret
# total of 7 uops on SnB-family
Would only be 6 uops with mov eax, 0xAAAAAAAB / mul edi, but still worse than:
# gcc9.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop
mov edi, 2863311531 # 1 uop
imul rax, rdi # 1 uop
shr rax, 33 # 1 uop
ret
# total 4 uops, not counting ret
Unfortunately, 64-bit 0x00000000AAAAAAAB can't be represented as a 32-bit sign-extended immediate, so imul rax, rcx, 0xAAAAAAAB isn't encodeable. It would mean 0xFFFFFFFFAAAAAAAB.
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
It is unsigned. Signedness of the inputs only affects the high half of the result, but imul reg, reg doesn't produce the high half. Only the one-operand forms of mul and imul are full multiplies that do NxN => 2N, so only they need separate signed and unsigned versions.
Only imul has the faster and more flexible low-half-only forms. The only thing that's signed about imul reg, reg is that it sets OF based on signed overflow of the low half. It wasn't worth spending more opcodes and more transistors just to have a mul r,r whose only difference from imul r,r is the FLAGS output.
Intel's manual (https://www.felixcloutier.com/x86/imul) even points out the fact that it can be used for unsigned.
What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
No, there's no multiplier constant that would give the exact right answer for every possible input x if you implemented it that way. The "as-if" optimization rule doesn't allow approximations, only implementations that produce the exact same observable behaviour for every input the program uses. Without knowing a value-range for x other than full range of unsigned, compilers don't have that option. (-ffast-math only applies to floating point; if you want faster approximations for integer math, code them manually like below):
See Why does GCC use multiplication by a strange number in implementing integer division? for more about the fixed-point multiplicative inverse method compilers use for exact division by compile time constants.
For an example of this not working in the general case, see my edit to an answer on Divide by 10 using bit shifts? which proposed
// Warning: INEXACT FOR LARGE INPUTS
// this fast approximation can just use the high half,
// so on 32-bit machines it avoids one shift instruction vs. exact division
int32_t div10(int32_t dividend)
{
int64_t invDivisor = 0x1999999A;
return (int32_t) ((invDivisor * dividend) >> 32);
}
Its first wrong answer (if you loop from 0 upward) is div10(1073741829) = 107374183 when 1073741829/10 is actually 107374182. (It rounded up instead of toward 0 like C integer division is supposed to.)
From your edit, I see you were actually talking about using the low half of a multiply result, which apparently works perfectly for exact multiples all the way up to UINT_MAX.
As you say, it completely fails when the division would have a remainder, e.g. 16 * 0xaaaaaaab = 0xaaaaaab0 when truncated to 32-bit, not 5.
unsigned div3_exact_only(unsigned x) {
__builtin_assume(x % 3 == 0); // or an equivalent with if() __builtin_unreachable()
return x / 3;
}
Yes, if that math works out, it would be legal and optimal for compilers to implement that with 32-bit imul. They don't look for this optimization because it's rarely a known fact. IDK if it would be worth adding compiler code to even look for the optimization, in terms of compile time, not to mention compiler maintenance cost in developer time. It's not a huge difference in runtime cost, and it's rarely going to be possible. It is nice, though.
div3_exact_only:
imul eax, edi, 0xAAAAAAAB # 1 uop, 3c latency
ret
However, it is something you can do yourself in source code, at least for known type widths like uint32_t:
uint32_t div3_exact_only(uint32_t x) {
return x * 0xaaaaaaabU;
}
What's up with the 33-bit right shift at the end? I thought we can just drop the highest 32-bits.
Instead of 3^(-1) mod 3 you have to think more about 0.3333333 where the 0 before the . is located in the upper 32 bit and the the 3333 is located in the lower 32 bit.
This fixed point operation works fine, but the result is obviously shifted to the upper part of rax, therefor the CPU must shift the result down again after the operation.
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
There is no MUL instruction equivalent to the IMUL instruction. The IMUL variant that is used takes two registers:
a <= a * b
There is no MUL instruction that does that. MUL instructions are more expensive because they store the result as 128 Bit in two registers.
Of course you could use the legacy instructions, but this does not change the fact that the result is stored in two registers.
If you look at my answer to the prior question:
Why does GCC use multiplication by a strange number in implementing integer division?
It contains a link to a pdf article that explains this (my answer clarifies the stuff that isn't explained well in this pdf article):
https://gmplib.org/~tege/divcnst-pldi94.pdf
Note that one extra bit of precision is needed for some divisors, such as 7, the multiplier would normally require 33 bits, and the product would normally require 65 bits, but this can be avoided by handling the 2^32 bit separately with 3 additional instructions as shown in my prior answer and below.
Take a look at the generated code if you change to
unsigned div7(unsigned x) {
return x / 7;
}
So to explain the process, let L = ceil(log2(divisor)). For the question above, L = ceil(log2(3)) == 2. The right shift count would initially be 32+L = 34.
To generate a multiplier with a sufficient number of bits, two potential multipliers are generated: mhi will be the multiplier to be used, and the shift count will be 32+L.
mhi = (2^(32+L) + 2^(L))/3 = 5726623062
mlo = (2^(32+L) )/3 = 5726623061
Then a check is made to see if the number of required bits can be reduced:
while((L > 0) && ((mhi>>1) > (mlo>>1))){
mhi = mhi>>1;
mlo = mlo>>1;
L = L-1;
}
if(mhi >= 2^32){
mhi = mhi-2^32
L = L-1;
; use 3 additional instructions for missing 2^32 bit
}
... mhi>>1 = 5726623062>>1 = 2863311531
... mlo>>1 = 5726623061>>1 = 2863311530 (mhi>>1) > (mlo>>1)
... mhi = mhi>>1 = 2863311531
... mlo = mhi>>1 = 2863311530
... L = L-1 = 1
... the next loop exits since now (mhi>>1) == (mlo>>1)
So the multiplier is mhi = 2863311531 and the shift count = 32+L = 33.
On an modern X86, multiply and shift instructions are constant time, so there's no point in reducing the multiplier (mhi) to less than 32 bits, so that while(...) above is changed to an if(...).
In the case of 7, the loop exits on the first iteration, and requires 3 extra instructions to handle the 2^32 bit, so that mhi is <= 32 bits:
L = ceil(log2(7)) = 3
mhi = (2^(32+L) + 2^(L))/7 = 4908534053
mhi = mhi-2^32 = 613566757
Let ecx = dividend, the simple approach could overflow on the add:
mov eax, 613566757 ; eax = mhi
mul ecx ; edx:eax = ecx*mhi
add edx, ecx ; edx:eax = ecx*(mhi + 2^32), potential overflow
shr edx, 3
To avoid the potential overflow, note that eax = eax*2 - eax:
(ecx*eax) = (ecx*eax)<<1) -(ecx*eax)
(ecx*(eax+2^32)) = (ecx*eax)<<1)+ (ecx*2^32)-(ecx*eax)
(ecx*(eax+2^32))>>3 = ((ecx*eax)<<1)+ (ecx*2^32)-(ecx*eax) )>>3
= (((ecx*eax) )+(((ecx*2^32)-(ecx*eax))>>1))>>2
so the actual code, using u32() to mean upper 32 bits:
... visual studio generated code for div7, dividend is ecx
mov eax, 613566757
mul ecx ; edx = u32( (ecx*eax) )
sub ecx, edx ; ecx = u32( ((ecx*2^32)-(ecx*eax)) )
shr ecx, 1 ; ecx = u32( (((ecx*2^32)-(ecx*eax))>>1) )
lea eax, DWORD PTR [edx+ecx] ; eax = u32( (ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1) )
shr eax, 2 ; eax = u32(((ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1))>>2)
If a remainder is wanted, then the following steps can be used:
mhi and L are generated based on divisor during compile time
...
quotient = (x*mhi)>>(32+L)
product = quotient*divisor
remainder = x - product
x/3 is approximately (x * (2^32/3)) / 2^32. So we can perform a single 32x32->64 bit multiplication, take the higher 32 bits, and get approximately x/3.
There is some error because we cannot multiply exactly by 2^32/3, only by this number rounded to an integer. We get more precision using x/3 ≈ (x * (2^33/3)) / 2^33. (We can't use 2^34/3 because that is > 2^32). And that turns out to be good enough to get x/3 in all cases exactly. You would prove this by checking that the formula gives a result of k if the input is 3k or 3k+2.

Bit fiddling: Negate an int, only if it's negative

I'm working through a codebase that uses the following encoding to indicate sampling with replacement: We maintain an array of ints as indicators of position and presence in the sample, where positive ints indicate a position in another array and negative ints indicate that we should not use the data point in this iteration.
Example:
data_points: [...] // Objects vector of length 5 == position.size()
std::vector<int> position: [3, 4, -3, 1, -2]
would mean that the first element in data_points should go to bucket 3, the second to bucket 4, and the fourth to bucket 1.
The negative values indicate that for this iteration we won't be using those data points, namely the 3rd and 5th data points are marked as excluded because their values in position are negative and have been set with position[i] = ~position[i].
The trick is that we might perform this multiple times, but the positions of the data points in the index should not change. So in the next iteration, if we want to exclude data point 1, and include data point 5 we could do something like:
position[0] = ~position[0] // Bit-wise complement flips the sign on ints and subtracts 1
position[4] = ~position[4]
This will change the position vector into
std::vector<int> position: [-4, 4, -3, 1, 1]
Now on to the question: At the end of each round I want to reset all signs to positive, i.e. position should become [3, 4, 3, 1, 2].
Is there a bit-fiddling trick that will allow me to do this without having an if condition for the sign of the value?
Also, because I'm new to to bit fiddling like this, why/how does taking the bit complement of a signed, positive int give us its mathematical complement? (i.e. the same value with the sign flipped)
Edit: The above is wrong, the complement of a (int) will give -(a + 1) and depends on the representation of ints as noted in the answers. So the original question of simply taking the positive value of an existing one does not apply, we actually need to perform a bitwise complement to get the original value.
Example:
position[0] = 1
position[0] = ~position[0] // Now position[0] is -2
// So if we did
position[0] = std::abs(position[0]) // This is wrong, position[0] is now 2!
// If we want to reset it to 1, we need to do another complement
position[0] = ~position[0] // Now position[0] is again 1
I recommend not attempting bit fiddling. Partly because you're dealing with signed numbers, and you would throw away portability if you fiddle. Partly because bit fiddling is not as readable as reusable functions.
A simple solution:
std::for_each(position.begin(), position.end(), [](int v) {
return std::abs(v);
});
why/how does taking the bit complement of a signed, positive int give us its mathematical complement? (i.e. the same value with the sign flipped)
It doesn't. Not in general anyway. It does so only on systems that use 1's complement representation for negative numbers, and the reason for that is simply because that's how the representation is specified. A negative number is represented by binary complement of the positive value.
Most commonly used representation these days is 2's complement which doesn't work that way.
Probably the first go-to source for bit twiddling hacks: The eponymous site
int v; // we want to find the absolute value of v
unsigned int r; // the result goes here
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
r = (v + mask) ^ mask;
However, I would question the assumption that position[i] = std::abs(position[i]) has worse performance. You should definitely have profiling results that demonstrate the bit hack being superior before you check in that kind of code.
Feel free to play around with a quick benchmark (with disassembly) of both - I don't think there is a difference in speed:
gcc 8.2
clang 6.0
Also take a look at the assembly that is actually generated:
https://godbolt.org/z/Ghcw_c
Evidently, clang sees your bithack and is not impressed - it generates a conditional move in all cases (which is branch-free). gcc does as you tell it, but has two more implementations of abs in store, some making use of register semantics of the target architecture.
And if you get into (auto-)vectorization, things get even more muddy. You'll have to profile regardless.
Conclusion: Just write std::abs - your compiler will do all the bit-twiddling for you.
To answer the extended question: Again, first write the obvious and intuitive code and then check that your compiler does the right thing: Look ma, no branches!
If you let it have its fun with auto-vectorization then you probably won't understand (or be good at judging) the assembly, so you have to profile anyway. Concrete example:
https://godbolt.org/z/oaaOwJ. clang likes to also unroll the auto-vectorized loop while gcc is more conservative. In any case, it's still branch-free.
Chances are that your compiler understands the nitty-gritty details of instruction scheduling on your target platform better than you. If you don't obscure your intent with bit-magic, it'll do a good job by itself. If it's still a hot spot in your code, you can then go and see if you can hand-craft a better version (but that will probably have to be in assembly).
Use functions to signify intent. Allow the compiler's optimiser to do a better job than you ever will.
#include <cmath>
void include_index(int& val)
{
val = std::abs(val);
}
void exclude_index(int& val)
{
val = -std::abs(val);
}
bool is_included(int const& val)
{
return val > 0;
}
Sample output from godbolt's gcc8 x86_64 compiler (note that it's all bit-twiddling and there are no conditional jumps - the bane of high performance computing):
include_index(int&):
mov eax, DWORD PTR [rdi]
sar eax, 31
xor DWORD PTR [rdi], eax
sub DWORD PTR [rdi], eax
ret
exclude_index(int&):
mov eax, DWORD PTR [rdi]
mov edx, DWORD PTR [rdi]
sar eax, 31
xor edx, eax
sub eax, edx
mov DWORD PTR [rdi], eax
ret
is_included(int const&):
mov eax, DWORD PTR [rdi]
test eax, eax
setg al
ret
https://godbolt.org/z/ni6DOk
Also, because I'm new to to bit fiddling like this, why/how does taking the bit complement of a signed, positive int give us its mathematical complement? (i.e. the same value with the sign flipped)
This question deserves an answer all by itself since everyone will tell you that this is how you do it, but nobody ever tells you why.
Notice that 1 - 0 = 1 and 1 - 1 = 0. What this means is that if we do 1 - b, where b is a single bit, the result is the opposite of b, or not b (~b). Also notice how this subtraction will never produce a borrow, this is very important, since b can only be at most 1.
Also notice that subtracting a number with n bits simply means performing n 1-bit subtractions, while taking care of borrows. But our special case will never procude a borrow.
Basically we have created a mathematical definiton for the bitwise not operation. To flip a bit b, do 1 - b. If we wanna flip an n bit number, do this for every bit. But doing n subtractions in sequence is the same as subtracting two n bit numbers. So if we wanna calculate the bitwise not of an 8-bit number, a, we simply do 11111111 - a, and the same for any n bit number. Once again this works because subtracting a bit from 1 will never produce a borrow.
But what is the sequence of n "1" bits? It's the value 2^n - 1. So taking the bitwise not of a number, a, is the same as calculating 2^n - 1 - a.
Now, numbers inside a computer are stored as numbers modulo 2^n. This is because we only have a limited number of bits available. You may know that if you work with 8 bits and you do 255 + 1 you'll get 0. This is because an 8-bit number is a number modulo 2^8 = 256, and 255 + 1 = 256. 256 is obviously equal to 0 modulo 256.
But why not do the same backwards? By this logic, 0 - 1 = 255, right? This is indeed correct. Mathematically, -1 and 255 are "congruent" modulo 256. Congruent essentially means equal to, but it's used to differentiate between regular equality and equality in modular arithmetic.
Notice in fact how 0 is also congruent to 256 modulo 256. So 0 - 1 = 256 - 1 = 255. 256 is our modulus, 2^8. But if bitwise not is defined as 2^n - 1 - a, then we have ~a = 2^8 - 1 - a. You'll notice how we have that - 1 in the middle. We can remove that by adding 1.
So we now have ~a + 1 = 2^n - 1 - a + 1 = 2^n - a. But 2^n - a is the negative a modulo n. And so here we have our negative number. This is called two's complement, and it's used in pretty much every modern processor, because it's the mathematical definition of a negative number in modular arithmetic modulo 2^n, and because numbers inside a processor work as if they were in modulo 2^n the math just works out by itself. You can add and subtract without doing any extra steps. Multiplication and division do require "sign extension", but that's just a quirk of how those operations are defined, the meaning of the number doesn't change when extending the sign.
Of course with this method you lose a bit, because you now have half of the numbers being positive and the other half negative, but you can't just magically add a bit to your processor, so the new range of value you can represent is from -2^(n-1) to 2^(n-1) - 1 inclusive.
Alternatively, you can keep the number as it is and not add 1 at the end. This is known as the one's complement. Of course, this is not quite the same as the mathematical negative, so adding, subtracting, multiplying and dividing don't just work out of the box, you need extra steps to adjust the result. This is why two's complement is the de facto standard for signed arithmetic. There's also the problem that, in one's complement, both 0 and 2^n - 1 represent the same quantity, zero, while in two's complement, negative 0 is still correctly 0 (because ~0 + 1 = 2^n - 1 + 1 = 2^n = 0). I think one's complement is used in the Internet Protocol as a checksum, but other than that it has very limited purpose.
But be aware, "de facto" standard means that this is just what everyone does, but there is no rule that says that it MUST be done this way, so always make sure to check the documentation of you target architecture to make sure you're doing the right thing. Even though let's be honest, chances of finding a useful one's complement processor out there nowadays are pretty much null unless you're working on some extremely specific architecture, but still, better be safe than sorry.
"Is there a bit-fiddling trick that will allow me to do this without having an if condition for the sign of the value?"
Do you need to keep the numeric value of the number being changed from a negative value?
If not, you can use std::max to set the negative values to zero
iValue = std::max(iValue, 0); // returns zero if iValue is less than zero
If you need to keep the numeric value, but just change from negative to positive, then
iValue = std::abs(iValue); // always returns a positive value of iValue

re implement modulo using bit shifts?

I'm writing some code for a very limited system where the mod operator is very slow. In my code a modulo needs to be used about 180 times per second and I figured that removing it as much as possible would significantly increase the speed of my code, as of now one cycle of my mainloop does not run in 1/60 of a second as it should. I was wondering if it was possible to re-implement the modulo using only bit shifts like is possible with multiplication and division. So here is my code so far in c++ (if i can perform a modulo using assembly it would be even better). How can I remove the modulo without using division or multiplication?
while(input > 0)
{
out = (out << 3) + (out << 1);
out += input % 10;
input = (input >> 8) + (input >> 1);
}
EDIT: Actually I realized that I need to do it way more than 180 times per second. Seeing as the value of input can be a very large number up to 40 digits.
What you can do with simple bitwise operations is taking a power-of-two modulo(divisor) of the value(dividend) by AND'ing it with divisor-1. A few examples:
unsigned int val = 123; // initial value
unsigned int rem;
rem = val & 0x3; // remainder after value is divided by 4.
// Equivalent to 'val % 4'
rem = val % 5; // remainder after value is divided by 5.
// Because 5 isn't power of two, we can't simply AND it with 5-1(=4).
Why it works? Let's consider a bit pattern for the value 123 which is 1111011 and then the divisor 4, which has the bit pattern of 00000100. As we know by now, the divisor has to be power-of-two(as 4 is) and we need to decrement it by one(from 4 to 3 in decimal) which yields us the bit pattern 00000011. After we bitwise-AND both the original 123 and 3, the resulting bit pattern will be 00000011. That turns out to be 3 in decimal. The reason why we need a power-of-two divisor is that once we decrement them by one, we get all the less significant bits set to 1 and the rest are 0. Once we do the bitwise-AND, it 'cancels out' the more significant bits from the original value, and leaves us with simply the remainder of the original value divided by the divisor.
However, applying something specific like this for arbitrary divisors is not going to work unless you know your divisors beforehand(at compile time, and even then requires divisor-specific codepaths) - resolving it run-time is not feasible, especially not in your case where performance matters.
Also there's a previous question related to the subject which probably has interesting information on the matter from different points of view.
Actually division by constants is a well known optimization for compilers and in fact, gcc is already doing it.
This simple code snippet:
int mod(int val) {
return val % 10;
}
Generates the following code on my rather old gcc with -O3:
_mod:
push ebp
mov edx, 1717986919
mov ebp, esp
mov ecx, DWORD PTR [ebp+8]
pop ebp
mov eax, ecx
imul edx
mov eax, ecx
sar eax, 31
sar edx, 2
sub edx, eax
lea eax, [edx+edx*4]
mov edx, ecx
add eax, eax
sub edx, eax
mov eax, edx
ret
If you disregard the function epilogue/prologue, basically two muls (indeed on x86 we're lucky and can use lea for one) and some shifts and adds/subs. I know that I already explained the theory behind this optimization somewhere, so I'll see if I can find that post before explaining it yet again.
Now on modern CPUs that's certainly faster than accessing memory (even if you hit the cache), but whether it's faster for your obviously a bit more ancient CPU is a question that can only be answered with benchmarking (and also make sure your compiler is doing that optimization, otherwise you can always just "steal" the gcc version here ;) ). Especially considering that it depends on an efficient mulhs (ie higher bits of a multiply instruction) to be efficient.
Note that this code is not size independent - to be exact the magic number changes (and maybe also parts of the add/shifts), but that can be adapted.
Doing modulo 10 with bit shifts is going to be hard and ugly, since bit shifts are inherently binary (on any machine you're going to be running on today). If you think about it, bit shifts are simply multiply or divide by 2.
But there's an obvious space-time trade you could make here: set up a table of values for out and out % 10 and look it up. Then the line becomes
out += tab[out]
and with any luck at all, that will turn out to be one 16-bit add and a store operation.
If you want to do modulo 10 and shifts, maybe you can adapt double dabble algorithm to your needs?
This algorithm is used to convert binary numbers to decimal without using modulo or division.
Every power of 16 ends in 6. If you represent the number as a sum of powers of 16 (i.e. break it into nybbles), then each term contributes to the last digit in the same way, except the one's place.
0x481A % 10 = ( 0x4 * 6 + 0x8 * 6 + 0x1 * 6 + 0xA ) % 10
Note that 6 = 5 + 1, and the 5's will cancel out if there are an even number of them. So just sum the nybbles (except the last one) and add 5 if the result is odd.
0x481A % 10 = ( 0x4 + 0x8 + 0x1 /* sum = 13 */
+ 5 /* so add 5 */ + 0xA /* and the one's place */ ) % 10
= 28 % 10
This reduces the 16-bit, 4-nybble modulo to a number at most 0xF * 4 + 5 = 65. In binary, that is annoyingly still 3 nybbles so you would need to repeat the algorithm (although one of them doesn't really count).
But the 286 should have reasonably efficient BCD addition that you can use to perform the sum and obtain the result in one pass. (That requires converting each nybble to BCD manually; I don't know enough about the platform to say how to optimize that or whether it's problematic.)

By Left shifting, can a number be set to ZERO

I have a doubt in Left Shift Operator
int i = 1;
i <<= (sizeof (int) *8);
cout << i;
It prints 1.
i has been initialized to 1.
And while moving the bits till the size of the integer, it fills the LSB with 0's and as 1 crosses the limit of integer, i was expecting the output to be 0.
How and Why it is 1?
Let's say sizeof(int) is 4 on your platform. Then the expression becomes:
i = i << 32;
The standard says:
6.5.7-3
If the value of the right operand is negative or is greater than or
equal to the width of the promoted left operand, the behavior is
undefined.
As cnicutar said, your example exhibits undefined behaviour. That means that the compiler is free to do whatever the vendor seems fit, including making demons fly out your nose or just doing nothing to the value at hand.
What you can do to convince yourself, that left shifting by the number of bits will produce 0 is this:
int i = 1;
i <<= (sizeof (int) *4);
i <<= (sizeof (int) *4);
cout << i;
Expanding on the previous answer...
On the x86 platform your code would get compiled down to something like this:
; 32-bit ints:
mov cl, 32
shl dword ptr i, cl
The CPU will shift the dword in the variable i by the value contained in the cl register modulo 32. So, 32 modulo 32 yields 0. Hence, the shift doesn't really occur. And that's perfectly fine per the C standard. In fact, what the C standard says in 6.5.7-3 is because the aforementioned CPU behavior was quite common back in the day and influenced the standard.
As already mentioned by others, according to C standard the behavior of the shift is undefined.
That said, the program prints 1. A low-level explanation of why it prints 1 is as follows:
When compiling without optimizations the compiler (GCC, clang) emits the SHL instruction:
...
mov $32,%ecx
shll %cl,0x1c(%esp)
...
The Intel documentation for SHL instruction says:
SAL/SAR/SHL/SHR—Shift
The count is masked to 5 bits (or 6 bits if in 64-bit mode and REX.W is used). The count range is limited to 0 to 31 (or 63 if 64-bit mode and REX.W is used).
Masking the shift count 32 (binary 00100000) to 5 bits yields 0 (binary 00000000). Therefore the shll %cl,0x1c(%esp) instruction isn't doing any shifting and leaves the value of i unchanged.