Replace gcc asm function with __uint128_t equivalent [duplicate] - c++

Consider the following code:
unsigned long long div(unsigned long long a, unsigned long long b, unsigned long long c) {
unsigned __int128 d = (unsigned __int128)a*(unsigned __int128)b;
return d/c;
}
When compiled with x86-64 gcc 10 or clang 10, both with -O3, it emits __udivti3, instead of DIVQ instruction:
div:
mov rax, rdi
mov r8, rdx
sub rsp, 8
xor ecx, ecx
mul rsi
mov r9, rax
mov rsi, rdx
mov rdx, r8
mov rdi, r9
call __udivti3
add rsp, 8
ret
At least in my testing, the former is much slower than the (already) slow later, hence the question: is there a way to make a modern compiler emit DIVQ for the above code?
Edit: Let's assume the quotient fits into 64-bits register.

div will fault if the quotient doesn't fit in 64 bits. Doing (a*b) / c with mul + a single div isn't safe in the general case (doesn't implement the abstract-machine semantics for every possible input), therefore a compiler can't generate asm that way for x86-64.
Even if you do give the compiler enough info to figure out that the division can't overflow (i.e. that high_half < divisor), unfortunately gcc/clang still won't ever optimize it to single a div with a non-zero high-half dividend (RDX).
You need an intrinsic or inline asm to explicitly do 128 / 64-bit => 64-bit division. e.g. Intrinsics for 128 multiplication and division has GNU C inline asm that looks right for low/high halves separately.
Unfortunately GNU C doesn't have an intrinsic for this. MSVC does, though: Unsigned 128-bit division on 64-bit machine has links.

Related

Clang generates strange output when dividing two integers

I have written the following very simple code which I am experimenting with in godbolt's compiler explorer:
#include <cstdint>
uint64_t func(uint64_t num, uint64_t den)
{
return num / den;
}
GCC produces the following output, which I would expect:
func(unsigned long, unsigned long):
mov rax, rdi
xor edx, edx
div rsi
ret
However Clang 13.0.0 produces the following, involving shifts and a jump even:
func(unsigned long, unsigned long): # #func(unsigned long, unsigned long)
mov rax, rdi
mov rcx, rdi
or rcx, rsi
shr rcx, 32
je .LBB0_1
xor edx, edx
div rsi
ret
.LBB0_1:
xor edx, edx
div esi
ret
When using uint32_t, clang's output is once again "simple" and what I would expect.
It seems this might be some sort of optimization, since clang 10.0.1 produces the same output as GCC, however I cannot understand what is happening. Why is clang producing this longer assembly?
The assembly seems to be checking if either num or den is larger than 2**32 by shifting right by 32 bits and then checking whether the resulting number is 0.
Depending on the decision, a 64-bit division (div rsi) or 32-bit division (div esi) is performed.
Presumably this code is generated because the compiler writer thinks the additional checks and potential branch outweigh the costs of doing an unnecessary 64-bit division.
If I understand correctly, it just checks if any of the operands is larger than 32-bits and uses different div for "up to" 32 bits and for larger one.

Why different assembling code generated? Which is better?

#include <cstdint>
uint64_t hr1(const uint64_t x, const bool a, const int n) noexcept
{
if (a) {
return x | (a << n);
}
return x;
}
uint64_t hr2(const uint64_t x, const bool a, const int n)
{
return x | ((a ? 1ull : 0) << n);
}
https://godbolt.org/z/gy_65H
hr1(unsigned long, bool, int):
mov rax, rdi
test sil, sil
jne .L4
ret
.L4:
mov ecx, edx
mov esi, 1
sal esi, cl
movsx rsi, esi
or rax, rsi
ret
hr2(unsigned long, bool, int):
mov ecx, edx
movzx esi, sil
sal rsi, cl
mov rax, rsi
or rax, rdi
ret
Why clang and gcc cannot optimize first function as second?
The functions do not have identical behavior. In particular in the first one a will undergo integer promotion to int in a << n, so that the shift will have undefined behavior if n >= std::numeric_limits<int>::digits (typically 31).
This is not the case in the second function where a ? 1ull : 0 will result in the common type of unsigned long long, so that the shift will have well-defined behavior for all non-negative values n < std::numeric_limits<unsigned long long>::digits (typically 64) which is most likely more than std::numeric_limits<int>::digits (typically 31).
You should cast a and 1 to uint64_t in both shifts to make the code well behaved for all sensible inputs (i.e. 0 <= n < 64).
Even after fixing that the functions do not have equal behavior. The second function will have undefined behavior if n >= 64 or n < 0 no matter what the value of a is while the first function has well-defined behavior for a == false. The compiler must guarantee that this case returns x unmodified, no matter how large (or negative) the value of n is.
The second function therefore in principle gives the compiler more freedom to optimize since the range of valid input values is much smaller.
Of course, if the function gets inlined (likely), the compiler may use what it knows about the possible range of values in the call arguments for a and n and optimize further based on that.
This isn't the issue here though, GCC will compile to similar assembly for the first function if e.g.
uint64_t hr1(const uint64_t x, const bool a, const int n) noexcept
{
return a ? x | (uint64_t{1} << n) : x | (uint64_t{0} << n);
}
is used (which has the same valid inputs as hr2). I don't know which of the two assemblies will perform better. I suppose you will have to benchmark that or wait for some expert on that to show up.
Both ways look over-complicated (and the first one is buggy for n>=32). To promote a bool to a uint64_t 0 or 1, just use uint64_t(a) or a C-style cast. You don't need a ? 1ull : 0.
The simple branchless way is probably good, unless you expect a to be highly predictable (e.g. usually one way, or correlated with earlier branching. Modern TAGE predictors use recent branch history to index the BHT / BTB.)
uint64_t hr2(uint64_t x, bool a, int n) {
return x | (uint64_t(a) << n);
}
If you want to make this more complicated to avoid UB when n is out of range, write your C++ to wrap the shift count the same way x86 shift instructions do, so the compiler doesn't need any extra instructions.
#include <limits>
uint64_t hr3(uint64_t x, bool a, int n) {
using shiftwidth = decltype(x);
const int mask = std::numeric_limits<shiftwidth>::digits - 1;
// wrap the count to the shift width to avoid UB
// x86 does this for free for 32 and 64-bit shifts.
return x | (shiftwidth(a) << (n & mask));
}
Both versions compile identically for x86 (because the simple version has to work for all inputs without UB).
This compiles decently if you have BMI2 (for single-uop variable-count shifts on Intel), otherwise it's not great. (https://agner.org/optimize/ and https://uops.info/) But even then there are missed optimizations from GCC:
# GCC9.2 -O3 -march=skylake
hr3(unsigned long, bool, int):
movzx esi, sil # zero-extend the bool to 64-bit, 1 cycle latency because GCC failed to use a different register
shlx rsi, rsi, rdx # the shift
mov rax, rsi # stupid GCC didn't put the result in RAX
or rax, rdi # retval = shift | x
ret
This could have been
# hand optimized, and clang 9.0 -O3 -march=skylake
movzx eax, sil # mov-elimination works between different regs
shlx rax, rax, rdx # don't need to take advantage of copy-and-shift
or rax, rdi
ret
It turns out that clang9.0 actually does emit this efficient version with -O3 -march=skylake or znver1. (Godbolt).
This is cheap enough (3 uops) it's not worth branching for, except to break the data dependency on n in case x and a are likely to be ready earlier than n.
But without BMI2, the shift would take a mov ecx, edx, and a 3-uop (on Intel SnB-family) shl rax, cl. AMD has single-uop variable-count shifts even for the legacy versions that do write flags (except when CL=0 and they have to leave FLAGS unmodified; that's why it costs more on Intel). GCC is still dumb and zero-extends in place instead of into RAX. Clang gets it right (and takes advantage of the unofficial calling convention feature where narrow function args are sign or zero-extended to 32-bit so it can use mov instead of movzx) https://godbolt.org/z/9wrYEN
Clang compiles an if() to branchless using CMOV, so that's significantly worse than the simple version that uses uint64_t(a) << n. It's a missed optimization that it doesn't compile my hr1 the same as my hr3; they
GCC actually branches and then uses mov reg, 1 / shl / or for the if version. Again it could compile it the same as hr3 if it chose to. (It can assume that a=1 implies n<=63, otherwise the if version would have shift UB.)
The missed optimization in both is failure to use bts, which implements reg |= 1<<(n&63)
Especially for gcc after branching so it knows its shifting a constant 1, the tail of the function should be bts rax, rdx which is 1 uop with 1c latency on Intel, 2 uops on AMD Zen1 / Zen2. GCC and clang do know how to use bts for the simple case of a compile-time-constant a=1, though: https://godbolt.org/z/rkhbzH
There's no way that I know of to hand-hold GCC or clang into using bts otherwise, and I wouldn't recommend inline-assembly for this unless it's in the most critical inner loop of something and you're prepared to check that it doesn't hurt other optimizations, and to maintain it. i.e. just don't.
But ideally GCC / clang would do something like this when BMI2 isn't available:
# hand optimized, compilers should do this but don't.
mov rax, rdi # x
bts rdi, rdx # x | 1<<(n&63)
test sil, sil
cmovnz rax, rdi # return a ? x_with_bit_set : x;
ret
Doesn't require BMI2, but still only 4 uops on Broadwell and later. (And 5 uops on AMD Bulldozer / Zen). Critical path latencies:
x -> retval: 2 cycles (through (MOV and BTS) -> CMOV) on Broadwell and later. 3 cycles on earlier Intel (2 uop cmov) and on any AMD (2 uop BTS).
n -> retval: same as x (through BTS -> CMOV).
a -> retval: 2 cycles (through TEST -> CMOV) on Broadwell and later, and all AMD. 3 cycles on earlier Intel (2 uop cmov).
This is pretty obviously better than what clang emits for any version without -march=skylake or other BMI2, and even more better than what GCC emits (unless branchy turns out to be a good strategy).
One way that clang will use BTS:
If we mask the shift count for the branchy version, then clang will actually branch, and on the branch where the if body runs it implements it with bts as I described above. https://godbolt.org/z/BtT4w6
uint64_t hr1(uint64_t x, bool a, int n) noexcept
{
if (a) {
return x | (uint64_t(a) << (n&63));
}
return x;
}
clang 9.0 -O3 (without -march=)
hr1(unsigned long, bool, int):
mov rax, rdi
test sil, sil
je .LBB0_2 # if(a) {
bts rax, rdx # x |= 1<<(n&63)
.LBB0_2: # }
ret
So if branchy is good for your use-case, then this way of writing it compiles well with clang.
These stand-alone versions might end up different after inlining into a real caller.
For example, a caller might save a MOV instruction if it can have the shift count n already in CL. Or the decision on whether to do if-conversion from an if to a branchless sequence might be different.
Or if n is a compile-time constant, that means we don't need BMI2 to save uops on the shift anymore; immediate shifts are fully efficient on all modern CPUs (single uop).
And of course if a is a compile time constant then it's either nothing to do or optimizes to a bts.
Further reading: see the performance links in https://stackoverflow.com/tags/x86/info for more about how to decide if asm is efficient by looking at it.

MSVC compiler generates mov ecx, ecx that looks useless [duplicate]

This question already has answers here:
Is mov %esi, %esi a no-op or not on x86-64?
(2 answers)
Why did GCC generate mov %eax,%eax and what does it mean?
(1 answer)
Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?
(4 answers)
Closed 3 years ago.
I have some C++ code that is being compiled to the following assembly using MSVC compiler v14.24:
00007FF798252D4C vmulsd xmm1,xmm1,xmm7
00007FF798252D50 vcvttsd2si rcx,xmm1
00007FF798252D55 vmulsd xmm1,xmm7,mmword ptr [rbx+28h]
00007FF798252D5A mov ecx,ecx
00007FF798252D5C imul rdx,rcx,0BB8h
00007FF798252D63 vcvttsd2si rcx,xmm1
00007FF798252D68 mov ecx,ecx
00007FF798252D6A add rdx,rcx
00007FF798252D6D add rdx,rdx
00007FF798252D70 cmp byte ptr [r14+rdx*8+8],0
00007FF798252D76 je applyActionMovements+15Dh (07FF798252D8Dh)
As you can see, the compiler added two
mov ecx,ecx
instructions that don't make any sense to me, because they move data from and to the same register.
Is there something that I'm missing?
Here is a small Godbolt reproducer: https://godbolt.org/z/UFo2qe
int arr[4000][3000];
inline int foo(double a, double b) {
return arr[static_cast<unsigned int>(a * 100)][static_cast<unsigned int>(b * 100)];
}
int bar(double a, double b) {
if (foo(a, b)) {
return 0;
}
return 1;
}
That's an inefficient way to zero-extend ECX into RCX. More efficient would be mov into a different register so mov-elimination could work.
Duplicates of:
Why did GCC generate mov %eax,%eax and what does it mean?
Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?
But your specific test-case needs zero-extension for a slightly non-obvious reason:
x86 only has conversion between FP and signed integers (until AVX512). FP -> unsigned int is efficiently possible on x86-64 by doing FP -> int64_t and then taking the low 32 bits as unsigned int.
This is what this sequence is doing:
vcvttsd2si rcx,xmm1 ; double -> int64_t, unsigned int result in ECX
mov ecx,ecx ; zero-extend to promote unsigned to ptrdiff_t for indexing
add rdx,rcx ; 64-bit integer math on the zero-extended result

Intel DRNG giving only giving 4 bytes of data instead of 8

I am trying to implement Intel DRNG in c++.
According to its guide to generate a 64 bit unsigned long long the code should be:
int rdrand64_step (unsigned long long *rand)
{
unsigned char ok;
asm volatile ("rdrand %0; setc %1"
: "=r" (*rand), "=qm" (ok));
return (int) ok;
}
However the output of this function rand is only giving me an output of only 32 bits as shown.
bd4a749d
d461c2a8
8f666eee
d1d5bcc4
c6f4a412
any reason why this is happening?
more info: the IDE I'm using is codeblocks
Use int _rdrand64_step (unsigned __int64* val) from immintrin.h instead of writing inline asm. You don't need it, and there are many reasons (including this one) to avoid it: https://gcc.gnu.org/wiki/DontUseInlineAsm
In this case, the problem is that you're probably compiling 32-bit code, so of course 64-bit rdrand is not encodeable. But the way you used inline-asm ended up giving you a 32-bit rdrand, and storing garbage from another register for the high half.
gcc -Wall -O3 -m32 -march=ivybridge (and similar for clang) produces (on Godbolt):
In function 'rdrand64_step':
7 : <source>:7:1: warning: unsupported size for integer register
rdrand64_step:
push ebx
rdrand ecx; setc al
mov edx, DWORD PTR [esp+8] # load the pointer arg
movzx eax, al
mov DWORD PTR [edx], ecx
mov DWORD PTR [edx+4], ebx # store garbage in the high half of *rand
pop ebx
ret
I guess you called this function with a caller that happened to have ebx=0. Or else you used a different compiler that did something different. Maybe something else happens after inlining. If you looked at disassembly of what you actually compiled, you could explain exactly what's going on.
If you'd used the intrinsic, you would have gotten error: '_rdrand64_step' was not declared in this scope, because immintrin.h only declares it in 64-bit mode (and with a -march setting that implies rdrand support. Or [-mrdrnd]3. Best option: use -march=native if you're building on the target machine).
You'd also get significantly more efficient code for a retry loop, at least with clang:
unsigned long long use_intrinsic(void) {
unsigned long long rand;
while(!_rdrand64_step(&rand)); // TODO: retry limit in case RNG is broken.
return rand;
}
use_intrinsic: # #use_intrinsic
.LBB2_1: # =>This Inner Loop Header: Depth=1
rdrand rax
jae .LBB2_1
ret
That avoids setcc and then testing that, which is of course redundant. gcc6 has syntax for returning flag results from inline asm. You can also use asm goto and put a jcc inside the asm, jumping to a label: return 1; target or falling through to a return 0. (The inline-asm docs have an example of doing this. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html. See also the inline-assembly tag wiki.)
Using your inline-asm, clang (in 64-bit mode) compiles it to:
use_asm:
.LBB1_1:
rdrand rax
setb byte ptr [rsp - 1]
cmp byte ptr [rsp - 1], 0
je .LBB1_1
ret
(clang makes bad decisions for constraints with multiple options that include memory.)
gcc7.2 and ICC17 actually end up with better code from the asm than from the intrinsic. They use cmovc to get a 0 or 1 and then test that. It's pretty dumb. But that's a gcc/ICC missed optimization that will hopefully be.

Tricks compiler uses to compile basic arithmetic operations of 128-bit integer

I played on GodBolt to see x86-64 gcc(6.3) compiles the following codes:
typedef __int128_t int128_t;
typedef __uint128_t uint128_t;
uint128_t mul_to_128(uint64_t x, uint64_t y) {
return uint128_t(x)*uint128_t(y);
}
uint128_t mul(uint128_t x, uint128_t y) {
return x*y;
}
uint128_t div(uint128_t x, uint128_t y) {
return x/y;
}
and I got:
mul_to_128(unsigned long, unsigned long):
mov rax, rdi
mul rsi
ret
mul(unsigned __int128, unsigned __int128):
imul rsi, rdx
mov rax, rdi
imul rcx, rdi
mul rdx
add rcx, rsi
add rdx, rcx
ret
div(unsigned __int128, unsigned __int128):
sub rsp, 8
call __udivti3 //what is this???
add rsp, 8
ret
3 questions:
The 1st function(cast 64-bit uint to 128-bit then multiply them) are
much simpler than multiplication of 2 128-bit uints(2nd function). Basically, just
1 multiplication. If you multiply 2 maximums of 64-bit uint, it
definitely overflows out of a 64-bit register...How does it produce
128-bit result by just 1 64-bit-64-bit multiplication???
I cannot read the second result really well...my guess is to break 64-bit number to 2 32-bit numbers(says, hi as higher 4 bytes
and lo as lower 4 bytes), and assemble the result like
(hi1*hi2)<<64 + (hi1*lo2)<<32 + (hi2*lo1)<<32+(lo1*lo2). Apparently
I was wrong...because it uses only 3 multiplications (2 of them
are even imul...signed multiplication???why???). Can anyone tell me
what gcc is thinking? And it is optimal?
Cannot even understand the assembly of the division...push stack -> call something called __udivti3 then pop stack...is __udivti3 something
big?(like table look-up?) and what stuff does gcc try to push before the call?
the godbolt link: https://godbolt.org/g/sIIaM3
You're right that multiplying two unsigned 64-bit values can produce a 128-bit result. Funny thing, hardware designers know that, too. <g> So multiplying two 64-bit values produces a 128-bit result by storing the lower half of the result in one 64-bit register and the upper half of the result in another 64-bit register. The compiler-writer knows which registers are used, and when you call mul_to_128 it will look for the results in the appropriate registers.
In the second example, think of the values as a1*2^64 + a0 and b1*2^64 + b0 (that is, split each 128-bit value into two parts, the upper 64 bits and the lower 64 bits). When you multiply those you get a1*b1*2^64*2^64 + a1*b0*2^64 + a0*b1*2^64 + a0*b0. That's essentially what the assembly code is doing. The parts of the result that overflow 128 bits are ignored.
In the third example,__udivti3 is a function that does the division. It's not simple, so it doesn't get expanded inline.
The mul rsi will produce a 128 bit result in rdx:rax, as any instruction set reference will tell you.
The imul is used to get a 64 bit result. It works even for unsigned. Again, the instruction set reference says: "The two- and three-operand forms may also be used with unsigned operands because the lower half of the product
is the same regardless if the operands are signed or unsigned." Other than that, yes, basically it's doing the double width equivalent of what you described. Only 3 multiplies, because the result of the 4th would not fit in the output 128 bit anyway.
__udivti3 is just a helper function, you can look at its disassembly to see what it's doing.