I played on GodBolt to see x86-64 gcc(6.3) compiles the following codes:
typedef __int128_t int128_t;
typedef __uint128_t uint128_t;
uint128_t mul_to_128(uint64_t x, uint64_t y) {
return uint128_t(x)*uint128_t(y);
}
uint128_t mul(uint128_t x, uint128_t y) {
return x*y;
}
uint128_t div(uint128_t x, uint128_t y) {
return x/y;
}
and I got:
mul_to_128(unsigned long, unsigned long):
mov rax, rdi
mul rsi
ret
mul(unsigned __int128, unsigned __int128):
imul rsi, rdx
mov rax, rdi
imul rcx, rdi
mul rdx
add rcx, rsi
add rdx, rcx
ret
div(unsigned __int128, unsigned __int128):
sub rsp, 8
call __udivti3 //what is this???
add rsp, 8
ret
3 questions:
The 1st function(cast 64-bit uint to 128-bit then multiply them) are
much simpler than multiplication of 2 128-bit uints(2nd function). Basically, just
1 multiplication. If you multiply 2 maximums of 64-bit uint, it
definitely overflows out of a 64-bit register...How does it produce
128-bit result by just 1 64-bit-64-bit multiplication???
I cannot read the second result really well...my guess is to break 64-bit number to 2 32-bit numbers(says, hi as higher 4 bytes
and lo as lower 4 bytes), and assemble the result like
(hi1*hi2)<<64 + (hi1*lo2)<<32 + (hi2*lo1)<<32+(lo1*lo2). Apparently
I was wrong...because it uses only 3 multiplications (2 of them
are even imul...signed multiplication???why???). Can anyone tell me
what gcc is thinking? And it is optimal?
Cannot even understand the assembly of the division...push stack -> call something called __udivti3 then pop stack...is __udivti3 something
big?(like table look-up?) and what stuff does gcc try to push before the call?
the godbolt link: https://godbolt.org/g/sIIaM3
You're right that multiplying two unsigned 64-bit values can produce a 128-bit result. Funny thing, hardware designers know that, too. <g> So multiplying two 64-bit values produces a 128-bit result by storing the lower half of the result in one 64-bit register and the upper half of the result in another 64-bit register. The compiler-writer knows which registers are used, and when you call mul_to_128 it will look for the results in the appropriate registers.
In the second example, think of the values as a1*2^64 + a0 and b1*2^64 + b0 (that is, split each 128-bit value into two parts, the upper 64 bits and the lower 64 bits). When you multiply those you get a1*b1*2^64*2^64 + a1*b0*2^64 + a0*b1*2^64 + a0*b0. That's essentially what the assembly code is doing. The parts of the result that overflow 128 bits are ignored.
In the third example,__udivti3 is a function that does the division. It's not simple, so it doesn't get expanded inline.
The mul rsi will produce a 128 bit result in rdx:rax, as any instruction set reference will tell you.
The imul is used to get a 64 bit result. It works even for unsigned. Again, the instruction set reference says: "The two- and three-operand forms may also be used with unsigned operands because the lower half of the product
is the same regardless if the operands are signed or unsigned." Other than that, yes, basically it's doing the double width equivalent of what you described. Only 3 multiplies, because the result of the 4th would not fit in the output 128 bit anyway.
__udivti3 is just a helper function, you can look at its disassembly to see what it's doing.
Related
Consider the following code:
unsigned long long div(unsigned long long a, unsigned long long b, unsigned long long c) {
unsigned __int128 d = (unsigned __int128)a*(unsigned __int128)b;
return d/c;
}
When compiled with x86-64 gcc 10 or clang 10, both with -O3, it emits __udivti3, instead of DIVQ instruction:
div:
mov rax, rdi
mov r8, rdx
sub rsp, 8
xor ecx, ecx
mul rsi
mov r9, rax
mov rsi, rdx
mov rdx, r8
mov rdi, r9
call __udivti3
add rsp, 8
ret
At least in my testing, the former is much slower than the (already) slow later, hence the question: is there a way to make a modern compiler emit DIVQ for the above code?
Edit: Let's assume the quotient fits into 64-bits register.
div will fault if the quotient doesn't fit in 64 bits. Doing (a*b) / c with mul + a single div isn't safe in the general case (doesn't implement the abstract-machine semantics for every possible input), therefore a compiler can't generate asm that way for x86-64.
Even if you do give the compiler enough info to figure out that the division can't overflow (i.e. that high_half < divisor), unfortunately gcc/clang still won't ever optimize it to single a div with a non-zero high-half dividend (RDX).
You need an intrinsic or inline asm to explicitly do 128 / 64-bit => 64-bit division. e.g. Intrinsics for 128 multiplication and division has GNU C inline asm that looks right for low/high halves separately.
Unfortunately GNU C doesn't have an intrinsic for this. MSVC does, though: Unsigned 128-bit division on 64-bit machine has links.
I have the following c++ function which simply sums the three elements of the given input array.
#include <array>
using namespace std;
int square(array<int, 3> ar) {
int res = 0;
for(int idx = 0; idx < ar.size(); idx++){
res += ar[idx];
}
return res;
}
This code compiled with Clang (gcc and icc produce the same code) and the compiler flag -O3 produces the following x86-64 assembly
sum(std::array<int, 3ul>):
mov rax, rdi
shr rax, 32
add eax, edi
add eax, esi
ret
My current interpretation of the assembly is that the following happens:
64 bits are moved from the 64 bit input register rdi into the 64 bit output register rax. This corresponds to 32 bit ints.
shr shifts the contents of rax by 32 bits thus keeping only the first 32 bit int contained in rdi.
the contents of the 32 bit input register edi are added to the 32 bit output register eax
the contents of the second 32 bit input register esi are added to eax
eax is returned
I am however left with some questions:
Can the computer simply shift between 32 and 64 bit registers as is done in the first two instructions?
Shouldn't the use of shr result in the first int being added two times because the second int is shifted out? (Does this have to do with endianes?)
As an extra note: the compiler produces the same assembly instructions when supplied with a range based for loop.
#include <array>
using namespace std;
int sum(array<int, 3> ar) {
int res = 0;
for(const auto& in: ar){
res += in;
}
return res;
}
You can find the example here: https://godbolt.org/z/s3fera7ca
The array is packed into registers for parameter passing as if it was a simple struct of 3 ints.
So, two 32-bit int elements are passed in the first argument register, and the remaining one in the second argument register.
How those first two are packed into one register may seem somewhat arbitrary, given that there is no memory involved in this example, and to be clear, the registers themselves alone have no notion endianness. Endianness is introduced by numeric data that takes more than one memory address — not by anything in or of the registers: registers can only be named (in machine code instructions), but not addressed, and as such, so there is no concept of endianness within the registers.
However, for some other operations that do involve storing and loading that same structure from memory, it is effective if that packing follows the endianness of the processor, so that is the logical choice for the designers of an ABI, who specify (by rules) where the first element, second element and third element of a struct go when passed as parameters in registers.
When the processor endianness is followed, then programs can use a quad word load or store and a double word load or store to copy the struct — a 64-bit operation followed by a 32-bit operation. If the processor's natural endianness weren't followed in the registers (which would actually still work) then three double word load or store operations would be needed instead, to get the proper order of the array elements from/into memory.
By following the natural endianness, machine code can mix 64-bit and 32-bit load and store operations even though the structure holds only 32-bit items.
How does edi fit into this?
edi is the the first element of the array/structure. rdi >> 32 is the 2nd as it is packed into the upper 32-bits of rdi, while the first element is packed into the lower 32-bits of rdi. And esi is the third.
I have written the following very simple code which I am experimenting with in godbolt's compiler explorer:
#include <cstdint>
uint64_t func(uint64_t num, uint64_t den)
{
return num / den;
}
GCC produces the following output, which I would expect:
func(unsigned long, unsigned long):
mov rax, rdi
xor edx, edx
div rsi
ret
However Clang 13.0.0 produces the following, involving shifts and a jump even:
func(unsigned long, unsigned long): # #func(unsigned long, unsigned long)
mov rax, rdi
mov rcx, rdi
or rcx, rsi
shr rcx, 32
je .LBB0_1
xor edx, edx
div rsi
ret
.LBB0_1:
xor edx, edx
div esi
ret
When using uint32_t, clang's output is once again "simple" and what I would expect.
It seems this might be some sort of optimization, since clang 10.0.1 produces the same output as GCC, however I cannot understand what is happening. Why is clang producing this longer assembly?
The assembly seems to be checking if either num or den is larger than 2**32 by shifting right by 32 bits and then checking whether the resulting number is 0.
Depending on the decision, a 64-bit division (div rsi) or 32-bit division (div esi) is performed.
Presumably this code is generated because the compiler writer thinks the additional checks and potential branch outweigh the costs of doing an unnecessary 64-bit division.
If I understand correctly, it just checks if any of the operands is larger than 32-bits and uses different div for "up to" 32 bits and for larger one.
#include <cstdint>
uint64_t hr1(const uint64_t x, const bool a, const int n) noexcept
{
if (a) {
return x | (a << n);
}
return x;
}
uint64_t hr2(const uint64_t x, const bool a, const int n)
{
return x | ((a ? 1ull : 0) << n);
}
https://godbolt.org/z/gy_65H
hr1(unsigned long, bool, int):
mov rax, rdi
test sil, sil
jne .L4
ret
.L4:
mov ecx, edx
mov esi, 1
sal esi, cl
movsx rsi, esi
or rax, rsi
ret
hr2(unsigned long, bool, int):
mov ecx, edx
movzx esi, sil
sal rsi, cl
mov rax, rsi
or rax, rdi
ret
Why clang and gcc cannot optimize first function as second?
The functions do not have identical behavior. In particular in the first one a will undergo integer promotion to int in a << n, so that the shift will have undefined behavior if n >= std::numeric_limits<int>::digits (typically 31).
This is not the case in the second function where a ? 1ull : 0 will result in the common type of unsigned long long, so that the shift will have well-defined behavior for all non-negative values n < std::numeric_limits<unsigned long long>::digits (typically 64) which is most likely more than std::numeric_limits<int>::digits (typically 31).
You should cast a and 1 to uint64_t in both shifts to make the code well behaved for all sensible inputs (i.e. 0 <= n < 64).
Even after fixing that the functions do not have equal behavior. The second function will have undefined behavior if n >= 64 or n < 0 no matter what the value of a is while the first function has well-defined behavior for a == false. The compiler must guarantee that this case returns x unmodified, no matter how large (or negative) the value of n is.
The second function therefore in principle gives the compiler more freedom to optimize since the range of valid input values is much smaller.
Of course, if the function gets inlined (likely), the compiler may use what it knows about the possible range of values in the call arguments for a and n and optimize further based on that.
This isn't the issue here though, GCC will compile to similar assembly for the first function if e.g.
uint64_t hr1(const uint64_t x, const bool a, const int n) noexcept
{
return a ? x | (uint64_t{1} << n) : x | (uint64_t{0} << n);
}
is used (which has the same valid inputs as hr2). I don't know which of the two assemblies will perform better. I suppose you will have to benchmark that or wait for some expert on that to show up.
Both ways look over-complicated (and the first one is buggy for n>=32). To promote a bool to a uint64_t 0 or 1, just use uint64_t(a) or a C-style cast. You don't need a ? 1ull : 0.
The simple branchless way is probably good, unless you expect a to be highly predictable (e.g. usually one way, or correlated with earlier branching. Modern TAGE predictors use recent branch history to index the BHT / BTB.)
uint64_t hr2(uint64_t x, bool a, int n) {
return x | (uint64_t(a) << n);
}
If you want to make this more complicated to avoid UB when n is out of range, write your C++ to wrap the shift count the same way x86 shift instructions do, so the compiler doesn't need any extra instructions.
#include <limits>
uint64_t hr3(uint64_t x, bool a, int n) {
using shiftwidth = decltype(x);
const int mask = std::numeric_limits<shiftwidth>::digits - 1;
// wrap the count to the shift width to avoid UB
// x86 does this for free for 32 and 64-bit shifts.
return x | (shiftwidth(a) << (n & mask));
}
Both versions compile identically for x86 (because the simple version has to work for all inputs without UB).
This compiles decently if you have BMI2 (for single-uop variable-count shifts on Intel), otherwise it's not great. (https://agner.org/optimize/ and https://uops.info/) But even then there are missed optimizations from GCC:
# GCC9.2 -O3 -march=skylake
hr3(unsigned long, bool, int):
movzx esi, sil # zero-extend the bool to 64-bit, 1 cycle latency because GCC failed to use a different register
shlx rsi, rsi, rdx # the shift
mov rax, rsi # stupid GCC didn't put the result in RAX
or rax, rdi # retval = shift | x
ret
This could have been
# hand optimized, and clang 9.0 -O3 -march=skylake
movzx eax, sil # mov-elimination works between different regs
shlx rax, rax, rdx # don't need to take advantage of copy-and-shift
or rax, rdi
ret
It turns out that clang9.0 actually does emit this efficient version with -O3 -march=skylake or znver1. (Godbolt).
This is cheap enough (3 uops) it's not worth branching for, except to break the data dependency on n in case x and a are likely to be ready earlier than n.
But without BMI2, the shift would take a mov ecx, edx, and a 3-uop (on Intel SnB-family) shl rax, cl. AMD has single-uop variable-count shifts even for the legacy versions that do write flags (except when CL=0 and they have to leave FLAGS unmodified; that's why it costs more on Intel). GCC is still dumb and zero-extends in place instead of into RAX. Clang gets it right (and takes advantage of the unofficial calling convention feature where narrow function args are sign or zero-extended to 32-bit so it can use mov instead of movzx) https://godbolt.org/z/9wrYEN
Clang compiles an if() to branchless using CMOV, so that's significantly worse than the simple version that uses uint64_t(a) << n. It's a missed optimization that it doesn't compile my hr1 the same as my hr3; they
GCC actually branches and then uses mov reg, 1 / shl / or for the if version. Again it could compile it the same as hr3 if it chose to. (It can assume that a=1 implies n<=63, otherwise the if version would have shift UB.)
The missed optimization in both is failure to use bts, which implements reg |= 1<<(n&63)
Especially for gcc after branching so it knows its shifting a constant 1, the tail of the function should be bts rax, rdx which is 1 uop with 1c latency on Intel, 2 uops on AMD Zen1 / Zen2. GCC and clang do know how to use bts for the simple case of a compile-time-constant a=1, though: https://godbolt.org/z/rkhbzH
There's no way that I know of to hand-hold GCC or clang into using bts otherwise, and I wouldn't recommend inline-assembly for this unless it's in the most critical inner loop of something and you're prepared to check that it doesn't hurt other optimizations, and to maintain it. i.e. just don't.
But ideally GCC / clang would do something like this when BMI2 isn't available:
# hand optimized, compilers should do this but don't.
mov rax, rdi # x
bts rdi, rdx # x | 1<<(n&63)
test sil, sil
cmovnz rax, rdi # return a ? x_with_bit_set : x;
ret
Doesn't require BMI2, but still only 4 uops on Broadwell and later. (And 5 uops on AMD Bulldozer / Zen). Critical path latencies:
x -> retval: 2 cycles (through (MOV and BTS) -> CMOV) on Broadwell and later. 3 cycles on earlier Intel (2 uop cmov) and on any AMD (2 uop BTS).
n -> retval: same as x (through BTS -> CMOV).
a -> retval: 2 cycles (through TEST -> CMOV) on Broadwell and later, and all AMD. 3 cycles on earlier Intel (2 uop cmov).
This is pretty obviously better than what clang emits for any version without -march=skylake or other BMI2, and even more better than what GCC emits (unless branchy turns out to be a good strategy).
One way that clang will use BTS:
If we mask the shift count for the branchy version, then clang will actually branch, and on the branch where the if body runs it implements it with bts as I described above. https://godbolt.org/z/BtT4w6
uint64_t hr1(uint64_t x, bool a, int n) noexcept
{
if (a) {
return x | (uint64_t(a) << (n&63));
}
return x;
}
clang 9.0 -O3 (without -march=)
hr1(unsigned long, bool, int):
mov rax, rdi
test sil, sil
je .LBB0_2 # if(a) {
bts rax, rdx # x |= 1<<(n&63)
.LBB0_2: # }
ret
So if branchy is good for your use-case, then this way of writing it compiles well with clang.
These stand-alone versions might end up different after inlining into a real caller.
For example, a caller might save a MOV instruction if it can have the shift count n already in CL. Or the decision on whether to do if-conversion from an if to a branchless sequence might be different.
Or if n is a compile-time constant, that means we don't need BMI2 to save uops on the shift anymore; immediate shifts are fully efficient on all modern CPUs (single uop).
And of course if a is a compile time constant then it's either nothing to do or optimizes to a bts.
Further reading: see the performance links in https://stackoverflow.com/tags/x86/info for more about how to decide if asm is efficient by looking at it.
Assuming I have a: usize and a negative b:isize how do I achieve the following semantics - reduce a by absolute value of b in fastest manner possible?
I already thought of a - (b.abs() as usize), but I'm wondering if there is a faster way. Something with bit manipulation, perhaps?
Why do you assume this is slow? If that code is put in a function and compiled, on x86-64 linux, it generates the following:
_ZN6simple20h0f921f89f1d823aeeaaE:
mov rax, rsi
neg rax
cmovl rax, rsi
sub rdi, rax
mov rax, rdi
ret
That's assuming it doesn't get inlined... which I had to work at for a few minutes to prevent the optimiser from doing in order to get the above.
That's not to say it definitely couldn't be done faster, but I'm unconvinced it could be done faster by much.
If b is guaranteed to be negative, then you can just do a + b.
In Rust, we must first cast one of the operands to the same type as the other one, then we must use wrapping_add instead of simply using operator + as debug builds panic on overflow (an overflow occurs when using + on usize because negative numbers become very large positive numbers after the cast).
fn main() {
let a: usize = 5;
let b: isize = -2;
let c: usize = a.wrapping_add(b as usize);
println!("{}", c); // prints 3
}
With optimizations, wrapping_add compiles to a single add instruction.