Consider this C++ code:
#include <cstdint>
// returns a if less than b or if b is INT32_MIN
int32_t special_min(int32_t a, int32_t b)
{
return a < b || b == INT32_MIN ? a : b;
}
GCC with -fwrapv correctly realizes that subtracting 1 from b can eliminate the special case, and it generates this code for x86-64:
lea edx, [rsi-1]
mov eax, edi
cmp edi, edx
cmovg eax, esi
ret
But without -fwrapv it generates worse code:
mov eax, esi
cmp edi, esi
jl .L4
cmp esi, -2147483648
je .L4
ret
.L4:
mov eax, edi
ret
I understand that -fwrapv is needed if I write C++ code which relies on signed overflow. But:
The above C++ code does not depend on signed overflow (it is valid standard C++).
We all know that signed overflow has a specific behavior on x86-64.
The compiler knows it is compiling for x86-64.
If I wrote "hand optimized" C++ code trying to implement that optimization, I understand -fwrapv would be required, otherwise the compiler could decide the signed overflow is UB and do whatever it wants in the case where b == INT32_MIN. But here the compiler is in control, and I don't see what stops it from using the optimization without -fwrapv. Is there some reason it isn't allowed to?
This kind of missed optimization has happened before in GCC, like not fully treating signed int add as associative even though it's compiling for a 2's complement target with wrapping addition. So it optimizes better for unsigned. IIRC, the reason was something like GCC losing track of some information it had about the operations, and thus being conservative? I forget if that ever got fixed.
I can't find where I've seen this before on SO with a reply from a GCC dev about the internals; maybe it was in a GCC bug report? I think it was with something like a+b+c+d+e (not) re-associating into a tree of dependencies to shorten the critical path. But unfortunately it's still present in current GCC:
int sum(int a, int b, int c, int d, int e, int f) {
return a+b+c+d+e+f;
// gcc and clang make one stupid dep chain
}
int sumv2(int a, int b, int c, int d, int e, int f) {
return (a+b)+(c+d)+(e+f);
// clang pessimizes this back to 1 chain, GCC doesn't
}
unsigned sumu(unsigned a, unsigned b, unsigned c, unsigned d, unsigned e, unsigned f) {
return a+b+c+d+e+f;
// gcc and clang make one stupid dep chain
}
unsigned sumuv2(unsigned a, unsigned b, unsigned c, unsigned d, unsigned e, unsigned f) {
return (a+b)+(c+d)+(e+f);
// GCC and clang pessimize back to 1 chain for unsigned
}
Godbolt for x86-64 System V at -O3, clang and gcc -fwrapv make the same asm for all 4 functions, as you'd expect.
GCC (without -fwrapv) makes the same asm for sumu as for sumuv2 (summing into r8d, the reg that held e.) But GCC makes different asm for sum and sumv2, because they use signed int
# gcc -O3 *without* -fwrapv
# The same order of order of operations as the C source
sum(int, int, int, int, int, int):
add edi, esi # a += b
add edi, edx # ((a+b) + c) ...
add edi, ecx # sum everything into EDI
add edi, r8d
lea eax, [rdi+r9]
ret
# also as written, the source order of operations:
sumv2(int, int, int, int, int, int):
add edi, esi # a+=b
add edx, ecx # c+=d
add r8d, r9d # e+=f
add edi, edx # a += c
lea eax, [rdi+r8] # retval = a + e
ret
So ironically GCC makes better asm when it doesn't re-associate the source. That's assuming that all 6 inputs are ready at once. If out-of-order exec of earlier code only produced the input registers 1 per cycle, the final result here would be ready only 1 cycle after the final input was ready, assuming that final input was f.
But if the last input was a or b, the result wouldn't be ready until 5 cycles later with the single chain like GCC and clang use when they can. vs. 3 cycles worst case for the tree reduction, 2 cycle best case (if e or f were ready last).
(Update: -mtune=znver2 makes GCC re-associate into a tree, thanks #amonakov. So this is a tuning choice with a default that seems strange to me, at least for this specific problem-size. See GCC source, search for reassoc to see costs for other tuning settings; most of them are 1,1,1,1 which is insane, especially for floating-point. This might be why GCC fails to use multiple vector accumulators when unrolling FP loops, defeating the purpose.)
But anyway, this is a case of GCC only re-associating signed int with -fwrapv. So clearly it limits itself more than necessary without -fwrapv.
Related: Compiler optimizations may cause integer overflow. Is that okay? - it is of course legal, and failure to do it is a missed optimization.
GCC isn't totally hamstrung by signed int; it will auto-vectorize int sum += arr[i], and it does manage to optimize Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)? for signed int a.
Related
Consider the following code:
unsigned long long div(unsigned long long a, unsigned long long b, unsigned long long c) {
unsigned __int128 d = (unsigned __int128)a*(unsigned __int128)b;
return d/c;
}
When compiled with x86-64 gcc 10 or clang 10, both with -O3, it emits __udivti3, instead of DIVQ instruction:
div:
mov rax, rdi
mov r8, rdx
sub rsp, 8
xor ecx, ecx
mul rsi
mov r9, rax
mov rsi, rdx
mov rdx, r8
mov rdi, r9
call __udivti3
add rsp, 8
ret
At least in my testing, the former is much slower than the (already) slow later, hence the question: is there a way to make a modern compiler emit DIVQ for the above code?
Edit: Let's assume the quotient fits into 64-bits register.
div will fault if the quotient doesn't fit in 64 bits. Doing (a*b) / c with mul + a single div isn't safe in the general case (doesn't implement the abstract-machine semantics for every possible input), therefore a compiler can't generate asm that way for x86-64.
Even if you do give the compiler enough info to figure out that the division can't overflow (i.e. that high_half < divisor), unfortunately gcc/clang still won't ever optimize it to single a div with a non-zero high-half dividend (RDX).
You need an intrinsic or inline asm to explicitly do 128 / 64-bit => 64-bit division. e.g. Intrinsics for 128 multiplication and division has GNU C inline asm that looks right for low/high halves separately.
Unfortunately GNU C doesn't have an intrinsic for this. MSVC does, though: Unsigned 128-bit division on 64-bit machine has links.
#include <cstdint>
uint64_t hr1(const uint64_t x, const bool a, const int n) noexcept
{
if (a) {
return x | (a << n);
}
return x;
}
uint64_t hr2(const uint64_t x, const bool a, const int n)
{
return x | ((a ? 1ull : 0) << n);
}
https://godbolt.org/z/gy_65H
hr1(unsigned long, bool, int):
mov rax, rdi
test sil, sil
jne .L4
ret
.L4:
mov ecx, edx
mov esi, 1
sal esi, cl
movsx rsi, esi
or rax, rsi
ret
hr2(unsigned long, bool, int):
mov ecx, edx
movzx esi, sil
sal rsi, cl
mov rax, rsi
or rax, rdi
ret
Why clang and gcc cannot optimize first function as second?
The functions do not have identical behavior. In particular in the first one a will undergo integer promotion to int in a << n, so that the shift will have undefined behavior if n >= std::numeric_limits<int>::digits (typically 31).
This is not the case in the second function where a ? 1ull : 0 will result in the common type of unsigned long long, so that the shift will have well-defined behavior for all non-negative values n < std::numeric_limits<unsigned long long>::digits (typically 64) which is most likely more than std::numeric_limits<int>::digits (typically 31).
You should cast a and 1 to uint64_t in both shifts to make the code well behaved for all sensible inputs (i.e. 0 <= n < 64).
Even after fixing that the functions do not have equal behavior. The second function will have undefined behavior if n >= 64 or n < 0 no matter what the value of a is while the first function has well-defined behavior for a == false. The compiler must guarantee that this case returns x unmodified, no matter how large (or negative) the value of n is.
The second function therefore in principle gives the compiler more freedom to optimize since the range of valid input values is much smaller.
Of course, if the function gets inlined (likely), the compiler may use what it knows about the possible range of values in the call arguments for a and n and optimize further based on that.
This isn't the issue here though, GCC will compile to similar assembly for the first function if e.g.
uint64_t hr1(const uint64_t x, const bool a, const int n) noexcept
{
return a ? x | (uint64_t{1} << n) : x | (uint64_t{0} << n);
}
is used (which has the same valid inputs as hr2). I don't know which of the two assemblies will perform better. I suppose you will have to benchmark that or wait for some expert on that to show up.
Both ways look over-complicated (and the first one is buggy for n>=32). To promote a bool to a uint64_t 0 or 1, just use uint64_t(a) or a C-style cast. You don't need a ? 1ull : 0.
The simple branchless way is probably good, unless you expect a to be highly predictable (e.g. usually one way, or correlated with earlier branching. Modern TAGE predictors use recent branch history to index the BHT / BTB.)
uint64_t hr2(uint64_t x, bool a, int n) {
return x | (uint64_t(a) << n);
}
If you want to make this more complicated to avoid UB when n is out of range, write your C++ to wrap the shift count the same way x86 shift instructions do, so the compiler doesn't need any extra instructions.
#include <limits>
uint64_t hr3(uint64_t x, bool a, int n) {
using shiftwidth = decltype(x);
const int mask = std::numeric_limits<shiftwidth>::digits - 1;
// wrap the count to the shift width to avoid UB
// x86 does this for free for 32 and 64-bit shifts.
return x | (shiftwidth(a) << (n & mask));
}
Both versions compile identically for x86 (because the simple version has to work for all inputs without UB).
This compiles decently if you have BMI2 (for single-uop variable-count shifts on Intel), otherwise it's not great. (https://agner.org/optimize/ and https://uops.info/) But even then there are missed optimizations from GCC:
# GCC9.2 -O3 -march=skylake
hr3(unsigned long, bool, int):
movzx esi, sil # zero-extend the bool to 64-bit, 1 cycle latency because GCC failed to use a different register
shlx rsi, rsi, rdx # the shift
mov rax, rsi # stupid GCC didn't put the result in RAX
or rax, rdi # retval = shift | x
ret
This could have been
# hand optimized, and clang 9.0 -O3 -march=skylake
movzx eax, sil # mov-elimination works between different regs
shlx rax, rax, rdx # don't need to take advantage of copy-and-shift
or rax, rdi
ret
It turns out that clang9.0 actually does emit this efficient version with -O3 -march=skylake or znver1. (Godbolt).
This is cheap enough (3 uops) it's not worth branching for, except to break the data dependency on n in case x and a are likely to be ready earlier than n.
But without BMI2, the shift would take a mov ecx, edx, and a 3-uop (on Intel SnB-family) shl rax, cl. AMD has single-uop variable-count shifts even for the legacy versions that do write flags (except when CL=0 and they have to leave FLAGS unmodified; that's why it costs more on Intel). GCC is still dumb and zero-extends in place instead of into RAX. Clang gets it right (and takes advantage of the unofficial calling convention feature where narrow function args are sign or zero-extended to 32-bit so it can use mov instead of movzx) https://godbolt.org/z/9wrYEN
Clang compiles an if() to branchless using CMOV, so that's significantly worse than the simple version that uses uint64_t(a) << n. It's a missed optimization that it doesn't compile my hr1 the same as my hr3; they
GCC actually branches and then uses mov reg, 1 / shl / or for the if version. Again it could compile it the same as hr3 if it chose to. (It can assume that a=1 implies n<=63, otherwise the if version would have shift UB.)
The missed optimization in both is failure to use bts, which implements reg |= 1<<(n&63)
Especially for gcc after branching so it knows its shifting a constant 1, the tail of the function should be bts rax, rdx which is 1 uop with 1c latency on Intel, 2 uops on AMD Zen1 / Zen2. GCC and clang do know how to use bts for the simple case of a compile-time-constant a=1, though: https://godbolt.org/z/rkhbzH
There's no way that I know of to hand-hold GCC or clang into using bts otherwise, and I wouldn't recommend inline-assembly for this unless it's in the most critical inner loop of something and you're prepared to check that it doesn't hurt other optimizations, and to maintain it. i.e. just don't.
But ideally GCC / clang would do something like this when BMI2 isn't available:
# hand optimized, compilers should do this but don't.
mov rax, rdi # x
bts rdi, rdx # x | 1<<(n&63)
test sil, sil
cmovnz rax, rdi # return a ? x_with_bit_set : x;
ret
Doesn't require BMI2, but still only 4 uops on Broadwell and later. (And 5 uops on AMD Bulldozer / Zen). Critical path latencies:
x -> retval: 2 cycles (through (MOV and BTS) -> CMOV) on Broadwell and later. 3 cycles on earlier Intel (2 uop cmov) and on any AMD (2 uop BTS).
n -> retval: same as x (through BTS -> CMOV).
a -> retval: 2 cycles (through TEST -> CMOV) on Broadwell and later, and all AMD. 3 cycles on earlier Intel (2 uop cmov).
This is pretty obviously better than what clang emits for any version without -march=skylake or other BMI2, and even more better than what GCC emits (unless branchy turns out to be a good strategy).
One way that clang will use BTS:
If we mask the shift count for the branchy version, then clang will actually branch, and on the branch where the if body runs it implements it with bts as I described above. https://godbolt.org/z/BtT4w6
uint64_t hr1(uint64_t x, bool a, int n) noexcept
{
if (a) {
return x | (uint64_t(a) << (n&63));
}
return x;
}
clang 9.0 -O3 (without -march=)
hr1(unsigned long, bool, int):
mov rax, rdi
test sil, sil
je .LBB0_2 # if(a) {
bts rax, rdx # x |= 1<<(n&63)
.LBB0_2: # }
ret
So if branchy is good for your use-case, then this way of writing it compiles well with clang.
These stand-alone versions might end up different after inlining into a real caller.
For example, a caller might save a MOV instruction if it can have the shift count n already in CL. Or the decision on whether to do if-conversion from an if to a branchless sequence might be different.
Or if n is a compile-time constant, that means we don't need BMI2 to save uops on the shift anymore; immediate shifts are fully efficient on all modern CPUs (single uop).
And of course if a is a compile time constant then it's either nothing to do or optimizes to a bts.
Further reading: see the performance links in https://stackoverflow.com/tags/x86/info for more about how to decide if asm is efficient by looking at it.
I am currently trying to improve the speed of my program.
I was wondering whether it would help to replace all if-statements of the type:
bool a=1;
int b=0;
if(a){b++;}
with this:
bool a=1;
int b=0;
b+=a;
I am unsure whether the conversion from bool to int could be a problem time-wise.
One rule of thumb when programming is to not micro-optimise.
Another rule is to write clear code.
But in this case, another rule applies. If you are writing optimised code then avoid any code that can cause branches, as you can cause unwanted cpu pipeline dumps due to failed branch prediction.
Bear in mind also that there are not bool and int types as such in assembler: just registers, so you will probably find that all conversions will be optimised out. Therefore
b += a;
wins for me; it's also clearer.
Compilers are allowed to assume that the underlying value of a bool isn't messed up, so optimizing compilers can avoid the branch.
If we look at the generated code for this artificial test
int with_if_bool(bool a, int b) {
if(a){b++;}
return b;
}
int with_if_char(unsigned char a, int b) {
if(a){b++;}
return b;
}
int without_if(bool a, int b) {
b += a;
return b;
}
clang will exploit this fact and generate the exact same branchless code that sums a and b for the bool version, and instead generate actual comparisons with zero in the unsigned char case (although it's still branchless code):
with_if_bool(bool, int): # #with_if_bool(bool, int)
lea eax, [rdi + rsi]
ret
with_if_char(unsigned char, int): # #with_if_char(unsigned char, int)
cmp dil, 1
sbb esi, -1
mov eax, esi
ret
without_if(bool, int): # #without_if(bool, int)
lea eax, [rdi + rsi]
ret
gcc will instead treat bool just as if it was an unsigned char, without exploiting its properties, generating similar code as clang's unsigned char case.
with_if_bool(bool, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
with_if_char(unsigned char, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
without_if(bool, int):
movzx edi, dil
lea eax, [rdi+rsi]
ret
Finally, Visual C++ will treat the bool and the unsigned char versions equally, just as gcc, although with more naive codegen (it uses a conditional move instead of performing arithmetic with the flags register, which IIRC traditionally used to be less efficient, don't know for current machines).
a$ = 8
b$ = 16
int with_if_bool(bool,int) PROC ; with_if_bool, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_bool(bool,int) ENDP ; with_if_bool
a$ = 8
b$ = 16
int with_if_char(unsigned char,int) PROC ; with_if_char, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_char(unsigned char,int) ENDP ; with_if_char
a$ = 8
b$ = 16
int without_if(bool,int) PROC ; without_if, COMDAT
movzx eax, cl
add eax, edx
ret 0
int without_if(bool,int) ENDP ; without_if
In all cases, no branches are generated; the only difference is that, on most compilers, some more complex code is generated that depends on a cmp or a test, creating a longer dependency chain.
That being said, I would worry about this kind of micro-optimization only if you actually run your code under a profiler, and the results point to this specific code (or to some tight loop that involve it); in general you should write sensible, semantically correct code and focus on using the correct algorithms/data structures. Micro-optimization comes later.
In my program, this wouldn't work, as a is actually an operation of the type: b+=(a==c)
This should be even better for the optimizer, as it doesn't even have any doubt about where the bool is coming from - it can just decide straight from the flags register. As you can see, here gcc produces quite similar code for the two cases, clang exactly the same, while VC++ as usual produces something that is more conditional-ish (a cmov) in the if case.
I am trying to implement Intel DRNG in c++.
According to its guide to generate a 64 bit unsigned long long the code should be:
int rdrand64_step (unsigned long long *rand)
{
unsigned char ok;
asm volatile ("rdrand %0; setc %1"
: "=r" (*rand), "=qm" (ok));
return (int) ok;
}
However the output of this function rand is only giving me an output of only 32 bits as shown.
bd4a749d
d461c2a8
8f666eee
d1d5bcc4
c6f4a412
any reason why this is happening?
more info: the IDE I'm using is codeblocks
Use int _rdrand64_step (unsigned __int64* val) from immintrin.h instead of writing inline asm. You don't need it, and there are many reasons (including this one) to avoid it: https://gcc.gnu.org/wiki/DontUseInlineAsm
In this case, the problem is that you're probably compiling 32-bit code, so of course 64-bit rdrand is not encodeable. But the way you used inline-asm ended up giving you a 32-bit rdrand, and storing garbage from another register for the high half.
gcc -Wall -O3 -m32 -march=ivybridge (and similar for clang) produces (on Godbolt):
In function 'rdrand64_step':
7 : <source>:7:1: warning: unsupported size for integer register
rdrand64_step:
push ebx
rdrand ecx; setc al
mov edx, DWORD PTR [esp+8] # load the pointer arg
movzx eax, al
mov DWORD PTR [edx], ecx
mov DWORD PTR [edx+4], ebx # store garbage in the high half of *rand
pop ebx
ret
I guess you called this function with a caller that happened to have ebx=0. Or else you used a different compiler that did something different. Maybe something else happens after inlining. If you looked at disassembly of what you actually compiled, you could explain exactly what's going on.
If you'd used the intrinsic, you would have gotten error: '_rdrand64_step' was not declared in this scope, because immintrin.h only declares it in 64-bit mode (and with a -march setting that implies rdrand support. Or [-mrdrnd]3. Best option: use -march=native if you're building on the target machine).
You'd also get significantly more efficient code for a retry loop, at least with clang:
unsigned long long use_intrinsic(void) {
unsigned long long rand;
while(!_rdrand64_step(&rand)); // TODO: retry limit in case RNG is broken.
return rand;
}
use_intrinsic: # #use_intrinsic
.LBB2_1: # =>This Inner Loop Header: Depth=1
rdrand rax
jae .LBB2_1
ret
That avoids setcc and then testing that, which is of course redundant. gcc6 has syntax for returning flag results from inline asm. You can also use asm goto and put a jcc inside the asm, jumping to a label: return 1; target or falling through to a return 0. (The inline-asm docs have an example of doing this. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html. See also the inline-assembly tag wiki.)
Using your inline-asm, clang (in 64-bit mode) compiles it to:
use_asm:
.LBB1_1:
rdrand rax
setb byte ptr [rsp - 1]
cmp byte ptr [rsp - 1], 0
je .LBB1_1
ret
(clang makes bad decisions for constraints with multiple options that include memory.)
gcc7.2 and ICC17 actually end up with better code from the asm than from the intrinsic. They use cmovc to get a 0 or 1 and then test that. It's pretty dumb. But that's a gcc/ICC missed optimization that will hopefully be.
The x86-64 ABI specifies two return registers: rax and rdx, both 64-bits (8 bytes) in size.
Assuming that x86-64 is the only targeted platform, which of these two functions:
uint64_t f(uint64_t * const secondReturnValue) {
/* Calculate a and b. */
*secondReturnValue = b;
return a;
}
std::pair<uint64_t, uint64_t> g() {
/* Calculate a and b, same as in f() above. */
return { a, b };
}
would yield better performance, given the current state of C/C++ compilers targeting x86-64? Are there any pitfalls performance-wise using one or the other version? Are compilers (GCC, Clang) always able to optimize the std::pair to be returned in rax and rdx?
UPDATE: Generally, returning a pair is faster if the compiler optimizes out the std::pair methods (examples of binary output with GCC 5.3.0 and Clang 3.8.0). If f() is not inlined, the compiler must generate code to write a value to memory, e.g:
movq b, (%rdi)
movq a, %rax
retq
But in case of g() it suffices for the compiler to do:
movq a, %rax
movq b, %rdx
retq
Because instructions for writing values to memory are generally slower than instructions for writing values to registers, the second version should be faster.
Since the ABI specifies that in some particular cases two registers have to be used for the 2-word result any conforming compiler has to obey that rule.
However, for such tiny functions I guess that most of the performance will come from inlining.
You may want to compile and link with g++ -flto -O2 using link-time optimizations.
I guess that the second function (returning a pair thru 2 registers) might be slightly faster, and that perhaps in some situations the GCC compiler could inline and optimize the first into the second.
But you really should benchmark if you care that much.
Note that the ABI specifies packing any small struct into registers for passing/returning (if it contains only integer types). This means that returning a std::pair<uint32_t, uint32_t> means the values have to be shift+ORed into rax.
This is probably still better than a round trip through memory, because setting up space for a pointer, and passing that pointer as an extra arg, has some overhead. (Other than that, though, a round-trip through L1 cache is pretty cheap, like ~5c latency. The store/load are almost certainly going to hit in L1 cache, because stack memory is used all the time. Even if it misses, store-forwarding can still happen, so execution doesn't stall until the ROB fills because the store can't retire. See Agner Fog's microarch guide and other stuff at the x86 tag wiki.)
Anyway, here's the kind of code you get from gcc 5.3 -O2, using functions that take args instead of returning compile-time constant values (which would lead to movabs rax, 0x...):
#include <cstdint>
#include <utility>
#define type_t uint32_t
type_t f(type_t * const secondReturnValue, type_t x) {
*secondReturnValue = x+4;
return x+2;
}
lea eax, [rsi+4] # LEA is an add-and-shift instruction that uses memory-operand syntax and encoding
mov DWORD PTR [rdi], eax
lea eax, [rsi+2]
ret
std::pair<type_t, type_t> g(type_t x) { return {x+2, x+4}; }
lea eax, [rdi+4]
lea edx, [rdi+2]
sal rax, 32
or rax, rdx
ret
type_t use_pair(std::pair<type_t, type_t> pair) {
return pair.second + pair.first;
}
mov rax, rdi
shr rax, 32
add eax, edi
ret
So it's really not bad at all. Two or three insns in the caller and callee to pack and unpack a pair of uint32_t values. Nowhere near as good as returning a pair of uint64_t values, though.
If you're specifically optimizing for x86-64, and care what happens for non-inlined functions with multiple return values, then prefer returning std::pair<uint64_t, uint64_t> (or int64_t, obviously), even if you assign those pairs to narrower integers in the caller. Note that in the x32 ABI (-mx32), pointers are only 32bits. Don't assume pointers are 64bit when optimizing for x86-64, if you care about that ABI.
If either member of the pair is 64bit, they use separate registers. It doesn't do anything stupid like splitting one value between the high half of one reg and the low half of another.