#include <cstdint>
uint64_t hr1(const uint64_t x, const bool a, const int n) noexcept
{
if (a) {
return x | (a << n);
}
return x;
}
uint64_t hr2(const uint64_t x, const bool a, const int n)
{
return x | ((a ? 1ull : 0) << n);
}
https://godbolt.org/z/gy_65H
hr1(unsigned long, bool, int):
mov rax, rdi
test sil, sil
jne .L4
ret
.L4:
mov ecx, edx
mov esi, 1
sal esi, cl
movsx rsi, esi
or rax, rsi
ret
hr2(unsigned long, bool, int):
mov ecx, edx
movzx esi, sil
sal rsi, cl
mov rax, rsi
or rax, rdi
ret
Why clang and gcc cannot optimize first function as second?
The functions do not have identical behavior. In particular in the first one a will undergo integer promotion to int in a << n, so that the shift will have undefined behavior if n >= std::numeric_limits<int>::digits (typically 31).
This is not the case in the second function where a ? 1ull : 0 will result in the common type of unsigned long long, so that the shift will have well-defined behavior for all non-negative values n < std::numeric_limits<unsigned long long>::digits (typically 64) which is most likely more than std::numeric_limits<int>::digits (typically 31).
You should cast a and 1 to uint64_t in both shifts to make the code well behaved for all sensible inputs (i.e. 0 <= n < 64).
Even after fixing that the functions do not have equal behavior. The second function will have undefined behavior if n >= 64 or n < 0 no matter what the value of a is while the first function has well-defined behavior for a == false. The compiler must guarantee that this case returns x unmodified, no matter how large (or negative) the value of n is.
The second function therefore in principle gives the compiler more freedom to optimize since the range of valid input values is much smaller.
Of course, if the function gets inlined (likely), the compiler may use what it knows about the possible range of values in the call arguments for a and n and optimize further based on that.
This isn't the issue here though, GCC will compile to similar assembly for the first function if e.g.
uint64_t hr1(const uint64_t x, const bool a, const int n) noexcept
{
return a ? x | (uint64_t{1} << n) : x | (uint64_t{0} << n);
}
is used (which has the same valid inputs as hr2). I don't know which of the two assemblies will perform better. I suppose you will have to benchmark that or wait for some expert on that to show up.
Both ways look over-complicated (and the first one is buggy for n>=32). To promote a bool to a uint64_t 0 or 1, just use uint64_t(a) or a C-style cast. You don't need a ? 1ull : 0.
The simple branchless way is probably good, unless you expect a to be highly predictable (e.g. usually one way, or correlated with earlier branching. Modern TAGE predictors use recent branch history to index the BHT / BTB.)
uint64_t hr2(uint64_t x, bool a, int n) {
return x | (uint64_t(a) << n);
}
If you want to make this more complicated to avoid UB when n is out of range, write your C++ to wrap the shift count the same way x86 shift instructions do, so the compiler doesn't need any extra instructions.
#include <limits>
uint64_t hr3(uint64_t x, bool a, int n) {
using shiftwidth = decltype(x);
const int mask = std::numeric_limits<shiftwidth>::digits - 1;
// wrap the count to the shift width to avoid UB
// x86 does this for free for 32 and 64-bit shifts.
return x | (shiftwidth(a) << (n & mask));
}
Both versions compile identically for x86 (because the simple version has to work for all inputs without UB).
This compiles decently if you have BMI2 (for single-uop variable-count shifts on Intel), otherwise it's not great. (https://agner.org/optimize/ and https://uops.info/) But even then there are missed optimizations from GCC:
# GCC9.2 -O3 -march=skylake
hr3(unsigned long, bool, int):
movzx esi, sil # zero-extend the bool to 64-bit, 1 cycle latency because GCC failed to use a different register
shlx rsi, rsi, rdx # the shift
mov rax, rsi # stupid GCC didn't put the result in RAX
or rax, rdi # retval = shift | x
ret
This could have been
# hand optimized, and clang 9.0 -O3 -march=skylake
movzx eax, sil # mov-elimination works between different regs
shlx rax, rax, rdx # don't need to take advantage of copy-and-shift
or rax, rdi
ret
It turns out that clang9.0 actually does emit this efficient version with -O3 -march=skylake or znver1. (Godbolt).
This is cheap enough (3 uops) it's not worth branching for, except to break the data dependency on n in case x and a are likely to be ready earlier than n.
But without BMI2, the shift would take a mov ecx, edx, and a 3-uop (on Intel SnB-family) shl rax, cl. AMD has single-uop variable-count shifts even for the legacy versions that do write flags (except when CL=0 and they have to leave FLAGS unmodified; that's why it costs more on Intel). GCC is still dumb and zero-extends in place instead of into RAX. Clang gets it right (and takes advantage of the unofficial calling convention feature where narrow function args are sign or zero-extended to 32-bit so it can use mov instead of movzx) https://godbolt.org/z/9wrYEN
Clang compiles an if() to branchless using CMOV, so that's significantly worse than the simple version that uses uint64_t(a) << n. It's a missed optimization that it doesn't compile my hr1 the same as my hr3; they
GCC actually branches and then uses mov reg, 1 / shl / or for the if version. Again it could compile it the same as hr3 if it chose to. (It can assume that a=1 implies n<=63, otherwise the if version would have shift UB.)
The missed optimization in both is failure to use bts, which implements reg |= 1<<(n&63)
Especially for gcc after branching so it knows its shifting a constant 1, the tail of the function should be bts rax, rdx which is 1 uop with 1c latency on Intel, 2 uops on AMD Zen1 / Zen2. GCC and clang do know how to use bts for the simple case of a compile-time-constant a=1, though: https://godbolt.org/z/rkhbzH
There's no way that I know of to hand-hold GCC or clang into using bts otherwise, and I wouldn't recommend inline-assembly for this unless it's in the most critical inner loop of something and you're prepared to check that it doesn't hurt other optimizations, and to maintain it. i.e. just don't.
But ideally GCC / clang would do something like this when BMI2 isn't available:
# hand optimized, compilers should do this but don't.
mov rax, rdi # x
bts rdi, rdx # x | 1<<(n&63)
test sil, sil
cmovnz rax, rdi # return a ? x_with_bit_set : x;
ret
Doesn't require BMI2, but still only 4 uops on Broadwell and later. (And 5 uops on AMD Bulldozer / Zen). Critical path latencies:
x -> retval: 2 cycles (through (MOV and BTS) -> CMOV) on Broadwell and later. 3 cycles on earlier Intel (2 uop cmov) and on any AMD (2 uop BTS).
n -> retval: same as x (through BTS -> CMOV).
a -> retval: 2 cycles (through TEST -> CMOV) on Broadwell and later, and all AMD. 3 cycles on earlier Intel (2 uop cmov).
This is pretty obviously better than what clang emits for any version without -march=skylake or other BMI2, and even more better than what GCC emits (unless branchy turns out to be a good strategy).
One way that clang will use BTS:
If we mask the shift count for the branchy version, then clang will actually branch, and on the branch where the if body runs it implements it with bts as I described above. https://godbolt.org/z/BtT4w6
uint64_t hr1(uint64_t x, bool a, int n) noexcept
{
if (a) {
return x | (uint64_t(a) << (n&63));
}
return x;
}
clang 9.0 -O3 (without -march=)
hr1(unsigned long, bool, int):
mov rax, rdi
test sil, sil
je .LBB0_2 # if(a) {
bts rax, rdx # x |= 1<<(n&63)
.LBB0_2: # }
ret
So if branchy is good for your use-case, then this way of writing it compiles well with clang.
These stand-alone versions might end up different after inlining into a real caller.
For example, a caller might save a MOV instruction if it can have the shift count n already in CL. Or the decision on whether to do if-conversion from an if to a branchless sequence might be different.
Or if n is a compile-time constant, that means we don't need BMI2 to save uops on the shift anymore; immediate shifts are fully efficient on all modern CPUs (single uop).
And of course if a is a compile time constant then it's either nothing to do or optimizes to a bts.
Further reading: see the performance links in https://stackoverflow.com/tags/x86/info for more about how to decide if asm is efficient by looking at it.
Related
Consider this C++ code:
#include <cstdint>
// returns a if less than b or if b is INT32_MIN
int32_t special_min(int32_t a, int32_t b)
{
return a < b || b == INT32_MIN ? a : b;
}
GCC with -fwrapv correctly realizes that subtracting 1 from b can eliminate the special case, and it generates this code for x86-64:
lea edx, [rsi-1]
mov eax, edi
cmp edi, edx
cmovg eax, esi
ret
But without -fwrapv it generates worse code:
mov eax, esi
cmp edi, esi
jl .L4
cmp esi, -2147483648
je .L4
ret
.L4:
mov eax, edi
ret
I understand that -fwrapv is needed if I write C++ code which relies on signed overflow. But:
The above C++ code does not depend on signed overflow (it is valid standard C++).
We all know that signed overflow has a specific behavior on x86-64.
The compiler knows it is compiling for x86-64.
If I wrote "hand optimized" C++ code trying to implement that optimization, I understand -fwrapv would be required, otherwise the compiler could decide the signed overflow is UB and do whatever it wants in the case where b == INT32_MIN. But here the compiler is in control, and I don't see what stops it from using the optimization without -fwrapv. Is there some reason it isn't allowed to?
This kind of missed optimization has happened before in GCC, like not fully treating signed int add as associative even though it's compiling for a 2's complement target with wrapping addition. So it optimizes better for unsigned. IIRC, the reason was something like GCC losing track of some information it had about the operations, and thus being conservative? I forget if that ever got fixed.
I can't find where I've seen this before on SO with a reply from a GCC dev about the internals; maybe it was in a GCC bug report? I think it was with something like a+b+c+d+e (not) re-associating into a tree of dependencies to shorten the critical path. But unfortunately it's still present in current GCC:
int sum(int a, int b, int c, int d, int e, int f) {
return a+b+c+d+e+f;
// gcc and clang make one stupid dep chain
}
int sumv2(int a, int b, int c, int d, int e, int f) {
return (a+b)+(c+d)+(e+f);
// clang pessimizes this back to 1 chain, GCC doesn't
}
unsigned sumu(unsigned a, unsigned b, unsigned c, unsigned d, unsigned e, unsigned f) {
return a+b+c+d+e+f;
// gcc and clang make one stupid dep chain
}
unsigned sumuv2(unsigned a, unsigned b, unsigned c, unsigned d, unsigned e, unsigned f) {
return (a+b)+(c+d)+(e+f);
// GCC and clang pessimize back to 1 chain for unsigned
}
Godbolt for x86-64 System V at -O3, clang and gcc -fwrapv make the same asm for all 4 functions, as you'd expect.
GCC (without -fwrapv) makes the same asm for sumu as for sumuv2 (summing into r8d, the reg that held e.) But GCC makes different asm for sum and sumv2, because they use signed int
# gcc -O3 *without* -fwrapv
# The same order of order of operations as the C source
sum(int, int, int, int, int, int):
add edi, esi # a += b
add edi, edx # ((a+b) + c) ...
add edi, ecx # sum everything into EDI
add edi, r8d
lea eax, [rdi+r9]
ret
# also as written, the source order of operations:
sumv2(int, int, int, int, int, int):
add edi, esi # a+=b
add edx, ecx # c+=d
add r8d, r9d # e+=f
add edi, edx # a += c
lea eax, [rdi+r8] # retval = a + e
ret
So ironically GCC makes better asm when it doesn't re-associate the source. That's assuming that all 6 inputs are ready at once. If out-of-order exec of earlier code only produced the input registers 1 per cycle, the final result here would be ready only 1 cycle after the final input was ready, assuming that final input was f.
But if the last input was a or b, the result wouldn't be ready until 5 cycles later with the single chain like GCC and clang use when they can. vs. 3 cycles worst case for the tree reduction, 2 cycle best case (if e or f were ready last).
(Update: -mtune=znver2 makes GCC re-associate into a tree, thanks #amonakov. So this is a tuning choice with a default that seems strange to me, at least for this specific problem-size. See GCC source, search for reassoc to see costs for other tuning settings; most of them are 1,1,1,1 which is insane, especially for floating-point. This might be why GCC fails to use multiple vector accumulators when unrolling FP loops, defeating the purpose.)
But anyway, this is a case of GCC only re-associating signed int with -fwrapv. So clearly it limits itself more than necessary without -fwrapv.
Related: Compiler optimizations may cause integer overflow. Is that okay? - it is of course legal, and failure to do it is a missed optimization.
GCC isn't totally hamstrung by signed int; it will auto-vectorize int sum += arr[i], and it does manage to optimize Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)? for signed int a.
Consider the following code:
unsigned long long div(unsigned long long a, unsigned long long b, unsigned long long c) {
unsigned __int128 d = (unsigned __int128)a*(unsigned __int128)b;
return d/c;
}
When compiled with x86-64 gcc 10 or clang 10, both with -O3, it emits __udivti3, instead of DIVQ instruction:
div:
mov rax, rdi
mov r8, rdx
sub rsp, 8
xor ecx, ecx
mul rsi
mov r9, rax
mov rsi, rdx
mov rdx, r8
mov rdi, r9
call __udivti3
add rsp, 8
ret
At least in my testing, the former is much slower than the (already) slow later, hence the question: is there a way to make a modern compiler emit DIVQ for the above code?
Edit: Let's assume the quotient fits into 64-bits register.
div will fault if the quotient doesn't fit in 64 bits. Doing (a*b) / c with mul + a single div isn't safe in the general case (doesn't implement the abstract-machine semantics for every possible input), therefore a compiler can't generate asm that way for x86-64.
Even if you do give the compiler enough info to figure out that the division can't overflow (i.e. that high_half < divisor), unfortunately gcc/clang still won't ever optimize it to single a div with a non-zero high-half dividend (RDX).
You need an intrinsic or inline asm to explicitly do 128 / 64-bit => 64-bit division. e.g. Intrinsics for 128 multiplication and division has GNU C inline asm that looks right for low/high halves separately.
Unfortunately GNU C doesn't have an intrinsic for this. MSVC does, though: Unsigned 128-bit division on 64-bit machine has links.
C++17 std::clamp is a template function that makes sure the input value is not less than the given minimum and less than the given maximum, and returns the input value; otherwise it returns the minimum or the maximum respectively.
The goal is to optimize it, assuming the following:
The type parameter is 32 bit or 64 bit integer
The input value is way more likely to be already in range than out of range, so likely to be returned
The input value is likely to be computed shortly before, the minimum and maximum is likely to be known in advance
Let's ignore references, which may complicate optimization, but in practice are not useful for integer values
For both the standard implementation, and a naive implementation, the assembly generated by gcc and clang does not seem to favor the in range assumption above. Both of these:
#include <algorithm>
int clamp1(int v, int minv, int maxv)
{
return std::clamp(v, minv, maxv);
}
int clamp2(int v, int minv, int maxv)
{
if (maxv < v)
{
return maxv ;
}
if (v < minv)
{
return minv;
}
return v;
}
Compile into two cmov (https://godbolt.org/z/oedd9Yfro):
mov eax, esi
cmp edi, esi
cmovge eax, edi
cmp edx, edi
cmovl eax, edi
Trying to tell the compilers to favor the in range case with __builtin_expect (gcc seem to ignore C++20 [[likely]]):
int clamp3(int v, int minv, int maxv)
{
if (__builtin_expect(maxv < v, 0))
{
return maxv;
}
if (__builtin_expect(v < minv, 0))
{
return minv;
}
return v;
}
The result for gcc and clang are now different (https://godbolt.org/z/s4vedo1br).
gcc still fully avoids branches using two cmov. clang has one branch, instead of expected two (annotation mine):
clamp3(int, int, int):
mov eax, edx ; result = maxv
cmp edi, esi ; v, minv
cmovge esi, edi ; if (v >= minv) minv = v
cmp edx, edi ; maxv, v
jl .LBB0_2 ; if (maxv < v) goto LBB0_2
mov eax, esi ; result = minv (after it was updated from v if no clamping)
.LBB0_2:
ret
Questions:
Are there significant disadvantages in using conditional jumps that are expected to go the same branch each time, so that gcc avoids them?
Is clang version with one conditional jump better than it would have been if there was two jumps?
Not using cmov is suggested in Intel® 64 and IA-32 Architectures
Optimization Reference Manual, from June 2021 version page 3-5:
Assembly/Compiler Coding Rule 2. (M impact, ML generality) Use the SETCC and CMOV
instructions to eliminate unpredictable conditional branches where possible. Do not do this for
predictable branches. Do not use these instructions to eliminate all unpredictable conditional branches
(because using these instructions will incur execution overhead due to the requirement for executing
both paths of a conditional branch). In addition, converting a conditional branch to SETCC or CMOV
trades off control flow dependence for data dependence and restricts the capability of the out-of-order
engine. When tuning, note that all Intel 64 and IA-32 processors usually have very high branch
prediction rates. Consistently mispredicted branches are generally rare. Use these instructions only if
the increase in computation time is less than the expected cost of a mispredicted branch.
My question is about what the compiler is doing in this case that optimizes the code way more than what I would think is possible.
Given this enum:
enum MyEnum {
Entry1,
Entry2,
... // Entry3..27 are the same, omitted for size.
Entry28,
Entry29
};
And this function:
bool MyFunction(MyEnum e)
{
if (
e == MyEnum::Entry1 ||
e == MyEnum::Entry3 ||
e == MyEnum::Entry8 ||
e == MyEnum::Entry14 ||
e == MyEnum::Entry15 ||
e == MyEnum::Entry18 ||
e == MyEnum::Entry21 ||
e == MyEnum::Entry22 ||
e == MyEnum::Entry25)
{
return true;
}
return false;
}
For the function, MSVC generates this assembly when compiled with -Ox optimization flag (Godbolt):
bool MyFunction(MyEnum) PROC ; MyFunction
cmp ecx, 24
ja SHORT $LN5#MyFunction
mov eax, 20078725 ; 01326085H
bt eax, ecx
jae SHORT $LN5#MyFunction
mov al, 1
ret 0
$LN5#MyFunction:
xor al, al
ret 0
Clang generates similar (slightly better, one less jump) assembly when compiled with -O3 flag:
MyFunction(MyEnum): # #MyFunction(MyEnum)
cmp edi, 24
ja .LBB0_2
mov eax, 20078725
mov ecx, edi
shr eax, cl
and al, 1
ret
.LBB0_2:
xor eax, eax
ret
What is happening here? I see that even if I add more enum comparisons to the function, the assembly that is generated does not actually become "more", it's only this magic number (20078725) that changes. That number depends on how many enum comparisons are happening in the function. I do not understand what is happening here.
The reason why I am looking at this is that I was wondering if it is good to write the function as above, or alternatively like this, with bitwise comparisons:
bool MyFunction2(MyEnum e)
{
if (
e == MyEnum::Entry1 |
e == MyEnum::Entry3 |
e == MyEnum::Entry8 |
e == MyEnum::Entry14 |
e == MyEnum::Entry15 |
e == MyEnum::Entry18 |
e == MyEnum::Entry21 |
e == MyEnum::Entry22 |
e == MyEnum::Entry25)
{
return true;
}
return false;
}
This results in this generated assembly with MSVC:
bool MyFunction2(MyEnum) PROC ; MyFunction2
xor edx, edx
mov r9d, 1
cmp ecx, 24
mov eax, edx
mov r8d, edx
sete r8b
cmp ecx, 21
sete al
or r8d, eax
mov eax, edx
cmp ecx, 20
cmove r8d, r9d
cmp ecx, 17
sete al
or r8d, eax
mov eax, edx
cmp ecx, 14
cmove r8d, r9d
cmp ecx, 13
sete al
or r8d, eax
cmp ecx, 7
cmove r8d, r9d
cmp ecx, 2
sete dl
or r8d, edx
test ecx, ecx
cmove r8d, r9d
test r8d, r8d
setne al
ret 0
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.
Quite smart! The first comparison with 24 is to do a rough range check - if it's more than 24 or less than 0 it will bail out; this is important as the instructions that follow that operate on the magic number have a hard cap to [0, 31] for operand range.
For the rest, the magic number is just a bitmask, with the bits corresponding to the "good" values set.
>>> bin(20078725)
'0b1001100100110000010000101'
It's easy to spot the first and third bits (counting from 1 and from right) set, the 8th, 14th, 15th, ...
MSVC checks it "directly" using the BT (bit test) instruction and branching, clang instead shifts it of the appropriate amount (to get the relevant bit in the lowest order position) and keeps just it ANDing it with zero (avoiding a branch).
The C code corresponding to the clang version would be something like:
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return (20078725 >> e) & 1;
}
as for the MSVC version, it's more like
inline bool bit_test(unsigned val, int bit) {
return val & (1<<bit);
}
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return bit_test(20078725, e);
}
(I kept the bit_test function separated to emphasize that it's actually a single instruction in assembly, that val & (1<<bit) thing has no correspondence to the original assembly.
As for the if-based code, it's quite bad - it uses a lot of CMOV and ORs the results together, which is both longer code, and will probably serialize execution. I suspect the corresponding clang code will be better. OTOH, you wrote this code using bitwise OR (|) instead of the more semantically correct logical OR (||), and the compiler is strictly following your orders (typical of MSVC).
Another possibility to try instead could be a switch - but I don't think there's much to gain compared to the code already generated for the first snippet, which looks pretty good to me.
Ok, doing a quick test with all the versions against all compilers, we can see that:
the C translation of the CLang output above results in pretty much that same code (= to the clang output) in all compilers; similarly for the MSVC translation;
the bitwise or version is the same as the logical or version (= good) in both CLang and gcc;
in general, gcc does essentially the same thing as CLang except for the switch case;
switch results are varied:
CLang does best, by generating the exact same code;
both gcc and MSVC generate jump-table based code, which in this case is less good; however:
gcc prefers to emit a table of QWORDs, trading size for simplicity of the setup code;
MSVC instead emits a table of BYTEs, paying it in setup code size; I couldn't get gcc to emit similar code even changing -O3 to -Os (optimize for size).
Ah, the old immediate bitmap trick.
GCC does this too, at least for a switch.
x86 asm casetable implementation. Unfortunately GCC9 has a regression for some cases: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026#c3 ; GCC8 and earlier do a better job.
Another example of using it, this time for code-golf (fewest bytes of code, in this case x86 machine code) to detect certain letters: User Appreciation Challenge #1: Dennis ♦
The basic idea is to use the input as an index into a bitmap of true/false results.
First you have to range-check because the bitmap is fixed-width, and x86 shifts wrap the shift count. We don't want high inputs to alias into the range where there are some that should return true. cmp edi, 24/ja is doing.
(If the range between the lowest and highest true values was from 120 to 140, for example, it might start with a sub edi,120 to range-shift everything before the cmp.)
Then you use bitmap & (1<<e) (the bt instruction), or (bitmap >> e) & 1 (shr / and) to check the bit in the bitmap that tells you whether that e value should return true or false.
There are many ways to implement that check, logically equivalent but with performance differences.
If the range was wider than 32, it would have to use 64-bit operand-size. If it was wider than 64, the compiler might not attempt this optimization at all. Or might still do it for some of the conditions that are in a narrow range.
Using an even larger bitmap (in .rodata memory) would be possible but probably not something most compilers will invent for you. Either with bt [mem],reg (inefficient) or manually indexing a dword and checking that the same way this code checks the immediate bitmap. If you had a lot of high-entropy ranges it might be worth checking 2x 64-bit immediate bitmap, branchy or branchless...
Clang/LLVM has other tricks up its sleeve for efficiently comparing against multiple values (when it doesn't matter which one is hit), e.g. broadcast a value into a SIMD register and use a packed compare. That isn't dependent on the values being in a dense range. (Clang generates worse code for 7 comparisons than for 8 comparisons)
that optimizes the code way more than what I would think is possible.
These kinds of optimizations come from smart human compiler developers that notice common patterns in source code and think of clever ways to implement them. Then get compilers to recognize those patterns and transform their internal representation of the program logic to use the trick.
Turns out that switch and switch-like if() statements are common, and aggressive optimizations are common.
Compilers are far from perfect, but sometimes they do come close to living up to what people often claim; that compilers will optimize your code for you so you can write it in a human-readable way and still have it run near-optimally. This is sometimes true over the small scale.
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.
The immediate bitmap is vastly more efficient. There's no data memory access in either one so no cache miss loads. The only "expensive" instruction is a variable-count shift (3 uops on mainstream Intel, because of x86's annoying FLAGS-setting semantics; BMI2 shrx is only 1 uop and avoid having to mov the number to ecx.) https://agner.org/optimize. And see other performance analysis links in https://stackoverflow.com/tags/x86/info.
Each instruction in the cmp/cmov chain is at least 1 uop, and there's a pretty long dependency chain through each cmov because MSVC didn't bother to break it into 2 or more parallel chains. But regardless it's just a lot of uops, far more than the bitmap version, so worse for throughput (ability for out-of-order exec to overlap the work with surrounding code) as well as latency.
bt is also cheap: 1 uop on modern AMD and Intel. (bts, btr, btc are 2 on AMD, still 1 on Intel).
The branch in the immediate-bitmap version could have been a setna / and to make it branchless, but especially for this enum definition the compiler expected that it would be in range. It could have increased branch predictability by only requiring e <= 31, not e <= 24.
Since the enum only goes up to 29, and IIRC it's UB to have out-of-range enum values, it could actually optimize it away entirely.
Even if the e>24 branch doesn't predict very well, it's still probably better overall. Given current compilers, we only get a choice between the nasty chain of cmp/cmov or branch + bitmap. Unless turn the asm logic back into C to hand-hold compilers into making the asm we want, then we can maybe get branchless with an AND or CMOV to make it always zero for out-of-range e.
But if we're lucky, profile-guided optimization might let some compilers make the bitmap range check branchless. (In asm the behaviour of shl reg, cl with cl > 31 or 63 is well-defined: on x86 it simply masks the count. In a C equivalent, you could use bitmap >> (e&31) which can still optimize to a shr; compilers know that x86 shr masks the count so they can optimize that away. But not for other ISAs that saturate the shift count...)
There are lots of ways to implement the bitmap check that are pretty much equivalent. e.g. you could even use the CF output of shr, set according to the last bit shifted out. At least if you make sure CF has a known state ahead of time for the cl=0 case.
When you want an integer bool result, right-shifting seems to make more sense than bt / setcc, but with shr costing 3 uops on Intel it might actually be best to use bt reg,reg / setc al. Especially if you only need a bool, and can use EAX as your bitmap destination so the previous value of EAX is definitely ready before setcc. (Avoiding a false dependency on some unrelated earlier dep chain.)
BTW, MSVC has other silliness: as What is the best way to set a register to zero in x86 assembly: xor, mov or and? explains, xor al,al is totally stupid compared to xor eax,eax when you want to zero AL. If you don't need to leave the upper bytes of RAX unmodified, zero the full register with a zeroing idiom.
And of course branching just to return 0 or return 1 makes little sense, unless you expect it to be very predictable and want to break the data dependency. I'd expect that setc al would make more sense to read the CF result of bt
I am trying to implement Intel DRNG in c++.
According to its guide to generate a 64 bit unsigned long long the code should be:
int rdrand64_step (unsigned long long *rand)
{
unsigned char ok;
asm volatile ("rdrand %0; setc %1"
: "=r" (*rand), "=qm" (ok));
return (int) ok;
}
However the output of this function rand is only giving me an output of only 32 bits as shown.
bd4a749d
d461c2a8
8f666eee
d1d5bcc4
c6f4a412
any reason why this is happening?
more info: the IDE I'm using is codeblocks
Use int _rdrand64_step (unsigned __int64* val) from immintrin.h instead of writing inline asm. You don't need it, and there are many reasons (including this one) to avoid it: https://gcc.gnu.org/wiki/DontUseInlineAsm
In this case, the problem is that you're probably compiling 32-bit code, so of course 64-bit rdrand is not encodeable. But the way you used inline-asm ended up giving you a 32-bit rdrand, and storing garbage from another register for the high half.
gcc -Wall -O3 -m32 -march=ivybridge (and similar for clang) produces (on Godbolt):
In function 'rdrand64_step':
7 : <source>:7:1: warning: unsupported size for integer register
rdrand64_step:
push ebx
rdrand ecx; setc al
mov edx, DWORD PTR [esp+8] # load the pointer arg
movzx eax, al
mov DWORD PTR [edx], ecx
mov DWORD PTR [edx+4], ebx # store garbage in the high half of *rand
pop ebx
ret
I guess you called this function with a caller that happened to have ebx=0. Or else you used a different compiler that did something different. Maybe something else happens after inlining. If you looked at disassembly of what you actually compiled, you could explain exactly what's going on.
If you'd used the intrinsic, you would have gotten error: '_rdrand64_step' was not declared in this scope, because immintrin.h only declares it in 64-bit mode (and with a -march setting that implies rdrand support. Or [-mrdrnd]3. Best option: use -march=native if you're building on the target machine).
You'd also get significantly more efficient code for a retry loop, at least with clang:
unsigned long long use_intrinsic(void) {
unsigned long long rand;
while(!_rdrand64_step(&rand)); // TODO: retry limit in case RNG is broken.
return rand;
}
use_intrinsic: # #use_intrinsic
.LBB2_1: # =>This Inner Loop Header: Depth=1
rdrand rax
jae .LBB2_1
ret
That avoids setcc and then testing that, which is of course redundant. gcc6 has syntax for returning flag results from inline asm. You can also use asm goto and put a jcc inside the asm, jumping to a label: return 1; target or falling through to a return 0. (The inline-asm docs have an example of doing this. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html. See also the inline-assembly tag wiki.)
Using your inline-asm, clang (in 64-bit mode) compiles it to:
use_asm:
.LBB1_1:
rdrand rax
setb byte ptr [rsp - 1]
cmp byte ptr [rsp - 1], 0
je .LBB1_1
ret
(clang makes bad decisions for constraints with multiple options that include memory.)
gcc7.2 and ICC17 actually end up with better code from the asm than from the intrinsic. They use cmovc to get a 0 or 1 and then test that. It's pretty dumb. But that's a gcc/ICC missed optimization that will hopefully be.