Intel DRNG giving only giving 4 bytes of data instead of 8

Intel DRNG giving only giving 4 bytes of data instead of 8 - c++

I am trying to implement Intel DRNG in c++.
According to its guide to generate a 64 bit unsigned long long the code should be:
int rdrand64_step (unsigned long long *rand)
{
unsigned char ok;
asm volatile ("rdrand %0; setc %1"
: "=r" (*rand), "=qm" (ok));
return (int) ok;
}
However the output of this function rand is only giving me an output of only 32 bits as shown.
bd4a749d
d461c2a8
8f666eee
d1d5bcc4
c6f4a412
any reason why this is happening?
more info: the IDE I'm using is codeblocks

Use int _rdrand64_step (unsigned __int64* val) from immintrin.h instead of writing inline asm. You don't need it, and there are many reasons (including this one) to avoid it: https://gcc.gnu.org/wiki/DontUseInlineAsm
In this case, the problem is that you're probably compiling 32-bit code, so of course 64-bit rdrand is not encodeable. But the way you used inline-asm ended up giving you a 32-bit rdrand, and storing garbage from another register for the high half.
gcc -Wall -O3 -m32 -march=ivybridge (and similar for clang) produces (on Godbolt):
In function 'rdrand64_step':
7 : <source>:7:1: warning: unsupported size for integer register
rdrand64_step:
push ebx
rdrand ecx; setc al
mov edx, DWORD PTR [esp+8] # load the pointer arg
movzx eax, al
mov DWORD PTR [edx], ecx
mov DWORD PTR [edx+4], ebx # store garbage in the high half of *rand
pop ebx
ret
I guess you called this function with a caller that happened to have ebx=0. Or else you used a different compiler that did something different. Maybe something else happens after inlining. If you looked at disassembly of what you actually compiled, you could explain exactly what's going on.
If you'd used the intrinsic, you would have gotten error: '_rdrand64_step' was not declared in this scope, because immintrin.h only declares it in 64-bit mode (and with a -march setting that implies rdrand support. Or [-mrdrnd]3. Best option: use -march=native if you're building on the target machine).
You'd also get significantly more efficient code for a retry loop, at least with clang:
unsigned long long use_intrinsic(void) {
unsigned long long rand;
while(!_rdrand64_step(&rand)); // TODO: retry limit in case RNG is broken.
return rand;
}
use_intrinsic: # #use_intrinsic
.LBB2_1: # =>This Inner Loop Header: Depth=1
rdrand rax
jae .LBB2_1
ret
That avoids setcc and then testing that, which is of course redundant. gcc6 has syntax for returning flag results from inline asm. You can also use asm goto and put a jcc inside the asm, jumping to a label: return 1; target or falling through to a return 0. (The inline-asm docs have an example of doing this. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html. See also the inline-assembly tag wiki.)
Using your inline-asm, clang (in 64-bit mode) compiles it to:
use_asm:
.LBB1_1:
rdrand rax
setb byte ptr [rsp - 1]
cmp byte ptr [rsp - 1], 0
je .LBB1_1
ret
(clang makes bad decisions for constraints with multiple options that include memory.)
gcc7.2 and ICC17 actually end up with better code from the asm than from the intrinsic. They use cmovc to get a 0 or 1 and then test that. It's pretty dumb. But that's a gcc/ICC missed optimization that will hopefully be.

Related

Why does MSVC generate nop instructions for atomic loads on x64?

If you compile code such as
#include <atomic>
int load(std::atomic<int> *p) {
return p->load(std::memory_order_acquire) + p->load(std::memory_order_acquire);
}
you see that MSVC generates NOP padding after each memory load:
int load(std::atomic<int> *) PROC
mov edx, DWORD PTR [rcx]
npad 1
mov eax, DWORD PTR [rcx]
npad 1
add eax, edx
ret 0
Why is this? Is there any way to avoid it without relaxing the memory order (which would affect the correctness of the code)?

p->load() may eventually use the _ReadWriteBarrier compiler intrinsic.
According to this: https://developercommunity.visualstudio.com/t/-readwritebarrier-intrinsic-emits-unnecessary-code/1538997
the nops get inserted because of the flag /volatileMetadata which is now on by default. You can return to the old behavior by adding /volatileMetadata-, but doing so will result in worse performance if your code is ever run emulated. It’ll still be emulated correctly, but the emulator will have to pessimistically assume every load/store needs a barrier.
And compiling with /volatileMetadata- does indeed remove the npad.

Replace gcc asm function with __uint128_t equivalent [duplicate]

Consider the following code:
unsigned long long div(unsigned long long a, unsigned long long b, unsigned long long c) {
unsigned __int128 d = (unsigned __int128)a*(unsigned __int128)b;
return d/c;
}
When compiled with x86-64 gcc 10 or clang 10, both with -O3, it emits __udivti3, instead of DIVQ instruction:
div:
mov rax, rdi
mov r8, rdx
sub rsp, 8
xor ecx, ecx
mul rsi
mov r9, rax
mov rsi, rdx
mov rdx, r8
mov rdi, r9
call __udivti3
add rsp, 8
ret
At least in my testing, the former is much slower than the (already) slow later, hence the question: is there a way to make a modern compiler emit DIVQ for the above code?
Edit: Let's assume the quotient fits into 64-bits register.

div will fault if the quotient doesn't fit in 64 bits. Doing (a*b) / c with mul + a single div isn't safe in the general case (doesn't implement the abstract-machine semantics for every possible input), therefore a compiler can't generate asm that way for x86-64.
Even if you do give the compiler enough info to figure out that the division can't overflow (i.e. that high_half < divisor), unfortunately gcc/clang still won't ever optimize it to single a div with a non-zero high-half dividend (RDX).
You need an intrinsic or inline asm to explicitly do 128 / 64-bit => 64-bit division. e.g. Intrinsics for 128 multiplication and division has GNU C inline asm that looks right for low/high halves separately.
Unfortunately GNU C doesn't have an intrinsic for this. MSVC does, though: Unsigned 128-bit division on 64-bit machine has links.

Error : Invalid Character '(' in mnemonic

Hi I am trying to compile the below assembly code on Linux using gcc 7.5 version but somehow getting the error
Error : Invalid Character '(' in mnemonic
bool InterlockedCompareAndStore128(int *dest,int *newVal,int *oldVal)
{
asm(
"push %rbx\n"
"push %rdi\n"
"mov %rcx, %rdi\n" // ptr to dest -> RDI
"mov 8(%rdx), %rcx\n" // newVal -> RCX:RBX
"mov (%rdx), %rbx\n"
"mov 8(%r8), %rdx\n" // oldVal -> RDX:RAX
"mov (%r8), %rax\n"
"lock (%rdi), cmpxchg16b\n"
"mov $0, %rax\n"
"jnz exit\n"
"inc1 %rax\n"
"exit:;\n"
"pop %rdi\n"
"pop %rbx\n"
);
}
Can anyone suggest how to resolve this . Checked many online links and tutorials for Assembly code but could not relate the exact issue.
Thanks for the help in advance.
In Windows I could see the implementation of the above function as:
function InterlockedCompareExchange128;
asm
.PUSHNV RBX
MOV R10,RCX
MOV RBX,R8
MOV RCX,RDX
MOV RDX,[R9+8]
MOV RAX,[R9]
LOCK CMPXCHG16B [R10]
MOV [R9+8],RDX
MOV [R9],RAX
SETZ AL
MOVZX EAX, AL
end;
For PUSHNV , I could not found anything related to this on Linux. So , basically I am trying to implement same functionality in c++ on Linux.

The question here was about Invalid Character '(' in mnemonic which the other answer addresses.
However, OP's code has a number of issues beyond that problem. Here's (what I think are) two better approaches to this problem. Note that I've changed the order of the parameters and turned them const.
This one continues to use inline asm, but uses Extended asm instead of Basic. While I'm of the don't use inline asm school of thought, this might be useful or at least educational.
bool InterlockedCompareAndStore128B(__int64 *dest, const __int64 *oldVal, const __int64 *newVal)
{
bool result;
__int64 ovl = oldVal[0];
__int64 ovh = oldVal[1];
asm volatile ("lock cmpxchg16b %[ptr]"
: "=#ccz" (result), [ptr] "+m" (*dest),
"+d" (ovh), "+a" (ovl)
: "c" (newVal[1]), "b" (newVal[0])
: "cc", "memory");
// cmpxchg16b changes rdx:rax to the current value in dest. Useful if you need
// to loop until you succeed, but OP's code doesn't save the values, so I'm
// just following that spec.
//oldVal[0] = ovl;
//oldVal[1] = ovh;
return result;
}
In addition to solving the problems with the original code, it's also inlineable and shorter. The constraints likely make it harder to read, but the fact that there's only 1 line of asm might help offset that. If you want to understand what the constraints mean, check out this page (scroll down to x86 family) and the description of flag output constraints (again, scroll down for x86 family).
As an alternative, this code uses a gcc builtin and allows the compiler to generate the appropriate asm instructions. Note that this must be built with -mcx16 for best results.
bool InterlockedCompareAndStore128C(__int128 *dest, const __int128 *oldVal, const __int128 *newVal)
{
// While a sensible person would use __atomic_compare_exchange_n and let gcc generate
// cmpxchg16b, gcc decided they needed to turn this into a big hairy function call:
// https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
// In short, if someone wants to compare/exchange against readonly memory, you can't just
// use cmpxchg16b cuz it would crash. Why would anyone try to exchange memory that can't
// be written to? Apparently because it's expected to *not* crash if the compare fails
// and nothing gets written. So no one gets to use that 1 line instruction and everyone
// gets an entire routine (that uses MUTEX instead of lockfree) to support this absurd
// border case. Sounds dumb to me, but that's where things stand as of 2021-05-07.
// Use the legacy function instead.
bool b = __sync_bool_compare_and_swap(dest, *oldVal, *newVal);
return b;
}
For the kibizters in the crowd, here's the code generated by -m64 -O3 -mcx16 for that last one:
InterlockedCompareAndStore128C(__int128*, __int128 const*, __int128 const*):
mov rcx, rdx
push rbx
mov rax, QWORD PTR [rsi]
mov rbx, QWORD PTR [rcx]
mov rdx, QWORD PTR [rsi+8]
mov rcx, QWORD PTR [rcx+8]
lock cmpxchg16b XMMWORD PTR [rdi]
pop rbx
sete al
ret
If someone wants to fiddle, here's the godbolt link.

There are a number of problems with this code, and I'm not convinced I'm doing you any favors by telling you how to fix the specific problem.
But the short answer is that
"lock (%rdi), cmpxchg16b\n"
should be
"lock cmpxchg16b (%rdi)\n"
Tada, now it compiles. Well, it would if inc1 was a real instruction.
But I can't help but notice that the pointers here are int *, which is 4 bytes, not 16. And that this function is not declared as naked. And using Extended asm would save you from having to push all these registers around by hand, making this code a lot slower than it needs to be.
But most of all, you should really use the builtins, like __atomic_compare_exchange because inline asm is error prone, not portable, and really hard to maintain.

Why different assembling code generated? Which is better?

#include <cstdint>
uint64_t hr1(const uint64_t x, const bool a, const int n) noexcept
{
if (a) {
return x | (a << n);
}
return x;
}
uint64_t hr2(const uint64_t x, const bool a, const int n)
{
return x | ((a ? 1ull : 0) << n);
}
https://godbolt.org/z/gy_65H
hr1(unsigned long, bool, int):
mov rax, rdi
test sil, sil
jne .L4
ret
.L4:
mov ecx, edx
mov esi, 1
sal esi, cl
movsx rsi, esi
or rax, rsi
ret
hr2(unsigned long, bool, int):
mov ecx, edx
movzx esi, sil
sal rsi, cl
mov rax, rsi
or rax, rdi
ret
Why clang and gcc cannot optimize first function as second?

The functions do not have identical behavior. In particular in the first one a will undergo integer promotion to int in a << n, so that the shift will have undefined behavior if n >= std::numeric_limits<int>::digits (typically 31).
This is not the case in the second function where a ? 1ull : 0 will result in the common type of unsigned long long, so that the shift will have well-defined behavior for all non-negative values n < std::numeric_limits<unsigned long long>::digits (typically 64) which is most likely more than std::numeric_limits<int>::digits (typically 31).
You should cast a and 1 to uint64_t in both shifts to make the code well behaved for all sensible inputs (i.e. 0 <= n < 64).
Even after fixing that the functions do not have equal behavior. The second function will have undefined behavior if n >= 64 or n < 0 no matter what the value of a is while the first function has well-defined behavior for a == false. The compiler must guarantee that this case returns x unmodified, no matter how large (or negative) the value of n is.
The second function therefore in principle gives the compiler more freedom to optimize since the range of valid input values is much smaller.
Of course, if the function gets inlined (likely), the compiler may use what it knows about the possible range of values in the call arguments for a and n and optimize further based on that.
This isn't the issue here though, GCC will compile to similar assembly for the first function if e.g.
uint64_t hr1(const uint64_t x, const bool a, const int n) noexcept
{
return a ? x | (uint64_t{1} << n) : x | (uint64_t{0} << n);
}
is used (which has the same valid inputs as hr2). I don't know which of the two assemblies will perform better. I suppose you will have to benchmark that or wait for some expert on that to show up.

Both ways look over-complicated (and the first one is buggy for n>=32). To promote a bool to a uint64_t 0 or 1, just use uint64_t(a) or a C-style cast. You don't need a ? 1ull : 0.
The simple branchless way is probably good, unless you expect a to be highly predictable (e.g. usually one way, or correlated with earlier branching. Modern TAGE predictors use recent branch history to index the BHT / BTB.)
uint64_t hr2(uint64_t x, bool a, int n) {
return x | (uint64_t(a) << n);
}
If you want to make this more complicated to avoid UB when n is out of range, write your C++ to wrap the shift count the same way x86 shift instructions do, so the compiler doesn't need any extra instructions.
#include <limits>
uint64_t hr3(uint64_t x, bool a, int n) {
using shiftwidth = decltype(x);
const int mask = std::numeric_limits<shiftwidth>::digits - 1;
// wrap the count to the shift width to avoid UB
// x86 does this for free for 32 and 64-bit shifts.
return x | (shiftwidth(a) << (n & mask));
}
Both versions compile identically for x86 (because the simple version has to work for all inputs without UB).
This compiles decently if you have BMI2 (for single-uop variable-count shifts on Intel), otherwise it's not great. (https://agner.org/optimize/ and https://uops.info/) But even then there are missed optimizations from GCC:
# GCC9.2 -O3 -march=skylake
hr3(unsigned long, bool, int):
movzx esi, sil # zero-extend the bool to 64-bit, 1 cycle latency because GCC failed to use a different register
shlx rsi, rsi, rdx # the shift
mov rax, rsi # stupid GCC didn't put the result in RAX
or rax, rdi # retval = shift | x
ret
This could have been
# hand optimized, and clang 9.0 -O3 -march=skylake
movzx eax, sil # mov-elimination works between different regs
shlx rax, rax, rdx # don't need to take advantage of copy-and-shift
or rax, rdi
ret
It turns out that clang9.0 actually does emit this efficient version with -O3 -march=skylake or znver1. (Godbolt).
This is cheap enough (3 uops) it's not worth branching for, except to break the data dependency on n in case x and a are likely to be ready earlier than n.
But without BMI2, the shift would take a mov ecx, edx, and a 3-uop (on Intel SnB-family) shl rax, cl. AMD has single-uop variable-count shifts even for the legacy versions that do write flags (except when CL=0 and they have to leave FLAGS unmodified; that's why it costs more on Intel). GCC is still dumb and zero-extends in place instead of into RAX. Clang gets it right (and takes advantage of the unofficial calling convention feature where narrow function args are sign or zero-extended to 32-bit so it can use mov instead of movzx) https://godbolt.org/z/9wrYEN
Clang compiles an if() to branchless using CMOV, so that's significantly worse than the simple version that uses uint64_t(a) << n. It's a missed optimization that it doesn't compile my hr1 the same as my hr3; they
GCC actually branches and then uses mov reg, 1 / shl / or for the if version. Again it could compile it the same as hr3 if it chose to. (It can assume that a=1 implies n<=63, otherwise the if version would have shift UB.)
The missed optimization in both is failure to use bts, which implements reg |= 1<<(n&63)
Especially for gcc after branching so it knows its shifting a constant 1, the tail of the function should be bts rax, rdx which is 1 uop with 1c latency on Intel, 2 uops on AMD Zen1 / Zen2. GCC and clang do know how to use bts for the simple case of a compile-time-constant a=1, though: https://godbolt.org/z/rkhbzH
There's no way that I know of to hand-hold GCC or clang into using bts otherwise, and I wouldn't recommend inline-assembly for this unless it's in the most critical inner loop of something and you're prepared to check that it doesn't hurt other optimizations, and to maintain it. i.e. just don't.
But ideally GCC / clang would do something like this when BMI2 isn't available:
# hand optimized, compilers should do this but don't.
mov rax, rdi # x
bts rdi, rdx # x | 1<<(n&63)
test sil, sil
cmovnz rax, rdi # return a ? x_with_bit_set : x;
ret
Doesn't require BMI2, but still only 4 uops on Broadwell and later. (And 5 uops on AMD Bulldozer / Zen). Critical path latencies:
x -> retval: 2 cycles (through (MOV and BTS) -> CMOV) on Broadwell and later. 3 cycles on earlier Intel (2 uop cmov) and on any AMD (2 uop BTS).
n -> retval: same as x (through BTS -> CMOV).
a -> retval: 2 cycles (through TEST -> CMOV) on Broadwell and later, and all AMD. 3 cycles on earlier Intel (2 uop cmov).
This is pretty obviously better than what clang emits for any version without -march=skylake or other BMI2, and even more better than what GCC emits (unless branchy turns out to be a good strategy).
One way that clang will use BTS:
If we mask the shift count for the branchy version, then clang will actually branch, and on the branch where the if body runs it implements it with bts as I described above. https://godbolt.org/z/BtT4w6
uint64_t hr1(uint64_t x, bool a, int n) noexcept
{
if (a) {
return x | (uint64_t(a) << (n&63));
}
return x;
}
clang 9.0 -O3 (without -march=)
hr1(unsigned long, bool, int):
mov rax, rdi
test sil, sil
je .LBB0_2 # if(a) {
bts rax, rdx # x |= 1<<(n&63)
.LBB0_2: # }
ret
So if branchy is good for your use-case, then this way of writing it compiles well with clang.
These stand-alone versions might end up different after inlining into a real caller.
For example, a caller might save a MOV instruction if it can have the shift count n already in CL. Or the decision on whether to do if-conversion from an if to a branchless sequence might be different.
Or if n is a compile-time constant, that means we don't need BMI2 to save uops on the shift anymore; immediate shifts are fully efficient on all modern CPUs (single uop).
And of course if a is a compile time constant then it's either nothing to do or optimizes to a bts.
Further reading: see the performance links in https://stackoverflow.com/tags/x86/info for more about how to decide if asm is efficient by looking at it.

What is the compiler doing here that allows comparison of many values to be done with few actual comparisons?

My question is about what the compiler is doing in this case that optimizes the code way more than what I would think is possible.
Given this enum:
enum MyEnum {
Entry1,
Entry2,
... // Entry3..27 are the same, omitted for size.
Entry28,
Entry29
};
And this function:
bool MyFunction(MyEnum e)
{
if (
e == MyEnum::Entry1 ||
e == MyEnum::Entry3 ||
e == MyEnum::Entry8 ||
e == MyEnum::Entry14 ||
e == MyEnum::Entry15 ||
e == MyEnum::Entry18 ||
e == MyEnum::Entry21 ||
e == MyEnum::Entry22 ||
e == MyEnum::Entry25)
{
return true;
}
return false;
}
For the function, MSVC generates this assembly when compiled with -Ox optimization flag (Godbolt):
bool MyFunction(MyEnum) PROC ; MyFunction
cmp ecx, 24
ja SHORT $LN5#MyFunction
mov eax, 20078725 ; 01326085H
bt eax, ecx
jae SHORT $LN5#MyFunction
mov al, 1
ret 0
$LN5#MyFunction:
xor al, al
ret 0
Clang generates similar (slightly better, one less jump) assembly when compiled with -O3 flag:
MyFunction(MyEnum): # #MyFunction(MyEnum)
cmp edi, 24
ja .LBB0_2
mov eax, 20078725
mov ecx, edi
shr eax, cl
and al, 1
ret
.LBB0_2:
xor eax, eax
ret
What is happening here? I see that even if I add more enum comparisons to the function, the assembly that is generated does not actually become "more", it's only this magic number (20078725) that changes. That number depends on how many enum comparisons are happening in the function. I do not understand what is happening here.
The reason why I am looking at this is that I was wondering if it is good to write the function as above, or alternatively like this, with bitwise comparisons:
bool MyFunction2(MyEnum e)
{
if (
e == MyEnum::Entry1 |
e == MyEnum::Entry3 |
e == MyEnum::Entry8 |
e == MyEnum::Entry14 |
e == MyEnum::Entry15 |
e == MyEnum::Entry18 |
e == MyEnum::Entry21 |
e == MyEnum::Entry22 |
e == MyEnum::Entry25)
{
return true;
}
return false;
}
This results in this generated assembly with MSVC:
bool MyFunction2(MyEnum) PROC ; MyFunction2
xor edx, edx
mov r9d, 1
cmp ecx, 24
mov eax, edx
mov r8d, edx
sete r8b
cmp ecx, 21
sete al
or r8d, eax
mov eax, edx
cmp ecx, 20
cmove r8d, r9d
cmp ecx, 17
sete al
or r8d, eax
mov eax, edx
cmp ecx, 14
cmove r8d, r9d
cmp ecx, 13
sete al
or r8d, eax
cmp ecx, 7
cmove r8d, r9d
cmp ecx, 2
sete dl
or r8d, edx
test ecx, ecx
cmove r8d, r9d
test r8d, r8d
setne al
ret 0
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.

Quite smart! The first comparison with 24 is to do a rough range check - if it's more than 24 or less than 0 it will bail out; this is important as the instructions that follow that operate on the magic number have a hard cap to [0, 31] for operand range.
For the rest, the magic number is just a bitmask, with the bits corresponding to the "good" values set.
>>> bin(20078725)
'0b1001100100110000010000101'
It's easy to spot the first and third bits (counting from 1 and from right) set, the 8th, 14th, 15th, ...
MSVC checks it "directly" using the BT (bit test) instruction and branching, clang instead shifts it of the appropriate amount (to get the relevant bit in the lowest order position) and keeps just it ANDing it with zero (avoiding a branch).
The C code corresponding to the clang version would be something like:
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return (20078725 >> e) & 1;
}
as for the MSVC version, it's more like
inline bool bit_test(unsigned val, int bit) {
return val & (1<<bit);
}
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return bit_test(20078725, e);
}
(I kept the bit_test function separated to emphasize that it's actually a single instruction in assembly, that val & (1<<bit) thing has no correspondence to the original assembly.
As for the if-based code, it's quite bad - it uses a lot of CMOV and ORs the results together, which is both longer code, and will probably serialize execution. I suspect the corresponding clang code will be better. OTOH, you wrote this code using bitwise OR (|) instead of the more semantically correct logical OR (||), and the compiler is strictly following your orders (typical of MSVC).
Another possibility to try instead could be a switch - but I don't think there's much to gain compared to the code already generated for the first snippet, which looks pretty good to me.
Ok, doing a quick test with all the versions against all compilers, we can see that:
the C translation of the CLang output above results in pretty much that same code (= to the clang output) in all compilers; similarly for the MSVC translation;
the bitwise or version is the same as the logical or version (= good) in both CLang and gcc;
in general, gcc does essentially the same thing as CLang except for the switch case;
switch results are varied:
CLang does best, by generating the exact same code;
both gcc and MSVC generate jump-table based code, which in this case is less good; however:
gcc prefers to emit a table of QWORDs, trading size for simplicity of the setup code;
MSVC instead emits a table of BYTEs, paying it in setup code size; I couldn't get gcc to emit similar code even changing -O3 to -Os (optimize for size).

Ah, the old immediate bitmap trick.
GCC does this too, at least for a switch.
x86 asm casetable implementation. Unfortunately GCC9 has a regression for some cases: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026#c3 ; GCC8 and earlier do a better job.
Another example of using it, this time for code-golf (fewest bytes of code, in this case x86 machine code) to detect certain letters: User Appreciation Challenge #1: Dennis ♦
The basic idea is to use the input as an index into a bitmap of true/false results.
First you have to range-check because the bitmap is fixed-width, and x86 shifts wrap the shift count. We don't want high inputs to alias into the range where there are some that should return true. cmp edi, 24/ja is doing.
(If the range between the lowest and highest true values was from 120 to 140, for example, it might start with a sub edi,120 to range-shift everything before the cmp.)
Then you use bitmap & (1<<e) (the bt instruction), or (bitmap >> e) & 1 (shr / and) to check the bit in the bitmap that tells you whether that e value should return true or false.
There are many ways to implement that check, logically equivalent but with performance differences.
If the range was wider than 32, it would have to use 64-bit operand-size. If it was wider than 64, the compiler might not attempt this optimization at all. Or might still do it for some of the conditions that are in a narrow range.
Using an even larger bitmap (in .rodata memory) would be possible but probably not something most compilers will invent for you. Either with bt [mem],reg (inefficient) or manually indexing a dword and checking that the same way this code checks the immediate bitmap. If you had a lot of high-entropy ranges it might be worth checking 2x 64-bit immediate bitmap, branchy or branchless...
Clang/LLVM has other tricks up its sleeve for efficiently comparing against multiple values (when it doesn't matter which one is hit), e.g. broadcast a value into a SIMD register and use a packed compare. That isn't dependent on the values being in a dense range. (Clang generates worse code for 7 comparisons than for 8 comparisons)
that optimizes the code way more than what I would think is possible.
These kinds of optimizations come from smart human compiler developers that notice common patterns in source code and think of clever ways to implement them. Then get compilers to recognize those patterns and transform their internal representation of the program logic to use the trick.
Turns out that switch and switch-like if() statements are common, and aggressive optimizations are common.
Compilers are far from perfect, but sometimes they do come close to living up to what people often claim; that compilers will optimize your code for you so you can write it in a human-readable way and still have it run near-optimally. This is sometimes true over the small scale.
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.
The immediate bitmap is vastly more efficient. There's no data memory access in either one so no cache miss loads. The only "expensive" instruction is a variable-count shift (3 uops on mainstream Intel, because of x86's annoying FLAGS-setting semantics; BMI2 shrx is only 1 uop and avoid having to mov the number to ecx.) https://agner.org/optimize. And see other performance analysis links in https://stackoverflow.com/tags/x86/info.
Each instruction in the cmp/cmov chain is at least 1 uop, and there's a pretty long dependency chain through each cmov because MSVC didn't bother to break it into 2 or more parallel chains. But regardless it's just a lot of uops, far more than the bitmap version, so worse for throughput (ability for out-of-order exec to overlap the work with surrounding code) as well as latency.
bt is also cheap: 1 uop on modern AMD and Intel. (bts, btr, btc are 2 on AMD, still 1 on Intel).
The branch in the immediate-bitmap version could have been a setna / and to make it branchless, but especially for this enum definition the compiler expected that it would be in range. It could have increased branch predictability by only requiring e <= 31, not e <= 24.
Since the enum only goes up to 29, and IIRC it's UB to have out-of-range enum values, it could actually optimize it away entirely.
Even if the e>24 branch doesn't predict very well, it's still probably better overall. Given current compilers, we only get a choice between the nasty chain of cmp/cmov or branch + bitmap. Unless turn the asm logic back into C to hand-hold compilers into making the asm we want, then we can maybe get branchless with an AND or CMOV to make it always zero for out-of-range e.
But if we're lucky, profile-guided optimization might let some compilers make the bitmap range check branchless. (In asm the behaviour of shl reg, cl with cl > 31 or 63 is well-defined: on x86 it simply masks the count. In a C equivalent, you could use bitmap >> (e&31) which can still optimize to a shr; compilers know that x86 shr masks the count so they can optimize that away. But not for other ISAs that saturate the shift count...)
There are lots of ways to implement the bitmap check that are pretty much equivalent. e.g. you could even use the CF output of shr, set according to the last bit shifted out. At least if you make sure CF has a known state ahead of time for the cl=0 case.
When you want an integer bool result, right-shifting seems to make more sense than bt / setcc, but with shr costing 3 uops on Intel it might actually be best to use bt reg,reg / setc al. Especially if you only need a bool, and can use EAX as your bitmap destination so the previous value of EAX is definitely ready before setcc. (Avoiding a false dependency on some unrelated earlier dep chain.)
BTW, MSVC has other silliness: as What is the best way to set a register to zero in x86 assembly: xor, mov or and? explains, xor al,al is totally stupid compared to xor eax,eax when you want to zero AL. If you don't need to leave the upper bytes of RAX unmodified, zero the full register with a zeroing idiom.
And of course branching just to return 0 or return 1 makes little sense, unless you expect it to be very predictable and want to break the data dependency. I'd expect that setc al would make more sense to read the CF result of bt

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js