Find position of first (lowest) set bit in 32-bit number - c++

I need to get a 1-bit number in a 32-bit number, in which there is only one 1-bit (always). The fastest way in C ++ or asm.
For example
input: 0x00000001, 0x10000000
output: 0, 28

#ifdef __GNUC__, use __builtin_ctz(unsigned) to Count Trailing Zeros (GCC manual). GCC, clang, and ICC all support it on all target ISAs. (On ISAs where there's no native instruction, it will call a GCC helper function.)
Leading vs. Trailing is when written in printing order, MSB-first, like 8-bit binary 00000010 has 6 leading zeros and one trailing zero. (And when cast to 32-bit binary, will have 24+6 = 30 leading zeros.)
For 64-bit integers, use __builtin_ctzll(unsigned long long). It's unfortunate that GNU C bitscan builtins don't take fixed-width types (especially the leading zeros versions), but unsigned is always 32-bit on GNU C for x86 (although not for AVR or MSP430). unsigned long long is always uint64_t on all GNU C targets I'm aware of.
On x86, it will compile to bsf or tzcnt depending on tuning + target options. tzcnt is a single uop with 3 cycle latency on modern Intel, and only 2 uops with 2 cycle latency on AMD (perhaps a bit-reverse to feed an lzcnt uop?) https://agner.org/optimize/ / https://uops.info/. Either way it's directly supported by fast hardware, and is much faster than anything you can do in pure C++. About the same cost as x * 1234567 (on Intel CPUs, bsf/tzcnt has the same cost as imul r, r, imm, in front-end uops, back-end port, and latency.)
The builtin has undefined behaviour for inputs with no bits set, allowing it to avoid any extra checks if it might run as bsf.
In other compilers (specifically MSVC), you might want an intrinsic for TZCNT, like _mm_tzcnt_32 from immintrin.h. (Intel intrinsics guide). Or you might need to include intrin.h (MSVC) or x86intrin.h for non-SIMD intrinsics.
Unlike GCC/clang, MSVC doesn't stop you from using intrinsics for ISA extensions you haven't enabled for the compiler to use on its own.
MSVC also has _BitScanForward / _BitScanReverse for actual BSF/BSR, but the leave-destination-unmodified behaviour that AMD guarantees (and Intel also implements) is still not exposed by these intrinsics, despite their pointer-output API.
VS: unexpected optimization behavior with _BitScanReverse64 intrinsic - pointer-output is assumed to always be written :/
_BitScanForward _BitScanForward64 missing (VS2017) Snappy - correct headers
How to use MSVC intrinsics to get the equivalent of this GCC code?
TZCNT decode as BSF on CPUs without BMI1 because its machine-code encoding is rep bsf. They give identical results for non-zero inputs, so compilers can and do always just use tzcnt because that's much faster on AMD. (They're the same speed on Intel so no downside. And on Skylake and later, tzcnt has no false output dependency. BSF does because it leaves its output unmodified for input=0).
(The situation is less convenient for bsr vs. lzcnt: bsr returns the bit-index, lzcnt returns the leading-zero count. So for best performance on AMD, you need to know that your code will only run on CPUs supporting BMI1 / TBM so the compiler can use lzcnt)
Note that with exactly 1 bit set, scanning from either direction will find the same bit. So 31 - lzcnt = bsr is the same in this case as bsf = tzcnt. Possibly useful if porting to another ISA that only has leading-zero count and no bit-reverse instruction.
Related:
Why does breaking the "output dependency" of LZCNT matter? modern compilers generally know to break the false dependency for lzcnt/tzcnt/popcnt. bsf/bsr have one, too, and I think GCC is also smart about that, but ironically might not be.
How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows? - the pseudocode is not the hardware implementation.
https://en.wikipedia.org/wiki/Find_first_set has more about bitscan functions across ISAs. Including POSIX ffs() which returns a 1-based index and has to do extra work to account for the possibility of the input being 0.
Compilers do recognize ffs() and inline it like a builtin (like they do for memcpy or sqrt), but don't always manage to optimize away all the work their canned sequence does to implement it when you actually want a 0-based index. It's especially hard to tell the compiler there's only 1 bit set.

Related

What is "MAX" referring to in the intel intrinsics documentation?

Within the intel intrinsics guide some operations are defined using a term "MAX". An example is __m256 _mm256_mask_permutexvar_ps (__m256 src, __mmask8 k, __m256i idx, __m256 a), which is defined as
FOR j := 0 to 7
i := j*32
id := idx[i+2:i]*32
IF k[j]
dst[i+31:i] := a[id+31:id]
ELSE
dst[i+31:i] := 0
FI
ENDFOR
dst[MAX:256] := 0
. Please take note of the last line within this definition: dst[MAX:256] := 0. What is MAX referring to and is this line even adding any valuable information? If I had to make assumptions, then MAX probably means the amount of bits within the vector, which is 256 in case of _mm256. This however does not seem to change anything for the definition of the operation and might as well have been omitted. But why is it there then?
This pseudo-code only makes sense for assembly documentation, where it was copied from, not for intrinsics. (HTML scrape of Intel's vol.2 PDF documenting the corresponding vpermps asm instruction.)
...
ENDFOR
DEST[MAXVL-1:VL] ← 0
(The same asm doc entry covers VL = 128, 256, and 512-bit versions, the vector width of the instruction.)
In asm, a YMM register is the low half of a ZMM register, and writing a YMM zeroes the upper bits out to the CPU's max supported vector width (just like writing EAX zero-extends into RAX).
The intrinsic you picked is for the masked version, so it requires AVX-512 (EVEX encoding), thus VLMAX is at least 5121. If the mask is a constant all-ones, it could get optimized to the AVX2 VEX encoding, but both still zero high bits of the full register out to VLMAX.
This is meaningless for intrinsics
The intrinsics API just has __m256 and __m512 types; an __m256 is not implicitly the low half of an __m512. You can use _mm512_castps256_ps512 to get a __m512 with your __m256 as the low half, but the API documentation says "the upper 256 bits of the result are undefined". So if you use it on a function arg, it doesn't force it to vmovaps ymm7, ymm0 or something to zero-extend into a ZMM register in case the caller left high garbage.
If you use _mm512_castps256_ps512 on a __m256 that came from an intrinsic in this function, it pretty much always will happen to compile with a zeroed high half whether it stayed in a reg or got stored/reloaded, but that's not guaranteed by the API. (If the compiler chose to combine a previous calculation with something else, using a 512-bit operation, you could plausibly end up with a non-zero high half.) If you want high zeros, there's no equivalent to _mm256_set_m128 (__m128 hi, __m128 lo), so you need some other explicit way.
Footnote 1: Or with some hypothetical future extension, VLMAX aka MAXVL could be even wider. It's determined by the current value of XCR0. This documentation is telling you these instructions will still zero out to whatever that is.
(I haven't looked into whether changing VLMAX is possible on a machine supporting AVX-512, or if it's read-only. IDK how the CPU would handle it if you can change it, like maybe not running 512-bit instructions at all. Mainstream OSes certainly don't do this even if it's possible with privileged operations.)
SSE didn't have any defined mechanism for extension to wider vectors, and some existing code (notably Windows kernel drivers) manually saved/restored a few XMM registers for their own use. To support that, AVX decided that legacy SSE would leave the high part of YMM/ZMM registers unmodified. But to run existing machine code using non-VEX legacy SSE encodings efficiently, it needed expensive state transitions (Haswell and Ice Lake) and/or false dependencies (Skylake): Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
Intel wasn't going to make this mistake again, so they defined AVX as zeroing out to whatever vector width the CPU supports, and document it clearly in every AVX and AVX-512 instruction encoding. Thus VEX and EVEX can be mixed freely, even being useful to save machine-code size:
What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?
What is the penalty of mixing EVEX and VEX encoded scheme? (none), with an answer discussing more details of why SSE/AVX penalties are a thing.
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/301853 Agner Fog's 2008 post on Intel's forums about AVX, when it was first announced, pointing out the problem created by the lack of foresight with SSE.
Does vzeroall zero registers ymm16 to ymm31? - interestingly no; since they're not accessible via legacy SSE instructions, they can't be part of a dirty-uppers problem.
Bits in the registers are numbered with high indices on the “left” and low indices on the “right”. This matches how we write and talk about binary numerals: 100102 is the binary numeral for 18, with bit number 4, representing 24 = 16, on the left and bit number 0, representing 20 = 1, on the right.
R[m:n] denotes the set of bits of register R from m down to n, with m being the “left” end of the set and n being the “right” end. If m is less than n, then it is the empty set. Therefore, for registers with 512 bits, dst[511:256] := 0 says to set bits 511 to 256 to zero, and, for registers with 256 bits, dst[255:256] := 0 says to do nothing.
dst[MAX:256] := 0 sets all bits above (and including) 256th bit to zero. It is only relevant to registers having more than 256 bits. So MAX can be 256 if the register is 256 bits long or 512 if the processor is using 512 bits registers.

Instruction/intrinsic for taking higher half of uint64_t in C++?

Imagine following code:
Try it online!
uint64_t x = 0x81C6E3292A71F955ULL;
uint32_t y = (uint32_t) (x >> 32);
y receives higher 32-bit part of 64-bit integer. My question is whether there exists any intrinsic function or any CPU instruction that does this in single operation without doing move and shift?
At least CLang (linked in Try-it-online above) creates two instruction mov rax, rdi and shr rax, 32 for this, so either CLang doesn't do such optimization, or there exists no such special instruction.
Would be great if there existed imaginary single instruction like movhi dst_reg, src_reg.
If there was a better way to do this bitfield-extraction for an arbitrary uint64_t, compilers would already use it. (At least in theory; compilers do have missed optimizations, and their choices sometimes favour latency even if it costs more uops.)
You only need intrinsics for things that you can't express efficiently in pure C, in ways the compiler can already easily understand. (Or if your compiler is dumb and can't spot the obvious.)
You could maybe imagine cases where the input value comes from the multiply of two 32-bit values, then it might be worthwhile on some CPUs for the compiler to use widening mul r32 to already generate the result in two separate 32-bit registers, instead of imul r64, r64 + shr reg,32, if it can easily use EAX/EDX. But other than gcc -mtune=silvermont or other tuning options, you can't make the compiler do it that way.
shr reg, 32 has 1 cycle latency, and can run on more than 1 execution port on most modern x86 microarchitectures (https://uops.info/). The only thing one might wish for is that it could put the result in a different register, without overwriting the input.
Most modern non-x86 ISAs are RISC-like with 3-operand instructions, so a shift instruction can copy-and-shift, unlike x86 shifts where the compiler needs a mov in addition to shr if it also needs the original 64-bit value later, or (in the case of a tiny function) needs the return value in a different register.
And some ISAs have bitfield-extract instructions. PowerPC even has a fun rotate-and-mask instruction (rlwinm) (with the mask being a bit-range specified by immediates), and it's a different instruction from a normal shift. Compilers will use it as appropriate - no need for an intrinsic. https://devblogs.microsoft.com/oldnewthing/20180810-00/?p=99465
x86 with BMI2 has rorx rax, rdi, 32 to copy-and-rotate, instead of being stuck shifting within the same register. A function returning uint32_t could/should use that instead of mov+shr, in the stand-alone version that doesn't inline because the caller already has to ignore high garbage in RAX. (Both x86-64 System V and Windows x64 define the return value as only the register width matching the C type of the arg; e.g. returning uint32_t means that the high 32 bits of RAX are not part of the return value, and can hold anything. Usually they're zero because writing a 32-bit register implicitly zero-extends to 64, but something like return bar() where bar returns uint64_t can just leave RAX untouched without having to truncate it; in fact an optimized tailcall is possible.)
There's no intrinsic for rorx; compilers are just supposed to know when to use it. (But gcc/clang -O3 -march=haswell miss this optimization.) https://godbolt.org/z/ozjhcc8Te
If a compiler was doing this in a loop, it could have 32 in a register for shrx reg,reg,reg as a copy-and-shift. Or more silly, it could use pext with 0xffffffffULL << 32 as the mask. But that's strictly worse that shrx because of the higher latency.
AMD TBM (Bulldozer-family only, not Zen) had an immediate form of bextr (bitfield-extract), and it ran efficiently as 1 uop (https://agner.org/optimize/). https://godbolt.org/z/bn3rfxzch shows gcc11 -O3 -march=bdver4 (Excavator) uses bextr rax, rdi, 0x2020, while clang misses that optimization. gcc -march=znver1 uses mov + shr because Zen dropped Trailing Bit Manipulation along with the XOP extension.
Standard BMI1 bextr needs position/len in a register, and on Intel CPUs is 2 uops so it's garbage for this. It does have an intrinsic, but I recommend not using it. mov+shr is faster on Intel CPUs.

Optimized code in VC++ and ASM

Good evening. Sorry, I used google tradutor.
I use NASM in VC ++ on x86 and I'm learning how to use MASM on x64.
Is there any way to specify where each argument goes and the return of an assembly function in such a way that the compiler manages to leave the data there in the fastest way? We can too specify which registers will be used so that the compiler knows what data is still saved to make the best use of it?
For example, since there is no intrinsic function that applies the exactly IDIV r/m64 (64-bit signed integer division of assembly language), we may need to implement it. The IDIV requires that the low magnitude part of the dividend/numerator be in RAX, the high in RDX and the divisor/denominator in any register or in a region of memory. At the end, the quotient is in EAX and the remainder in EDX. We may therefore want to develop functions so (I put inutilities to exemplify):
void DivLongLongInt( long long NumLow , long long NumHigh , long long Den , long long *Quo , long long *Rem ){
__asm(
// Specify used register: [rax], specify pre location: NumLow --> [rax]
reg(rax)=NumLow ,
// Specify used register: [rdx], specify pre location: NumHigh --> [rdx]
reg(rdx)=NumHigh ,
// Specify required memory: memory64bits [den], specify pre location: Den --> [den]
mem[64](den)=Den ,
// Specify used register: [st0], specify pre location: Const(12.5) --> [st0]
reg(st0)=25*0.5 ,
// Specify used register: [bh]
reg(bh) ,
// Specify required memory: memory64bits [nothing]
mem[64](nothing) ,
// Specify used register: [st1]
reg(st1)
){
// Specify code
IDIV [den]
}(
// Specify pos location: [rax] --> *Quo
*Quo=reg(rax) ,
// Specify pos location: [rdx] --> *Rem
*Rem=reg(rdx)
) ;
}
Is it possible to do something at least close to that?
Thanks for all help.
If there is no way to do this, it's a shame because it would certainly be a great way to implement high-level functions with assembly-level features. I think it's a simple interface between C ++ and ASM that should already exist and enable assembly code to be embedded inline and at high level, practically as simple C++ code.
As others have mentioned, MSVC does not support any form of inline assembly when targeting x86-64.
Inline assembly is supported only in x86-32 builds, and even there, it is rather limited in what you can do. In particular, you can't specify inputs and outputs, so the use of inline assembly necessarily entails a lot of shuffling of values back and forth between registers and memory, which is precisely the opposite of what you want when writing high-performance code. Unless there is something that you cannot possibly do any other way except by causing the manual emission of machine code, you should avoid the inline assembler. Its original purpose was to do things like generate OUT instructions and call ROM BIOS interrupts in obsolete 8-bit and 16-bit programming environments. It made it into the 32-bit compiler for compatibility purposes, but the team drew the line with 64-bit.
Intrinsics are now the recommended solution, because these play much better with the optimizer. Virtually any SIMD code that you need the compiler to generate can be accomplished using intrinsics, just as you would on most any other compiler targeting x86, so not only are you getting better code, but you're also getting slightly more portable code.
Even on Gnu-style compilers that support extended asm blocks, which give you the type of input/output operand power that you are looking for, there are still lots of good reasons to avoid the use of inline asm. Intrinsics are still a better solution there, as is finding a way to represent what you want in C and persuading the compiler to generate the assembly code that you wish it to emit.
The only exception is cases where there are no intrinsics available. The IDIV instruction is, unfortunately, one of those cases. (There are intrinsics available for 128-bit multiplication. They go by various names: either Windows-specific or compiler-specific.)
On Gnu compilers that support 128-bit integer types as an extension on 64-bit targets, you can get the compiler to generate the code for you:
__int128_t dividend = 1234;
int64_t divisor = 64;
int64_t quotient = (dividend / divisor);
Now, this is generally compiled as a call to their library function that does 128-bit division, rather than an inline IDIV instruction that returns a 64-bit quotient. Presumably, this is because of the need to handle overflows, as David mentioned. Actually, it's worse than that. No C or C++ implementation can use the DIV/IDIV instructions because they are non-conforming. These instructions will result in overflow exceptions, whereas the standard says that the result should be truncated. (With multiplication, you do get inline IMUL/MUL instruction(s) because these don't have the overflow problem, since they return 128-bit results.)
This isn't actually as big of a loss as you might think. You seem to be assuming that the 64-bit IDIV instruction is really fast. It isn't. Although the actual numbers vary depending on the number of significant bits in the absolute value of the dividend, your values probably are quite large if you actually need the range of a 128-bit integer. Looking at Agner Fog's instruction tables will give you some idea of the performance you can expect on various architectures. It's getting faster on newer architectures (especially on the newer AMD processors; it's still sluggish on Intel), but it still has pretty substantial latencies. Just because it's one instruction doesn't mean that it runs in one cycle or anything like that. A single instruction might be good for code density when you're optimizing for size and worried about a call to a library function evicting other instructions from your cache, but division is a slow enough operation that this usually doesn't matter. In fact, division is so slow that compilers try very hard not to use it—whenever possible, they will do multiplication by the reciprocal, which is significantly faster. And if you're really needing to do multiplications quickly, you should look into parallelizing them with SIMD instructions, which all have intrinsics available.
Back to MSVC (although everything I said in the last paragraph still applies, of course), there are no 128-bit integer types, so if you need to implement this type of division, you will need to write the code in an external assembly module and link it in. The code is pretty simple, and Visual Studio has excellent, built-in support for assembling code with MASM and linking it directly into your project:
; Windows 64-bit calling convention passes parameters as follows:
; RCX == first 64-bit integer parameter (low bits of dividend)
; RDX == second 64-bit integer parameter (high bits of dividend)
; R8 == third 64-bit integer parameter (divisor)
; R9 == fourth 64-bit integer parameter (pointer to remainder)
Div128x64 PROC
mov rax, rcx
idiv r8 ; 128-bit divide (RDX:RAX / R8)
mov [r9], rdx ; store remainder
ret ; return, with quotient in RDX:RAX
Div128x64 ENDP
Then you just prototype that in your C++ code as:
extern int64_t Div128x64(int64_t loDividend,
int64_t hiDividend,
int64_t divisor,
int64_t* pRemainder);
and you're done. Call it as desired.
The equivalent can be written for unsigned division, using the DIV instruction.
No, you don't get intelligent register allocation, but this isn't really a big deal with register renaming in the front end that can often elide register-register moves entirely (in other words, MOVs become zero-latency operations). Plus, the IDIV instruction is so restrictive anyway in terms of its operands, since they are hardcoded to RAX and RDX, that it's pretty unlikely a scheduler would be able to keep the values in those registers anyway, at least for any non-trivial piece of code.
Beware that once you write the necessary code to check for the possibility of overflows, or worse—the code to handle exceptions—this will very likely end up performing the same or worse as a library function that does a proper 128-bit division, so you should arguably just write and use that (until such time as Microsoft sees fit to provide one). That can be written in C (also see implementation of __divti3 library function for Gnu compilers), which makes it a candidate for inlining and otherwise plays better with the optimizer.
No, it is not possible to do this. MSVC doesn't support inline assembly for x64 builds. Instead, you should use intrinsics; almost everything is available. The sad thing is, as far as I know, 128-bit idiv is missing from the intrinsics.
A note: you can solve your issue with two movs (to put inputs in the correct registers). And you should not worry about that; current CPUs handle mov very well. Putting mov into a code may not slow it down at all. And div is very expensive compared to a mov, so it doesn't matter too much.

RDRAND and RDSEED intrinsics on various compilers?

Does Intel C++ compiler and/or GCC support the following Intel intrinsics, like MSVC does since 2012 / 2013?
#include <immintrin.h> // for the following intrinsics
int _rdrand16_step(uint16_t*);
int _rdrand32_step(uint32_t*);
int _rdrand64_step(uint64_t*);
int _rdseed16_step(uint16_t*);
int _rdseed32_step(uint32_t*);
int _rdseed64_step(uint64_t*);
And if these intrinsics are supported, since which version are they supported (with compile-time-constant please)?
Both GCC and Intel compiler support them. GCC support was introduced at the end of 2010. They require the header <immintrin.h>.
GCC support has been present since at least version 4.6, but there doesn't seem to be any specific compile-time constant - you can just check __GNUC_MAJOR__ > 4 || (__GNUC_MAJOR__ == 4 && __GNUC_MINOR__ >= 6).
All the major compilers support Intel's intrinsics for rdrand and rdseed via <immintrin.h>.
Somewhat recent versions of some compilers are needed for rdseed, e.g. GCC9 (2019) or clang7 (2018), although those have been stable for a good while by now. If you'd rather use an older compiler, or not enable ISA-extension options like -march=skylake, a library1 wrapper function instead of the intrinsic is a good choice. (Inline asm is not necessary, I wouldn't recommend it unless you want to play with it.)
#include <immintrin.h>
#include <stdint.h>
// gcc -march=native or haswell or znver1 or whatever, or manually enable -mrdrnd
uint64_t rdrand64(){
unsigned long long ret; // not uint64_t, GCC/clang wouldn't compile.
do{}while( !_rdrand64_step(&ret) ); // retry until success.
return ret;
}
// and equivalent for _rdseed64_step
// and 32 and 16-bit sizes with unsigned and unsigned short.
Some compilers define __RDRND__ when the instruction is enabled at compile-time. GCC/clang since they supported the intrinsic at all, but only much later ICC (19.0). And with ICC, -march=ivybridge doesn't imply -mrdrnd or define __RDRND__ until 2021.1.
ICX is LLVM-based and behaves like clang.
MSVC doesn't define any macros; its handling of intrinsics is designed around runtime feature detection only, unlike gcc/clang where the easy way is compile-time CPU feature options.
Why do{}while() instead of while(){}? Turns out ICC compiles to a less-dumb loop with do{}while(), not uselessly peeling a first iteration. Other compilers don't benefit from that hand-holding, and it's not a correctness problem for ICC.
Why unsigned long long instead of uint64_t? The type has to agree with the pointer type expected by the intrinsic, or C and especially C++ compilers will complain, regardless of the object-representations being identical (64-bit unsigned). On Linux for example, uint64_t is unsigned long, but GCC/clang's immintrin.h define int _rdrand64_step(unsigned long long*), same as on Windows. So you always need unsigned long long ret with GCC/clang. MSVC is a non-problem as it can (AFAIK) only target Windows, where unsigned long long is the only 64-bit unsigned type.
But ICC defines the intrinsic as taking unsigned long* when compiling for GNU/Linux, according to my testing on https://godbolt.org/. So to be portable to ICC, you actually need #ifdef __INTEL_COMPILER; even in C++ I don't know a way to use auto or other type-deduction to declare a variable that matches it.
Compiler versions to support intrinsics
Tested on Godbolt; its earliest version of MSVC is 2015, and ICC 2013, so I can't go back any further. Support for _rdrand16_step / 32 / 64 were all introduced at the same time in any given compiler. 64 requires 64-bit mode.
CPU
gcc
clang
MSVC
ICC
rdrand
Ivy Bridge / Excavator
4.6
3.2
before 2015 (19.10)
before 13.0.1, but 19.0 for -mrdrnd defining __RDRND__. 2021.1 for -march=ivybridge to enable -mrdrnd
rdseed
Broadwell / Zen 1
9.1
7.0
before 2015 (19.10)
before(?) 13.0.1, but 19.0 also added -mrdrnd and -mrdseed options)
The earliest GCC and clang versions don't recognize -march=ivybridge only -mrdrnd. (GCC 4.9 and clang 3.6 for Ivy Bridge, not that you specifically want to use IvyBridge if modern CPUs are more relevant. So use a non-ancient compiler and set a CPU option appropriate for CPUs you actually care about, or at least a -mtune= with a more recent CPU.)
Intel's new oneAPI / ICX compilers all support rdrand/rdseed, and are based on LLVM internals so they work similarly to clang for CPU options. (It doesn't define __INTEL_COMPILER, which is good because it's different from ICC.)
GCC and clang only let you use intrinsics for instructions you've told the compiler the target supports. Use -march=native if compiling for your own machine, or use -march=skylake or something to enable all the ISA extensions for the CPU you're targeting. But if you need your program to run on old CPUs and only use RDRAND or RDSEED after runtime detection, only those functions need __attribute__((target("rdrnd"))) or rdseed, and won't be able to inline into functions with different target options. Or using a separately-compiled library would be easier1.
-mrdrnd: enabled by -march=ivybridge or -march=znver1 (or bdver4 Exavator APUs) and later
-mrdseed: enabled by -march=broadwell or -march=znver1 or later
Normally if you're going to enable one CPU feature, it makes sense to enable others that CPUs of that generation will have, and to set tuning options. But rdrand isn't something the compiler will use on its own (unlike BMI2 shlx for more efficient variable-count shifts, or AVX/SSE for auto-vectorization and array/struct copying and init). So enabling -mrdrnd globally likely won't make your program crash on pre-Ivy Bridge CPUs, if you check CPU features and don't actually run code that uses _rdrand64_step on CPUs without the feature.
But if you are only going to run your code on some specific kind of CPU or later, gcc -O3 -march=haswell is a good choice. (-march also implies -mtune=haswell, and tuning for Ivy Bridge specifically is not what you want for modern CPUs. You could -march=ivybridge -mtune=skylake to set an older baseline of CPU features, but still tune for newer CPUs.)
Wrappers that compile everywhere
This is valid C++ and C. For C, you probably want static inline instead of inline so you don't need to manually instantiate an extern inline version in a .c in case a debug build decided not to inline. (Or use __attribute__((always_inline)) in GNU C.)
The 64-bit versions are only defined for x86-64 targets, because asm instructions can only use 64-bit operand-size in 64-bit mode. I didn't #ifdef __RDRND__ or #if defined(__i386__)||defined(__x86_64__), on the assumption that you'd only include this for x86(-64) builds at all, not cluttering the ifdefs more than necessary. It does only define the rdseed wrappers if that's enabled at compile time, or for MSVC where there's no way to enable them or to detect it.
There are some commented __attribute__((target("rdseed"))) examples you can uncomment if you want to do it that way instead of compiler options. rdrand16 / rdseed16 are intentionally omitted as not being normally useful. rdrand runs the same speed for different operand-sizes, and even pulls the same amount of data from the CPU's internal RNG buffer, optionally throwing away part of it for you.
#include <immintrin.h>
#include <stdint.h>
#if defined(__x86_64__) || defined (_M_X64)
// Figure out which 64-bit type the output arg uses
#ifdef __INTEL_COMPILER // Intel declares the output arg type differently from everyone(?) else
// ICC for Linux declares rdrand's output as unsigned long, but must be long long for a Windows ABI
typedef uint64_t intrin_u64;
#else
// GCC/clang headers declare it as unsigned long long even for Linux where long is 64-bit, but uint64_t is unsigned long and not compatible
typedef unsigned long long intrin_u64;
#endif
//#if defined(__RDRND__) || defined(_MSC_VER) // conditional definition if you want
inline
uint64_t rdrand64(){
intrin_u64 ret;
do{}while( !_rdrand64_step(&ret) ); // retry until success.
return ret;
}
//#endif
#if defined(__RDSEED__) || defined(_MSC_VER)
inline
uint64_t rdseed64(){
intrin_u64 ret;
do{}while( !_rdseed64_step(&ret) ); // retry until success.
return ret;
}
#endif // RDSEED
#endif // x86-64
//__attribute__((target("rdrnd")))
inline
uint32_t rdrand32(){
unsigned ret; // Intel documents this as unsigned int, not necessarily uint32_t
do{}while( !_rdrand32_step(&ret) ); // retry until success.
return ret;
}
#if defined(__RDSEED__) || defined(_MSC_VER)
//__attribute__((target("rdseed")))
inline
uint32_t rdseed32(){
unsigned ret; // Intel documents this as unsigned int, not necessarily uint32_t
do{}while( !_rdseed32_step(&ret) ); // retry until success.
return ret;
}
#endif
The fact that Intel's intrinsics API is supported at all implies that unsigned int is a 32-bit type, regardless of whether uint32_t is defined as unsigned int or unsigned long if any compilers do that.
On the Godbolt compiler explorer we can see how these compile. Clang and MSVC do what we'd expect, just a 2-instruction loop until rdrand leaves CF=1
# clang 7.0 -O3 -march=broadwell MSVC -O2 does the same.
rdrand64():
.LBB0_1: # =>This Inner Loop Header: Depth=1
rdrand rax
jae .LBB0_1 # synonym for jnc - jump if Not Carry
ret
# same for other functions.
Unfortunately GCC is not so good, even current GCC12.1 makes weird asm:
# gcc 12.1 -O3 -march=broadwell
rdrand64():
mov edx, 1
.L2:
rdrand rax
mov QWORD PTR [rsp-8], rax # store into the red-zone where retval is allocated
cmovc eax, edx # materialize a 0 or 1 from CF. (rdrand zeros EAX when it clears CF=0, otherwise copy the 1)
test eax, eax # then test+branch on it
je .L2 # could have just been jnc after rdrand
mov rax, QWORD PTR [rsp-8] # reload retval
ret
rdseed64():
.L7:
rdseed rax
mov QWORD PTR [rsp-8], rax # dead store into the red-zone
jnc .L7
ret
ICC makes the same asm as long as we use a do{}while() retry loop; with a while() {} it's even worse, doing an rdrand and checking before entering the loop for the first time.
Footnote 1: rdrand/rdseed library wrappers
librdrand or Intel's libdrng have wrapper functions with retry loops like I showed, and ones that fill a buffer of bytes or array of uint32_t* or uint64_t*. (Consistently taking uint64_t*, no unsigned long long* on some targets).
A library is also a good choice if you're doing runtime CPU feature detection, so you don't have to mess around with __attribute__((target)) stuff. However you do it, that limits inlining of a function using the intrinsics anyway, so a small static library is equivalent.
libdrng also provides RdRand_isSupported() and RdSeed_isSupported(), so you don't need to do your own CPUID check.
But if you're going to build with -march= something newer than Ivy Bridge / Broadwell or Excavator / Zen1 anyway, inlining a 2-instruction retry loop (like clang compiles it to) is about the same code-size as a function call-site, but doesn't clobber any registers. rdrand is quite slow so that's probably not a big deal, but it also means no extra library dependency.
Performance / internals of rdrand / rdseed
For more details about the HW internals on Intel (not AMD's version), see Intel's docs. For the actual TRNG logic, see Understanding Intel's Ivy Bridge Random Number Generator - it's a metastable latch that settles to 0 or 1 due to thermal noise. Or at least Intel says it is; it's basically impossible to truly verify where the rdrand bits actually come from in a CPU you bought. Worst case, still much better than nothing if you're mixing it with other entropy sources, like Linux does for /dev/random.
For more on the fact that there's a buffer that cores pull from, see some SO answers from the engineer who designed the hardware and wrote librdrand, such as this and this about its exhaustion / performance characteristics on Ivy Bridge, the first generation to feature it.
Infinite retry count?
The asm instructions set the carry flag (CF) = 1 in FLAGS on success, when it put a random number in the destination register. Otherwise CF=0 and the output register = 0. You're intended to call it in a retry loop, that's (I assume) why the intrinsic has the word step in the name; it's one step of generating a single random number.
In theory, a microcode update could change things so it always indicates failure, e.g. if a problem is discovered in some CPU model that makes the RNG untrustworthy (by the standards of the CPU vendor). The hardware RNG also has some self-diagnostics, so it's in theory possible for a CPU to decide that the RNG is broken and not produce any outputs. I haven't heard of any CPUs ever doing this, but I haven't gone looking. And a future microcode update is always possible.
Either of these could lead to an infinite retry loop. That's not great, but unless you want to write a bunch of code to report on that situation, it's at least an observable behaviour that users could potentially deal with in the unlikely event it ever happened.
But occasional temporary failure is normal and expected, and must be handled. Preferably by retrying without telling the user about it.
If there wasn't a random number ready in its buffer, the CPU can report failure instead of stalling this core for potentially even longer. That design choice might be related to interrupt latency, or just keeping it simpler without having to build retrying into the microcode.
Ivy Bridge can't pull data from the DRNG faster than it can keep up, according to the designer, even with all cores looping rdrand, but later CPUs can. Therefore it is important to actually retry.
#jww has had some experience with deploying rdrand in libcrypto++, and found that with a retry count set too low, there were reports of occasional spurious failure. He's had good results from infinite retries, which is why I chose that for this answer. (I suspect he would have heard reports from users with broken CPUs that always fail, if that was a thing.)
Intel's library functions that include a retry loop take a retry count. That's likely to handle the permanent-failure case which, as I said, I don't think happens in any real CPUs yet. Without a limited retry count, you'd loop forever.
An infinite retry count allows a simple API returning the number by value, without silly limitations like OpenSSL's functions that use 0 as an error return: they can't randomly generate a 0!
If you did want a finite retry count, I'd suggest very high. Like maybe 1 million, so it takes maybe have a second or a second of spinning to give up on a broken CPU, with negligible chance of having one thread starve that long if it's repeatedly unlucky in contending for access to the internal queue.
https://uops.info/ measured a throughput on Skylake of one per 3554 cycles on Skylake, one per 1352 on Alder Lake P-cores, 1230 on E-cores. One per 1809 cycles on Zen2. The Skylake version ran thousands of uops, the others were in the low double digits. Ivy Bridge had 110 cycle throughput, but in Haswell it was already up to 2436 cycles, but still a double-digit number of uops.
These abysmal performance numbers on recent Intel CPUs are probably due to microcode updates to work around problems that weren't anticipated when the HW was designed. Agner Fog measured one per 460 cycle throughput for rdrand and rdseed on Skylake when it was new, each costing 16 uops. The thousands of uops are probably extra buffer flushing hooked into the microcode for those instructions by recent updates. Agner measured Haswell at 17 uops, 320 cycles when it was new. See RdRand Performance As Bad As ~3% Original Speed With CrossTalk/SRBDS Mitigation on Phoronix:
As explained in the earlier article, mitigating CrossTalk involves locking the entire memory bus before updating the staging buffer and unlocking it after the contents have been cleared. This locking and serialization now involved for those instructions is very brutal on the performance, but thankfully most real-world workloads shouldn't be making too much use of these instructions.
Locking the memory bus sounds like it could hurt performance even of other cores, if it's like cache-line splits for locked instructions.
(Those cycle numbers are core clock cycle counts; if the DRNG doesn't run on the same clock as the core, those might vary by CPU model. I wonder if uops.info's testing is running rdrand on multiple cores of the same hardware, since Coffee Lake is twice the uops as Skylake, and 1.4x as many cycles per random number. Unless that's just higher clocks leading to more microcode retries?)
Microsoft compiler does not have intrinsics support for RDSEED and RDRAND instruction.
But, you may implement these instruction using NASM or MASM. Assembly code is available at:
https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-software-implementation-guide
For Intel Compiler, you can use header to determine the version. You can use following macros to determine the version and sub-version:
__INTEL_COMPILER //Major Version
__INTEL_COMPILER_UPDATE // Minor Update.
For instance if you use ICC15.0 Update 3 compiler, it will show that you have
__INTEL_COMPILER = 1500
__INTEL_COMPILER_UPDATE = 3
For further details on pre-defined macros you can go to: https://software.intel.com/en-us/node/524490

Intrinsics for CPUID like informations?

Considering that I'm coding in C++, if possible, I would like to use an Intrinsics-like solution to read useful informations about the hardware, my concerns/considerations are:
I don't know assembly that well, it will be a considerable investment just to get this kind of informations ( altough it looks like CPU it's just about flipping values and reading registers. )
there at least 2 popular syntax for asm ( Intel and AT&T ), so it's fragmented
strangely enough Intrinsics are more popular and supported than asm code this days
not all the the compilers that are in my radar right now support inline asm, MSVC 64 bit is one; I'm afraid that I will find other similar flaws while digging more into the feature sets of the different compilers that I have to use.
considering the trand I think that is more productive for me to bet on Intrinsics, it should be also way more easy than any asm code.
And the last question that I have to answer to is: how to do a similar thing with intrinsics ? Because I haven't found nothing other than CPUID opcodes to get this kind of informations at all.
After some digging I have found a useful built-in functions that is gcc specific.
The only problem is that this kind of functions are really limited ( basically you have only 2 functions, 1 for the CPU "name" and 1 for the set of registers )
an example is
#include <stdio.h>
int main()
{
if (__builtin_cpu_supports("mmx")) {
printf("\nI got MMX !\n");
} else
printf("\nWhat ? MMX ? What is that ?\n");
return (0);
}
and apparently this built-in functions work under mingw-w64 too.
Gcc includes a cpuid interface:
http://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/i386/cpuid.h
These don't seem to be well documented, but example usage can be found here:
http://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/i386/driver-i386.c
Note that you must use __cpuid_count() and not __cpuid() when the initial value of ecx matters, such as with avx/avx2 detection.
As user2485710 pointed out, gcc can do all the cpu feature detection work for you. As of gcc 4.8.1, the full list of features supported by __builtin_cpu_supports() is: cmov, mmx, popcnt, sse, sse2, sse3, ssse3, sse4.1, sse4.2, avx and avx2.
Intrinsics such as this are also generally compiler specific.
MS VC++ has a __cpuid (and a __cpuidex) to generate a CPUID op code.
At least as far as I know, gcc/g++ doesn't provide an equivalent to that though. Inline assembly seems to be the only option available.
For x86/x64, Intel provides an intrinsic called _may_i_use_cpu_feature. You can find it under the General Support category of the Intel Intrinsics Guide page. Below is a rip of Intel's documentation.
GCC supposedly follows Intel with respect to intrinsics, so it should be available under GCC. Its not clear to me if Microsoft provides it because they provide most (but not all) Intel intrinsics.
I'm not aware of anything for ARM. As far as I know, there is no __builtin_cpu_supports("neon"), __builtin_cpu_supports("crc32"), __builtin_cpu_supports("aes"), __builtin_cpu_supports("pmull"), __builtin_cpu_supports("sha"), etc under ARM. For ARM you have to perform CPU feature probing.
Synopsis
int _may_i_use_cpu_feature (unsigned __int64 a)
#include "immintrin.h"
Description
Dynamically query the processor to determine if the processor-specific feature(s) specified
in a are available, and return true or false (1 or 0) if the set of features is
available. Multiple features may be OR'd together. This intrinsic does not check the
processor vendor. See the valid feature flags below:
Operation
_FEATURE_GENERIC_IA32
_FEATURE_FPU
_FEATURE_CMOV
_FEATURE_MMX
_FEATURE_FXSAVE
_FEATURE_SSE
_FEATURE_SSE2
_FEATURE_SSE3
_FEATURE_SSSE3
_FEATURE_SSE4_1
_FEATURE_SSE4_2
_FEATURE_MOVBE
_FEATURE_POPCNT
_FEATURE_PCLMULQDQ
_FEATURE_AES
_FEATURE_F16C
_FEATURE_AVX
_FEATURE_RDRND
_FEATURE_FMA
_FEATURE_BMI
_FEATURE_LZCNT
_FEATURE_HLE
_FEATURE_RTM
_FEATURE_AVX2
_FEATURE_KNCNI
_FEATURE_AVX512F
_FEATURE_ADX
_FEATURE_RDSEED
_FEATURE_AVX512ER
_FEATURE_AVX512PF
_FEATURE_AVX512CD
_FEATURE_SHA
_FEATURE_MPX
For great-grandchildren, this is how to obtain CPU vendor name with GCC, tested on Win7 x64
#include <cpuid.h>
...
int eax, ebx, ecx, edx;
char vendor[13];
__cpuid(0, eax, ebx, ecx, edx);
memcpy(vendor, &ebx, 4);
memcpy(vendor + 4, &edx, 4);
memcpy(vendor + 8, &ecx, 4);
vendor[12] = '\0';
printf("CPU: %s\n", vendor);