How to emulate _mm256_loadu_epi32 with gcc or clang?

How to emulate _mm256_loadu_epi32 with gcc or clang? - c++

Intel's intrinsic guide lists the intrinsic _mm256_loadu_epi32:
_m256i _mm256_loadu_epi32 (void const* mem_addr);
/*
Instruction: vmovdqu32 ymm, m256
CPUID Flags: AVX512VL + AVX512F
Description
Load 256-bits (composed of 8 packed 32-bit integers) from memory into dst.
mem_addr does not need to be aligned on any particular boundary.
Operation
a[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0
*/
But clang and gcc do not provide this intrinsic. Instead they provide (in file avx512vlintrin.h) only the masked versions
_mm256_mask_loadu_epi32 (__m256i, __mmask8, void const *);
_mm256_maskz_loadu_epi32 (__mmask8, void const *);
which boil down to the same instruction vmovdqu32. My question: how can I emulate _mm256_loadu_epi32:
inline _m256i _mm256_loadu_epi32(void const* mem_addr)
{
/* code using vmovdqu32 and compiles with gcc */
}
without writing assembly, i.e. using only intrinsics available?

Just use _mm256_loadu_si256 like a normal person. The only thing the AVX512 intrinsic gives you is a nicer prototype (const void* instead of const __m256i*) so you don't have to write ugly casts.
#chtz suggests out that you might still want to write a wrapper function yourself to get the void* prototype. But don't call it _mm256_loadu_epi32; some future GCC version will probably add that for compat with Intel's docs and break your code.
From another perspective, it's unfortunate that compilers don't treat it as an AVX1 intrinsic, but I guess compilers which don't optimize intrinsics, and which let you use intrinsics from ISA extensions you haven't enabled, need this kind of clue to know when they can use ymm16-31.
You don't even want the compiler to emit vmovdqu32 ymm when you're not masking; vmovdqu ymm is shorter and does exactly the same thing, with no penalty for mixing with EVEX-encoded instructions. The compiler can always use an vmovdqu32 or 64 if it wants to load into ymm16..31, otherwise you want it to use a shorter VEX-coded AVX1 vmovdqu.
I'm pretty sure that GCC treats _mm256_maskz_epi32(0xffu,ptr) exactly the same as _mm256_loadu_si256((const __m256i*)ptr) and makes the same asm regardless of which one you use. It can optimize away the 0xffu mask and simply use an unmasked load, but there's no need for that extra complication in your source.
But unfortunately GCC9 and earlier will pessimize to vmovdqu32 ymm0, [mem] when AVX512VL is enabled (e.g. -march=skylake-avx512) even when you write _mm256_loadu_si256. This was a missed-optimization, GCC Bug 89346.
It doesn't matter which 256-bit load intrinsic you use (except for aligned vs. unaligned) as long as there's no masking.
Related:
error: '_mm512_loadu_epi64' was not declared in this scope
What is the difference between _mm512_load_epi32 and _mm512_load_si512?

Related

MSVC's instrinsics __emulu and _umul128 in GCC/CLang

In MSVC there exist instrinsics __emulu() and _umul128(). First does u32*u32->u64 multiplication and second u64*u64->u128 multiplication.
Do same intrinsics exist for CLang/GCC?
Closest I found are _mulx_u32() and _mulx_u64() mentioned in Intel's Guide. But they produce mulx instruction which needs BMI2 support. While MSVC's intrinsics produce regular mul instruction. Also _mulx_u32() is not available in -m64 mode, while __emulu() and _umul128() both exist in 32 and 64 bit mode of MSVC.
You may try online 32-bit code and 64-bit code.
Of cause for 32-bit one may do return uint64_t(a) * uint64_t(b); (see it online) hoping that compiler will guess correctly and optimize to using u32*u32->u64 multiplication instead of u64*u64->u64. But is there a way to be sure about this? Not to rely on compiler's guess that both arguments are 32-bit (i.e. higher part of uint64_t is zeroed)? To have some intrinsics like __emulu() that make you sure about code.
There is __int128 in GCC/CLang (see code online) but again we have to rely on compiler's guess that we actually multiply 64-bit numbers (i.e. higher part of int128 is zeroed). Is there a way to be sure without compiler guessing, if there exist some intrinsics for that?
BTW, both uint64_t (for 32-bit) and __int128 (for 64-bit) produce correct mul instruction instead of mulx in GCC/CLang. But again we have to rely that compiler guesses correctly that higher part of uint64_t and __int128 is zeroed.
Of cause I can look into assembler code that GCC/Clang have optimized and guessed correctly, but looking at assembler once doesn't guarantee that same will happen always in all circumstances. And I don't know of a way in C++ to statically assert that compiler did correct guess about assembler instructions.

You already have the answer. Use uint64_t and __uint128_t. No intrinsics needed. This is available with modern GCC and Clang for all 64-bit targets. See Is there a 128 bit integer in gcc?
#include <stdint.h>
typedef __uint128_t uint128_t;
// 32*32=64 multiplication
f(uint32_t a, uint32_t b) {
uint64_t ab = (uint64_t)a * b;
}
//64*64=128 multiplication
f(uint64_t a, uint64_t b) {
uint128_t ab = (uint128_t)a * b;
}
Note that the cast must be on the operands, or at least on one operand. Casting the result would not work since it would do the multiplication with the shorter type and extend the result.
But is there a way to be sure about this? Not to rely on compiler's guess
You get exactly the same guarantee as with compiler intrinsics: that the value of the result is correct. There are never any guarantees about optimization. Just because you used intrinsics doesn't guarantee that the compiler will emit the “obvious” assembly instruction. The only way to have this guarantee is to use inline assembly, and for a simple operation like this it's likely to hurt performance because it would restrict the ways in which the compiler optimizes register usage.

c++ AVX512 intrinsic equivalent of _mm256_broadcast_ss()?

I'm rewriting a code from AVX2 to AVX512.
What's the equivalent I can use to broadcast a single float number to a _mm512 vector? In AVX2 it is _mm256_broadcast_ss() but I can't find something like _mm512_broadcast_ss().

AVX512 doesn't need a special intrinsic for the memory source version1. You can simply use _mm512_set1_ps (which takes a float, not a float*). The compiler should use a memory-source broadcast if that's efficient. (Potentially even folded into a broadcast memory source for an ALU instruction instead of a separate load; AVX512 can do that for 512-bit vectors.)
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_set1_ps&expand=5236,4980
Footnote 1: The reason for _mm256_broadcast_ss even existing separately from _mm256_set1_ps is probably because of AVX1 vbroadcastss ymm, [mem] vs. AVX2 vbroadcastss ymm, xmm. Some compilers like MSVC and ICC let you use intrinsics without enabling the ISA extensions for the compiler to use anywhere, so there needed to be an intrinsic for only the AVX1 memory-source version specifically.
With AVX512, both memory and register source forms were introduced with AVX512F so there's no need to give users of those compilers a way to micro-manage which asm is allowed.

Using AVX to xor two zmm (512 bit) registers

I would like to bit-wisr xor zmm0 with zmm1.
I read around the internet and tried:
asm volatile(
"vmovdqa64 (%0),%%zmm0;\n"
"vmovdqa64 (%1),%%zmm1;\n"
"vpxorq %%zmm1, %%zmm0;\n"
"vmovdqa64 %%zmm0,(%0);\n"
:: "r"(p_dst), "r" (p_src)
: );
But the compiler gives "Error: number of operands mismatch for `vpxorq'".
What am I doing wrong?

Inline asm for this is pointless (https://gcc.gnu.org/wiki/DontUseInlineAsm), and your code is unsafe and inefficient even if you fixed the syntax error by adding the 3rd operand.
Use the intrinsic _mm512_xor_epi64( __m512i a, __m512i b); as documented in Intel's asm manual entry for pxor. Look at the compiler-generated asm if you want to see how it's done.
Unsafe because you don't have a "memory" clobber to tell the compiler that you read/write memory, and you don't declare clobbers on zmm0 or zmm1.
And inefficient for many reasons, including forcing the addressing modes and not using a memory source operand. And not letting the compiler pick which registers to use.
Just fixing the asm syntax so it compiles will go from having an obvious compile-time bug to a subtle and dangerous runtime bug that might only be visible with optimization enabled.
See https://stackoverflow.com/tags/inline-assembly/info for more about inline asm. But again, there is basically zero reason to use it for most SIMD because you can get the compiler to make asm that's just as efficient as what you can do by hand, and more efficient than this.

Most AVX512 instructions use 3+ operands, i.e. you need to add additional operand - dst register (it can be the same as one of the other operands).
This is also true for AVX2 version, see https://www.felixcloutier.com/x86/pxor:
VPXOR ymm1, ymm2, ymm3/m256
VPXORD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst
Note, that the above is intel syntax and would roughly translate into *mm1 = *mm2 ^ **mm3, in your case I guess you wanted to use "vpxorq %%zmm1, %%zmm0, %%zmm0;\n"
Be advised, that using inline assembly is generally a bad practice reserved for really special occasions. SIMD programming is better (faster, easier) done by using intrinsics supported by all major compilers. You can browse them here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Compiling legacy GCC code with AVX vector warnings

I've been trying to search on google but couldn't find anything useful.
typedef int64_t v4si __attribute__ ((vector_size(32)));
//warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
// so isn't AVX already automatically enabled?
// What does it mean "without AVX enabled"?
// What does it mean "changes the ABI"?
inline v4si v4si_gt0(v4si x_);
//warning: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
//So why there's warning and what does it mean?
// Why only this parameter got warning?
// And all other v4si parameter/arguments got no warning?
void set_quota(v4si quota);

That's not legacy code. __attribute__ ((vector_size(32))) means a 32 byte vector, i.e. 256 bit, which (on x86) means AVX. (GNU C Vector Extensions)
AVX isn't enabled unless you use -mavx (or a -march setting that includes it). Without that, the compiler isn't allowed to generate code that uses AVX instructions, because those would trigger an illegal-instruction fault on older CPUs that don't support AVX.
So the compiler can't pass or return 256b vectors in registers, like the normal calling convention specifies. Probably it treats them the same as structs of that size passed by value.
See the ABI links in the x86 tag wiki, or the x86 Calling Conventions page on Wikipedia (mostly doesn't mention vector registers).
Since the GNU C Vector Extensions syntax isn't tied to any particular hardware, using a 32 byte vector will still compile to correct code. It will perform badly, but it will still work even if the compiler can only use SSE instructions. (Last I saw, gcc was known to do a very bad job of generating code to deal with vectors wider than the target machine supports. You'd get significantly better code for a machine with 16B vectors from using vector_size(16) manually.)
Anyway, the point is that you get a warning instead of a compiler error because __attribute__ ((vector_size(32))) doesn't imply AVX specifically, but AVX or some other 256b vector instruction set is required for it to compile to good code.

RDRAND and RDSEED intrinsics on various compilers?

Does Intel C++ compiler and/or GCC support the following Intel intrinsics, like MSVC does since 2012 / 2013?
#include <immintrin.h> // for the following intrinsics
int _rdrand16_step(uint16_t*);
int _rdrand32_step(uint32_t*);
int _rdrand64_step(uint64_t*);
int _rdseed16_step(uint16_t*);
int _rdseed32_step(uint32_t*);
int _rdseed64_step(uint64_t*);
And if these intrinsics are supported, since which version are they supported (with compile-time-constant please)?

Both GCC and Intel compiler support them. GCC support was introduced at the end of 2010. They require the header <immintrin.h>.
GCC support has been present since at least version 4.6, but there doesn't seem to be any specific compile-time constant - you can just check __GNUC_MAJOR__ > 4 || (__GNUC_MAJOR__ == 4 && __GNUC_MINOR__ >= 6).

All the major compilers support Intel's intrinsics for rdrand and rdseed via <immintrin.h>.
Somewhat recent versions of some compilers are needed for rdseed, e.g. GCC9 (2019) or clang7 (2018), although those have been stable for a good while by now. If you'd rather use an older compiler, or not enable ISA-extension options like -march=skylake, a library1 wrapper function instead of the intrinsic is a good choice. (Inline asm is not necessary, I wouldn't recommend it unless you want to play with it.)
#include <immintrin.h>
#include <stdint.h>
// gcc -march=native or haswell or znver1 or whatever, or manually enable -mrdrnd
uint64_t rdrand64(){
unsigned long long ret; // not uint64_t, GCC/clang wouldn't compile.
do{}while( !_rdrand64_step(&ret) ); // retry until success.
return ret;
}
// and equivalent for _rdseed64_step
// and 32 and 16-bit sizes with unsigned and unsigned short.
Some compilers define __RDRND__ when the instruction is enabled at compile-time. GCC/clang since they supported the intrinsic at all, but only much later ICC (19.0). And with ICC, -march=ivybridge doesn't imply -mrdrnd or define __RDRND__ until 2021.1.
ICX is LLVM-based and behaves like clang.
MSVC doesn't define any macros; its handling of intrinsics is designed around runtime feature detection only, unlike gcc/clang where the easy way is compile-time CPU feature options.
Why do{}while() instead of while(){}? Turns out ICC compiles to a less-dumb loop with do{}while(), not uselessly peeling a first iteration. Other compilers don't benefit from that hand-holding, and it's not a correctness problem for ICC.
Why unsigned long long instead of uint64_t? The type has to agree with the pointer type expected by the intrinsic, or C and especially C++ compilers will complain, regardless of the object-representations being identical (64-bit unsigned). On Linux for example, uint64_t is unsigned long, but GCC/clang's immintrin.h define int _rdrand64_step(unsigned long long*), same as on Windows. So you always need unsigned long long ret with GCC/clang. MSVC is a non-problem as it can (AFAIK) only target Windows, where unsigned long long is the only 64-bit unsigned type.
But ICC defines the intrinsic as taking unsigned long* when compiling for GNU/Linux, according to my testing on https://godbolt.org/. So to be portable to ICC, you actually need #ifdef __INTEL_COMPILER; even in C++ I don't know a way to use auto or other type-deduction to declare a variable that matches it.
Compiler versions to support intrinsics
Tested on Godbolt; its earliest version of MSVC is 2015, and ICC 2013, so I can't go back any further. Support for _rdrand16_step / 32 / 64 were all introduced at the same time in any given compiler. 64 requires 64-bit mode.
CPU
gcc
clang
MSVC
ICC
rdrand
Ivy Bridge / Excavator
4.6
3.2
before 2015 (19.10)
before 13.0.1, but 19.0 for -mrdrnd defining __RDRND__. 2021.1 for -march=ivybridge to enable -mrdrnd
rdseed
Broadwell / Zen 1
9.1
7.0
before 2015 (19.10)
before(?) 13.0.1, but 19.0 also added -mrdrnd and -mrdseed options)
The earliest GCC and clang versions don't recognize -march=ivybridge only -mrdrnd. (GCC 4.9 and clang 3.6 for Ivy Bridge, not that you specifically want to use IvyBridge if modern CPUs are more relevant. So use a non-ancient compiler and set a CPU option appropriate for CPUs you actually care about, or at least a -mtune= with a more recent CPU.)
Intel's new oneAPI / ICX compilers all support rdrand/rdseed, and are based on LLVM internals so they work similarly to clang for CPU options. (It doesn't define __INTEL_COMPILER, which is good because it's different from ICC.)
GCC and clang only let you use intrinsics for instructions you've told the compiler the target supports. Use -march=native if compiling for your own machine, or use -march=skylake or something to enable all the ISA extensions for the CPU you're targeting. But if you need your program to run on old CPUs and only use RDRAND or RDSEED after runtime detection, only those functions need __attribute__((target("rdrnd"))) or rdseed, and won't be able to inline into functions with different target options. Or using a separately-compiled library would be easier1.
-mrdrnd: enabled by -march=ivybridge or -march=znver1 (or bdver4 Exavator APUs) and later
-mrdseed: enabled by -march=broadwell or -march=znver1 or later
Normally if you're going to enable one CPU feature, it makes sense to enable others that CPUs of that generation will have, and to set tuning options. But rdrand isn't something the compiler will use on its own (unlike BMI2 shlx for more efficient variable-count shifts, or AVX/SSE for auto-vectorization and array/struct copying and init). So enabling -mrdrnd globally likely won't make your program crash on pre-Ivy Bridge CPUs, if you check CPU features and don't actually run code that uses _rdrand64_step on CPUs without the feature.
But if you are only going to run your code on some specific kind of CPU or later, gcc -O3 -march=haswell is a good choice. (-march also implies -mtune=haswell, and tuning for Ivy Bridge specifically is not what you want for modern CPUs. You could -march=ivybridge -mtune=skylake to set an older baseline of CPU features, but still tune for newer CPUs.)
Wrappers that compile everywhere
This is valid C++ and C. For C, you probably want static inline instead of inline so you don't need to manually instantiate an extern inline version in a .c in case a debug build decided not to inline. (Or use __attribute__((always_inline)) in GNU C.)
The 64-bit versions are only defined for x86-64 targets, because asm instructions can only use 64-bit operand-size in 64-bit mode. I didn't #ifdef __RDRND__ or #if defined(__i386__)||defined(__x86_64__), on the assumption that you'd only include this for x86(-64) builds at all, not cluttering the ifdefs more than necessary. It does only define the rdseed wrappers if that's enabled at compile time, or for MSVC where there's no way to enable them or to detect it.
There are some commented __attribute__((target("rdseed"))) examples you can uncomment if you want to do it that way instead of compiler options. rdrand16 / rdseed16 are intentionally omitted as not being normally useful. rdrand runs the same speed for different operand-sizes, and even pulls the same amount of data from the CPU's internal RNG buffer, optionally throwing away part of it for you.
#include <immintrin.h>
#include <stdint.h>
#if defined(__x86_64__) || defined (_M_X64)
// Figure out which 64-bit type the output arg uses
#ifdef __INTEL_COMPILER // Intel declares the output arg type differently from everyone(?) else
// ICC for Linux declares rdrand's output as unsigned long, but must be long long for a Windows ABI
typedef uint64_t intrin_u64;
#else
// GCC/clang headers declare it as unsigned long long even for Linux where long is 64-bit, but uint64_t is unsigned long and not compatible
typedef unsigned long long intrin_u64;
#endif
//#if defined(__RDRND__) || defined(_MSC_VER) // conditional definition if you want
inline
uint64_t rdrand64(){
intrin_u64 ret;
do{}while( !_rdrand64_step(&ret) ); // retry until success.
return ret;
}
//#endif
#if defined(__RDSEED__) || defined(_MSC_VER)
inline
uint64_t rdseed64(){
intrin_u64 ret;
do{}while( !_rdseed64_step(&ret) ); // retry until success.
return ret;
}
#endif // RDSEED
#endif // x86-64
//__attribute__((target("rdrnd")))
inline
uint32_t rdrand32(){
unsigned ret; // Intel documents this as unsigned int, not necessarily uint32_t
do{}while( !_rdrand32_step(&ret) ); // retry until success.
return ret;
}
#if defined(__RDSEED__) || defined(_MSC_VER)
//__attribute__((target("rdseed")))
inline
uint32_t rdseed32(){
unsigned ret; // Intel documents this as unsigned int, not necessarily uint32_t
do{}while( !_rdseed32_step(&ret) ); // retry until success.
return ret;
}
#endif
The fact that Intel's intrinsics API is supported at all implies that unsigned int is a 32-bit type, regardless of whether uint32_t is defined as unsigned int or unsigned long if any compilers do that.
On the Godbolt compiler explorer we can see how these compile. Clang and MSVC do what we'd expect, just a 2-instruction loop until rdrand leaves CF=1
# clang 7.0 -O3 -march=broadwell MSVC -O2 does the same.
rdrand64():
.LBB0_1: # =>This Inner Loop Header: Depth=1
rdrand rax
jae .LBB0_1 # synonym for jnc - jump if Not Carry
ret
# same for other functions.
Unfortunately GCC is not so good, even current GCC12.1 makes weird asm:
# gcc 12.1 -O3 -march=broadwell
rdrand64():
mov edx, 1
.L2:
rdrand rax
mov QWORD PTR [rsp-8], rax # store into the red-zone where retval is allocated
cmovc eax, edx # materialize a 0 or 1 from CF. (rdrand zeros EAX when it clears CF=0, otherwise copy the 1)
test eax, eax # then test+branch on it
je .L2 # could have just been jnc after rdrand
mov rax, QWORD PTR [rsp-8] # reload retval
ret
rdseed64():
.L7:
rdseed rax
mov QWORD PTR [rsp-8], rax # dead store into the red-zone
jnc .L7
ret
ICC makes the same asm as long as we use a do{}while() retry loop; with a while() {} it's even worse, doing an rdrand and checking before entering the loop for the first time.
Footnote 1: rdrand/rdseed library wrappers
librdrand or Intel's libdrng have wrapper functions with retry loops like I showed, and ones that fill a buffer of bytes or array of uint32_t* or uint64_t*. (Consistently taking uint64_t*, no unsigned long long* on some targets).
A library is also a good choice if you're doing runtime CPU feature detection, so you don't have to mess around with __attribute__((target)) stuff. However you do it, that limits inlining of a function using the intrinsics anyway, so a small static library is equivalent.
libdrng also provides RdRand_isSupported() and RdSeed_isSupported(), so you don't need to do your own CPUID check.
But if you're going to build with -march= something newer than Ivy Bridge / Broadwell or Excavator / Zen1 anyway, inlining a 2-instruction retry loop (like clang compiles it to) is about the same code-size as a function call-site, but doesn't clobber any registers. rdrand is quite slow so that's probably not a big deal, but it also means no extra library dependency.
Performance / internals of rdrand / rdseed
For more details about the HW internals on Intel (not AMD's version), see Intel's docs. For the actual TRNG logic, see Understanding Intel's Ivy Bridge Random Number Generator - it's a metastable latch that settles to 0 or 1 due to thermal noise. Or at least Intel says it is; it's basically impossible to truly verify where the rdrand bits actually come from in a CPU you bought. Worst case, still much better than nothing if you're mixing it with other entropy sources, like Linux does for /dev/random.
For more on the fact that there's a buffer that cores pull from, see some SO answers from the engineer who designed the hardware and wrote librdrand, such as this and this about its exhaustion / performance characteristics on Ivy Bridge, the first generation to feature it.
Infinite retry count?
The asm instructions set the carry flag (CF) = 1 in FLAGS on success, when it put a random number in the destination register. Otherwise CF=0 and the output register = 0. You're intended to call it in a retry loop, that's (I assume) why the intrinsic has the word step in the name; it's one step of generating a single random number.
In theory, a microcode update could change things so it always indicates failure, e.g. if a problem is discovered in some CPU model that makes the RNG untrustworthy (by the standards of the CPU vendor). The hardware RNG also has some self-diagnostics, so it's in theory possible for a CPU to decide that the RNG is broken and not produce any outputs. I haven't heard of any CPUs ever doing this, but I haven't gone looking. And a future microcode update is always possible.
Either of these could lead to an infinite retry loop. That's not great, but unless you want to write a bunch of code to report on that situation, it's at least an observable behaviour that users could potentially deal with in the unlikely event it ever happened.
But occasional temporary failure is normal and expected, and must be handled. Preferably by retrying without telling the user about it.
If there wasn't a random number ready in its buffer, the CPU can report failure instead of stalling this core for potentially even longer. That design choice might be related to interrupt latency, or just keeping it simpler without having to build retrying into the microcode.
Ivy Bridge can't pull data from the DRNG faster than it can keep up, according to the designer, even with all cores looping rdrand, but later CPUs can. Therefore it is important to actually retry.
#jww has had some experience with deploying rdrand in libcrypto++, and found that with a retry count set too low, there were reports of occasional spurious failure. He's had good results from infinite retries, which is why I chose that for this answer. (I suspect he would have heard reports from users with broken CPUs that always fail, if that was a thing.)
Intel's library functions that include a retry loop take a retry count. That's likely to handle the permanent-failure case which, as I said, I don't think happens in any real CPUs yet. Without a limited retry count, you'd loop forever.
An infinite retry count allows a simple API returning the number by value, without silly limitations like OpenSSL's functions that use 0 as an error return: they can't randomly generate a 0!
If you did want a finite retry count, I'd suggest very high. Like maybe 1 million, so it takes maybe have a second or a second of spinning to give up on a broken CPU, with negligible chance of having one thread starve that long if it's repeatedly unlucky in contending for access to the internal queue.
https://uops.info/ measured a throughput on Skylake of one per 3554 cycles on Skylake, one per 1352 on Alder Lake P-cores, 1230 on E-cores. One per 1809 cycles on Zen2. The Skylake version ran thousands of uops, the others were in the low double digits. Ivy Bridge had 110 cycle throughput, but in Haswell it was already up to 2436 cycles, but still a double-digit number of uops.
These abysmal performance numbers on recent Intel CPUs are probably due to microcode updates to work around problems that weren't anticipated when the HW was designed. Agner Fog measured one per 460 cycle throughput for rdrand and rdseed on Skylake when it was new, each costing 16 uops. The thousands of uops are probably extra buffer flushing hooked into the microcode for those instructions by recent updates. Agner measured Haswell at 17 uops, 320 cycles when it was new. See RdRand Performance As Bad As ~3% Original Speed With CrossTalk/SRBDS Mitigation on Phoronix:
As explained in the earlier article, mitigating CrossTalk involves locking the entire memory bus before updating the staging buffer and unlocking it after the contents have been cleared. This locking and serialization now involved for those instructions is very brutal on the performance, but thankfully most real-world workloads shouldn't be making too much use of these instructions.
Locking the memory bus sounds like it could hurt performance even of other cores, if it's like cache-line splits for locked instructions.
(Those cycle numbers are core clock cycle counts; if the DRNG doesn't run on the same clock as the core, those might vary by CPU model. I wonder if uops.info's testing is running rdrand on multiple cores of the same hardware, since Coffee Lake is twice the uops as Skylake, and 1.4x as many cycles per random number. Unless that's just higher clocks leading to more microcode retries?)

Microsoft compiler does not have intrinsics support for RDSEED and RDRAND instruction.
But, you may implement these instruction using NASM or MASM. Assembly code is available at:
https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-software-implementation-guide
For Intel Compiler, you can use header to determine the version. You can use following macros to determine the version and sub-version:
__INTEL_COMPILER //Major Version
__INTEL_COMPILER_UPDATE // Minor Update.
For instance if you use ICC15.0 Update 3 compiler, it will show that you have
__INTEL_COMPILER = 1500
__INTEL_COMPILER_UPDATE = 3
For further details on pre-defined macros you can go to: https://software.intel.com/en-us/node/524490

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to emulate _mm256_loadu_epi32 with gcc or clang? - c++

Related

MSVC's instrinsics __emulu and _umul128 in GCC/CLang

c++ AVX512 intrinsic equivalent of _mm256_broadcast_ss()?

Using AVX to xor two zmm (512 bit) registers

Compiling legacy GCC code with AVX vector warnings

RDRAND and RDSEED intrinsics on various compilers?

Categories

Resources