Forcing AVX intrinsics to use SSE instructions instead - c++

Unfortunately I have an AMD piledriver cpu, which seems to have problems with AVX instructions:
Memory writes with the 256-bit AVX registers are exceptionally slow. The measured throughput is 5 - 6 times slower than on the previous model (Bulldozer), and 8 - 9 times slower than two 128-bit writes.
In my own experience, I've found mm256 intrinsics to be much slower than mm128, and I'm assuming it's because of the above reason.
I really want to code for the newest instruction set AVX though, while still being able to test builds on my machine at a reasonable speed. Is there a way to force mm256 intrinsics to use SSE instructions instead? I'm using VS 2015.
If there is no easy way, what about a hard way. Replace <immintrin.h> with a custom made header containing my own definitions for the intrinsics which can be coded to use SSE? Not sure how plausible this is, prefer easier way if possible before I go through that work.

Use Agner Fog's Vector Class Library and add this to the command line in Visual Studio: -D__SSE4_2__ -D__XOP__.
Then use an AVX sized vector such as Vec8f for eight floats. When you compile without AVX enable it will use the file vectorf256e.h which emulates AVX with two SSE registers. For example Vec8f inherits from Vec256fe which starts like this:
class Vec256fe {
protected:
__m128 y0; // low half
__m128 y1; // high half
If you compile with /arch:AVX -D__XOP__ the VCL will instead use the file vectorf256.h and one AVX register. Then your code works for AVX and SSE with only a compiler switch change.
If you don't want to use XOP don't use -D__XOP__.
As Peter Cordes pointed out in his answer, if you your goal is only to avoid 256-bit load/stores then you may still want VEX encoded instructions (though it's not clear this will make a difference except in some special cases). You can do that with the vector class like this
Vec8f a;
Vec4f lo = a.get_low(); // a is a Vec8f type
Vec4f hi = a.get_high();
lo.store(&b[0]); // b is a float array
hi.store(&b[4]);
then compile with /arch:AVX -D__XOP__.
Another option would be be one source file that uses Vecnf and then do
//foo.cpp
#include "vectorclass.h"
#if SIMDWIDTH == 4
typedef Vec4f Vecnf;
#else
typedef Vec8f Vecnf;
#endif
and compile like this
cl /O2 /DSIMDWIDTH=4 foo.cpp /Fofoo_sse
cl /O2 /DSIMDWIDTH=4 /arch:AVX /D__XOP__ foo.cpp /Fofoo_avx128
cl /O2 /DSIMDWIDTH=8 /arch:AVX foo.cpp /Fofoo_avx256
This would create three executables with one source file. Instead of linking them you could just compile them with /c and them make a CPU dispatcher. I used XOP with avx128 because I don't think there is a good reason to use avx128 except on AMD.

You don't want to use SSE instructions. What you want is for 256b stores to be done as two separate 128b stores, still with VEX-coded 128b instructions. i.e. 128b AVX vmovups.
gcc has -mavx256-split-unaligned-load and ...-store options (enabled as part of -march=sandybridge for example, presumably also for Bulldozer-family (-march=bdver2 is piledriver). That doesn't solve the problem when the compiler knows the memory is aligned, though.
You could override the normal 256b store intrinsic with a macro like
// maybe enable this for all BD family CPUs?
#if defined(__bdver2) | defined(PILEDRIVER) | defined(SPLIT_256b_STORES)
#define _mm256_storeu_ps(addr, data) do{ \
_mm_storeu_ps( ((float*)(addr)) + 0, _mm256_extractf128_ps((data),0)); \
_mm_storeu_ps( ((float*)(addr)) + 4, _mm256_extractf128_ps((data),1)); \
}while(0)
#endif
gcc defines __bdver2 (Bulldozer version 2) for Piledriver (-march=bdver2).
You could do the same for (aligned) _mm256_store_ps, or just always use the unaligned intrinsic.
Compilers optimize the _mm256_extractf128(data,0) to a simple cast. I.e. it should just compile to
vmovups [rdi], xmm0 ; if data is in xmm0 and addr is in rdi
vextractf128 [rdi+16], xmm0, 1
However, testing on godbolt shows that gcc and clang are dumb, and extract to a register and then store. ICC correctly generates the two-instruction sequence.

Related

c++ AVX512 intrinsic equivalent of _mm256_broadcast_ss()?

I'm rewriting a code from AVX2 to AVX512.
What's the equivalent I can use to broadcast a single float number to a _mm512 vector? In AVX2 it is _mm256_broadcast_ss() but I can't find something like _mm512_broadcast_ss().
AVX512 doesn't need a special intrinsic for the memory source version1. You can simply use _mm512_set1_ps (which takes a float, not a float*). The compiler should use a memory-source broadcast if that's efficient. (Potentially even folded into a broadcast memory source for an ALU instruction instead of a separate load; AVX512 can do that for 512-bit vectors.)
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_set1_ps&expand=5236,4980
Footnote 1: The reason for _mm256_broadcast_ss even existing separately from _mm256_set1_ps is probably because of AVX1 vbroadcastss ymm, [mem] vs. AVX2 vbroadcastss ymm, xmm. Some compilers like MSVC and ICC let you use intrinsics without enabling the ISA extensions for the compiler to use anywhere, so there needed to be an intrinsic for only the AVX1 memory-source version specifically.
With AVX512, both memory and register source forms were introduced with AVX512F so there's no need to give users of those compilers a way to micro-manage which asm is allowed.

How to emulate _mm256_loadu_epi32 with gcc or clang?

Intel's intrinsic guide lists the intrinsic _mm256_loadu_epi32:
_m256i _mm256_loadu_epi32 (void const* mem_addr);
/*
Instruction: vmovdqu32 ymm, m256
CPUID Flags: AVX512VL + AVX512F
Description
Load 256-bits (composed of 8 packed 32-bit integers) from memory into dst.
mem_addr does not need to be aligned on any particular boundary.
Operation
a[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0
*/
But clang and gcc do not provide this intrinsic. Instead they provide (in file avx512vlintrin.h) only the masked versions
_mm256_mask_loadu_epi32 (__m256i, __mmask8, void const *);
_mm256_maskz_loadu_epi32 (__mmask8, void const *);
which boil down to the same instruction vmovdqu32. My question: how can I emulate _mm256_loadu_epi32:
inline _m256i _mm256_loadu_epi32(void const* mem_addr)
{
/* code using vmovdqu32 and compiles with gcc */
}
without writing assembly, i.e. using only intrinsics available?
Just use _mm256_loadu_si256 like a normal person. The only thing the AVX512 intrinsic gives you is a nicer prototype (const void* instead of const __m256i*) so you don't have to write ugly casts.
#chtz suggests out that you might still want to write a wrapper function yourself to get the void* prototype. But don't call it _mm256_loadu_epi32; some future GCC version will probably add that for compat with Intel's docs and break your code.
From another perspective, it's unfortunate that compilers don't treat it as an AVX1 intrinsic, but I guess compilers which don't optimize intrinsics, and which let you use intrinsics from ISA extensions you haven't enabled, need this kind of clue to know when they can use ymm16-31.
You don't even want the compiler to emit vmovdqu32 ymm when you're not masking; vmovdqu ymm is shorter and does exactly the same thing, with no penalty for mixing with EVEX-encoded instructions. The compiler can always use an vmovdqu32 or 64 if it wants to load into ymm16..31, otherwise you want it to use a shorter VEX-coded AVX1 vmovdqu.
I'm pretty sure that GCC treats _mm256_maskz_epi32(0xffu,ptr) exactly the same as _mm256_loadu_si256((const __m256i*)ptr) and makes the same asm regardless of which one you use. It can optimize away the 0xffu mask and simply use an unmasked load, but there's no need for that extra complication in your source.
But unfortunately GCC9 and earlier will pessimize to vmovdqu32 ymm0, [mem] when AVX512VL is enabled (e.g. -march=skylake-avx512) even when you write _mm256_loadu_si256. This was a missed-optimization, GCC Bug 89346.
It doesn't matter which 256-bit load intrinsic you use (except for aligned vs. unaligned) as long as there's no masking.
Related:
error: '_mm512_loadu_epi64' was not declared in this scope
What is the difference between _mm512_load_epi32 and _mm512_load_si512?

GCC w/ inline assembly & -Ofast generating extra code for memory operand

I am inputting the address of an index into a table into an extended inline assembly operation, but GCC is producing an extra lea instruction when it is not necessary, even when using -Ofast -fomit-frame-pointer or -Os -f.... GCC is using RIP-relative addresses.
I was creating a function for converting two consecutive bits into a two-part XMM mask (1 quadword mask per bit). To do this, I am using _mm_cvtepi8_epi64 (internally vpmovsxbq) with a memory operand from a 8-byte table with the bits as index.
When I use the intrinsic, GCC produces the exactly same code as using the extended inline assembly.
I can directly embed the memory operation into the ASM template, but that would force RIP-relative addressing always (and I don't like forcing myself into workarounds).
typedef uint64_t xmm2q __attribute__ ((vector_size (16)));
// Used for converting 2 consecutive bits (as index) into a 2-elem XMM mask (pmovsxbq)
static const uint16_t MASK_TABLE[4] = { 0x0000, 0x0080, 0x8000, 0x8080 };
xmm2q mask2b(uint64_t mask) {
assert(mask < 4);
#ifdef USE_ASM
xmm2q result;
asm("vpmovsxbq %1, %0" : "=x" (result) : "m" (MASK_TABLE[mask]));
return result;
#else
// bad cast (UB?), but input should be `uint16_t*` anyways
return (xmm2q) _mm_cvtepi8_epi64(*((__m128i*) &MASK_TABLE[mask]));
#endif
}
Output assembly with -S (with USE_ASM and without):
__Z6mask2by: ## #_Z6mask2by
.cfi_startproc
## %bb.0:
leaq __ZL10MASK_TABLE(%rip), %rax
vpmovsxbq (%rax,%rdi,2), %xmm0
retq
.cfi_endproc
What I was expecting (I've removed all the extra stuff):
__Z6mask2by:
vpmovsxbq __ZL10MASK_TABLE(%rip,%rdi,2), %xmm0
retq
The only RIP-relative addressing mode is RIP + rel32. RIP + reg is not available.
(In machine code, 32-bit code used to have 2 redundant ways to encode [disp32]. x86-64 uses the shorter (no SIB) form as RIP relative, the longer SIB form as [sign_extended_disp32]).
If you compile for Linux with -fno-pie -no-pie, GCC will be able to access static data with a 32-bit absolute address, so it can use a mode like __ZL10MASK_TABLE(,%rdi,2). This isn't possible for MacOS, where the base address is always above 2^32; 32-bit absolute addressing is completely unsupported on x86-64 MacOS.
In a PIE executable (or PIC code in general like a library), you need a RIP-relative LEA to set up for indexing a static array. Or any other case where the static address won't fit in 32 bits and/or isn't a link-time constant.
Intrinsics
Yes, intrinsics make it very inconvenient to express a pmovzx/sx load from a narrow source because pointer-source versions of the intrinsics are missing.
*((__m128i*) &MASK_TABLE[mask] isn't safe: if you disable optimization, you might well get a movdqa 16-byte load but the address will be misaligned. It's only safe when the compiler folds the load into a memory operand for pmovzxbq which has a 2-byte memory operand therefore not requiring alignment.
In fact current GCC does compile your code with a movdqa 16-byte load like movdqa xmm0, XMMWORD PTR [rax+rdi*2] before a reg-reg pmovzx. This is obviously a missed optimization. :( clang/LLVM (which MacOS installs as gcc) does fold the load into pmovzx.
The safe way is _mm_cvtepi8_epi64( _mm_cvtsi32_si128(MASK_TABLE[mask]) ) or something, and then hoping the compiler optimizes away the zero-extend from 2 to 4 bytes and folds the movd into a load when you enable optimization. Or maybe try _mm_loadu_si32 for a 32-bit load even though you really want 16. But last time I tried, compilers sucked at folding a 64-bit load intrinsic into a memory operand for pmovzxbw for example. GCC and clang still fail at it, but ICC19 succeeds. https://godbolt.org/z/IdgoKV
I've written about this before:
Loading 8 chars from memory into an __m256 variable as packed single precision floats
How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?
Your integer -> vector strategy
Your choice of pmovsx seems odd. You don't need sign-extension, so I would have picked pmovzx (_mm_cvt_epu8_epi64). It's not actually more efficient on any CPUs, though.
A lookup table does work here with only a small amount of static data needed. If your mask range was any bigger, you'd maybe want to look into
is there an inverse instruction to the movemask instruction in intel avx2? for alternative strategies like broadcast + AND + (shift or compare).
If you do this often, using a whole cache line of 4x 16-byte vector constants might be best so you don't need a pmovzx instruction, just index into an aligned table of xmm2 or __m128i vectors which can be a memory source for any other SSE instruction. Use alignas(64) to get all the constants in the same cache line.
You could also consider (intrinsics for) pdep + movd xmm0, eax + pmovzxbq reg-reg if you're targeting Intel CPUs with BMI2. (pdep is slow on AMD, though).

RDRAND and RDSEED intrinsics on various compilers?

Does Intel C++ compiler and/or GCC support the following Intel intrinsics, like MSVC does since 2012 / 2013?
#include <immintrin.h> // for the following intrinsics
int _rdrand16_step(uint16_t*);
int _rdrand32_step(uint32_t*);
int _rdrand64_step(uint64_t*);
int _rdseed16_step(uint16_t*);
int _rdseed32_step(uint32_t*);
int _rdseed64_step(uint64_t*);
And if these intrinsics are supported, since which version are they supported (with compile-time-constant please)?
Both GCC and Intel compiler support them. GCC support was introduced at the end of 2010. They require the header <immintrin.h>.
GCC support has been present since at least version 4.6, but there doesn't seem to be any specific compile-time constant - you can just check __GNUC_MAJOR__ > 4 || (__GNUC_MAJOR__ == 4 && __GNUC_MINOR__ >= 6).
All the major compilers support Intel's intrinsics for rdrand and rdseed via <immintrin.h>.
Somewhat recent versions of some compilers are needed for rdseed, e.g. GCC9 (2019) or clang7 (2018), although those have been stable for a good while by now. If you'd rather use an older compiler, or not enable ISA-extension options like -march=skylake, a library1 wrapper function instead of the intrinsic is a good choice. (Inline asm is not necessary, I wouldn't recommend it unless you want to play with it.)
#include <immintrin.h>
#include <stdint.h>
// gcc -march=native or haswell or znver1 or whatever, or manually enable -mrdrnd
uint64_t rdrand64(){
unsigned long long ret; // not uint64_t, GCC/clang wouldn't compile.
do{}while( !_rdrand64_step(&ret) ); // retry until success.
return ret;
}
// and equivalent for _rdseed64_step
// and 32 and 16-bit sizes with unsigned and unsigned short.
Some compilers define __RDRND__ when the instruction is enabled at compile-time. GCC/clang since they supported the intrinsic at all, but only much later ICC (19.0). And with ICC, -march=ivybridge doesn't imply -mrdrnd or define __RDRND__ until 2021.1.
ICX is LLVM-based and behaves like clang.
MSVC doesn't define any macros; its handling of intrinsics is designed around runtime feature detection only, unlike gcc/clang where the easy way is compile-time CPU feature options.
Why do{}while() instead of while(){}? Turns out ICC compiles to a less-dumb loop with do{}while(), not uselessly peeling a first iteration. Other compilers don't benefit from that hand-holding, and it's not a correctness problem for ICC.
Why unsigned long long instead of uint64_t? The type has to agree with the pointer type expected by the intrinsic, or C and especially C++ compilers will complain, regardless of the object-representations being identical (64-bit unsigned). On Linux for example, uint64_t is unsigned long, but GCC/clang's immintrin.h define int _rdrand64_step(unsigned long long*), same as on Windows. So you always need unsigned long long ret with GCC/clang. MSVC is a non-problem as it can (AFAIK) only target Windows, where unsigned long long is the only 64-bit unsigned type.
But ICC defines the intrinsic as taking unsigned long* when compiling for GNU/Linux, according to my testing on https://godbolt.org/. So to be portable to ICC, you actually need #ifdef __INTEL_COMPILER; even in C++ I don't know a way to use auto or other type-deduction to declare a variable that matches it.
Compiler versions to support intrinsics
Tested on Godbolt; its earliest version of MSVC is 2015, and ICC 2013, so I can't go back any further. Support for _rdrand16_step / 32 / 64 were all introduced at the same time in any given compiler. 64 requires 64-bit mode.
CPU
gcc
clang
MSVC
ICC
rdrand
Ivy Bridge / Excavator
4.6
3.2
before 2015 (19.10)
before 13.0.1, but 19.0 for -mrdrnd defining __RDRND__. 2021.1 for -march=ivybridge to enable -mrdrnd
rdseed
Broadwell / Zen 1
9.1
7.0
before 2015 (19.10)
before(?) 13.0.1, but 19.0 also added -mrdrnd and -mrdseed options)
The earliest GCC and clang versions don't recognize -march=ivybridge only -mrdrnd. (GCC 4.9 and clang 3.6 for Ivy Bridge, not that you specifically want to use IvyBridge if modern CPUs are more relevant. So use a non-ancient compiler and set a CPU option appropriate for CPUs you actually care about, or at least a -mtune= with a more recent CPU.)
Intel's new oneAPI / ICX compilers all support rdrand/rdseed, and are based on LLVM internals so they work similarly to clang for CPU options. (It doesn't define __INTEL_COMPILER, which is good because it's different from ICC.)
GCC and clang only let you use intrinsics for instructions you've told the compiler the target supports. Use -march=native if compiling for your own machine, or use -march=skylake or something to enable all the ISA extensions for the CPU you're targeting. But if you need your program to run on old CPUs and only use RDRAND or RDSEED after runtime detection, only those functions need __attribute__((target("rdrnd"))) or rdseed, and won't be able to inline into functions with different target options. Or using a separately-compiled library would be easier1.
-mrdrnd: enabled by -march=ivybridge or -march=znver1 (or bdver4 Exavator APUs) and later
-mrdseed: enabled by -march=broadwell or -march=znver1 or later
Normally if you're going to enable one CPU feature, it makes sense to enable others that CPUs of that generation will have, and to set tuning options. But rdrand isn't something the compiler will use on its own (unlike BMI2 shlx for more efficient variable-count shifts, or AVX/SSE for auto-vectorization and array/struct copying and init). So enabling -mrdrnd globally likely won't make your program crash on pre-Ivy Bridge CPUs, if you check CPU features and don't actually run code that uses _rdrand64_step on CPUs without the feature.
But if you are only going to run your code on some specific kind of CPU or later, gcc -O3 -march=haswell is a good choice. (-march also implies -mtune=haswell, and tuning for Ivy Bridge specifically is not what you want for modern CPUs. You could -march=ivybridge -mtune=skylake to set an older baseline of CPU features, but still tune for newer CPUs.)
Wrappers that compile everywhere
This is valid C++ and C. For C, you probably want static inline instead of inline so you don't need to manually instantiate an extern inline version in a .c in case a debug build decided not to inline. (Or use __attribute__((always_inline)) in GNU C.)
The 64-bit versions are only defined for x86-64 targets, because asm instructions can only use 64-bit operand-size in 64-bit mode. I didn't #ifdef __RDRND__ or #if defined(__i386__)||defined(__x86_64__), on the assumption that you'd only include this for x86(-64) builds at all, not cluttering the ifdefs more than necessary. It does only define the rdseed wrappers if that's enabled at compile time, or for MSVC where there's no way to enable them or to detect it.
There are some commented __attribute__((target("rdseed"))) examples you can uncomment if you want to do it that way instead of compiler options. rdrand16 / rdseed16 are intentionally omitted as not being normally useful. rdrand runs the same speed for different operand-sizes, and even pulls the same amount of data from the CPU's internal RNG buffer, optionally throwing away part of it for you.
#include <immintrin.h>
#include <stdint.h>
#if defined(__x86_64__) || defined (_M_X64)
// Figure out which 64-bit type the output arg uses
#ifdef __INTEL_COMPILER // Intel declares the output arg type differently from everyone(?) else
// ICC for Linux declares rdrand's output as unsigned long, but must be long long for a Windows ABI
typedef uint64_t intrin_u64;
#else
// GCC/clang headers declare it as unsigned long long even for Linux where long is 64-bit, but uint64_t is unsigned long and not compatible
typedef unsigned long long intrin_u64;
#endif
//#if defined(__RDRND__) || defined(_MSC_VER) // conditional definition if you want
inline
uint64_t rdrand64(){
intrin_u64 ret;
do{}while( !_rdrand64_step(&ret) ); // retry until success.
return ret;
}
//#endif
#if defined(__RDSEED__) || defined(_MSC_VER)
inline
uint64_t rdseed64(){
intrin_u64 ret;
do{}while( !_rdseed64_step(&ret) ); // retry until success.
return ret;
}
#endif // RDSEED
#endif // x86-64
//__attribute__((target("rdrnd")))
inline
uint32_t rdrand32(){
unsigned ret; // Intel documents this as unsigned int, not necessarily uint32_t
do{}while( !_rdrand32_step(&ret) ); // retry until success.
return ret;
}
#if defined(__RDSEED__) || defined(_MSC_VER)
//__attribute__((target("rdseed")))
inline
uint32_t rdseed32(){
unsigned ret; // Intel documents this as unsigned int, not necessarily uint32_t
do{}while( !_rdseed32_step(&ret) ); // retry until success.
return ret;
}
#endif
The fact that Intel's intrinsics API is supported at all implies that unsigned int is a 32-bit type, regardless of whether uint32_t is defined as unsigned int or unsigned long if any compilers do that.
On the Godbolt compiler explorer we can see how these compile. Clang and MSVC do what we'd expect, just a 2-instruction loop until rdrand leaves CF=1
# clang 7.0 -O3 -march=broadwell MSVC -O2 does the same.
rdrand64():
.LBB0_1: # =>This Inner Loop Header: Depth=1
rdrand rax
jae .LBB0_1 # synonym for jnc - jump if Not Carry
ret
# same for other functions.
Unfortunately GCC is not so good, even current GCC12.1 makes weird asm:
# gcc 12.1 -O3 -march=broadwell
rdrand64():
mov edx, 1
.L2:
rdrand rax
mov QWORD PTR [rsp-8], rax # store into the red-zone where retval is allocated
cmovc eax, edx # materialize a 0 or 1 from CF. (rdrand zeros EAX when it clears CF=0, otherwise copy the 1)
test eax, eax # then test+branch on it
je .L2 # could have just been jnc after rdrand
mov rax, QWORD PTR [rsp-8] # reload retval
ret
rdseed64():
.L7:
rdseed rax
mov QWORD PTR [rsp-8], rax # dead store into the red-zone
jnc .L7
ret
ICC makes the same asm as long as we use a do{}while() retry loop; with a while() {} it's even worse, doing an rdrand and checking before entering the loop for the first time.
Footnote 1: rdrand/rdseed library wrappers
librdrand or Intel's libdrng have wrapper functions with retry loops like I showed, and ones that fill a buffer of bytes or array of uint32_t* or uint64_t*. (Consistently taking uint64_t*, no unsigned long long* on some targets).
A library is also a good choice if you're doing runtime CPU feature detection, so you don't have to mess around with __attribute__((target)) stuff. However you do it, that limits inlining of a function using the intrinsics anyway, so a small static library is equivalent.
libdrng also provides RdRand_isSupported() and RdSeed_isSupported(), so you don't need to do your own CPUID check.
But if you're going to build with -march= something newer than Ivy Bridge / Broadwell or Excavator / Zen1 anyway, inlining a 2-instruction retry loop (like clang compiles it to) is about the same code-size as a function call-site, but doesn't clobber any registers. rdrand is quite slow so that's probably not a big deal, but it also means no extra library dependency.
Performance / internals of rdrand / rdseed
For more details about the HW internals on Intel (not AMD's version), see Intel's docs. For the actual TRNG logic, see Understanding Intel's Ivy Bridge Random Number Generator - it's a metastable latch that settles to 0 or 1 due to thermal noise. Or at least Intel says it is; it's basically impossible to truly verify where the rdrand bits actually come from in a CPU you bought. Worst case, still much better than nothing if you're mixing it with other entropy sources, like Linux does for /dev/random.
For more on the fact that there's a buffer that cores pull from, see some SO answers from the engineer who designed the hardware and wrote librdrand, such as this and this about its exhaustion / performance characteristics on Ivy Bridge, the first generation to feature it.
Infinite retry count?
The asm instructions set the carry flag (CF) = 1 in FLAGS on success, when it put a random number in the destination register. Otherwise CF=0 and the output register = 0. You're intended to call it in a retry loop, that's (I assume) why the intrinsic has the word step in the name; it's one step of generating a single random number.
In theory, a microcode update could change things so it always indicates failure, e.g. if a problem is discovered in some CPU model that makes the RNG untrustworthy (by the standards of the CPU vendor). The hardware RNG also has some self-diagnostics, so it's in theory possible for a CPU to decide that the RNG is broken and not produce any outputs. I haven't heard of any CPUs ever doing this, but I haven't gone looking. And a future microcode update is always possible.
Either of these could lead to an infinite retry loop. That's not great, but unless you want to write a bunch of code to report on that situation, it's at least an observable behaviour that users could potentially deal with in the unlikely event it ever happened.
But occasional temporary failure is normal and expected, and must be handled. Preferably by retrying without telling the user about it.
If there wasn't a random number ready in its buffer, the CPU can report failure instead of stalling this core for potentially even longer. That design choice might be related to interrupt latency, or just keeping it simpler without having to build retrying into the microcode.
Ivy Bridge can't pull data from the DRNG faster than it can keep up, according to the designer, even with all cores looping rdrand, but later CPUs can. Therefore it is important to actually retry.
#jww has had some experience with deploying rdrand in libcrypto++, and found that with a retry count set too low, there were reports of occasional spurious failure. He's had good results from infinite retries, which is why I chose that for this answer. (I suspect he would have heard reports from users with broken CPUs that always fail, if that was a thing.)
Intel's library functions that include a retry loop take a retry count. That's likely to handle the permanent-failure case which, as I said, I don't think happens in any real CPUs yet. Without a limited retry count, you'd loop forever.
An infinite retry count allows a simple API returning the number by value, without silly limitations like OpenSSL's functions that use 0 as an error return: they can't randomly generate a 0!
If you did want a finite retry count, I'd suggest very high. Like maybe 1 million, so it takes maybe have a second or a second of spinning to give up on a broken CPU, with negligible chance of having one thread starve that long if it's repeatedly unlucky in contending for access to the internal queue.
https://uops.info/ measured a throughput on Skylake of one per 3554 cycles on Skylake, one per 1352 on Alder Lake P-cores, 1230 on E-cores. One per 1809 cycles on Zen2. The Skylake version ran thousands of uops, the others were in the low double digits. Ivy Bridge had 110 cycle throughput, but in Haswell it was already up to 2436 cycles, but still a double-digit number of uops.
These abysmal performance numbers on recent Intel CPUs are probably due to microcode updates to work around problems that weren't anticipated when the HW was designed. Agner Fog measured one per 460 cycle throughput for rdrand and rdseed on Skylake when it was new, each costing 16 uops. The thousands of uops are probably extra buffer flushing hooked into the microcode for those instructions by recent updates. Agner measured Haswell at 17 uops, 320 cycles when it was new. See RdRand Performance As Bad As ~3% Original Speed With CrossTalk/SRBDS Mitigation on Phoronix:
As explained in the earlier article, mitigating CrossTalk involves locking the entire memory bus before updating the staging buffer and unlocking it after the contents have been cleared. This locking and serialization now involved for those instructions is very brutal on the performance, but thankfully most real-world workloads shouldn't be making too much use of these instructions.
Locking the memory bus sounds like it could hurt performance even of other cores, if it's like cache-line splits for locked instructions.
(Those cycle numbers are core clock cycle counts; if the DRNG doesn't run on the same clock as the core, those might vary by CPU model. I wonder if uops.info's testing is running rdrand on multiple cores of the same hardware, since Coffee Lake is twice the uops as Skylake, and 1.4x as many cycles per random number. Unless that's just higher clocks leading to more microcode retries?)
Microsoft compiler does not have intrinsics support for RDSEED and RDRAND instruction.
But, you may implement these instruction using NASM or MASM. Assembly code is available at:
https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-software-implementation-guide
For Intel Compiler, you can use header to determine the version. You can use following macros to determine the version and sub-version:
__INTEL_COMPILER //Major Version
__INTEL_COMPILER_UPDATE // Minor Update.
For instance if you use ICC15.0 Update 3 compiler, it will show that you have
__INTEL_COMPILER = 1500
__INTEL_COMPILER_UPDATE = 3
For further details on pre-defined macros you can go to: https://software.intel.com/en-us/node/524490

_ftol2_sse, are there faster options?

I have code which calls a lot of
int myNumber = (int)(floatNumber);
which takes up, in total, around 10% of my CPU time (according to profiler). While I could leave it at that, I wonder if there are faster options, so I tried searching around, and stumbled upon
http://devmaster.net/forums/topic/7804-fast-int-float-conversion-routines/
http://stereopsis.com/FPU.html
I tried implementing the Real2Int() function given there, but it gives me wrong results, and runs slower. Now I wonder, are there faster implementations to floor double / float values to integers, or is the SSE2 version as fast as it gets? The pages I found date back a bit, so it might just be outdated, and newer STL is faster at this.
The current implementation does:
013B1030 call _ftol2_sse (13B19A0h)
013B19A0 cmp dword ptr [___sse2_available (13B3378h)],0
013B19A7 je _ftol2 (13B19D6h)
013B19A9 push ebp
013B19AA mov ebp,esp
013B19AC sub esp,8
013B19AF and esp,0FFFFFFF8h
013B19B2 fstp qword ptr [esp]
013B19B5 cvttsd2si eax,mmword ptr [esp]
013B19BA leave
013B19BB ret
Related questions I found:
Fast float to int conversion and floating point precision on ARM (iPhone 3GS/4)
What is the fastest way to convert float to int on x86
Since both are old, or are ARM based, I wonder if there are current ways to do this. Note that it says the best conversion is one that doesn't happen, but I need to have it, so that will not be possible.
It's going to be hard to beat that if you are targeting generic x86 hardware. The runtime doesn't know for sure that the target machine has an SSE unit. If it did, it could do what the x64 compiler does and inline a cvttss2si opcode. But since the runtime has to check whether an SSE unit is available, you are left with the current implementation. That's what the implementation of ftol2_sse does. And what's more it passes the value in an x87 register and then transfers it to an SSE register if an SSE unit is available.
You could tell the x86 compiler to target machines that have SSE units. Then the compiler would indeed emit a simple cvttss2si opcode inline. That's going to be as fast as you can get. But if you run the code on an older machine then it will fail. Perhaps you could supply two versions, one for machines with SSE, and one for those without.
That's not going to gain you all that much. It's just going to avoid all the overhead of ftol2_sse that happens before you actually reach the cvttss2si opcode that does the work.
To change the compiler settings from the IDE, use Project > Properties > Configuration Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set. On the command line it is /arch:SSE or /arch:SSE2.
For double I don't think you will be able to improve the results much but if you have a lot of floats to convert that using a packed conversion could help, the following is nasm code:
global _start
section .data
align 16
fv1: dd 1.1, 2.5, 2.51, 3.6
section .text
_start:
cvtps2dq xmm1, [fv1] ; Convert four 32-bit(single precision) floats to 32-bit(double word) integers and place the result in xmm1
There should be intrinsics code that allows you to do the same thing in an easier way but I am not as familiar with using intrinsics libraries. Although you are not using gcc this article Auto-vectorization with gcc 4.7 is an eye opener on how hard it can be to get the compiler to generate good vectorized code.
If you need speed and a large base of target machines, you'd better introduce a fast SSE version of all your algorithms, as well as a generic one -- and choose the algorithms to be executed at much higher level.
This would also mean that also the ABI is optimized for SSE; and that you can vectorize the calculation when available and that also the control logic is optimized for the architecture.
btw. even FLD; FIST sequence should take no longer than ~7 clock cycles on Pentium.