Related
I'm trying to implement a faster version of the following code fragment:
Eigen::VectorXd dTX = (( (XPSF.array() - x0).square() + (ZPSF.array() - z0).square() ).sqrt() + txShift)*fs/c + t0*fs;
Eigen::VectorXd Zsq = ZPSF.array().square();
Eigen::MatrixXd idxt(XPSF.size(),nc);
for (int i = 0; i < nc; i++) {
idxt.col(i) = ((XPSF.array() - xe(i)).square() + Zsq.array()).sqrt()*fs/c + dTX.array();
idxt.col(i) = (abs(XPSF.array()-xe(i)) <= ZPSF.array()*0.5/fnumber).select(idxt.col(i),-1);
}
The sample array sizes I'm working with right now are:
XPSF: Column Vector of 591*192 coefficients (113,472 total values in the column vector)
ZPSF: Same size as XPSF
xe: RowVector of 192 coefficients
idxt: Matrix of 113,472x192 size
Current runs with gcc and -msse2 and -o3 optimization yield an average time of ~0.08 seconds for the first line of the loop and ~0.03 seconds for the second line of the loop. I know that runtimes are platform dependent, but I believe that this still can be much faster. A commercial software performs the operations I'm trying to do here in ~two orders of magnitude less time. Also, I suspect my code is a bit amateurish right now!
I've tried reading over Eigen documentation to understand how vectorization works, where it is implemented and how much of this code might be "implicitly" parallelized by Eigen, but I've struggled to keep track of the details. I'm also a bit new to C++ in general, but I've seen the documentation and other resources regarding std::thread and have tried to combine it with this code, but without much success.
Any advice would be appreciated.
Update:
Update 2
I would upvote Soleil's answer because it contains helpful information if I had the reputation score for it. However, I should clarify that I would like to first figure out what optimizations I can do without a GPU. I'm convinced (albeit without OpenMP) Eigen's inherent multithreading and vectorization won't speed it up any further (unless there are unnecessary temporaries being generated). How could I use something like std::thread to explicitly parellelize this? I'm struggling to combine both std::thread and Eigen to this end.
OpenMP
If your CPU has enough many cores and threads, usually a simple and quick first step is to invoke OpenMP by adding the pragma:
#pragma omp parallel for
for (int i = 0; i < nc; i++)
and compile with /openmp (cl) or -fopenmp (gcc) or just -ftree-parallelize-loops with gcc in order to auto unroll the loops.
This will do a map reduce and the map will occur over the number of parallel threads your CPU can handle (8 threads with the 7700HQ).
In general you also can set a clause num_threads(n) where n is the desired number of threads:
#pragma omp parallel num_threads(8)
Where I used 8 since the 7700HQ can handle 8 concurrent threads.
TBB
You also can unroll your loop with TBB:
#pragma unroll
for (int i = 0; i < nc; i++)
threading integrated with eigen
With Eigen you can add
OMP_NUM_THREADS=n ./my_program
omp_set_num_threads(n);
Eigen::setNbThreads(n);
remarks with multithreading with eigen
However, in the FAQ:
currently Eigen parallelizes only general matrix-matrix products (bench), so it doesn't by itself take much advantage of parallel hardware."
In general, the improvement with OpenMP is not always here, so benchmark the release build. Another way is to make sure that you're using vectorized instructions.
Again, from the FAQ/vectorization:
How can I enable vectorization?
You just need to tell your compiler to enable the corresponding
instruction set, and Eigen will then detect it. If it is enabled by
default, then you don't need to do anything. On GCC and clang you can
simply pass -march=native to let the compiler enables all instruction
set that are supported by your CPU.
On the x86 architecture, SSE is not enabled by default by most
compilers. You need to enable SSE2 (or newer) manually. For example,
with GCC, you would pass the -msse2 command-line option.
On the x86-64 architecture, SSE2 is generally enabled by default, but
you can enable AVX and FMA for better performance
On PowerPC, you have to use the following flags: -maltivec
-mabi=altivec, for AltiVec, or -mvsx for VSX-capable systems.
On 32-bit ARM NEON, the following: -mfpu=neon -mfloat-abi=softfp|hard,
depending if you are on a softfp/hardfp system. Most current
distributions are using a hard floating-point ABI, so go for the
latter, or just leave the default and just pass -mfpu=neon.
On 64-bit ARM, SIMD is enabled by default, you don't have to do
anything extra.
On S390X SIMD (ZVector), you have to use a recent gcc (version >5.2.1)
compiler, and add the following flags: -march=z13 -mzvector.
multithreading with cuda
Given the size of your arrays, you want to try to offload to a GPU to reach the microsecond; in that case you would have (typically) as many threads as the number of elements in your array.
For a simple start, if you have an nvidia card, you want to look at cublas, which also allows you to use the tensor registers (fused multiply add, etc) of the last generations, unlike regular kernel.
Since eigen is a header only library, it makes sense that you could use it in a cuda kernel.
You also may implements everything "by hand" (ie., without eigen) with regular kernels. This is a nonsense in terms of engineering, but common practice in an education/university project, in order to understand everything.
multithreading with OneAPI and Intel GPU
Since you have a skylake architecture, you also can unroll your loop on your CPU's GPU with OneAPI:
// Unroll loop as specified by the unroll factor.
#pragma unroll unroll_factor
for (int i = 0; i < nc; i++)
(from the sample).
I'm implementing AES256/GCM encryption and authentication using Crypto++ library. My code is compiled using Visual Studio 2008 as a C++/MFC project. This is a somewhat older project that uses a previous version of the library, Cryptopp562.
I'm curious if the resulting compiled code will use Intel's AES-NI instructions? And if so, what happens if the hardware (older CPU) does not support it?
EDIT: Here's an example of code that I'm testing it with:
int nIV_Length = 12;
int nAES_KeyLength = 32;
BYTE* iv = new BYTE[nIV_Length];
BYTE* key = new BYTE[nAES_KeyLength];
int nLnPlainText = 128;
BYTE* pDataPlainText = new BYTE[nLnPlainText];
CryptoPP::AutoSeededRandomPool rng;
rng.GenerateBlock(iv, nIV_Length);
CryptoPP::GCM<CryptoPP::AES>::Encryption enc;
enc.SetKeyWithIV(key, nAES_KeyLength, iv, nIV_Length);
BYTE* pDataOut_AES_GCM = new BYTE[nLnPlainText];
memset(pDataOut_AES_GCM, 0, nLnPlainText);
BYTE mac[16] = {0};
enc.EncryptAndAuthenticate(pDataOut_AES_GCM, mac, sizeof(mac), iv, nIV_Length, NULL, 0, pDataPlainText, nLnPlainText);
delete[] pDataPlainText;
delete[] pDataOut_AES_GCM;
delete[] key;
delete[] iv;
If you run code containing AES-NI instructions on x86 hardware which does not support these instructions, you should get invalid instruction errors. Unless the code does something smart (such as looking at CPUID to decide whether to run AES-NI optimized code, or something else), this can also be used to detect whether AES-NI instructions are actually used.
Otherwise you can always use a debugger, and set breakpoints at the AES-NI instructions to see whether your process ever uses that portion of code.
According to Crypto++ release notes AES-NI support was added in version 5.6.1. Looking at the source code of version 5.6.5 Crypto++, if AES-NI support was enabled at compile time, then it uses run-time checks (the HasAESNI() function, probably utilizing CPUID) to decide whether to use these intrinsics. See rijndael.cpp (and cpu.cpp for the CPUID code) in its source code for details.
I'm curious if the resulting compiled code will use Intel's AES-NI instructions?
Crypto++ 5.6.1 added support for AES-NI and Carryless Multiplies under GCM. It is used when two or three conditions are met. First, you are using a version of the library with the support. From the homepage under News (or the README):
8/9/2010 - Version 5.6.1 released
added support for AES-NI and CLMUL instruction sets in AES and GMAC/GCM
Second, the compiler, assembler and the linker must support the instructions. For Crypto++, that means you use at least MSVC 2008 SP1, GCC 4.3, and Binutils 2.19. For MSVC, if you look at config.h, its guarded as follows (__AES__ is there for GCC and friends, too):
#if ... (_MSC_FULL_VER >= 150030729) ...
#define CRYPTOPP_BOOL_AESNI_INTRINSICS_AVAILABLE 1
#else
#define CRYPTOPP_BOOL_AESNI_INTRINSICS_AVAILABLE 0
#endif
You can lookup _MSC_FULL_VER numbers at Visual Studio version. Ironically, I've never seen a similar page on MSDN even though the service packs matter. You have to go to a Chinese site. For example, checked iterators showed up in VS2005 SP1 (IIRC).
For Linux and GCC compatibles, the GNUmakefile checks the version of the compiler and assembler. If they are too old, then the makefile adds CRYPTOPP_DISABLE_AESNI to the command line to disable the support even if __AES__ is defined.
CRYPTOPP_DISABLE_AESNI shows up more often then you think. For example, if you download OpenBSD 6.0 (the current version), then
CRYPTOPP_DISABLE_AESNI will be present because their assembler is so old. They are mostly stuck at the pre-GPL-2 version of their tools (apparently they did not agree to the license changes).
Third, the CPU supports both AES and SSE4 instructions (the reason for the SSE4 instructions is explained below). These checks are performed at runtime, and the function of interest is called HasAES() from cpu.h (there's also a HasSSE4()):
//! \brief Determines AES-NI availability
//! \returns true if AES-NI is determined to be available, false otherwise
//! \details HasAESNI() is a runtime check performed using CPUID
inline bool HasAESNI()
{
if (!g_x86DetectionDone)
DetectX86Features();
return g_hasAESNI;
}
The caveat of Item (3) is the library needed to be compiled with support from Item (2). If Item (2) did not include compile time support, then Item (3) cannot offer runtime support.
With respect to Item (3) and runtime support, we recently had to tune it. It seems some low-end Atom processors, like D2500's, have SSE2, SSE3, SSSE3 and AES-NI, but not SSE4.1 or SSE4.2. According to Intel ARK, its an optional configuration of the processor. We received one bug report about an illegal SSE4 instruction in the AES-NI codepath, so we had to add an HasSSE4() check. See PR 172, Check for SSE4 support before using SSE4.1 instruction.
And if so, what happens if the hardware (older CPU) does not support it?
Nothing. The default CXX implementation is used rather than the hardware accelerated AES.
You might be interested to know we also have other AES hardware acceleration, including ARMv8 Crypto and VIA Padlock. We also provide other hardware acceleration, like CRC32, Carryless-Multiplies and SHA. They all function the same way - compile time support is translated into runtime support.
(Comment): I just set a breakpoint on DetectX86Features method in cpu.cpp ... and it never triggered ...
This can be tricky for two reasons. First, the calls may be inlined in release builds so the code is shaped a little differently then you would expect.
Second, there's a global random number generator accessed by GlobalRNG(). GlobalRNG() is AES in OFB mode. When initializers run for the test.cpp translation unit, the GlobalRNG() is created which causes DetectX86Features() to run very early (before control enters main).
You may have better luck with observing the low level details with WinDbg.
Its also worth mentioning that AES/GCM can be sped up by interleaving AES with GCM. I believe the idea is to perform 4 rounds of AES key calculation and 1 CLMUL in parallel. Crypto++ does not take advantage of it, but OpenSSL takes the opportunity. I don't know what Botan or mbedTLS do.
Just to finish up my question, here's my findings.
The method that forks the execution to hardware supported AES-NI instructions, vs software implemented ones in Crypto++ library for my code sample above, is Rijndael::Enc::AdvancedProcessBlocks located in rijndael.cpp. It starts as such:
size_t Rijndael::Enc::AdvancedProcessBlocks(const byte *inBlocks, const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags) const
{
#if CRYPTOPP_BOOL_AESNI_INTRINSICS_AVAILABLE
if (HasAESNI())
return AESNI_AdvancedProcessBlocks(AESNI_Enc_Block, AESNI_Enc_4_Blocks, (MAYBE_CONST __m128i *)(const void *)m_key.begin(), m_rounds, inBlocks, xorBlocks, outBlocks, length, flags);
#endif
The CRYPTOPP_BOOL_AESNI_INTRINSICS_AVAILABLE preprocessor variable will be defined if you're building the Crypto++ library with at least Visual Studio 2008 with SP1 (note that SP1 is important.) Such dependency is necessary to be able to use AES-NI intrinsics (such as _mm_aesenc_si128 and _mm_aesenclast_si128) to generate Intel's AES-NI machine code instructions.
So adding a breakpoint to the beginning of
will let you debug it right from the Visual Studio. No outside debugger needed.
If you then step into the AESNI_AdvancedProcessBlocks method the actual AES encryption will be processed in one of the AESNI_Enc_* methods. Here's how the actual aesenc and aesenclast machine instructions may look like for x86 configuration in the Release build:
So to answer my original question, for the code sample in my post above to be able to utilize Intel's AES-NI instructions one needs to build both the code sample and Crypto++ library with at least Visual Studio 2008 with SP1. (Just building it with Visual Studio 2008, or earlier version, will not do the job, even if the CPU that the code runs on supports AES-NI instructions.) After that, no other steps seem to be necessary. The library will detect the presence of AES-NI instructions automatically (HasAESNI() function) and will use them when available. Otherwise it will default to a software implementation.
Lastly, just from curiosity I decided to see how much difference would hardware vs software AES-GCM encryption would produce in speed. I used the following code snippet (from my code sample above):
int nCntTest = 100000;
DWORD dwmsIniTicks = ::GetTickCount();
for(int i = 0; i < nCntTest; i++)
{
enc.EncryptAndAuthenticate(pDataOut_AES_GCM, mac, sizeof(mac), iv, nIV_Length, NULL, 0, pDataPlainText, nLnPlainText);
}
DWORD dwmsElapsed = ::GetTickCount() - dwmsIniTicks;
bool bHaveHwAES_Support = false;
#if CRYPTOPP_BOOL_AESNI_INTRINSICS_AVAILABLE
bHaveHwAES_Support = CryptoPP::HasAESNI();
#endif
_tprintf(L"\nTimed %d AES256-GCM encryptions %s hardware encryption of %d bytes: %u ms\n",
nCntTest, bHaveHwAES_Support ? L"with" : L"without",
nLnRealPlainText, dwmsElapsed);
Here are two results:
and
This is obviously not an all-encompassing test. I ran it on my desktop with the "Intel(R) Core(TM) i7-4770 CPU # 3.40GHz" CPU.
But the good news is that AES-GCM encryption seems to be very fast, even without a hardware AES support.
I've been trying to search on google but couldn't find anything useful.
typedef int64_t v4si __attribute__ ((vector_size(32)));
//warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
// so isn't AVX already automatically enabled?
// What does it mean "without AVX enabled"?
// What does it mean "changes the ABI"?
inline v4si v4si_gt0(v4si x_);
//warning: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
//So why there's warning and what does it mean?
// Why only this parameter got warning?
// And all other v4si parameter/arguments got no warning?
void set_quota(v4si quota);
That's not legacy code. __attribute__ ((vector_size(32))) means a 32 byte vector, i.e. 256 bit, which (on x86) means AVX. (GNU C Vector Extensions)
AVX isn't enabled unless you use -mavx (or a -march setting that includes it). Without that, the compiler isn't allowed to generate code that uses AVX instructions, because those would trigger an illegal-instruction fault on older CPUs that don't support AVX.
So the compiler can't pass or return 256b vectors in registers, like the normal calling convention specifies. Probably it treats them the same as structs of that size passed by value.
See the ABI links in the x86 tag wiki, or the x86 Calling Conventions page on Wikipedia (mostly doesn't mention vector registers).
Since the GNU C Vector Extensions syntax isn't tied to any particular hardware, using a 32 byte vector will still compile to correct code. It will perform badly, but it will still work even if the compiler can only use SSE instructions. (Last I saw, gcc was known to do a very bad job of generating code to deal with vectors wider than the target machine supports. You'd get significantly better code for a machine with 16B vectors from using vector_size(16) manually.)
Anyway, the point is that you get a warning instead of a compiler error because __attribute__ ((vector_size(32))) doesn't imply AVX specifically, but AVX or some other 256b vector instruction set is required for it to compile to good code.
Does Intel C++ compiler and/or GCC support the following Intel intrinsics, like MSVC does since 2012 / 2013?
#include <immintrin.h> // for the following intrinsics
int _rdrand16_step(uint16_t*);
int _rdrand32_step(uint32_t*);
int _rdrand64_step(uint64_t*);
int _rdseed16_step(uint16_t*);
int _rdseed32_step(uint32_t*);
int _rdseed64_step(uint64_t*);
And if these intrinsics are supported, since which version are they supported (with compile-time-constant please)?
Both GCC and Intel compiler support them. GCC support was introduced at the end of 2010. They require the header <immintrin.h>.
GCC support has been present since at least version 4.6, but there doesn't seem to be any specific compile-time constant - you can just check __GNUC_MAJOR__ > 4 || (__GNUC_MAJOR__ == 4 && __GNUC_MINOR__ >= 6).
All the major compilers support Intel's intrinsics for rdrand and rdseed via <immintrin.h>.
Somewhat recent versions of some compilers are needed for rdseed, e.g. GCC9 (2019) or clang7 (2018), although those have been stable for a good while by now. If you'd rather use an older compiler, or not enable ISA-extension options like -march=skylake, a library1 wrapper function instead of the intrinsic is a good choice. (Inline asm is not necessary, I wouldn't recommend it unless you want to play with it.)
#include <immintrin.h>
#include <stdint.h>
// gcc -march=native or haswell or znver1 or whatever, or manually enable -mrdrnd
uint64_t rdrand64(){
unsigned long long ret; // not uint64_t, GCC/clang wouldn't compile.
do{}while( !_rdrand64_step(&ret) ); // retry until success.
return ret;
}
// and equivalent for _rdseed64_step
// and 32 and 16-bit sizes with unsigned and unsigned short.
Some compilers define __RDRND__ when the instruction is enabled at compile-time. GCC/clang since they supported the intrinsic at all, but only much later ICC (19.0). And with ICC, -march=ivybridge doesn't imply -mrdrnd or define __RDRND__ until 2021.1.
ICX is LLVM-based and behaves like clang.
MSVC doesn't define any macros; its handling of intrinsics is designed around runtime feature detection only, unlike gcc/clang where the easy way is compile-time CPU feature options.
Why do{}while() instead of while(){}? Turns out ICC compiles to a less-dumb loop with do{}while(), not uselessly peeling a first iteration. Other compilers don't benefit from that hand-holding, and it's not a correctness problem for ICC.
Why unsigned long long instead of uint64_t? The type has to agree with the pointer type expected by the intrinsic, or C and especially C++ compilers will complain, regardless of the object-representations being identical (64-bit unsigned). On Linux for example, uint64_t is unsigned long, but GCC/clang's immintrin.h define int _rdrand64_step(unsigned long long*), same as on Windows. So you always need unsigned long long ret with GCC/clang. MSVC is a non-problem as it can (AFAIK) only target Windows, where unsigned long long is the only 64-bit unsigned type.
But ICC defines the intrinsic as taking unsigned long* when compiling for GNU/Linux, according to my testing on https://godbolt.org/. So to be portable to ICC, you actually need #ifdef __INTEL_COMPILER; even in C++ I don't know a way to use auto or other type-deduction to declare a variable that matches it.
Compiler versions to support intrinsics
Tested on Godbolt; its earliest version of MSVC is 2015, and ICC 2013, so I can't go back any further. Support for _rdrand16_step / 32 / 64 were all introduced at the same time in any given compiler. 64 requires 64-bit mode.
CPU
gcc
clang
MSVC
ICC
rdrand
Ivy Bridge / Excavator
4.6
3.2
before 2015 (19.10)
before 13.0.1, but 19.0 for -mrdrnd defining __RDRND__. 2021.1 for -march=ivybridge to enable -mrdrnd
rdseed
Broadwell / Zen 1
9.1
7.0
before 2015 (19.10)
before(?) 13.0.1, but 19.0 also added -mrdrnd and -mrdseed options)
The earliest GCC and clang versions don't recognize -march=ivybridge only -mrdrnd. (GCC 4.9 and clang 3.6 for Ivy Bridge, not that you specifically want to use IvyBridge if modern CPUs are more relevant. So use a non-ancient compiler and set a CPU option appropriate for CPUs you actually care about, or at least a -mtune= with a more recent CPU.)
Intel's new oneAPI / ICX compilers all support rdrand/rdseed, and are based on LLVM internals so they work similarly to clang for CPU options. (It doesn't define __INTEL_COMPILER, which is good because it's different from ICC.)
GCC and clang only let you use intrinsics for instructions you've told the compiler the target supports. Use -march=native if compiling for your own machine, or use -march=skylake or something to enable all the ISA extensions for the CPU you're targeting. But if you need your program to run on old CPUs and only use RDRAND or RDSEED after runtime detection, only those functions need __attribute__((target("rdrnd"))) or rdseed, and won't be able to inline into functions with different target options. Or using a separately-compiled library would be easier1.
-mrdrnd: enabled by -march=ivybridge or -march=znver1 (or bdver4 Exavator APUs) and later
-mrdseed: enabled by -march=broadwell or -march=znver1 or later
Normally if you're going to enable one CPU feature, it makes sense to enable others that CPUs of that generation will have, and to set tuning options. But rdrand isn't something the compiler will use on its own (unlike BMI2 shlx for more efficient variable-count shifts, or AVX/SSE for auto-vectorization and array/struct copying and init). So enabling -mrdrnd globally likely won't make your program crash on pre-Ivy Bridge CPUs, if you check CPU features and don't actually run code that uses _rdrand64_step on CPUs without the feature.
But if you are only going to run your code on some specific kind of CPU or later, gcc -O3 -march=haswell is a good choice. (-march also implies -mtune=haswell, and tuning for Ivy Bridge specifically is not what you want for modern CPUs. You could -march=ivybridge -mtune=skylake to set an older baseline of CPU features, but still tune for newer CPUs.)
Wrappers that compile everywhere
This is valid C++ and C. For C, you probably want static inline instead of inline so you don't need to manually instantiate an extern inline version in a .c in case a debug build decided not to inline. (Or use __attribute__((always_inline)) in GNU C.)
The 64-bit versions are only defined for x86-64 targets, because asm instructions can only use 64-bit operand-size in 64-bit mode. I didn't #ifdef __RDRND__ or #if defined(__i386__)||defined(__x86_64__), on the assumption that you'd only include this for x86(-64) builds at all, not cluttering the ifdefs more than necessary. It does only define the rdseed wrappers if that's enabled at compile time, or for MSVC where there's no way to enable them or to detect it.
There are some commented __attribute__((target("rdseed"))) examples you can uncomment if you want to do it that way instead of compiler options. rdrand16 / rdseed16 are intentionally omitted as not being normally useful. rdrand runs the same speed for different operand-sizes, and even pulls the same amount of data from the CPU's internal RNG buffer, optionally throwing away part of it for you.
#include <immintrin.h>
#include <stdint.h>
#if defined(__x86_64__) || defined (_M_X64)
// Figure out which 64-bit type the output arg uses
#ifdef __INTEL_COMPILER // Intel declares the output arg type differently from everyone(?) else
// ICC for Linux declares rdrand's output as unsigned long, but must be long long for a Windows ABI
typedef uint64_t intrin_u64;
#else
// GCC/clang headers declare it as unsigned long long even for Linux where long is 64-bit, but uint64_t is unsigned long and not compatible
typedef unsigned long long intrin_u64;
#endif
//#if defined(__RDRND__) || defined(_MSC_VER) // conditional definition if you want
inline
uint64_t rdrand64(){
intrin_u64 ret;
do{}while( !_rdrand64_step(&ret) ); // retry until success.
return ret;
}
//#endif
#if defined(__RDSEED__) || defined(_MSC_VER)
inline
uint64_t rdseed64(){
intrin_u64 ret;
do{}while( !_rdseed64_step(&ret) ); // retry until success.
return ret;
}
#endif // RDSEED
#endif // x86-64
//__attribute__((target("rdrnd")))
inline
uint32_t rdrand32(){
unsigned ret; // Intel documents this as unsigned int, not necessarily uint32_t
do{}while( !_rdrand32_step(&ret) ); // retry until success.
return ret;
}
#if defined(__RDSEED__) || defined(_MSC_VER)
//__attribute__((target("rdseed")))
inline
uint32_t rdseed32(){
unsigned ret; // Intel documents this as unsigned int, not necessarily uint32_t
do{}while( !_rdseed32_step(&ret) ); // retry until success.
return ret;
}
#endif
The fact that Intel's intrinsics API is supported at all implies that unsigned int is a 32-bit type, regardless of whether uint32_t is defined as unsigned int or unsigned long if any compilers do that.
On the Godbolt compiler explorer we can see how these compile. Clang and MSVC do what we'd expect, just a 2-instruction loop until rdrand leaves CF=1
# clang 7.0 -O3 -march=broadwell MSVC -O2 does the same.
rdrand64():
.LBB0_1: # =>This Inner Loop Header: Depth=1
rdrand rax
jae .LBB0_1 # synonym for jnc - jump if Not Carry
ret
# same for other functions.
Unfortunately GCC is not so good, even current GCC12.1 makes weird asm:
# gcc 12.1 -O3 -march=broadwell
rdrand64():
mov edx, 1
.L2:
rdrand rax
mov QWORD PTR [rsp-8], rax # store into the red-zone where retval is allocated
cmovc eax, edx # materialize a 0 or 1 from CF. (rdrand zeros EAX when it clears CF=0, otherwise copy the 1)
test eax, eax # then test+branch on it
je .L2 # could have just been jnc after rdrand
mov rax, QWORD PTR [rsp-8] # reload retval
ret
rdseed64():
.L7:
rdseed rax
mov QWORD PTR [rsp-8], rax # dead store into the red-zone
jnc .L7
ret
ICC makes the same asm as long as we use a do{}while() retry loop; with a while() {} it's even worse, doing an rdrand and checking before entering the loop for the first time.
Footnote 1: rdrand/rdseed library wrappers
librdrand or Intel's libdrng have wrapper functions with retry loops like I showed, and ones that fill a buffer of bytes or array of uint32_t* or uint64_t*. (Consistently taking uint64_t*, no unsigned long long* on some targets).
A library is also a good choice if you're doing runtime CPU feature detection, so you don't have to mess around with __attribute__((target)) stuff. However you do it, that limits inlining of a function using the intrinsics anyway, so a small static library is equivalent.
libdrng also provides RdRand_isSupported() and RdSeed_isSupported(), so you don't need to do your own CPUID check.
But if you're going to build with -march= something newer than Ivy Bridge / Broadwell or Excavator / Zen1 anyway, inlining a 2-instruction retry loop (like clang compiles it to) is about the same code-size as a function call-site, but doesn't clobber any registers. rdrand is quite slow so that's probably not a big deal, but it also means no extra library dependency.
Performance / internals of rdrand / rdseed
For more details about the HW internals on Intel (not AMD's version), see Intel's docs. For the actual TRNG logic, see Understanding Intel's Ivy Bridge Random Number Generator - it's a metastable latch that settles to 0 or 1 due to thermal noise. Or at least Intel says it is; it's basically impossible to truly verify where the rdrand bits actually come from in a CPU you bought. Worst case, still much better than nothing if you're mixing it with other entropy sources, like Linux does for /dev/random.
For more on the fact that there's a buffer that cores pull from, see some SO answers from the engineer who designed the hardware and wrote librdrand, such as this and this about its exhaustion / performance characteristics on Ivy Bridge, the first generation to feature it.
Infinite retry count?
The asm instructions set the carry flag (CF) = 1 in FLAGS on success, when it put a random number in the destination register. Otherwise CF=0 and the output register = 0. You're intended to call it in a retry loop, that's (I assume) why the intrinsic has the word step in the name; it's one step of generating a single random number.
In theory, a microcode update could change things so it always indicates failure, e.g. if a problem is discovered in some CPU model that makes the RNG untrustworthy (by the standards of the CPU vendor). The hardware RNG also has some self-diagnostics, so it's in theory possible for a CPU to decide that the RNG is broken and not produce any outputs. I haven't heard of any CPUs ever doing this, but I haven't gone looking. And a future microcode update is always possible.
Either of these could lead to an infinite retry loop. That's not great, but unless you want to write a bunch of code to report on that situation, it's at least an observable behaviour that users could potentially deal with in the unlikely event it ever happened.
But occasional temporary failure is normal and expected, and must be handled. Preferably by retrying without telling the user about it.
If there wasn't a random number ready in its buffer, the CPU can report failure instead of stalling this core for potentially even longer. That design choice might be related to interrupt latency, or just keeping it simpler without having to build retrying into the microcode.
Ivy Bridge can't pull data from the DRNG faster than it can keep up, according to the designer, even with all cores looping rdrand, but later CPUs can. Therefore it is important to actually retry.
#jww has had some experience with deploying rdrand in libcrypto++, and found that with a retry count set too low, there were reports of occasional spurious failure. He's had good results from infinite retries, which is why I chose that for this answer. (I suspect he would have heard reports from users with broken CPUs that always fail, if that was a thing.)
Intel's library functions that include a retry loop take a retry count. That's likely to handle the permanent-failure case which, as I said, I don't think happens in any real CPUs yet. Without a limited retry count, you'd loop forever.
An infinite retry count allows a simple API returning the number by value, without silly limitations like OpenSSL's functions that use 0 as an error return: they can't randomly generate a 0!
If you did want a finite retry count, I'd suggest very high. Like maybe 1 million, so it takes maybe have a second or a second of spinning to give up on a broken CPU, with negligible chance of having one thread starve that long if it's repeatedly unlucky in contending for access to the internal queue.
https://uops.info/ measured a throughput on Skylake of one per 3554 cycles on Skylake, one per 1352 on Alder Lake P-cores, 1230 on E-cores. One per 1809 cycles on Zen2. The Skylake version ran thousands of uops, the others were in the low double digits. Ivy Bridge had 110 cycle throughput, but in Haswell it was already up to 2436 cycles, but still a double-digit number of uops.
These abysmal performance numbers on recent Intel CPUs are probably due to microcode updates to work around problems that weren't anticipated when the HW was designed. Agner Fog measured one per 460 cycle throughput for rdrand and rdseed on Skylake when it was new, each costing 16 uops. The thousands of uops are probably extra buffer flushing hooked into the microcode for those instructions by recent updates. Agner measured Haswell at 17 uops, 320 cycles when it was new. See RdRand Performance As Bad As ~3% Original Speed With CrossTalk/SRBDS Mitigation on Phoronix:
As explained in the earlier article, mitigating CrossTalk involves locking the entire memory bus before updating the staging buffer and unlocking it after the contents have been cleared. This locking and serialization now involved for those instructions is very brutal on the performance, but thankfully most real-world workloads shouldn't be making too much use of these instructions.
Locking the memory bus sounds like it could hurt performance even of other cores, if it's like cache-line splits for locked instructions.
(Those cycle numbers are core clock cycle counts; if the DRNG doesn't run on the same clock as the core, those might vary by CPU model. I wonder if uops.info's testing is running rdrand on multiple cores of the same hardware, since Coffee Lake is twice the uops as Skylake, and 1.4x as many cycles per random number. Unless that's just higher clocks leading to more microcode retries?)
Microsoft compiler does not have intrinsics support for RDSEED and RDRAND instruction.
But, you may implement these instruction using NASM or MASM. Assembly code is available at:
https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-software-implementation-guide
For Intel Compiler, you can use header to determine the version. You can use following macros to determine the version and sub-version:
__INTEL_COMPILER //Major Version
__INTEL_COMPILER_UPDATE // Minor Update.
For instance if you use ICC15.0 Update 3 compiler, it will show that you have
__INTEL_COMPILER = 1500
__INTEL_COMPILER_UPDATE = 3
For further details on pre-defined macros you can go to: https://software.intel.com/en-us/node/524490
Considering that I'm coding in C++, if possible, I would like to use an Intrinsics-like solution to read useful informations about the hardware, my concerns/considerations are:
I don't know assembly that well, it will be a considerable investment just to get this kind of informations ( altough it looks like CPU it's just about flipping values and reading registers. )
there at least 2 popular syntax for asm ( Intel and AT&T ), so it's fragmented
strangely enough Intrinsics are more popular and supported than asm code this days
not all the the compilers that are in my radar right now support inline asm, MSVC 64 bit is one; I'm afraid that I will find other similar flaws while digging more into the feature sets of the different compilers that I have to use.
considering the trand I think that is more productive for me to bet on Intrinsics, it should be also way more easy than any asm code.
And the last question that I have to answer to is: how to do a similar thing with intrinsics ? Because I haven't found nothing other than CPUID opcodes to get this kind of informations at all.
After some digging I have found a useful built-in functions that is gcc specific.
The only problem is that this kind of functions are really limited ( basically you have only 2 functions, 1 for the CPU "name" and 1 for the set of registers )
an example is
#include <stdio.h>
int main()
{
if (__builtin_cpu_supports("mmx")) {
printf("\nI got MMX !\n");
} else
printf("\nWhat ? MMX ? What is that ?\n");
return (0);
}
and apparently this built-in functions work under mingw-w64 too.
Gcc includes a cpuid interface:
http://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/i386/cpuid.h
These don't seem to be well documented, but example usage can be found here:
http://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/i386/driver-i386.c
Note that you must use __cpuid_count() and not __cpuid() when the initial value of ecx matters, such as with avx/avx2 detection.
As user2485710 pointed out, gcc can do all the cpu feature detection work for you. As of gcc 4.8.1, the full list of features supported by __builtin_cpu_supports() is: cmov, mmx, popcnt, sse, sse2, sse3, ssse3, sse4.1, sse4.2, avx and avx2.
Intrinsics such as this are also generally compiler specific.
MS VC++ has a __cpuid (and a __cpuidex) to generate a CPUID op code.
At least as far as I know, gcc/g++ doesn't provide an equivalent to that though. Inline assembly seems to be the only option available.
For x86/x64, Intel provides an intrinsic called _may_i_use_cpu_feature. You can find it under the General Support category of the Intel Intrinsics Guide page. Below is a rip of Intel's documentation.
GCC supposedly follows Intel with respect to intrinsics, so it should be available under GCC. Its not clear to me if Microsoft provides it because they provide most (but not all) Intel intrinsics.
I'm not aware of anything for ARM. As far as I know, there is no __builtin_cpu_supports("neon"), __builtin_cpu_supports("crc32"), __builtin_cpu_supports("aes"), __builtin_cpu_supports("pmull"), __builtin_cpu_supports("sha"), etc under ARM. For ARM you have to perform CPU feature probing.
Synopsis
int _may_i_use_cpu_feature (unsigned __int64 a)
#include "immintrin.h"
Description
Dynamically query the processor to determine if the processor-specific feature(s) specified
in a are available, and return true or false (1 or 0) if the set of features is
available. Multiple features may be OR'd together. This intrinsic does not check the
processor vendor. See the valid feature flags below:
Operation
_FEATURE_GENERIC_IA32
_FEATURE_FPU
_FEATURE_CMOV
_FEATURE_MMX
_FEATURE_FXSAVE
_FEATURE_SSE
_FEATURE_SSE2
_FEATURE_SSE3
_FEATURE_SSSE3
_FEATURE_SSE4_1
_FEATURE_SSE4_2
_FEATURE_MOVBE
_FEATURE_POPCNT
_FEATURE_PCLMULQDQ
_FEATURE_AES
_FEATURE_F16C
_FEATURE_AVX
_FEATURE_RDRND
_FEATURE_FMA
_FEATURE_BMI
_FEATURE_LZCNT
_FEATURE_HLE
_FEATURE_RTM
_FEATURE_AVX2
_FEATURE_KNCNI
_FEATURE_AVX512F
_FEATURE_ADX
_FEATURE_RDSEED
_FEATURE_AVX512ER
_FEATURE_AVX512PF
_FEATURE_AVX512CD
_FEATURE_SHA
_FEATURE_MPX
For great-grandchildren, this is how to obtain CPU vendor name with GCC, tested on Win7 x64
#include <cpuid.h>
...
int eax, ebx, ecx, edx;
char vendor[13];
__cpuid(0, eax, ebx, ecx, edx);
memcpy(vendor, &ebx, 4);
memcpy(vendor + 4, &edx, 4);
memcpy(vendor + 8, &ecx, 4);
vendor[12] = '\0';
printf("CPU: %s\n", vendor);