Get SSE version without __asm on x64 - c++

I'm trying to build slightly modified versions of some functions of the VS2010 CRT library, all is well except for the parts where it tries to access a global variable which presumably holds the instruction set architecture version (ISA):
if (__isa_available > __ISA_AVAILABLE_SSE2)
// ...
else if (__isa_available == __ISA_AVAILABLE_SSE2)
// ...
The values it should hold I found in an assembly file
How and where __isa_available is assigned a value is nowhere to be found (I've tried a find-in-files in all my directories...)
MSDN refers to the CPUID example to determine the instruction set. The problem with that is it uses __asm blocks and those are not allowed in my x64 build.
Does anyone knows how to quickly assign the correct value to __isa_available?

Microsoft decided to stop the support of inline assembly. But they introduced a new format. You can find more information about CPUID in the new format here (with example).
The advantage of intrinsics over inline assembly is that they are compatible with both x86 and x64 without additional code and are easier to use.

VC++ has an intrinsic that allows you to use CPUID without inline ASM:
__cpuid in intrin.h
On that same website is an extensive code sample, too.


Does compiled Crypto++ library code that uses AES/GCM encryption utilize Intel's AES-NI instructions?

I'm implementing AES256/GCM encryption and authentication using Crypto++ library. My code is compiled using Visual Studio 2008 as a C++/MFC project. This is a somewhat older project that uses a previous version of the library, Cryptopp562.
I'm curious if the resulting compiled code will use Intel's AES-NI instructions? And if so, what happens if the hardware (older CPU) does not support it?
EDIT: Here's an example of code that I'm testing it with:
int nIV_Length = 12;
int nAES_KeyLength = 32;
BYTE* iv = new BYTE[nIV_Length];
BYTE* key = new BYTE[nAES_KeyLength];
int nLnPlainText = 128;
BYTE* pDataPlainText = new BYTE[nLnPlainText];
CryptoPP::AutoSeededRandomPool rng;
rng.GenerateBlock(iv, nIV_Length);
CryptoPP::GCM<CryptoPP::AES>::Encryption enc;
enc.SetKeyWithIV(key, nAES_KeyLength, iv, nIV_Length);
BYTE* pDataOut_AES_GCM = new BYTE[nLnPlainText];
memset(pDataOut_AES_GCM, 0, nLnPlainText);
BYTE mac[16] = {0};
enc.EncryptAndAuthenticate(pDataOut_AES_GCM, mac, sizeof(mac), iv, nIV_Length, NULL, 0, pDataPlainText, nLnPlainText);
delete[] pDataPlainText;
delete[] pDataOut_AES_GCM;
delete[] key;
delete[] iv;
If you run code containing AES-NI instructions on x86 hardware which does not support these instructions, you should get invalid instruction errors. Unless the code does something smart (such as looking at CPUID to decide whether to run AES-NI optimized code, or something else), this can also be used to detect whether AES-NI instructions are actually used.
Otherwise you can always use a debugger, and set breakpoints at the AES-NI instructions to see whether your process ever uses that portion of code.
According to Crypto++ release notes AES-NI support was added in version 5.6.1. Looking at the source code of version 5.6.5 Crypto++, if AES-NI support was enabled at compile time, then it uses run-time checks (the HasAESNI() function, probably utilizing CPUID) to decide whether to use these intrinsics. See rijndael.cpp (and cpu.cpp for the CPUID code) in its source code for details.
I'm curious if the resulting compiled code will use Intel's AES-NI instructions?
Crypto++ 5.6.1 added support for AES-NI and Carryless Multiplies under GCM. It is used when two or three conditions are met. First, you are using a version of the library with the support. From the homepage under News (or the README):
8/9/2010 - Version 5.6.1 released
added support for AES-NI and CLMUL instruction sets in AES and GMAC/GCM
Second, the compiler, assembler and the linker must support the instructions. For Crypto++, that means you use at least MSVC 2008 SP1, GCC 4.3, and Binutils 2.19. For MSVC, if you look at config.h, its guarded as follows (__AES__ is there for GCC and friends, too):
#if ... (_MSC_FULL_VER >= 150030729) ...
You can lookup _MSC_FULL_VER numbers at Visual Studio version. Ironically, I've never seen a similar page on MSDN even though the service packs matter. You have to go to a Chinese site. For example, checked iterators showed up in VS2005 SP1 (IIRC).
For Linux and GCC compatibles, the GNUmakefile checks the version of the compiler and assembler. If they are too old, then the makefile adds CRYPTOPP_DISABLE_AESNI to the command line to disable the support even if __AES__ is defined.
CRYPTOPP_DISABLE_AESNI shows up more often then you think. For example, if you download OpenBSD 6.0 (the current version), then
CRYPTOPP_DISABLE_AESNI will be present because their assembler is so old. They are mostly stuck at the pre-GPL-2 version of their tools (apparently they did not agree to the license changes).
Third, the CPU supports both AES and SSE4 instructions (the reason for the SSE4 instructions is explained below). These checks are performed at runtime, and the function of interest is called HasAES() from cpu.h (there's also a HasSSE4()):
//! \brief Determines AES-NI availability
//! \returns true if AES-NI is determined to be available, false otherwise
//! \details HasAESNI() is a runtime check performed using CPUID
inline bool HasAESNI()
if (!g_x86DetectionDone)
return g_hasAESNI;
The caveat of Item (3) is the library needed to be compiled with support from Item (2). If Item (2) did not include compile time support, then Item (3) cannot offer runtime support.
With respect to Item (3) and runtime support, we recently had to tune it. It seems some low-end Atom processors, like D2500's, have SSE2, SSE3, SSSE3 and AES-NI, but not SSE4.1 or SSE4.2. According to Intel ARK, its an optional configuration of the processor. We received one bug report about an illegal SSE4 instruction in the AES-NI codepath, so we had to add an HasSSE4() check. See PR 172, Check for SSE4 support before using SSE4.1 instruction.
And if so, what happens if the hardware (older CPU) does not support it?
Nothing. The default CXX implementation is used rather than the hardware accelerated AES.
You might be interested to know we also have other AES hardware acceleration, including ARMv8 Crypto and VIA Padlock. We also provide other hardware acceleration, like CRC32, Carryless-Multiplies and SHA. They all function the same way - compile time support is translated into runtime support.
(Comment): I just set a breakpoint on DetectX86Features method in cpu.cpp ... and it never triggered ...
This can be tricky for two reasons. First, the calls may be inlined in release builds so the code is shaped a little differently then you would expect.
Second, there's a global random number generator accessed by GlobalRNG(). GlobalRNG() is AES in OFB mode. When initializers run for the test.cpp translation unit, the GlobalRNG() is created which causes DetectX86Features() to run very early (before control enters main).
You may have better luck with observing the low level details with WinDbg.
Its also worth mentioning that AES/GCM can be sped up by interleaving AES with GCM. I believe the idea is to perform 4 rounds of AES key calculation and 1 CLMUL in parallel. Crypto++ does not take advantage of it, but OpenSSL takes the opportunity. I don't know what Botan or mbedTLS do.
Just to finish up my question, here's my findings.
The method that forks the execution to hardware supported AES-NI instructions, vs software implemented ones in Crypto++ library for my code sample above, is Rijndael::Enc::AdvancedProcessBlocks located in rijndael.cpp. It starts as such:
size_t Rijndael::Enc::AdvancedProcessBlocks(const byte *inBlocks, const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags) const
if (HasAESNI())
return AESNI_AdvancedProcessBlocks(AESNI_Enc_Block, AESNI_Enc_4_Blocks, (MAYBE_CONST __m128i *)(const void *)m_key.begin(), m_rounds, inBlocks, xorBlocks, outBlocks, length, flags);
The CRYPTOPP_BOOL_AESNI_INTRINSICS_AVAILABLE preprocessor variable will be defined if you're building the Crypto++ library with at least Visual Studio 2008 with SP1 (note that SP1 is important.) Such dependency is necessary to be able to use AES-NI intrinsics (such as _mm_aesenc_si128 and _mm_aesenclast_si128) to generate Intel's AES-NI machine code instructions.
So adding a breakpoint to the beginning of
will let you debug it right from the Visual Studio. No outside debugger needed.
If you then step into the AESNI_AdvancedProcessBlocks method the actual AES encryption will be processed in one of the AESNI_Enc_* methods. Here's how the actual aesenc and aesenclast machine instructions may look like for x86 configuration in the Release build:
So to answer my original question, for the code sample in my post above to be able to utilize Intel's AES-NI instructions one needs to build both the code sample and Crypto++ library with at least Visual Studio 2008 with SP1. (Just building it with Visual Studio 2008, or earlier version, will not do the job, even if the CPU that the code runs on supports AES-NI instructions.) After that, no other steps seem to be necessary. The library will detect the presence of AES-NI instructions automatically (HasAESNI() function) and will use them when available. Otherwise it will default to a software implementation.
Lastly, just from curiosity I decided to see how much difference would hardware vs software AES-GCM encryption would produce in speed. I used the following code snippet (from my code sample above):
int nCntTest = 100000;
DWORD dwmsIniTicks = ::GetTickCount();
for(int i = 0; i < nCntTest; i++)
enc.EncryptAndAuthenticate(pDataOut_AES_GCM, mac, sizeof(mac), iv, nIV_Length, NULL, 0, pDataPlainText, nLnPlainText);
DWORD dwmsElapsed = ::GetTickCount() - dwmsIniTicks;
bool bHaveHwAES_Support = false;
bHaveHwAES_Support = CryptoPP::HasAESNI();
_tprintf(L"\nTimed %d AES256-GCM encryptions %s hardware encryption of %d bytes: %u ms\n",
nCntTest, bHaveHwAES_Support ? L"with" : L"without",
nLnRealPlainText, dwmsElapsed);
Here are two results:
This is obviously not an all-encompassing test. I ran it on my desktop with the "Intel(R) Core(TM) i7-4770 CPU # 3.40GHz" CPU.
But the good news is that AES-GCM encryption seems to be very fast, even without a hardware AES support.

HP-UX Itanium Compare and Swap

I am developing C/C++ cross-platform code, and the last platform is Itanium based HP-UX. Relevant machine an processor information can be found at the end of the question.
I need to implement or find an atomic compare and swap for the machine and compiler specifications given below.
I have found a few possibilities for solutions, but haven't been able to find how to use them.
The first possible solution is through the use of _Asm_cmpxchg (documentation here). I'm unable to find what header to include for this or how to get it to compile.
The second possible solution is to write my own inline assembly with the direct use of the cmpxchg and cmpxchg8b commands, but I haven't been able to find how to correctly do this either. I've found various resources, most of which are directly writing assembly, not for the processor architecture I require, or don't show a specific enough example.
I found more documentation about cmpxchg and cmpxchg8 instructions (as well as tzcnt and lzcnt which are two that are nice to have, but not necessary) here. If you are viewing in google chrome, abosulte page values are 234 for cmpxchg and 236 for cmpxchg8.
Limitations: I am unable to use a third party library due to constraints beyond my control.
Result of uname -smr: HP-UX B.11.31 ia64
Processor Model: Intel(R) Itanium(R) Processor 9340
Compiler -v: aCC: HP C/aC++ B3910B A.06.28
Update: I was able to get _Asm_cmpxchg to compile, but it doesn't seem to work (the value remains unchanged). For parameters, I passed _SZ_W for the _Asm_sz, _SEM_ACQ for _Asm_sem, _LDHINT_NONE for _Asm_ldhint, a pointer to the original 32 bit integer value for r3, and the desired new value for r2. I'm guessing at the meaning of the parameters, given that documentation is very lackluster.
I ended up finding the solution on my own, using option 1. Below is the sample code to get it to work:
bool compare_and_swap(unsigned int* var, unsigned int oldval, unsigned int newval)
// Move the old value into register _AREG_CCV because this is the register
// that var will be compared against
_Asm_mov_to_ar(_AREG_CCV, oldval);
// Do the compare and swap
return oldval == _Asm_cmpxchg(
_SZ_W /* 4 byte word */,
_SEM_ACQ /* acquire the semaphore */,
_LDHINT_NONE /* locality hint */);

vtbl2 intrinsics on ARM64 missing

I have some code that uses the vtbl2_u8 ARM Neon intrinsic function. When I compile with armv7 or armv7s architectures, this code compiles (and executes) correctly. However, when I try to compile targeting arm64, I get errors:
simd.h: error: call to unavailable function 'vtbl2_u8'
My Xcode version is 6.1, iPhone SDK 8.1. Looking at arm64_neon_internal.h, the definition for vtbl2_u8 has an __attribute__(unavailable). There is a definiton for vtbl2q_u8, but it takes different parameter types. Is there a direct replacement for the vtbl2 intrinsic for arm64?
As documented in the ARM NEON intrinsics reference ( ), vtbl2_u8 is expected to be provided by compilers providing an ARM C Language Extensions implementation for AArch64 state in ARMv8-A. Note that the same document would suggest that vtbl2q_u8 is an Xcode extension, rather than an intrinsic which is expected to be supported by ACLE compilers.
The answer to your question then, is there should be no need for a replacement for vtbl2_u8, as it should be provided. However, that doesn't help you with your real problem, which is how you can use the instruction with a compiler which does not provide it.
Looking at what you have available in Xcode, and what vtbl2_u8 is documented to map to, I think you should be able to emulate the expected behaviour with:
uint8x8_t vtbl2_u8 (uint8x8x2_t a, uint8x8_t b)
/* Build the 128-bit vector mask from the two 64-bit halves. */
uint8x16_t new_mask = vcombine_u8 (a.val[0], a.val[1]);
/* Use an Xcode specific intrinsic. */
return vtbl1q_u8 (new_mask, b);
Though I don't have an Xcode toolchain to test with, so you'll have to confirm that does what you expect.
If this is appearing in performance critical code, you may find that the vcombine_u8 is an unacceptable extra instruction. Fundamentally a uint8x8x2_t lives in two consecutive registers, which gives a different layout between AArch64 and AArch32 (where Q0 was D0:D1).The vtbl2_u8 intrinsic requires a 16-bit mask.
Rewriting the producer of the uint8x8x2_t data to produce a uint8x16_t is the only other workaround for this, and is probably the one likely to work best. Note that even in compilers which provide the vtbl2_u8 intrinsic (trunk GCC and Clang at time of writing), an instruction performing the vcombine_u8 is inserted, so you may still be seeing extra move instructions behind the scenes.

Intrinsics for CPUID like informations?

Considering that I'm coding in C++, if possible, I would like to use an Intrinsics-like solution to read useful informations about the hardware, my concerns/considerations are:
I don't know assembly that well, it will be a considerable investment just to get this kind of informations ( altough it looks like CPU it's just about flipping values and reading registers. )
there at least 2 popular syntax for asm ( Intel and AT&T ), so it's fragmented
strangely enough Intrinsics are more popular and supported than asm code this days
not all the the compilers that are in my radar right now support inline asm, MSVC 64 bit is one; I'm afraid that I will find other similar flaws while digging more into the feature sets of the different compilers that I have to use.
considering the trand I think that is more productive for me to bet on Intrinsics, it should be also way more easy than any asm code.
And the last question that I have to answer to is: how to do a similar thing with intrinsics ? Because I haven't found nothing other than CPUID opcodes to get this kind of informations at all.
After some digging I have found a useful built-in functions that is gcc specific.
The only problem is that this kind of functions are really limited ( basically you have only 2 functions, 1 for the CPU "name" and 1 for the set of registers )
an example is
#include <stdio.h>
int main()
if (__builtin_cpu_supports("mmx")) {
printf("\nI got MMX !\n");
} else
printf("\nWhat ? MMX ? What is that ?\n");
return (0);
and apparently this built-in functions work under mingw-w64 too.
Gcc includes a cpuid interface:;a=blob;f=gcc/config/i386/cpuid.h
These don't seem to be well documented, but example usage can be found here:;a=blob_plain;f=gcc/config/i386/driver-i386.c
Note that you must use __cpuid_count() and not __cpuid() when the initial value of ecx matters, such as with avx/avx2 detection.
As user2485710 pointed out, gcc can do all the cpu feature detection work for you. As of gcc 4.8.1, the full list of features supported by __builtin_cpu_supports() is: cmov, mmx, popcnt, sse, sse2, sse3, ssse3, sse4.1, sse4.2, avx and avx2.
Intrinsics such as this are also generally compiler specific.
MS VC++ has a __cpuid (and a __cpuidex) to generate a CPUID op code.
At least as far as I know, gcc/g++ doesn't provide an equivalent to that though. Inline assembly seems to be the only option available.
For x86/x64, Intel provides an intrinsic called _may_i_use_cpu_feature. You can find it under the General Support category of the Intel Intrinsics Guide page. Below is a rip of Intel's documentation.
GCC supposedly follows Intel with respect to intrinsics, so it should be available under GCC. Its not clear to me if Microsoft provides it because they provide most (but not all) Intel intrinsics.
I'm not aware of anything for ARM. As far as I know, there is no __builtin_cpu_supports("neon"), __builtin_cpu_supports("crc32"), __builtin_cpu_supports("aes"), __builtin_cpu_supports("pmull"), __builtin_cpu_supports("sha"), etc under ARM. For ARM you have to perform CPU feature probing.
int _may_i_use_cpu_feature (unsigned __int64 a)
#include "immintrin.h"
Dynamically query the processor to determine if the processor-specific feature(s) specified
in a are available, and return true or false (1 or 0) if the set of features is
available. Multiple features may be OR'd together. This intrinsic does not check the
processor vendor. See the valid feature flags below:
For great-grandchildren, this is how to obtain CPU vendor name with GCC, tested on Win7 x64
#include <cpuid.h>
int eax, ebx, ecx, edx;
char vendor[13];
__cpuid(0, eax, ebx, ecx, edx);
memcpy(vendor, &ebx, 4);
memcpy(vendor + 4, &edx, 4);
memcpy(vendor + 8, &ecx, 4);
vendor[12] = '\0';
printf("CPU: %s\n", vendor);

Can I make clang generate absolute addresses for function pointers?

Here's a simplified version of some code I'm working with right now:
int Do_A_Thing(void)
void Some_Function(void)
int (*fn_do_thing)(void) = Do_A_Thing;
When I compile this in xcode 4.1, the assembler it generates sets the fn_do_thing variable like so:
0x2006: calll 0x200b ;
0x200b: popl %eax ; get EIP
0x200c: movl 1333(%eax), %eax
I.e. it generates a relative address for the place to find the Do_A_Thing function - "current instruction plus 1333", which according to the map file is a "non-lazy pointer" to the function.
When I compile similar code on Windows with Visual Studio, windows generates a fixed address instead of doing it relatively like this. If Do_A_Thing lives at, for example, 0x40050914, it just sets fn_do_thing to 0x40050914. By contrast, xcode sets it to "where I am + some offset".
Is there any way to make xcode generate an absolute address to set the function pointer to, like visual studio does? Or is there some reason that wouldn't work? I have noticed that every time I run the program, the Do_A_Thing function (and all other functions) seem to load at a different address.
You're looking at position independent code (more specifically Position Independent Executable) in action. This, as you noticed, allows the OS to load the binary anywhere in memory, which provides numerous security improvements for potentially insecure code.
You can disable it via removing a linker option in XCode (-Wl,-pie).
Note that on x86_64 (amd64), instructions can operate relative to the instruction pointer, which improves the efficiency of this technique (and makes it basically "free" in performance cost).