vtbl2 intrinsics on ARM64 missing - c++

I have some code that uses the vtbl2_u8 ARM Neon intrinsic function. When I compile with armv7 or armv7s architectures, this code compiles (and executes) correctly. However, when I try to compile targeting arm64, I get errors:
simd.h: error: call to unavailable function 'vtbl2_u8'
My Xcode version is 6.1, iPhone SDK 8.1. Looking at arm64_neon_internal.h, the definition for vtbl2_u8 has an __attribute__(unavailable). There is a definiton for vtbl2q_u8, but it takes different parameter types. Is there a direct replacement for the vtbl2 intrinsic for arm64?

As documented in the ARM NEON intrinsics reference ( http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf ), vtbl2_u8 is expected to be provided by compilers providing an ARM C Language Extensions implementation for AArch64 state in ARMv8-A. Note that the same document would suggest that vtbl2q_u8 is an Xcode extension, rather than an intrinsic which is expected to be supported by ACLE compilers.
The answer to your question then, is there should be no need for a replacement for vtbl2_u8, as it should be provided. However, that doesn't help you with your real problem, which is how you can use the instruction with a compiler which does not provide it.
Looking at what you have available in Xcode, and what vtbl2_u8 is documented to map to, I think you should be able to emulate the expected behaviour with:
uint8x8_t vtbl2_u8 (uint8x8x2_t a, uint8x8_t b)
{
/* Build the 128-bit vector mask from the two 64-bit halves. */
uint8x16_t new_mask = vcombine_u8 (a.val[0], a.val[1]);
/* Use an Xcode specific intrinsic. */
return vtbl1q_u8 (new_mask, b);
}
Though I don't have an Xcode toolchain to test with, so you'll have to confirm that does what you expect.
If this is appearing in performance critical code, you may find that the vcombine_u8 is an unacceptable extra instruction. Fundamentally a uint8x8x2_t lives in two consecutive registers, which gives a different layout between AArch64 and AArch32 (where Q0 was D0:D1).The vtbl2_u8 intrinsic requires a 16-bit mask.
Rewriting the producer of the uint8x8x2_t data to produce a uint8x16_t is the only other workaround for this, and is probably the one likely to work best. Note that even in compilers which provide the vtbl2_u8 intrinsic (trunk GCC and Clang at time of writing), an instruction performing the vcombine_u8 is inserted, so you may still be seeing extra move instructions behind the scenes.

Related

Does compiled Crypto++ library code that uses AES/GCM encryption utilize Intel's AES-NI instructions?

I'm implementing AES256/GCM encryption and authentication using Crypto++ library. My code is compiled using Visual Studio 2008 as a C++/MFC project. This is a somewhat older project that uses a previous version of the library, Cryptopp562.
I'm curious if the resulting compiled code will use Intel's AES-NI instructions? And if so, what happens if the hardware (older CPU) does not support it?
EDIT: Here's an example of code that I'm testing it with:
int nIV_Length = 12;
int nAES_KeyLength = 32;
BYTE* iv = new BYTE[nIV_Length];
BYTE* key = new BYTE[nAES_KeyLength];
int nLnPlainText = 128;
BYTE* pDataPlainText = new BYTE[nLnPlainText];
CryptoPP::AutoSeededRandomPool rng;
rng.GenerateBlock(iv, nIV_Length);
CryptoPP::GCM<CryptoPP::AES>::Encryption enc;
enc.SetKeyWithIV(key, nAES_KeyLength, iv, nIV_Length);
BYTE* pDataOut_AES_GCM = new BYTE[nLnPlainText];
memset(pDataOut_AES_GCM, 0, nLnPlainText);
BYTE mac[16] = {0};
enc.EncryptAndAuthenticate(pDataOut_AES_GCM, mac, sizeof(mac), iv, nIV_Length, NULL, 0, pDataPlainText, nLnPlainText);
delete[] pDataPlainText;
delete[] pDataOut_AES_GCM;
delete[] key;
delete[] iv;
If you run code containing AES-NI instructions on x86 hardware which does not support these instructions, you should get invalid instruction errors. Unless the code does something smart (such as looking at CPUID to decide whether to run AES-NI optimized code, or something else), this can also be used to detect whether AES-NI instructions are actually used.
Otherwise you can always use a debugger, and set breakpoints at the AES-NI instructions to see whether your process ever uses that portion of code.
According to Crypto++ release notes AES-NI support was added in version 5.6.1. Looking at the source code of version 5.6.5 Crypto++, if AES-NI support was enabled at compile time, then it uses run-time checks (the HasAESNI() function, probably utilizing CPUID) to decide whether to use these intrinsics. See rijndael.cpp (and cpu.cpp for the CPUID code) in its source code for details.
I'm curious if the resulting compiled code will use Intel's AES-NI instructions?
Crypto++ 5.6.1 added support for AES-NI and Carryless Multiplies under GCM. It is used when two or three conditions are met. First, you are using a version of the library with the support. From the homepage under News (or the README):
8/9/2010 - Version 5.6.1 released
added support for AES-NI and CLMUL instruction sets in AES and GMAC/GCM
Second, the compiler, assembler and the linker must support the instructions. For Crypto++, that means you use at least MSVC 2008 SP1, GCC 4.3, and Binutils 2.19. For MSVC, if you look at config.h, its guarded as follows (__AES__ is there for GCC and friends, too):
#if ... (_MSC_FULL_VER >= 150030729) ...
#define CRYPTOPP_BOOL_AESNI_INTRINSICS_AVAILABLE 1
#else
#define CRYPTOPP_BOOL_AESNI_INTRINSICS_AVAILABLE 0
#endif
You can lookup _MSC_FULL_VER numbers at Visual Studio version. Ironically, I've never seen a similar page on MSDN even though the service packs matter. You have to go to a Chinese site. For example, checked iterators showed up in VS2005 SP1 (IIRC).
For Linux and GCC compatibles, the GNUmakefile checks the version of the compiler and assembler. If they are too old, then the makefile adds CRYPTOPP_DISABLE_AESNI to the command line to disable the support even if __AES__ is defined.
CRYPTOPP_DISABLE_AESNI shows up more often then you think. For example, if you download OpenBSD 6.0 (the current version), then
CRYPTOPP_DISABLE_AESNI will be present because their assembler is so old. They are mostly stuck at the pre-GPL-2 version of their tools (apparently they did not agree to the license changes).
Third, the CPU supports both AES and SSE4 instructions (the reason for the SSE4 instructions is explained below). These checks are performed at runtime, and the function of interest is called HasAES() from cpu.h (there's also a HasSSE4()):
//! \brief Determines AES-NI availability
//! \returns true if AES-NI is determined to be available, false otherwise
//! \details HasAESNI() is a runtime check performed using CPUID
inline bool HasAESNI()
{
if (!g_x86DetectionDone)
DetectX86Features();
return g_hasAESNI;
}
The caveat of Item (3) is the library needed to be compiled with support from Item (2). If Item (2) did not include compile time support, then Item (3) cannot offer runtime support.
With respect to Item (3) and runtime support, we recently had to tune it. It seems some low-end Atom processors, like D2500's, have SSE2, SSE3, SSSE3 and AES-NI, but not SSE4.1 or SSE4.2. According to Intel ARK, its an optional configuration of the processor. We received one bug report about an illegal SSE4 instruction in the AES-NI codepath, so we had to add an HasSSE4() check. See PR 172, Check for SSE4 support before using SSE4.1 instruction.
And if so, what happens if the hardware (older CPU) does not support it?
Nothing. The default CXX implementation is used rather than the hardware accelerated AES.
You might be interested to know we also have other AES hardware acceleration, including ARMv8 Crypto and VIA Padlock. We also provide other hardware acceleration, like CRC32, Carryless-Multiplies and SHA. They all function the same way - compile time support is translated into runtime support.
(Comment): I just set a breakpoint on DetectX86Features method in cpu.cpp ... and it never triggered ...
This can be tricky for two reasons. First, the calls may be inlined in release builds so the code is shaped a little differently then you would expect.
Second, there's a global random number generator accessed by GlobalRNG(). GlobalRNG() is AES in OFB mode. When initializers run for the test.cpp translation unit, the GlobalRNG() is created which causes DetectX86Features() to run very early (before control enters main).
You may have better luck with observing the low level details with WinDbg.
Its also worth mentioning that AES/GCM can be sped up by interleaving AES with GCM. I believe the idea is to perform 4 rounds of AES key calculation and 1 CLMUL in parallel. Crypto++ does not take advantage of it, but OpenSSL takes the opportunity. I don't know what Botan or mbedTLS do.
Just to finish up my question, here's my findings.
The method that forks the execution to hardware supported AES-NI instructions, vs software implemented ones in Crypto++ library for my code sample above, is Rijndael::Enc::AdvancedProcessBlocks located in rijndael.cpp. It starts as such:
size_t Rijndael::Enc::AdvancedProcessBlocks(const byte *inBlocks, const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags) const
{
#if CRYPTOPP_BOOL_AESNI_INTRINSICS_AVAILABLE
if (HasAESNI())
return AESNI_AdvancedProcessBlocks(AESNI_Enc_Block, AESNI_Enc_4_Blocks, (MAYBE_CONST __m128i *)(const void *)m_key.begin(), m_rounds, inBlocks, xorBlocks, outBlocks, length, flags);
#endif
The CRYPTOPP_BOOL_AESNI_INTRINSICS_AVAILABLE preprocessor variable will be defined if you're building the Crypto++ library with at least Visual Studio 2008 with SP1 (note that SP1 is important.) Such dependency is necessary to be able to use AES-NI intrinsics (such as _mm_aesenc_si128 and _mm_aesenclast_si128) to generate Intel's AES-NI machine code instructions.
So adding a breakpoint to the beginning of
will let you debug it right from the Visual Studio. No outside debugger needed.
If you then step into the AESNI_AdvancedProcessBlocks method the actual AES encryption will be processed in one of the AESNI_Enc_* methods. Here's how the actual aesenc and aesenclast machine instructions may look like for x86 configuration in the Release build:
So to answer my original question, for the code sample in my post above to be able to utilize Intel's AES-NI instructions one needs to build both the code sample and Crypto++ library with at least Visual Studio 2008 with SP1. (Just building it with Visual Studio 2008, or earlier version, will not do the job, even if the CPU that the code runs on supports AES-NI instructions.) After that, no other steps seem to be necessary. The library will detect the presence of AES-NI instructions automatically (HasAESNI() function) and will use them when available. Otherwise it will default to a software implementation.
Lastly, just from curiosity I decided to see how much difference would hardware vs software AES-GCM encryption would produce in speed. I used the following code snippet (from my code sample above):
int nCntTest = 100000;
DWORD dwmsIniTicks = ::GetTickCount();
for(int i = 0; i < nCntTest; i++)
{
enc.EncryptAndAuthenticate(pDataOut_AES_GCM, mac, sizeof(mac), iv, nIV_Length, NULL, 0, pDataPlainText, nLnPlainText);
}
DWORD dwmsElapsed = ::GetTickCount() - dwmsIniTicks;
bool bHaveHwAES_Support = false;
#if CRYPTOPP_BOOL_AESNI_INTRINSICS_AVAILABLE
bHaveHwAES_Support = CryptoPP::HasAESNI();
#endif
_tprintf(L"\nTimed %d AES256-GCM encryptions %s hardware encryption of %d bytes: %u ms\n",
nCntTest, bHaveHwAES_Support ? L"with" : L"without",
nLnRealPlainText, dwmsElapsed);
Here are two results:
and
This is obviously not an all-encompassing test. I ran it on my desktop with the "Intel(R) Core(TM) i7-4770 CPU # 3.40GHz" CPU.
But the good news is that AES-GCM encryption seems to be very fast, even without a hardware AES support.

Compiling legacy GCC code with AVX vector warnings

I've been trying to search on google but couldn't find anything useful.
typedef int64_t v4si __attribute__ ((vector_size(32)));
//warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
// so isn't AVX already automatically enabled?
// What does it mean "without AVX enabled"?
// What does it mean "changes the ABI"?
inline v4si v4si_gt0(v4si x_);
//warning: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
//So why there's warning and what does it mean?
// Why only this parameter got warning?
// And all other v4si parameter/arguments got no warning?
void set_quota(v4si quota);
That's not legacy code. __attribute__ ((vector_size(32))) means a 32 byte vector, i.e. 256 bit, which (on x86) means AVX. (GNU C Vector Extensions)
AVX isn't enabled unless you use -mavx (or a -march setting that includes it). Without that, the compiler isn't allowed to generate code that uses AVX instructions, because those would trigger an illegal-instruction fault on older CPUs that don't support AVX.
So the compiler can't pass or return 256b vectors in registers, like the normal calling convention specifies. Probably it treats them the same as structs of that size passed by value.
See the ABI links in the x86 tag wiki, or the x86 Calling Conventions page on Wikipedia (mostly doesn't mention vector registers).
Since the GNU C Vector Extensions syntax isn't tied to any particular hardware, using a 32 byte vector will still compile to correct code. It will perform badly, but it will still work even if the compiler can only use SSE instructions. (Last I saw, gcc was known to do a very bad job of generating code to deal with vectors wider than the target machine supports. You'd get significantly better code for a machine with 16B vectors from using vector_size(16) manually.)
Anyway, the point is that you get a warning instead of a compiler error because __attribute__ ((vector_size(32))) doesn't imply AVX specifically, but AVX or some other 256b vector instruction set is required for it to compile to good code.

HP-UX Itanium Compare and Swap

I am developing C/C++ cross-platform code, and the last platform is Itanium based HP-UX. Relevant machine an processor information can be found at the end of the question.
I need to implement or find an atomic compare and swap for the machine and compiler specifications given below.
I have found a few possibilities for solutions, but haven't been able to find how to use them.
The first possible solution is through the use of _Asm_cmpxchg (documentation here). I'm unable to find what header to include for this or how to get it to compile.
The second possible solution is to write my own inline assembly with the direct use of the cmpxchg and cmpxchg8b commands, but I haven't been able to find how to correctly do this either. I've found various resources, most of which are directly writing assembly, not for the processor architecture I require, or don't show a specific enough example.
I found more documentation about cmpxchg and cmpxchg8 instructions (as well as tzcnt and lzcnt which are two that are nice to have, but not necessary) here. If you are viewing in google chrome, abosulte page values are 234 for cmpxchg and 236 for cmpxchg8.
Limitations: I am unable to use a third party library due to constraints beyond my control.
Result of uname -smr: HP-UX B.11.31 ia64
Processor Model: Intel(R) Itanium(R) Processor 9340
Compiler -v: aCC: HP C/aC++ B3910B A.06.28
Update: I was able to get _Asm_cmpxchg to compile, but it doesn't seem to work (the value remains unchanged). For parameters, I passed _SZ_W for the _Asm_sz, _SEM_ACQ for _Asm_sem, _LDHINT_NONE for _Asm_ldhint, a pointer to the original 32 bit integer value for r3, and the desired new value for r2. I'm guessing at the meaning of the parameters, given that documentation is very lackluster.
I ended up finding the solution on my own, using option 1. Below is the sample code to get it to work:
bool compare_and_swap(unsigned int* var, unsigned int oldval, unsigned int newval)
{
// Move the old value into register _AREG_CCV because this is the register
// that var will be compared against
_Asm_mov_to_ar(_AREG_CCV, oldval);
// Do the compare and swap
return oldval == _Asm_cmpxchg(
_SZ_W /* 4 byte word */,
_SEM_ACQ /* acquire the semaphore */,
var,
newval,
_LDHINT_NONE /* locality hint */);
}

Why is there assert( sizeof( bool ) == 1 ) in Doom 3 source?

Here's the assert. In what reasonable circumstances can it fail, and why is the game checking it?
Some platforms define bool to be the same size as int. At least older versions of Mac OS X (and likely other RISC BSD ports) were like this. Presumably the code uses bool arrays with an assumption of efficiency. Doom has been ported to a lot of platforms so it's probably very cagey about such things.
It has to be done at runtime because there is no standard macro specifying sizeof(bool), and compile time checks didn't work with non-macro expressions until C++11.
I think I have come across the the answer you were looking for. Doom 3 is cross platform and on x86 platforms bool is defined by gcc with a size of 1. In gcc(compiler used by Apple at the time),on Mac OS X PowerPC on the other hand it defaults to 4. Use the -mone-byte-bool to change it to 1.
From http://linux.die.net/man/1/g++
-mone-byte-bool
Override the defaults for "bool" so that "sizeof(bool)==1". By
default "sizeof(bool)" is 4 when compiling for Darwin/PowerPC and 1
when compiling for Darwin/x86, so this option has no effect on x86.
Warning: The -mone-byte-bool switch causes GCC to generate code
that is not binary compatible with code generated without that
switch. Using this switch may require recompiling all other
modules in a program, including system libraries. Use this switch
to conform to a non-default data model.

Intrinsics for CPUID like informations?

Considering that I'm coding in C++, if possible, I would like to use an Intrinsics-like solution to read useful informations about the hardware, my concerns/considerations are:
I don't know assembly that well, it will be a considerable investment just to get this kind of informations ( altough it looks like CPU it's just about flipping values and reading registers. )
there at least 2 popular syntax for asm ( Intel and AT&T ), so it's fragmented
strangely enough Intrinsics are more popular and supported than asm code this days
not all the the compilers that are in my radar right now support inline asm, MSVC 64 bit is one; I'm afraid that I will find other similar flaws while digging more into the feature sets of the different compilers that I have to use.
considering the trand I think that is more productive for me to bet on Intrinsics, it should be also way more easy than any asm code.
And the last question that I have to answer to is: how to do a similar thing with intrinsics ? Because I haven't found nothing other than CPUID opcodes to get this kind of informations at all.
After some digging I have found a useful built-in functions that is gcc specific.
The only problem is that this kind of functions are really limited ( basically you have only 2 functions, 1 for the CPU "name" and 1 for the set of registers )
an example is
#include <stdio.h>
int main()
{
if (__builtin_cpu_supports("mmx")) {
printf("\nI got MMX !\n");
} else
printf("\nWhat ? MMX ? What is that ?\n");
return (0);
}
and apparently this built-in functions work under mingw-w64 too.
Gcc includes a cpuid interface:
http://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/i386/cpuid.h
These don't seem to be well documented, but example usage can be found here:
http://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/i386/driver-i386.c
Note that you must use __cpuid_count() and not __cpuid() when the initial value of ecx matters, such as with avx/avx2 detection.
As user2485710 pointed out, gcc can do all the cpu feature detection work for you. As of gcc 4.8.1, the full list of features supported by __builtin_cpu_supports() is: cmov, mmx, popcnt, sse, sse2, sse3, ssse3, sse4.1, sse4.2, avx and avx2.
Intrinsics such as this are also generally compiler specific.
MS VC++ has a __cpuid (and a __cpuidex) to generate a CPUID op code.
At least as far as I know, gcc/g++ doesn't provide an equivalent to that though. Inline assembly seems to be the only option available.
For x86/x64, Intel provides an intrinsic called _may_i_use_cpu_feature. You can find it under the General Support category of the Intel Intrinsics Guide page. Below is a rip of Intel's documentation.
GCC supposedly follows Intel with respect to intrinsics, so it should be available under GCC. Its not clear to me if Microsoft provides it because they provide most (but not all) Intel intrinsics.
I'm not aware of anything for ARM. As far as I know, there is no __builtin_cpu_supports("neon"), __builtin_cpu_supports("crc32"), __builtin_cpu_supports("aes"), __builtin_cpu_supports("pmull"), __builtin_cpu_supports("sha"), etc under ARM. For ARM you have to perform CPU feature probing.
Synopsis
int _may_i_use_cpu_feature (unsigned __int64 a)
#include "immintrin.h"
Description
Dynamically query the processor to determine if the processor-specific feature(s) specified
in a are available, and return true or false (1 or 0) if the set of features is
available. Multiple features may be OR'd together. This intrinsic does not check the
processor vendor. See the valid feature flags below:
Operation
_FEATURE_GENERIC_IA32
_FEATURE_FPU
_FEATURE_CMOV
_FEATURE_MMX
_FEATURE_FXSAVE
_FEATURE_SSE
_FEATURE_SSE2
_FEATURE_SSE3
_FEATURE_SSSE3
_FEATURE_SSE4_1
_FEATURE_SSE4_2
_FEATURE_MOVBE
_FEATURE_POPCNT
_FEATURE_PCLMULQDQ
_FEATURE_AES
_FEATURE_F16C
_FEATURE_AVX
_FEATURE_RDRND
_FEATURE_FMA
_FEATURE_BMI
_FEATURE_LZCNT
_FEATURE_HLE
_FEATURE_RTM
_FEATURE_AVX2
_FEATURE_KNCNI
_FEATURE_AVX512F
_FEATURE_ADX
_FEATURE_RDSEED
_FEATURE_AVX512ER
_FEATURE_AVX512PF
_FEATURE_AVX512CD
_FEATURE_SHA
_FEATURE_MPX
For great-grandchildren, this is how to obtain CPU vendor name with GCC, tested on Win7 x64
#include <cpuid.h>
...
int eax, ebx, ecx, edx;
char vendor[13];
__cpuid(0, eax, ebx, ecx, edx);
memcpy(vendor, &ebx, 4);
memcpy(vendor + 4, &edx, 4);
memcpy(vendor + 8, &ecx, 4);
vendor[12] = '\0';
printf("CPU: %s\n", vendor);