Compiling legacy GCC code with AVX vector warnings - c++

I've been trying to search on google but couldn't find anything useful.
typedef int64_t v4si __attribute__ ((vector_size(32)));
//warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
// so isn't AVX already automatically enabled?
// What does it mean "without AVX enabled"?
// What does it mean "changes the ABI"?
inline v4si v4si_gt0(v4si x_);
//warning: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
//So why there's warning and what does it mean?
// Why only this parameter got warning?
// And all other v4si parameter/arguments got no warning?
void set_quota(v4si quota);

That's not legacy code. __attribute__ ((vector_size(32))) means a 32 byte vector, i.e. 256 bit, which (on x86) means AVX. (GNU C Vector Extensions)
AVX isn't enabled unless you use -mavx (or a -march setting that includes it). Without that, the compiler isn't allowed to generate code that uses AVX instructions, because those would trigger an illegal-instruction fault on older CPUs that don't support AVX.
So the compiler can't pass or return 256b vectors in registers, like the normal calling convention specifies. Probably it treats them the same as structs of that size passed by value.
See the ABI links in the x86 tag wiki, or the x86 Calling Conventions page on Wikipedia (mostly doesn't mention vector registers).
Since the GNU C Vector Extensions syntax isn't tied to any particular hardware, using a 32 byte vector will still compile to correct code. It will perform badly, but it will still work even if the compiler can only use SSE instructions. (Last I saw, gcc was known to do a very bad job of generating code to deal with vectors wider than the target machine supports. You'd get significantly better code for a machine with 16B vectors from using vector_size(16) manually.)
Anyway, the point is that you get a warning instead of a compiler error because __attribute__ ((vector_size(32))) doesn't imply AVX specifically, but AVX or some other 256b vector instruction set is required for it to compile to good code.

Related

Intrinsic load operation: Unsuited alignment, but the code still runs [duplicate]

I am looking at the generated assembly for my code (using Visual Studio 2017) and noticed that _mm_load_ps is often (always?) compiled to movups.
The data I'm using _mm_load_ps on is defined like this:
struct alignas(16) Vector {
float v[4];
}
// often embedded in other structs like this
struct AABB {
Vector min;
Vector max;
bool intersection(/* parameters */) const;
}
Now when I'm using this construct, the following will happen:
// this code
__mm128 bb_min = _mm_load_ps(min.v);
// generates this
movups xmm4, XMMWORD PTR [r8]
I'm expecting movaps because of alignas(16). Do I need something else to convince the compiler to use movaps in this case?
EDIT: My question is different from this question because I'm not getting any crashes. The struct is specifically aligned and I'm also using aligned allocation. Rather, I'm curious why the compiler is switching _mm_load_ps (the intrinsic for aligned memory) to movups. If I know struct was allocated at an aligned address and I'm calling it via this* it would be safe to use movaps, right?
On recent versions of Visual Studio and the Intel Compiler (recent as post-2013?), the compiler rarely ever generates aligned SIMD load/stores anymore.
When compiling for AVX or higher:
The Microsoft compiler (>VS2013?) doesn't generate aligned loads. But it still generates aligned stores.
The Intel compiler (> Parallel Studio 2012?) doesn't do it at all anymore. But you'll still see them in ICC-compiled binaries inside their hand-optimized libraries like memset().
As of GCC 6.1, it still generates aligned load/stores when you use the aligned intrinsics.
The compiler is allowed to do this because it's not a loss of functionality when the code is written correctly. All processors starting from Nehalem have no penalty for unaligned load/stores when the address is aligned.
Microsoft's stance on this issue is that it "helps the programmer by not crashing". Unfortunately, I can't find the original source for this statement from Microsoft anymore. In my opinion, this achieves the exact opposite of that because it hides misalignment penalties. From the correctness standpoint, it also hides incorrect code.
Whatever the case is, unconditionally using unaligned load/stores does simplify the compiler a bit.
New Relevations:
Starting Parallel Studio 2018, the Intel Compiler no longer generates aligned moves at all - even for pre-Nehalem targets.
Starting from Visual Studio 2017, the Microsoft Compiler also no longer generates aligned moves at all - even when targeting pre-AVX hardware.
Both cases result in inevitable performance degradation on older processors. But it seems that this is intentional as both Intel and Microsoft no longer care about old processors.
The only load/store intrinsics that are immune to this are the non-temporal load/stores. There is no unaligned equivalent of them, so the compiler has no choice.
So if you want to just test for correctness of your code, you can substitute in the load/store intrinsics for non-temporal ones. But be careful not to let something like this slip into production code since NT load/stores (NT-stores in particular) are a double-edged sword that can hurt you if you don't know what you're doing.

MSVC's instrinsics __emulu and _umul128 in GCC/CLang

In MSVC there exist instrinsics __emulu() and _umul128(). First does u32*u32->u64 multiplication and second u64*u64->u128 multiplication.
Do same intrinsics exist for CLang/GCC?
Closest I found are _mulx_u32() and _mulx_u64() mentioned in Intel's Guide. But they produce mulx instruction which needs BMI2 support. While MSVC's intrinsics produce regular mul instruction. Also _mulx_u32() is not available in -m64 mode, while __emulu() and _umul128() both exist in 32 and 64 bit mode of MSVC.
You may try online 32-bit code and 64-bit code.
Of cause for 32-bit one may do return uint64_t(a) * uint64_t(b); (see it online) hoping that compiler will guess correctly and optimize to using u32*u32->u64 multiplication instead of u64*u64->u64. But is there a way to be sure about this? Not to rely on compiler's guess that both arguments are 32-bit (i.e. higher part of uint64_t is zeroed)? To have some intrinsics like __emulu() that make you sure about code.
There is __int128 in GCC/CLang (see code online) but again we have to rely on compiler's guess that we actually multiply 64-bit numbers (i.e. higher part of int128 is zeroed). Is there a way to be sure without compiler guessing, if there exist some intrinsics for that?
BTW, both uint64_t (for 32-bit) and __int128 (for 64-bit) produce correct mul instruction instead of mulx in GCC/CLang. But again we have to rely that compiler guesses correctly that higher part of uint64_t and __int128 is zeroed.
Of cause I can look into assembler code that GCC/Clang have optimized and guessed correctly, but looking at assembler once doesn't guarantee that same will happen always in all circumstances. And I don't know of a way in C++ to statically assert that compiler did correct guess about assembler instructions.
You already have the answer. Use uint64_t and __uint128_t. No intrinsics needed. This is available with modern GCC and Clang for all 64-bit targets. See Is there a 128 bit integer in gcc?
#include <stdint.h>
typedef __uint128_t uint128_t;
// 32*32=64 multiplication
f(uint32_t a, uint32_t b) {
uint64_t ab = (uint64_t)a * b;
}
//64*64=128 multiplication
f(uint64_t a, uint64_t b) {
uint128_t ab = (uint128_t)a * b;
}
Note that the cast must be on the operands, or at least on one operand. Casting the result would not work since it would do the multiplication with the shorter type and extend the result.
But is there a way to be sure about this? Not to rely on compiler's guess
You get exactly the same guarantee as with compiler intrinsics: that the value of the result is correct. There are never any guarantees about optimization. Just because you used intrinsics doesn't guarantee that the compiler will emit the “obvious” assembly instruction. The only way to have this guarantee is to use inline assembly, and for a simple operation like this it's likely to hurt performance because it would restrict the ways in which the compiler optimizes register usage.

How to emulate _mm256_loadu_epi32 with gcc or clang?

Intel's intrinsic guide lists the intrinsic _mm256_loadu_epi32:
_m256i _mm256_loadu_epi32 (void const* mem_addr);
/*
Instruction: vmovdqu32 ymm, m256
CPUID Flags: AVX512VL + AVX512F
Description
Load 256-bits (composed of 8 packed 32-bit integers) from memory into dst.
mem_addr does not need to be aligned on any particular boundary.
Operation
a[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0
*/
But clang and gcc do not provide this intrinsic. Instead they provide (in file avx512vlintrin.h) only the masked versions
_mm256_mask_loadu_epi32 (__m256i, __mmask8, void const *);
_mm256_maskz_loadu_epi32 (__mmask8, void const *);
which boil down to the same instruction vmovdqu32. My question: how can I emulate _mm256_loadu_epi32:
inline _m256i _mm256_loadu_epi32(void const* mem_addr)
{
/* code using vmovdqu32 and compiles with gcc */
}
without writing assembly, i.e. using only intrinsics available?
Just use _mm256_loadu_si256 like a normal person. The only thing the AVX512 intrinsic gives you is a nicer prototype (const void* instead of const __m256i*) so you don't have to write ugly casts.
#chtz suggests out that you might still want to write a wrapper function yourself to get the void* prototype. But don't call it _mm256_loadu_epi32; some future GCC version will probably add that for compat with Intel's docs and break your code.
From another perspective, it's unfortunate that compilers don't treat it as an AVX1 intrinsic, but I guess compilers which don't optimize intrinsics, and which let you use intrinsics from ISA extensions you haven't enabled, need this kind of clue to know when they can use ymm16-31.
You don't even want the compiler to emit vmovdqu32 ymm when you're not masking; vmovdqu ymm is shorter and does exactly the same thing, with no penalty for mixing with EVEX-encoded instructions. The compiler can always use an vmovdqu32 or 64 if it wants to load into ymm16..31, otherwise you want it to use a shorter VEX-coded AVX1 vmovdqu.
I'm pretty sure that GCC treats _mm256_maskz_epi32(0xffu,ptr) exactly the same as _mm256_loadu_si256((const __m256i*)ptr) and makes the same asm regardless of which one you use. It can optimize away the 0xffu mask and simply use an unmasked load, but there's no need for that extra complication in your source.
But unfortunately GCC9 and earlier will pessimize to vmovdqu32 ymm0, [mem] when AVX512VL is enabled (e.g. -march=skylake-avx512) even when you write _mm256_loadu_si256. This was a missed-optimization, GCC Bug 89346.
It doesn't matter which 256-bit load intrinsic you use (except for aligned vs. unaligned) as long as there's no masking.
Related:
error: '_mm512_loadu_epi64' was not declared in this scope
What is the difference between _mm512_load_epi32 and _mm512_load_si512?

vtbl2 intrinsics on ARM64 missing

I have some code that uses the vtbl2_u8 ARM Neon intrinsic function. When I compile with armv7 or armv7s architectures, this code compiles (and executes) correctly. However, when I try to compile targeting arm64, I get errors:
simd.h: error: call to unavailable function 'vtbl2_u8'
My Xcode version is 6.1, iPhone SDK 8.1. Looking at arm64_neon_internal.h, the definition for vtbl2_u8 has an __attribute__(unavailable). There is a definiton for vtbl2q_u8, but it takes different parameter types. Is there a direct replacement for the vtbl2 intrinsic for arm64?
As documented in the ARM NEON intrinsics reference ( http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf ), vtbl2_u8 is expected to be provided by compilers providing an ARM C Language Extensions implementation for AArch64 state in ARMv8-A. Note that the same document would suggest that vtbl2q_u8 is an Xcode extension, rather than an intrinsic which is expected to be supported by ACLE compilers.
The answer to your question then, is there should be no need for a replacement for vtbl2_u8, as it should be provided. However, that doesn't help you with your real problem, which is how you can use the instruction with a compiler which does not provide it.
Looking at what you have available in Xcode, and what vtbl2_u8 is documented to map to, I think you should be able to emulate the expected behaviour with:
uint8x8_t vtbl2_u8 (uint8x8x2_t a, uint8x8_t b)
{
/* Build the 128-bit vector mask from the two 64-bit halves. */
uint8x16_t new_mask = vcombine_u8 (a.val[0], a.val[1]);
/* Use an Xcode specific intrinsic. */
return vtbl1q_u8 (new_mask, b);
}
Though I don't have an Xcode toolchain to test with, so you'll have to confirm that does what you expect.
If this is appearing in performance critical code, you may find that the vcombine_u8 is an unacceptable extra instruction. Fundamentally a uint8x8x2_t lives in two consecutive registers, which gives a different layout between AArch64 and AArch32 (where Q0 was D0:D1).The vtbl2_u8 intrinsic requires a 16-bit mask.
Rewriting the producer of the uint8x8x2_t data to produce a uint8x16_t is the only other workaround for this, and is probably the one likely to work best. Note that even in compilers which provide the vtbl2_u8 intrinsic (trunk GCC and Clang at time of writing), an instruction performing the vcombine_u8 is inserted, so you may still be seeing extra move instructions behind the scenes.

Aliasing of NEON vector data types

Does NEON support aliasing of the vector data types with their scalar components?
E.g.(Intel SSE)
typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__));
The above will allow me to do:
__m128i* somePtr;
somePtr++;//advance to the next block
Aliasing a la Intel it will allow to advance my pointer to the next block I want to process
without managing extra counts and indexes.
The __may_alias__ attribute on __m128i should be viewed as a work-around which makes it possible to write strict-aliasing-correct code even though Intel completely screwed up the signatures of some of the SEE load/store intrinsics. (The 8-byte loading _mm_loadl_epi64(const __m128i*) is the most hilarious example, but there are others). ARM got their intrinsics right, so __may_alias__ isn't needed.
Just use pointers to the element type, and use explicit loads and stores. In my experience that leads to better code being generated, and is probably also more portable. (Does the ARM C language spec even allow pointers to NEON types? I wouldn't be surprised if they didn't).
NEON intrinsics implementation does NOT support aliasing of the vector data types with their scalar components.
GCC supports a bunch of intrinsics when you specify -mfpu_neon. One of the ones you'd be interested in, I'm guessing, is int32x4_t. More info about all the types available can be found on the ARM site.