c++ AVX512 intrinsic equivalent of _mm256_broadcast_ss()?

c++ AVX512 intrinsic equivalent of _mm256_broadcast_ss()? - c++

I'm rewriting a code from AVX2 to AVX512.
What's the equivalent I can use to broadcast a single float number to a _mm512 vector? In AVX2 it is _mm256_broadcast_ss() but I can't find something like _mm512_broadcast_ss().

AVX512 doesn't need a special intrinsic for the memory source version1. You can simply use _mm512_set1_ps (which takes a float, not a float*). The compiler should use a memory-source broadcast if that's efficient. (Potentially even folded into a broadcast memory source for an ALU instruction instead of a separate load; AVX512 can do that for 512-bit vectors.)
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_set1_ps&expand=5236,4980
Footnote 1: The reason for _mm256_broadcast_ss even existing separately from _mm256_set1_ps is probably because of AVX1 vbroadcastss ymm, [mem] vs. AVX2 vbroadcastss ymm, xmm. Some compilers like MSVC and ICC let you use intrinsics without enabling the ISA extensions for the compiler to use anywhere, so there needed to be an intrinsic for only the AVX1 memory-source version specifically.
With AVX512, both memory and register source forms were introduced with AVX512F so there's no need to give users of those compilers a way to micro-manage which asm is allowed.

Related

How to emulate _mm256_loadu_epi32 with gcc or clang?

Intel's intrinsic guide lists the intrinsic _mm256_loadu_epi32:
_m256i _mm256_loadu_epi32 (void const* mem_addr);
/*
Instruction: vmovdqu32 ymm, m256
CPUID Flags: AVX512VL + AVX512F
Description
Load 256-bits (composed of 8 packed 32-bit integers) from memory into dst.
mem_addr does not need to be aligned on any particular boundary.
Operation
a[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0
*/
But clang and gcc do not provide this intrinsic. Instead they provide (in file avx512vlintrin.h) only the masked versions
_mm256_mask_loadu_epi32 (__m256i, __mmask8, void const *);
_mm256_maskz_loadu_epi32 (__mmask8, void const *);
which boil down to the same instruction vmovdqu32. My question: how can I emulate _mm256_loadu_epi32:
inline _m256i _mm256_loadu_epi32(void const* mem_addr)
{
/* code using vmovdqu32 and compiles with gcc */
}
without writing assembly, i.e. using only intrinsics available?

Just use _mm256_loadu_si256 like a normal person. The only thing the AVX512 intrinsic gives you is a nicer prototype (const void* instead of const __m256i*) so you don't have to write ugly casts.
#chtz suggests out that you might still want to write a wrapper function yourself to get the void* prototype. But don't call it _mm256_loadu_epi32; some future GCC version will probably add that for compat with Intel's docs and break your code.
From another perspective, it's unfortunate that compilers don't treat it as an AVX1 intrinsic, but I guess compilers which don't optimize intrinsics, and which let you use intrinsics from ISA extensions you haven't enabled, need this kind of clue to know when they can use ymm16-31.
You don't even want the compiler to emit vmovdqu32 ymm when you're not masking; vmovdqu ymm is shorter and does exactly the same thing, with no penalty for mixing with EVEX-encoded instructions. The compiler can always use an vmovdqu32 or 64 if it wants to load into ymm16..31, otherwise you want it to use a shorter VEX-coded AVX1 vmovdqu.
I'm pretty sure that GCC treats _mm256_maskz_epi32(0xffu,ptr) exactly the same as _mm256_loadu_si256((const __m256i*)ptr) and makes the same asm regardless of which one you use. It can optimize away the 0xffu mask and simply use an unmasked load, but there's no need for that extra complication in your source.
But unfortunately GCC9 and earlier will pessimize to vmovdqu32 ymm0, [mem] when AVX512VL is enabled (e.g. -march=skylake-avx512) even when you write _mm256_loadu_si256. This was a missed-optimization, GCC Bug 89346.
It doesn't matter which 256-bit load intrinsic you use (except for aligned vs. unaligned) as long as there's no masking.
Related:
error: '_mm512_loadu_epi64' was not declared in this scope
What is the difference between _mm512_load_epi32 and _mm512_load_si512?

Find position of first (lowest) set bit in 32-bit number

I need to get a 1-bit number in a 32-bit number, in which there is only one 1-bit (always). The fastest way in C ++ or asm.
For example
input: 0x00000001, 0x10000000
output: 0, 28

#ifdef __GNUC__, use __builtin_ctz(unsigned) to Count Trailing Zeros (GCC manual). GCC, clang, and ICC all support it on all target ISAs. (On ISAs where there's no native instruction, it will call a GCC helper function.)
Leading vs. Trailing is when written in printing order, MSB-first, like 8-bit binary 00000010 has 6 leading zeros and one trailing zero. (And when cast to 32-bit binary, will have 24+6 = 30 leading zeros.)
For 64-bit integers, use __builtin_ctzll(unsigned long long). It's unfortunate that GNU C bitscan builtins don't take fixed-width types (especially the leading zeros versions), but unsigned is always 32-bit on GNU C for x86 (although not for AVR or MSP430). unsigned long long is always uint64_t on all GNU C targets I'm aware of.
On x86, it will compile to bsf or tzcnt depending on tuning + target options. tzcnt is a single uop with 3 cycle latency on modern Intel, and only 2 uops with 2 cycle latency on AMD (perhaps a bit-reverse to feed an lzcnt uop?) https://agner.org/optimize/ / https://uops.info/. Either way it's directly supported by fast hardware, and is much faster than anything you can do in pure C++. About the same cost as x * 1234567 (on Intel CPUs, bsf/tzcnt has the same cost as imul r, r, imm, in front-end uops, back-end port, and latency.)
The builtin has undefined behaviour for inputs with no bits set, allowing it to avoid any extra checks if it might run as bsf.
In other compilers (specifically MSVC), you might want an intrinsic for TZCNT, like _mm_tzcnt_32 from immintrin.h. (Intel intrinsics guide). Or you might need to include intrin.h (MSVC) or x86intrin.h for non-SIMD intrinsics.
Unlike GCC/clang, MSVC doesn't stop you from using intrinsics for ISA extensions you haven't enabled for the compiler to use on its own.
MSVC also has _BitScanForward / _BitScanReverse for actual BSF/BSR, but the leave-destination-unmodified behaviour that AMD guarantees (and Intel also implements) is still not exposed by these intrinsics, despite their pointer-output API.
VS: unexpected optimization behavior with _BitScanReverse64 intrinsic - pointer-output is assumed to always be written :/
_BitScanForward _BitScanForward64 missing (VS2017) Snappy - correct headers
How to use MSVC intrinsics to get the equivalent of this GCC code?
TZCNT decode as BSF on CPUs without BMI1 because its machine-code encoding is rep bsf. They give identical results for non-zero inputs, so compilers can and do always just use tzcnt because that's much faster on AMD. (They're the same speed on Intel so no downside. And on Skylake and later, tzcnt has no false output dependency. BSF does because it leaves its output unmodified for input=0).
(The situation is less convenient for bsr vs. lzcnt: bsr returns the bit-index, lzcnt returns the leading-zero count. So for best performance on AMD, you need to know that your code will only run on CPUs supporting BMI1 / TBM so the compiler can use lzcnt)
Note that with exactly 1 bit set, scanning from either direction will find the same bit. So 31 - lzcnt = bsr is the same in this case as bsf = tzcnt. Possibly useful if porting to another ISA that only has leading-zero count and no bit-reverse instruction.
Related:
Why does breaking the "output dependency" of LZCNT matter? modern compilers generally know to break the false dependency for lzcnt/tzcnt/popcnt. bsf/bsr have one, too, and I think GCC is also smart about that, but ironically might not be.
How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows? - the pseudocode is not the hardware implementation.
https://en.wikipedia.org/wiki/Find_first_set has more about bitscan functions across ISAs. Including POSIX ffs() which returns a 1-based index and has to do extra work to account for the possibility of the input being 0.
Compilers do recognize ffs() and inline it like a builtin (like they do for memcpy or sqrt), but don't always manage to optimize away all the work their canned sequence does to implement it when you actually want a 0-based index. It's especially hard to tell the compiler there's only 1 bit set.

Using AVX to xor two zmm (512 bit) registers

I would like to bit-wisr xor zmm0 with zmm1.
I read around the internet and tried:
asm volatile(
"vmovdqa64 (%0),%%zmm0;\n"
"vmovdqa64 (%1),%%zmm1;\n"
"vpxorq %%zmm1, %%zmm0;\n"
"vmovdqa64 %%zmm0,(%0);\n"
:: "r"(p_dst), "r" (p_src)
: );
But the compiler gives "Error: number of operands mismatch for `vpxorq'".
What am I doing wrong?

Inline asm for this is pointless (https://gcc.gnu.org/wiki/DontUseInlineAsm), and your code is unsafe and inefficient even if you fixed the syntax error by adding the 3rd operand.
Use the intrinsic _mm512_xor_epi64( __m512i a, __m512i b); as documented in Intel's asm manual entry for pxor. Look at the compiler-generated asm if you want to see how it's done.
Unsafe because you don't have a "memory" clobber to tell the compiler that you read/write memory, and you don't declare clobbers on zmm0 or zmm1.
And inefficient for many reasons, including forcing the addressing modes and not using a memory source operand. And not letting the compiler pick which registers to use.
Just fixing the asm syntax so it compiles will go from having an obvious compile-time bug to a subtle and dangerous runtime bug that might only be visible with optimization enabled.
See https://stackoverflow.com/tags/inline-assembly/info for more about inline asm. But again, there is basically zero reason to use it for most SIMD because you can get the compiler to make asm that's just as efficient as what you can do by hand, and more efficient than this.

Most AVX512 instructions use 3+ operands, i.e. you need to add additional operand - dst register (it can be the same as one of the other operands).
This is also true for AVX2 version, see https://www.felixcloutier.com/x86/pxor:
VPXOR ymm1, ymm2, ymm3/m256
VPXORD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst
Note, that the above is intel syntax and would roughly translate into *mm1 = *mm2 ^ **mm3, in your case I guess you wanted to use "vpxorq %%zmm1, %%zmm0, %%zmm0;\n"
Be advised, that using inline assembly is generally a bad practice reserved for really special occasions. SIMD programming is better (faster, easier) done by using intrinsics supported by all major compilers. You can browse them here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Compiling legacy GCC code with AVX vector warnings

I've been trying to search on google but couldn't find anything useful.
typedef int64_t v4si __attribute__ ((vector_size(32)));
//warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
// so isn't AVX already automatically enabled?
// What does it mean "without AVX enabled"?
// What does it mean "changes the ABI"?
inline v4si v4si_gt0(v4si x_);
//warning: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
//So why there's warning and what does it mean?
// Why only this parameter got warning?
// And all other v4si parameter/arguments got no warning?
void set_quota(v4si quota);

That's not legacy code. __attribute__ ((vector_size(32))) means a 32 byte vector, i.e. 256 bit, which (on x86) means AVX. (GNU C Vector Extensions)
AVX isn't enabled unless you use -mavx (or a -march setting that includes it). Without that, the compiler isn't allowed to generate code that uses AVX instructions, because those would trigger an illegal-instruction fault on older CPUs that don't support AVX.
So the compiler can't pass or return 256b vectors in registers, like the normal calling convention specifies. Probably it treats them the same as structs of that size passed by value.
See the ABI links in the x86 tag wiki, or the x86 Calling Conventions page on Wikipedia (mostly doesn't mention vector registers).
Since the GNU C Vector Extensions syntax isn't tied to any particular hardware, using a 32 byte vector will still compile to correct code. It will perform badly, but it will still work even if the compiler can only use SSE instructions. (Last I saw, gcc was known to do a very bad job of generating code to deal with vectors wider than the target machine supports. You'd get significantly better code for a machine with 16B vectors from using vector_size(16) manually.)
Anyway, the point is that you get a warning instead of a compiler error because __attribute__ ((vector_size(32))) doesn't imply AVX specifically, but AVX or some other 256b vector instruction set is required for it to compile to good code.

Forcing AVX intrinsics to use SSE instructions instead

Unfortunately I have an AMD piledriver cpu, which seems to have problems with AVX instructions:
Memory writes with the 256-bit AVX registers are exceptionally slow. The measured throughput is 5 - 6 times slower than on the previous model (Bulldozer), and 8 - 9 times slower than two 128-bit writes.
In my own experience, I've found mm256 intrinsics to be much slower than mm128, and I'm assuming it's because of the above reason.
I really want to code for the newest instruction set AVX though, while still being able to test builds on my machine at a reasonable speed. Is there a way to force mm256 intrinsics to use SSE instructions instead? I'm using VS 2015.
If there is no easy way, what about a hard way. Replace <immintrin.h> with a custom made header containing my own definitions for the intrinsics which can be coded to use SSE? Not sure how plausible this is, prefer easier way if possible before I go through that work.

Use Agner Fog's Vector Class Library and add this to the command line in Visual Studio: -D__SSE4_2__ -D__XOP__.
Then use an AVX sized vector such as Vec8f for eight floats. When you compile without AVX enable it will use the file vectorf256e.h which emulates AVX with two SSE registers. For example Vec8f inherits from Vec256fe which starts like this:
class Vec256fe {
protected:
__m128 y0; // low half
__m128 y1; // high half
If you compile with /arch:AVX -D__XOP__ the VCL will instead use the file vectorf256.h and one AVX register. Then your code works for AVX and SSE with only a compiler switch change.
If you don't want to use XOP don't use -D__XOP__.
As Peter Cordes pointed out in his answer, if you your goal is only to avoid 256-bit load/stores then you may still want VEX encoded instructions (though it's not clear this will make a difference except in some special cases). You can do that with the vector class like this
Vec8f a;
Vec4f lo = a.get_low(); // a is a Vec8f type
Vec4f hi = a.get_high();
lo.store(&b[0]); // b is a float array
hi.store(&b[4]);
then compile with /arch:AVX -D__XOP__.
Another option would be be one source file that uses Vecnf and then do
//foo.cpp
#include "vectorclass.h"
#if SIMDWIDTH == 4
typedef Vec4f Vecnf;
#else
typedef Vec8f Vecnf;
#endif
and compile like this
cl /O2 /DSIMDWIDTH=4 foo.cpp /Fofoo_sse
cl /O2 /DSIMDWIDTH=4 /arch:AVX /D__XOP__ foo.cpp /Fofoo_avx128
cl /O2 /DSIMDWIDTH=8 /arch:AVX foo.cpp /Fofoo_avx256
This would create three executables with one source file. Instead of linking them you could just compile them with /c and them make a CPU dispatcher. I used XOP with avx128 because I don't think there is a good reason to use avx128 except on AMD.

You don't want to use SSE instructions. What you want is for 256b stores to be done as two separate 128b stores, still with VEX-coded 128b instructions. i.e. 128b AVX vmovups.
gcc has -mavx256-split-unaligned-load and ...-store options (enabled as part of -march=sandybridge for example, presumably also for Bulldozer-family (-march=bdver2 is piledriver). That doesn't solve the problem when the compiler knows the memory is aligned, though.
You could override the normal 256b store intrinsic with a macro like
// maybe enable this for all BD family CPUs?
#if defined(__bdver2) | defined(PILEDRIVER) | defined(SPLIT_256b_STORES)
#define _mm256_storeu_ps(addr, data) do{ \
_mm_storeu_ps( ((float*)(addr)) + 0, _mm256_extractf128_ps((data),0)); \
_mm_storeu_ps( ((float*)(addr)) + 4, _mm256_extractf128_ps((data),1)); \
}while(0)
#endif
gcc defines __bdver2 (Bulldozer version 2) for Piledriver (-march=bdver2).
You could do the same for (aligned) _mm256_store_ps, or just always use the unaligned intrinsic.
Compilers optimize the _mm256_extractf128(data,0) to a simple cast. I.e. it should just compile to
vmovups [rdi], xmm0 ; if data is in xmm0 and addr is in rdi
vextractf128 [rdi+16], xmm0, 1
However, testing on godbolt shows that gcc and clang are dumb, and extract to a register and then store. ICC correctly generates the two-instruction sequence.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js