Does the compiler actually use my "omp declare simd" functions? - c++

Have a look at this example I constructed for a 4D dot product:
#pragma omp declare simd
double dot(double x0, double y0, double z0, double w0, double x1, double y1, double z1, double w1)
{
return x0 * x1 + y0 * y1 + z0 * z1 + w0 * w1;
}
#define SIMD 4
int main(int argc, char **argv)
{
double x[SIMD];
double y[SIMD];
double z[SIMD];
double w[SIMD];
double r[SIMD];
for (int i = 0; i < SIMD; i++)
{
x[i] = y[i] = z[i] = 1;
w[i] = 0;
}
#pragma omp simd
for (int i = 0; i < SIMD; i++)
{
r[i] = dot(x[i], y[i], z[i], w[i], x[i], y[i], z[i], w[i]);
}
double s = 0;
for (int i = 0; i < SIMD; i++)
{
s += r[i];
}
return s;
}
In the compiler output you can see that it generates a few functions called _XXXXXXvvvvvvvv_dot. I assume that these are the functions used for different lengths of input for the dot function, or at least that is what they are supposed to be. However, these function do not seem to be actually used by the compiler. Line 94 of the output reads call dot(…). Does that call one of these functions? What do I have to do to use them?

Don't try to call the SIMD versions manually: let the compiler do that from a loop that it's auto-vectorizing.
You didn't enable optimization so GCC doesn't auto-vectorize your loops. Thus it only calls the scalar version of the function.
The GCC default is -O0 - anti-optimize for debugging, so of course the code is total garbage and not actually auto-vectorized (no addpd or mulpd instructions).
Enable optimization with -O3. GCC will simply inline the calls when it can see the definition. The #pragma omp declare simd thing lets the compiler emit calls to vectorized versions of the function even if it can't see the definition. (Or for larger functions that it chooses not to inline.)
You can use __attribute__((noinline)) on dot to see how it works even for your small function:
On Godbolt with GCC9.1 -O3 -fopenmp, with that change:
# gcc9.1 -O3 -fopenmp
main:
sub rsp, 40
movapd xmm0, XMMWORD PTR .LC0[rip] # {1, 1}
pxor xmm7, xmm7 # {0, 0}
movapd xmm3, xmm7
movapd xmm6, xmm0 # duplicate the 1,1 vector for several args
movapd xmm5, xmm0
movapd xmm4, xmm0
movapd xmm2, xmm0
movapd xmm1, xmm0
call _ZGVbN2vvvvvvvv_dot(double, double, double, double, double, double, double, double)
movaps XMMWORD PTR [rsp], xmm0 # store to the stack
movaps XMMWORD PTR [rsp+16], xmm0 # twice
pxor xmm0, xmm0 # 0.0
addsd xmm0, QWORD PTR [rsp] # 0 + v[0]
addsd xmm0, QWORD PTR [rsp+8] # ... += v[1]
addsd xmm0, QWORD PTR [rsp+16]
addsd xmm0, QWORD PTR [rsp+24] # stupid inefficient horizontal sum
add rsp, 40
cvttsd2si eax, xmm0 # truncate to integer as main's return value
ret
With your tiny #define SIMD 4, main doesn't actually need to loop at all, just two 16-byte vectors is sufficient. The arrays with compile-time-constant initializers get optimized away; GCC just materializes the constants into registers with pxor-zeroing for 0.0 and loading + copying from static constant data for 1.0.
So anyway, there's only one call to a SIMD version of dot(), but this is it. I think GCC knows that the same call will give the same result, which is why it calls once but stores the result twice.
IDK why GCC's OpenMP horizontal sum is so dumb. Obviously it would be better to addpd xmm0,xmm0 instead of storing it twice, and a shuffle could avoid a store/reload. Also using an addsd to do 0.0 + x is pointless; just use the low element of the register that you stored from.
The scalar version of dot() has the usual C++ name mangling for a function. The other versions have special name-mangling conventions, maybe specific to GCC's OpenMP, IDK.
Interestingly, gcc makes a few different versions of dot, including an AVX version using YMM registers. And some that spill to the stack and use scalar math in a loop; IDK why those exist.
So I guess that means that even if you compile this source file without -march=skylake-avx512, another loop that is compiled that way can still emit a call to _ZGVeN8vvvvvvvv_dot and get the AVX512 definition:
_ZGVeN8vvvvvvvv_dot(double, double, double, double, double, double, double, double):
vmulpd zmm1, zmm1, zmm5
vfmadd132pd zmm0, zmm1, zmm4
vfmadd231pd zmm0, zmm2, zmm6
vfmadd231pd zmm0, zmm3, zmm7
Strangely I don't see an AVX+FMA definition that uses FMA on YMM regs, only SSE2 and AVX definitions that use vmulpd / vaddpd.

Related

how to set a int32 value at some index within an m128i with only SSE2?

Is there a SSE2 intrinsics that can set a single int32 value within m128i?
Such as set value 1000 at index 1 on a m128i that already contains 1,2,3,4? (which result in 1,1000,3,4)
If SSE4 available, use
__m128i _mm_insert_epi32 (__m128i a, int i, const int imm8)
If you are limited to SSE2, you need to split it to two calls of
__m128i _mm_insert_epi16 (__m128i a, int i, const int imm8)
return _mm_insert_epi16(_mm_insert_epi16(a, 1000, 2), 0, 3);
to set 1000 on lane 1 of a interpreted as vector of ints.
With SSE3 available, I would presume that a shifting/shuffling sequence would be more efficient:
a = _mm_shuffle_epi32(a, 0xe0); // shuffle as 3 2 0 ?
__m128i b = _mm_cvtsi32_si128(value);
b = _mm_alignr_epi8(b, a, 4); // value 3 2 0
return _mm_shuffle_epi32(b, 0x5c); // 3 2 value 0
If the value is in a 64-bit register, one can use _mm_cvtsi64_si128 instead.
Gcc is able to convert the store load sequence to pinsrd xmm0, eax, 1 when SSE4 enabled, but gives quite a convoluted sequence without -msse4.
movd eax, xmm0
movaps XMMWORD PTR [rsp-24], xmm0
movabs rdx, 4294967296000
or rax, rdx
mov QWORD PTR [rsp-24], rax
movdqa xmm0, XMMWORD PTR [rsp-24]
ret
OTOH clang respects the store, modify stack, load paradigm.
movaps xmmword ptr [rsp - 24], xmm0
mov dword ptr [rsp - 20], 1000
movaps xmm0, xmmword ptr [rsp - 24]
ret
Probably the overall winner is the store/modify/load combo, which also has free programmable index. All others require hard coded immediates, including those using the insert intrinsics.
__m128i store_modify_load(__m128i a, int value, size_t index) {
alignas(16) int32_t tmp[4] = {};
_mm_store_si128(reinterpret_cast<__m128i*>(tmp), a);
tmp[index] = value;
return _mm_load_si128(reinterpret_cast<__m128i*>(tmp));
}
See the produced assembly in godbolt.

What is the fastest way to convert a large c-array of char8 to short16?

My raw data is a bunch of c-array of (unsigned) char (8bit) of length > 1000000.
I want to add them together (vector addition) follow the rule as in the code below.
Result:
c-array of (unsigned) short (16bit).
I have read all the SSE and AVX/AVX2 but there just a similar call
that multiple 2 registers of 256bit. The first 4 32bit will be multiplied together, the result for each pair of 32bit is a 64bit will fit into the 256 register.( _mm256_mul_epi32, _mm256_mul_epu32)
Firgure
https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX
Sample code:
static inline void adder(uint16_t *canvas, uint8_t *addon, uint64_t count)
{
for (uint64_t i=0; i<count; i++)
canvas[i] += static_cast<uint16_t>(addon[i]);
}
Thanks
Adding onto #wim answer (which is a good answer) and taking #Bathsheba comment into account, its well worthwhile both trusting the compiler but also examining what your compiler outputs to both learn how to do this and also check that its doing what you'd want. Running a slightly modified version of your code through godbolt (for msvc, gcc and clang) gives some non perfect answers.
This is especially true if you limit yourself to SSE2 and below which this answer assumes (and what I tested)
All compilers both vectorise and unroll the code and use punpcklbw to 'unpack' the uint8_t's into uint16_t's and then run a SIMD add and save. This is good. However, MSVC tends to spill unnecessarily in the inner loop, and clang only uses punpcklbw and not punpckhbw which means it loads the source data twice. GCC gets the SIMD part right but has higher overhead for the loop constraints.
So theoretically if you wanted to improve these versions you can roll your own using intrinsics which would look something like:
static inline void adder2(uint16_t *canvas, uint8_t *addon, uint64_t count)
{
uint64_t count32 = (count / 32) * 32;
__m128i zero = _mm_set_epi32(0, 0, 0, 0);
uint64_t i = 0;
for (; i < count32; i+= 32)
{
uint8_t* addonAddress = (addon + i);
// Load data 32 bytes at a time and widen the input
// to `uint16_t`'sinto 4 temp xmm reigsters.
__m128i input = _mm_loadu_si128((__m128i*)(addonAddress + 0));
__m128i temp1 = _mm_unpacklo_epi8(input, zero);
__m128i temp2 = _mm_unpackhi_epi8(input, zero);
__m128i input2 = _mm_loadu_si128((__m128i*)(addonAddress + 16));
__m128i temp3 = _mm_unpacklo_epi8(input2, zero);
__m128i temp4 = _mm_unpackhi_epi8(input2, zero);
// Load data we need to update
uint16_t* canvasAddress = (canvas + i);
__m128i canvas1 = _mm_loadu_si128((__m128i*)(canvasAddress + 0));
__m128i canvas2 = _mm_loadu_si128((__m128i*)(canvasAddress + 8));
__m128i canvas3 = _mm_loadu_si128((__m128i*)(canvasAddress + 16));
__m128i canvas4 = _mm_loadu_si128((__m128i*)(canvasAddress + 24));
// Update the values
__m128i output1 = _mm_add_epi16(canvas1, temp1);
__m128i output2 = _mm_add_epi16(canvas2, temp2);
__m128i output3 = _mm_add_epi16(canvas3, temp3);
__m128i output4 = _mm_add_epi16(canvas4, temp4);
// Store the values
_mm_storeu_si128((__m128i*)(canvasAddress + 0), output1);
_mm_storeu_si128((__m128i*)(canvasAddress + 8), output2);
_mm_storeu_si128((__m128i*)(canvasAddress + 16), output3);
_mm_storeu_si128((__m128i*)(canvasAddress + 24), output4);
}
// Mop up
for (; i<count; i++)
canvas[i] += static_cast<uint16_t>(addon[i]);
}
Examining the output for this it is strictly better than any of gcc/clang/msvc. So if you want to get the absolute last drop of perf (and have a fixed architecture) then something like the above is a possibility. However its a really small improvement as the compilers already handle this almost perfectly and so I'd actually recommend not doing this and just trusting the compiler.
If you do think you can improve the compiler, remember to always test and profile to make sure you actually are.
Indeed the comments are right: the compiler can do the vectorization for you.
I have modified your code a bit to improve the auto-vectorization.
With gcc -O3 -march=haswell -std=c++14 (gcc version 8.2), the following code:
#include <cstdint>
#include <immintrin.h>
void cvt_uint8_int16(uint16_t * __restrict__ canvas, uint8_t * __restrict__ addon, int64_t count) {
int64_t i;
/* If you know that n is always a multiple of 32 then insert */
/* n = n & 0xFFFFFFFFFFFFFFE0u; */
/* This leads to cleaner code. Now assume n is a multiple of 32: */
count = count & 0xFFFFFFFFFFFFFFE0u;
for (i = 0; i < count; i++){
canvas[i] += static_cast<uint16_t>(addon[i]);
}
}
compiles to:
cvt_uint8_int16(unsigned short*, unsigned char*, long):
and rdx, -32
jle .L5
add rdx, rsi
.L3:
vmovdqu ymm2, YMMWORD PTR [rsi]
add rsi, 32
add rdi, 64
vextracti128 xmm1, ymm2, 0x1
vpmovzxbw ymm0, xmm2
vpaddw ymm0, ymm0, YMMWORD PTR [rdi-64]
vpmovzxbw ymm1, xmm1
vpaddw ymm1, ymm1, YMMWORD PTR [rdi-32]
vmovdqu YMMWORD PTR [rdi-64], ymm0
vmovdqu YMMWORD PTR [rdi-32], ymm1
cmp rdx, rsi
jne .L3
vzeroupper
.L5:
Compiler Clang produces code which is a bit different: It loads 128 bit (char)vectors and converts them with vpmovzxbw.
Compiler gcc loads 256 bit (char) vectors and converts the upper and the lower 128 bits
separately, which is probably slightly less efficient.
Nevertheless, your problem is likely bandwidth limited anyway (since length > 1000000).
You can also vectorize the code with intrinsics (not tested):
void cvt_uint8_int16_with_intrinsics(uint16_t * __restrict__ canvas, uint8_t * __restrict__ addon, int64_t count) {
int64_t i;
/* Assume n is a multiple of 16 */
for (i = 0; i < count; i=i+16){
__m128i x = _mm_loadu_si128((__m128i*)&addon[i]);
__m256i y = _mm256_loadu_si256((__m256i*)&canvas[i]);
__m256i x_u16 = _mm256_cvtepu8_epi16(x);
__m256i sum = _mm256_add_epi16(y, x_u16);
_mm256_storeu_si256((__m256i*)&canvas[i], sum);
}
}
This leads to similar results as the auto-vectorized code.
In contrast to the manually-optimized approaches presented in wim's and Mike's great answers, let's also have a quick look at what a completely vanilla C++ implementation would give us:
std::transform(addon, addon + count, canvas, canvas, std::plus<void>());
Try it out here. You'll see that even without any real effort on your part, the compiler is already able to produce vectorized code that is quite good given that it cannot make any assumptions concerning alignment and size of your buffers, and there's also some potential aliasing issues (due to the use of uint8_t which, unfortunately, forces the compiler to assume that the pointer may alias to any other object). Also, note that the code is basically identical to what you'd get from a C-style implementation (depending on the compiler, the C++ version has a few instructions more or a few instructions less)
void f(uint16_t* canvas, const uint8_t* addon, size_t count)
{
for (size_t i = 0; i < count; ++i)
canvas[i] += addon[i];
}
However, the generic C++ solution works on any combination of different kinds of container and element types as long as the element types can be added. So—as also pointed out in the other answers—while it is certainly possible to get a slightly more efficient implementation from manual optimization, one can go a long way just by writing plain C++ code (if done right). Before resorting to manually writing SSE intrinsics, consider that a generic C++ solution is more flexible, easier to maintain, and, especially, more portable. By the simple flip of the target architecture switch, you can have it produce code of similar quality not only for SSE, but AVX, or even ARM with NEON and whatever other instruction sets you may happen to want to run on. If you need your code to be perfect down to the last instruction for one particular use case on one particular CPU, then yes, intrinsics or even inline assembly is probably the way to go. But in general, I would also suggest to instead focus on writing your C++ code in a way that enables and guides the compiler to generate the assembly you want rather than generating the assembly yourself. For example, by using the (non-standard but generally available) restrict qualifier and borrowing the trick with letting the compiler know that your count is always a multiple of 32
void f(std::uint16_t* __restrict__ canvas, const std::uint8_t* __restrict__ addon, std::size_t count)
{
assert(count % 32 == 0);
count = count & -32;
std::transform(addon, addon + count, canvas, canvas, std::plus<void>());
}
you get (-std=c++17 -DNDEBUG -O3 -mavx)
f(unsigned short*, unsigned char const*, unsigned long):
and rdx, -32
je .LBB0_3
xor eax, eax
.LBB0_2: # =>This Inner Loop Header: Depth=1
vpmovzxbw xmm0, qword ptr [rsi + rax] # xmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
vpmovzxbw xmm1, qword ptr [rsi + rax + 8] # xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
vpmovzxbw xmm2, qword ptr [rsi + rax + 16] # xmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
vpmovzxbw xmm3, qword ptr [rsi + rax + 24] # xmm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
vpaddw xmm0, xmm0, xmmword ptr [rdi + 2*rax]
vpaddw xmm1, xmm1, xmmword ptr [rdi + 2*rax + 16]
vpaddw xmm2, xmm2, xmmword ptr [rdi + 2*rax + 32]
vpaddw xmm3, xmm3, xmmword ptr [rdi + 2*rax + 48]
vmovdqu xmmword ptr [rdi + 2*rax], xmm0
vmovdqu xmmword ptr [rdi + 2*rax + 16], xmm1
vmovdqu xmmword ptr [rdi + 2*rax + 32], xmm2
vmovdqu xmmword ptr [rdi + 2*rax + 48], xmm3
add rax, 32
cmp rdx, rax
jne .LBB0_2
.LBB0_3:
ret
which is really not bad…

Horizontal XOR in AVX

Is there a way to XOR horizontally an AVX register—specifically, to XOR the four 64-bit components of a 256-bit register?
The goal is to get the XOR of all 4 64-bit components of an AVX register. It would essentially be doing the same thing as a horizontal add (_mm256_hadd_epi32()), except that I want to XOR instead of ADD.
The scalar code is:
inline uint64_t HorizontalXor(__m256i t) {
return t.m256i_u64[0] ^ t.m256i_u64[1] ^ t.m256i_u64[2] ^ t.m256i_u64[3];
}
As stated in the comments, the fastest code very likely uses scalar operations, doing everything in the integer registers. All you need to do is extract the four packed 64-bit integers, then you have three XOR instructions, and you're done. This can be done pretty efficiently, and it leaves the result in an integer register, which is what your sample code suggests that you would want.
MSVC already generates pretty good code for the scalar function that you show as an example in the question:
inline uint64_t HorizontalXor(__m256i t) {
return t.m256i_u64[0] ^ t.m256i_u64[1] ^ t.m256i_u64[2] ^ t.m256i_u64[3];
}
Assuming that t is in ymm1, the resulting disassembly will be something like this:
vextractf128 xmm0, ymm1, 1
vpextrq rax, xmm0, 1
vmovq rcx, xmm1
xor rax, rcx
vpextrq rcx, xmm1, 1
vextractf128 xmm0, ymm1, 1
xor rax, rcx
vmovq rcx, xmm0
xor rax, rcx
…with the result left in RAX. If this accurately reflects what you need (a scalar uint64_t result), then this code would be sufficient.
You can slightly improve it by using intrinsics:
inline uint64_t _mm256_hxor_epu64(__m256i x)
{
const __m128i temp = _mm256_extracti128_si256(x, 1);
return (uint64_t&)x
^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
^ (uint64_t&)(temp)
^ (uint64_t)(_mm_extract_epi64(temp, 1));
}
Then you'll get the following disassembly (again, assuming that x is in ymm1):
vextracti128 xmm2, ymm1, 1
vpextrq rcx, xmm2, 1
vpextrq rax, xmm1, 1
xor rax, rcx
vmovq rcx, xmm1
xor rax, rcx
vmovq rcx, xmm2
xor rax, rcx
Notice that we were able to elide one extraction instruction, and that we've ensured VEXTRACTI128 was used instead of VEXTRACTF128 (although, this choice probably does not matter).
You'll see similar output on other compilers. For example, here's GCC 7.1 (with x assumed to be in ymm0):
vextracti128 xmm2, ymm0, 0x1
vpextrq rax, xmm0, 1
vmovq rdx, xmm2
vpextrq rcx, xmm2, 1
xor rax, rdx
vmovq rdx, xmm0
xor rax, rdx
xor rax, rcx
The same instructions are there, but they've been slightly reordered. The intrinsics allow the compiler's scheduler to order as it deems best. Clang 4.0 schedules them differently yet:
vmovq rax, xmm0
vpextrq rcx, xmm0, 1
xor rcx, rax
vextracti128 xmm0, ymm0, 1
vmovq rdx, xmm0
xor rdx, rcx
vpextrq rax, xmm0, 1
xor rax, rdx
And, of course, this ordering is always subject to change when the code gets inlined.
On the other hand, if you want the result to be in an AVX register, then you first need to decide how you want it to be stored. I guess you would just store the single 64-bit result as a scalar, something like:
inline __m256i _mm256_hxor(__m256i x)
{
const __m128i temp = _mm256_extracti128_si256(x, 1);
return _mm256_set1_epi64x((uint64_t&)x
^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
^ (uint64_t&)(temp)
^ (uint64_t)(_mm_extract_epi64(temp, 1)));
}
But now you're doing a lot of data shuffling, negating any performance boost that you would possibly see from vectorizing the code.
Speaking of which, I'm not really sure how you got yourself into a situation where you need to do horizontal operations like this in the first place. SIMD operations are designed to scale vertically, not horizontally. If you're still in the implementation phase, it may be appropriate to reconsider the design. In particular, you should be generating the 4 integer values in 4 different AVX registers, rather than packing them all into one.
If you actually want 4 copies of the result packed into an AVX register, then you could do something like this:
inline __m256i _mm256_hxor(__m256i x)
{
const __m256i temp = _mm256_xor_si256(x,
_mm256_permute2f128_si256(x, x, 1));
return _mm256_xor_si256(temp,
_mm256_shuffle_epi32(temp, _MM_SHUFFLE(1, 0, 3, 2)));
}
This still exploits a bit of parallelism by doing two XORs at once, meaning that only two XOR operations are required in all, instead of three.
If it helps to visualize it, this basically does:
A B C D ⟵ input
XOR XOR XOR XOR
C D A B ⟵ permuted input
=====================================
A^C B^D A^C B^D ⟵ intermediate result
XOR XOR XOR XOR
B^D A^C B^D A^C ⟵ shuffled intermediate result
======================================
A^C^B^D A^C^B^D A^C^B^D A^C^B^D ⟵ final result
On practically all compilers, these intrinsics will produce the following assembly code:
vperm2f128 ymm0, ymm1, ymm1, 1 ; input is in YMM1
vpxor ymm2, ymm0, ymm1
vpshufd ymm1, ymm2, 78
vpxor ymm0, ymm1, ymm2
(I had come up with this on my way to bed after first posting this answer, and planned to come back and update the answer, but I see that wim beat me to the punch on posting it. Oh well, it's still a better approach than what I first had, so it still merits being included here.)
And, of course, if you wanted this in an integer register, you would just need a simple VMOVQ:
vperm2f128 ymm0, ymm1, ymm1, 1 ; input is in YMM1
vpxor ymm2, ymm0, ymm1
vpshufd ymm1, ymm2, 78
vpxor ymm0, ymm1, ymm2
vmovq rax, xmm0
The question is, would this be faster than the scalar code above. And the answer is, yes, probably. Although you are doing the XORs using the AVX execution units, instead of the completely separate integer execution units, there are fewer AVX shuffles/permutes/extracts that need to be done, which means less overhead. So I might also have to eat my words on scalar code being the fastest implementation. But it really depends on what you're doing and how the instructions can be scheduled/interleaved.
Vectorization is likely to be useful if the input of the horizontal xor-function is already in
an AVX register, i.e. your t is the result of some SIMD computation.
Otherwise, scalar code is likely to be faster, as already mentioned by #Cody Gray.
Often you can do horizontal SIMD operations in about log_2(SIMD_width) 'steps'.
In this case one step is a 'shuffle/permute' and a 'xor'. This is slightly more efficient than #Cody Gray 's _mm256_hxor function:
__m256i _mm256_hxor_v2(__m256i x)
{
__m256i x0 = _mm256_permute2x128_si256(x,x,1); // swap the 128 bit high and low lane
__m256i x1 = _mm256_xor_si256(x,x0);
__m256i x2 = _mm256_shuffle_epi32(x1,0b01001110); // swap 64 bit lanes
__m256i x3 = _mm256_xor_si256(x1,x2);
return x3;
}
This compiles to:
vperm2i128 $1, %ymm0, %ymm0, %ymm1
vpxor %ymm1, %ymm0, %ymm0
vpshufd $78, %ymm0, %ymm1
vpxor %ymm1, %ymm0, %ymm0
If you want the result in a scalar register:
uint64_t _mm256_hxor_v2_uint64(__m256i x)
{
__m256i x0 = _mm256_permute2x128_si256(x,x,1);
__m256i x1 = _mm256_xor_si256(x,x0);
__m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
__m256i x3 = _mm256_xor_si256(x1,x2);
return _mm_cvtsi128_si64x(_mm256_castsi256_si128(x3)) ;
}
Which compiles to:
vperm2i128 $1, %ymm0, %ymm0, %ymm1
vpxor %ymm1, %ymm0, %ymm0
vpshufd $78, %ymm0, %ymm1
vpxor %ymm1, %ymm0, %ymm0
vmovq %xmm0, %rax
Full test code:
#include <stdio.h>
#include <x86intrin.h>
#include <stdint.h>
/* gcc -O3 -Wall -m64 -march=broadwell hor_xor.c */
int print_vec_uint64(__m256i v);
__m256i _mm256_hxor_v2(__m256i x)
{
__m256i x0 = _mm256_permute2x128_si256(x,x,1);
__m256i x1 = _mm256_xor_si256(x,x0);
__m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
__m256i x3 = _mm256_xor_si256(x1,x2);
/* Uncomment the next few lines to print the values of the intermediate variables */
/*
printf("3...0 = 3 2 1 0\n");
printf("x = ");print_vec_uint64(x );
printf("x0 = ");print_vec_uint64(x0 );
printf("x1 = ");print_vec_uint64(x1 );
printf("x2 = ");print_vec_uint64(x2 );
printf("x3 = ");print_vec_uint64(x3 );
*/
return x3;
}
uint64_t _mm256_hxor_v2_uint64(__m256i x)
{
__m256i x0 = _mm256_permute2x128_si256(x,x,1);
__m256i x1 = _mm256_xor_si256(x,x0);
__m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
__m256i x3 = _mm256_xor_si256(x1,x2);
return _mm_cvtsi128_si64x(_mm256_castsi256_si128(x3)) ;
}
int main() {
__m256i x = _mm256_set_epi64x(0x7, 0x5, 0x2, 0xB);
// __m256i x = _mm256_set_epi64x(4235566778345231, 1123312566778345423, 72345566778345673, 967856775433457);
printf("x = ");print_vec_uint64(x);
__m256i y = _mm256_hxor_v2(x);
printf("y = ");print_vec_uint64(y);
uint64_t z = _mm256_hxor_v2_uint64(x);
printf("z = %10lX \n",z);
return 0;
}
int print_vec_uint64(__m256i v){
uint64_t t[4];
_mm256_storeu_si256((__m256i *)t,v);
printf("%10lX %10lX %10lX %10lX \n",t[3],t[2],t[1],t[0]);
return 0;
}
Implementation of direct analogue of _mm256_hadd_epi32() for XOR will be look something like this:
#include <immintrin.h>
template<int imm> inline __m256i _mm256_shuffle_epi32(__m256i a, __m256i b)
{
return _mm256_castps_si256(_mm256_shuffle_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b), imm));
}
inline __m256i _mm256_hxor_epi32(__m256i a, __m256i b)
{
return _mm256_xor_si256(_mm256_shuffle_epi32<0x88>(a, b), _mm256_shuffle_epi32<0xDD>(a, b));
}
int main()
{
__m256i a = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7);
__m256i b = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8);
__m256i c = _mm256_hxor_epi32(a, b);
return 0;
}

Is shufps slower than memory access?

The title may seem nonsense but let me explain. I was studying a program the other day when I encountered the following assembly code:
movaps xmm3, xmmword ptr [rbp-30h]
lea rdx, [rdi+1320h]
movaps xmm5, xmm3
movaps xmm6, xmm3
movaps xmm0, xmm3
movss dword ptr [rdx], xmm3
shufps xmm5, xmm3, 55h
shufps xmm6, xmm3, 0AAh
shufps xmm0, xmm3, 0FFh
movaps xmm4, xmm3
movss dword ptr [rdx+4], xmm5
movss dword ptr [rdx+8], xmm6
movss dword ptr [rdx+0Ch], xmm0
mulss xmm4, xmm3
and it seems like mostly it just copies four floats from [rbp-30h] to [rdx]. Those shufpss are used just to select one of four floats in xmm3 (e.g. shufps xmm5, xmm3, 55h selects the second float and places it in xmm5).
This makes me wonder if the compiler did so because shufps is actually faster than memory access (something like movss xmm0, dword ptr [rbp-30h], movss dword ptr [rdx], xmm0).
So I wrote some tests to compare these two approaches and found shufps always slower than multiple memory accesses. Now I'm thinking maybe the use of shufps has nothing to do with performance. It might just be there to obfuscate the code so decompilers cannot produce clean code easily (tried with IDA pro and it was indeed overly complicated).
While I'll probably never use shufps explicitly anyway (by using _mm_shuffle_ps for example) in any practical programs as the compiler is most likely smarter than me, I still want to know why the compiler that compiled the program generated such code. It's neither faster nor smaller. It makes no sense.
Anyway I'll provide the tests I wrote below.
#include <Windows.h>
#include <iostream>
using namespace std;
__declspec(noinline) DWORD profile_routine(void (*routine)(void *), void *arg, int iterations = 1)
{
DWORD startTime = GetTickCount();
while (iterations--)
{
routine(arg);
}
DWORD timeElapsed = GetTickCount() - startTime;
return timeElapsed;
}
struct Struct
{
float x, y, z, w;
};
__declspec(noinline) Struct shuffle1(float *arr)
{
float x = arr[3];
float y = arr[2];
float z = arr[0];
float w = arr[1];
return {x, y, z, w};
}
#define SS0 (0x00)
#define SS1 (0x55)
#define SS2 (0xAA)
#define SS3 (0xFF)
__declspec(noinline) Struct shuffle2(float *arr)
{
Struct r;
__m128 packed = *reinterpret_cast<__m128 *>(arr);
__m128 x = _mm_shuffle_ps(packed, packed, SS3);
__m128 y = _mm_shuffle_ps(packed, packed, SS2);
__m128 z = _mm_shuffle_ps(packed, packed, SS0);
__m128 w = _mm_shuffle_ps(packed, packed, SS1);
_mm_store_ss(&r.x, x);
_mm_store_ss(&r.y, y);
_mm_store_ss(&r.z, z);
_mm_store_ss(&r.w, w);
return r;
}
void profile_shuffle_r1(void *arg)
{
float *arr = static_cast<float *>(arg);
Struct q = shuffle1(arr);
arr[0] += q.w;
arr[1] += q.z;
arr[2] += q.y;
arr[3] += q.x;
}
void profile_shuffle_r2(void *arg)
{
float *arr = static_cast<float *>(arg);
Struct q = shuffle2(arr);
arr[0] += q.w;
arr[1] += q.z;
arr[2] += q.y;
arr[3] += q.x;
}
int main(int argc, char **argv)
{
int n = argc + 3;
float arr1[4], arr2[4];
for (int i = 0; i < 4; i++)
{
arr1[i] = static_cast<float>(n + i);
arr2[i] = static_cast<float>(n + i);
}
int iterations = 20000000;
DWORD time1 = profile_routine(profile_shuffle_r1, arr1, iterations);
cout << "time1 = " << time1 << endl;
DWORD time2 = profile_routine(profile_shuffle_r2, arr2, iterations);
cout << "time2 = " << time2 << endl;
return 0;
}
In the test above, I have two shuffle methods shuffle1 and shuffle2 that do the same thing. When compiled with MSVC -O2, it produces the following code:
shuffle1:
mov eax,dword ptr [rdx+0Ch]
mov dword ptr [rcx],eax
mov eax,dword ptr [rdx+8]
mov dword ptr [rcx+4],eax
mov eax,dword ptr [rdx]
mov dword ptr [rcx+8],eax
mov eax,dword ptr [rdx+4]
mov dword ptr [rcx+0Ch],eax
mov rax,rcx
ret
shuffle2:
movaps xmm2,xmmword ptr [rdx]
mov rax,rcx
movaps xmm0,xmm2
shufps xmm0,xmm2,0FFh
movss dword ptr [rcx],xmm0
movaps xmm0,xmm2
shufps xmm0,xmm2,0AAh
movss dword ptr [rcx+4],xmm0
movss dword ptr [rcx+8],xmm2
shufps xmm2,xmm2,55h
movss dword ptr [rcx+0Ch],xmm2
ret
shuffle1 is always at least 30% faster than shuffle2 on my machine. I did notice shuffle2 has two more instructions and shuffle1 actually uses eax instead of xmm0 so I thought if I add some junk arithmetic operations, the result would be different.
So I modified them as the following:
__declspec(noinline) Struct shuffle1(float *arr)
{
float x0 = arr[3];
float y0 = arr[2];
float z0 = arr[0];
float w0 = arr[1];
float x = x0 + y0 + z0;
float y = y0 + z0 + w0;
float z = z0 + w0 + x0;
float w = w0 + x0 + y0;
return {x, y, z, w};
}
#define SS0 (0x00)
#define SS1 (0x55)
#define SS2 (0xAA)
#define SS3 (0xFF)
__declspec(noinline) Struct shuffle2(float *arr)
{
Struct r;
__m128 packed = *reinterpret_cast<__m128 *>(arr);
__m128 x0 = _mm_shuffle_ps(packed, packed, SS3);
__m128 y0 = _mm_shuffle_ps(packed, packed, SS2);
__m128 z0 = _mm_shuffle_ps(packed, packed, SS0);
__m128 w0 = _mm_shuffle_ps(packed, packed, SS1);
__m128 yz = _mm_add_ss(y0, z0);
__m128 x = _mm_add_ss(x0, yz);
__m128 y = _mm_add_ss(w0, yz);
__m128 wx = _mm_add_ss(w0, x0);
__m128 z = _mm_add_ss(z0, wx);
__m128 w = _mm_add_ss(y0, wx);
_mm_store_ss(&r.x, x);
_mm_store_ss(&r.y, y);
_mm_store_ss(&r.z, z);
_mm_store_ss(&r.w, w);
return r;
}
and now the assembly looks a bit more fair as they have the same number of instructions and both need to use xmm registers.
shuffle1:
movss xmm5,dword ptr [rdx+8]
mov rax,rcx
movss xmm3,dword ptr [rdx+0Ch]
movaps xmm0,xmm5
movss xmm2,dword ptr [rdx]
addss xmm0,xmm3
movss xmm4,dword ptr [rdx+4]
movaps xmm1,xmm2
addss xmm1,xmm5
addss xmm0,xmm2
addss xmm1,xmm4
movss dword ptr [rcx],xmm0
movaps xmm0,xmm4
addss xmm0,xmm2
addss xmm4,xmm3
movss dword ptr [rcx+4],xmm1
addss xmm0,xmm3
addss xmm4,xmm5
movss dword ptr [rcx+8],xmm0
movss dword ptr [rcx+0Ch],xmm4
ret
shuffle2:
movaps xmm4,xmmword ptr [rdx]
mov rax,rcx
movaps xmm3,xmm4
movaps xmm5,xmm4
shufps xmm5,xmm4,0AAh
movaps xmm2,xmm4
shufps xmm2,xmm4,0FFh
movaps xmm0,xmm5
addss xmm0,xmm3
shufps xmm4,xmm4,55h
movaps xmm1,xmm4
addss xmm1,xmm2
addss xmm2,xmm0
addss xmm4,xmm0
addss xmm3,xmm1
addss xmm5,xmm1
movss dword ptr [rcx],xmm2
movss dword ptr [rcx+4],xmm4
movss dword ptr [rcx+8],xmm3
movss dword ptr [rcx+0Ch],xmm5
ret
but it doesn't matter. shuffle1 is still 30% faster!
Without the broader context, it is hard to say for sure, but ... when optimizing for newer processors, you have to consider the usage of the different ports. See Agners here: http://www.agner.org/optimize/instruction_tables.pdf
In this case, while it may seem unlikely, there are a few possibilities that jump out at me if we're assuming that the assembly is, in fact, optimized.
This could appear in a stretch of code where the Out-Of-Order scheduler happens to have more of port 5 (on Haswell, for example) than ports 2 and 3 (again, using Haswell as an example) available.
Similar to with #1, but the same effect might be observed when hyperthreading. This code may be intended to not steal read operations from the sibling hyperthread.
Lastly, specific to this sort of optimization and where I've used something similar. Say you have a branch that is run-time near 100% predictable, but not during compile-time. Let's imagine, hypothetically that right after the branch there is a read that is often a cache miss. You want to read as soon as possible. The Out-Of-Order scheduler will read ahead and begin executing that read if it you don't use the read ports. This could make the shufps instructions essentially "free" to execute. Here's that example:
MOV ecx, [some computed, mostly constant at run-time global]
label loop:
ADD rdi, 16
ADD rbp, 16
CALL shuffle
SUB ecx, 1
JNE loop
MOV rax, [rdi]
;do a read that could be "predicted" properly
MOV rbx, [rax]
Honestly though, it just looks like poorly written assembly or poorly generated machine code, so I wouldn't put much thought into it. The example I'm giving is pretty darned unlikely.
You don't show if the later code actually uses the results of broadcasting each element to all 4 positions of a vector. (e.g. 0x55 is _MM_SHUFFLE(1,1,1,1)). If you already need that for for a ...ps instruction later, then you need those shuffles anyway, so there'd be no reason to also do scalar loads.
If you don't, and the only visible side-effect is the stores to memory, this is just a hilariously bad missed optimization by either a human programmer using intrinsics, and/or by a compiler. Just like in your examples of MSVC output for your test functions.
Keep in mind that some compilers (like ICC and MSVC) don't really optimize intrinsics, so if you write 3x _mm_shuffle_ps you get 3x shufps, so this poor decision could have been made by a human using intrinsics, not a compiler.
But Clang on the other hand does aggressively optimize shuffle intrinsics. clang optimizes both of your shuffle functions to one movaps load, one shufps (or pshufd), and one movups store. This is optimal for most CPUs, getting the work done in the fewest instructions and uops.
(gcc auto-vectorizes shuffle1 but not shuffle2. MSVC fails at everything, just using scalar for shuffle1)
(If you just need each scalar float at the bottom of an xmm register for ...ss instructions, you can use the shuffle that creates your store vector as one of them, because it has a different low element than the input. You'd movaps copy first, though, or use pshufd, to avoid destroying the reg with the original low element.)
If tuning specifically for a CPU with slow movups stores (like Intel pre-Nehalem) and the result isn't known to be aligned, then you'd still use one shufps but store the result with movlps and movhps. This is what gcc does if you compile with -mtune=core2.
You apparently know your input vector is aligned, so it still makes a huge amount of sense to load it with movaps. K8 will split a movaps into two 8-byte load uops, but most other x86-64 CPUs can do 16-byte aligned loads as a single uop. (Pentium M / Core 1 were the last mainstream Intel CPUs to split 128-bit vector ops like that, and they didn't support 64-bit mode.)
vbroadcastss requires AVX, so without AVX if you want a dword from memory broadcast into an XMM register, you have to use a shuffle instruction that needs a port 5 ALU uop. (vbroadcastss xmm0, [rsi+4] decodes to a pure load uop on Intel CPUs, no ALU uop needed, so it has 2 per clock throughput instead of 1.)
Old CPUs like Merom and K8 have slow shuffle units that are only 64 bits wide, so shufps is pretty slow because it's a full 128-bit shuffle with granularity smaller than 64 bits. You might consider doing 2x movsd or movq loads to feed pshuflw, which is fast because it only shuffles the low 64 bits. But only if you're specifically tuning for old CPUs.
// for gcc, I used __attribute__((ms_abi)) to target the Windows x64 calling convention
Struct shuffle3(float *arr)
{
Struct r;
__m128 packed = _mm_load_ps(arr);
__m128 xyzw = _mm_shuffle_ps(packed, packed, _MM_SHUFFLE(1,0,2,3));
_mm_storeu_ps(&r.x, xyzw);
return r;
}
shuffle1 and shuffle3 both compile to identical code with gcc and clang (on the Godbolt compiler explorer), because they auto-vectorize the scalar assignments. The only difference is using a movups load for shuffle1, because nothing guarantees 16-byte alignment there. (If we promised the compiler an aligned pointer for the pure C scalar version, then it would be exactly identical.)
# MSVC compiles shuffle3 like this as well
# from gcc9.1 -O3 (default baseline x86-64, tune=generic)
shuffle3(float*):
movaps xmm0, XMMWORD PTR [rdx] # MSVC still uses movups even for _mm_load_ps
mov rax, rcx # return the retval pointer
shufps xmm0, xmm0, 75
movups XMMWORD PTR [rcx], xmm0 # store to the hidden retval pointer
ret
With -mtune=core2, gcc still auto-vectorizes shuffle1. It uses split unaligned loads because we didn't promise the compiler aligned memory.
For shuffle3, it does use movaps but still splits _mm_storeu_ps into movlps + movhps. (This is one of the interesting effects that tuning options can have. They don't let the compiler use new instruction, just change the selection for existing ones.)
# gcc9.1 -O3 -mtune=core2 # auto-vectorizing shuffle1
shuffle1(float*):
movq xmm0, QWORD PTR [rdx]
mov rax, rcx
movhps xmm0, QWORD PTR [rdx+8]
shufps xmm0, xmm0, 75
movlps QWORD PTR [rcx], xmm0 # store in 2 halves
movhps QWORD PTR [rcx+8], xmm0
ret
MSVC doesn't have tuning options, and doesn't auto-vectorize shuffle1.

SSE/NEON table lookup optimization

i have the following lookup and interpolation code to optimize. (float table with size 128)
It will be used with an Intel compiler on windows, GCC on OSX and GCC with neon OSX.
for(unsigned int i = 0 ; i < 4 ; i++)
{
const int iIdx = (int)m_fIndex[i];
const float frac = m_fIndex - iIdx;
m_fResult[i] = sftable[iIdx].val + sftable[iIdx].val2 * frac;
}
i vecorized everything with sse/neon. (the macros convert into sse/neon instructions)
VEC_INT iIdx = VEC_FLOAT2INT(m_fIndex);
VEC_FLOAT frac = VEC_SUB(m_fIndex ,VEC_INT2FLOAT(iIdx);
m_fResult[0] = sftable[iIdx[0]].val2;
m_fResult[1] = sftable[iIdx[1]].val2;
m_fResult[2] = sftable[iIdx[2]].val2;
m_fResult[3] = sftable[iIdx[3]].val2;
m_fResult=VEC_MUL( m_fResult,frac);
frac[0] = sftable[iIdx[0]].val1;
frac[1] = sftable[iIdx[1]].val1;
frac[2] = sftable[iIdx[2]].val1;
frac[3] = sftable[iIdx[3]].val1;
m_fResult=VEC_ADD( m_fResult,frac);
i think that the table access and move into aligned memory is the real bottleneck here.
I am not good with assembler but there are a lot of unpcklps and mov:
10026751 mov eax,dword ptr [esp+4270h]
10026758 movaps xmm3,xmmword ptr [eax+16640h]
1002675F cvttps2dq xmm5,xmm3
10026763 cvtdq2ps xmm4,xmm5
10026766 movd edx,xmm5
1002676A movdqa xmm6,xmm5
1002676E movdqa xmm1,xmm5
10026772 psrldq xmm6,4
10026777 movdqa xmm2,xmm5
1002677B movd ebx,xmm6
1002677F subps xmm3,xmm4
10026782 psrldq xmm1,8
10026787 movd edi,xmm1
1002678B psrldq xmm2,0Ch
10026790 movdqa xmmword ptr [esp+4F40h],xmm5
10026799 mov ecx,dword ptr [eax+edx*8+10CF4h]
100267A0 movss xmm0,dword ptr [eax+edx*8+10CF4h]
100267A9 mov dword ptr [eax+166B0h],ecx
100267AF movd ecx,xmm2
100267B3 mov esi,dword ptr [eax+ebx*8+10CF4h]
100267BA movss xmm4,dword ptr [eax+ebx*8+10CF4h]
100267C3 mov dword ptr [eax+166B4h],esi
100267C9 mov edx,dword ptr [eax+edi*8+10CF4h]
100267D0 movss xmm7,dword ptr [eax+edi*8+10CF4h]
100267D9 mov dword ptr [eax+166B8h],edx
100267DF movss xmm1,dword ptr [eax+ecx*8+10CF4h]
100267E8 unpcklps xmm0,xmm7
100267EB unpcklps xmm4,xmm1
100267EE unpcklps xmm0,xmm4
100267F1 mulps xmm0,xmm3
100267F4 movaps xmmword ptr [eax+166B0h],xmm0
100267FB mov ebx,dword ptr [esp+4F40h]
10026802 mov edi,dword ptr [esp+4F44h]
10026809 mov ecx,dword ptr [esp+4F48h]
10026810 mov esi,dword ptr [esp+4F4Ch]
10026817 movss xmm2,dword ptr [eax+ebx*8+10CF0h]
10026820 movss xmm5,dword ptr [eax+edi*8+10CF0h]
10026829 movss xmm3,dword ptr [eax+ecx*8+10CF0h]
10026832 movss xmm6,dword ptr [eax+esi*8+10CF0h]
1002683B unpcklps xmm2,xmm3
1002683E unpcklps xmm5,xmm6
10026841 unpcklps xmm2,xmm5
10026844 mulps xmm2,xmm0
10026847 movaps xmmword ptr [eax+166B0h],xmm2
When profiling there is not much benefit with the sse version on win.
Do you have any suggestions how to improve ?
Any side effects with neon/gcc to be expected ?
Currently i consider just making the first part vecorized and do the tablereadout and interpolation in a loop, hoping that it will be benefit from compiler optimization.
OSX? Then it has nothing to do with NEON.
BTW, NEON cannot handle LUTs this large anyway. (I don't know about SSE for this matter)
Verify first if SSE can handle LUTs of this size, if yes, I suggest using a different compiler since GCC tends to make intrinsucks out of intrinsics.
That’s some of the absolute worst compiler codegen I’ve ever seen (assuming that the optimizer is enabled). Worth filing a bug against GCC.
Major issues:
Loading val and val2 for each lookup separately.
Getting the index for val and val2 into a GPR separately.
Writing the vector of indices to the stack and then loading them into GPRs.
In order to get compilers to generate better code (one load for each table line), you may need to load each table line as though it were a double, then cast the line to a vector of two floats and swizzle the lines to get homogeneous vectors. On both NEON and SSE, this should require only four loads and three or four unpacks (much better than the current eight loads + six unpacks).
Getting rid of the superfluous stack traffic may be harder. Make sure that the optimizer is turned on. Fixing the multiple-load issue will halve the stack traffic because you’ll only generate each index once, but to get rid of it entirely it might be necessary to write assembly instead of intrinsics (or to use a newer compiler version).
One of the reasons why the compiler creates "funky" code (with lots of re-loads) here is because it must assume, for correctness, that the data in sftable[] arrays may change. To make the generated code better, restructure it to look like:
VEC_INT iIdx = VEC_FLOAT2INT(m_fIndex);
VEC_FLOAT frac = VEC_SUB(m_fIndex ,VEC_INT2FLOAT(iIdx);
VEC_FLOAT fracnew;
// make it explicit that all you want is _four loads_
typeof(*sftable) tbl[4] = {
sftable[iIdx[0]], sftable[iIdx[1]], sftable[iIdx[2]], sftable[iIdx[3]]
};
m_fResult[0] = tbl[0].val2
m_fResult[1] = tbl[1].val2;
m_fResult[2] = tbl[2].val2;
m_fResult[3] = tbl[3].val2;
fracnew[0] = tbl[0].val1;
fracnew[1] = tbl[1].val1;
fracnew[2] = tbl[2].val1;
fracnew[3] = tbl[3].val1;
m_fResult=VEC_MUL( m_fResult,frac);
m_fResult=VEC_ADD( m_fResult,fracnew);
frac = fracnew;
It might make sense (due to the interleaved layout of what you have in sftable[]) to use intrinsics because both vector float arrays fResult and frac are quite probably loadable from tbl[] with a single instruction (unpack hi/lo in SSE, unzip in Neon). The "main" table lookup isn't vectorizable without the help of something like AVX2's VGATHER instruction, but it doesn't have to be more than four loads.