Horizontal XOR in AVX

Horizontal XOR in AVX - c++

Is there a way to XOR horizontally an AVX register—specifically, to XOR the four 64-bit components of a 256-bit register?
The goal is to get the XOR of all 4 64-bit components of an AVX register. It would essentially be doing the same thing as a horizontal add (_mm256_hadd_epi32()), except that I want to XOR instead of ADD.
The scalar code is:
inline uint64_t HorizontalXor(__m256i t) {
return t.m256i_u64[0] ^ t.m256i_u64[1] ^ t.m256i_u64[2] ^ t.m256i_u64[3];
}

As stated in the comments, the fastest code very likely uses scalar operations, doing everything in the integer registers. All you need to do is extract the four packed 64-bit integers, then you have three XOR instructions, and you're done. This can be done pretty efficiently, and it leaves the result in an integer register, which is what your sample code suggests that you would want.
MSVC already generates pretty good code for the scalar function that you show as an example in the question:
inline uint64_t HorizontalXor(__m256i t) {
return t.m256i_u64[0] ^ t.m256i_u64[1] ^ t.m256i_u64[2] ^ t.m256i_u64[3];
}
Assuming that t is in ymm1, the resulting disassembly will be something like this:
vextractf128 xmm0, ymm1, 1
vpextrq rax, xmm0, 1
vmovq rcx, xmm1
xor rax, rcx
vpextrq rcx, xmm1, 1
vextractf128 xmm0, ymm1, 1
xor rax, rcx
vmovq rcx, xmm0
xor rax, rcx
…with the result left in RAX. If this accurately reflects what you need (a scalar uint64_t result), then this code would be sufficient.
You can slightly improve it by using intrinsics:
inline uint64_t _mm256_hxor_epu64(__m256i x)
{
const __m128i temp = _mm256_extracti128_si256(x, 1);
return (uint64_t&)x
^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
^ (uint64_t&)(temp)
^ (uint64_t)(_mm_extract_epi64(temp, 1));
}
Then you'll get the following disassembly (again, assuming that x is in ymm1):
vextracti128 xmm2, ymm1, 1
vpextrq rcx, xmm2, 1
vpextrq rax, xmm1, 1
xor rax, rcx
vmovq rcx, xmm1
xor rax, rcx
vmovq rcx, xmm2
xor rax, rcx
Notice that we were able to elide one extraction instruction, and that we've ensured VEXTRACTI128 was used instead of VEXTRACTF128 (although, this choice probably does not matter).
You'll see similar output on other compilers. For example, here's GCC 7.1 (with x assumed to be in ymm0):
vextracti128 xmm2, ymm0, 0x1
vpextrq rax, xmm0, 1
vmovq rdx, xmm2
vpextrq rcx, xmm2, 1
xor rax, rdx
vmovq rdx, xmm0
xor rax, rdx
xor rax, rcx
The same instructions are there, but they've been slightly reordered. The intrinsics allow the compiler's scheduler to order as it deems best. Clang 4.0 schedules them differently yet:
vmovq rax, xmm0
vpextrq rcx, xmm0, 1
xor rcx, rax
vextracti128 xmm0, ymm0, 1
vmovq rdx, xmm0
xor rdx, rcx
vpextrq rax, xmm0, 1
xor rax, rdx
And, of course, this ordering is always subject to change when the code gets inlined.
On the other hand, if you want the result to be in an AVX register, then you first need to decide how you want it to be stored. I guess you would just store the single 64-bit result as a scalar, something like:
inline __m256i _mm256_hxor(__m256i x)
{
const __m128i temp = _mm256_extracti128_si256(x, 1);
return _mm256_set1_epi64x((uint64_t&)x
^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
^ (uint64_t&)(temp)
^ (uint64_t)(_mm_extract_epi64(temp, 1)));
}
But now you're doing a lot of data shuffling, negating any performance boost that you would possibly see from vectorizing the code.
Speaking of which, I'm not really sure how you got yourself into a situation where you need to do horizontal operations like this in the first place. SIMD operations are designed to scale vertically, not horizontally. If you're still in the implementation phase, it may be appropriate to reconsider the design. In particular, you should be generating the 4 integer values in 4 different AVX registers, rather than packing them all into one.
If you actually want 4 copies of the result packed into an AVX register, then you could do something like this:
inline __m256i _mm256_hxor(__m256i x)
{
const __m256i temp = _mm256_xor_si256(x,
_mm256_permute2f128_si256(x, x, 1));
return _mm256_xor_si256(temp,
_mm256_shuffle_epi32(temp, _MM_SHUFFLE(1, 0, 3, 2)));
}
This still exploits a bit of parallelism by doing two XORs at once, meaning that only two XOR operations are required in all, instead of three.
If it helps to visualize it, this basically does:
A B C D ⟵ input
XOR XOR XOR XOR
C D A B ⟵ permuted input
=====================================
A^C B^D A^C B^D ⟵ intermediate result
XOR XOR XOR XOR
B^D A^C B^D A^C ⟵ shuffled intermediate result
======================================
A^C^B^D A^C^B^D A^C^B^D A^C^B^D ⟵ final result
On practically all compilers, these intrinsics will produce the following assembly code:
vperm2f128 ymm0, ymm1, ymm1, 1 ; input is in YMM1
vpxor ymm2, ymm0, ymm1
vpshufd ymm1, ymm2, 78
vpxor ymm0, ymm1, ymm2
(I had come up with this on my way to bed after first posting this answer, and planned to come back and update the answer, but I see that wim beat me to the punch on posting it. Oh well, it's still a better approach than what I first had, so it still merits being included here.)
And, of course, if you wanted this in an integer register, you would just need a simple VMOVQ:
vperm2f128 ymm0, ymm1, ymm1, 1 ; input is in YMM1
vpxor ymm2, ymm0, ymm1
vpshufd ymm1, ymm2, 78
vpxor ymm0, ymm1, ymm2
vmovq rax, xmm0
The question is, would this be faster than the scalar code above. And the answer is, yes, probably. Although you are doing the XORs using the AVX execution units, instead of the completely separate integer execution units, there are fewer AVX shuffles/permutes/extracts that need to be done, which means less overhead. So I might also have to eat my words on scalar code being the fastest implementation. But it really depends on what you're doing and how the instructions can be scheduled/interleaved.

Vectorization is likely to be useful if the input of the horizontal xor-function is already in
an AVX register, i.e. your t is the result of some SIMD computation.
Otherwise, scalar code is likely to be faster, as already mentioned by #Cody Gray.
Often you can do horizontal SIMD operations in about log_2(SIMD_width) 'steps'.
In this case one step is a 'shuffle/permute' and a 'xor'. This is slightly more efficient than #Cody Gray 's _mm256_hxor function:
__m256i _mm256_hxor_v2(__m256i x)
{
__m256i x0 = _mm256_permute2x128_si256(x,x,1); // swap the 128 bit high and low lane
__m256i x1 = _mm256_xor_si256(x,x0);
__m256i x2 = _mm256_shuffle_epi32(x1,0b01001110); // swap 64 bit lanes
__m256i x3 = _mm256_xor_si256(x1,x2);
return x3;
}
This compiles to:
vperm2i128 $1, %ymm0, %ymm0, %ymm1
vpxor %ymm1, %ymm0, %ymm0
vpshufd $78, %ymm0, %ymm1
vpxor %ymm1, %ymm0, %ymm0
If you want the result in a scalar register:
uint64_t _mm256_hxor_v2_uint64(__m256i x)
{
__m256i x0 = _mm256_permute2x128_si256(x,x,1);
__m256i x1 = _mm256_xor_si256(x,x0);
__m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
__m256i x3 = _mm256_xor_si256(x1,x2);
return _mm_cvtsi128_si64x(_mm256_castsi256_si128(x3)) ;
}
Which compiles to:
vperm2i128 $1, %ymm0, %ymm0, %ymm1
vpxor %ymm1, %ymm0, %ymm0
vpshufd $78, %ymm0, %ymm1
vpxor %ymm1, %ymm0, %ymm0
vmovq %xmm0, %rax
Full test code:
#include <stdio.h>
#include <x86intrin.h>
#include <stdint.h>
/* gcc -O3 -Wall -m64 -march=broadwell hor_xor.c */
int print_vec_uint64(__m256i v);
__m256i _mm256_hxor_v2(__m256i x)
{
__m256i x0 = _mm256_permute2x128_si256(x,x,1);
__m256i x1 = _mm256_xor_si256(x,x0);
__m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
__m256i x3 = _mm256_xor_si256(x1,x2);
/* Uncomment the next few lines to print the values of the intermediate variables */
/*
printf("3...0 = 3 2 1 0\n");
printf("x = ");print_vec_uint64(x );
printf("x0 = ");print_vec_uint64(x0 );
printf("x1 = ");print_vec_uint64(x1 );
printf("x2 = ");print_vec_uint64(x2 );
printf("x3 = ");print_vec_uint64(x3 );
*/
return x3;
}
uint64_t _mm256_hxor_v2_uint64(__m256i x)
{
__m256i x0 = _mm256_permute2x128_si256(x,x,1);
__m256i x1 = _mm256_xor_si256(x,x0);
__m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
__m256i x3 = _mm256_xor_si256(x1,x2);
return _mm_cvtsi128_si64x(_mm256_castsi256_si128(x3)) ;
}
int main() {
__m256i x = _mm256_set_epi64x(0x7, 0x5, 0x2, 0xB);
// __m256i x = _mm256_set_epi64x(4235566778345231, 1123312566778345423, 72345566778345673, 967856775433457);
printf("x = ");print_vec_uint64(x);
__m256i y = _mm256_hxor_v2(x);
printf("y = ");print_vec_uint64(y);
uint64_t z = _mm256_hxor_v2_uint64(x);
printf("z = %10lX \n",z);
return 0;
}
int print_vec_uint64(__m256i v){
uint64_t t[4];
_mm256_storeu_si256((__m256i *)t,v);
printf("%10lX %10lX %10lX %10lX \n",t[3],t[2],t[1],t[0]);
return 0;
}

Implementation of direct analogue of _mm256_hadd_epi32() for XOR will be look something like this:
#include <immintrin.h>
template<int imm> inline __m256i _mm256_shuffle_epi32(__m256i a, __m256i b)
{
return _mm256_castps_si256(_mm256_shuffle_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b), imm));
}
inline __m256i _mm256_hxor_epi32(__m256i a, __m256i b)
{
return _mm256_xor_si256(_mm256_shuffle_epi32<0x88>(a, b), _mm256_shuffle_epi32<0xDD>(a, b));
}
int main()
{
__m256i a = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7);
__m256i b = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8);
__m256i c = _mm256_hxor_epi32(a, b);
return 0;
}

Related

OMP SIMD logical AND on unsigned long long

I have been playing around with SIMD OMP instructions and I am not getting the compiler to emit ANDPS in my scenario.
What I'm trying to do:
This is an implementation of this problem (tldr: find pair of users with a common friend). My approach is to pack 64 bits (whether somebody is a friend or not) into an unsigned long long.
My SIMD approach: Take AND between two vectors of relationship and reduce with a OR which nicely fits the reduction pattern of OMP.
g++ instructions (on a 2019 intel i-7 macbookPro):
g++-11 friends.cpp -S -O3 -fopenmp -fsanitize=address -Wshadow -Wall -march=native --std=c++17;
My implementation below
#include <vector>
#include <algorithm>
#include "iostream"
#include <cmath>
#include <numeric>
typedef long long ll;
typedef unsigned long long ull;
using namespace std;
ull find_sol(vector<vector<ull>> & input_data, int q) {
bool not_friend = false;
ull cnt = 0;
int size_arr = (int) input_data[0].size();
for (int i = 0; i < q; ++i) // from these friends
{
for (int j = i+1; j < q; ++j) // to these friends
{
int step = j/64;
int remainder = j - 64*step;
not_friend = (input_data[i].at(step) >> remainder) % 2 == 0;
if(not_friend){
bool counter = false;
vector<ull> & v1 = input_data[i];
vector<ull> & v2 = input_data[j];
#pragma omp simd reduction(|:counter)
for (int c = 0; c < size_arr; ++c)
{
__asm__ ("entry");
counter |= (v1[c] & v2[c])>0;
__asm__ ("exit");
}
if(counter>0)
cnt++;
}
}
}
return cnt << 1;
}
int main(){
int q;
cin >> q;
vector<vector<ull>> input_data(q,vector<ull>(1 + q/64,0ULL));
for (int i = 0; i < q; ++i)
{
string s;
cin >> s;
for (int j = 0; j < 1 + q/64; ++j)
{
string str = s.substr(j*64,64);
reverse(str.begin(),str.end());
ull ul = std::stoull(str,nullptr,2);
input_data.at(i).at(j) = ul;
}
}
cout << find_sol(input_data,q) << endl;
}
Looking at the assembly inside the loop, I would expect some SIMD instructions (specifically andps) but I can't see them. What's preventing my compiler to emit them? Also, is there a way for the compiler to emit a warning re:what's wrong (would be very helpful)?
entry
# 0 "" 2
cmpb $0, (%rbx)
jne L53
movq (%r8), %rdx
leaq 0(,%rax,8), %rdi
addq %rdi, %rdx
movq %rdx, %r15
shrq $3, %r15
cmpb $0, (%r15,%rcx)
jne L54
cmpb $0, (%r11)
movq (%rdx), %rdx
jne L55
addq (%r9), %rdi
movq %rdi, %r15
shrq $3, %r15
cmpb $0, (%r15,%rcx)
jne L56
andq (%rdi), %rdx
movzbl (%r12), %edx
setne %dil
cmpb %r13b, %dl
jg L21
testb %dl, %dl
jne L57
L21:
orb %dil, -32(%r10)
EDIT 1:
Following Peter 1st and 2nd suggestion, I moved the marker out of the loop and I replaced the binarization by a simple OR. I'm still not getting SIMD instructions though:
ull counter = 0;
vector<ull> & v1 = input_data[i];
vector<ull> & v2 = input_data[j];
__asm__ ("entry" :::);
#pragma omp simd reduction(|:counter)
for (int c = 0; c < size_arr; ++c)
{
counter |= v1[c] & v2[c];
}
__asm__ ("exit" :::);
if(counter!=0)
cnt++;

First problem: asm. In recent GCC, non-empty Basic Asm statements like __asm__ ("entry"); have an implicit ::: "memory" clobber, making it impossible for the compiler to combine array accesses across iterations. Maybe try __asm__ ("entry" :::); if you really want these markers. (Extended asm without a memory clobber).
Or better, use better tools for looking at compiler output, such as the Godbolt compiler explorer (https://godbolt.org/) which lets you right click on a source line and go to the corresponding asm. (Optimization can make this a bit wonky, so sometimes you have to find the asm and mouseover it to make sure it comes from that source line.)
See How to remove "noise" from GCC/clang assembly output?
Second problem: -fsanitize=address makes it harder for the compiler to optimize. I only looked at GCC output without that option.
Vectorizing the OR reduction
After fixing those showstoppers:
You're forcing the compiler to booleanize to an 8-bit bool inside the inner loop, instead of just reducing the integer AND results with |= into a variable of the same type. (Which you check once after the loop.) This is probably part of why GCC has a hard time; it often makes a mess with different-sized integer types when it vectorizes at all.
(v1[c] & v2[c]) > 0; would need SSE4.1 pcmpeqqvs. just SIMD OR in the loop and check counter for !=0 after the loop. (You had bool counter, which was really surprising given counter>0 as a semantically weird way to check an unsigned value for non-zero. Even more unexpected for a bool.)
After changing that, GCC auto-vectorizes the way I expected without OpenMP, if you use -O3 (which includes -ftree-vectorize). It of course uses with vpand, not vandps, since FP booleans have lower throughput on some CPUs. (You didn't say what -march=native is for you; if you only had AVX1, e.g. on Sandybridge, then vandps is plausible.)
ull counter = 0;
// #pragma omp simd reduction(|:counter)
for (int c = 0; c < size_arr; ++c)
{
//__asm__ ("entry");
counter |= (v1[c] & v2[c]);
//__asm__ ("exit");
}
if(counter != 0)
cnt++;
From the Godbolt compiler explorer (which you should use instead of littering your code with asm statements)
# g++ 11.2 -O3 -march=skylake **without** OpenMP
.L7: # the vector part of the inner-most loop
vmovdqu ymm2, YMMWORD PTR [rsi+rax]
vpand ymm0, ymm2, YMMWORD PTR [rcx+rax]
add rax, 32
vpor ymm1, ymm1, ymm0
cmp rax, r8
jne .L7
vextracti128 xmm0, ymm1, 0x1
vpor xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 8
... (horizontal OR reduction of that one SIMD vector, eventually vmovq to RAX)
GCC OpenMP does vectorize, but badly / weirdly
With OpenMP, there is a vectorized version of the loop, but it sucks a lot, doing shuffles and gather loads, and storing results into a local buffer which it later reads. I don't know OpenMP that well, but unless you're using it wrong, this is a major missed optimization. Possibly it's scaling a loop counter with multiplies instead of incrementing a pointer, which is just horrible.
(Godbolt)
# g++ 11.2 -Wall -O3 -fopenmp -march=skylake -std=gnu++17
# with the #pragma uncommented
.L10:
vmovdqa ymm0, ymm3
vpermq ymm0, ymm0, 216
vpshufd ymm1, ymm0, 80 # unpack for 32x32 => 64-bit multiplies?
vpmuldq ymm1, ymm1, ymm4
vpshufd ymm0, ymm0, 250
vpmuldq ymm0, ymm0, ymm4
vmovdqa ymm7, ymm6 # ymm6 = set1(-1) outside the loop, gather mask
add rsi, 64
vpaddq ymm1, ymm1, ymm5
vpgatherqq ymm2, QWORD PTR [0+ymm1*1], ymm7
vpaddq ymm0, ymm0, ymm5
vmovdqa ymm7, ymm6
vpgatherqq ymm1, QWORD PTR [0+ymm0*1], ymm7
vpand ymm0, ymm1, YMMWORD PTR [rsi-32] # memory source = one array
vpand ymm1, ymm2, YMMWORD PTR [rsi-64]
vpor ymm0, ymm0, YMMWORD PTR [rsp+64] # OR with old contents of local buffer
vpor ymm1, ymm1, YMMWORD PTR [rsp+32]
vpaddd ymm3, ymm3, ymm4
vmovdqa YMMWORD PTR [rsp+32], ymm1 # and store back into it.
vmovdqa YMMWORD PTR [rsp+64], ymm0
cmp r9, rsi
jne .L10
mov edi, DWORD PTR [rsp+16] # outer loop tail
cmp DWORD PTR [rsp+20], edi
je .L7
This buffer of 64 bytes is read at the top of .L7 (an outer loop)
.L7:
vmovdqa ymm2, YMMWORD PTR [rsp+32]
vpor ymm1, ymm2, YMMWORD PTR [rsp+64]
vextracti128 xmm0, ymm1, 0x1
vpor xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 8
vpor xmm0, xmm0, xmm1
vmovq rsi, xmm0
cmp rsi, 1 # sets CF unless RSI=0
sbb r13, -1 # R13 -= -1 +CF i.e. increment if CF=0
IDK if there's a way to hand-hold the compiler into making better asm; perhaps with pointer-width loop counters?
GCC5.4 -O3 -fopenmp -march=haswell -std=gnu++17 makes sane asm, with just vpand / vpor and an array index increment in the loop. The stuff outside the loop is a bit different with OpenMP vs. plain vectorization, with OpenMP using vector store / scalar reload for the horizontal OR reduction of the final vector.

how to set a int32 value at some index within an m128i with only SSE2?

Is there a SSE2 intrinsics that can set a single int32 value within m128i?
Such as set value 1000 at index 1 on a m128i that already contains 1,2,3,4? (which result in 1,1000,3,4)

If SSE4 available, use
__m128i _mm_insert_epi32 (__m128i a, int i, const int imm8)
If you are limited to SSE2, you need to split it to two calls of
__m128i _mm_insert_epi16 (__m128i a, int i, const int imm8)
return _mm_insert_epi16(_mm_insert_epi16(a, 1000, 2), 0, 3);
to set 1000 on lane 1 of a interpreted as vector of ints.
With SSE3 available, I would presume that a shifting/shuffling sequence would be more efficient:
a = _mm_shuffle_epi32(a, 0xe0); // shuffle as 3 2 0 ?
__m128i b = _mm_cvtsi32_si128(value);
b = _mm_alignr_epi8(b, a, 4); // value 3 2 0
return _mm_shuffle_epi32(b, 0x5c); // 3 2 value 0
If the value is in a 64-bit register, one can use _mm_cvtsi64_si128 instead.
Gcc is able to convert the store load sequence to pinsrd xmm0, eax, 1 when SSE4 enabled, but gives quite a convoluted sequence without -msse4.
movd eax, xmm0
movaps XMMWORD PTR [rsp-24], xmm0
movabs rdx, 4294967296000
or rax, rdx
mov QWORD PTR [rsp-24], rax
movdqa xmm0, XMMWORD PTR [rsp-24]
ret
OTOH clang respects the store, modify stack, load paradigm.
movaps xmmword ptr [rsp - 24], xmm0
mov dword ptr [rsp - 20], 1000
movaps xmm0, xmmword ptr [rsp - 24]
ret
Probably the overall winner is the store/modify/load combo, which also has free programmable index. All others require hard coded immediates, including those using the insert intrinsics.
__m128i store_modify_load(__m128i a, int value, size_t index) {
alignas(16) int32_t tmp[4] = {};
_mm_store_si128(reinterpret_cast<__m128i*>(tmp), a);
tmp[index] = value;
return _mm_load_si128(reinterpret_cast<__m128i*>(tmp));
}
See the produced assembly in godbolt.

What is the fastest way to convert a large c-array of char8 to short16?

My raw data is a bunch of c-array of (unsigned) char (8bit) of length > 1000000.
I want to add them together (vector addition) follow the rule as in the code below.
Result:
c-array of (unsigned) short (16bit).
I have read all the SSE and AVX/AVX2 but there just a similar call
that multiple 2 registers of 256bit. The first 4 32bit will be multiplied together, the result for each pair of 32bit is a 64bit will fit into the 256 register.( _mm256_mul_epi32, _mm256_mul_epu32)
Firgure
https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX
Sample code:
static inline void adder(uint16_t *canvas, uint8_t *addon, uint64_t count)
{
for (uint64_t i=0; i<count; i++)
canvas[i] += static_cast<uint16_t>(addon[i]);
}
Thanks

Adding onto #wim answer (which is a good answer) and taking #Bathsheba comment into account, its well worthwhile both trusting the compiler but also examining what your compiler outputs to both learn how to do this and also check that its doing what you'd want. Running a slightly modified version of your code through godbolt (for msvc, gcc and clang) gives some non perfect answers.
This is especially true if you limit yourself to SSE2 and below which this answer assumes (and what I tested)
All compilers both vectorise and unroll the code and use punpcklbw to 'unpack' the uint8_t's into uint16_t's and then run a SIMD add and save. This is good. However, MSVC tends to spill unnecessarily in the inner loop, and clang only uses punpcklbw and not punpckhbw which means it loads the source data twice. GCC gets the SIMD part right but has higher overhead for the loop constraints.
So theoretically if you wanted to improve these versions you can roll your own using intrinsics which would look something like:
static inline void adder2(uint16_t *canvas, uint8_t *addon, uint64_t count)
{
uint64_t count32 = (count / 32) * 32;
__m128i zero = _mm_set_epi32(0, 0, 0, 0);
uint64_t i = 0;
for (; i < count32; i+= 32)
{
uint8_t* addonAddress = (addon + i);
// Load data 32 bytes at a time and widen the input
// to `uint16_t`'sinto 4 temp xmm reigsters.
__m128i input = _mm_loadu_si128((__m128i*)(addonAddress + 0));
__m128i temp1 = _mm_unpacklo_epi8(input, zero);
__m128i temp2 = _mm_unpackhi_epi8(input, zero);
__m128i input2 = _mm_loadu_si128((__m128i*)(addonAddress + 16));
__m128i temp3 = _mm_unpacklo_epi8(input2, zero);
__m128i temp4 = _mm_unpackhi_epi8(input2, zero);
// Load data we need to update
uint16_t* canvasAddress = (canvas + i);
__m128i canvas1 = _mm_loadu_si128((__m128i*)(canvasAddress + 0));
__m128i canvas2 = _mm_loadu_si128((__m128i*)(canvasAddress + 8));
__m128i canvas3 = _mm_loadu_si128((__m128i*)(canvasAddress + 16));
__m128i canvas4 = _mm_loadu_si128((__m128i*)(canvasAddress + 24));
// Update the values
__m128i output1 = _mm_add_epi16(canvas1, temp1);
__m128i output2 = _mm_add_epi16(canvas2, temp2);
__m128i output3 = _mm_add_epi16(canvas3, temp3);
__m128i output4 = _mm_add_epi16(canvas4, temp4);
// Store the values
_mm_storeu_si128((__m128i*)(canvasAddress + 0), output1);
_mm_storeu_si128((__m128i*)(canvasAddress + 8), output2);
_mm_storeu_si128((__m128i*)(canvasAddress + 16), output3);
_mm_storeu_si128((__m128i*)(canvasAddress + 24), output4);
}
// Mop up
for (; i<count; i++)
canvas[i] += static_cast<uint16_t>(addon[i]);
}
Examining the output for this it is strictly better than any of gcc/clang/msvc. So if you want to get the absolute last drop of perf (and have a fixed architecture) then something like the above is a possibility. However its a really small improvement as the compilers already handle this almost perfectly and so I'd actually recommend not doing this and just trusting the compiler.
If you do think you can improve the compiler, remember to always test and profile to make sure you actually are.

Indeed the comments are right: the compiler can do the vectorization for you.
I have modified your code a bit to improve the auto-vectorization.
With gcc -O3 -march=haswell -std=c++14 (gcc version 8.2), the following code:
#include <cstdint>
#include <immintrin.h>
void cvt_uint8_int16(uint16_t * __restrict__ canvas, uint8_t * __restrict__ addon, int64_t count) {
int64_t i;
/* If you know that n is always a multiple of 32 then insert */
/* n = n & 0xFFFFFFFFFFFFFFE0u; */
/* This leads to cleaner code. Now assume n is a multiple of 32: */
count = count & 0xFFFFFFFFFFFFFFE0u;
for (i = 0; i < count; i++){
canvas[i] += static_cast<uint16_t>(addon[i]);
}
}
compiles to:
cvt_uint8_int16(unsigned short*, unsigned char*, long):
and rdx, -32
jle .L5
add rdx, rsi
.L3:
vmovdqu ymm2, YMMWORD PTR [rsi]
add rsi, 32
add rdi, 64
vextracti128 xmm1, ymm2, 0x1
vpmovzxbw ymm0, xmm2
vpaddw ymm0, ymm0, YMMWORD PTR [rdi-64]
vpmovzxbw ymm1, xmm1
vpaddw ymm1, ymm1, YMMWORD PTR [rdi-32]
vmovdqu YMMWORD PTR [rdi-64], ymm0
vmovdqu YMMWORD PTR [rdi-32], ymm1
cmp rdx, rsi
jne .L3
vzeroupper
.L5:
Compiler Clang produces code which is a bit different: It loads 128 bit (char)vectors and converts them with vpmovzxbw.
Compiler gcc loads 256 bit (char) vectors and converts the upper and the lower 128 bits
separately, which is probably slightly less efficient.
Nevertheless, your problem is likely bandwidth limited anyway (since length > 1000000).
You can also vectorize the code with intrinsics (not tested):
void cvt_uint8_int16_with_intrinsics(uint16_t * __restrict__ canvas, uint8_t * __restrict__ addon, int64_t count) {
int64_t i;
/* Assume n is a multiple of 16 */
for (i = 0; i < count; i=i+16){
__m128i x = _mm_loadu_si128((__m128i*)&addon[i]);
__m256i y = _mm256_loadu_si256((__m256i*)&canvas[i]);
__m256i x_u16 = _mm256_cvtepu8_epi16(x);
__m256i sum = _mm256_add_epi16(y, x_u16);
_mm256_storeu_si256((__m256i*)&canvas[i], sum);
}
}
This leads to similar results as the auto-vectorized code.

In contrast to the manually-optimized approaches presented in wim's and Mike's great answers, let's also have a quick look at what a completely vanilla C++ implementation would give us:
std::transform(addon, addon + count, canvas, canvas, std::plus<void>());
Try it out here. You'll see that even without any real effort on your part, the compiler is already able to produce vectorized code that is quite good given that it cannot make any assumptions concerning alignment and size of your buffers, and there's also some potential aliasing issues (due to the use of uint8_t which, unfortunately, forces the compiler to assume that the pointer may alias to any other object). Also, note that the code is basically identical to what you'd get from a C-style implementation (depending on the compiler, the C++ version has a few instructions more or a few instructions less)
void f(uint16_t* canvas, const uint8_t* addon, size_t count)
{
for (size_t i = 0; i < count; ++i)
canvas[i] += addon[i];
}
However, the generic C++ solution works on any combination of different kinds of container and element types as long as the element types can be added. So—as also pointed out in the other answers—while it is certainly possible to get a slightly more efficient implementation from manual optimization, one can go a long way just by writing plain C++ code (if done right). Before resorting to manually writing SSE intrinsics, consider that a generic C++ solution is more flexible, easier to maintain, and, especially, more portable. By the simple flip of the target architecture switch, you can have it produce code of similar quality not only for SSE, but AVX, or even ARM with NEON and whatever other instruction sets you may happen to want to run on. If you need your code to be perfect down to the last instruction for one particular use case on one particular CPU, then yes, intrinsics or even inline assembly is probably the way to go. But in general, I would also suggest to instead focus on writing your C++ code in a way that enables and guides the compiler to generate the assembly you want rather than generating the assembly yourself. For example, by using the (non-standard but generally available) restrict qualifier and borrowing the trick with letting the compiler know that your count is always a multiple of 32
void f(std::uint16_t* __restrict__ canvas, const std::uint8_t* __restrict__ addon, std::size_t count)
{
assert(count % 32 == 0);
count = count & -32;
std::transform(addon, addon + count, canvas, canvas, std::plus<void>());
}
you get (-std=c++17 -DNDEBUG -O3 -mavx)
f(unsigned short*, unsigned char const*, unsigned long):
and rdx, -32
je .LBB0_3
xor eax, eax
.LBB0_2: # =>This Inner Loop Header: Depth=1
vpmovzxbw xmm0, qword ptr [rsi + rax] # xmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
vpmovzxbw xmm1, qword ptr [rsi + rax + 8] # xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
vpmovzxbw xmm2, qword ptr [rsi + rax + 16] # xmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
vpmovzxbw xmm3, qword ptr [rsi + rax + 24] # xmm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
vpaddw xmm0, xmm0, xmmword ptr [rdi + 2*rax]
vpaddw xmm1, xmm1, xmmword ptr [rdi + 2*rax + 16]
vpaddw xmm2, xmm2, xmmword ptr [rdi + 2*rax + 32]
vpaddw xmm3, xmm3, xmmword ptr [rdi + 2*rax + 48]
vmovdqu xmmword ptr [rdi + 2*rax], xmm0
vmovdqu xmmword ptr [rdi + 2*rax + 16], xmm1
vmovdqu xmmword ptr [rdi + 2*rax + 32], xmm2
vmovdqu xmmword ptr [rdi + 2*rax + 48], xmm3
add rax, 32
cmp rdx, rax
jne .LBB0_2
.LBB0_3:
ret
which is really not bad…

Move an int64_t to the high quadwords of an AVX2 __m256i vector

This question is similar to [1]. However I didn't quite understand how it addressed inserting to high quadwords of a ymm using a GPR. Additionally I want the operation not use any intermediate memory accesses.
Can it be done with AVX2 or below (I don't have AVX512)?
[1] How to move double in %rax into particular qword position on %ymm or %zmm? (Kaby Lake or later)

My answer on the linked question didn't show a way to do that because it can't be done very efficiently without AVX512F for a masked broadcast (vpbroadcastq zmm0{k1}, rax). But it's actually not all that bad using a scratch register, about the same cost as a vpinsrq + an immediate blend.
(On Intel, 3 uops total. 2 uops for port 5 (vmovq + broadcast), and an immediate blend that can run on any port.
See https://agner.org/optimize/).
I updated my answer there with asm for this. In C++ with Intel's intrinsics, you'd do something like:
#include <immintrin.h>
#include <stdint.h>
// integer version. An FP version would still use _mm256_set1_epi64x, then a cast
template<unsigned elem>
static inline
__m256i merge_epi64(__m256i v, int64_t newval)
{
static_assert(elem <= 3, "a __m256i only has 4 qword elements");
__m256i splat = _mm256_set1_epi64x(newval);
constexpr unsigned dword_blendmask = 0b11 << (elem*2); // vpblendd uses 2 bits per qword
return _mm256_blend_epi32(v, splat, dword_blendmask);
}
Clang compiles this nearly perfectly efficiently for all 4 possible element positions, which really shows off how nice its shuffle optimizer is. It takes advantage of all the special cases. And as a bonus, it comments its asm to show you which elements come from where in blends and shuffles.
From the Godbolt compiler explorer, some test functions to see what happens with args in regs.
__m256i merge3(__m256i v, int64_t newval) {
return merge_epi64<3> (v, newval);
}
// and so on for 2..0
# clang7.0 -O3 -march=haswell
merge3(long long __vector(4), long):
vmovq xmm1, rdi
vpbroadcastq ymm1, xmm1
vpblendd ymm0, ymm0, ymm1, 192 # ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]
# 192 = 0xC0 = 0b11000000
ret
merge2(long long __vector(4), long):
vmovq xmm1, rdi
vinserti128 ymm1, ymm0, xmm1, 1 # Runs on more ports than vbroadcast on AMD Ryzen
# But it introduced a dependency on v (ymm0) before the blend for no reason, for the low half of ymm1. Could have used xmm1, xmm1.
vpblendd ymm0, ymm0, ymm1, 48 # ymm0 = ymm0[0,1,2,3],ymm1[4,5],ymm0[6,7]
ret
merge1(long long __vector(4), long):
vmovq xmm1, rdi
vpbroadcastq xmm1, xmm1 # only an *XMM* broadcast, 1c latency instead of 3.
vpblendd ymm0, ymm0, ymm1, 12 # ymm0 = ymm0[0,1],ymm1[2,3],ymm0[4,5,6,7]
ret
merge0(long long __vector(4), long):
vmovq xmm1, rdi
# broadcast optimized away, newval is already in the low element
vpblendd ymm0, ymm0, ymm1, 3 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7]
ret
Other compilers blindly broadcast to the full YMM and then blend, even for elem=0. You can specialize the template, or add if() conditions in the template that will optimize away. e.g. splat = (elem?) set1() : v; to save the broadcast for elem==0. You could capture the other optimizations, too, if you wanted.
GCC 8.x and earlier use a normally-bad way of broadcasting the integer: they store/reload. This avoids using any ALU shuffle ports because broadcast-loads are free on Intel CPUs, but it introduces store-forwarding latency into the chain from the integer to the final vector result.
This is fixed in current trunk for gcc9, but I don't know if there's a workaround to get non-silly code-gen with earlier gcc. Normally -march=<an intel uarch> favours ALU instead of store/reload for integer -> vector and vice versa, but in this case the cost model still picks store/reload with -march=haswell.
# gcc8.2 -O3 -march=haswell
merge0(long long __vector(4), long):
push rbp
mov rbp, rsp
and rsp, -32 # align the stack even though no YMM is spilled/loaded
mov QWORD PTR [rsp-8], rdi
vpbroadcastq ymm1, QWORD PTR [rsp-8] # 1 uop on Intel
vpblendd ymm0, ymm0, ymm1, 3
leave
ret
; GCC trunk: g++ (GCC-Explorer-Build) 9.0.0 20190103 (experimental)
; MSVC and ICC do this, too. (For MSVC, make sure to compile with -arch:AVX2)
merge0(long long __vector(4), long):
vmovq xmm2, rdi
vpbroadcastq ymm1, xmm2
vpblendd ymm0, ymm0, ymm1, 3
ret
For a runtime-variable element position, the shuffle still works but you'd have to create a blend mask vector with the high bit set in the right element. e.g. with a vpmovsxbq load from mask[3-elem] in alignas(8) int8_t mask[] = { 0,0,0,-1,0,0,0 };. But vpblendvb or vblendvpd is slower than an immediate blend, especially on Haswell, so avoid that if possible.

Is shufps slower than memory access?

The title may seem nonsense but let me explain. I was studying a program the other day when I encountered the following assembly code:
movaps xmm3, xmmword ptr [rbp-30h]
lea rdx, [rdi+1320h]
movaps xmm5, xmm3
movaps xmm6, xmm3
movaps xmm0, xmm3
movss dword ptr [rdx], xmm3
shufps xmm5, xmm3, 55h
shufps xmm6, xmm3, 0AAh
shufps xmm0, xmm3, 0FFh
movaps xmm4, xmm3
movss dword ptr [rdx+4], xmm5
movss dword ptr [rdx+8], xmm6
movss dword ptr [rdx+0Ch], xmm0
mulss xmm4, xmm3
and it seems like mostly it just copies four floats from [rbp-30h] to [rdx]. Those shufpss are used just to select one of four floats in xmm3 (e.g. shufps xmm5, xmm3, 55h selects the second float and places it in xmm5).
This makes me wonder if the compiler did so because shufps is actually faster than memory access (something like movss xmm0, dword ptr [rbp-30h], movss dword ptr [rdx], xmm0).
So I wrote some tests to compare these two approaches and found shufps always slower than multiple memory accesses. Now I'm thinking maybe the use of shufps has nothing to do with performance. It might just be there to obfuscate the code so decompilers cannot produce clean code easily (tried with IDA pro and it was indeed overly complicated).
While I'll probably never use shufps explicitly anyway (by using _mm_shuffle_ps for example) in any practical programs as the compiler is most likely smarter than me, I still want to know why the compiler that compiled the program generated such code. It's neither faster nor smaller. It makes no sense.
Anyway I'll provide the tests I wrote below.
#include <Windows.h>
#include <iostream>
using namespace std;
__declspec(noinline) DWORD profile_routine(void (*routine)(void *), void *arg, int iterations = 1)
{
DWORD startTime = GetTickCount();
while (iterations--)
{
routine(arg);
}
DWORD timeElapsed = GetTickCount() - startTime;
return timeElapsed;
}
struct Struct
{
float x, y, z, w;
};
__declspec(noinline) Struct shuffle1(float *arr)
{
float x = arr[3];
float y = arr[2];
float z = arr[0];
float w = arr[1];
return {x, y, z, w};
}
#define SS0 (0x00)
#define SS1 (0x55)
#define SS2 (0xAA)
#define SS3 (0xFF)
__declspec(noinline) Struct shuffle2(float *arr)
{
Struct r;
__m128 packed = *reinterpret_cast<__m128 *>(arr);
__m128 x = _mm_shuffle_ps(packed, packed, SS3);
__m128 y = _mm_shuffle_ps(packed, packed, SS2);
__m128 z = _mm_shuffle_ps(packed, packed, SS0);
__m128 w = _mm_shuffle_ps(packed, packed, SS1);
_mm_store_ss(&r.x, x);
_mm_store_ss(&r.y, y);
_mm_store_ss(&r.z, z);
_mm_store_ss(&r.w, w);
return r;
}
void profile_shuffle_r1(void *arg)
{
float *arr = static_cast<float *>(arg);
Struct q = shuffle1(arr);
arr[0] += q.w;
arr[1] += q.z;
arr[2] += q.y;
arr[3] += q.x;
}
void profile_shuffle_r2(void *arg)
{
float *arr = static_cast<float *>(arg);
Struct q = shuffle2(arr);
arr[0] += q.w;
arr[1] += q.z;
arr[2] += q.y;
arr[3] += q.x;
}
int main(int argc, char **argv)
{
int n = argc + 3;
float arr1[4], arr2[4];
for (int i = 0; i < 4; i++)
{
arr1[i] = static_cast<float>(n + i);
arr2[i] = static_cast<float>(n + i);
}
int iterations = 20000000;
DWORD time1 = profile_routine(profile_shuffle_r1, arr1, iterations);
cout << "time1 = " << time1 << endl;
DWORD time2 = profile_routine(profile_shuffle_r2, arr2, iterations);
cout << "time2 = " << time2 << endl;
return 0;
}
In the test above, I have two shuffle methods shuffle1 and shuffle2 that do the same thing. When compiled with MSVC -O2, it produces the following code:
shuffle1:
mov eax,dword ptr [rdx+0Ch]
mov dword ptr [rcx],eax
mov eax,dword ptr [rdx+8]
mov dword ptr [rcx+4],eax
mov eax,dword ptr [rdx]
mov dword ptr [rcx+8],eax
mov eax,dword ptr [rdx+4]
mov dword ptr [rcx+0Ch],eax
mov rax,rcx
ret
shuffle2:
movaps xmm2,xmmword ptr [rdx]
mov rax,rcx
movaps xmm0,xmm2
shufps xmm0,xmm2,0FFh
movss dword ptr [rcx],xmm0
movaps xmm0,xmm2
shufps xmm0,xmm2,0AAh
movss dword ptr [rcx+4],xmm0
movss dword ptr [rcx+8],xmm2
shufps xmm2,xmm2,55h
movss dword ptr [rcx+0Ch],xmm2
ret
shuffle1 is always at least 30% faster than shuffle2 on my machine. I did notice shuffle2 has two more instructions and shuffle1 actually uses eax instead of xmm0 so I thought if I add some junk arithmetic operations, the result would be different.
So I modified them as the following:
__declspec(noinline) Struct shuffle1(float *arr)
{
float x0 = arr[3];
float y0 = arr[2];
float z0 = arr[0];
float w0 = arr[1];
float x = x0 + y0 + z0;
float y = y0 + z0 + w0;
float z = z0 + w0 + x0;
float w = w0 + x0 + y0;
return {x, y, z, w};
}
#define SS0 (0x00)
#define SS1 (0x55)
#define SS2 (0xAA)
#define SS3 (0xFF)
__declspec(noinline) Struct shuffle2(float *arr)
{
Struct r;
__m128 packed = *reinterpret_cast<__m128 *>(arr);
__m128 x0 = _mm_shuffle_ps(packed, packed, SS3);
__m128 y0 = _mm_shuffle_ps(packed, packed, SS2);
__m128 z0 = _mm_shuffle_ps(packed, packed, SS0);
__m128 w0 = _mm_shuffle_ps(packed, packed, SS1);
__m128 yz = _mm_add_ss(y0, z0);
__m128 x = _mm_add_ss(x0, yz);
__m128 y = _mm_add_ss(w0, yz);
__m128 wx = _mm_add_ss(w0, x0);
__m128 z = _mm_add_ss(z0, wx);
__m128 w = _mm_add_ss(y0, wx);
_mm_store_ss(&r.x, x);
_mm_store_ss(&r.y, y);
_mm_store_ss(&r.z, z);
_mm_store_ss(&r.w, w);
return r;
}
and now the assembly looks a bit more fair as they have the same number of instructions and both need to use xmm registers.
shuffle1:
movss xmm5,dword ptr [rdx+8]
mov rax,rcx
movss xmm3,dword ptr [rdx+0Ch]
movaps xmm0,xmm5
movss xmm2,dword ptr [rdx]
addss xmm0,xmm3
movss xmm4,dword ptr [rdx+4]
movaps xmm1,xmm2
addss xmm1,xmm5
addss xmm0,xmm2
addss xmm1,xmm4
movss dword ptr [rcx],xmm0
movaps xmm0,xmm4
addss xmm0,xmm2
addss xmm4,xmm3
movss dword ptr [rcx+4],xmm1
addss xmm0,xmm3
addss xmm4,xmm5
movss dword ptr [rcx+8],xmm0
movss dword ptr [rcx+0Ch],xmm4
ret
shuffle2:
movaps xmm4,xmmword ptr [rdx]
mov rax,rcx
movaps xmm3,xmm4
movaps xmm5,xmm4
shufps xmm5,xmm4,0AAh
movaps xmm2,xmm4
shufps xmm2,xmm4,0FFh
movaps xmm0,xmm5
addss xmm0,xmm3
shufps xmm4,xmm4,55h
movaps xmm1,xmm4
addss xmm1,xmm2
addss xmm2,xmm0
addss xmm4,xmm0
addss xmm3,xmm1
addss xmm5,xmm1
movss dword ptr [rcx],xmm2
movss dword ptr [rcx+4],xmm4
movss dword ptr [rcx+8],xmm3
movss dword ptr [rcx+0Ch],xmm5
ret
but it doesn't matter. shuffle1 is still 30% faster!

Without the broader context, it is hard to say for sure, but ... when optimizing for newer processors, you have to consider the usage of the different ports. See Agners here: http://www.agner.org/optimize/instruction_tables.pdf
In this case, while it may seem unlikely, there are a few possibilities that jump out at me if we're assuming that the assembly is, in fact, optimized.
This could appear in a stretch of code where the Out-Of-Order scheduler happens to have more of port 5 (on Haswell, for example) than ports 2 and 3 (again, using Haswell as an example) available.
Similar to with #1, but the same effect might be observed when hyperthreading. This code may be intended to not steal read operations from the sibling hyperthread.
Lastly, specific to this sort of optimization and where I've used something similar. Say you have a branch that is run-time near 100% predictable, but not during compile-time. Let's imagine, hypothetically that right after the branch there is a read that is often a cache miss. You want to read as soon as possible. The Out-Of-Order scheduler will read ahead and begin executing that read if it you don't use the read ports. This could make the shufps instructions essentially "free" to execute. Here's that example:
MOV ecx, [some computed, mostly constant at run-time global]
label loop:
ADD rdi, 16
ADD rbp, 16
CALL shuffle
SUB ecx, 1
JNE loop
MOV rax, [rdi]
;do a read that could be "predicted" properly
MOV rbx, [rax]
Honestly though, it just looks like poorly written assembly or poorly generated machine code, so I wouldn't put much thought into it. The example I'm giving is pretty darned unlikely.

You don't show if the later code actually uses the results of broadcasting each element to all 4 positions of a vector. (e.g. 0x55 is _MM_SHUFFLE(1,1,1,1)). If you already need that for for a ...ps instruction later, then you need those shuffles anyway, so there'd be no reason to also do scalar loads.
If you don't, and the only visible side-effect is the stores to memory, this is just a hilariously bad missed optimization by either a human programmer using intrinsics, and/or by a compiler. Just like in your examples of MSVC output for your test functions.
Keep in mind that some compilers (like ICC and MSVC) don't really optimize intrinsics, so if you write 3x _mm_shuffle_ps you get 3x shufps, so this poor decision could have been made by a human using intrinsics, not a compiler.
But Clang on the other hand does aggressively optimize shuffle intrinsics. clang optimizes both of your shuffle functions to one movaps load, one shufps (or pshufd), and one movups store. This is optimal for most CPUs, getting the work done in the fewest instructions and uops.
(gcc auto-vectorizes shuffle1 but not shuffle2. MSVC fails at everything, just using scalar for shuffle1)
(If you just need each scalar float at the bottom of an xmm register for ...ss instructions, you can use the shuffle that creates your store vector as one of them, because it has a different low element than the input. You'd movaps copy first, though, or use pshufd, to avoid destroying the reg with the original low element.)
If tuning specifically for a CPU with slow movups stores (like Intel pre-Nehalem) and the result isn't known to be aligned, then you'd still use one shufps but store the result with movlps and movhps. This is what gcc does if you compile with -mtune=core2.
You apparently know your input vector is aligned, so it still makes a huge amount of sense to load it with movaps. K8 will split a movaps into two 8-byte load uops, but most other x86-64 CPUs can do 16-byte aligned loads as a single uop. (Pentium M / Core 1 were the last mainstream Intel CPUs to split 128-bit vector ops like that, and they didn't support 64-bit mode.)
vbroadcastss requires AVX, so without AVX if you want a dword from memory broadcast into an XMM register, you have to use a shuffle instruction that needs a port 5 ALU uop. (vbroadcastss xmm0, [rsi+4] decodes to a pure load uop on Intel CPUs, no ALU uop needed, so it has 2 per clock throughput instead of 1.)
Old CPUs like Merom and K8 have slow shuffle units that are only 64 bits wide, so shufps is pretty slow because it's a full 128-bit shuffle with granularity smaller than 64 bits. You might consider doing 2x movsd or movq loads to feed pshuflw, which is fast because it only shuffles the low 64 bits. But only if you're specifically tuning for old CPUs.
// for gcc, I used __attribute__((ms_abi)) to target the Windows x64 calling convention
Struct shuffle3(float *arr)
{
Struct r;
__m128 packed = _mm_load_ps(arr);
__m128 xyzw = _mm_shuffle_ps(packed, packed, _MM_SHUFFLE(1,0,2,3));
_mm_storeu_ps(&r.x, xyzw);
return r;
}
shuffle1 and shuffle3 both compile to identical code with gcc and clang (on the Godbolt compiler explorer), because they auto-vectorize the scalar assignments. The only difference is using a movups load for shuffle1, because nothing guarantees 16-byte alignment there. (If we promised the compiler an aligned pointer for the pure C scalar version, then it would be exactly identical.)
# MSVC compiles shuffle3 like this as well
# from gcc9.1 -O3 (default baseline x86-64, tune=generic)
shuffle3(float*):
movaps xmm0, XMMWORD PTR [rdx] # MSVC still uses movups even for _mm_load_ps
mov rax, rcx # return the retval pointer
shufps xmm0, xmm0, 75
movups XMMWORD PTR [rcx], xmm0 # store to the hidden retval pointer
ret
With -mtune=core2, gcc still auto-vectorizes shuffle1. It uses split unaligned loads because we didn't promise the compiler aligned memory.
For shuffle3, it does use movaps but still splits _mm_storeu_ps into movlps + movhps. (This is one of the interesting effects that tuning options can have. They don't let the compiler use new instruction, just change the selection for existing ones.)
# gcc9.1 -O3 -mtune=core2 # auto-vectorizing shuffle1
shuffle1(float*):
movq xmm0, QWORD PTR [rdx]
mov rax, rcx
movhps xmm0, QWORD PTR [rdx+8]
shufps xmm0, xmm0, 75
movlps QWORD PTR [rcx], xmm0 # store in 2 halves
movhps QWORD PTR [rcx+8], xmm0
ret
MSVC doesn't have tuning options, and doesn't auto-vectorize shuffle1.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js