So I've come across another problem when dealing with AVX code. I have a case where I have 4 ymm registers that need to be split vertically to 4 other ymm registers
(ie. ymm0(ABCD) -> ymm4(A...), ymm5(B...), ymm6(C...), ymm7(D...)).
Here is an example:
// a, b, c, d are __m256 structs with [] operators to access xyzw
__m256d A = _mm256_setr_pd(a[0], b[0], c[0], d[0]);
__m256d B = _mm256_setr_pd(a[1], b[1], c[1], d[1]);
__m256d C = _mm256_setr_pd(a[2], b[2], c[2], d[2]);
__m256d D = _mm256_setr_pd(a[3], b[3], c[3], d[3]);
Just putting Paul's comment into an answer:
My question is about how to a matrix transposition which is easily done in AVX as indicated with the link he provided.
Here's my implementation for those who come across here:
void Transpose(__m256d* A, __m256d* T)
__m256d t0 = _mm256_shuffle_pd(A[0], A[1], 0b0000);
__m256d t1 = _mm256_shuffle_pd(A[0], A[1], 0b1111);
__m256d t2 = _mm256_shuffle_pd(A[2], A[3], 0b0000);
__m256d t3 = _mm256_shuffle_pd(A[2], A[3], 0b1111);
T[0] = _mm256_permute2f128_pd(t0, t2, 0b0100000);
T[1] = _mm256_permute2f128_pd(t1, t3, 0b0100000);
T[2] = _mm256_permute2f128_pd(t0, t2, 0b0110001);
T[3] = _mm256_permute2f128_pd(t1, t3, 0b0110001);
This function cuts the number of instructions in about half on full optimization as compared to my previous attempt
I will preface this by saying that I am a complete beginner at SIMD intrinsics.
Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU # 2.70GHz). I would like to know the fastest way to compute the dot product of two std::vector<float> of size 512.
I have done some digging online and found this and this, and this stack overflow question suggests using the following function __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);, However, these all suggest different ways of performing the dot product I am not sure what is the correct (and fastest) way to do it.
In particular, I am looking for the fastest way to perform dot product for a vector of size 512 (because I know the vector size effects the implementation).
Thank you for your help
Edit 1:
I am also a little confused about the -mavx2 gcc flag. If I use these AVX2 functions, do I need to add the flag when I compile? Also, is gcc able to do these optimizations for me (say if I use the -OFast gcc flag) if I write a naive dot product implementation?
Edit 2
If anyone has the time and energy, I would very much appreciate if you could write a full implementation. I am sure other beginners would also value this information.
_mm256_dp_ps is only useful for dot-products of 2 to 4 elements; for longer vectors use vertical SIMD in a loop and reduce to scalar at the end. Using _mm256_dp_ps and _mm256_add_ps in a loop would be much slower.
GCC and clang require you to enable (with command line options) ISA extensions that you use intrinsics for, unlike MSVC and ICC.
The code below is probably close to theoretical performance limit of your CPU. Untested.
Compile it with clang or gcc -O3 -march=native. (Requires at least -mavx -mfma, but -mtune options implied by -march are good, too, and so are the other -mpopcnt and other things arch=native enables. Tune options are critical to this compiling efficiently for most CPUs with FMA, specifically -mno-avx256-split-unaligned-load: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)
Or compile it with MSVC -O2 -arch:AVX2
#include <immintrin.h>
#include <vector>
#include <assert.h>
// CPUs support RAM access like this: "ymmword ptr [rax+64]"
// Using templates with offset int argument to make easier for compiler to emit good code.
// Multiply 8 floats by another 8 floats.
template<int offsetRegs>
inline __m256 mul8( const float* p1, const float* p2 )
constexpr int lanes = offsetRegs * 8;
const __m256 a = _mm256_loadu_ps( p1 + lanes );
const __m256 b = _mm256_loadu_ps( p2 + lanes );
return _mm256_mul_ps( a, b );
// Returns acc + ( p1 * p2 ), for 8-wide float lanes.
template<int offsetRegs>
inline __m256 fma8( __m256 acc, const float* p1, const float* p2 )
constexpr int lanes = offsetRegs * 8;
const __m256 a = _mm256_loadu_ps( p1 + lanes );
const __m256 b = _mm256_loadu_ps( p2 + lanes );
return _mm256_fmadd_ps( a, b, acc );
// Compute dot product of float vectors, using 8-wide FMA instructions.
float dotProductFma( const std::vector<float>& a, const std::vector<float>& b )
assert( a.size() == b.size() );
assert( 0 == ( a.size() % 32 ) );
if( a.empty() )
return 0.0f;
const float* p1 =;
const float* const p1End = p1 + a.size();
const float* p2 =;
// Process initial 32 values. Nothing to add yet, just multiplying.
__m256 dot0 = mul8<0>( p1, p2 );
__m256 dot1 = mul8<1>( p1, p2 );
__m256 dot2 = mul8<2>( p1, p2 );
__m256 dot3 = mul8<3>( p1, p2 );
p1 += 8 * 4;
p2 += 8 * 4;
// Process the rest of the data.
// The code uses FMA instructions to multiply + accumulate, consuming 32 values per loop iteration.
// Unrolling manually for 2 reasons:
// 1. To reduce data dependencies. With a single register, every loop iteration would depend on the previous result.
// 2. Unrolled code checks for exit condition 4x less often, therefore more CPU cycles spent computing useful stuff.
while( p1 < p1End )
dot0 = fma8<0>( dot0, p1, p2 );
dot1 = fma8<1>( dot1, p1, p2 );
dot2 = fma8<2>( dot2, p1, p2 );
dot3 = fma8<3>( dot3, p1, p2 );
p1 += 8 * 4;
p2 += 8 * 4;
// Add 32 values into 8
const __m256 dot01 = _mm256_add_ps( dot0, dot1 );
const __m256 dot23 = _mm256_add_ps( dot2, dot3 );
const __m256 dot0123 = _mm256_add_ps( dot01, dot23 );
// Add 8 values into 4
const __m128 r4 = _mm_add_ps( _mm256_castps256_ps128( dot0123 ), _mm256_extractf128_ps( dot0123, 1 ) );
// Add 4 values into 2
const __m128 r2 = _mm_add_ps( r4, _mm_movehl_ps( r4, r4 ) );
// Add 2 lower values into the final result
const __m128 r1 = _mm_add_ss( r2, _mm_movehdup_ps( r2 ) );
// Return the lowest lane of the result vector.
// The intrinsic below compiles into noop, modern compilers return floats in the lowest lane of xmm0 register.
return _mm_cvtss_f32( r1 );
Possible further improvements:
Unroll by 8 vectors instead of 4. I’ve checked gcc 9.2 asm output, compiler only used 8 vector registers out of the 16 available.
Make sure both input vectors are aligned, e.g. use a custom allocator which calls _aligned_malloc / _aligned_free on msvc, or aligned_alloc / free on gcc & clang. Then replace _mm256_loadu_ps with _mm256_load_ps.
To auto-vectorize a simple scalar dot product, you'd also need OpenMP SIMD or -ffast-math (implied by -Ofast) to let the compiler treat FP math as associative even though it's not (because of rounding). But GCC won't use multiple accumulators when auto-vectorizing, even if it does unroll, so you'd bottleneck on FMA latency, not load throughput.
(2 loads per FMA means the throughput bottleneck for this code is vector loads, not actual FMA operations.)
I am trying to rewrite a code from c++ source code including SSE instructions, to only c++ code. I know i will lose performance, but its an experiment, i am trying to perform.
I was wondering if there is a C++ equivalent for doing the same as , __mm_unpackhi_pd and __mm_unpacklo_pd. I have zero knowledge about SSE.
A snippet of the code for reference which i am trying to convert. Any knowledge or tips would be helpful. Thank you.
for (unsigned chunk = 0; chunk < chunks; chunk++)
unsigned start = chunk * chunksize;
unsigned end =
std::min((chunk + 1) * chunksize, (unsigned)2 * w);
__m128d a2b2 =
_mm_load_pd(d_origx +
((2 * init_G_offset + start) & n2_m_1));
unsigned i2_mod_B = 0;
for (unsigned i = start; i < end; i += 2)
__m128d ab = a2b2;
a2b2 =
_mm_load_pd(d_origx +
((origx_offset + i) & n2_m_1));
__m128d cd = _mm_load_pd(d_filter + i);
__m128d cc = _mm_unpacklo_pd(cd, cd);
__m128d dd = _mm_unpackhi_pd(cd, cd);
__m128d a0a1 = _mm_unpacklo_pd(ab, a2b2);
__m128d b0b1 = _mm_unpackhi_pd(ab, a2b2);
__m128d ac = _mm_mul_pd(cc, a0a1);
__m128d ad = _mm_mul_pd(dd, a0a1);
__m128d bc = _mm_mul_pd(cc, b0b1);
__m128d bd = _mm_mul_pd(dd, b0b1);
__m128d ac_m_bd = _mm_sub_pd(ac, bd);
__m128d ad_p_bc = _mm_add_pd(ad, bc);
__m128d ab_times_cd = _mm_unpacklo_pd(ac_m_bd, ad_p_bc);
__m128d a2b2_times_cd =
_mm_unpackhi_pd(ac_m_bd, ad_p_bc);
__m128d xy = _mm_load_pd(d_x_sampt + i2_mod_B);
__m128d x2y2 = _mm_load_pd(d_x_sampt + i2_mod_B + 2);
__m128d st = _mm_add_pd(xy, ab_times_cd);
__m128d s2t2 = _mm_add_pd(x2y2, a2b2_times_cd);
_mm_store_pd(d_x_sampt + i2_mod_B, st);
_mm_store_pd(d_x_sampt + i2_mod_B + 2, s2t2);
i2_mod_B += 4;
Below you find the description of the two functions, I've also linked each function to its reference page. The whole reference is available here:
__m128d _mm_unpackhi_pd (__m128d a, __m128d b)
Unpack and interleave double-precision (64-bit) floating-point
elements from the high half of a and b, and store the results in dst.
_m128d _mm_unpacklo_pd (__m128d a, __m128d b)
Unpack and interleave double-precision (64-bit) floating-point
elements from the low half of a and b, and store the results in dst.
Exactly how to implement it depends on your representation, but basically you return a new value composed of the high (or low) half of a concatenated with the high (or low) half of b. For example:
typedef double[2] __m128d;
__m128d _mm_unpackhi_pd(__m128d a, __m128d b) {
__m128d res;
res[0] = a[1];
res[1] = b[1];
return res;
__m128d _mm_unpacklo_pd(__m128d a, __m128d b) {
__m128d res;
res[0] = a[0];
res[1] = b[0];
return res;
Wierd timing on this question… I found this issue while implementing this function for SIMDe, and it's only 17 days old. If you want to use SIMDe as a reference, these functions are in sse2.h along with a lot of others. The code in SIMDe is a bit more complex than what's above, but that's mostly just to match the implementations of the other _mm_unpack* functions.
I have two SSE registers (128 bits is one register) and I want to add them up. I know how I can add corresponding words in them, for example I can do it with _mm_add_epi16 if I use 16bit words in registers, but what I want is something like _mm_add_epi128 (which does not exist), which would use register as one big word.
Is there any way to perform this operation, even if multiple instructions are needed?
I was thinking about using _mm_add_epi64, detecting overflow in the right word and then adding 1 to the left word in register if needed, but I would also like this approach to work for 256bit registers (AVX2), and this approach seems too complicated for that.
To add two 128-bit numbers x and y to give z with SSE you can do it like this
z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);
This is based on this link how-can-i-add-and-subtract-128-bit-integers-in-c-or-c.
The function unsigned_lessthan is defined below. It's complicated without AMD XOP (actually a found a simpler version for SSE4.2 if XOP is not available - see the end of my answer). Probably some of the other people here can suggest a better method. Here is some code showing this works.
#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
inline __m128i unsigned_lessthan(__m128i a, __m128i b) {
#ifdef __XOP__ // AMD XOP instruction set
return _mm_comgt_epu64(b,a));
#else // SSE2 instruction set
__m128i sign32 = _mm_set1_epi32(0x80000000); // sign bit of each dword
__m128i aflip = _mm_xor_si128(b,sign32); // a with sign bits flipped
__m128i bflip = _mm_xor_si128(a,sign32); // b with sign bits flipped
__m128i equal = _mm_cmpeq_epi32(b,a); // a == b, dwords
__m128i bigger = _mm_cmpgt_epi32(aflip,bflip); // a > b, dwords
__m128i biggerl = _mm_shuffle_epi32(bigger,0xA0); // a > b, low dwords copied to high dwords
__m128i eqbig = _mm_and_si128(equal,biggerl); // high part equal and low part bigger
__m128i hibig = _mm_or_si128(bigger,eqbig); // high part bigger or high part equal and low part
__m128i big = _mm_shuffle_epi32(hibig,0xF5); // result copied to low part
return big;
int main() {
__m128i x,y,z,c;
x = _mm_set_epi64x(3,0xffffffffffffffffll);
y = _mm_set_epi64x(1,0x2ll);
z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);
int out[4];
//int64_t out[2];
_mm_storeu_si128((__m128i*)out, z);
printf("%d %d\n", out[2], out[0]);
The only potentially efficient way to add 128-bit or 256-bit numbers with SSE is with XOP. The only option with AVX would be XOP2 which does not exist yet. And even if you have XOP it may only be efficient to add two 128-bit or 256-numbers in parallel (you could do four with AVX if XOP2 existed) to avoid the horizontal instructions such as mm_unpacklo_epi64.
The best solution in general is to push the registers onto the stack and use scalar arithmetic. Assuming you have two 256-bit registers x4 and y4 you can add them like this:
__m256i x4, y4, z4;
uint64_t x[4], uint64_t y[4], uint64_t z[4]
_mm256_storeu_si256((__m256i*)x, x4);
_mm256_storeu_si256((__m256i*)y, y4);
z4 = _mm256_loadu_si256((__m256i*)z);
void add_u256(uint64_t x[4], uint64_t y[4], uint64_t z[4]) {
uint64_t c1 = 0, c2 = 0, tmp;
//add low 128-bits
z[0] = x[0] + y[0];
z[1] = x[1] + y[1];
c1 += z[1]<x[1];
tmp = z[1];
z[1] += z[0]<x[0];
c1 += z[1]<tmp;
//add high 128-bits + carry from low 128-bits
z[2] = x[2] + y[2];
c2 += z[2]<x[2];
tmp = z[2];
z[2] += c1;
c2 += z[2]<tmp;
z[3] = x[3] + y[3] + c2;
int main() {
uint64_t x[4], y[4], z[4];
x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
y[0] = 1; y[1] = 1; y[2] = 1; y[3] = 1;
//z = x + y (x3,x2,x1,x0) = (2,3,1,0)
//x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
//y[0] = 1; y[1] = 0; y[2] = 1; y[3] = 1;
//z = x + y (x3,x2,x1,x0) = (2,3,0,0)
for(int i=3; i>=0; i--) printf("%u ", z[i]); printf("\n");
Edit: based on a comment by Stephen Canon at saturated-substraction-avx-or-sse4-2 I discovered there is a more efficient way to compare unsigned 64-bit numbers with SSE4.2 if XOP is not available.
__m128i a,b;
__m128i sign64 = _mm_set1_epi64x(0x8000000000000000L);
__m128i aflip = _mm_xor_si128(a, sign64);
__m128i bflip = _mm_xor_si128(b, sign64);
__m128i cmp = _mm_cmpgt_epi64(aflip,bflip);
> [EDIT: (edited to highlight the question in context)
Following are the SSE intrinsics for which I require NEON intrinsics as I am converting some SSE code to run on iOS.
Sets the four single-precision, floating-point values to the four inputs.
(__m128 _mm_set_ps(float z , float y , float x , float w );)
Return Value:
r0 := w
r1 := x
r2 := y
r3 := z
Loads four single-precision, floating-point values. The address does not need to be 16-byte aligned.
__m128 _mm_loadu_ps(float * p);
Return Value:
r0 := p[0]
r1 := p[1]
r2 := p[2]
r3 := p[3]
Stores four single-precision, floating-point values. The address does not need to be 16-byte aligned.
void _mm_storeu_ps(float *p, __m128 a);
Return Value:
p[0] := a0
p[1] := a1
p[2] := a2
p[3] := a3
Adds the 4 signed or unsigned 32-bit integers in a to the 4 signed or unsigned 32-bit integers in b.
__m128i _mm_add_epi32 (__m128i a, __m128i b);
Return Value:
r0 := a0 + b0
r1 := a1 + b1
r2 := a2 + b2
r3 := a3 + b3
Note: Avoid unaligned memory access whenever possible. So, I need a way to convert unaligned access to aligned access (probably using padding).
I'm not very familiar with NEON intrinsics, but I can name you the equivalent NEON instructions. You'll find the appropriate macro easily then.
If the values are already in S registers, you just have to re-interpret them as D registers
Otherwise, you can fill a D register with a vmov instruction :
vmov.i32 d0, r0, r1
vld1.32 q0, [r0]
vst1.32 q0, [r0]
vadd.u32 q0, q1, q2
My initial attempt looked like this (supposed we want to multiply)
__m128 mat[n]; /* rows */
__m128 vec[n] = {1,1,1,1};
float outvector[n];
for (int row=0;row<n;row++) {
for(int k =3; k < 8; k = k+ 4)
__m128 mrow = mat[k];
__m128 v = vec[row];
__m128 sum = _mm_mul_ps(mrow,v);
sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */
But this clearly doesn't work. How do I approach this?
I should load 4 at a time....
The other question is: if my array is very big (say n = 1000), how can I make it 16-bytes aligned? Is that even possible?
OK... I'll use a row-major matrix convention. Each row of [m] requires (2) __m128 elements to yield 8 floats. The 8x1 vector v is a column vector. Since you're using the haddps instruction, I'll assume SSE3 is available. Finding r = [m] * v :
void mul (__m128 r[2], const __m128 m[8][2], const __m128 v[2])
__m128 t0, t1, t2, t3, r0, r1, r2, r3;
t0 = _mm_mul_ps(m[0][0], v[0]);
t1 = _mm_mul_ps(m[1][0], v[0]);
t2 = _mm_mul_ps(m[2][0], v[0]);
t3 = _mm_mul_ps(m[3][0], v[0]);
t0 = _mm_hadd_ps(t0, t1);
t2 = _mm_hadd_ps(t2, t3);
r0 = _mm_hadd_ps(t0, t2);
t0 = _mm_mul_ps(m[0][1], v[1]);
t1 = _mm_mul_ps(m[1][1], v[1]);
t2 = _mm_mul_ps(m[2][1], v[1]);
t3 = _mm_mul_ps(m[3][1], v[1]);
t0 = _mm_hadd_ps(t0, t1);
t2 = _mm_hadd_ps(t2, t3);
r1 = _mm_hadd_ps(t0, t2);
t0 = _mm_mul_ps(m[4][0], v[0]);
t1 = _mm_mul_ps(m[5][0], v[0]);
t2 = _mm_mul_ps(m[6][0], v[0]);
t3 = _mm_mul_ps(m[7][0], v[0]);
t0 = _mm_hadd_ps(t0, t1);
t2 = _mm_hadd_ps(t2, t3);
r2 = _mm_hadd_ps(t0, t2);
t0 = _mm_mul_ps(m[4][1], v[1]);
t1 = _mm_mul_ps(m[5][1], v[1]);
t2 = _mm_mul_ps(m[6][1], v[1]);
t3 = _mm_mul_ps(m[7][1], v[1]);
t0 = _mm_hadd_ps(t0, t1);
t2 = _mm_hadd_ps(t2, t3);
r3 = _mm_hadd_ps(t0, t2);
r[0] = _mm_add_ps(r0, r1);
r[1] = _mm_add_ps(r2, r3);
As for alignment, a variable of a type __m128 should be automatically aligned on the stack. With dynamic memory, this is not a safe assumption. Some malloc / new implementations may only return memory guaranteed to be 8-byte aligned.
The intrinsics header provides _mm_malloc and _mm_free. The align parameter should be (16) in this case.
Intel has developed a Small Matrix Library for matrices with sizes ranging from 1×1 to 6×6. Application Note AP-930 Streaming SIMD Extensions - Matrix Multiplication describes in detail the algorithm for multiplying two 6×6 matrices. This should be adaptable to other size matrices with some effort.