AVX2: Computing dot product of 512 float arrays - c++

I will preface this by saying that I am a complete beginner at SIMD intrinsics.
Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU # 2.70GHz). I would like to know the fastest way to compute the dot product of two std::vector<float> of size 512.
I have done some digging online and found this and this, and this stack overflow question suggests using the following function __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);, However, these all suggest different ways of performing the dot product I am not sure what is the correct (and fastest) way to do it.
In particular, I am looking for the fastest way to perform dot product for a vector of size 512 (because I know the vector size effects the implementation).
Thank you for your help
Edit 1:
I am also a little confused about the -mavx2 gcc flag. If I use these AVX2 functions, do I need to add the flag when I compile? Also, is gcc able to do these optimizations for me (say if I use the -OFast gcc flag) if I write a naive dot product implementation?
Edit 2
If anyone has the time and energy, I would very much appreciate if you could write a full implementation. I am sure other beginners would also value this information.

_mm256_dp_ps is only useful for dot-products of 2 to 4 elements; for longer vectors use vertical SIMD in a loop and reduce to scalar at the end. Using _mm256_dp_ps and _mm256_add_ps in a loop would be much slower.
GCC and clang require you to enable (with command line options) ISA extensions that you use intrinsics for, unlike MSVC and ICC.
The code below is probably close to theoretical performance limit of your CPU. Untested.
Compile it with clang or gcc -O3 -march=native. (Requires at least -mavx -mfma, but -mtune options implied by -march are good, too, and so are the other -mpopcnt and other things arch=native enables. Tune options are critical to this compiling efficiently for most CPUs with FMA, specifically -mno-avx256-split-unaligned-load: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)
Or compile it with MSVC -O2 -arch:AVX2
#include <immintrin.h>
#include <vector>
#include <assert.h>
// CPUs support RAM access like this: "ymmword ptr [rax+64]"
// Using templates with offset int argument to make easier for compiler to emit good code.
// Multiply 8 floats by another 8 floats.
template<int offsetRegs>
inline __m256 mul8( const float* p1, const float* p2 )
{
constexpr int lanes = offsetRegs * 8;
const __m256 a = _mm256_loadu_ps( p1 + lanes );
const __m256 b = _mm256_loadu_ps( p2 + lanes );
return _mm256_mul_ps( a, b );
}
// Returns acc + ( p1 * p2 ), for 8-wide float lanes.
template<int offsetRegs>
inline __m256 fma8( __m256 acc, const float* p1, const float* p2 )
{
constexpr int lanes = offsetRegs * 8;
const __m256 a = _mm256_loadu_ps( p1 + lanes );
const __m256 b = _mm256_loadu_ps( p2 + lanes );
return _mm256_fmadd_ps( a, b, acc );
}
// Compute dot product of float vectors, using 8-wide FMA instructions.
float dotProductFma( const std::vector<float>& a, const std::vector<float>& b )
{
assert( a.size() == b.size() );
assert( 0 == ( a.size() % 32 ) );
if( a.empty() )
return 0.0f;
const float* p1 = a.data();
const float* const p1End = p1 + a.size();
const float* p2 = b.data();
// Process initial 32 values. Nothing to add yet, just multiplying.
__m256 dot0 = mul8<0>( p1, p2 );
__m256 dot1 = mul8<1>( p1, p2 );
__m256 dot2 = mul8<2>( p1, p2 );
__m256 dot3 = mul8<3>( p1, p2 );
p1 += 8 * 4;
p2 += 8 * 4;
// Process the rest of the data.
// The code uses FMA instructions to multiply + accumulate, consuming 32 values per loop iteration.
// Unrolling manually for 2 reasons:
// 1. To reduce data dependencies. With a single register, every loop iteration would depend on the previous result.
// 2. Unrolled code checks for exit condition 4x less often, therefore more CPU cycles spent computing useful stuff.
while( p1 < p1End )
{
dot0 = fma8<0>( dot0, p1, p2 );
dot1 = fma8<1>( dot1, p1, p2 );
dot2 = fma8<2>( dot2, p1, p2 );
dot3 = fma8<3>( dot3, p1, p2 );
p1 += 8 * 4;
p2 += 8 * 4;
}
// Add 32 values into 8
const __m256 dot01 = _mm256_add_ps( dot0, dot1 );
const __m256 dot23 = _mm256_add_ps( dot2, dot3 );
const __m256 dot0123 = _mm256_add_ps( dot01, dot23 );
// Add 8 values into 4
const __m128 r4 = _mm_add_ps( _mm256_castps256_ps128( dot0123 ), _mm256_extractf128_ps( dot0123, 1 ) );
// Add 4 values into 2
const __m128 r2 = _mm_add_ps( r4, _mm_movehl_ps( r4, r4 ) );
// Add 2 lower values into the final result
const __m128 r1 = _mm_add_ss( r2, _mm_movehdup_ps( r2 ) );
// Return the lowest lane of the result vector.
// The intrinsic below compiles into noop, modern compilers return floats in the lowest lane of xmm0 register.
return _mm_cvtss_f32( r1 );
}
Possible further improvements:
Unroll by 8 vectors instead of 4. I’ve checked gcc 9.2 asm output, compiler only used 8 vector registers out of the 16 available.
Make sure both input vectors are aligned, e.g. use a custom allocator which calls _aligned_malloc / _aligned_free on msvc, or aligned_alloc / free on gcc & clang. Then replace _mm256_loadu_ps with _mm256_load_ps.
To auto-vectorize a simple scalar dot product, you'd also need OpenMP SIMD or -ffast-math (implied by -Ofast) to let the compiler treat FP math as associative even though it's not (because of rounding). But GCC won't use multiple accumulators when auto-vectorizing, even if it does unroll, so you'd bottleneck on FMA latency, not load throughput.
(2 loads per FMA means the throughput bottleneck for this code is vector loads, not actual FMA operations.)

Related

How would you convert a "while" iterator into simd instructions?

This is the code I actually had (for a scalar code) which I've replicated (x4) storing data into simd:
waveTable *waveTables[4];
for (int i = 0; i < 4; i++) {
int waveTableIindex = 0;
while ((phaseIncrement[i] >= mWaveTables[waveTableIindex].mTopFreq) && (waveTableIindex < kNumWaveTableSlots)) {
waveTableIindex++;
}
waveTables[i] = &mWaveTables[waveTableIindex];
}
Its not "faster" at all, of course. How would you do the same with simd, saving cpu? Any tips/starting point?
I'm with SSE2.
Here's the context of the computation.
topFreq for each wave table are calculated starting from the max harmonic amounts (x2, due to Nyquist), and multiply for 2 on every wave table (dividing later the number of harmonics available for each table):
double topFreq = 1.0 / (maxHarmonic * 2);
while (maxHarmonic) {
// fill the table in with the needed harmonics
// ... makeWaveTable() code
// prepare for next table
topFreq *= 2;
maxHarmonic >>= 1;
}
Than, on processing, for each sample, I need to "catch" the correct wave table to use, due to the osc's freq (i.e. phase increment):
freq = clamp(freq, 20.0f, 22050.0f);
phaseIncrement = freq * vSampleTime;
so, for example (having vSampleTime = 1/44100, maxHarmonic = 500), 30hz is wavetable 0, 50hz is wavetable 1, and so on
Assuming your values are FP32, I would do it like this. Untested.
const __m128 phaseIncrements = _mm_loadu_ps( phaseIncrement );
__m128i indices = _mm_setzero_si128();
__m128i activeIndices = _mm_set1_epi32( -1 );
for( size_t idx = 0; idx < kNumWaveTableSlots; idx++ )
{
// Broadcast the mTopFreq value into FP32 vector. If you build this for AVX1, will become 1 very fast instruction.
const __m128 topFreq = _mm_set1_ps( mWaveTables[ idx ].mTopFreq );
// Compare for phaseIncrements >= topFreq
const __m128 cmp_f32 = _mm_cmpge_ps( phaseIncrements, topFreq );
// The following line compiles into no instruction, it's only to please the type checker
__m128i cmp = _mm_castps_si128( cmp_f32 );
// Bitwise AND with activeIndices
cmp = _mm_and_si128( cmp, activeIndices );
// The following line increments the indices vector by 1, only the lanes where cmp was TRUE
indices = _mm_sub_epi32( indices, cmp );
// Update the set of active lane indices
activeIndices = cmp;
// The vector may become completely zero, meaning all 4 lanes have encountered at least 1 value where topFreq < phaseIncrements
if( 0 == _mm_movemask_epi8( activeIndices ) )
break;
}
// Indices vector keeps 4 32-bit integers
// Each lane contains index of the first table entry less than the corresponding lane of phaseIncrements
// Or maybe kNumWaveTableSlots if not found
There is no standard way to write SIMD instructions in C++. A compiler may produce SIMD instructions when appropriate as long as you've configured it to target a CPU that supports such instructions and enabled relevant optimisations. You can use standard algorithms using the std::execution::unsequenced_policy to help compiler understand that SIMD is appropriate.
If you are using GCC/G++ or Clang, there is a non-standard language extension for vector extensions. using __attribute__ ((vector_size (xx))). See the GCC manual for details
https://gcc.gnu.org/onlinedocs/gcc-11.2.0/gcc/Vector-Extensions.html#Vector-Extensions

Fastest way to perform AVX inner product operations with mixed (float, double) input vectors

I need to build a single-precision floating-point inner product routine for mixed single/double-precision floating-point vectors, exploiting the AVX instruction set for SIMD registers with 256 bits.
Problem: one input vector is float (x), while the other is double (yD).
Hence, before to compute the true inner product operations, I need to convert my input yD vector data from double to float.
Using the SSE2 instruction set, I was able to implement a very fast code doing what I needed, and with speed performances very close to the case when both vectors x and y were float:
void vector_operation(const size_t i)
{
__m128 X = _mm_load_ps(x + i);
__m128 Y = _mm_movelh_ps(_mm_cvtpd_ps(_mm_load_pd(yD + i + 0)), _mm_cvtpd_ps(_mm_load_pd(yD + i + 2)));
//inner-products accumulation
res = _mm_add_ps(res, _mm_mul_ps(X, Y));
}
Now, with the hope to further speed-up, I implemented a correpsonding version with AVX instruction set:
inline void vector_operation(const size_t i)
{
__m256 X = _mm256_load_ps(x + i);
__m128 yD1 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 0));
__m128 yD2 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 2));
__m128 yD3 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 4));
__m128 yD4 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 6));
__m128 Ylow = _mm_movelh_ps(yD1, yD2);
__m128 Yhigh = _mm_movelh_ps(yD3, yD4);
//Pack __m128 data inside __m256
__m256 Y = _mm256_permute2f128_ps(_mm256_castps128_ps256(Ylow), _mm256_castps128_ps256(Yhigh), 0x20);
//inner-products accumulation
res = _mm256_add_ps(res, _mm256_mul_ps(X, Y));
}
I also tested other AVX implementations using, for example, casting and insertion operations instead of perfmuting data. Performances were comparably poor compared to the case where both x and y vectors were float.
The problem with the AVX code is that no matter how I implemented it, its performance is by far inferior to the ones achieved by using only float x and y vectors (i.e. no double-float conversion is needed).
The conversion from double to float for the yD vector seems pretty fast, while a lot of time is lost in the line where data is inserted in the _m256 Y register.
Do you know if this is a well-known issue with AVX?
Do you have a solution that could preserve good performances?
Thanks in advance!
I rewrote your function and took better advantage of what AVX has to offer. I also used fused multiply-add at the end; if you can't use FMA, just replace that line with addition and multiplication. I only now see that I wrote an implementation that uses unaligned loads and yours uses aligned loads, but I'm not gonna lose any sleep over it. :)
__m256 foo(float*x, double* yD, const size_t i, __m256 res_prev)
{
__m256 X = _mm256_loadu_ps(x + i);
__m128 yD21 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 0));
__m128 yD43 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 4));
__m256 Y = _mm256_set_m128(yD43, yD21);
return _mm256_fmadd_ps(X, Y, res_prev);
}
I did a quick benhmark and compared running times of your and my implementation. I tried two different benchmark approaches with several repetitions and every time my code was around 15% faster. I used MSVC 14.1 compiler and compiled the program with /O2 and /arch:AVX2 flags.
EDIT: this is the disassembly of the function:
vcvtpd2ps xmm3,ymmword ptr [rdx+r8*8+20h]
vcvtpd2ps xmm2,ymmword ptr [rdx+r8*8]
vmovups ymm0,ymmword ptr [rcx+r8*4]
vinsertf128 ymm3,ymm2,xmm3,1
vfmadd213ps ymm0,ymm3,ymmword ptr [r9]
EDIT 2: this is the disassembly of your AVX implementation of the same algorithm:
vcvtpd2ps xmm0,xmmword ptr [rdx+r8*8+30h]
vcvtpd2ps xmm1,xmmword ptr [rdx+r8*8+20h]
vmovlhps xmm3,xmm1,xmm0
vcvtpd2ps xmm0,xmmword ptr [rdx+r8*8+10h]
vcvtpd2ps xmm1,xmmword ptr [rdx+r8*8]
vmovlhps xmm2,xmm1,xmm0
vperm2f128 ymm3,ymm2,ymm3,20h
vmulps ymm0,ymm3,ymmword ptr [rcx+r8*4]
vaddps ymm0,ymm0,ymmword ptr [r9]

Why two consecutive gather instruction perform worse than equivalent elementary ops?

I am upgrading some code from SSE to AVX2. In general I can see that gather instructions are quite useful and benefit performance. However I encountered a case where gather instructions are less efficient than decomposing the gather operations into simpler ones.
In the code below, I have a vector of int32 b, a vector of double xi and 4 int32 indices packed in a 128 bit register bidx. I need to gather first from vector b, than from vector xi. I.e., in pseudo code, I need to do:
__m128i i = b[idx];
__m256d x = xi[i];
In the function below, I implement this in two ways using an #ifdef: via gather instructions, yielding a throughput of 290 Miter/sec and via elementary operations, yielding a throughput of 325 Miter/sec.
Can somebody explain what is going on? Thanks
inline void resolve( const __m256d& z, const __m128i& bidx, int32_t j
, const int32_t *b, const double *xi, int32_t* ri )
{
__m256d x;
__m128i i;
#if 0 // this code uses two gather instructions in sequence
i = _mm_i32gather_epi32(b, bidx, 4)); // i = b[bidx]
x = _mm256_i32gather_pd(xi, i, 8); // x = xi[i]
#else // this code does not use gather instructions
union {
__m128i vec;
int32_t i32[4];
} u;
x = _mm256_set_pd
( xi[(u.i32[3] = b[_mm_extract_epi32(bidx,3)])]
, xi[(u.i32[2] = b[_mm_extract_epi32(bidx,2)])]
, xi[(u.i32[1] = b[_mm_extract_epi32(bidx,1)])]
, xi[(u.i32[0] = b[_mm_cvtsi128_si32(bidx) ])]
);
i = u.vec;
#endif
// here we use x and i
__m256 ps256 = _mm256_castpd_ps(_mm256_cmp_pd(z, x, _CMP_LT_OS));
__m128 lo128 = _mm256_castps256_ps128(ps256);
__m128 hi128 = _mm256_extractf128_ps(ps256, 1);
__m128 blend = _mm_shuffle_ps(lo128, hi128, 0 + (2<<2) + (0<<4) + (2<<6));
__m128i lt = _mm_castps_si128(blend); // this is 0 or -1
i = _mm_add_epi32(i, lt);
_mm_storeu_si128(reinterpret_cast<__m128i*>(ri)+j, i);
}
Since your 'resolve' function is marked as inline I suppose it's called in a high frequency loop. Then you might also have a look at the dependencies of the input parameters from each other outside the 'resolve' function. The compiler might be able to optimize the inlined code better across loop boundaries when using the scalar code variant.

Location of Intel's __assume affects performance

I am using an 8-th order finite difference time stepping function (for 2D acoustic wave equation) shown below.
I am observing substantial (up to 25%) performance increase from placing Intel's __assume statement inside the inner loop, compared to placing it at the beginning of the function body. (This happens regardless of number of OpenMP threads).
The code is compiled by Intel 2016-update1 compiler, Linux, with -O3 optimization option, and for AVX-capable architecture (Xeon E5-2695 v2).
Is it a compiler problem?
/* Finite difference, 8-th order scheme for acoustic 2D equation.
p - current pressure
q - previous and next pressure
c - velocity
n0 x n1 - problem size
p1 - stride
*/
void fdtd_2d( float const* const __restrict__ p,
float * const __restrict__ q,
float const* const __restrict__ c,
int const n0,
int const n1,
int const p1 )
{
// Stencil coefficients.
static const float C[5] = { -5.6944444e+0f, 1.6000000e+0f, -2.0000000e-1f, 2.5396825e-2f, -1.7857143e-3f };
// INTEL OPTIMIZER PROBLEM?
// PLACING THE FOLLOWING LINE INSIDE THE LOOP BELOW
// INSTEAD OF HERE SPEEDS UP THE CODE!
// __assume( p1 % 16 == 0 );
#pragma omp parallel for default(none)
for ( int i1 = 0; i1 < n1; ++i1 )
{
float const* const __restrict__ ps = p + i1 * p1;
float * const __restrict__ qs = q + i1 * p1;
float const* const __restrict__ cs = c + i1 * p1;
#pragma omp simd aligned( ps, qs, cs : 64 )
for ( int i0 = 0; i0 < n0; ++i0 )
{
// INTEL OPTIMIZER PROBLEM?
// PLACING THE FOLLOWING LINE HERE
// INSTEAD OF THE ABOVE SPEEDS UP THE CODE!
__assume( p1 % 16 == 0 );
// Laplacian cross stencil:
// center and 4 points up, down, left and right from the center
auto lap = C[0] * ps[i0];
for ( int r = 1; r <= 4; ++r )
lap += C[r] * ( ps[i0 + r] + ps[i0 - r] + ps[i0 + r * p1] + ps[i0 - r * p1] );
qs[i0] = 2.0f * ps[i0] - qs[i0] + cs[i0] * lap;
}
}
}
I was pointed to the following on Intel website:
Clauses such as __assume_aligned and __assume tell the compiler that the property holds at the particular point in the program where the clause appears. So the statement __assume_aligned(a, 64); means the pointer a is aligned at 64 bytes whenever program execution reaches this point. Compiler may propagate that property to other points in the program (such as a later loop), but this behavior is not guaranteed (it is possible that compiler has to make conservative assumptions and cannot apply the property safely for a later loop in the same function).
So when I place __assume at the beginning of the function body, the assumption is not propagated into the inner loops, which results in less optimal code.
Although, my expectation was reasonable: since p1 is declared as const, the compiler could have propagated the assumption.

Cannot access memory as SSE type on x86 but works fine on x64

I've got some code written using the MSVC SSE intrinsics.
__m128 zero = _mm_setzero_ps();
__m128 center = _mm_load_ps(&sphere.origin.x);
__m128 boxmin = _mm_load_ps(&rhs.BottomLeftClosest.x);
__m128 boxmax = _mm_load_ps(&rhs.TopRightFurthest.x);
__m128 e = _mm_add_ps(_mm_max_ps(_mm_sub_ps(boxmin, center), zero), _mm_max_ps(_mm_sub_ps(center, boxmax), zero));
e = _mm_mul_ps(e, e);
__declspec(align(16)) float arr[4];
_mm_store_ps(arr, e);
float r = sphere.radius;
return (arr[0] + arr[1] + arr[2] <= r * r);
The Math::Vector type (which is the type of sphere.origin, rhs.BottomLeftClosest, and rhs.TopRightFurthest) is effectively an array of 3 floats. I aligned them to 16 bytes and this code executes fine on x64. But on x86 I get access violation reading a null pointer. Any advice on where this comes from?
__m128 center = _mm_load_ps(&sphere.origin.x);
_mm_load_ps() requires that the passed pointer is 16-byte aligned. There's no evidence that you ensured that sphere.origin.x is aligned properly. You'll need to use _mm_loadu_ps() instead if you can't provide that guarantee.