Just tell me which one is faster: sub or mul?
My target platform is X86; FPU and SSE.
example:
'LerpColorSolution1' uses multiply.
'LerpColorSolution2' uses subtract.
which is faster ?
void LerpColorSolution1(const float* a, const float* b, float alpha, float* out)
{
out[0] = a[0] + (b[0] - a[0]) * alpha;
out[1] = a[1] + (b[1] - a[1]) * alpha;
out[2] = a[2] + (b[2] - a[2]) * alpha;
out[3] = a[3] + (b[3] - a[3]) * alpha;
}
void LerpColorSolution2(const float* a, const float* b, float alpha, float* out)
{
float f = 1.0f - alpha;
out[0] = a[0]*f + b[0] * alpha;
out[1] = a[1]*f + b[1] * alpha;
out[2] = a[2]*f + b[2] * alpha;
out[3] = a[3]*f + b[3] * alpha;
}
Thanks to all ;)
Just for fun: assuming that you (or your compiler) vectorize both of your approaches (because of course you would if you're chasing performance), and you're targeting a recent x86 processor...
A direct translation of "LerpColorSolution1" into AVX instructions is as follows:
VSUBPS dst, a, b // a[] - b[]
VSHUFPS alpha, alpha, alpha, 0 // splat alpha
VMULPS dst, alpha, dst // alpha*(a[] - b[])
VADDPS dst, a, dst // a[] + alpha*(a[] - b[])
The long latency chain for this sequence is sub-mul-add, which has a total latency of 3+5+3 = 11 cycles on most recent Intel processors. Throughput (assuming that you do nothing but these operations) is limited by port 1 utilization, with a theoretical peak of one LERP every two cycles. (I'm intentionally overlooking load/store traffic and focusing solely on the mathematical operation being performed here).
If we look at your "LerpColorSolution2":
VSHUFPS alpha, alpha, alpha, 0 // splat alpha
VSUBPS dst, one, alpha // 1.0f - alpha, assumes "1.0f" kept in reg.
VMULPS tmp, alpha, b // alpha*b[]
VMULPS dst, dst, a // (1-alpha)*a[]
VADDPS dst, dst, tmp // (1-alpha)*a[] + alpha*b[]
Now the long latency chain is shuffle-sub-mul-add, which has a total latency of 1+3+5+3 = 12 cycles; Throughput is now limited by ports 0 and 1, but still has a peak of one LERP every two cycles. You need to retire one additional µop for each LERP operation, which may make throughput slightly slower depending on the surrounding context.
So your first solution is slightly better; (which isn't surprising -- even without this detail of analysis, the rough guideline "fewer operations is better" is a good rule of thumb).
Haswell tilts things significantly in favor of the first solution; using FMA it requires only one µop on each of ports 0,1, and 5, allowing for a theoretical throughput of one LERP per cycle; while FMA also improves solution 2, it still requires four µops, including three that need to execute on port 0 or 1. This limits solution 2 to a theoretical peak of one LERP every 1.5 cycles -- 50% slower than solution 1.
Related
I will preface this by saying that I am a complete beginner at SIMD intrinsics.
Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU # 2.70GHz). I would like to know the fastest way to compute the dot product of two std::vector<float> of size 512.
I have done some digging online and found this and this, and this stack overflow question suggests using the following function __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);, However, these all suggest different ways of performing the dot product I am not sure what is the correct (and fastest) way to do it.
In particular, I am looking for the fastest way to perform dot product for a vector of size 512 (because I know the vector size effects the implementation).
Thank you for your help
Edit 1:
I am also a little confused about the -mavx2 gcc flag. If I use these AVX2 functions, do I need to add the flag when I compile? Also, is gcc able to do these optimizations for me (say if I use the -OFast gcc flag) if I write a naive dot product implementation?
Edit 2
If anyone has the time and energy, I would very much appreciate if you could write a full implementation. I am sure other beginners would also value this information.
_mm256_dp_ps is only useful for dot-products of 2 to 4 elements; for longer vectors use vertical SIMD in a loop and reduce to scalar at the end. Using _mm256_dp_ps and _mm256_add_ps in a loop would be much slower.
GCC and clang require you to enable (with command line options) ISA extensions that you use intrinsics for, unlike MSVC and ICC.
The code below is probably close to theoretical performance limit of your CPU. Untested.
Compile it with clang or gcc -O3 -march=native. (Requires at least -mavx -mfma, but -mtune options implied by -march are good, too, and so are the other -mpopcnt and other things arch=native enables. Tune options are critical to this compiling efficiently for most CPUs with FMA, specifically -mno-avx256-split-unaligned-load: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)
Or compile it with MSVC -O2 -arch:AVX2
#include <immintrin.h>
#include <vector>
#include <assert.h>
// CPUs support RAM access like this: "ymmword ptr [rax+64]"
// Using templates with offset int argument to make easier for compiler to emit good code.
// Multiply 8 floats by another 8 floats.
template<int offsetRegs>
inline __m256 mul8( const float* p1, const float* p2 )
{
constexpr int lanes = offsetRegs * 8;
const __m256 a = _mm256_loadu_ps( p1 + lanes );
const __m256 b = _mm256_loadu_ps( p2 + lanes );
return _mm256_mul_ps( a, b );
}
// Returns acc + ( p1 * p2 ), for 8-wide float lanes.
template<int offsetRegs>
inline __m256 fma8( __m256 acc, const float* p1, const float* p2 )
{
constexpr int lanes = offsetRegs * 8;
const __m256 a = _mm256_loadu_ps( p1 + lanes );
const __m256 b = _mm256_loadu_ps( p2 + lanes );
return _mm256_fmadd_ps( a, b, acc );
}
// Compute dot product of float vectors, using 8-wide FMA instructions.
float dotProductFma( const std::vector<float>& a, const std::vector<float>& b )
{
assert( a.size() == b.size() );
assert( 0 == ( a.size() % 32 ) );
if( a.empty() )
return 0.0f;
const float* p1 = a.data();
const float* const p1End = p1 + a.size();
const float* p2 = b.data();
// Process initial 32 values. Nothing to add yet, just multiplying.
__m256 dot0 = mul8<0>( p1, p2 );
__m256 dot1 = mul8<1>( p1, p2 );
__m256 dot2 = mul8<2>( p1, p2 );
__m256 dot3 = mul8<3>( p1, p2 );
p1 += 8 * 4;
p2 += 8 * 4;
// Process the rest of the data.
// The code uses FMA instructions to multiply + accumulate, consuming 32 values per loop iteration.
// Unrolling manually for 2 reasons:
// 1. To reduce data dependencies. With a single register, every loop iteration would depend on the previous result.
// 2. Unrolled code checks for exit condition 4x less often, therefore more CPU cycles spent computing useful stuff.
while( p1 < p1End )
{
dot0 = fma8<0>( dot0, p1, p2 );
dot1 = fma8<1>( dot1, p1, p2 );
dot2 = fma8<2>( dot2, p1, p2 );
dot3 = fma8<3>( dot3, p1, p2 );
p1 += 8 * 4;
p2 += 8 * 4;
}
// Add 32 values into 8
const __m256 dot01 = _mm256_add_ps( dot0, dot1 );
const __m256 dot23 = _mm256_add_ps( dot2, dot3 );
const __m256 dot0123 = _mm256_add_ps( dot01, dot23 );
// Add 8 values into 4
const __m128 r4 = _mm_add_ps( _mm256_castps256_ps128( dot0123 ), _mm256_extractf128_ps( dot0123, 1 ) );
// Add 4 values into 2
const __m128 r2 = _mm_add_ps( r4, _mm_movehl_ps( r4, r4 ) );
// Add 2 lower values into the final result
const __m128 r1 = _mm_add_ss( r2, _mm_movehdup_ps( r2 ) );
// Return the lowest lane of the result vector.
// The intrinsic below compiles into noop, modern compilers return floats in the lowest lane of xmm0 register.
return _mm_cvtss_f32( r1 );
}
Possible further improvements:
Unroll by 8 vectors instead of 4. I’ve checked gcc 9.2 asm output, compiler only used 8 vector registers out of the 16 available.
Make sure both input vectors are aligned, e.g. use a custom allocator which calls _aligned_malloc / _aligned_free on msvc, or aligned_alloc / free on gcc & clang. Then replace _mm256_loadu_ps with _mm256_load_ps.
To auto-vectorize a simple scalar dot product, you'd also need OpenMP SIMD or -ffast-math (implied by -Ofast) to let the compiler treat FP math as associative even though it's not (because of rounding). But GCC won't use multiple accumulators when auto-vectorizing, even if it does unroll, so you'd bottleneck on FMA latency, not load throughput.
(2 loads per FMA means the throughput bottleneck for this code is vector loads, not actual FMA operations.)
Here's the code I'm trying to convert: the double version of VDT's Pade Exp fast_ex() approx (here's the old repo resource):
inline double fast_exp(double initial_x){
double x = initial_x;
double px=details::fpfloor(details::LOG2E * x +0.5);
const int32_t n = int32_t(px);
x -= px * 6.93145751953125E-1;
x -= px * 1.42860682030941723212E-6;
const double xx = x * x;
// px = x * P(x**2).
px = details::PX1exp;
px *= xx;
px += details::PX2exp;
px *= xx;
px += details::PX3exp;
px *= x;
// Evaluate Q(x**2).
double qx = details::QX1exp;
qx *= xx;
qx += details::QX2exp;
qx *= xx;
qx += details::QX3exp;
qx *= xx;
qx += details::QX4exp;
// e**x = 1 + 2x P(x**2)/( Q(x**2) - P(x**2) )
x = px / (qx - px);
x = 1.0 + 2.0 * x;
// Build 2^n in double.
x *= details::uint642dp(( ((uint64_t)n) +1023)<<52);
if (initial_x > details::EXP_LIMIT)
x = std::numeric_limits<double>::infinity();
if (initial_x < -details::EXP_LIMIT)
x = 0.;
return x;
}
I got this:
__m128d PExpSSE_dbl(__m128d x) {
__m128d initial_x = x;
__m128d half = _mm_set1_pd(0.5);
__m128d one = _mm_set1_pd(1.0);
__m128d log2e = _mm_set1_pd(1.4426950408889634073599);
__m128d p1 = _mm_set1_pd(1.26177193074810590878E-4);
__m128d p2 = _mm_set1_pd(3.02994407707441961300E-2);
__m128d p3 = _mm_set1_pd(9.99999999999999999910E-1);
__m128d q1 = _mm_set1_pd(3.00198505138664455042E-6);
__m128d q2 = _mm_set1_pd(2.52448340349684104192E-3);
__m128d q3 = _mm_set1_pd(2.27265548208155028766E-1);
__m128d q4 = _mm_set1_pd(2.00000000000000000009E0);
__m128d px = _mm_add_pd(_mm_mul_pd(log2e, x), half);
__m128d t = _mm_cvtepi64_pd(_mm_cvttpd_epi64(px));
px = _mm_sub_pd(t, _mm_and_pd(_mm_cmplt_pd(px, t), one));
__m128i n = _mm_cvtpd_epi64(px);
x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(6.93145751953125E-1)));
x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(1.42860682030941723212E-6)));
__m128d xx = _mm_mul_pd(x, x);
px = _mm_mul_pd(xx, p1);
px = _mm_add_pd(px, p2);
px = _mm_mul_pd(px, xx);
px = _mm_add_pd(px, p3);
px = _mm_mul_pd(px, x);
__m128d qx = _mm_mul_pd(xx, q1);
qx = _mm_add_pd(qx, q2);
qx = _mm_mul_pd(xx, qx);
qx = _mm_add_pd(qx, q3);
qx = _mm_mul_pd(xx, qx);
qx = _mm_add_pd(qx, q4);
x = _mm_div_pd(px, _mm_sub_pd(qx, px));
x = _mm_add_pd(one, _mm_mul_pd(_mm_set1_pd(2.0), x));
n = _mm_add_epi64(n, _mm_set1_epi64x(1023));
n = _mm_slli_epi64(n, 52);
// return?
}
But I'm not able to finish the last lines - i.e. this code:
if (initial_x > details::EXP_LIMIT)
x = std::numeric_limits<double>::infinity();
if (initial_x < -details::EXP_LIMIT)
x = 0.;
return x;
How would you convert in SSE2?
Than of course I need to check the whole, since I'm not quite sure I've converted it correctly.
EDIT: I found the SSE conversion of float exp - i.e. from this:
/* multiply by power of 2 */
z *= details::uint322sp((n + 0x7f) << 23);
if (initial_x > details::MAXLOGF) z = std::numeric_limits<float>::infinity();
if (initial_x < details::MINLOGF) z = 0.f;
return z;
to this:
n = _mm_add_epi32(n, _mm_set1_epi32(0x7f));
n = _mm_slli_epi32(n, 23);
return _mm_mul_ps(z, _mm_castsi128_ps(n));
Yup, dividing two polynomials can often give you a better tradeoff between speed and precision than one huge polynomial. As long as there's enough work to hide the divpd throughput. (The latest x86 CPUs have pretty decent FP divide throughput. Still bad vs. multiply, but it's only 1 uop so it doesn't stall the pipeline if you use it rarely enough, i.e. mixed with lots of multiplies. Including in the surrounding code that uses exp)
However, _mm_cvtepi64_pd(_mm_cvttpd_epi64(px)); won't work with SSE2. Packed-conversion intrinsics to/from 64-bit integers requires AVX512DQ.
To do packed rounding to the nearest integer, ideally you'd use SSE4.1 _mm_round_pd(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC), (or truncation towards zero, or floor or ceil towards -+Inf).
But we don't actually need that.
The scalar code ends up with int n and double px both representing the same numeric value. It uses the bad/buggy floor(val+0.5) idiom instead of rint(val) or nearbyint(val) to round to nearest, and then converts that already-integer double to an int (with C++'s truncation semantics, but that doesn't matter because the double value's already an exact integer.)
With SIMD intrinsics, it appears to be easiest to just convert to 32-bit integer and back.
__m128i n = _mm_cvtpd_epi32( _mm_mul_pd(log2e, x) ); // round to nearest
__m128d px = _mm_cvtepi32_pd( n );
Rounding to int with the desired mode, then converting back to double, is equivalent to double->double rounding and then grabbing an int version of that like the scalar version does. (Because you don't care what happens for doubles too large to fit in an int.)
cvtsd2si and si2sd instructions are 2 uops each, and shuffle the 32-bit integers to packed in the low 64 bits of a vector. So to set up for 64-bit integer shifts to stuff the bits into a double again, you'll need to shuffle. The top 64 bits of n will be zeros, so we can use that to create 64-bit integer n lined up with the doubles:
n = _mm_shuffle_epi32(n, _MM_SHUFFLE(3,1,2,0)); // 64-bit integers
But with just SSE2, there are workarounds. Converting to 32-bit integer and back is one option: you don't care about inputs too small or too large. But packed-conversion between double and int costs at least 2 uops on Intel CPUs each way, so a total of 4. But only 2 of those uops need the FMA units, and your code probably doesn't bottleneck on port 5 with all those multiplies and adds.
Or add a very large number and subtract it again: large enough that each double is 1 integer apart, so normal FP rounding does what you want. (This works for inputs that won't fit in 32 bits, but not double > 2^52. So either way that would work.) Also see How to efficiently perform double/int64 conversions with SSE/AVX? which uses that trick. I couldn't find an example on SO, though.
Related:
Fastest Implementation of Exponential Function Using AVX and Fastest Implementation of Exponential Function Using SSE have versions with other speed / precision tradeoffs, for _ps (packed single-precision float).
Fast SSE low precision exponential using double precision operations is at the other end of the spectrum, but still for double.
How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU? discusses some existing libraries like SVML, and Agner Fog's VCL (GPL licensed). And glibc's libmvec.
Then of course I need to check the whole, since I'm not quite sure I've converted it correctly.
iterating over all 2^64 double bit-patterns is impractical, unlike for float where there are only 4 billion, but maybe iterating over all doubles that have the low 32 bits of their mantissa all zero would be a good start. i.e. check in a loop with
bitpatterns = _mm_add_epi64(bitpatterns, _mm_set1_epi64x( 1ULL << 32 ));
doubles = _mm_castsi128_pd(bitpatterns);
https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/
For those last few lines, correcting the input for out-of-range inputs:
The float version you quote just leaves out the range-check entirely. This is obviously the fastest way, if your inputs will always be in range or if you don't care about what happens for out-of-range inputs.
Alternate cheaper range-checking (maybe only for debugging) would be to turn out-of-range values into NaN by ORing the packed-compare result into the result. (An all-ones bit-pattern represents a NaN.)
__m128d out_of_bounds = _mm_cmplt_pd( limit, abs(initial_x) ); // abs = mask off the sign bit
result = _mm_or_pd(result, out_of_bounds);
In general, you can vectorize simple condition setting of a value using branchless compare + blend. Instead of if(x) y=0;, you have the SIMD equivalent of y = (condition) ? 0 : y;, on a per-element basis. SIMD compares produce a mask of all-zero / all-one elements so you can use it to blend.
e.g. in this case cmppd the input and blendvpd the output if you have SSE4.1. Or with just SSE2, and/andnot/or to blend. See SSE intrinsics for comparison (_mm_cmpeq_ps) and assignment operation for a _ps version of both, _pd is identical.
In asm it will look like this:
; result in xmm0 (in need of fixups for out of range inputs)
; initial_x in xmm2
; constants:
; xmm5 = limit
; xmm6 = +Inf
cmpltpd xmm2, xmm5 ; xmm2 = input_x < limit ? 0xffff... : 0
andpd xmm0, xmm2 ; result = result or 0
andnpd xmm2, xmm6 ; xmm2 = 0 or +Inf (In that order because we used ANDN)
orpd xmm0, xmm2 ; result |= 0 or +Inf
; xmm0 = (input < limit) ? result : +Inf
(In an earlier version of the answer, I thought I was maybe saving a movaps to copy a register, but this is just a bog-standard blend. It destroys initial_x, so the compiler needs to copy that register at some point while calculating result, though.)
Optimizations for this special condition
Or in this case, 0.0 is represented by an all-zero bit-pattern, so do a compare that will produce true if in-range, and AND the output with that. (To leave it unchanged or force it to +0.0). This is better than _mm_blendv_pd, which costs 2 uops on most Intel CPUs (and the AVX 128-bit version always costs 2 uops on Intel). And it's not worse on AMD or Skylake.
+-Inf is represented by a bit-pattern of significand=0, exponent=all-ones. (Any other value in the significand represents +-NaN.) Since too-large inputs will presumably still leave non-zero significands, we can't just AND the compare result and OR that into the final result. I think we need to do a regular blend, or something as expensive (3 uops and a vector constant).
It adds 2 cycles of latency to the final result; both the ANDNPD and ORPD are on the critical path. The CMPPD and ANDPD aren't; they can run in parallel with whatever you do to compute the result.
Hopefully your compiler will actually use ANDPS and so on, not PD, for everything except the CMP, because it's 1 byte shorter but identical because they're both just bitwise ops. I wrote ANDPD just so I didn't have to explain this in comments.
You might be able to shorten the critical path latency by combining both fixups before applying to the result, so you only have one blend. But then I think you also need to combine the compare results.
Or since your upper and lower bounds are the same magnitude, maybe you can compare the absolute value? (mask off the sign bit of initial_x and do _mm_cmplt_pd(abs_initial_x, _mm_set1_pd(details::EXP_LIMIT))). But then you have to sort out whether to zero or set to +Inf.
If you had SSE4.1 for _mm_blendv_pd, you could use initial_x itself as the blend control for the fixup that might need applying, because blendv only cares about the sign bit of the blend control (unlike with the AND/ANDN/OR version where all bits need to match.)
__m128d fixup = _mm_blendv_pd( _mm_setzero_pd(), _mm_set1_pd(INFINITY), initial_x ); // fixup = (initial_x signbit) ? 0 : +Inf
// see below for generating fixup with an SSE2 integer arithmetic-shift
const signbit_mask = _mm_castsi128_pd(_mm_set1_epi64x(0x7fffffffffffffff)); // ~ set1(-0.0)
__m128d abs_init_x = _mm_and_pd( initial_x, signbit_mask );
__m128d out_of_range = _mm_cmpgt_pd(abs_init_x, details::EXP_LIMIT);
// Conditionally apply the fixup to result
result = _mm_blendv_pd(result, fixup, out_of_range);
Possibly use cmplt instead of cmpgt and rearrange if you care what happens for initial_x being a NaN. Choosing the compare so false applies the fixup instead of true will mean that an unordered comparison results in either 0 or +Inf for an input of -NaN or +NaN. This still doesn't do NaN propagation. You could _mm_cmpunord_pd(initial_x, initial_x) and OR that into fixup, if you want to make that happen.
Especially on Skylake and AMD Bulldozer/Ryzen where SSE2 blendvpd is only 1 uop, this should be pretty nice. (The VEX encoding, vblendvpd is 2 uops, having 3 inputs and a separate output.)
You might still be able to use some of this idea with only SSE2, maybe creating fixup by doing a compare against zero and then _mm_and_pd or _mm_andnot_pd with the compare result and +Infinity.
Using an integer arithmetic shift to broadcast the sign bit to every position in the double isn't efficient: psraq doesn't exist, only psraw/d. Only logical shifts come in 64-bit element size.
But you could create fixup with just one integer shift and mask, and a bitwise invert
__m128i ix = _mm_castsi128_pd(initial_x);
__m128i ifixup = _mm_srai_epi32(ix, 11); // all 11 bits of exponent field = sign bit
ifixup = _mm_and_si128(ifixup, _mm_set1_epi64x(0x7FF0000000000000ULL) ); // clear other bits
// ix = the bit pattern for 0 (non-negative x) or +Inf (negative x)
__m128d fixup = _mm_xor_si128(ifixup, _mm_set1_epi32(-1)); // bitwise invert
Then blend fixup into result for out-of-range inputs as normal.
Cheaply checking abs(initial_x) > details::EXP_LIMIT
If the exp algorithm was already squaring initial_x, you could compare against EXP_LIMIT squared. But it's not, xx = x*x only happens after some calculation to create x.
If you have AVX512F/VL, VFIXUPIMMPD might be handy here. It's designed for functions where the special case outputs are from "special" inputs like NaN and +-Inf, negative, positive, or zero, saving a compare for those cases. (e.g. for after a Newton-Raphson reciprocal(x) for x=0.)
But both of your special cases need compares. Or do they?
If you square your input and subtract, it only costs one FMA to do initial_x * initial_x - details::EXP_LIMIT * details::EXP_LIMIT to create a result that's negative for abs(initial_x) < details::EXP_LIMIT, and non-negative otherwise.
Agner Fog reports that vfixupimmpd is only 1 uop on Skylake-X.
I need to build a single-precision floating-point inner product routine for mixed single/double-precision floating-point vectors, exploiting the AVX instruction set for SIMD registers with 256 bits.
Problem: one input vector is float (x), while the other is double (yD).
Hence, before to compute the true inner product operations, I need to convert my input yD vector data from double to float.
Using the SSE2 instruction set, I was able to implement a very fast code doing what I needed, and with speed performances very close to the case when both vectors x and y were float:
void vector_operation(const size_t i)
{
__m128 X = _mm_load_ps(x + i);
__m128 Y = _mm_movelh_ps(_mm_cvtpd_ps(_mm_load_pd(yD + i + 0)), _mm_cvtpd_ps(_mm_load_pd(yD + i + 2)));
//inner-products accumulation
res = _mm_add_ps(res, _mm_mul_ps(X, Y));
}
Now, with the hope to further speed-up, I implemented a correpsonding version with AVX instruction set:
inline void vector_operation(const size_t i)
{
__m256 X = _mm256_load_ps(x + i);
__m128 yD1 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 0));
__m128 yD2 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 2));
__m128 yD3 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 4));
__m128 yD4 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 6));
__m128 Ylow = _mm_movelh_ps(yD1, yD2);
__m128 Yhigh = _mm_movelh_ps(yD3, yD4);
//Pack __m128 data inside __m256
__m256 Y = _mm256_permute2f128_ps(_mm256_castps128_ps256(Ylow), _mm256_castps128_ps256(Yhigh), 0x20);
//inner-products accumulation
res = _mm256_add_ps(res, _mm256_mul_ps(X, Y));
}
I also tested other AVX implementations using, for example, casting and insertion operations instead of perfmuting data. Performances were comparably poor compared to the case where both x and y vectors were float.
The problem with the AVX code is that no matter how I implemented it, its performance is by far inferior to the ones achieved by using only float x and y vectors (i.e. no double-float conversion is needed).
The conversion from double to float for the yD vector seems pretty fast, while a lot of time is lost in the line where data is inserted in the _m256 Y register.
Do you know if this is a well-known issue with AVX?
Do you have a solution that could preserve good performances?
Thanks in advance!
I rewrote your function and took better advantage of what AVX has to offer. I also used fused multiply-add at the end; if you can't use FMA, just replace that line with addition and multiplication. I only now see that I wrote an implementation that uses unaligned loads and yours uses aligned loads, but I'm not gonna lose any sleep over it. :)
__m256 foo(float*x, double* yD, const size_t i, __m256 res_prev)
{
__m256 X = _mm256_loadu_ps(x + i);
__m128 yD21 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 0));
__m128 yD43 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 4));
__m256 Y = _mm256_set_m128(yD43, yD21);
return _mm256_fmadd_ps(X, Y, res_prev);
}
I did a quick benhmark and compared running times of your and my implementation. I tried two different benchmark approaches with several repetitions and every time my code was around 15% faster. I used MSVC 14.1 compiler and compiled the program with /O2 and /arch:AVX2 flags.
EDIT: this is the disassembly of the function:
vcvtpd2ps xmm3,ymmword ptr [rdx+r8*8+20h]
vcvtpd2ps xmm2,ymmword ptr [rdx+r8*8]
vmovups ymm0,ymmword ptr [rcx+r8*4]
vinsertf128 ymm3,ymm2,xmm3,1
vfmadd213ps ymm0,ymm3,ymmword ptr [r9]
EDIT 2: this is the disassembly of your AVX implementation of the same algorithm:
vcvtpd2ps xmm0,xmmword ptr [rdx+r8*8+30h]
vcvtpd2ps xmm1,xmmword ptr [rdx+r8*8+20h]
vmovlhps xmm3,xmm1,xmm0
vcvtpd2ps xmm0,xmmword ptr [rdx+r8*8+10h]
vcvtpd2ps xmm1,xmmword ptr [rdx+r8*8]
vmovlhps xmm2,xmm1,xmm0
vperm2f128 ymm3,ymm2,ymm3,20h
vmulps ymm0,ymm3,ymmword ptr [rcx+r8*4]
vaddps ymm0,ymm0,ymmword ptr [r9]
I have a 256 bit AVX register containing 4 single precision complex numbers stored as real, imaginary, real, imaginary, etc. I'm currently writing the entire 256 bit register back to memory and summing it there, but that seems inefficient.
How can the complex number horizontal sum be performed using AVX (or AVX2) intrinsics? I would accept an answer using assembly if there is not an answer with comparable efficiency using intrinsics.
Edit: To clarify, if the register contains AR, AI, BR, BI, CR, CI, DR, DI, I want to compute the complex number (AR + BR + CR + DR, AI + BI + CI + DI). If the result is in a 256 bit register, I can extract the 2 single precision floating point numbers.
Edit2: Potential solution, though not necessarily optimal...
float hsum_ps_sse3(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v); // broadcast elements 3,1 to 2,0
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums); // high half -> low half
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
float sumReal = 0.0;
float sumImaginary = 0.0;
__m256i mask = _mm256_set_epi32 (7, 5, 3, 1, 6, 4, 2, 0);
// Separate real and imaginary.
__m256 permutedSum = _mm256_permutevar8x32_ps(sseSum0, mask);
__m128 realSum = _mm256_extractf128_ps(permutedSum , 0);
__m128 imaginarySum = _mm256_extractf128_ps(permutedSum , 1);
// Horizontally sum real and imaginary.
sumReal = hsum_ps_sse3(realSum);
sumImaginary = hsum_ps_sse3(imaginarySum);
One fairly straightforward solution which requires only AVX (not AVX2):
__m128i v0 = _mm256_castps256_ps128(v); // get low 2 complex values
__m128i v1 = _mm256_extractf128_ps(v, 1); // get high 2 complex values
v0 = _mm_add_ps(v0, v1); // add high and low
v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(1, 0, 3, 2));
v0 = _mm_add_ps(v0, v1); // combine two halves of result
The result will be in v0 as { sum.re, sum.im, sum.re, sum.im }.
I'm trying to downsample an Image using Neon. So I tried to exercise neon by writing a function that subtracts two images using neon and I have succeeded.
Now I came back to write the bilinear interpolation using neon intrinsics.
Right now I have two problems, getting 4 pixels from one row and one column and also compute the interpolated value (gray) from 4 pixels or if it is possible from 8 pixels from one row and one column. I tried to think about it, but I think the algorithm should be rewritten at all ?
void resizeBilinearNeon( uint8_t *src, uint8_t *dest, float srcWidth, float srcHeight, float destWidth, float destHeight)
{
int A, B, C, D, x, y, index;
float x_ratio = ((float)(srcWidth-1))/destWidth ;
float y_ratio = ((float)(srcHeight-1))/destHeight ;
float x_diff, y_diff;
for (int i=0;i<destHeight;i++) {
for (int j=0;j<destWidth;j++) {
x = (int)(x_ratio * j) ;
y = (int)(y_ratio * i) ;
x_diff = (x_ratio * j) - x ;
y_diff = (y_ratio * i) - y ;
index = y*srcWidth+x ;
uint8x8_t pixels_r = vld1_u8 (src[index]);
uint8x8_t pixels_c = vld1_u8 (src[index+srcWidth]);
// Y = A(1-w)(1-h) + B(w)(1-h) + C(h)(1-w) + Dwh
gray = (int)(
pixels_r[0]*(1-x_diff)*(1-y_diff) + pixels_r[1]*(x_diff)*(1-y_diff) +
pixels_c[0]*(y_diff)*(1-x_diff) + pixels_c[1]*(x_diff*y_diff)
) ;
dest[i*w2 + j] = gray ;
}
}
Neon will definitely help with downsampling in an arbitrary ratio using bilinear filtering. The key being clever use of vtbl.8 instruction, that is able to perform a parallel look-up-table for 8 consecutive destination pixels from pre-loaded array:
d0 = a [b] c [d] e [f] g h, d1 = i j k l m n o p
d2 = q r s t u v [w] x, d3 = [y] z [A] B [C][D] E F ...
d4 = G H I J K L M N, d5 = O P Q R S T U V ...
One can easily calculate the fractional positions for the pixels in brackets:
[b] [d] [f] [w] [y] [A] [C] [D], accessed with vtbl.8 d6, {d0,d1,d2,d3}
The row below would be accessed with vtbl.8 d7, {d2,d3,d4,d5}
Incrementing vadd.8 d6, d30 ; with d30 = [1 1 1 1 1 ... 1] gives lookup indices for the pixels right of the origin etc.
There's no reason for getting the pixels from two rows other than illustrating it's possible and that the method can be used to implement also slight distortions if needed.
In real time applications using e.g. of lanzcos can be a bit overkill, but still feasible using NEON. Downsampling of larger factors require of course (heavy) filtering, but can be easily achieved with iteratively averaging and decimating by 2:1 and only at the end using fractional sampling.
For any 8 consecutive pixels to write, one can calculate the vector
x_positions = (X + [0 1 2 3 4 5 6 7]) * source_width / target_width;
y_positions = (Y + [0 0 0 0 0 0 0 0]) * source_height / target_height;
ptr = to_int(x_positions) + y_positions * stride;
x_position += (ptr & 7); // this pointer arithmetic goes only for 8-bit planar
ptr &= ~7; // this is to adjust read pointer to qword alignment
vld1.8 {d0,d1}, [r0]
vld1.8 {d2,d3], [r0], r2 // wasn't this possible? (use r2==stride)
d4 = int_part_of (x_positions);
d5 = d4 + 1;
d6 = fract_part_of (x_positions);
d7 = fract_part_of (y_positions);
vtbl.8 d8,d4,{d0,d1} // read top row
vtbl.8 d9,d5,{d0,d1} // read top row +1
MIX(d8,d9,d6) // horizontal mix of ptr[] & ptr[1]
vtbl.8 d10,d4,{d2,d3} // read bottom row
vtbl.8 d11,d5,{d2,d3} // read bottom row
MIX(d10,d11,d6) // horizontal mix of ptr[1024] & ptr[1025]
MIX(d8,d10,d7)
// MIX (dst, src, fract) is a macro that somehow does linear blending
// should be doable with ~3-4 instructions
To calculate the integer parts, it's enough to use 8.8 bit resolution (one really doesn't have to calculate 666+[0 1 2 3 .. 7]) and keep all intermediate results in simd register.
Disclaimer -- this is conceptual pseudo c / vector code. In SIMD there are two parallel tasks to be optimized: what's the minimum amount of arithmetic operations needed and how to minimize unnecessary shuffling / copying of data. In this respect too NEON with three register approach is much better suited to serious DSP than SSE. The second respect is the amount of multiplication instruction and the third advantage the interleaving instructions.
#MarkRansom is not correct about nearest neighbor versus 2x2 bilinear interpolation; bilinear using 4 pixels will produce better output than nearest neighbor. He is correct that averaging the appropriate number of pixels (more than 4 if scaling by > 2:1) will produce better output still. However, NEON will not help with image downsampling unless the scaling is done by an integer ratio.
The maximum benefit of NEON and other SIMD instruction sets is to be able to process 8 or 16 pixels at once using the same operations. By accessing individual elements the way you are, you lose all the SIMD benefit. Another problem is that moving data from NEON to ARM registers is a slow operation. Downsampling images is best done by a GPU or optimized ARM instructions.