I would like to speed up a part of my code but I don't think there is a possible better way to do the following calculation:
float invSum = 1.0f / float(sum);
for (int i = 0; i < numBins; ++i)
{
histVec[i] *= invSum;
}
for (int i = 0; i < numBins; ++i)
{
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
fmean += f * midPoint;
}
for (int i = 0; i < numBins; ++i)
{
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
float diff = midPoint - fmean;
var += f * hwk::sqr(diff);
}
numBins in the for-loops is typically 10 but this bit of code is called very often (frequency of 80 frames per seconds, called at least 8 times per frame)
I tried to use some SSE methods but it is only slightly speeding up this code. I think I could avoid calculating twice midPoint but I am not sure how. Is there a better way to compute fmean and var?
Here is the SSE code:
// make hist contain a multiple of 4 valid values
for (int i = numBins; i < ((numBins + 3) & ~3); i++)
hist[i] = 0;
// find sum of bins in inHist
__m128i iSum4 = _mm_set1_epi32(0);
for (int i = 0; i < numBins; i += 4)
{
__m128i a = *((__m128i *) &inHist[i]);
iSum4 = _mm_add_epi32(iSum4, a);
}
int iSum = iSum4.m128i_i32[0] + iSum4.m128i_i32[1] + iSum4.m128i_i32[2] + iSum4.m128i_i32[3];
//float stdevB, meanB;
if (iSum == 0.0f)
{
stdev = 0.0;
mean = 0.0;
}
else
{
// Set histVec to normalised values in inHist
__m128 invSum = _mm_set1_ps(1.0f / float(iSum));
for (int i = 0; i < numBins; i += 4)
{
__m128i a = *((__m128i *) &inHist[i]);
__m128 b = _mm_cvtepi32_ps(a);
__m128 c = _mm_mul_ps(b, invSum);
_mm_store_ps(&histVec[i], c);
}
float binSize = 256.0f / (float)numBins;
float halfBinSize = binSize * 0.5f;
float binOffset = halfBinSize;
__m128 binSizeMask = _mm_set1_ps(binSize);
__m128 binOffsetMask = _mm_set1_ps(binOffset);
__m128 fmean4 = _mm_set1_ps(0.0f);
for (int i = 0; i < numBins; i += 4)
{
__m128i idx4 = _mm_set_epi32(i + 3, i + 2, i + 1, i);
__m128 idx_m128 = _mm_cvtepi32_ps(idx4);
__m128 histVec4 = _mm_load_ps(&histVec[i]);
__m128 midPoint4 = _mm_add_ps(_mm_mul_ps(idx_m128, binSizeMask), binOffsetMask);
fmean4 = _mm_add_ps(fmean4, _mm_mul_ps(histVec4, midPoint4));
}
fmean4 = _mm_hadd_ps(fmean4, fmean4); // 01 23 01 23
fmean4 = _mm_hadd_ps(fmean4, fmean4); // 0123 0123 0123 0123
float fmean = fmean4.m128_f32[0];
//fmean4 = _mm_set1_ps(fmean);
__m128 var4 = _mm_set1_ps(0.0f);
for (int i = 0; i < numBins; i+=4)
{
__m128i idx4 = _mm_set_epi32(i + 3, i + 2, i + 1, i);
__m128 idx_m128 = _mm_cvtepi32_ps(idx4);
__m128 histVec4 = _mm_load_ps(&histVec[i]);
__m128 midPoint4 = _mm_add_ps(_mm_mul_ps(idx_m128, binSizeMask), binOffsetMask);
__m128 diff4 = _mm_sub_ps(midPoint4, fmean4);
var4 = _mm_add_ps(var4, _mm_mul_ps(histVec4, _mm_mul_ps(diff4, diff4)));
}
var4 = _mm_hadd_ps(var4, var4); // 01 23 01 23
var4 = _mm_hadd_ps(var4, var4); // 0123 0123 0123 0123
float var = var4.m128_f32[0];
stdev = sqrt(var);
mean = fmean;
}
I might be doing something wrong since I dont have a lot of improvement as I was expecting.
Is there something in the SSE code that might possibly slow down the process?
(editor's note: the SSE part of this question was originally asked as https://stackoverflow.com/questions/31837817/foor-loop-optimisation-sse-comparison, which was closed as a duplicate.)
I only just realized that your data array starts out as an array of int, since you didn't have declarations in your code. I can see in the SSE version that you start with integers, and only store a float version of it later.
Keeping everything integer will let us do the loop-counter-vector with a simple ivec = _mm_add_epi32(ivec, _mm_set1_epi32(4)); Aki Suihkonen's answer has some transformations that should let it optimize a lot better. Especially, the auto-vectorizer should be able to do more even without -ffast-math. In fact, it does quite well. You could do better with intrinsics, esp. saving some vector 32bit multiplies and shortening the dependency chain.
My old answer, based on just trying to optimize your code as written, assuming FP input:
You may be able to combine all 3 loops into one, using the algorithm #Jason linked to. It might not be profitable, though, since it involves a division. For small numbers of bins, probably just loop multiple times.
Start by reading the guides at http://agner.org/optimize/. A couple of the techniques in his Optimising Assembly guide will speed up your SSE attempt (which I edited into this question for you).
combine your loops where possible, so you do more with the data for each time it's loaded / stored.
multiple accumulators to hide the latency of loop-carried dependency chains. (Even FP add takes 3 cycles on recent Intel CPUs.) This won't apply for really short arrays like your case.
instead of int->float conversion on every iteration, use a float loop counter as well as the int loop counter. (add a vector of _mm_set1_ps(4.0f) every iteration.) _mm_set... with variable args is something to avoid in loops, when possible. It takes several instructions (esp. when each arg to setr has to be calculated separately.)
gcc -O3 manages to auto-vectorize the first loop, but not the others. With -O3 -ffast-math, it auto-vectorizes more. -ffast-math allows it to do FP operations in a different order than the code specifies. e.g. adding up the array in 4 elements of a vector, and only combining the 4 accumulators at the end.
Telling gcc that the input pointer is aligned by 16 lets gcc auto-vectorize with a lot less overhead (no scalar loops for unaligned portions).
// return mean
float fpstats(float histVec[], float sum, float binSize, float binOffset, long numBins, float *variance_p)
{
numBins += 3;
numBins &= ~3; // round up to multiple of 4. This is just a quick hack to make the code fast and simple.
histVec = (float*)__builtin_assume_aligned(histVec, 16);
float invSum = 1.0f / float(sum);
float var = 0, fmean = 0;
for (int i = 0; i < numBins; ++i)
{
histVec[i] *= invSum;
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
fmean += f * midPoint;
}
for (int i = 0; i < numBins; ++i)
{
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
float diff = midPoint - fmean;
// var += f * hwk::sqr(diff);
var += f * (diff * diff);
}
*variance_p = var;
return fmean;
}
gcc generates some weird code for the 2nd loop.
# broadcasting fmean after the 1st loop
subss %xmm0, %xmm2 # fmean, D.2466
shufps $0, %xmm2, %xmm2 # vect_cst_.16
.L5: ## top of 2nd loop
movdqa %xmm3, %xmm5 # vect_vec_iv_.8, vect_vec_iv_.8
cvtdq2ps %xmm3, %xmm3 # vect_vec_iv_.8, vect__32.9
movq %rcx, %rsi # D.2465, D.2467
addq $1, %rcx #, D.2465
mulps %xmm1, %xmm3 # vect_cst_.11, vect__33.10
salq $4, %rsi #, D.2467
paddd %xmm7, %xmm5 # vect_cst_.7, vect_vec_iv_.8
addps %xmm2, %xmm3 # vect_cst_.16, vect_diff_39.15
mulps %xmm3, %xmm3 # vect_diff_39.15, vect_powmult_53.17
mulps (%rdi,%rsi), %xmm3 # MEM[base: histVec_10, index: _107, offset: 0B], vect__41.18
addps %xmm3, %xmm4 # vect__41.18, vect_var_42.19
cmpq %rcx, %rax # D.2465, bnd.26
ja .L8 #, ### <--- This is insane.
haddps %xmm4, %xmm4 # vect_var_42.19, tmp160
haddps %xmm4, %xmm4 # tmp160, vect_var_42.21
.L2:
movss %xmm4, (%rdx) # var, *variance_p_44(D)
ret
.p2align 4,,10
.p2align 3
.L8:
movdqa %xmm5, %xmm3 # vect_vec_iv_.8, vect_vec_iv_.8
jmp .L5 #
So instead of just jumping back to the top every iteration, gcc decides to jump ahead to copy a register, and then unconditionally jmp back to the top of the loop. The uop loop buffer may remove the front-end overhead of this sillyness, but gcc should have structured the loop so it didn't copy xmm5->xmm3 and then xmm3->xmm5 every iteration, because that's silly. It should have the conditional jump just go to the top of the loop.
Also note the technique gcc used to get a float version of the loop counter: start with an integer vector of 1 2 3 4, and add set1_epi32(4). Use that as an input for packed int->float cvtdq2ps. On Intel HW, that instruction runs on the FP-add port, and has 3 cycle latency, same as packed FP add. gcc prob. would have done better to just add a vector of set1_ps(4.0), even though this creates a 3-cycle loop-carried dependency chain, instead of 1 cycle vector int add, with a 3 cycle convert forking off on every iteration.
small iteration count
You say this will often be used on exactly 10 bins? A specialized version for just 10 bins could give a big speedup, by avoiding all the loop overhead and keeping everything in registers.
With that small a problem size, you can have the FP weights just sitting there in memory, instead of re-computing them with integer->float conversion every time.
Also, 10 bins is going to mean a lot of horizontal operations relative to the amount of vertical operations, since you only have 2 and a half vectors worth of data.
If exactly 10 is really common, specialize a version for that. If under-16 is common, specialize a version for that. (They can and should share the const float weights[] = { 0.0f, 1.0f, 2.0f, ...}; array.)
You probably will want to use intrinsics for the specialized small-problem versions, rather than auto-vectorization.
Having zero-padding after the end of the useful data in your array might still be a good idea in your specialized version(s). However, you can load the last 2 floats and clear the upper 64b of a vector register with a movq instruction. (__m128i _mm_cvtsi64_si128 (__int64 a)). Cast this to __m128 and you're good to go.
As peterchen mentioned, these operations are very trivial for current desktop processors. The function is linear, i.e. O(n). What's the typical size of numBins? If it's rather large (say, over 1000000), parallelization will help. This could be simple using a library like OpenMP. If numBins starts approaching MAXINT, you may consider GPGPU as an option (CUDA/OpenCL).
All that considered, you should try profiling your application. Chances are good that, if there is a performance constraint, it's not in this method. Michael Abrash's definition of "high-performance code" has helped me greatly in determining if/when to optimize:
Before we can create high-performance code, we must understand what high performance is. The objective (not always attained) in creating high-performance software is to make the software able to carry out its appointed tasks so rapidly that it responds instantaneously, as far as the user is concerned. In other words, high-performance code should ideally run so fast that any further improvement in the code would be pointless. Notice that the above definition most emphatically does not say anything about making the software as fast as possible.
Reference:
The Graphics Programming Black Book
The overall function to be calculated is
std = sqrt(SUM_i { hist[i]/sum * (midpoint_i - mean_midpoint)^2 })
Using the identity
Var (aX + b) = Var (X) * a^2
one can reduce the complexity of the overall operation considerably
1) midpoint of a bin doesn't need offset b
2) no need to prescale by bin array elements with bin width
and
3) no need to normalize histogram entries with reciprocal of sum
The optimized calculation goes as follows
float calcVariance(int histBin[], float binWidth)
{
int i;
int sum = 0;
int mid = 0;
int var = 0;
for (i = 0; i < 10; i++)
{
sum += histBin[i];
mid += i*histBin[i];
}
float inv_sum = 1.0f / (float)sum;
float mid_sum = mid * inv_sum;
for (i = 0; i < 10; i++)
{
int diff = i * sum - mid; // because mid is prescaled by sum
var += histBin[i] * diff * diff;
}
return sqrt(float(var) / (float)(sum * sum * sum)) * binWidth;
}
Minor changes are required if it's float histBin[];
Also I second padding histBin size to a multiple of 4 for better vectorization.
EDIT
Another way to calculate this with floats in the inner loop:
float inv_sum = 1.0f / (float)sum;
float mid_sum = mid * inv_sum;
float var = 0.0f;
for (i = 0; i < 10; i++)
{
float diff = (float)i - mid_sum;
var += (float)histBin[i] * diff * diff;
}
return sqrt(var * inv_sum) * binWidth;
Perform the scaling on the global results only and keep integers as long as possible.
Group all computation in a single loop, using Σ(X-m)²/N = ΣX²/N - m².
// Accumulate the histogram
int mean= 0, var= 0;
for (int i = 0; i < numBins; ++i)
{
mean+= i * histVec[i];
var+= i * i * histVec[i];
}
// Compute the reduced mean and variance
float fmean= (float(mean) / sum);
float fvar= float(var) / sum - fmean * fmean;
// Rescale
fmean= fmean * binSize + binOffset;
fvar= fvar * binSize * binSize;
The required integer type will depend on the maximum value in the bins. The SSE optimization of the loop can exploit the _mm_madd_epi16 instruction.
If the number of bins is a small as 10, consider fully unrolling the loop. Precompute the i and i² vectors in a table.
In the lucky case that the data fits in 16 bits and the sums in 32 bits, the accumulation is done with something like
static short I[16]= { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0 };
static short I2[16]= { 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 0, 0, 0, 0, 0, 0 };
// First group
__m128i i= _mm_load_si128((__m128i*)&I[0]);
__m128i i2= _mm_load_si128((__m128i*)&I2[0]);
__m128i h= _mm_load_si128((__m128i*)&inHist[0]);
__m128i mean= _mm_madd_epi16(i, h);
__m128i var= _mm_madd_epi16(i2, h);
// Second group
i= _mm_load_si128((__m128i*)&I[8]);
i2= _mm_load_si128((__m128i*)&I2[8]);
h= _mm_load_si128((__m128i*)&inHist[8]);
mean= _mm_add_epi32(mean, _mm_madd_epi16(i, h));
var= _mm_add_epi32(var, _mm_madd_epi16(i2, h));
CAUTION: unchecked
Related
This is the code I actually had (for a scalar code) which I've replicated (x4) storing data into simd:
waveTable *waveTables[4];
for (int i = 0; i < 4; i++) {
int waveTableIindex = 0;
while ((phaseIncrement[i] >= mWaveTables[waveTableIindex].mTopFreq) && (waveTableIindex < kNumWaveTableSlots)) {
waveTableIindex++;
}
waveTables[i] = &mWaveTables[waveTableIindex];
}
Its not "faster" at all, of course. How would you do the same with simd, saving cpu? Any tips/starting point?
I'm with SSE2.
Here's the context of the computation.
topFreq for each wave table are calculated starting from the max harmonic amounts (x2, due to Nyquist), and multiply for 2 on every wave table (dividing later the number of harmonics available for each table):
double topFreq = 1.0 / (maxHarmonic * 2);
while (maxHarmonic) {
// fill the table in with the needed harmonics
// ... makeWaveTable() code
// prepare for next table
topFreq *= 2;
maxHarmonic >>= 1;
}
Than, on processing, for each sample, I need to "catch" the correct wave table to use, due to the osc's freq (i.e. phase increment):
freq = clamp(freq, 20.0f, 22050.0f);
phaseIncrement = freq * vSampleTime;
so, for example (having vSampleTime = 1/44100, maxHarmonic = 500), 30hz is wavetable 0, 50hz is wavetable 1, and so on
Assuming your values are FP32, I would do it like this. Untested.
const __m128 phaseIncrements = _mm_loadu_ps( phaseIncrement );
__m128i indices = _mm_setzero_si128();
__m128i activeIndices = _mm_set1_epi32( -1 );
for( size_t idx = 0; idx < kNumWaveTableSlots; idx++ )
{
// Broadcast the mTopFreq value into FP32 vector. If you build this for AVX1, will become 1 very fast instruction.
const __m128 topFreq = _mm_set1_ps( mWaveTables[ idx ].mTopFreq );
// Compare for phaseIncrements >= topFreq
const __m128 cmp_f32 = _mm_cmpge_ps( phaseIncrements, topFreq );
// The following line compiles into no instruction, it's only to please the type checker
__m128i cmp = _mm_castps_si128( cmp_f32 );
// Bitwise AND with activeIndices
cmp = _mm_and_si128( cmp, activeIndices );
// The following line increments the indices vector by 1, only the lanes where cmp was TRUE
indices = _mm_sub_epi32( indices, cmp );
// Update the set of active lane indices
activeIndices = cmp;
// The vector may become completely zero, meaning all 4 lanes have encountered at least 1 value where topFreq < phaseIncrements
if( 0 == _mm_movemask_epi8( activeIndices ) )
break;
}
// Indices vector keeps 4 32-bit integers
// Each lane contains index of the first table entry less than the corresponding lane of phaseIncrements
// Or maybe kNumWaveTableSlots if not found
There is no standard way to write SIMD instructions in C++. A compiler may produce SIMD instructions when appropriate as long as you've configured it to target a CPU that supports such instructions and enabled relevant optimisations. You can use standard algorithms using the std::execution::unsequenced_policy to help compiler understand that SIMD is appropriate.
If you are using GCC/G++ or Clang, there is a non-standard language extension for vector extensions. using __attribute__ ((vector_size (xx))). See the GCC manual for details
https://gcc.gnu.org/onlinedocs/gcc-11.2.0/gcc/Vector-Extensions.html#Vector-Extensions
For some real-time DSP application I need to compute the absolute values of a complex valued vector.
The straightforward implementation would look like that
computeAbsolute (std::complex<float>* complexSourceVec,
float* realValuedDestinationVec,
int vecLength)
{
for (int i = 0; i < vecLength; ++i)
realValuedDestinationVec[i] = std::abs (complexSourceVec[i]);
}
I want to replace this implementation with an AVX2 optimized version, based on AVX2 instrincts. What would be the most efficient way to implement it that way?
Note: The source data is handed to me by an API I have no access to, so there is no chance to change the layout of the complex input vector for better efficiency.
Inspired by the answer of Dan M. I first implemented his version with some tweaks:
First changed it to use the wider 256 Bit registers, then marked the temporary re and im arrays with __attribute__((aligned (32))) to be able to use aligned load
void computeAbsolute1 (const std::complex<float>* cplxIn, float* absOut, const int length)
{
for (int i = 0; i < length; i += 8)
{
float re[8] __attribute__((aligned (32))) = {cplxIn[i].real(), cplxIn[i + 1].real(), cplxIn[i + 2].real(), cplxIn[i + 3].real(), cplxIn[i + 4].real(), cplxIn[i + 5].real(), cplxIn[i + 6].real(), cplxIn[i + 7].real()};
float im[8] __attribute__((aligned (32))) = {cplxIn[i].imag(), cplxIn[i + 1].imag(), cplxIn[i + 2].imag(), cplxIn[i + 3].imag(), cplxIn[i + 4].imag(), cplxIn[i + 5].imag(), cplxIn[i + 6].imag(), cplxIn[i + 7].imag()};
__m256 x4 = _mm256_load_ps (re);
__m256 y4 = _mm256_load_ps (im);
__m256 b4 = _mm256_sqrt_ps (_mm256_add_ps (_mm256_mul_ps (x4,x4), _mm256_mul_ps (y4,y4)));
_mm256_storeu_ps (absOut + i, b4);
}
}
However manually shuffling the values this way seemed like a task that could be speeded up somehow. Now this is the solution I came up with, that runs 2 - 3 times faster in a quick test compiled by clang with full optimization:
#include <complex>
#include <immintrin.h>
void computeAbsolute2 (const std::complex<float>* __restrict cplxIn, float* __restrict absOut, const int length)
{
for (int i = 0; i < length; i += 8)
{
// load 8 complex values (--> 16 floats overall) into two SIMD registers
__m256 inLo = _mm256_loadu_ps (reinterpret_cast<const float*> (cplxIn + i ));
__m256 inHi = _mm256_loadu_ps (reinterpret_cast<const float*> (cplxIn + i + 4));
// seperates the real and imaginary part, however values are in a wrong order
__m256 re = _mm256_shuffle_ps (inLo, inHi, _MM_SHUFFLE (2, 0, 2, 0));
__m256 im = _mm256_shuffle_ps (inLo, inHi, _MM_SHUFFLE (3, 1, 3, 1));
// do the heavy work on the unordered vectors
__m256 abs = _mm256_sqrt_ps (_mm256_add_ps (_mm256_mul_ps (re, re), _mm256_mul_ps (im, im)));
// reorder values prior to storing
__m256d ordered = _mm256_permute4x64_pd (_mm256_castps_pd(abs), _MM_SHUFFLE(3,1,2,0));
_mm256_storeu_ps (absOut + i, _mm256_castpd_ps(ordered));
}
}
I think I'll go with that implementation if no one comes up with a faster solution
This compiles efficiently with gcc and clang (on the Godbolt compiler explorer).
It's really hard (if possible) to write "highly optimized AVX2" version of complex abs since the way complex numbers are defined in the standard prevents (specifically due to all inf/nan corner cases) a lot of optimization.
However, if you don't care about the correctness you can just use -ffast-math and some compilers would optimize the code for you. See gcc output: https://godbolt.org/z/QbZlBI
You can also take this output and create your own abs function with inline assembly.
But yes, as was already mentioned, if you really need performance, you probably want to swap std::complex for something else.
I was able to get a decent output for your specific case with all the required shuffles by manually filling small re and im arrays. See: https://godbolt.org/z/sWAAXo
This could be trivially extended for ymm registers.
Anyway, here is the ultimate solution adapted from this SO answer which uses intrinsics in combination with clever compiler optimizations:
#include <complex>
#include <cassert>
#include <immintrin.h>
static inline void cabs_soa4(const float *re, const float *im, float *b) {
__m128 x4 = _mm_loadu_ps(re);
__m128 y4 = _mm_loadu_ps(im);
__m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
_mm_storeu_ps(b, b4);
}
void computeAbsolute (const std::complex<float>* src,
float* realValuedDestinationVec,
int vecLength)
{
for (int i = 0; i < vecLength; i += 4) {
float re[4] = {src[i].real(), src[i + 1].real(), src[i + 2].real(), src[i + 3].real()};
float im[4] = {src[i].imag(), src[i + 1].imag(), src[i + 2].imag(), src[i + 3].imag()};
cabs_soa4(re, im, realValuedDestinationVec);
}
}
which compiles to simple
_Z15computeAbsolutePKSt7complexIfEPfi:
test edx, edx
jle .L5
lea eax, [rdx-1]
shr eax, 2
sal rax, 5
lea rax, [rdi+32+rax]
.L3:
vmovups xmm0, XMMWORD PTR [rdi]
vmovups xmm2, XMMWORD PTR [rdi+16]
add rdi, 32
vshufps xmm1, xmm0, xmm2, 136
vmulps xmm1, xmm1, xmm1
vshufps xmm0, xmm0, xmm2, 221
vfmadd132ps xmm0, xmm1, xmm0
vsqrtps xmm0, xmm0
vmovups XMMWORD PTR [rsi], xmm0
cmp rax, rdi
jne .L3
.L5:
ret
https://godbolt.org/z/Yu64Wg
I need to build a single-precision floating-point inner product routine for mixed single/double-precision floating-point vectors, exploiting the AVX instruction set for SIMD registers with 256 bits.
Problem: one input vector is float (x), while the other is double (yD).
Hence, before to compute the true inner product operations, I need to convert my input yD vector data from double to float.
Using the SSE2 instruction set, I was able to implement a very fast code doing what I needed, and with speed performances very close to the case when both vectors x and y were float:
void vector_operation(const size_t i)
{
__m128 X = _mm_load_ps(x + i);
__m128 Y = _mm_movelh_ps(_mm_cvtpd_ps(_mm_load_pd(yD + i + 0)), _mm_cvtpd_ps(_mm_load_pd(yD + i + 2)));
//inner-products accumulation
res = _mm_add_ps(res, _mm_mul_ps(X, Y));
}
Now, with the hope to further speed-up, I implemented a correpsonding version with AVX instruction set:
inline void vector_operation(const size_t i)
{
__m256 X = _mm256_load_ps(x + i);
__m128 yD1 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 0));
__m128 yD2 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 2));
__m128 yD3 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 4));
__m128 yD4 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 6));
__m128 Ylow = _mm_movelh_ps(yD1, yD2);
__m128 Yhigh = _mm_movelh_ps(yD3, yD4);
//Pack __m128 data inside __m256
__m256 Y = _mm256_permute2f128_ps(_mm256_castps128_ps256(Ylow), _mm256_castps128_ps256(Yhigh), 0x20);
//inner-products accumulation
res = _mm256_add_ps(res, _mm256_mul_ps(X, Y));
}
I also tested other AVX implementations using, for example, casting and insertion operations instead of perfmuting data. Performances were comparably poor compared to the case where both x and y vectors were float.
The problem with the AVX code is that no matter how I implemented it, its performance is by far inferior to the ones achieved by using only float x and y vectors (i.e. no double-float conversion is needed).
The conversion from double to float for the yD vector seems pretty fast, while a lot of time is lost in the line where data is inserted in the _m256 Y register.
Do you know if this is a well-known issue with AVX?
Do you have a solution that could preserve good performances?
Thanks in advance!
I rewrote your function and took better advantage of what AVX has to offer. I also used fused multiply-add at the end; if you can't use FMA, just replace that line with addition and multiplication. I only now see that I wrote an implementation that uses unaligned loads and yours uses aligned loads, but I'm not gonna lose any sleep over it. :)
__m256 foo(float*x, double* yD, const size_t i, __m256 res_prev)
{
__m256 X = _mm256_loadu_ps(x + i);
__m128 yD21 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 0));
__m128 yD43 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 4));
__m256 Y = _mm256_set_m128(yD43, yD21);
return _mm256_fmadd_ps(X, Y, res_prev);
}
I did a quick benhmark and compared running times of your and my implementation. I tried two different benchmark approaches with several repetitions and every time my code was around 15% faster. I used MSVC 14.1 compiler and compiled the program with /O2 and /arch:AVX2 flags.
EDIT: this is the disassembly of the function:
vcvtpd2ps xmm3,ymmword ptr [rdx+r8*8+20h]
vcvtpd2ps xmm2,ymmword ptr [rdx+r8*8]
vmovups ymm0,ymmword ptr [rcx+r8*4]
vinsertf128 ymm3,ymm2,xmm3,1
vfmadd213ps ymm0,ymm3,ymmword ptr [r9]
EDIT 2: this is the disassembly of your AVX implementation of the same algorithm:
vcvtpd2ps xmm0,xmmword ptr [rdx+r8*8+30h]
vcvtpd2ps xmm1,xmmword ptr [rdx+r8*8+20h]
vmovlhps xmm3,xmm1,xmm0
vcvtpd2ps xmm0,xmmword ptr [rdx+r8*8+10h]
vcvtpd2ps xmm1,xmmword ptr [rdx+r8*8]
vmovlhps xmm2,xmm1,xmm0
vperm2f128 ymm3,ymm2,ymm3,20h
vmulps ymm0,ymm3,ymmword ptr [rcx+r8*4]
vaddps ymm0,ymm0,ymmword ptr [r9]
Consider the following code:
Matrix4x4 perspective(const ViewFrustum &frustum) {
float l = frustum.l;
float r = frustum.r;
float b = frustum.b;
float t = frustum.t;
float n = frustum.n;
float f = frustum.f;
return {
{ 2 * n / (r - l), 0, (r + l) / (r - l), 0 },
{ 0, 2 * n / (t - b), (t + b) / (t - b), 0 },
{ 0, 0, -((f + n) / (f - n)), -(2 * n * f / (f - n)) },
{ 0, 0, -1, 0 }
};
}
In order to improve readability of constructing the matrix, I have to either make a copy of values from the frustum struct, or references to them. However, neither do I actually need copies or indirection.
Is it possible to have some kind of a "reference" that would be resolved at compile time, kind of like a symbolic link. It would have the same effect as:
Matrix4x4 perspective(const ViewFrustum &frustum) {
#define l frustum.l;
#define r frustum.r;
#define b frustum.b;
#define t frustum.t;
#define n frustum.n;
#define f frustum.f;
return {
{ 2 * n / (r - l), 0, (r + l) / (r - l), 0 },
{ 0, 2 * n / (t - b), (t + b) / (t - b), 0 },
{ 0, 0, -((f + n) / (f - n)), -(2 * n * f / (f - n)) },
{ 0, 0, -1, 0 }
};
#undef l
#undef r
#undef b
#undef t
#undef n
#undef f
}
Without the preprocessor (or is it acceptable?). I suppose it isn't really needed, or could be avoided in this particular case by making those 6 values arguments to a function directly (though it would be a bit irritating having to call the function like that - but even then, I could make an inline proxy function).
But I was just wondering if this is somehow possible in general? I could not find anything like it. I think it would come in handy for locally shortening descriptive names that are going to be used a lot, without actually having to lose the original names.
Well, that's what C++ references are for:
const float &l = frustum.l;
const float &r = frustum.r;
const float &b = frustum.b;
const float &t = frustum.t;
const float &n = frustum.n;
const float &f = frustum.f;
Most modern compilers will optimize out the references, and use the values from the frustum object verbatim, in the following expression, by resolving the references at compile-time.
Obligatory disclaimer: do not prematurely optimize.
Let me compare your naive perspective function, containing
float l = frustum.l;
float r = frustum.r;
float b = frustum.b;
float t = frustum.t;
float n = frustum.n;
float f = frustum.f;
With define's and #Sam Varshavchik solution with references.
We assume that our compiler is optimizing, and optimizing at least decent.
Assembly output for all three versions: https://godbolt.org/g/G06Bx8.
You can notice that reference and define versions are exactly the same - as expected. But naive differs a lot. It first loads all the values from memory:
movss (%rdi), %xmm2 # xmm2 = mem[0],zero,zero,zero
movss 4(%rdi), %xmm1 # xmm1 = mem[0],zero,zero,zero
movss 8(%rdi), %xmm0 # xmm0 = mem[0],zero,zero,zero
movss %xmm0, 12(%rsp) # 4-byte Spill
movss 12(%rdi), %xmm0 # xmm0 = mem[0],zero,zero,zero
movss %xmm0, 8(%rsp) # 4-byte Spill
movss 16(%rdi), %xmm3 # xmm3 = mem[0],zero,zero,zero
movaps %xmm3, 16(%rsp) # 16-byte Spill
movss 20(%rdi), %xmm0
And then never again references the %rdi (frustrum) memory. Reference and define versions, on the other hand, load values as they are needed.
This happens because the implementation of Vector4 constructor is hidden from the optimizer and it can't assume that constructor doesn't modify frustrum, so it must insert loads, even when such loads are redundant.
So, naive version can be even faster than "optimized" one, under certain circumstances.
In general, you can use plain references, as long as you are in the local scope. Modern compilers "see through them" and just treat them as aliases (notice that this actually applies even to pointers).
However, when dealing with stuff on the small side, copying to a local variable, if anything, is generally beneficial. frustnum.ris one layer of indirection away (frustnum is actually a pointer under the hood), so accessing it is costlier than it may seem, and if you have function calls in the middle of your function the compiler may not be able to prove that its value isn't changing, so the access may need to be repeated.
Local variables instead are normally directly on the stack (cheap) or straight in registers (cheapest), and, most importantly, given that they usually have no interaction with "the outside", the compiler has an easier time reasoning about them, so it can be more aggressive with optimizations; also, when actually performing the computations those values are going to be copied in registers and on the stack anyway.
So go ahead and use copies, at worst the compiler will probably do the same, at best you may helped it optimizing stuff.
I need to know the sign of the value which has the max absolute value stored in an __m128. This is the solution I have now:
int getMaxSign(__m128 const& vec) {
static const __m128 SIGN_BIT_MASK =
_mm_castsi128_ps(_mm_set1_epi32(0x80000000));
// This creates an int, where sign(a) is 1 if a is negative, 0 o.w.:
// sign(a3)<<3 | sign(a2)<<2 | sign(a1)<<1 | sign(a0)
const int signMask = _mm_movemask_ps(vec);
// Get the absolute value of the vector;
__m128 absValsMMX = _mm_andnot_ps(SIGN_BIT_MASK, vec);
// Figure out the horizontal max
__declspec(align(16)) float absVals[4];
_mm_store_ps(absVals, absValsMMX);
const float maxVal = std::max(std::max(absVals[0], absVals[1]), absVals[2]);
return (maxVal == absVals[0] ? signMask & 0x1 :
(maxVal == absVals[1] ? signMask & 0x2 : signMask & 0x4));
}
In this case, sign will be 1 if the value with the maximum absolute value was negative, and 0 otherwise, but I don't actually care what the convention is. Another thing is that I am representing homogenous vectors using these __m128s, so I know that the last value will always be 0.
This seems like a lot of work to do for a relatively simple task. How can I do this faster?
Thanks!
Here is one possible implementation (in C):
int getMaxSign(const __m128 v)
{
__m128 v1, vmax, vmin, vsign;
float sign;
v1 = (__m128)_mm_alignr_epi8((__m128i)v, (__m128i)v, 4); // v1 = v rotated by 1 element
vmax = _mm_max_ps(v, v1); // generate horizontal max/min
vmin = _mm_min_ps(v, v1);
vmax = _mm_max_ps(vmax, (__m128)_mm_alignr_epi8((__m128i)vmax, (__m128i)vmax, 8));
vmin = _mm_min_ps(vmin, (__m128)_mm_alignr_epi8((__m128i)vmin, (__m128i)vmin, 8));
vsign = _mm_add_ps(vmax, vmin); // add max and min to get sign of abs max
sign = _mm_extract_ps(vsign, 0);
return (int)(sign < 0.0f); // return 1 for negative
}
Although this looks like a lot of code it's only about 9 SSE instructions and there are no memory accesses, no branches and very little scalar code.
Note that both SSSE3 and SSE4.1 instructions are used in the above.
Here is a second version which only requires SSSE3:
int getMaxSign(const __m128 v)
{
__m128 v1, vmax, vmin, vsign;
int mask;
v1 = (__m128)_mm_alignr_epi8((__m128i)v, (__m128i)v, 4); // v1 = v rotated by 1 element
vmax = _mm_max_ps(v, v1); // generate horizontal max/min
vmin = _mm_min_ps(v, v1);
vmax = _mm_max_ps(vmax, (__m128)_mm_alignr_epi8((__m128i)vmax, (__m128i)vmax, 8));
vmin = _mm_min_ps(vmin, (__m128)_mm_alignr_epi8((__m128i)vmin, (__m128i)vmin, 8));
vsign = _mm_add_ps(vmax, vmin); // add max and min to get sign of abs max
mask = _mm_movemask_epi8((__m128i)vsign);
return (mask & 8) != 0; // return 1 for negative
}
This generates 12 instructions:
pshufd $57, %xmm0, %xmm1
movdqa %xmm0, %xmm2
minps %xmm1, %xmm2
pshufd $78, %xmm2, %xmm3
minps %xmm3, %xmm2
maxps %xmm1, %xmm0
pshufd $78, %xmm0, %xmm1
maxps %xmm1, %xmm0
addps %xmm2, %xmm0
pmovmskb %xmm0, %eax
shrl $3, %eax
andl $1, %eax
Note how the compiler craftily changes palignr to pshufd and also implements the final scalar test using just a shrl and an andl.
Note for Visual Studio C/C++: to cast between __m128 and __m128i you'll need to use _mm_castps_si128 and _mm_castsi128_ps, e.g.
mask = _mm_movemask_epi8((__m128i)vsign);
would need to be changed to:
mask = _mm_movemask_epi8(_mm_castps_si128(vsign));
If your numbers are discrete, and properly spaced, and drawing from a limited subset, there are other possibilities.
If you're guaranteed that a, b, and c are integers for instance, then you can multiply the vector by itself to get an odd power and then dot it with <1, 1, 1>. If we multiply it by itself 4 times, for instance, it will give you < a^5, b^5, c^5 >. If |a| is the largest and |a|=2, then we know that b and c will be 1 or 0, so the value of a^3 will dominate and the dot product will have its sign. For instance, if X= < a=-2, b=1, c=0 > , then X^5 = <-32, 1, 0>. When you dot this with <1, 1, 1> you get -31, whose sign reflects that of the largest absolute value. As the absolute value of the largest number increases, the disparity between it and the other terms will tend to converge - for instance, if we have <-8, 7, 7>, then the algorithm above gives X^5=<-32768, 16807, 16807>, you dot that with <1, 1, 1> and get 846, so the algorithm fails with exponent 5. If we bump the exponent up to 7, we get <-2097152, 823543, 823543>, dotted with <1, 1, 1> gives us -450066, which is the correct answer. Eventually round-off errors will also break this method. But I'm hoping it might give some insights into other alternatives, if you know the limits on your dataset.
As a footnote, remember that X^5 = (X*X) * (X*X) * X, so you do one multiply to get X^2, multiply that by itself to get X^4, and then multiply by X - three multiplies total. You need an odd exponent to preserve sign.
m = min(a,b,c);
M = max(a,b,c);
// return abs(m)>abs(M) ? sign(m): sign(M); // was
return sign(m+M);
As correctly noticed by Paul_R, the sign comes simply from the sum of the min and max values. Which ever has larger (opposite signed) absolute value, wins.
But the idea can be exploited more: the sum of min/max is the same, as the sum of all the elements, minus the middle one, which can be found by max 3 comparisons.
return sign(a+b+c - middle(a,b,c)); // or
return sign(a*aw + b*bw + c*cw); // where aw,bw,cw = [0,1]
aw,bw,cw could be derived from the number of won comparisons (which I think have to planned carefully for the case, when there are 2 or 3 equal values.)
And further:
x = abs(b)>abs(a)?b:a;
return sign(x+c);
Possibly even further:
s = sign(a + b); // store the sign of larger of a or b
a = abs(a); b=abs(b);
a = max(a,b) | s; // somehow copy the sign.
return sign(a+c);