SSE: convert __m128 and __m128i into two __m128d - c++

Two related questions.
This is what my code needs to do with fairly large amount of data. It is done inside inner loops and the performance is important.
Convert and array of __int32 into doubles (or convert __m128i into two __m128d).
Convert and array of floats into doubles (or convert __m128 into two __m128d).
Basically, I need function with the following signatures:
void convert_int_to_double(__int32 const * input, double * output);
void convert_float_to_double(float const * input, double * output);
Input and output pointers are aligned and the number of elements is a multiple of 4. The main problem is how to quickly unpack __m128 into two __m128d.

The intrinsics _mm_cvtepi32_pd and _mm_cvtps_pd convert the values to double.
This should be the loop:
__m128i* base_addr = ...;
for( int i = 0; i < cnt; ++i )
{
__m128i epi32 = _mm_load_si128( base_addr + i );
__m128d v0 = _mm_cvtepi32_pd( epi32 );
epi32 = _mm_srli_si128( epi32, 8 );
__m128d v1 = _mm_cvtepi32_pd( epi32 );
....
}

Related

Seeded Random Uniform float generator using SIMD? [duplicate]

I have a __m256 value that holds random bits.
I would like to to "interpret" it, to obtain another __m256 that holds float
values in a uniform [0.0f, 1.0f] range.
Planning to do it using:
__m256 randomBits = /* generated random bits, uniformly distribution */;
__m256 invFloatRange = _mm256_set1_ps( numeric_limits<float>::min() ); //min is a smallest increment of float precision
__m256 float01 = _mm256_mul(randomBits, invFloatRange);
//float01 is now ready to be used
Question 1:
However, will this cause a problem in very rare cases where randomBits has all bits as 1 and is therefore NAN?
What can I do to protect myself from this?
I want the float01 to always be a usable number
Question 2:
Will the [0 to 1] range remain uniform after I obtain it using the above approach? I know float has varying precision at different magnitudes
Reinterpreting an int32_t as float, one can
auto const one = _mm256_set1_epi32(0x7f800000);
a = _mm256_and_si256(a, _mm256_set1_epi32(0x007fffff));
a = _mm256_or_si256(a, one);
return _mm256_sub_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(one));
The and/or sequence will reuse the 23 LSBs of the input sequence to produce a uniform distribution of values between 1.0f <= a < 2.0f. And then the bias of 1.0f is removed.
As #Soonts has pointed out, floats can be created uniformly in [0, 1] range:
https://stackoverflow.com/a/54873925/9007125
I ended up using the answer below:
https://stackoverflow.com/a/54893167/9007125
//converts __m256i values into __m256 values, that contains floats in [0,1] range.
//https://stackoverflow.com/a/54893167/9007125
inline void int_rand_int_toFloat01( const __m256i* m256i_vals,
__m256* m256f_vals){ //<-- stores here.
const static __m256 c = _mm256_set1_ps(0x1.0p-24f); // or (1.0f / (uint32_t(1) << 24));
__m256i* rnd = ((__m256i*)m256i_vals);
__m256* output = ((__m256*)m256f_vals);
// remember that '_mm256_cvtepi32_ps' will convert 32-bit ints into a 32-bit floats
__m256 converted = _mm256_cvtepi32_ps(_mm256_srli_epi32(*rnd, 8));
*output = _mm256_mul_ps( converted, c);
}

Convert "__m256 with random-bits" into float values of [0, 1] range

I have a __m256 value that holds random bits.
I would like to to "interpret" it, to obtain another __m256 that holds float
values in a uniform [0.0f, 1.0f] range.
Planning to do it using:
__m256 randomBits = /* generated random bits, uniformly distribution */;
__m256 invFloatRange = _mm256_set1_ps( numeric_limits<float>::min() ); //min is a smallest increment of float precision
__m256 float01 = _mm256_mul(randomBits, invFloatRange);
//float01 is now ready to be used
Question 1:
However, will this cause a problem in very rare cases where randomBits has all bits as 1 and is therefore NAN?
What can I do to protect myself from this?
I want the float01 to always be a usable number
Question 2:
Will the [0 to 1] range remain uniform after I obtain it using the above approach? I know float has varying precision at different magnitudes
Reinterpreting an int32_t as float, one can
auto const one = _mm256_set1_epi32(0x7f800000);
a = _mm256_and_si256(a, _mm256_set1_epi32(0x007fffff));
a = _mm256_or_si256(a, one);
return _mm256_sub_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(one));
The and/or sequence will reuse the 23 LSBs of the input sequence to produce a uniform distribution of values between 1.0f <= a < 2.0f. And then the bias of 1.0f is removed.
As #Soonts has pointed out, floats can be created uniformly in [0, 1] range:
https://stackoverflow.com/a/54873925/9007125
I ended up using the answer below:
https://stackoverflow.com/a/54893167/9007125
//converts __m256i values into __m256 values, that contains floats in [0,1] range.
//https://stackoverflow.com/a/54893167/9007125
inline void int_rand_int_toFloat01( const __m256i* m256i_vals,
__m256* m256f_vals){ //<-- stores here.
const static __m256 c = _mm256_set1_ps(0x1.0p-24f); // or (1.0f / (uint32_t(1) << 24));
__m256i* rnd = ((__m256i*)m256i_vals);
__m256* output = ((__m256*)m256f_vals);
// remember that '_mm256_cvtepi32_ps' will convert 32-bit ints into a 32-bit floats
__m256 converted = _mm256_cvtepi32_ps(_mm256_srli_epi32(*rnd, 8));
*output = _mm256_mul_ps( converted, c);
}

AVX2: Computing dot product of 512 float arrays

I will preface this by saying that I am a complete beginner at SIMD intrinsics.
Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU # 2.70GHz). I would like to know the fastest way to compute the dot product of two std::vector<float> of size 512.
I have done some digging online and found this and this, and this stack overflow question suggests using the following function __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);, However, these all suggest different ways of performing the dot product I am not sure what is the correct (and fastest) way to do it.
In particular, I am looking for the fastest way to perform dot product for a vector of size 512 (because I know the vector size effects the implementation).
Thank you for your help
Edit 1:
I am also a little confused about the -mavx2 gcc flag. If I use these AVX2 functions, do I need to add the flag when I compile? Also, is gcc able to do these optimizations for me (say if I use the -OFast gcc flag) if I write a naive dot product implementation?
Edit 2
If anyone has the time and energy, I would very much appreciate if you could write a full implementation. I am sure other beginners would also value this information.
_mm256_dp_ps is only useful for dot-products of 2 to 4 elements; for longer vectors use vertical SIMD in a loop and reduce to scalar at the end. Using _mm256_dp_ps and _mm256_add_ps in a loop would be much slower.
GCC and clang require you to enable (with command line options) ISA extensions that you use intrinsics for, unlike MSVC and ICC.
The code below is probably close to theoretical performance limit of your CPU. Untested.
Compile it with clang or gcc -O3 -march=native. (Requires at least -mavx -mfma, but -mtune options implied by -march are good, too, and so are the other -mpopcnt and other things arch=native enables. Tune options are critical to this compiling efficiently for most CPUs with FMA, specifically -mno-avx256-split-unaligned-load: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)
Or compile it with MSVC -O2 -arch:AVX2
#include <immintrin.h>
#include <vector>
#include <assert.h>
// CPUs support RAM access like this: "ymmword ptr [rax+64]"
// Using templates with offset int argument to make easier for compiler to emit good code.
// Multiply 8 floats by another 8 floats.
template<int offsetRegs>
inline __m256 mul8( const float* p1, const float* p2 )
{
constexpr int lanes = offsetRegs * 8;
const __m256 a = _mm256_loadu_ps( p1 + lanes );
const __m256 b = _mm256_loadu_ps( p2 + lanes );
return _mm256_mul_ps( a, b );
}
// Returns acc + ( p1 * p2 ), for 8-wide float lanes.
template<int offsetRegs>
inline __m256 fma8( __m256 acc, const float* p1, const float* p2 )
{
constexpr int lanes = offsetRegs * 8;
const __m256 a = _mm256_loadu_ps( p1 + lanes );
const __m256 b = _mm256_loadu_ps( p2 + lanes );
return _mm256_fmadd_ps( a, b, acc );
}
// Compute dot product of float vectors, using 8-wide FMA instructions.
float dotProductFma( const std::vector<float>& a, const std::vector<float>& b )
{
assert( a.size() == b.size() );
assert( 0 == ( a.size() % 32 ) );
if( a.empty() )
return 0.0f;
const float* p1 = a.data();
const float* const p1End = p1 + a.size();
const float* p2 = b.data();
// Process initial 32 values. Nothing to add yet, just multiplying.
__m256 dot0 = mul8<0>( p1, p2 );
__m256 dot1 = mul8<1>( p1, p2 );
__m256 dot2 = mul8<2>( p1, p2 );
__m256 dot3 = mul8<3>( p1, p2 );
p1 += 8 * 4;
p2 += 8 * 4;
// Process the rest of the data.
// The code uses FMA instructions to multiply + accumulate, consuming 32 values per loop iteration.
// Unrolling manually for 2 reasons:
// 1. To reduce data dependencies. With a single register, every loop iteration would depend on the previous result.
// 2. Unrolled code checks for exit condition 4x less often, therefore more CPU cycles spent computing useful stuff.
while( p1 < p1End )
{
dot0 = fma8<0>( dot0, p1, p2 );
dot1 = fma8<1>( dot1, p1, p2 );
dot2 = fma8<2>( dot2, p1, p2 );
dot3 = fma8<3>( dot3, p1, p2 );
p1 += 8 * 4;
p2 += 8 * 4;
}
// Add 32 values into 8
const __m256 dot01 = _mm256_add_ps( dot0, dot1 );
const __m256 dot23 = _mm256_add_ps( dot2, dot3 );
const __m256 dot0123 = _mm256_add_ps( dot01, dot23 );
// Add 8 values into 4
const __m128 r4 = _mm_add_ps( _mm256_castps256_ps128( dot0123 ), _mm256_extractf128_ps( dot0123, 1 ) );
// Add 4 values into 2
const __m128 r2 = _mm_add_ps( r4, _mm_movehl_ps( r4, r4 ) );
// Add 2 lower values into the final result
const __m128 r1 = _mm_add_ss( r2, _mm_movehdup_ps( r2 ) );
// Return the lowest lane of the result vector.
// The intrinsic below compiles into noop, modern compilers return floats in the lowest lane of xmm0 register.
return _mm_cvtss_f32( r1 );
}
Possible further improvements:
Unroll by 8 vectors instead of 4. I’ve checked gcc 9.2 asm output, compiler only used 8 vector registers out of the 16 available.
Make sure both input vectors are aligned, e.g. use a custom allocator which calls _aligned_malloc / _aligned_free on msvc, or aligned_alloc / free on gcc & clang. Then replace _mm256_loadu_ps with _mm256_load_ps.
To auto-vectorize a simple scalar dot product, you'd also need OpenMP SIMD or -ffast-math (implied by -Ofast) to let the compiler treat FP math as associative even though it's not (because of rounding). But GCC won't use multiple accumulators when auto-vectorizing, even if it does unroll, so you'd bottleneck on FMA latency, not load throughput.
(2 loads per FMA means the throughput bottleneck for this code is vector loads, not actual FMA operations.)

Why two consecutive gather instruction perform worse than equivalent elementary ops?

I am upgrading some code from SSE to AVX2. In general I can see that gather instructions are quite useful and benefit performance. However I encountered a case where gather instructions are less efficient than decomposing the gather operations into simpler ones.
In the code below, I have a vector of int32 b, a vector of double xi and 4 int32 indices packed in a 128 bit register bidx. I need to gather first from vector b, than from vector xi. I.e., in pseudo code, I need to do:
__m128i i = b[idx];
__m256d x = xi[i];
In the function below, I implement this in two ways using an #ifdef: via gather instructions, yielding a throughput of 290 Miter/sec and via elementary operations, yielding a throughput of 325 Miter/sec.
Can somebody explain what is going on? Thanks
inline void resolve( const __m256d& z, const __m128i& bidx, int32_t j
, const int32_t *b, const double *xi, int32_t* ri )
{
__m256d x;
__m128i i;
#if 0 // this code uses two gather instructions in sequence
i = _mm_i32gather_epi32(b, bidx, 4)); // i = b[bidx]
x = _mm256_i32gather_pd(xi, i, 8); // x = xi[i]
#else // this code does not use gather instructions
union {
__m128i vec;
int32_t i32[4];
} u;
x = _mm256_set_pd
( xi[(u.i32[3] = b[_mm_extract_epi32(bidx,3)])]
, xi[(u.i32[2] = b[_mm_extract_epi32(bidx,2)])]
, xi[(u.i32[1] = b[_mm_extract_epi32(bidx,1)])]
, xi[(u.i32[0] = b[_mm_cvtsi128_si32(bidx) ])]
);
i = u.vec;
#endif
// here we use x and i
__m256 ps256 = _mm256_castpd_ps(_mm256_cmp_pd(z, x, _CMP_LT_OS));
__m128 lo128 = _mm256_castps256_ps128(ps256);
__m128 hi128 = _mm256_extractf128_ps(ps256, 1);
__m128 blend = _mm_shuffle_ps(lo128, hi128, 0 + (2<<2) + (0<<4) + (2<<6));
__m128i lt = _mm_castps_si128(blend); // this is 0 or -1
i = _mm_add_epi32(i, lt);
_mm_storeu_si128(reinterpret_cast<__m128i*>(ri)+j, i);
}
Since your 'resolve' function is marked as inline I suppose it's called in a high frequency loop. Then you might also have a look at the dependencies of the input parameters from each other outside the 'resolve' function. The compiler might be able to optimize the inlined code better across loop boundaries when using the scalar code variant.

Grayscale bilinear patch extraction - SSE optimization

My program makes an intensive use of small sub-images extracted using bilinear interpolation from larger grayscale images.
I am using the following function for this purpose:
bool extract_patch_bilin(const cv::Point2f &patch_ctr, const cv::Mat_<uchar> &img, cv::Mat_<uchar> &patch)
{
const int hsize = patch.rows/2;
// ...
// Precondition checks: patch is a preallocated square matrix and both patch and image have continuous buffers
// ...
int floorx=(int)floor(patch_ctr.x)-hsize, floory=(int)floor(patch_ctr.y)-hsize;
if(floorx<0 || img.cols-1<floorx+patch.cols || floory<0 || img.rows-1<floory+patch.rows)
return false;
float x=patch_ctr.x-hsize-floorx;
float y=patch_ctr.y-hsize-floory;
float xy = x*y;
float w00=1-x-y+xy, w01=x-xy, w10=y-xy, w11=xy;
int img_stride = img.cols-patch.cols;
uchar* buff_img0 = (uchar*)img.data+img.cols*floory+floorx;
uchar* buff_img1 = buff_img0+img.cols;
uchar* buff_patch = (uchar*)patch.data;
for(int v=0; v<patch.rows; ++v,buff_img0+=img_stride,buff_img1+=img_stride) {
for(int u=0; u<patch.cols; ++u,++buff_patch,++buff_img0,++buff_img1)
buff_patch[0] = cv::saturate_cast<uchar>(buff_img0[0]*w00+buff_img0[1]*w01+buff_img1[0]*w10+buff_img1[1]*w11);
}
return true;
}
Long story short, I am already using parallelization in other parts of the program, and I am considering using SSE to optimize the execution of this function, because I am mostly using 8x8 patches and it seems like a good idea to process bunches of 8 pixels at a time using SSE.
However, I am not sure how to deal with the multiplication by the float interpolation weights (i.e. w00, w01, w10 and w11. These weights are necessarily positive and smaller than 1, hence the multiplication cannot overflow the unsigned char datatype.
Does anyone know how to proceed ?
EDIT:
I tried to do this as follows (assuming 16x16 patches), but there is no significant speed-up:
bool extract_patch_bilin_16x16(const cv::Point2f& patch_ctr, const cv::Mat_<uchar> &img, cv::Mat_<uchar> &patch)
{
// ...
// Precondition checks
// ...
const int hsize = patch.rows/2;
int floorx=(int)floor(patch_ctr.x)-hsize, floory=(int)floor(patch_ctr.y)-hsize;
// Check that the full extracted patch is inside the image
if(floorx<0 || img.cols-1<floorx+patch.cols || floory<0 || img.rows-1<floory+patch.rows)
return false;
// Compute the constant bilinear weights
float x=patch_ctr.x-hsize-floorx;
float y=patch_ctr.y-hsize-floory;
float xy = x*y;
float w00=1-x-y+xy, w01=x-xy, w10=y-xy, w11=xy;
// Prepare image resampling loop
int img_stride = img.cols-patch.cols;
uchar* buff_img0 = (uchar*)img.data+img.cols*floory+floorx;
uchar* buff_img1 = buff_img0+img.cols;
uchar* buff_patch = (uchar*)patch.data;
// Precompute weighting variables
const __m128i CONST_0 = _mm_setzero_si128();
__m128i w00x256_32i = _mm_set1_epi32(cvRound(w00*256));
__m128i w01x256_32i = _mm_set1_epi32(cvRound(w01*256));
__m128i w10x256_32i = _mm_set1_epi32(cvRound(w10*256));
__m128i w11x256_32i = _mm_set1_epi32(cvRound(w11*256));
__m128i w00x256_16i = _mm_packs_epi32(w00x256_32i,w00x256_32i);
__m128i w01x256_16i = _mm_packs_epi32(w01x256_32i,w01x256_32i);
__m128i w10x256_16i = _mm_packs_epi32(w10x256_32i,w10x256_32i);
__m128i w11x256_16i = _mm_packs_epi32(w11x256_32i,w11x256_32i);
// Process pixels
int ngroups = patch.rows>>4;
for(int v=0; v<patch.rows; ++v,buff_img0+=img_stride,buff_img1+=img_stride) {
for(int g=0; g<ngroups; ++g,buff_patch+=16,buff_img0+=16,buff_img1+=16) {
////////////////////////////////
// Load the data (16 pixels in one load)
////////////////////////////////
__m128i val00 = _mm_loadu_si128((__m128i*)buff_img0);
__m128i val01 = _mm_loadu_si128((__m128i*)(buff_img0+1));
__m128i val10 = _mm_loadu_si128((__m128i*)buff_img1);
__m128i val11 = _mm_loadu_si128((__m128i*)(buff_img1+1));
////////////////////////////////
// Process the lower 8 values
////////////////////////////////
// Unpack into 16-bits integers
__m128i val00_lo = _mm_unpacklo_epi8(val00,CONST_0);
__m128i val01_lo = _mm_unpacklo_epi8(val01,CONST_0);
__m128i val10_lo = _mm_unpacklo_epi8(val10,CONST_0);
__m128i val11_lo = _mm_unpacklo_epi8(val11,CONST_0);
// Multiply with the integer weights
__m128i w256val00_lo = _mm_mullo_epi16(val00_lo,w00x256_16i);
__m128i w256val01_lo = _mm_mullo_epi16(val01_lo,w01x256_16i);
__m128i w256val10_lo = _mm_mullo_epi16(val10_lo,w10x256_16i);
__m128i w256val11_lo = _mm_mullo_epi16(val11_lo,w11x256_16i);
// Divide by 256 to get the approximate result of the multiplication with floating-point weights
__m128i wval00_lo = _mm_srli_epi16(w256val00_lo,8);
__m128i wval01_lo = _mm_srli_epi16(w256val01_lo,8);
__m128i wval10_lo = _mm_srli_epi16(w256val10_lo,8);
__m128i wval11_lo = _mm_srli_epi16(w256val11_lo,8);
// Add pairwise
__m128i sum0_lo = _mm_add_epi16(wval00_lo,wval01_lo);
__m128i sum1_lo = _mm_add_epi16(wval10_lo,wval11_lo);
__m128i final_lo = _mm_add_epi16(sum0_lo,sum1_lo);
////////////////////////////////
// Process the higher 8 values
////////////////////////////////
// Unpack into 16-bits integers
__m128i val00_hi = _mm_unpackhi_epi8(val00,CONST_0);
__m128i val01_hi = _mm_unpackhi_epi8(val01,CONST_0);
__m128i val10_hi = _mm_unpackhi_epi8(val10,CONST_0);
__m128i val11_hi = _mm_unpackhi_epi8(val11,CONST_0);
// Multiply with the integer weights
__m128i w256val00_hi = _mm_mullo_epi16(val00_hi,w00x256_16i);
__m128i w256val01_hi = _mm_mullo_epi16(val01_hi,w01x256_16i);
__m128i w256val10_hi = _mm_mullo_epi16(val10_hi,w10x256_16i);
__m128i w256val11_hi = _mm_mullo_epi16(val11_hi,w11x256_16i);
// Divide by 256 to get the approximate result of the multiplication with floating-point weights
__m128i wval00_hi = _mm_srli_epi16(w256val00_hi,8);
__m128i wval01_hi = _mm_srli_epi16(w256val01_hi,8);
__m128i wval10_hi = _mm_srli_epi16(w256val10_hi,8);
__m128i wval11_hi = _mm_srli_epi16(w256val11_hi,8);
// Add pairwise
__m128i sum0_hi = _mm_add_epi16(wval00_hi,wval01_hi);
__m128i sum1_hi = _mm_add_epi16(wval10_hi,wval11_hi);
__m128i final_hi = _mm_add_epi16(sum0_hi,sum1_hi);
////////////////////////////////
// Repack all values
////////////////////////////////
__m128i final_val = _mm_packus_epi16(final_lo,final_hi);
_mm_storeu_si128((__m128i*)buff_patch,final_val);
}
}
}
Any idea what could be done to improve the speed-up ?
I would consider sticking to integers: your weights are multiples of 1/64 so that working with fixed-point 8.6 is enough and that fits in 16 bits numbers.
Bilinear interpolation is best done as three linear ones (two on Y then one on X; you can reuse the second Y interpolation for the neighboring patch).
To perform a linear interpolation between two values, you will pre-store once for all the interpolation weights P and Q (8 to 1 and 0 to 7), and multiply and add them in pairs like V0.P[i]+V1.Q[i]. This is efficiently done using the PMADDUBSW instruction. (After appropriate data interleaving, and replication of the values V0 and V1, with PUNPCKLBW and the like).
In the end, divide by the total weight (PSRLW), rescale to bytes (PACKUSWB). (This step can be performed once only, combining the two interpolations.)
You could think of doubling all weights, so that the final scaling is by 8 bits, and PACKUSWB would suffice, but unfortunately it saturates the values and there is no unsaturated equivalent.
It could be that precomputing all 64 interpolation weights and summing the four bilinear terms is better.
UPDATE:
If the goal is to interpolate with fixed coefficients for all pixels quads (actually achieving subpixel translation), the strategy is different.
You will load a run of 8 (16 ?) pixels corresponding to the upper-left corners, a run of 8 shifted one pixel to the right (corresponding to the upper-right corners), and similarly for the next row (bottom coners); multiply and add in pairs (PMADDUBSW) the pixel values to the corresponding interpolation weights, and combine the pairs (PADDW). Store the weights with replication.
Another option will be to avoid the (PMADD) and perform separate multiplies (PMULLW) and adds (PADDW). This will simplify the reorganization scheme.
After scaling (as above), you end up with a run of 8 interpolated values.
This can work as well for variable interpolation weights, as long as you interpolate exactly one pixel per quad.