can i speed up more than _mm256_i32gather_epi32 - c++

I made a gamma conversion code for 4k video
/** gamma0
input range : 0 ~ 1,023
output range : 0 ~ ?
*/
v00 = _mm256_unpacklo_epi16(v0, _mm256_setzero_si256());
v01 = _mm256_unpackhi_epi16(v0, _mm256_setzero_si256());
v10 = _mm256_unpacklo_epi16(v1, _mm256_setzero_si256());
v11 = _mm256_unpackhi_epi16(v1, _mm256_setzero_si256());
v20 = _mm256_unpacklo_epi16(v2, _mm256_setzero_si256());
v21 = _mm256_unpackhi_epi16(v2, _mm256_setzero_si256());
v00 = _mm256_i32gather_epi32(csv->gamma0LUT, v00, 4);
v01 = _mm256_i32gather_epi32(csv->gamma0LUT, v01, 4);
v10 = _mm256_i32gather_epi32(csv->gamma0LUTc, v10, 4);
v11 = _mm256_i32gather_epi32(csv->gamma0LUTc, v11, 4);
v20 = _mm256_i32gather_epi32(csv->gamma0LUTc, v20, 4);
v21 = _mm256_i32gather_epi32(csv->gamma0LUTc, v21, 4);
I want to implement a "10-bit input to 10~13bit output" LUT(look-up table), but only 32-bit commands are supported by AVX2.
So, it was unavoidably extended to 32bit and implemented using the _mm256_i32gather_epi32 command.
The performance bottleneck in this area is the most severe, is there any way to improve this?

Since the context of your question is still a bit vague for me, just some general ideas you could try (some may be just slightly better or even worse compared to what you have at the moment, all code below is untested):
LUT with 16 bit values using _mm256_i32gather_epi32
Even though it loads 32bit values, you can still use a multiplier of 2 as last argument of _mm256_i32gather_epi32. You should make sure that 2 bytes before and after your LUT are readable.
static const int16_t LUT[1024+2] = { 0, val0, val1, ..., val1022, val1023, 0};
__m256i high_idx = _mm256_srli_epi32(v, 16);
__m256i low_idx = _mm256_blend_epi16(v, _mm256_setzero_si256(), 0xAA);
__m256i high_val = _mm256_i32gather_epi32((int const*)(LUT+0), high_idx, 2);
__m256i low_val = _mm256_i32gather_epi32((int const*)(LUT+1), low_idx, 2);
__m256i values = _mm256_blend_epi16(low_val, high_val, 0xAA);
Join two values into one LUT-entry
For small-ish LUTs, you could calculate an index from two neighboring indexes as (idx_hi << 10) + idx_low and look up the corresponding tuple directly. However, instead of 2KiB you would have a 4 MiB LUT in your case, which likely hurts caching -- but you only have half the number of gather instructions.
Polynomial approximation
Mathematically, all continuous functions on a finite interval can be approximated by a polynomial. You could either convert your values to float evaluate the polynomial and convert it back, or do it directly with fixed-point multiplications (note that _mm256_mulhi_epi16/_mm256_mulhi_epu16 compute (a * b) >> 16, which is convenient if one factor is actually in [0, 1).
8 bit, 16 entry LUT with linear interpolation
SSE/AVX2 provides a pshufb instruction which can be used as a 8bit LUT with 16 entries (and an implicit 0 entry).
Proof-of-concept implementation:
__m256i idx = _mm256_srli_epi16(v, 6); // shift highest 4 bits to the right
idx = _mm256_mullo_epi16(idx, _mm256_set1_epi16(0x0101)); // duplicate idx, maybe _mm256_shuffle_epi8 is better?
idx = _mm256_sub_epi8(idx, _mm256_set1_epi16(0x0001)); // subtract 1 from lower idx, 0 is mapped to 0xff
__m256i lut_vals = _mm256_shuffle_epi8(LUT, idx); // implicitly: LUT[-1] = 0
// get fractional part of input value:
__m256i dv = _mm256_and_si256(v, _mm256_set1_epi8(0x3f)); // lowest 6 bits
dv = _mm256_mullo_epi16(dv, _mm256_set1_epi16(0xff01)); // dv = [-dv, dv]
dv = _mm256_add_epi8(dv, _mm256_set1_epi16(0x4000)); // dv = [0x40-(v&0x3f), (v&0x3f)];
__m256i res = _mm256_maddubs_epi16(lut_vals, dv); // switch order depending on whether LUT values are (un)signed.
// probably shift res to the right, depending on the scale of your LUT values
You could also combine this with first doing a linear or quadratic approximation and just calculating the difference to your target function.

Related

Convert "__m256 with random-bits" into float values of [0, 1] range

I have a __m256 value that holds random bits.
I would like to to "interpret" it, to obtain another __m256 that holds float
values in a uniform [0.0f, 1.0f] range.
Planning to do it using:
__m256 randomBits = /* generated random bits, uniformly distribution */;
__m256 invFloatRange = _mm256_set1_ps( numeric_limits<float>::min() ); //min is a smallest increment of float precision
__m256 float01 = _mm256_mul(randomBits, invFloatRange);
//float01 is now ready to be used
Question 1:
However, will this cause a problem in very rare cases where randomBits has all bits as 1 and is therefore NAN?
What can I do to protect myself from this?
I want the float01 to always be a usable number
Question 2:
Will the [0 to 1] range remain uniform after I obtain it using the above approach? I know float has varying precision at different magnitudes
Reinterpreting an int32_t as float, one can
auto const one = _mm256_set1_epi32(0x7f800000);
a = _mm256_and_si256(a, _mm256_set1_epi32(0x007fffff));
a = _mm256_or_si256(a, one);
return _mm256_sub_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(one));
The and/or sequence will reuse the 23 LSBs of the input sequence to produce a uniform distribution of values between 1.0f <= a < 2.0f. And then the bias of 1.0f is removed.
As #Soonts has pointed out, floats can be created uniformly in [0, 1] range:
https://stackoverflow.com/a/54873925/9007125
I ended up using the answer below:
https://stackoverflow.com/a/54893167/9007125
//converts __m256i values into __m256 values, that contains floats in [0,1] range.
//https://stackoverflow.com/a/54893167/9007125
inline void int_rand_int_toFloat01( const __m256i* m256i_vals,
__m256* m256f_vals){ //<-- stores here.
const static __m256 c = _mm256_set1_ps(0x1.0p-24f); // or (1.0f / (uint32_t(1) << 24));
__m256i* rnd = ((__m256i*)m256i_vals);
__m256* output = ((__m256*)m256f_vals);
// remember that '_mm256_cvtepi32_ps' will convert 32-bit ints into a 32-bit floats
__m256 converted = _mm256_cvtepi32_ps(_mm256_srli_epi32(*rnd, 8));
*output = _mm256_mul_ps( converted, c);
}

Find min/max value from a __m128i

I want to find the minimum/maximum value into an array of byte using SIMD operations. So far I was able to go through the array and store the minimum/maximum value into a __m128i variable, but it means that the value I am looking for is mixed among others (15 others to be exact).
I've found these discussions here and here for integer, and this page for float, but I don't understand how works _mm_shuffle*. So my questions are:
What SIMD operations do I have to perform in order to extract the minimum / maximum byte (or unsigned byte) value from the __m128i variable?
How does _mm_shuffle* work? I don't get it when I look to the "minimal" documentation online. I know it is related to the _MM_SHUFFLE macro, but I don't get the example.
Here is an example for horizontal max for uint8_t:
#include "tmmintrin.h" // requires SSSE3
__m128i _mm_hmax_epu8(const __m128i v)
{
__m128i vmax = v;
vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 1));
vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 2));
vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 4));
vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 8));
return vmax;
}
The max value will be returned in all elements. If you need the value as a scalar then use _mm_extract_epi8.
It should be fairly obvious how to adapt this for min, and for signed min/max.
Alternatively, convert to words and use phminposuw (not tested)
int hminu8(__m128i x)
{
__m128i l = _mm_unpacklo_epi8(x, _mm_setzero_si128());
__m128i h = _mm_unpackhi_epi8(x, _mm_setzero_si128());
l = _mm_minpos_epu16(l);
h = _mm_minpos_epu16(h);
return _mm_extract_epi16(_mm_min_epu16(l, h), 0);
}
By my quick count, the latency is a bit worse than a min/shuffle cascade, but the throughput a bit better. The linked answer with phminposuw is probably better though. Adapted for unsigned bytes (but not tested)
uint8_t hminu8(__m128i x)
{
x = _mm_min_epu8(x, _mm_srli_epi16(x, 8));
x = _mm_minpos_epu16(x);
return _mm_cvtsi128_si32(x);
}
You could use it for max too, but with a bit of overhead: complement the input and result.

Query points on the vertices of a Hamming cube

I have N points that lie only on the vertices of a cube, of dimension D, where D is something like 3.
A vertex may not contain any point. So every point has coordinates in {0, 1}D. I am only interested in query time, as long as the memory cost is reasonable ( not exponential in N for example :) ).
Given a query that lies on one of the cube's vertices and an input parameter r, find all the vertices (thus points) that have hamming distance <= r with the query.
What's the way to go in a c++ environment?
I am thinking of a kd-tree, but I am not sure and want help, any input, even approximative, would be appreciated! Since hamming distance comes into play, bitwise manipulations should help (e.g. XOR).
There is a nice bithack to go from one bitmask with k bits set to the lexicographically next permutation, which means it's fairly simple to loop through all masks with k bits set. XORing these masks with an initial value gives all the values at hamming distance exactly k away from it.
So for D dimensions, where D is less than 32 (otherwise change the types),
uint32_t limit = (1u << D) - 1;
for (int k = 1; k <= r; k++) {
uint32_t diff = (1u << k) - 1;
while (diff <= limit) {
// v is the input vertex
uint32_t vertex = v ^ diff;
// use it
diff = nextBitPermutation(diff);
}
}
Where nextBitPermutation may be implemented in C++ as something like (if you have __builtin_ctz)
uint32_t nextBitPermutation(uint32_t v) {
// see https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
uint32_t t = v | (v - 1);
return (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
}
Or for MSVC (not tested)
uint32_t nextBitPermutation(uint32_t v) {
// see https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
uint32_t t = v | (v - 1);
unsigned long tzc;
_BitScanForward(&tzc, v); // v != 0 so the return value doesn't matter
return (t + 1) | (((~t & -~t) - 1) >> (tzc + 1));
}
If D is really low, 4 or lower, the old popcnt-with-pshufb works really well and generally everything just lines up well, like this:
uint16_t query(int vertex, int r, int8_t* validmask)
{
// validmask should be array of 16 int8_t's,
// 0 for a vertex that doesn't exist, -1 if it does
__m128i valid = _mm_loadu_si128((__m128i*)validmask);
__m128i t0 = _mm_set1_epi8(vertex);
__m128i r0 = _mm_set1_epi8(r + 1);
__m128i all = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
__m128i popcnt_lut = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
__m128i dist = _mm_shuffle_epi8(popcnt_lut, _mm_xor_si128(t0, all));
__m128i close_enough = _mm_cmpgt_epi8(r0, dist);
__m128i result = _mm_and_si128(close_enough, valid);
return _mm_movemask_epi8(result);
}
This should be fairly fast; fast compared to the bithack above (nextBitPermutation, which is fairly heavy, is used a lot there) and also compared to looping over all vertices and testing whether they are in range (even with builtin popcnt, that automatically takes at least 16 cycles and the above shouldn't, assuming everything is cached or even permanently in a register). The downside is the result is annoying to work with, since it's a mask of which vertices both exist and are in range of the queried point, not a list of them. It would combine well with doing some processing on data associated with the points though.
This also scales down to D=3 of course, just make none of the points >= 8 valid. D>4 can be done similarly but it takes more code then, and since this is really a brute force solution that is only fast due to parallelism it fundamentally gets slower exponentially in D.

AVX, Horizontal Sum of Single Precision Complex Numbers?

I have a 256 bit AVX register containing 4 single precision complex numbers stored as real, imaginary, real, imaginary, etc. I'm currently writing the entire 256 bit register back to memory and summing it there, but that seems inefficient.
How can the complex number horizontal sum be performed using AVX (or AVX2) intrinsics? I would accept an answer using assembly if there is not an answer with comparable efficiency using intrinsics.
Edit: To clarify, if the register contains AR, AI, BR, BI, CR, CI, DR, DI, I want to compute the complex number (AR + BR + CR + DR, AI + BI + CI + DI). If the result is in a 256 bit register, I can extract the 2 single precision floating point numbers.
Edit2: Potential solution, though not necessarily optimal...
float hsum_ps_sse3(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v); // broadcast elements 3,1 to 2,0
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums); // high half -> low half
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
float sumReal = 0.0;
float sumImaginary = 0.0;
__m256i mask = _mm256_set_epi32 (7, 5, 3, 1, 6, 4, 2, 0);
// Separate real and imaginary.
__m256 permutedSum = _mm256_permutevar8x32_ps(sseSum0, mask);
__m128 realSum = _mm256_extractf128_ps(permutedSum , 0);
__m128 imaginarySum = _mm256_extractf128_ps(permutedSum , 1);
// Horizontally sum real and imaginary.
sumReal = hsum_ps_sse3(realSum);
sumImaginary = hsum_ps_sse3(imaginarySum);
One fairly straightforward solution which requires only AVX (not AVX2):
__m128i v0 = _mm256_castps256_ps128(v); // get low 2 complex values
__m128i v1 = _mm256_extractf128_ps(v, 1); // get high 2 complex values
v0 = _mm_add_ps(v0, v1); // add high and low
v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(1, 0, 3, 2));
v0 = _mm_add_ps(v0, v1); // combine two halves of result
The result will be in v0 as { sum.re, sum.im, sum.re, sum.im }.

Grayscale bilinear patch extraction - SSE optimization

My program makes an intensive use of small sub-images extracted using bilinear interpolation from larger grayscale images.
I am using the following function for this purpose:
bool extract_patch_bilin(const cv::Point2f &patch_ctr, const cv::Mat_<uchar> &img, cv::Mat_<uchar> &patch)
{
const int hsize = patch.rows/2;
// ...
// Precondition checks: patch is a preallocated square matrix and both patch and image have continuous buffers
// ...
int floorx=(int)floor(patch_ctr.x)-hsize, floory=(int)floor(patch_ctr.y)-hsize;
if(floorx<0 || img.cols-1<floorx+patch.cols || floory<0 || img.rows-1<floory+patch.rows)
return false;
float x=patch_ctr.x-hsize-floorx;
float y=patch_ctr.y-hsize-floory;
float xy = x*y;
float w00=1-x-y+xy, w01=x-xy, w10=y-xy, w11=xy;
int img_stride = img.cols-patch.cols;
uchar* buff_img0 = (uchar*)img.data+img.cols*floory+floorx;
uchar* buff_img1 = buff_img0+img.cols;
uchar* buff_patch = (uchar*)patch.data;
for(int v=0; v<patch.rows; ++v,buff_img0+=img_stride,buff_img1+=img_stride) {
for(int u=0; u<patch.cols; ++u,++buff_patch,++buff_img0,++buff_img1)
buff_patch[0] = cv::saturate_cast<uchar>(buff_img0[0]*w00+buff_img0[1]*w01+buff_img1[0]*w10+buff_img1[1]*w11);
}
return true;
}
Long story short, I am already using parallelization in other parts of the program, and I am considering using SSE to optimize the execution of this function, because I am mostly using 8x8 patches and it seems like a good idea to process bunches of 8 pixels at a time using SSE.
However, I am not sure how to deal with the multiplication by the float interpolation weights (i.e. w00, w01, w10 and w11. These weights are necessarily positive and smaller than 1, hence the multiplication cannot overflow the unsigned char datatype.
Does anyone know how to proceed ?
EDIT:
I tried to do this as follows (assuming 16x16 patches), but there is no significant speed-up:
bool extract_patch_bilin_16x16(const cv::Point2f& patch_ctr, const cv::Mat_<uchar> &img, cv::Mat_<uchar> &patch)
{
// ...
// Precondition checks
// ...
const int hsize = patch.rows/2;
int floorx=(int)floor(patch_ctr.x)-hsize, floory=(int)floor(patch_ctr.y)-hsize;
// Check that the full extracted patch is inside the image
if(floorx<0 || img.cols-1<floorx+patch.cols || floory<0 || img.rows-1<floory+patch.rows)
return false;
// Compute the constant bilinear weights
float x=patch_ctr.x-hsize-floorx;
float y=patch_ctr.y-hsize-floory;
float xy = x*y;
float w00=1-x-y+xy, w01=x-xy, w10=y-xy, w11=xy;
// Prepare image resampling loop
int img_stride = img.cols-patch.cols;
uchar* buff_img0 = (uchar*)img.data+img.cols*floory+floorx;
uchar* buff_img1 = buff_img0+img.cols;
uchar* buff_patch = (uchar*)patch.data;
// Precompute weighting variables
const __m128i CONST_0 = _mm_setzero_si128();
__m128i w00x256_32i = _mm_set1_epi32(cvRound(w00*256));
__m128i w01x256_32i = _mm_set1_epi32(cvRound(w01*256));
__m128i w10x256_32i = _mm_set1_epi32(cvRound(w10*256));
__m128i w11x256_32i = _mm_set1_epi32(cvRound(w11*256));
__m128i w00x256_16i = _mm_packs_epi32(w00x256_32i,w00x256_32i);
__m128i w01x256_16i = _mm_packs_epi32(w01x256_32i,w01x256_32i);
__m128i w10x256_16i = _mm_packs_epi32(w10x256_32i,w10x256_32i);
__m128i w11x256_16i = _mm_packs_epi32(w11x256_32i,w11x256_32i);
// Process pixels
int ngroups = patch.rows>>4;
for(int v=0; v<patch.rows; ++v,buff_img0+=img_stride,buff_img1+=img_stride) {
for(int g=0; g<ngroups; ++g,buff_patch+=16,buff_img0+=16,buff_img1+=16) {
////////////////////////////////
// Load the data (16 pixels in one load)
////////////////////////////////
__m128i val00 = _mm_loadu_si128((__m128i*)buff_img0);
__m128i val01 = _mm_loadu_si128((__m128i*)(buff_img0+1));
__m128i val10 = _mm_loadu_si128((__m128i*)buff_img1);
__m128i val11 = _mm_loadu_si128((__m128i*)(buff_img1+1));
////////////////////////////////
// Process the lower 8 values
////////////////////////////////
// Unpack into 16-bits integers
__m128i val00_lo = _mm_unpacklo_epi8(val00,CONST_0);
__m128i val01_lo = _mm_unpacklo_epi8(val01,CONST_0);
__m128i val10_lo = _mm_unpacklo_epi8(val10,CONST_0);
__m128i val11_lo = _mm_unpacklo_epi8(val11,CONST_0);
// Multiply with the integer weights
__m128i w256val00_lo = _mm_mullo_epi16(val00_lo,w00x256_16i);
__m128i w256val01_lo = _mm_mullo_epi16(val01_lo,w01x256_16i);
__m128i w256val10_lo = _mm_mullo_epi16(val10_lo,w10x256_16i);
__m128i w256val11_lo = _mm_mullo_epi16(val11_lo,w11x256_16i);
// Divide by 256 to get the approximate result of the multiplication with floating-point weights
__m128i wval00_lo = _mm_srli_epi16(w256val00_lo,8);
__m128i wval01_lo = _mm_srli_epi16(w256val01_lo,8);
__m128i wval10_lo = _mm_srli_epi16(w256val10_lo,8);
__m128i wval11_lo = _mm_srli_epi16(w256val11_lo,8);
// Add pairwise
__m128i sum0_lo = _mm_add_epi16(wval00_lo,wval01_lo);
__m128i sum1_lo = _mm_add_epi16(wval10_lo,wval11_lo);
__m128i final_lo = _mm_add_epi16(sum0_lo,sum1_lo);
////////////////////////////////
// Process the higher 8 values
////////////////////////////////
// Unpack into 16-bits integers
__m128i val00_hi = _mm_unpackhi_epi8(val00,CONST_0);
__m128i val01_hi = _mm_unpackhi_epi8(val01,CONST_0);
__m128i val10_hi = _mm_unpackhi_epi8(val10,CONST_0);
__m128i val11_hi = _mm_unpackhi_epi8(val11,CONST_0);
// Multiply with the integer weights
__m128i w256val00_hi = _mm_mullo_epi16(val00_hi,w00x256_16i);
__m128i w256val01_hi = _mm_mullo_epi16(val01_hi,w01x256_16i);
__m128i w256val10_hi = _mm_mullo_epi16(val10_hi,w10x256_16i);
__m128i w256val11_hi = _mm_mullo_epi16(val11_hi,w11x256_16i);
// Divide by 256 to get the approximate result of the multiplication with floating-point weights
__m128i wval00_hi = _mm_srli_epi16(w256val00_hi,8);
__m128i wval01_hi = _mm_srli_epi16(w256val01_hi,8);
__m128i wval10_hi = _mm_srli_epi16(w256val10_hi,8);
__m128i wval11_hi = _mm_srli_epi16(w256val11_hi,8);
// Add pairwise
__m128i sum0_hi = _mm_add_epi16(wval00_hi,wval01_hi);
__m128i sum1_hi = _mm_add_epi16(wval10_hi,wval11_hi);
__m128i final_hi = _mm_add_epi16(sum0_hi,sum1_hi);
////////////////////////////////
// Repack all values
////////////////////////////////
__m128i final_val = _mm_packus_epi16(final_lo,final_hi);
_mm_storeu_si128((__m128i*)buff_patch,final_val);
}
}
}
Any idea what could be done to improve the speed-up ?
I would consider sticking to integers: your weights are multiples of 1/64 so that working with fixed-point 8.6 is enough and that fits in 16 bits numbers.
Bilinear interpolation is best done as three linear ones (two on Y then one on X; you can reuse the second Y interpolation for the neighboring patch).
To perform a linear interpolation between two values, you will pre-store once for all the interpolation weights P and Q (8 to 1 and 0 to 7), and multiply and add them in pairs like V0.P[i]+V1.Q[i]. This is efficiently done using the PMADDUBSW instruction. (After appropriate data interleaving, and replication of the values V0 and V1, with PUNPCKLBW and the like).
In the end, divide by the total weight (PSRLW), rescale to bytes (PACKUSWB). (This step can be performed once only, combining the two interpolations.)
You could think of doubling all weights, so that the final scaling is by 8 bits, and PACKUSWB would suffice, but unfortunately it saturates the values and there is no unsaturated equivalent.
It could be that precomputing all 64 interpolation weights and summing the four bilinear terms is better.
UPDATE:
If the goal is to interpolate with fixed coefficients for all pixels quads (actually achieving subpixel translation), the strategy is different.
You will load a run of 8 (16 ?) pixels corresponding to the upper-left corners, a run of 8 shifted one pixel to the right (corresponding to the upper-right corners), and similarly for the next row (bottom coners); multiply and add in pairs (PMADDUBSW) the pixel values to the corresponding interpolation weights, and combine the pairs (PADDW). Store the weights with replication.
Another option will be to avoid the (PMADD) and perform separate multiplies (PMULLW) and adds (PADDW). This will simplify the reorganization scheme.
After scaling (as above), you end up with a run of 8 interpolated values.
This can work as well for variable interpolation weights, as long as you interpolate exactly one pixel per quad.