Seeded Random Uniform float generator using SIMD? [duplicate] - c++

I have a __m256 value that holds random bits.
I would like to to "interpret" it, to obtain another __m256 that holds float
values in a uniform [0.0f, 1.0f] range.
Planning to do it using:
__m256 randomBits = /* generated random bits, uniformly distribution */;
__m256 invFloatRange = _mm256_set1_ps( numeric_limits<float>::min() ); //min is a smallest increment of float precision
__m256 float01 = _mm256_mul(randomBits, invFloatRange);
//float01 is now ready to be used
Question 1:
However, will this cause a problem in very rare cases where randomBits has all bits as 1 and is therefore NAN?
What can I do to protect myself from this?
I want the float01 to always be a usable number
Question 2:
Will the [0 to 1] range remain uniform after I obtain it using the above approach? I know float has varying precision at different magnitudes

Reinterpreting an int32_t as float, one can
auto const one = _mm256_set1_epi32(0x7f800000);
a = _mm256_and_si256(a, _mm256_set1_epi32(0x007fffff));
a = _mm256_or_si256(a, one);
return _mm256_sub_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(one));
The and/or sequence will reuse the 23 LSBs of the input sequence to produce a uniform distribution of values between 1.0f <= a < 2.0f. And then the bias of 1.0f is removed.

As #Soonts has pointed out, floats can be created uniformly in [0, 1] range:
https://stackoverflow.com/a/54873925/9007125
I ended up using the answer below:
https://stackoverflow.com/a/54893167/9007125
//converts __m256i values into __m256 values, that contains floats in [0,1] range.
//https://stackoverflow.com/a/54893167/9007125
inline void int_rand_int_toFloat01( const __m256i* m256i_vals,
__m256* m256f_vals){ //<-- stores here.
const static __m256 c = _mm256_set1_ps(0x1.0p-24f); // or (1.0f / (uint32_t(1) << 24));
__m256i* rnd = ((__m256i*)m256i_vals);
__m256* output = ((__m256*)m256f_vals);
// remember that '_mm256_cvtepi32_ps' will convert 32-bit ints into a 32-bit floats
__m256 converted = _mm256_cvtepi32_ps(_mm256_srli_epi32(*rnd, 8));
*output = _mm256_mul_ps( converted, c);
}

Related

can i speed up more than _mm256_i32gather_epi32

I made a gamma conversion code for 4k video
/** gamma0
input range : 0 ~ 1,023
output range : 0 ~ ?
*/
v00 = _mm256_unpacklo_epi16(v0, _mm256_setzero_si256());
v01 = _mm256_unpackhi_epi16(v0, _mm256_setzero_si256());
v10 = _mm256_unpacklo_epi16(v1, _mm256_setzero_si256());
v11 = _mm256_unpackhi_epi16(v1, _mm256_setzero_si256());
v20 = _mm256_unpacklo_epi16(v2, _mm256_setzero_si256());
v21 = _mm256_unpackhi_epi16(v2, _mm256_setzero_si256());
v00 = _mm256_i32gather_epi32(csv->gamma0LUT, v00, 4);
v01 = _mm256_i32gather_epi32(csv->gamma0LUT, v01, 4);
v10 = _mm256_i32gather_epi32(csv->gamma0LUTc, v10, 4);
v11 = _mm256_i32gather_epi32(csv->gamma0LUTc, v11, 4);
v20 = _mm256_i32gather_epi32(csv->gamma0LUTc, v20, 4);
v21 = _mm256_i32gather_epi32(csv->gamma0LUTc, v21, 4);
I want to implement a "10-bit input to 10~13bit output" LUT(look-up table), but only 32-bit commands are supported by AVX2.
So, it was unavoidably extended to 32bit and implemented using the _mm256_i32gather_epi32 command.
The performance bottleneck in this area is the most severe, is there any way to improve this?
Since the context of your question is still a bit vague for me, just some general ideas you could try (some may be just slightly better or even worse compared to what you have at the moment, all code below is untested):
LUT with 16 bit values using _mm256_i32gather_epi32
Even though it loads 32bit values, you can still use a multiplier of 2 as last argument of _mm256_i32gather_epi32. You should make sure that 2 bytes before and after your LUT are readable.
static const int16_t LUT[1024+2] = { 0, val0, val1, ..., val1022, val1023, 0};
__m256i high_idx = _mm256_srli_epi32(v, 16);
__m256i low_idx = _mm256_blend_epi16(v, _mm256_setzero_si256(), 0xAA);
__m256i high_val = _mm256_i32gather_epi32((int const*)(LUT+0), high_idx, 2);
__m256i low_val = _mm256_i32gather_epi32((int const*)(LUT+1), low_idx, 2);
__m256i values = _mm256_blend_epi16(low_val, high_val, 0xAA);
Join two values into one LUT-entry
For small-ish LUTs, you could calculate an index from two neighboring indexes as (idx_hi << 10) + idx_low and look up the corresponding tuple directly. However, instead of 2KiB you would have a 4 MiB LUT in your case, which likely hurts caching -- but you only have half the number of gather instructions.
Polynomial approximation
Mathematically, all continuous functions on a finite interval can be approximated by a polynomial. You could either convert your values to float evaluate the polynomial and convert it back, or do it directly with fixed-point multiplications (note that _mm256_mulhi_epi16/_mm256_mulhi_epu16 compute (a * b) >> 16, which is convenient if one factor is actually in [0, 1).
8 bit, 16 entry LUT with linear interpolation
SSE/AVX2 provides a pshufb instruction which can be used as a 8bit LUT with 16 entries (and an implicit 0 entry).
Proof-of-concept implementation:
__m256i idx = _mm256_srli_epi16(v, 6); // shift highest 4 bits to the right
idx = _mm256_mullo_epi16(idx, _mm256_set1_epi16(0x0101)); // duplicate idx, maybe _mm256_shuffle_epi8 is better?
idx = _mm256_sub_epi8(idx, _mm256_set1_epi16(0x0001)); // subtract 1 from lower idx, 0 is mapped to 0xff
__m256i lut_vals = _mm256_shuffle_epi8(LUT, idx); // implicitly: LUT[-1] = 0
// get fractional part of input value:
__m256i dv = _mm256_and_si256(v, _mm256_set1_epi8(0x3f)); // lowest 6 bits
dv = _mm256_mullo_epi16(dv, _mm256_set1_epi16(0xff01)); // dv = [-dv, dv]
dv = _mm256_add_epi8(dv, _mm256_set1_epi16(0x4000)); // dv = [0x40-(v&0x3f), (v&0x3f)];
__m256i res = _mm256_maddubs_epi16(lut_vals, dv); // switch order depending on whether LUT values are (un)signed.
// probably shift res to the right, depending on the scale of your LUT values
You could also combine this with first doing a linear or quadratic approximation and just calculating the difference to your target function.

Convert "__m256 with random-bits" into float values of [0, 1] range

I have a __m256 value that holds random bits.
I would like to to "interpret" it, to obtain another __m256 that holds float
values in a uniform [0.0f, 1.0f] range.
Planning to do it using:
__m256 randomBits = /* generated random bits, uniformly distribution */;
__m256 invFloatRange = _mm256_set1_ps( numeric_limits<float>::min() ); //min is a smallest increment of float precision
__m256 float01 = _mm256_mul(randomBits, invFloatRange);
//float01 is now ready to be used
Question 1:
However, will this cause a problem in very rare cases where randomBits has all bits as 1 and is therefore NAN?
What can I do to protect myself from this?
I want the float01 to always be a usable number
Question 2:
Will the [0 to 1] range remain uniform after I obtain it using the above approach? I know float has varying precision at different magnitudes
Reinterpreting an int32_t as float, one can
auto const one = _mm256_set1_epi32(0x7f800000);
a = _mm256_and_si256(a, _mm256_set1_epi32(0x007fffff));
a = _mm256_or_si256(a, one);
return _mm256_sub_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(one));
The and/or sequence will reuse the 23 LSBs of the input sequence to produce a uniform distribution of values between 1.0f <= a < 2.0f. And then the bias of 1.0f is removed.
As #Soonts has pointed out, floats can be created uniformly in [0, 1] range:
https://stackoverflow.com/a/54873925/9007125
I ended up using the answer below:
https://stackoverflow.com/a/54893167/9007125
//converts __m256i values into __m256 values, that contains floats in [0,1] range.
//https://stackoverflow.com/a/54893167/9007125
inline void int_rand_int_toFloat01( const __m256i* m256i_vals,
__m256* m256f_vals){ //<-- stores here.
const static __m256 c = _mm256_set1_ps(0x1.0p-24f); // or (1.0f / (uint32_t(1) << 24));
__m256i* rnd = ((__m256i*)m256i_vals);
__m256* output = ((__m256*)m256f_vals);
// remember that '_mm256_cvtepi32_ps' will convert 32-bit ints into a 32-bit floats
__m256 converted = _mm256_cvtepi32_ps(_mm256_srli_epi32(*rnd, 8));
*output = _mm256_mul_ps( converted, c);
}

AVX, Horizontal Sum of Single Precision Complex Numbers?

I have a 256 bit AVX register containing 4 single precision complex numbers stored as real, imaginary, real, imaginary, etc. I'm currently writing the entire 256 bit register back to memory and summing it there, but that seems inefficient.
How can the complex number horizontal sum be performed using AVX (or AVX2) intrinsics? I would accept an answer using assembly if there is not an answer with comparable efficiency using intrinsics.
Edit: To clarify, if the register contains AR, AI, BR, BI, CR, CI, DR, DI, I want to compute the complex number (AR + BR + CR + DR, AI + BI + CI + DI). If the result is in a 256 bit register, I can extract the 2 single precision floating point numbers.
Edit2: Potential solution, though not necessarily optimal...
float hsum_ps_sse3(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v); // broadcast elements 3,1 to 2,0
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums); // high half -> low half
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
float sumReal = 0.0;
float sumImaginary = 0.0;
__m256i mask = _mm256_set_epi32 (7, 5, 3, 1, 6, 4, 2, 0);
// Separate real and imaginary.
__m256 permutedSum = _mm256_permutevar8x32_ps(sseSum0, mask);
__m128 realSum = _mm256_extractf128_ps(permutedSum , 0);
__m128 imaginarySum = _mm256_extractf128_ps(permutedSum , 1);
// Horizontally sum real and imaginary.
sumReal = hsum_ps_sse3(realSum);
sumImaginary = hsum_ps_sse3(imaginarySum);
One fairly straightforward solution which requires only AVX (not AVX2):
__m128i v0 = _mm256_castps256_ps128(v); // get low 2 complex values
__m128i v1 = _mm256_extractf128_ps(v, 1); // get high 2 complex values
v0 = _mm_add_ps(v0, v1); // add high and low
v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(1, 0, 3, 2));
v0 = _mm_add_ps(v0, v1); // combine two halves of result
The result will be in v0 as { sum.re, sum.im, sum.re, sum.im }.

How to quantize floating point to unsigned byte in GLSL

I used floating point texture as data buffer in GLSL and need to save the data on a normal texture (each pixel's color has 1 byte). In my situation, floating point is [-2048.0, 2048.0] and so I have to quantize [-2048.0, 2048.0] to [0, 255]. I think the C++ code for this problem is like :
//*quantization*
float fvalue = ... ; // floating point data
fvalue /= 16.0f; // [-128.0, 128.0]
fvalue = roundf(fvalue); // [-128, 128]
if(fvalue > 127.0f) fvalue = 127.0f;
else if(fvalue < -128.0f) fvalue = -128.0f;
u_char byte = (int)fvalue + 128; // [0, 255]
//*inverse quantization*
u_char byte = ...; // [0, 255]
float fvalue = byte - 128; // [-128, 127]
fvalue *= 16.0f; // [-2048, 2032] (it can't be helped?)
I'm not certain this code is good, but moreover I'm not really sure what is great in GLSL (GLSL handles byte value [0, 255] as floating point [0.0, 1.0]). My code is :
//*quantization*
vec3 F = ...; //F is floating vector [-2048.0, 2048.0]
F /= 16; // [-128.0, 128.0]
F /= 256; // [-0.5, 0.5]
F += vec3(0.50f); // [0.0, 1.0]
gl_FragData[0] = vec4(F, 1.0);
//*inverse quantization*
vec3 F = texture2D(...); //byte data [0.0, 1.0]
F -= vec3(0.50f); //byte data [-0.5, 0.5]
F *= 256; //[-128, 128]
F *= 16; //[-2048, 2048]
This didn't work well. However, if I rewrite codes F += vec3(0.50f); to F += vec3(0.51f); and also F -= vec3(0.50f); to F -= vec3(0.51f);, It seems works well. But I don't think the value 0.51f is reasonable. In fact, this works well in one hardware, while this doesn't work well in another hardware.
I want to know the good way to quantize (also inv-quantize) float values.
I can find the way which works "well". I'm afraid to say I can't explain reasonably why it works and so I don't know whether this is versatile method.
//*quantization*
vec3 F = ...; //F is floating vector [-2048.0, 2048.0]
F += 2048;
F /= 16;
F /= 255;
gl_FragData[0] = vec4(F, 1.0);
//*inverse quantization*
vec3 F = texture2D(...); //byte data [0.0, 1.0]
F *= 255.0;
F *= 16.0;
F -= vec3(2048 + 8); //adding bias -16.0/2.0
F = 2.0 * F * qp * Q / 16.0;
First of all, each pixel having 1 byte does not adequately convey what you are trying to describe. This so-called "normal texture" is more accurately referred to as "unsigned normalized" (often shortened to unorm).
You want an 8-bit unorm texture here (ideally with multiple components); these are textures that store fixed-point data and are treated like floating-point (in the range [0.0,1.0]) when sampled by normalizing the data to its intrinsic range (e.g. promoting to floating-point and dividing by 255.0).
Given what was just described, you simply need to transform the original data [-2048.0,2048.0] into [0.0,1.0] and then multiply by 255.
This is rather undesirable though, because you will lose the ability to represent the original range without severe aliasing. Instead, multiply by 4294967295 (2564-1) and pack 8-bits into R, 8-bits into G, 8-bits into B and 8-bits into A. You have made no attempt to pack the components in the shader shown.

SSE intrinsics: masking a float and using bitwise and?

Basically the problem is related to x86 assembler where you have a number that you want to set to either zero or the number itself using an and. If you and that number with negative one you get back the number itself but if you and it with zero you get zero.
Now the problem I'm having with SSE instrinsics is that floats aren't the same in binary as doubles (or maybe I'm mistaken). Anyways here's the code, I've tried using all kinds of floats to mask the second and third numbers (127.0f and 99.0f respectively) but no luck.
#include <xmmintrin.h>
#include <stdio.h>
void print_4_bit_num(const char * label, __m128 var)
{
float *val = (float *) &var;
printf("%s: %f %f %f %f\n",
label, val[3], val[2], val[1], val[0]);
}
int main()
{
__m128 v1 = _mm_set_ps(1.0f, 127.0f, 99.0f, 1.0f);
__m128 v2 = _mm_set_ps(1.0f, 65535.0f, 127.0f, 0.0f);
__m128 v = _mm_and_ps(v1, v2);
print_4_bit_num("v1", v1);
print_4_bit_num("v2", v2);
print_4_bit_num("v ", v);
return 0;
}
You need to use a bitwise (integer) mask when you AND, so to e.g. clear alternate values in a vector you might do something like this:
__m128 v1 = _mm_set_ps(1.0f, 127.0f, 99.0f, 1.0f);
__m128 v2 = _mm_castsi128_ps(_mm_set_epi32(0, -1, 0, -1));
__m128 v = _mm_and_ps(v1, v2); // => v = { 0.0f, 127.0f, 0.0f, 1.0f }
You can cast any SSE vector to any SSE vector type of the same size (128 bit, or 256 bit), and you will get the exact same bits as before; there won't be any actual code. Obviously if you cast 4 float to 2 double you get nonsense, but for your case you cast float to some integer type, do the and, cast the result back.
If you have SSE4.1 (which I bet you do), you should consider _mm_blendv_ps(a,b,mask). This only uses the sign bit of its mask argument and essentially implements the vectorised mask<0?b:a.