Compare the sign bit in SSE Intrinsics - c++

How would one create a mask using SSE intrinsics which indicates whether the signs of two packed floats (__m128's) are the same for example if comparing a and b where a is [1.0 -1.0 0.0 2.0] and b is [1.0 1.0 1.0 1.0] the desired mask we would get is [true false true true].

Here's one solution:
const __m128i MASK = _mm_set1_epi32(0xffffffff);
__m128 a = _mm_setr_ps(1,-1,0,2);
__m128 b = _mm_setr_ps(1,1,1,1);
__m128 f = _mm_xor_ps(a,b);
__m128i i = _mm_castps_si128(f);
i = _mm_srai_epi32(i,31);
i = _mm_xor_si128(i,MASK);
f = _mm_castsi128_ps(i);
// i = (0xffffffff, 0, 0xffffffff, 0xffffffff)
// f = (0xffffffff, 0, 0xffffffff, 0xffffffff)
In this snippet, both i and f will have the same bitmask. I assume you want it in the __m128 type so I added the f = _mm_castsi128_ps(i); to convert it back from an __m128i.
Note that this code is sensitive to the sign of the zero. So 0.0 and -0.0 will affect the results.
Explanations:
The way the code works is as follows:
f = _mm_xor_ps(a,b); // xor the sign bits (well all the bits actually)
i = _mm_castps_si128(f); // Convert it to an integer. There's no instruction here.
i = _mm_srai_epi32(i,31); // Arithmetic shift that sign bit into all the bits.
i = _mm_xor_si128(i,MASK); // Invert all the bits
f = _mm_castsi128_ps(i); // Convert back. Again, there's no instruction here.

Have a look at the _mm_movemask_ps instruction, which extracts the most significant bit (i.e. sign bit) from 4 floats. See http://msdn.microsoft.com/en-us/library/4490ys29.aspx
For example, if you have [1.0 -1.0 0.0 2.0], then movemask_ps will return 4, or 0100 in binary. So then if you get movemask_ps for each vector and compare the results (perhaps bitwise NOT XOR), then that will indicate whether all the signs are the same.
a = [1.0 -1.0 0.0 2.0]
b = [1.0 1.0 1.0 1.0]
movemask_ps a = 4
movemask_ps b = 0
NOT (a XOR b) = 0xB, or binary 1011
Hence signs are the same except in the second vector element.

Related

can i speed up more than _mm256_i32gather_epi32

I made a gamma conversion code for 4k video
/** gamma0
input range : 0 ~ 1,023
output range : 0 ~ ?
*/
v00 = _mm256_unpacklo_epi16(v0, _mm256_setzero_si256());
v01 = _mm256_unpackhi_epi16(v0, _mm256_setzero_si256());
v10 = _mm256_unpacklo_epi16(v1, _mm256_setzero_si256());
v11 = _mm256_unpackhi_epi16(v1, _mm256_setzero_si256());
v20 = _mm256_unpacklo_epi16(v2, _mm256_setzero_si256());
v21 = _mm256_unpackhi_epi16(v2, _mm256_setzero_si256());
v00 = _mm256_i32gather_epi32(csv->gamma0LUT, v00, 4);
v01 = _mm256_i32gather_epi32(csv->gamma0LUT, v01, 4);
v10 = _mm256_i32gather_epi32(csv->gamma0LUTc, v10, 4);
v11 = _mm256_i32gather_epi32(csv->gamma0LUTc, v11, 4);
v20 = _mm256_i32gather_epi32(csv->gamma0LUTc, v20, 4);
v21 = _mm256_i32gather_epi32(csv->gamma0LUTc, v21, 4);
I want to implement a "10-bit input to 10~13bit output" LUT(look-up table), but only 32-bit commands are supported by AVX2.
So, it was unavoidably extended to 32bit and implemented using the _mm256_i32gather_epi32 command.
The performance bottleneck in this area is the most severe, is there any way to improve this?
Since the context of your question is still a bit vague for me, just some general ideas you could try (some may be just slightly better or even worse compared to what you have at the moment, all code below is untested):
LUT with 16 bit values using _mm256_i32gather_epi32
Even though it loads 32bit values, you can still use a multiplier of 2 as last argument of _mm256_i32gather_epi32. You should make sure that 2 bytes before and after your LUT are readable.
static const int16_t LUT[1024+2] = { 0, val0, val1, ..., val1022, val1023, 0};
__m256i high_idx = _mm256_srli_epi32(v, 16);
__m256i low_idx = _mm256_blend_epi16(v, _mm256_setzero_si256(), 0xAA);
__m256i high_val = _mm256_i32gather_epi32((int const*)(LUT+0), high_idx, 2);
__m256i low_val = _mm256_i32gather_epi32((int const*)(LUT+1), low_idx, 2);
__m256i values = _mm256_blend_epi16(low_val, high_val, 0xAA);
Join two values into one LUT-entry
For small-ish LUTs, you could calculate an index from two neighboring indexes as (idx_hi << 10) + idx_low and look up the corresponding tuple directly. However, instead of 2KiB you would have a 4 MiB LUT in your case, which likely hurts caching -- but you only have half the number of gather instructions.
Polynomial approximation
Mathematically, all continuous functions on a finite interval can be approximated by a polynomial. You could either convert your values to float evaluate the polynomial and convert it back, or do it directly with fixed-point multiplications (note that _mm256_mulhi_epi16/_mm256_mulhi_epu16 compute (a * b) >> 16, which is convenient if one factor is actually in [0, 1).
8 bit, 16 entry LUT with linear interpolation
SSE/AVX2 provides a pshufb instruction which can be used as a 8bit LUT with 16 entries (and an implicit 0 entry).
Proof-of-concept implementation:
__m256i idx = _mm256_srli_epi16(v, 6); // shift highest 4 bits to the right
idx = _mm256_mullo_epi16(idx, _mm256_set1_epi16(0x0101)); // duplicate idx, maybe _mm256_shuffle_epi8 is better?
idx = _mm256_sub_epi8(idx, _mm256_set1_epi16(0x0001)); // subtract 1 from lower idx, 0 is mapped to 0xff
__m256i lut_vals = _mm256_shuffle_epi8(LUT, idx); // implicitly: LUT[-1] = 0
// get fractional part of input value:
__m256i dv = _mm256_and_si256(v, _mm256_set1_epi8(0x3f)); // lowest 6 bits
dv = _mm256_mullo_epi16(dv, _mm256_set1_epi16(0xff01)); // dv = [-dv, dv]
dv = _mm256_add_epi8(dv, _mm256_set1_epi16(0x4000)); // dv = [0x40-(v&0x3f), (v&0x3f)];
__m256i res = _mm256_maddubs_epi16(lut_vals, dv); // switch order depending on whether LUT values are (un)signed.
// probably shift res to the right, depending on the scale of your LUT values
You could also combine this with first doing a linear or quadratic approximation and just calculating the difference to your target function.

Seeded Random Uniform float generator using SIMD? [duplicate]

I have a __m256 value that holds random bits.
I would like to to "interpret" it, to obtain another __m256 that holds float
values in a uniform [0.0f, 1.0f] range.
Planning to do it using:
__m256 randomBits = /* generated random bits, uniformly distribution */;
__m256 invFloatRange = _mm256_set1_ps( numeric_limits<float>::min() ); //min is a smallest increment of float precision
__m256 float01 = _mm256_mul(randomBits, invFloatRange);
//float01 is now ready to be used
Question 1:
However, will this cause a problem in very rare cases where randomBits has all bits as 1 and is therefore NAN?
What can I do to protect myself from this?
I want the float01 to always be a usable number
Question 2:
Will the [0 to 1] range remain uniform after I obtain it using the above approach? I know float has varying precision at different magnitudes
Reinterpreting an int32_t as float, one can
auto const one = _mm256_set1_epi32(0x7f800000);
a = _mm256_and_si256(a, _mm256_set1_epi32(0x007fffff));
a = _mm256_or_si256(a, one);
return _mm256_sub_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(one));
The and/or sequence will reuse the 23 LSBs of the input sequence to produce a uniform distribution of values between 1.0f <= a < 2.0f. And then the bias of 1.0f is removed.
As #Soonts has pointed out, floats can be created uniformly in [0, 1] range:
https://stackoverflow.com/a/54873925/9007125
I ended up using the answer below:
https://stackoverflow.com/a/54893167/9007125
//converts __m256i values into __m256 values, that contains floats in [0,1] range.
//https://stackoverflow.com/a/54893167/9007125
inline void int_rand_int_toFloat01( const __m256i* m256i_vals,
__m256* m256f_vals){ //<-- stores here.
const static __m256 c = _mm256_set1_ps(0x1.0p-24f); // or (1.0f / (uint32_t(1) << 24));
__m256i* rnd = ((__m256i*)m256i_vals);
__m256* output = ((__m256*)m256f_vals);
// remember that '_mm256_cvtepi32_ps' will convert 32-bit ints into a 32-bit floats
__m256 converted = _mm256_cvtepi32_ps(_mm256_srli_epi32(*rnd, 8));
*output = _mm256_mul_ps( converted, c);
}

Convert "__m256 with random-bits" into float values of [0, 1] range

I have a __m256 value that holds random bits.
I would like to to "interpret" it, to obtain another __m256 that holds float
values in a uniform [0.0f, 1.0f] range.
Planning to do it using:
__m256 randomBits = /* generated random bits, uniformly distribution */;
__m256 invFloatRange = _mm256_set1_ps( numeric_limits<float>::min() ); //min is a smallest increment of float precision
__m256 float01 = _mm256_mul(randomBits, invFloatRange);
//float01 is now ready to be used
Question 1:
However, will this cause a problem in very rare cases where randomBits has all bits as 1 and is therefore NAN?
What can I do to protect myself from this?
I want the float01 to always be a usable number
Question 2:
Will the [0 to 1] range remain uniform after I obtain it using the above approach? I know float has varying precision at different magnitudes
Reinterpreting an int32_t as float, one can
auto const one = _mm256_set1_epi32(0x7f800000);
a = _mm256_and_si256(a, _mm256_set1_epi32(0x007fffff));
a = _mm256_or_si256(a, one);
return _mm256_sub_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(one));
The and/or sequence will reuse the 23 LSBs of the input sequence to produce a uniform distribution of values between 1.0f <= a < 2.0f. And then the bias of 1.0f is removed.
As #Soonts has pointed out, floats can be created uniformly in [0, 1] range:
https://stackoverflow.com/a/54873925/9007125
I ended up using the answer below:
https://stackoverflow.com/a/54893167/9007125
//converts __m256i values into __m256 values, that contains floats in [0,1] range.
//https://stackoverflow.com/a/54893167/9007125
inline void int_rand_int_toFloat01( const __m256i* m256i_vals,
__m256* m256f_vals){ //<-- stores here.
const static __m256 c = _mm256_set1_ps(0x1.0p-24f); // or (1.0f / (uint32_t(1) << 24));
__m256i* rnd = ((__m256i*)m256i_vals);
__m256* output = ((__m256*)m256f_vals);
// remember that '_mm256_cvtepi32_ps' will convert 32-bit ints into a 32-bit floats
__m256 converted = _mm256_cvtepi32_ps(_mm256_srli_epi32(*rnd, 8));
*output = _mm256_mul_ps( converted, c);
}

How to convert scalar code of the double version of VDT's Pade Exp fast_ex() approx into SSE2?

Here's the code I'm trying to convert: the double version of VDT's Pade Exp fast_ex() approx (here's the old repo resource):
inline double fast_exp(double initial_x){
double x = initial_x;
double px=details::fpfloor(details::LOG2E * x +0.5);
const int32_t n = int32_t(px);
x -= px * 6.93145751953125E-1;
x -= px * 1.42860682030941723212E-6;
const double xx = x * x;
// px = x * P(x**2).
px = details::PX1exp;
px *= xx;
px += details::PX2exp;
px *= xx;
px += details::PX3exp;
px *= x;
// Evaluate Q(x**2).
double qx = details::QX1exp;
qx *= xx;
qx += details::QX2exp;
qx *= xx;
qx += details::QX3exp;
qx *= xx;
qx += details::QX4exp;
// e**x = 1 + 2x P(x**2)/( Q(x**2) - P(x**2) )
x = px / (qx - px);
x = 1.0 + 2.0 * x;
// Build 2^n in double.
x *= details::uint642dp(( ((uint64_t)n) +1023)<<52);
if (initial_x > details::EXP_LIMIT)
x = std::numeric_limits<double>::infinity();
if (initial_x < -details::EXP_LIMIT)
x = 0.;
return x;
}
I got this:
__m128d PExpSSE_dbl(__m128d x) {
__m128d initial_x = x;
__m128d half = _mm_set1_pd(0.5);
__m128d one = _mm_set1_pd(1.0);
__m128d log2e = _mm_set1_pd(1.4426950408889634073599);
__m128d p1 = _mm_set1_pd(1.26177193074810590878E-4);
__m128d p2 = _mm_set1_pd(3.02994407707441961300E-2);
__m128d p3 = _mm_set1_pd(9.99999999999999999910E-1);
__m128d q1 = _mm_set1_pd(3.00198505138664455042E-6);
__m128d q2 = _mm_set1_pd(2.52448340349684104192E-3);
__m128d q3 = _mm_set1_pd(2.27265548208155028766E-1);
__m128d q4 = _mm_set1_pd(2.00000000000000000009E0);
__m128d px = _mm_add_pd(_mm_mul_pd(log2e, x), half);
__m128d t = _mm_cvtepi64_pd(_mm_cvttpd_epi64(px));
px = _mm_sub_pd(t, _mm_and_pd(_mm_cmplt_pd(px, t), one));
__m128i n = _mm_cvtpd_epi64(px);
x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(6.93145751953125E-1)));
x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(1.42860682030941723212E-6)));
__m128d xx = _mm_mul_pd(x, x);
px = _mm_mul_pd(xx, p1);
px = _mm_add_pd(px, p2);
px = _mm_mul_pd(px, xx);
px = _mm_add_pd(px, p3);
px = _mm_mul_pd(px, x);
__m128d qx = _mm_mul_pd(xx, q1);
qx = _mm_add_pd(qx, q2);
qx = _mm_mul_pd(xx, qx);
qx = _mm_add_pd(qx, q3);
qx = _mm_mul_pd(xx, qx);
qx = _mm_add_pd(qx, q4);
x = _mm_div_pd(px, _mm_sub_pd(qx, px));
x = _mm_add_pd(one, _mm_mul_pd(_mm_set1_pd(2.0), x));
n = _mm_add_epi64(n, _mm_set1_epi64x(1023));
n = _mm_slli_epi64(n, 52);
// return?
}
But I'm not able to finish the last lines - i.e. this code:
if (initial_x > details::EXP_LIMIT)
x = std::numeric_limits<double>::infinity();
if (initial_x < -details::EXP_LIMIT)
x = 0.;
return x;
How would you convert in SSE2?
Than of course I need to check the whole, since I'm not quite sure I've converted it correctly.
EDIT: I found the SSE conversion of float exp - i.e. from this:
/* multiply by power of 2 */
z *= details::uint322sp((n + 0x7f) << 23);
if (initial_x > details::MAXLOGF) z = std::numeric_limits<float>::infinity();
if (initial_x < details::MINLOGF) z = 0.f;
return z;
to this:
n = _mm_add_epi32(n, _mm_set1_epi32(0x7f));
n = _mm_slli_epi32(n, 23);
return _mm_mul_ps(z, _mm_castsi128_ps(n));
Yup, dividing two polynomials can often give you a better tradeoff between speed and precision than one huge polynomial. As long as there's enough work to hide the divpd throughput. (The latest x86 CPUs have pretty decent FP divide throughput. Still bad vs. multiply, but it's only 1 uop so it doesn't stall the pipeline if you use it rarely enough, i.e. mixed with lots of multiplies. Including in the surrounding code that uses exp)
However, _mm_cvtepi64_pd(_mm_cvttpd_epi64(px)); won't work with SSE2. Packed-conversion intrinsics to/from 64-bit integers requires AVX512DQ.
To do packed rounding to the nearest integer, ideally you'd use SSE4.1 _mm_round_pd(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC), (or truncation towards zero, or floor or ceil towards -+Inf).
But we don't actually need that.
The scalar code ends up with int n and double px both representing the same numeric value. It uses the bad/buggy floor(val+0.5) idiom instead of rint(val) or nearbyint(val) to round to nearest, and then converts that already-integer double to an int (with C++'s truncation semantics, but that doesn't matter because the double value's already an exact integer.)
With SIMD intrinsics, it appears to be easiest to just convert to 32-bit integer and back.
__m128i n = _mm_cvtpd_epi32( _mm_mul_pd(log2e, x) ); // round to nearest
__m128d px = _mm_cvtepi32_pd( n );
Rounding to int with the desired mode, then converting back to double, is equivalent to double->double rounding and then grabbing an int version of that like the scalar version does. (Because you don't care what happens for doubles too large to fit in an int.)
cvtsd2si and si2sd instructions are 2 uops each, and shuffle the 32-bit integers to packed in the low 64 bits of a vector. So to set up for 64-bit integer shifts to stuff the bits into a double again, you'll need to shuffle. The top 64 bits of n will be zeros, so we can use that to create 64-bit integer n lined up with the doubles:
n = _mm_shuffle_epi32(n, _MM_SHUFFLE(3,1,2,0)); // 64-bit integers
But with just SSE2, there are workarounds. Converting to 32-bit integer and back is one option: you don't care about inputs too small or too large. But packed-conversion between double and int costs at least 2 uops on Intel CPUs each way, so a total of 4. But only 2 of those uops need the FMA units, and your code probably doesn't bottleneck on port 5 with all those multiplies and adds.
Or add a very large number and subtract it again: large enough that each double is 1 integer apart, so normal FP rounding does what you want. (This works for inputs that won't fit in 32 bits, but not double > 2^52. So either way that would work.) Also see How to efficiently perform double/int64 conversions with SSE/AVX? which uses that trick. I couldn't find an example on SO, though.
Related:
Fastest Implementation of Exponential Function Using AVX and Fastest Implementation of Exponential Function Using SSE have versions with other speed / precision tradeoffs, for _ps (packed single-precision float).
Fast SSE low precision exponential using double precision operations is at the other end of the spectrum, but still for double.
How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU? discusses some existing libraries like SVML, and Agner Fog's VCL (GPL licensed). And glibc's libmvec.
Then of course I need to check the whole, since I'm not quite sure I've converted it correctly.
iterating over all 2^64 double bit-patterns is impractical, unlike for float where there are only 4 billion, but maybe iterating over all doubles that have the low 32 bits of their mantissa all zero would be a good start. i.e. check in a loop with
bitpatterns = _mm_add_epi64(bitpatterns, _mm_set1_epi64x( 1ULL << 32 ));
doubles = _mm_castsi128_pd(bitpatterns);
https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/
For those last few lines, correcting the input for out-of-range inputs:
The float version you quote just leaves out the range-check entirely. This is obviously the fastest way, if your inputs will always be in range or if you don't care about what happens for out-of-range inputs.
Alternate cheaper range-checking (maybe only for debugging) would be to turn out-of-range values into NaN by ORing the packed-compare result into the result. (An all-ones bit-pattern represents a NaN.)
__m128d out_of_bounds = _mm_cmplt_pd( limit, abs(initial_x) ); // abs = mask off the sign bit
result = _mm_or_pd(result, out_of_bounds);
In general, you can vectorize simple condition setting of a value using branchless compare + blend. Instead of if(x) y=0;, you have the SIMD equivalent of y = (condition) ? 0 : y;, on a per-element basis. SIMD compares produce a mask of all-zero / all-one elements so you can use it to blend.
e.g. in this case cmppd the input and blendvpd the output if you have SSE4.1. Or with just SSE2, and/andnot/or to blend. See SSE intrinsics for comparison (_mm_cmpeq_ps) and assignment operation for a _ps version of both, _pd is identical.
In asm it will look like this:
; result in xmm0 (in need of fixups for out of range inputs)
; initial_x in xmm2
; constants:
; xmm5 = limit
; xmm6 = +Inf
cmpltpd xmm2, xmm5 ; xmm2 = input_x < limit ? 0xffff... : 0
andpd xmm0, xmm2 ; result = result or 0
andnpd xmm2, xmm6 ; xmm2 = 0 or +Inf (In that order because we used ANDN)
orpd xmm0, xmm2 ; result |= 0 or +Inf
; xmm0 = (input < limit) ? result : +Inf
(In an earlier version of the answer, I thought I was maybe saving a movaps to copy a register, but this is just a bog-standard blend. It destroys initial_x, so the compiler needs to copy that register at some point while calculating result, though.)
Optimizations for this special condition
Or in this case, 0.0 is represented by an all-zero bit-pattern, so do a compare that will produce true if in-range, and AND the output with that. (To leave it unchanged or force it to +0.0). This is better than _mm_blendv_pd, which costs 2 uops on most Intel CPUs (and the AVX 128-bit version always costs 2 uops on Intel). And it's not worse on AMD or Skylake.
+-Inf is represented by a bit-pattern of significand=0, exponent=all-ones. (Any other value in the significand represents +-NaN.) Since too-large inputs will presumably still leave non-zero significands, we can't just AND the compare result and OR that into the final result. I think we need to do a regular blend, or something as expensive (3 uops and a vector constant).
It adds 2 cycles of latency to the final result; both the ANDNPD and ORPD are on the critical path. The CMPPD and ANDPD aren't; they can run in parallel with whatever you do to compute the result.
Hopefully your compiler will actually use ANDPS and so on, not PD, for everything except the CMP, because it's 1 byte shorter but identical because they're both just bitwise ops. I wrote ANDPD just so I didn't have to explain this in comments.
You might be able to shorten the critical path latency by combining both fixups before applying to the result, so you only have one blend. But then I think you also need to combine the compare results.
Or since your upper and lower bounds are the same magnitude, maybe you can compare the absolute value? (mask off the sign bit of initial_x and do _mm_cmplt_pd(abs_initial_x, _mm_set1_pd(details::EXP_LIMIT))). But then you have to sort out whether to zero or set to +Inf.
If you had SSE4.1 for _mm_blendv_pd, you could use initial_x itself as the blend control for the fixup that might need applying, because blendv only cares about the sign bit of the blend control (unlike with the AND/ANDN/OR version where all bits need to match.)
__m128d fixup = _mm_blendv_pd( _mm_setzero_pd(), _mm_set1_pd(INFINITY), initial_x ); // fixup = (initial_x signbit) ? 0 : +Inf
// see below for generating fixup with an SSE2 integer arithmetic-shift
const signbit_mask = _mm_castsi128_pd(_mm_set1_epi64x(0x7fffffffffffffff)); // ~ set1(-0.0)
__m128d abs_init_x = _mm_and_pd( initial_x, signbit_mask );
__m128d out_of_range = _mm_cmpgt_pd(abs_init_x, details::EXP_LIMIT);
// Conditionally apply the fixup to result
result = _mm_blendv_pd(result, fixup, out_of_range);
Possibly use cmplt instead of cmpgt and rearrange if you care what happens for initial_x being a NaN. Choosing the compare so false applies the fixup instead of true will mean that an unordered comparison results in either 0 or +Inf for an input of -NaN or +NaN. This still doesn't do NaN propagation. You could _mm_cmpunord_pd(initial_x, initial_x) and OR that into fixup, if you want to make that happen.
Especially on Skylake and AMD Bulldozer/Ryzen where SSE2 blendvpd is only 1 uop, this should be pretty nice. (The VEX encoding, vblendvpd is 2 uops, having 3 inputs and a separate output.)
You might still be able to use some of this idea with only SSE2, maybe creating fixup by doing a compare against zero and then _mm_and_pd or _mm_andnot_pd with the compare result and +Infinity.
Using an integer arithmetic shift to broadcast the sign bit to every position in the double isn't efficient: psraq doesn't exist, only psraw/d. Only logical shifts come in 64-bit element size.
But you could create fixup with just one integer shift and mask, and a bitwise invert
__m128i ix = _mm_castsi128_pd(initial_x);
__m128i ifixup = _mm_srai_epi32(ix, 11); // all 11 bits of exponent field = sign bit
ifixup = _mm_and_si128(ifixup, _mm_set1_epi64x(0x7FF0000000000000ULL) ); // clear other bits
// ix = the bit pattern for 0 (non-negative x) or +Inf (negative x)
__m128d fixup = _mm_xor_si128(ifixup, _mm_set1_epi32(-1)); // bitwise invert
Then blend fixup into result for out-of-range inputs as normal.
Cheaply checking abs(initial_x) > details::EXP_LIMIT
If the exp algorithm was already squaring initial_x, you could compare against EXP_LIMIT squared. But it's not, xx = x*x only happens after some calculation to create x.
If you have AVX512F/VL, VFIXUPIMMPD might be handy here. It's designed for functions where the special case outputs are from "special" inputs like NaN and +-Inf, negative, positive, or zero, saving a compare for those cases. (e.g. for after a Newton-Raphson reciprocal(x) for x=0.)
But both of your special cases need compares. Or do they?
If you square your input and subtract, it only costs one FMA to do initial_x * initial_x - details::EXP_LIMIT * details::EXP_LIMIT to create a result that's negative for abs(initial_x) < details::EXP_LIMIT, and non-negative otherwise.
Agner Fog reports that vfixupimmpd is only 1 uop on Skylake-X.

How to quantize floating point to unsigned byte in GLSL

I used floating point texture as data buffer in GLSL and need to save the data on a normal texture (each pixel's color has 1 byte). In my situation, floating point is [-2048.0, 2048.0] and so I have to quantize [-2048.0, 2048.0] to [0, 255]. I think the C++ code for this problem is like :
//*quantization*
float fvalue = ... ; // floating point data
fvalue /= 16.0f; // [-128.0, 128.0]
fvalue = roundf(fvalue); // [-128, 128]
if(fvalue > 127.0f) fvalue = 127.0f;
else if(fvalue < -128.0f) fvalue = -128.0f;
u_char byte = (int)fvalue + 128; // [0, 255]
//*inverse quantization*
u_char byte = ...; // [0, 255]
float fvalue = byte - 128; // [-128, 127]
fvalue *= 16.0f; // [-2048, 2032] (it can't be helped?)
I'm not certain this code is good, but moreover I'm not really sure what is great in GLSL (GLSL handles byte value [0, 255] as floating point [0.0, 1.0]). My code is :
//*quantization*
vec3 F = ...; //F is floating vector [-2048.0, 2048.0]
F /= 16; // [-128.0, 128.0]
F /= 256; // [-0.5, 0.5]
F += vec3(0.50f); // [0.0, 1.0]
gl_FragData[0] = vec4(F, 1.0);
//*inverse quantization*
vec3 F = texture2D(...); //byte data [0.0, 1.0]
F -= vec3(0.50f); //byte data [-0.5, 0.5]
F *= 256; //[-128, 128]
F *= 16; //[-2048, 2048]
This didn't work well. However, if I rewrite codes F += vec3(0.50f); to F += vec3(0.51f); and also F -= vec3(0.50f); to F -= vec3(0.51f);, It seems works well. But I don't think the value 0.51f is reasonable. In fact, this works well in one hardware, while this doesn't work well in another hardware.
I want to know the good way to quantize (also inv-quantize) float values.
I can find the way which works "well". I'm afraid to say I can't explain reasonably why it works and so I don't know whether this is versatile method.
//*quantization*
vec3 F = ...; //F is floating vector [-2048.0, 2048.0]
F += 2048;
F /= 16;
F /= 255;
gl_FragData[0] = vec4(F, 1.0);
//*inverse quantization*
vec3 F = texture2D(...); //byte data [0.0, 1.0]
F *= 255.0;
F *= 16.0;
F -= vec3(2048 + 8); //adding bias -16.0/2.0
F = 2.0 * F * qp * Q / 16.0;
First of all, each pixel having 1 byte does not adequately convey what you are trying to describe. This so-called "normal texture" is more accurately referred to as "unsigned normalized" (often shortened to unorm).
You want an 8-bit unorm texture here (ideally with multiple components); these are textures that store fixed-point data and are treated like floating-point (in the range [0.0,1.0]) when sampled by normalizing the data to its intrinsic range (e.g. promoting to floating-point and dividing by 255.0).
Given what was just described, you simply need to transform the original data [-2048.0,2048.0] into [0.0,1.0] and then multiply by 255.
This is rather undesirable though, because you will lose the ability to represent the original range without severe aliasing. Instead, multiply by 4294967295 (2564-1) and pack 8-bits into R, 8-bits into G, 8-bits into B and 8-bits into A. You have made no attempt to pack the components in the shader shown.