SSE rounds down when it should round up - c++

I am working on an application that is converting Float samples in the range of -1.0 to 1.0 to signed 16bit, to ensure the output of the optimized (SSE) routines are accurate I have written a set of tests that runs the non optimized version against the SSE version and compares their output.
Before I start I have confirmed that the SSE rounding mode is set to nearest.
In my test case the formula is:
ratio = 65536 / 2
output = round(input * ratio)
For the most part the results are accurate, but on one particular input I am seeing a failure for an input of -0.8499908447265625.
-0.8499908447265625 * (65536 / 2) = -27852.5
The normal code correctly rounds this to -27853, but the SSE code rounds this to -27852.
Here is the SSE code in use:
void Float_S16(const float *in, int16_t *out, const unsigned int samples)
static float ratio = 65536.0f / 2.0f;
static __m128 mul = _mm_set_ps1(ratio);
for(unsigned int i = 0; i < samples; i += 4, in += 4, out += 4)
__m128 xin;
__m128i con;
xin = _mm_load_ps(in);
xin = _mm_mul_ps(xin, mul);
con = _mm_cvtps_epi32(xin);
out[0] = _mm_extract_epi16(con, 0);
out[1] = _mm_extract_epi16(con, 2);
out[2] = _mm_extract_epi16(con, 4);
out[3] = _mm_extract_epi16(con, 6);
Self Contained Example as requested:
/* standard math */
float ratio = 65536.0f / 2.0f;
float in [4] = {-1.0, -0.8499908447265625, 0.0, 1.0};
int16_t out[4];
for(int i = 0; i < 4; ++i)
out[i] = round(in[i] * ratio);
/* sse math */
static __m128 mul = _mm_set_ps1(ratio);
__m128 xin;
__m128i con;
xin = _mm_load_ps(in);
xin = _mm_mul_ps(xin, mul);
con = _mm_cvtps_epi32(xin);
int16_t outSSE[4];
outSSE[0] = _mm_extract_epi16(con, 0);
outSSE[1] = _mm_extract_epi16(con, 2);
outSSE[2] = _mm_extract_epi16(con, 4);
outSSE[3] = _mm_extract_epi16(con, 6);
printf("Standard = %d, SSE = %d\n", out[1], outSSE[1]);

Although the SSE rounding mode defaults to "round to nearest", it's not the old familiar rounding method that we all learned in school, but a slightly more modern variation known as Banker's rounding (aka unbiased rounding, convergent rounding, statistician's rounding, Dutch rounding, Gaussian rounding or odd–even rounding), which rounds to the nearest even integer value. This rounding method is supposedly better than the more traditional method, from a statistical perspective. You will see the same behaviour with functions such as rint(), and it is also the default rounding mode for IEEE-754.
Note also that whereas the standard library function round() uses the traditional rounding method, the SSE instruction ROUNDPS (_mm_round_ps) uses banker's rounding.

That's the default behaviour for all floating point processing, not just SSE. Round half to even or banker's rounding is the default rounding mode according to the IEEE 754 standard.
The reason this is used is that consistently rounding up (or down) results in a half-point error that accumulates when applied over even a moderate number of operations. The half points can result in some pretty significant errors - significant enough that they became a plot point in Superman 3.
Round half to even or odd though, results in both negative and positive half-point errors that eliminate each other when applied over many operations.
This is also desirable in SSE operations. SSE operations are typically used in signal processing (audio, image), engineering and statistical scenarios where a consistent rounding error would appear as noise and require additional processing to remove (if possible). Banker's rounding ensures this noise is eliminated


Precise conversion of 32-bit unsigned integer into a float in range (-1;1)

According to articles like this, half of the floating-point numbers are in the interval [-1,1]. Could you suggest how to make use of this fact so to replace the naive conversion of a 32-bit unsigned integer into a floating-point number (while keeping the uniform distribution)?
Naive code:
uint32_t i = /* randomly generated */;
float f = (float)i / (1ui32<<31) - 1.0f;
The problem here is that first the number i is converted into float losing up to 8 lower bits of precision. Only then the number is scaled to [0;2) interval, and then to [-1;1) interval.
Please, suggest the solution in C or C++ for x86_64 CPU or CUDA if you know it.
Update: the solution with a double is good for x86_64, but is too slow in CUDA. Sorry I didn't expect such a response. Any ideas how to achieve this without using double-precision floating-point?
You can do the calculation using double instead so you don't lose any precision on the uint32_t value, then assign the result to a float.
float f = (double)i / (1ui32<<31) - 1.0;
In case you drop the uniform distribution constraint its doable on 32bit integer arithmetics alone:
float i32_to_f32(int x)
int exp;
union _f32 // semi result
float f; // 32bit floating point
DWORD u; // 32 bit uint
} y;
// edge cases
if (x== 0x00000000) return 0.0f;
if (x< -0x1FFFFFFF) return -1.0f;
if (x> +0x1FFFFFFF) return +1.0f;
// conversion
y.u=0; // reset bits
if (x<0){ y.u|=0x80000000; x=-x; } // sign (31 bits left)
exp=((x>>23)&63)-64; // upper 6 bits -> exponent -1,...,-64 (not 7bits to avoid denormalized numbers)
y.u|=(exp+127)<<23; // exponent bias and bit position
y.u|=x&0x007FFFFF; // mantissa
return y.f;
int f32_to_i32(float x)
int exp,man,i;
union _f32 // semi result
float f; // 32bit floating point
DWORD u; // 32 bit uint
} y;
// edge cases
if (x== 0.0f) return 0x00000000;
if (x<=-1.0f) return -0x1FFFFFFF;
if (x>=+1.0f) return +0x1FFFFFFF;
// conversion
exp=(y.u>>23)&255; exp-=127; // exponent bias and bit position
if (exp<-64) return 0.0f;
man=y.u&0x007FFFFF; // mantissa
i =(exp<<23)&0x1F800000;
i|= man;
if (y.u>=0x80000000) i=-i; // sign
return i;
I chose to use only 29 bits + sign = ~ 30 bits of integer to avoid denormalized numbers havoc which I am too lazy to encode (it would get you 30 or even 31 bits but much slower and complicated).
But the distribution is not linear nor uniform at all:
in Red is the float in range <-1,+1> and Blue is integer in range <-1FFFFFFF,+1FFFFFFF>.
On the other hand there is no rounding at all in both conversions ...
PS. I think there might be a way to somewhat linearize the result by using a precomputed LUT for the 6 bit exponent (64 values).
The thing to realize is while (float)i does lose 8-bit of precision (so it has 24 bits of precision), the result only has 24 bits of precision as well. So this precision loss is not necessarily a bad thing (this is actually more complicated, because if i is smaller, it will lose less than 8-bits. But things will work out well).
So we just need to fix the range, so the originally non-negative value gets mapped to INT_MIN..INT_MAX.
This expression works: (float)(int)(value^0x80000000)/0x80000000.
Here's how it works:
The (int)(value^0x80000000) part flips the sign bit, so 0x0 gets mapped to INT_MIN, and 0xffffffff gets mapped to INT_MAX.
Then there is conversion to float. This is where some rounding happens, and we lose precision (but it is not a problem).
Then just divide by 0x80000000 to get into the range [-1..1]. As this division just adjusts the exponent part, this division doesn't lose any precision.
So, there is only one rounding, the other operations doesn't lose precision. These chain of operations should have the same effect, as calculating the result in infinite precision, then doing the rounding to float (this theoretical rounding has the same effect as the rounding at the 2. step)
But, to be absolutely sure, I've verified with brute force checking all the 32-bit values that this expression results in the same value as (float)((double)value/0x80000000-1.0).
I suggest (if yout want to avoid division and use an accurately float-representable start value of 1.0*2^-32):
float e = i * ldexp(1.0,-32) - 1.0;
Any ideas how to achieve this without using double-precision floating-point?
Without assuming too much about the insides of float:
Shift u until the most significant bit is set, halving the float conversion value.
"keeping the uniform distribution"
50% of the uint32_t values will be in the [0.5 ... 1.0)
25% of the uint32_t values will be in the [0.25 ... 0.5)
12.5% of the uint32_t values will be in the [0.125 ... 0.25)
6.25% of the uint32_t values will be in the [0.0625 ... 0.125)
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
float ui32to0to1(uint32_t u) {
if (u) {
float band = 1.0f/(1llu<<32);
while ((u & 0x80000000) == 0) {
u <<= 1;
band *= 0.5f;
return (float)u * band;
return 0.0f;
Some test code to show functional equivalence to double.
int test(uint32_t u) {
volatile float f0 = (float) ((double)u / (1llu<<32));
volatile float f1 = ui32to0to1(u);
if (f0 != f1) {
printf("%8lX %.7e %.7e\n", (unsigned long) u, f0, f1);
return 1;
return 0;
int main(void) {
for (int i=0; i<100000000; i++) {
test(rand()*65535u ^ rand());
return 0;
Various optimizations are possible, especially with assuming properties of float. Yet for an initial answer, I'll stick to a general approach.
For improved efficiency, the loop needs only to iterate from 32 down to FLT_MANT_DIG which is usually 24.
float ui32to0to1(uint32_t u) {
float band = 1.0f/(1llu<<32);
for (int i = 32; (i>FLT_MANT_DIG && ((u & 0x80000000) == 0)); i--) {
u <<= 1;
band *= 0.5f;
return (float)u * band;
This answers maps [0 to 232-1] to [0.0 to 1.0)
To map to [0 to 232-1] to (-1.0 to 1.0). It can form -0.0.
if (u >= 0x80000000) {
return ui32to0to1((u - 0x80000000)*2);
} else
return -ui32to0to1((0x7FFFFFFF - u)*2);

Is there a way to optimize this function?

For an application I'm working on, I need to take two integers and add them together using a particular mathematical formula. This ends up looking like this:
int16_t add_special(int16_t a, int16_t b) {
float limit = std::numeric_limits<int16_t>::max();//32767 as a floating point value
float a_fl = a, b_fl = b;
float numerator = a_fl + b_fl;
float denominator = 1 + a_fl * b_fl / std::pow(limit, 2);
float final_value = numerator / denominator;
return static_cast<int16_t>(std::round(final_value));
Any readers with a passing familiarity with physics will recognize that this formula is the same as what is used to calculate the sum of near-speed-of-light velocities, and the calculation here intentionally mirrors that computation.
The code as-written gives the results I need: for low numbers, they nearly add together normally, but for high numbers, they converge to the maximum value of 32767, i.e.
add_special(10, 15) == 25
add_special(100, 200) == 300
add_special(1000, 3000) == 3989
add_special(10000, 25000) == 28390
add_special(30000, 30000) == 32640
Which all appears to be correct.
The problem, however, is that the function as-written involves first transforming the numbers into floating point values before transforming them back into integers. This seems like a needless detour for numbers that I know, as a principle of its domain, will never not be integers.
Is there a faster, more optimized way to perform this computation? Or is this the most optimized version of this function I can create?
I'm building for x86-64, using MSVC 14.X, although methods that also work for GCC would be beneficial. Also, I'm not interested in SSE/SIMD optimizations at this stage; I'm mostly just looking at the elementary operations being performed on the data.
You might avoid floating number and does all computation in integral type:
constexpr int16_t add_special(int16_t a, int16_t b) {
std::int64_t limit = std::numeric_limits<int16_t>::max();
std::int64_t a_fl = a;
std::int64_t b_fl = b;
return static_cast<int16_t>(((limit * limit) * (a_fl + b_fl)
+ ((limit * limit + a_fl * b_fl) / 2)) /* Handle round */
/ (limit * limit + a_fl * b_fl));
but according to Benchmark, it is not faster for those values.
As noted by Johannes Overmann, a big performance boost is gained by avoiding std::round, at the cost of some (little) discrepancies in the results, though.
I tried some other little changes HERE, where it seems that the following is a faster approach (at least for that architecture)
constexpr int32_t i_max = std::numeric_limits<int16_t>::max();
constexpr int64_t i_max_2 = static_cast<int64_t>(i_max) * i_max;
int16_t my_add_special(int16_t a, int16_t b)
// integer multipication instead of floating point division
double numerator = (a + b) * i_max_2;
double denominator = i_max_2 + a * b;
// Approximated rounding instead of std::round
return 0.5 + numerator / denominator;
Use 32767.0*32767.0 (which is a constant) instead of std::pow(limit, 2).
Use integer values as much as possible, potentially with fixed points. Just the two divisions are a problem. Use floats just form them, if necessary (depends on the input data ranges).
Make it inline if the function is small and if it is appropriate.
Something like:
int16_t add_special(int16_t a, int16_t b) {
float numerator = int32_t(a) + int32_t(b); // Cannot overflow.
float denominator = 1 + (int32_t(a) * int32_t(b)) / (32767.0 * 32767.0); // Cannot overflow either.
return (numerator / denominator) + 0.5; // Relying on implementation defined rounding. Not good but potentially faster than std::round().
The only risk with the above is the omission of the explicit rounding, so you will get some implicit rounding.

How to convert scalar code of the double version of VDT's Pade Exp fast_ex() approx into SSE2?

Here's the code I'm trying to convert: the double version of VDT's Pade Exp fast_ex() approx (here's the old repo resource):
inline double fast_exp(double initial_x){
double x = initial_x;
double px=details::fpfloor(details::LOG2E * x +0.5);
const int32_t n = int32_t(px);
x -= px * 6.93145751953125E-1;
x -= px * 1.42860682030941723212E-6;
const double xx = x * x;
// px = x * P(x**2).
px = details::PX1exp;
px *= xx;
px += details::PX2exp;
px *= xx;
px += details::PX3exp;
px *= x;
// Evaluate Q(x**2).
double qx = details::QX1exp;
qx *= xx;
qx += details::QX2exp;
qx *= xx;
qx += details::QX3exp;
qx *= xx;
qx += details::QX4exp;
// e**x = 1 + 2x P(x**2)/( Q(x**2) - P(x**2) )
x = px / (qx - px);
x = 1.0 + 2.0 * x;
// Build 2^n in double.
x *= details::uint642dp(( ((uint64_t)n) +1023)<<52);
if (initial_x > details::EXP_LIMIT)
x = std::numeric_limits<double>::infinity();
if (initial_x < -details::EXP_LIMIT)
x = 0.;
return x;
I got this:
__m128d PExpSSE_dbl(__m128d x) {
__m128d initial_x = x;
__m128d half = _mm_set1_pd(0.5);
__m128d one = _mm_set1_pd(1.0);
__m128d log2e = _mm_set1_pd(1.4426950408889634073599);
__m128d p1 = _mm_set1_pd(1.26177193074810590878E-4);
__m128d p2 = _mm_set1_pd(3.02994407707441961300E-2);
__m128d p3 = _mm_set1_pd(9.99999999999999999910E-1);
__m128d q1 = _mm_set1_pd(3.00198505138664455042E-6);
__m128d q2 = _mm_set1_pd(2.52448340349684104192E-3);
__m128d q3 = _mm_set1_pd(2.27265548208155028766E-1);
__m128d q4 = _mm_set1_pd(2.00000000000000000009E0);
__m128d px = _mm_add_pd(_mm_mul_pd(log2e, x), half);
__m128d t = _mm_cvtepi64_pd(_mm_cvttpd_epi64(px));
px = _mm_sub_pd(t, _mm_and_pd(_mm_cmplt_pd(px, t), one));
__m128i n = _mm_cvtpd_epi64(px);
x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(6.93145751953125E-1)));
x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(1.42860682030941723212E-6)));
__m128d xx = _mm_mul_pd(x, x);
px = _mm_mul_pd(xx, p1);
px = _mm_add_pd(px, p2);
px = _mm_mul_pd(px, xx);
px = _mm_add_pd(px, p3);
px = _mm_mul_pd(px, x);
__m128d qx = _mm_mul_pd(xx, q1);
qx = _mm_add_pd(qx, q2);
qx = _mm_mul_pd(xx, qx);
qx = _mm_add_pd(qx, q3);
qx = _mm_mul_pd(xx, qx);
qx = _mm_add_pd(qx, q4);
x = _mm_div_pd(px, _mm_sub_pd(qx, px));
x = _mm_add_pd(one, _mm_mul_pd(_mm_set1_pd(2.0), x));
n = _mm_add_epi64(n, _mm_set1_epi64x(1023));
n = _mm_slli_epi64(n, 52);
// return?
But I'm not able to finish the last lines - i.e. this code:
if (initial_x > details::EXP_LIMIT)
x = std::numeric_limits<double>::infinity();
if (initial_x < -details::EXP_LIMIT)
x = 0.;
return x;
How would you convert in SSE2?
Than of course I need to check the whole, since I'm not quite sure I've converted it correctly.
EDIT: I found the SSE conversion of float exp - i.e. from this:
/* multiply by power of 2 */
z *= details::uint322sp((n + 0x7f) << 23);
if (initial_x > details::MAXLOGF) z = std::numeric_limits<float>::infinity();
if (initial_x < details::MINLOGF) z = 0.f;
return z;
to this:
n = _mm_add_epi32(n, _mm_set1_epi32(0x7f));
n = _mm_slli_epi32(n, 23);
return _mm_mul_ps(z, _mm_castsi128_ps(n));
Yup, dividing two polynomials can often give you a better tradeoff between speed and precision than one huge polynomial. As long as there's enough work to hide the divpd throughput. (The latest x86 CPUs have pretty decent FP divide throughput. Still bad vs. multiply, but it's only 1 uop so it doesn't stall the pipeline if you use it rarely enough, i.e. mixed with lots of multiplies. Including in the surrounding code that uses exp)
However, _mm_cvtepi64_pd(_mm_cvttpd_epi64(px)); won't work with SSE2. Packed-conversion intrinsics to/from 64-bit integers requires AVX512DQ.
To do packed rounding to the nearest integer, ideally you'd use SSE4.1 _mm_round_pd(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC), (or truncation towards zero, or floor or ceil towards -+Inf).
But we don't actually need that.
The scalar code ends up with int n and double px both representing the same numeric value. It uses the bad/buggy floor(val+0.5) idiom instead of rint(val) or nearbyint(val) to round to nearest, and then converts that already-integer double to an int (with C++'s truncation semantics, but that doesn't matter because the double value's already an exact integer.)
With SIMD intrinsics, it appears to be easiest to just convert to 32-bit integer and back.
__m128i n = _mm_cvtpd_epi32( _mm_mul_pd(log2e, x) ); // round to nearest
__m128d px = _mm_cvtepi32_pd( n );
Rounding to int with the desired mode, then converting back to double, is equivalent to double->double rounding and then grabbing an int version of that like the scalar version does. (Because you don't care what happens for doubles too large to fit in an int.)
cvtsd2si and si2sd instructions are 2 uops each, and shuffle the 32-bit integers to packed in the low 64 bits of a vector. So to set up for 64-bit integer shifts to stuff the bits into a double again, you'll need to shuffle. The top 64 bits of n will be zeros, so we can use that to create 64-bit integer n lined up with the doubles:
n = _mm_shuffle_epi32(n, _MM_SHUFFLE(3,1,2,0)); // 64-bit integers
But with just SSE2, there are workarounds. Converting to 32-bit integer and back is one option: you don't care about inputs too small or too large. But packed-conversion between double and int costs at least 2 uops on Intel CPUs each way, so a total of 4. But only 2 of those uops need the FMA units, and your code probably doesn't bottleneck on port 5 with all those multiplies and adds.
Or add a very large number and subtract it again: large enough that each double is 1 integer apart, so normal FP rounding does what you want. (This works for inputs that won't fit in 32 bits, but not double > 2^52. So either way that would work.) Also see How to efficiently perform double/int64 conversions with SSE/AVX? which uses that trick. I couldn't find an example on SO, though.
Fastest Implementation of Exponential Function Using AVX and Fastest Implementation of Exponential Function Using SSE have versions with other speed / precision tradeoffs, for _ps (packed single-precision float).
Fast SSE low precision exponential using double precision operations is at the other end of the spectrum, but still for double.
How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU? discusses some existing libraries like SVML, and Agner Fog's VCL (GPL licensed). And glibc's libmvec.
Then of course I need to check the whole, since I'm not quite sure I've converted it correctly.
iterating over all 2^64 double bit-patterns is impractical, unlike for float where there are only 4 billion, but maybe iterating over all doubles that have the low 32 bits of their mantissa all zero would be a good start. i.e. check in a loop with
bitpatterns = _mm_add_epi64(bitpatterns, _mm_set1_epi64x( 1ULL << 32 ));
doubles = _mm_castsi128_pd(bitpatterns);
For those last few lines, correcting the input for out-of-range inputs:
The float version you quote just leaves out the range-check entirely. This is obviously the fastest way, if your inputs will always be in range or if you don't care about what happens for out-of-range inputs.
Alternate cheaper range-checking (maybe only for debugging) would be to turn out-of-range values into NaN by ORing the packed-compare result into the result. (An all-ones bit-pattern represents a NaN.)
__m128d out_of_bounds = _mm_cmplt_pd( limit, abs(initial_x) ); // abs = mask off the sign bit
result = _mm_or_pd(result, out_of_bounds);
In general, you can vectorize simple condition setting of a value using branchless compare + blend. Instead of if(x) y=0;, you have the SIMD equivalent of y = (condition) ? 0 : y;, on a per-element basis. SIMD compares produce a mask of all-zero / all-one elements so you can use it to blend.
e.g. in this case cmppd the input and blendvpd the output if you have SSE4.1. Or with just SSE2, and/andnot/or to blend. See SSE intrinsics for comparison (_mm_cmpeq_ps) and assignment operation for a _ps version of both, _pd is identical.
In asm it will look like this:
; result in xmm0 (in need of fixups for out of range inputs)
; initial_x in xmm2
; constants:
; xmm5 = limit
; xmm6 = +Inf
cmpltpd xmm2, xmm5 ; xmm2 = input_x < limit ? 0xffff... : 0
andpd xmm0, xmm2 ; result = result or 0
andnpd xmm2, xmm6 ; xmm2 = 0 or +Inf (In that order because we used ANDN)
orpd xmm0, xmm2 ; result |= 0 or +Inf
; xmm0 = (input < limit) ? result : +Inf
(In an earlier version of the answer, I thought I was maybe saving a movaps to copy a register, but this is just a bog-standard blend. It destroys initial_x, so the compiler needs to copy that register at some point while calculating result, though.)
Optimizations for this special condition
Or in this case, 0.0 is represented by an all-zero bit-pattern, so do a compare that will produce true if in-range, and AND the output with that. (To leave it unchanged or force it to +0.0). This is better than _mm_blendv_pd, which costs 2 uops on most Intel CPUs (and the AVX 128-bit version always costs 2 uops on Intel). And it's not worse on AMD or Skylake.
+-Inf is represented by a bit-pattern of significand=0, exponent=all-ones. (Any other value in the significand represents +-NaN.) Since too-large inputs will presumably still leave non-zero significands, we can't just AND the compare result and OR that into the final result. I think we need to do a regular blend, or something as expensive (3 uops and a vector constant).
It adds 2 cycles of latency to the final result; both the ANDNPD and ORPD are on the critical path. The CMPPD and ANDPD aren't; they can run in parallel with whatever you do to compute the result.
Hopefully your compiler will actually use ANDPS and so on, not PD, for everything except the CMP, because it's 1 byte shorter but identical because they're both just bitwise ops. I wrote ANDPD just so I didn't have to explain this in comments.
You might be able to shorten the critical path latency by combining both fixups before applying to the result, so you only have one blend. But then I think you also need to combine the compare results.
Or since your upper and lower bounds are the same magnitude, maybe you can compare the absolute value? (mask off the sign bit of initial_x and do _mm_cmplt_pd(abs_initial_x, _mm_set1_pd(details::EXP_LIMIT))). But then you have to sort out whether to zero or set to +Inf.
If you had SSE4.1 for _mm_blendv_pd, you could use initial_x itself as the blend control for the fixup that might need applying, because blendv only cares about the sign bit of the blend control (unlike with the AND/ANDN/OR version where all bits need to match.)
__m128d fixup = _mm_blendv_pd( _mm_setzero_pd(), _mm_set1_pd(INFINITY), initial_x ); // fixup = (initial_x signbit) ? 0 : +Inf
// see below for generating fixup with an SSE2 integer arithmetic-shift
const signbit_mask = _mm_castsi128_pd(_mm_set1_epi64x(0x7fffffffffffffff)); // ~ set1(-0.0)
__m128d abs_init_x = _mm_and_pd( initial_x, signbit_mask );
__m128d out_of_range = _mm_cmpgt_pd(abs_init_x, details::EXP_LIMIT);
// Conditionally apply the fixup to result
result = _mm_blendv_pd(result, fixup, out_of_range);
Possibly use cmplt instead of cmpgt and rearrange if you care what happens for initial_x being a NaN. Choosing the compare so false applies the fixup instead of true will mean that an unordered comparison results in either 0 or +Inf for an input of -NaN or +NaN. This still doesn't do NaN propagation. You could _mm_cmpunord_pd(initial_x, initial_x) and OR that into fixup, if you want to make that happen.
Especially on Skylake and AMD Bulldozer/Ryzen where SSE2 blendvpd is only 1 uop, this should be pretty nice. (The VEX encoding, vblendvpd is 2 uops, having 3 inputs and a separate output.)
You might still be able to use some of this idea with only SSE2, maybe creating fixup by doing a compare against zero and then _mm_and_pd or _mm_andnot_pd with the compare result and +Infinity.
Using an integer arithmetic shift to broadcast the sign bit to every position in the double isn't efficient: psraq doesn't exist, only psraw/d. Only logical shifts come in 64-bit element size.
But you could create fixup with just one integer shift and mask, and a bitwise invert
__m128i ix = _mm_castsi128_pd(initial_x);
__m128i ifixup = _mm_srai_epi32(ix, 11); // all 11 bits of exponent field = sign bit
ifixup = _mm_and_si128(ifixup, _mm_set1_epi64x(0x7FF0000000000000ULL) ); // clear other bits
// ix = the bit pattern for 0 (non-negative x) or +Inf (negative x)
__m128d fixup = _mm_xor_si128(ifixup, _mm_set1_epi32(-1)); // bitwise invert
Then blend fixup into result for out-of-range inputs as normal.
Cheaply checking abs(initial_x) > details::EXP_LIMIT
If the exp algorithm was already squaring initial_x, you could compare against EXP_LIMIT squared. But it's not, xx = x*x only happens after some calculation to create x.
If you have AVX512F/VL, VFIXUPIMMPD might be handy here. It's designed for functions where the special case outputs are from "special" inputs like NaN and +-Inf, negative, positive, or zero, saving a compare for those cases. (e.g. for after a Newton-Raphson reciprocal(x) for x=0.)
But both of your special cases need compares. Or do they?
If you square your input and subtract, it only costs one FMA to do initial_x * initial_x - details::EXP_LIMIT * details::EXP_LIMIT to create a result that's negative for abs(initial_x) < details::EXP_LIMIT, and non-negative otherwise.
Agner Fog reports that vfixupimmpd is only 1 uop on Skylake-X.

OpenCL kernel float division gives different result

I have a OpenCL kernel for some computation. I found only one thread gives different result with CPU codes. I am using vs2010 x64 release mode.
By checking the OpenCL codes by some examples, I found some interesting results. Here are the testing examples in kernel codes.
I tested 3 cases in OpenCl kernel, the precision is checked by printf("%.10f", fval);
case 1:
float fval = (10296184.0) / (float)(x*y*z); // which gives result fval = 3351.6225585938
float fval = (10296184.0f) / (float)(x*y*z); // which gives result fval = 3351.6225585938
Variables are: int x,y, z
these values are computed by some operations. And their values are x=12, y=16, z=16;
case 2:
float fval = (10296184.0) / (float)(12*16*16); // which gives result fval = 3351.6223144531
float fval = (10296184.0f) / (float)(12*16*16); // which gives result fval = 3351.6223144531
case 3:
However, when I compute the difference of fval by using above two expressions, the result is 0 if using 10296184.0.
float fval = (10296184.0) / (float)(x*y*z) - (10296184.0) / (float)(12*16*16); // which gives result fval = 0.0000000000
float fval = (10296184.0f) / (float)(x*y*z) - (10296184.0f) / (float)(12*16*16); // which gives result fval = 0.0001812663
Could anyone explain the reason or give me some hints?
Some observations:
The two float values differ by 1 ULP. So the results differ by a minimum amount.
// Float ULP in the 2's place here
// v
0x1.a2f3ea0000000p+11 3351.622314... // OP's lower float value
0x1.a2f3eaaaaaaabp+11 3351.622395... // higher precision quotient
0x1.a2f3ec0000000p+11 3351.622558... // OP's higher float value
(10296184.0) / (float)(12*16*16) is calculated at compile time as is the closer result to the expected mathematical answer.
float fval = (10296184.0) / (float)(x*y*z) is calculated at run time.
Considering float variables being used, surprising that code is doing this division with double math. This is a double constant divide by a double (which is the promotion of the float product) resulting in a double quotient, converted to a float and then saved. I'd expect 10296184.0f - note the f - to have been used, then the math could have all been done as floats.
C allows different rounding modes denoted by FLT_ROUNDS This may differ at compile time and run time and may explain the difference. Knowing the result of fegetround() (The function gets the current rounding direction.) would help.
OP may have employed various compiler optimizations that sacrifice precision for speed.
C does not specify the precision of math operations, yet good to the last ULP should be expected with * / + - sqrt() modf() on quality platforms. I suspect code suffers from a weak math implementation.

How can I check whether a double has a fractional part?

Basically I have two variables:
double halfWidth = Width / 2;
double halfHeight = Height / 2;
As they are being divided by 2, they will either be a whole number or a decimal. How can I check whether they are a whole number or a .5?
You can use modf, this should be sufficient:
double intpart;
if( modf( halfWidth, &intpart) == 0 )
// your code here
First, you need to make sure that you're using double-precision floating-point math:
double halfWidth = Width / 2.0;
double halfHeight = Height / 2.0;
Because one of the operands is a double (namely, 2.0), this will force the compiler to convert Width and Height to doubles before doing the math (assuming they're not already doubles). Once converted, the division will be done in double-precision floating-point. So it will have a decimal, where appropriate.
The next step is to simply check it with modf.
double temp;
if(modf(halfWidth, &temp) != 0)
//Has fractional part.
//No fractional part.
You may discard a fractional part and compare the result with the original value using floor():
if (floor(halfWidth) == halfWidth) {
// halfWidth is a whole number
} else {
// halfWidth has a non-zero fractional part
As rightly pointed out by #Dávid Laczkó, it's a better solution than modf() because there's no need for an additional variable.
And according to my benchmarks (Linux, gcc 8.3.0, optimizations -O0...-O3), the floor() call consumes less CPU time than modf() on the modern notebook and server processors. The difference even growing with compiler optimizations enabled. Probably it's because the modf() has two arguments when the floor() has only one argument.