c++ half library has lower precision for positive numbers - c++

I know that I am using a capability not built into c++ however, this library seems to be so commonly used that I am surprised to see this error pop up.
For those of you who do not know about the library it can be found here. Essentially, it is supposed to allow the support of 16 bit floating point (lower precision) numbers.
My problem is that the precision of half floats appears to diminish for positive numbers.
In this code, I am generating a bunch of points to be rendered to the screen. {xs1, ys1} represents floating point precision calculation of sigmoid. {xs3, ys3} represents the values cast into floating point precision.
vector<float> xs1, ys1, xs3, ys3;
int res = 200000;
for (int i = 0; i < res; i++)
float prec = float(i) / float(res);
float fx = ((perc - 0.5) * 2.0)*8.0;
half hx = half(fx);
float fy = MFunctions::sigmoid(fx);
half hy = half(fy);
Here are the results (looking at zoomed in portions of the graph this generates with a window width of 2.2 and a window height of 0.02 units):
When looking at the floating precision graph, {xs1, ys1} both of the corners of the sigmoid function are smooth:
However, when looking at the half precision graph {xs3, ys3} the corner in the positive x axis shows a stepping effect while the corner in the negative x axis shows a lower resolution but smooth graph:
I am not sure why this is happening since the only difference between positive and negative numbers should be a sign bit.
Is there something wrong that I am doing or is this a flaw in the half library?

Sigmoid function output values are [0;1], so what you see is normal: in the bottom picture, values are around 1, so precision is much lower than around 0.


Evenly distribute grid cells horizontally/vertically?

I'm trying to draw a grid inside a window of game_width=640 and game_height=480. The numbers of grid cells is predefined. I want to evenly distribute the cells horizontally and vertically.
void GamePaint(HDC dc)
int numcells = 11;
for(int i = 1; i <= numcells; i++)
MoveToEx(dc, 0, y, NULL);
LineTo(dc, game_width, y);
// solving the horizontal equation will in turn solve the vertical one so no need to show the vertical code
The first equation came to mind was:
i * (game_height / numcells)
The notion behind this is to divide the total height by the number of cells to get the even cell size, which is then multiplied by i in each iteration of the loop to get the correct y coordinate of the beginning of the horizontal line.
The problem with that is that it seems to leave an extra small cell at the end:
I figured this must have something to do with that integral division, so we come to the second equation:
(int)(i * ((float)game_height / numcells))
The idea is to avoid the integer division, do a float division, multiply by i like before and cast the result back to int. This works well, no extra small cell at the end!
What's driving me nuts, is this equation:
i * game_height / numcells
which seems to have the same effect as the previous equation, but of course with the added benefit of not doing any casting. I can't figure out why this doesn't suffer from the integer division issues in the first equation.
Note that mathematically: X * (Y / Z) == X * Y / Z
So there's definitely a problem with integer division in the first equation.
Here's a video watching these 3 equations in the debugger watch window.
You can see an increasing gap between the result of the first equation vs the second and third (which always had the same result) as i increases.
Why is the 3rd equation giving correct results as the second equation and not suffering from the integral division error in the first equation? I just can't seem to wrap my head around it...
Any help is appreciated.
The answer is in the order of operations performed. The first equation
int y = i * (game_height / numcells);
performs the division first, and magnifies the rounding error by multiplying with i. The further you move down/to the right, the larger the accumulated rounding error becomes.
The last equation
int y = i * game_height / numcells;
is evaluated left to right. The relative error gets smaller, as you are dividing larger numbers (i * game_height). You still have a rounding error, but it doesn't accumulate. The fact, that you do not wind up with extra space at the final evaluation is, that i is equal to numcells, essentially cancelling each other out. You will always see y == game_height in the final iteration.
Using floating point operations is still more accurate: While the rounding error using integer math is in the interval [0 .. numcells) for each line, floating point math reduces that to [0 .. 1). You will see more evenly distributed lines using floating point math.
Note: On Windows you can use MulDiv instead of the integer equation, to prevent common errors such as transient overflows.

The result of own double precision cos() implemention in a shader is NaN, but works well on the CPU. What is going wrong?

as i said, i want implement my own double precision cos() function in a compute shader with GLSL, because there is just a built-in version for float.
This is my code:
double faculty[41];//values are calculated at the beginning of main()
double myCOS(double x)
double sum,tempExp,sign;
sum = 1.0;
tempExp = 1.0;
sign = -1.0;
for(int i = 1; i <= 30; i++)
tempExp *= x;
if(i % 2 == 0){
sum = sum + (sign * (tempExp / faculty[i]));
sign *= -1.0;
return sum;
The result of this code is, that the sum turns out to be NaN on the shader, but on the CPU the algorithm is working well.
I tried to debug this code too and I got the following information:
faculty[i] is positive and not zero for all entries
tempExp is positive in each step
none of the other variables are NaN during each step
the first time sum is NaN is at the step with i=4
and now my question: What exactly can go wrong if each variable is a number and nothing is divided by zero especially when the algorithm works on the CPU?
Let me guess:
First you determined the problem is in the loop, and you use only the following operations: +, *, /.
The rules for generating NaN from these operations are:
The divisions 0/0 and ±∞/±∞
The multiplications 0×±∞ and ±∞×0
The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions
You ruled out the possibility for 0/0 and ±∞/±∞ by stating that faculty[] is correctly initialized.
The variable sign is always 1.0 or -1.0 so it cannot generate the NaN through the * operation.
What remains is the + operation if tempExp ever become ±∞.
So probably tempExp is too high on entry of your function and becomes ±∞, this will make sum to be ±∞ too. At the next iteration you will trigger the NaN generating operation through: ∞ + (−∞). This is because you multiply one side of the addition by sign and sign switches between positive and negative at each iteration.
You're trying to approximate cos(x) around 0.0. So you should use the properties of the cos() function to reduce your input value to a value near 0.0. Ideally in the range [0, pi/4]. For instance, remove multiples of 2*pi, and get the values of cos() in [pi/4, pi/2] by computing sin(x) around 0.0 and so on.
What can go dramatically wrong is a loss of precision. cos(x) usually is implemented by range reduction followed by a dedicated implementation for the range [0, pi/2]. Range reduction uses cos(x+2*pi) = cos(x). But this range reduction isn't perfect. For starters, pi cannot be exactly represented in finite math.
Now what happens if you try something as absurd as cos(1<<30) ? It's quite possible that the range reduction algorithm introduces an error in x that's larger than 2*pi, in which case the outcome is meaningless. Returning NaN in such cases is reasonable.

Oh where has my precision gone with OpenMesh vector arithmetic?

Using doubles I would expect to have about 15 decimal points of precision. I know that many decimal numbers are not exactly representable in floating point notation, so I would get an approximation for 1/3 for example. However, using a double I would expect an approximation that was correct to about 15 decimal points. I would also expect to retain that level of accuracy when doing arithmetic.
However, in the following example, I try to calculate the area of a triangle using Heron's formula and OpenMesh::Vec3d which are backed by OpenMesh::VectorDataT<double,3> and end up with a result that is only accurate to 5 decimal points.
The correct result is area = 8.19922e-8, but I'm getting area=8.1992238711962083e-8. Any ideas where this is coming from?
The suggestion that this might result from the instability in Heron's Formula is a good one, but unfortunately is not the case in this example. I have added code which calculates the stable variation on Heron for those who might be interested. In this example, u.norm()>v.norm()>w.norm().
#include <OpenMesh/Core/Mesh/PolyMesh_ArrayKernelT.hh>
int main()
//triangle vertices
OpenMesh::Vec3d x(0.051051, 0.057411, 0.001355);
OpenMesh::Vec3d y(0.050981, 0.057337, -0.000678);
OpenMesh::Vec3d z(0.050949, 0.057303, 0.0);
//edge vectors
OpenMesh::Vec3d u = x-y;
OpenMesh::Vec3d v = x-z;
OpenMesh::Vec3d w = y-z;
//Heron's Formula
double semiP = (u.norm() + v.norm() + w.norm())/2.0;
double area = sqrt(semiP * (semiP - u.norm()) * (semiP - v.norm()) * (semiP - w.norm()) );
//Heron's Formula for small angles
double areaSmall = sqrt((u.norm() + (v.norm()+w.norm()))*(w.norm()-(u.norm()-v.norm()))*(w.norm()+(u.norm()-v.norm()))*(u.norm()+(v.norm()-w.norm())))/4.0;
Heron's formula is numerically unstable. If you have a very "flat" triangle with small angles, the sum of the two small sides is almost the long side, so one of the terms gets very small. If, for example, a and b are the small sides,
(s - c)
will be very small, because
s = (a + b + c)/2
is nearly equal to c.
The wikipedia article about herons formula mentions a stable alternative:
Arrange the sides such that a > b > c and use
A = 1/4*sqrt((a + (b + c))*(c - (a - b))*(c + (a - b))*(a + (b - c)))
To 75 decimal places, the correct area of your triangle is
If I replace the nine double constants you have with their decimal equivalents, I get
It would appear that you are not getting what you're expecting because you're expecting something unreasonable.
Any calculation involving subtraction will result in a loss of precision, if the values are at all close to each other. How many significant digits do you expect from this subtraction?
- 1.23456789000000
Both operands have 15 digits of precision, but the result only has 5.

Converting polygon coordinates from Double to Long for use with Clipper library

I have two polygons with their vertices stored as Double coordinates. I'd like to find the intersecting area of these polygons, so I'm looking at the Clipper library (C++ version). The problem is, Clipper only works with integer math (it uses the Long type).
Is there a way I can safely transform both my polygons with the same scale factor, convert their coordinates to Longs, perform the Intersection algorithm with Clipper, and scale the resulting intersection polygon back down with the same factor, and convert it back to a Double without too much loss of precision?
I can't quite get my head around how to do that.
You can use a simple multiplier to convert between the two:
/* Using power-of-two because it is exactly representable and makes
the scaling operation (not the rounding!) lossless. The value 1024
preserves roughly three decimal digits. */
double const scale = 1024.0;
// representable range
double const min_value = std::numeric_limits<long>::min() / scale;
double const max_value = std::numeric_limits<long>::max() / scale;
to_long(double v)
if(v < 0)
if(v < min_value)
throw out_of_range();
return static_cast<long>(v * scale - 0.5);
if(v > max_value)
throw out_of_range();
return static_cast<long>(v * scale + 0.5);
Note that the larger you make the scale, the higher your precision will be, but it also lowers the range. Effectively, this converts a floating-point number into a fixed-point number.
Lastly, you should be able to locate code to compute intersections between line segments using floating-point math easily, so I wonder why you want to use exactly Clipper.

fastest way to compute angle with x-axis

What is the fastest way to calculate angle between a line and the x-axis?
I need to define a function, which is Injective at the PI:2PI interval (I have to angle between point which is the uppermost point and any point below).
PointType * top = UPPERMOST_POINT;
PointType * targ = TARGET_POINT;
double targetVectorX = targ->x - top->x;
double targetVectorY = targ->y - top->y;
first try
double magnitudeTarVec = sqrt(targetVectorX*targetVectorX + targetVectorY*targetVectorY);
angle = tarX / magTar;
second try
//#2 slower
angle = atan2(targetVectorY, targetVectorX);
I do not need the angle directly (radians or degrees), just any value is fine as far as by comparing these values of this kind from 2 points I can tell which angle is bigger. (for example angle in example one is between -1 and 1 (it is cosine argument))
Check for y being zero as atan2 does; then The quotient x/y will be plenty fine. (assuming I understand you correctly).
I just wrote Fastest way to sort vectors by angle without actually computing that angle about the general question of finding functions monotonic in the angle, without any code or connections to C++ or the likes. Based on the currently accepted answer there I'd now suggest
double angle = copysign( // magnitude of first argument with sign of second
1. - targetVectorX/(fabs(targetVectorX) + fabs(targetVectorY)),
The great benefit compared to the currently accepted answer here is the fact that you won't have to worry about infinite values, since all non-zero vectors (i.e. targetVectorX and targetVectorY are not both equal to zero at the same time) will result in finite pseudoangle values. The resulting pseudoangles will be in the range [−2 … 2] for real angles [−π … π], so the signs and the discontinuity are just like you'd get them from atan2.