Float increments precision problems with UI - c++

Here is my problem, I have several parameters that I need to increment by 0.1.
But my UI only renders x.x , x.xx, x.xxx for floats so since 0.1f is not really 0.1 but something like 0.10000000149011612 on the long run my ui will render -0.00 and that doesn't make much sense. How to prevent that for all the possible cases of UI.
Thank you.

Use integers and divide by 10 (or 1000 etc...) just before displaying. Your parameters will store an integer number of tenths, and you'll increment them by 1 tenth.

If you know that your floating point value will always be a multiple of 0.1, you can round it after every increment to make sure it maintains a sensible value. It still won't be exact (because it physically can't be), but at least the errors won't accumulate and it will display properly.
Instead of:
x += delta;
Do:
x = floor((x + delta) / precision + 0.5) * precision;
Edit: It's useful to turn the rounding into a stand-alone function and decouple it from the increment:
inline double round(double value, double precision = 1.0)
{
return floor(value / precision + 0.5) * precision;
}
x = round(x + 0.1, 0.1);

Related

constrain a value -pi to pi for precision buff

What is the best way to constrain any value from -pi to pi ?
I currently have:
if (fAngle > XM_PI) {
fAngle = fAngle - XM_2PI;
}
else if (fAngle < -XM_PI) {
fAngle = fAngle - -XM_2PI;
}
However, I fear those if's should instead be while's
For reference, under the Exploit Symmetrical Functions section:
https://developer.arm.com/solutions/graphics-and-gaming/developer-guides/learn-the-basics/understanding-numerical-precision/mitigating-loss-of-precision
Extra bit of precision!
Adding or subtracting XM_2PI cannot restore any accuracy that has been lost. In fact, it adds noise, generally losing more accuracy, because XM_2PI is necessarily only an approximation of 2π. It has some error itself, so adding or subtracting it adds or subtracts the error in the approximation.
What it can do is keep you from losing more accuracy by ensuring that future results remain low in magnitude, thus remaining in a region where the floating-point format has more precision than if the number grew beyond 4, 8, 16, or other points where the exponent changes and the absolute precision becomes worse.
If you already have some value x outside [−π, π] and want its sine or cosine, you should get the best result by using sin(x) or cos(x) directly. Good implementations of sin and cos will reduce the argument using a high-precision value for 2π, so you will get a better result than using sin(x-XM_PI) or cos(x-XM_PI) (unless, by chance, the various errors in these happen to cancel).
So your task with trigonometric functions is not to reduce values you already have but to design your algorithms to keep values from growing. Adding or subtracting 2π is a reasonable way to do this. However, when you do it, add or subtract an extended-precision version of 2π, not just XM_2PI. You can do this by representing 2π as XM_2PI (which should be the value representable in floating-point that is closest to 2π) plus some residue r. r should be the value representable in floating-point that is closest to 2π−XM_2PI. You can calculate that with extended-precision software such as GMP or Maple and can likely find it online. (I do not have it handy or I would paste it here; anybody else is welcome to edit it in.) Then you would update your angle with fAngle = fAngle - XM_2PI - r; or fAngle = fAngle + XM_2PI + r;.
An exception is if you have the angle measured in some unit that you can represent or reduce exactly, such as in degrees (which you can reduce by 360º with no error as long as the number of degrees itself is represented with no error) or in time (such as number of seconds for some function with a period of a day or other rational number of seconds, so you can again reduce with no error). In that case, you can let the angle grow as long as you can represent it exactly, and you would reduce it modulo the period prior to converting it to radians.
The simplest coding way is to use the math library function remainder, as in
fAngle = remainder( fangle, XM_2PI);
STATIC_INLINE_PURE float const __vectorcall constrain(float const fAngle)
{
static constexpr double const
dPI(std::numbers::pi),
d2PI(2.0 * std::numbers::pi),
dResidue(-1.74845553146951715461909770965576171875e-07); // abs difference between d2PI(double precision) and XM_2PI(float precision)
double dAngle(fAngle);
dAngle = std::remainder(dAngle, d2PI);
if (dAngle > dPI) {
dAngle = dAngle - d2PI - dResidue;
}
else if (dAngle < -dPI) {
dAngle = dAngle + d2PI + dResidue;
}
return((float)dAngle);
}

How to increase accuracy of floating point second derivative calculation?

I've written a simple program to calculate the first and second derivative of a function, using function pointers. My program computes the correct answers (more or less), but for some functions, the accuracy is less than I would like.
This is the function I am differentiating:
float f1(float x) {
return (x * x);
}
These are the derivative functions, using the central finite difference method:
// Function for calculating the first derivative.
float first_dx(float (*fx)(float), float x) {
float h = 0.001;
float dfdx;
dfdx = (fx(x + h) - fx(x - h)) / (2 * h);
return dfdx;
}
// Function for calculating the second derivative.
float second_dx(float (*fx)(float), float x) {
float h = 0.001;
float d2fdx2;
d2fdx2 = (fx(x - h) - 2 * fx(x) + fx(x + h)) / (h * h);
return d2fdx2;
}
Main function:
int main() {
pc.baud(9600);
float x = 2.0;
pc.printf("**** Function Pointers ****\r\n");
pc.printf("Value of f(%f): %f\r\n", x, f1(x));
pc.printf("First derivative: %f\r\n", first_dx(f1, x));
pc.printf("Second derivative: %f\r\n\r\n", second_dx(f1, x));
}
This is the output from the program:
**** Function Pointers ****
Value of f(2.000000): 4.000000
First derivative: 3.999948
Second derivative: 1.430511
I'm happy with the accuracy of the first derivative, but I believe the second derivative is too far off (it should be equal to ~2.0).
I have a basic understanding of how floating point numbers are represented and why they are sometimes inaccurate, but how can I make this second derivative result more accurate? Could I be using something better than the central finite difference method, or is there a way I can get better results with the current method?
The accuracy can be increased by choosing a type which has more precision. float is currently defined as an IEEE-754 32-bit number, giving you a precision of ~7.225 decimal places.
What you want is the 64-bit counterpart: double with ~15.955 decimal places accuracy.
That should be sufficient for your calculation, however worth mentioning is boosts implementation which offers a quadruple-precision floating point number (128-bit).
Finally The GNU Multiple Precision Arithmetic Library offers types with an arbitrary number of decimal places for precision.
Go analytical. ;-) probably not an option given "with the current
method".
Use double instead of float.
Vary the epsilon (h), and combine the results in some way. For example you could try 0.00001, 0.000001, 0.0000001 and average them. In fact, you'd want the result with the smallest h that doesn't overflow/underflow. But it's not clear how to detect overflow and underflow.

The result of own double precision cos() implemention in a shader is NaN, but works well on the CPU. What is going wrong?

as i said, i want implement my own double precision cos() function in a compute shader with GLSL, because there is just a built-in version for float.
This is my code:
double faculty[41];//values are calculated at the beginning of main()
double myCOS(double x)
{
double sum,tempExp,sign;
sum = 1.0;
tempExp = 1.0;
sign = -1.0;
for(int i = 1; i <= 30; i++)
{
tempExp *= x;
if(i % 2 == 0){
sum = sum + (sign * (tempExp / faculty[i]));
sign *= -1.0;
}
}
return sum;
}
The result of this code is, that the sum turns out to be NaN on the shader, but on the CPU the algorithm is working well.
I tried to debug this code too and I got the following information:
faculty[i] is positive and not zero for all entries
tempExp is positive in each step
none of the other variables are NaN during each step
the first time sum is NaN is at the step with i=4
and now my question: What exactly can go wrong if each variable is a number and nothing is divided by zero especially when the algorithm works on the CPU?
Let me guess:
First you determined the problem is in the loop, and you use only the following operations: +, *, /.
The rules for generating NaN from these operations are:
The divisions 0/0 and ±∞/±∞
The multiplications 0×±∞ and ±∞×0
The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions
You ruled out the possibility for 0/0 and ±∞/±∞ by stating that faculty[] is correctly initialized.
The variable sign is always 1.0 or -1.0 so it cannot generate the NaN through the * operation.
What remains is the + operation if tempExp ever become ±∞.
So probably tempExp is too high on entry of your function and becomes ±∞, this will make sum to be ±∞ too. At the next iteration you will trigger the NaN generating operation through: ∞ + (−∞). This is because you multiply one side of the addition by sign and sign switches between positive and negative at each iteration.
You're trying to approximate cos(x) around 0.0. So you should use the properties of the cos() function to reduce your input value to a value near 0.0. Ideally in the range [0, pi/4]. For instance, remove multiples of 2*pi, and get the values of cos() in [pi/4, pi/2] by computing sin(x) around 0.0 and so on.
What can go dramatically wrong is a loss of precision. cos(x) usually is implemented by range reduction followed by a dedicated implementation for the range [0, pi/2]. Range reduction uses cos(x+2*pi) = cos(x). But this range reduction isn't perfect. For starters, pi cannot be exactly represented in finite math.
Now what happens if you try something as absurd as cos(1<<30) ? It's quite possible that the range reduction algorithm introduces an error in x that's larger than 2*pi, in which case the outcome is meaningless. Returning NaN in such cases is reasonable.

A small number is rounded to zero

I have the following values:
i->fitness = 160
sum_fitness = 826135
I do the operation:
i->roulette = (int)(((i->fitness / sum_fitness)*100000) + 0.5);
But i keep getting 0 in i->roulette.
I also tried to save i->fitness / sum_fitness in a double variable and only then applying the other operations, but also this gets a 0.
I'm thinking that's because 160/826135 is such a small number, then it rounds it down to 0.
How can i overcome this?
Thank you
edit:
Thanks everyone, i eventually did this:
double temp = (double)(i->fitness);
i->roulette = (int)(((temp / sum_fitness)*100000) + 0.5);
And it worked.
All the answers are similar so it's hard to choose one.
You line
i->roulette = (int)(((i->fitness / sum_fitness)*100000) + 0.5);
is casting the value to int which is why any float operation is truncated
try
i->roulette = (((i->fitness / sum_fitness)*100000) + 0.5);
and make sure that either 'sum_fitness' or 'i->fitness' is of of a float or double type to make the division a floating point division -- if they are not you will need to cast one of them before dividing, like this
i->roulette = (((i->fitness / (double)sum_fitness)*100000) + 0.5);
If you want to make this as a integer calculation you could also try to change the order of the division and multiplication, like
i->roulette = ( i->fitness *100000) / sum_fitness;
which would work as long as you don't get any integer overflow, which in your case would occur only if fitness risk to be above 2000000.
I'm thinking that's because 160/826135 is such a small number, then it rounds it down to 0.
It is integer division, and it is truncated to the integral part. So yes, it is 0, but there is no rounding. 99/100 would also be 0.
You could fix it like by casting the numerator or the denominator to double:
i->roulette = ((i->fitness / static_cast<double>(sum_fitness))*100000) + 0.5;

What is the optimum epsilon/dx value to use within the finite difference method?

double MyClass::dx = ?????;
double MyClass::f(double x)
{
return 3.0*x*x*x - 2.0*x*x + x - 5.0;
}
double MyClass::fp(double x) // derivative of f(x), that is f'(x)
{
return (f(x + dx) - f(x)) / dx;
}
When using finite difference method for derivation, it is critical to choose an optimum dx value. Mathematically, dx must be as small as possible. However, I'm not sure if it is a correct choice to choose it the smallest positive double precision number (i.e.; 2.2250738585072014 x 10−308).
Is there an optimal numeric interval or exact value to choose a dx in to make the calculation error as small as possible?
(I'm using 64-bit compiler. I will run my program on a Intel i5 processor.)
Choosing the smallest possible value is almost certainly wrong: if dx were that smallest number, then f(x + dx) would be exactly equal to f(x) due to rounding.
So you have a tradeoff: Choose dx too small, and you lose precision to rounding errors. Choose it too large, and your result will be imprecise due to changes in the derivative as x changes.
To judge the numeric errors, consider (f(x + dx) - f(x))/f(x)1 mathematically. The numerator denotes the difference you want to compute, but the denominator denotes the magnitude of numbers you're dealing with. If that fraction is about 2‒k, then you can expect approximately k bits of precision in your result.
If you know your function, you can compute what error you'd get from choosing dx too large. You can then balence things, so that the error incurred from this is about the same as the error incurred from rounding. But if you know the function, you might be better off by providing a function that directly computes the derivative, like in your example with the polygonal f.
The Wikipedia section that pogorskiy pointed out suggests a value of sqrt(ε)x, or approximately 1.5e-8 * x. Without any more detailed knowledge about the function, such a rule of thumb will provide a reasonable default. Also note that that same section suggests not dividing by dx, but instead by (x + dx) - x, as this takes rounding errors incurred by computing x + dx into account. But I guess that whole article is full of suggestions you might use.
1 This formula really should divide by f(x), not by dx, even though a past editor thought differently. I'm attempting to compare the amount of significant bits remaining after the division, not the slope of the tangent.
Why not just use the Power Rule to derive the derivative, you'll get an exact answer:
f(x) = 3x^3 - 2x^2 + x - 5
f'(x) = 9x^2 - 4x + 1
Therefore:
f(x) = 3.0 * x * x * x - 2.0 * x * x + x - 5.0
fp(x) = 9.0 * x * x - 4.0 * x + 1.0