How do you properly "snap" to a value? - c++

Let say I've set a step of 0.1 in my application.
So, whatever fp value I get, I just need 1 digit after the comma.
So, 47.93434 must be 47.9 (or at least, the nearest fp representable value).
If I write this:
double value = 47.9;
It correctly "snap" to the nearest fp value it can get, which is:
47.89999999999999857891452847979962825775146484375 // fp value
101111.11100110011001100110011001100110011001100110011 // binary
Now, suppose "I don't write those values", but I got them from a software.
And than I need to snap it. I wrote this function:
inline double SnapValue(double value, double step) {
return round(value / step) * step;
}
But it returns these values:
47.900000000000005684341886080801486968994140625 // fp value
101111.11100110011001100110011001100110011001100110100 // binary
which is formally a little far than the first example (its the "next" fp value, 011 + 1).
How would you get the first value (which is "more correct") for each input value?
Here's the testing code.
NOTE: the step can be different - i.e. step = 0.25 need to snap value around the nearest 0.25 X. Example: a step of 0.25 will return values as 0, 0.25, 0.50, 0.75, 1.0, 1.25 and so on. Thus, given an input of 1.30, it need to wrap to the nearest snapped value - i.e. 1.25.

You could try to use rational values instead of floating point. The latter are often inaccurate already, so not really an ideal match for a step.
inline double snap(double original, int numerator, int denominator)
{
return round(original * denominator / numerator) * numerator / denominator;
}
Say you want steps of 0.4, then use 2 / 5:
snap(1.7435, 2, 5) = round(4.35875) * 2 / 5 = 4 * 2 / 5 = 1.6 (or what comes closest to it)

Related

Interpolation search?

I have a uniform 1D grid with value {0.1, 0.22, 0.35, 0.5, 0.78, 0.92}. These values are equally positioned from position 0 to 5 like following:
value 0.1 0.22 0.35 0.5 0.78 0.92
|_________|_________|_________|_________|_________|
position 0 1 2 3 4 5
Now I like to extract/interpolated value positioned, say, at 2.3, which should be
val(2.3) = val(2)*(3-2.3) + val(3)*(2.3-2)
= 0.35*0.7 + 0.5*0.3
= 0.3950
So how should I do it in a optimized way in C++? I am on Visual Studio 2017.
I can think of a binary search, but is any some std methods/or better way to do the job? Thanks.
You can get the integer part of the interpolation value and use that to index the two values you need to interpolate between. No need to use binary search as you are always know between which two values you interpolate. Only need to look out for indices that are outside of the values if that could ever happen.
This only works if the values are always mapped to integer indices starting with zero.
#include <cmath>
float get( const std::vector<float>& val, float p )
{
// let's assume p is always valid so it is good as index
const int a = static_cast<int>(p); // round down
const float t = p - a;
return std::lerp(val[a], val[a+1], t);
}
Edit:
std::lerp is a c++20 feature. If you use earlier versions you can use the following implementation which should be good enough:
float lerp(float a, float b, float t)
{
return a + (b - a) * t;
}

How does Cpp work with large numbers in calculations?

I have a code that tries to solve an integral of a function in a given interval numerically, using the method of Trapezoidal Rule (see the formula in Trapezoid method ), now, for the function sin(x) in the interval [-pi/2.0,pi/2.0], the integral is waited to be zero.
In this case, I take the number of partitions 'n' equal to 4. The problem is that when I have pi with 20 decimal places it is zero, with 14 decimal places it is 8.72e^(-17), then with 11 decimal places, it is zero, with 8 decimal places it is 8.72e^(-17), with 3 decimal places it is zero. I mean, the integral is zero or a number near zero for different approximations of pi, but it doesn't have a clear trend.
I would appreciate your help in understanding why this happens. (I did run it in Dev-C++).
#include <iostream>
#include <math.h>
using namespace std;
#define pi 3.14159265358979323846
//Pi: 3.14159265358979323846
double func(double x){
return sin(x);
}
int main() {
double x0 = -pi/2.0, xf = pi/2.0;
int n = 4;
double delta_x = (xf-x0)/(n*1.0);
double sum = (func(x0)+func(xf))/2.0;
double integral;
for (int k = 1; k<n; k++){
// cout<<"func: "<<func(x0+(k*delta_x))<<" "<<"last sum: "<<sum<<endl;
sum = sum + func(x0+(k*delta_x));
// cout<<"func + last sum= "<<sum<<endl;
}
integral = delta_x*sum;
cout<<"The value for the integral is: "<<integral<<endl;
return 0;
}
OP is integrating y=sin(x) from -a to +a. The various tests use different values of a, all near pi/2.
The approach uses a linear summation of values near -1.0, down to 0 and then up to near 1.0.
This summation is sensitive to calculation error with the last terms as the final math sum is expected to be 0.0. Since the start/end a varies, the error varies.
A more stable result would be had adding the extreme f = sin(f(k)) values first. e.g. sum += sin(f(k=1)), then sum += sin(f(k=3)), then sum += sin(f(k=2)) rather than k=1,2,3. In particular the formation of term x=f(k=3) is likely a bit off from the negative of its x=f(k=1) earlier term, further compounding the issue.
Welcome to the world or numerical analysis.
Problem exists if code used all float or all long double, just different degrees.
Problem is not due to using an inexact value of pi (Exact value is impossible with FP as pi is irrational and all finite FP are rational).
Much is due to the formation of x. Could try the below to form the x symmetrically about 0.0. Compare exactly x generated this way to x the original way.
x = (x0-x1)/2 + ((k - n/2)*delta_x)
Print out the exact values computed for deeper understanding.
printf("x:%a y:%a\n", x0+(k*delta_x), func(x0+(k*delta_x)));

Precision issues when converting a decimal number to its rational equivalent

I have problem of converting a double (say N) to p/q form (rational form), for this I have the following strategy :
Multiply double N by a large number say $k = 10^{10}$
then p = y*k and q = k
Take gcd(p,q) and find p = p/gcd(p,q) and q = p/gcd(p,q)
when N = 8.2 , Answer is correct if we solve using pen and paper, but as 8.2 is represented as 8.19999999 in N (double), it causes problem in its rational form conversion.
I tried it doing other way as : (I used a large no. 10^k instead of 100)
if(abs(y*100 - round(y*100)) < 0.000001) y = round(y*100)/100
But this approach also doesn't give right representation all the time.
Is there any way I could carry out the equivalent conversion from double to p/q ?
Floating point arithmetic is very difficult. As has been mentioned in the comments, part of the difficulty is that you need to represent your numbers in binary.
For example, the number 0.125 can be represented exactly in binary:
0.125 = 2^-3 = 0b0.001
But the number 0.12 cannot.
To 11 significant figures:
0.12 = 0b0.00011110101
If this is converted back to a decimal then the error becomes obvious:
0b0.00011110101 = 0.11962890625
So if you write:
double a = 0.2;
What the machine actually does is find the closest binary representation of 0.2 that it can hold within a double data type. This is an approximation since as we saw above, 0.2 cannot be exactly represented in binary.
One possible approach is to define an 'epsilon' which determines how close your number can be to the nearest representable binary floating point.
Here is a good article on floating points:
https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
have problem of converting a double (say N) to p/q form
... when N = 8.2
A typical double cannot encode 8.2 exactly. Instead the closest representable double is about
8.19999999999999928945726423989981412887573...
8.20000000000000106581410364015027880668640... // next closest
When code does
double N = 8.2;
It will be the 8.19999999999999928945726423989981412887573... that is converted into rational form.
Converting a double to p/q form:
Multiply double N by a large number say $k = 10^{10}$
This may overflow the double. First step should be to determine if the double is large, it which case, it is a whole number.
Do not multiple by some power of 10 as double certainly uses a binary encoding. Multiplication by 10, 100, etc. may introduce round-off error.
C implementations of double overwhelmingly use a binary encoding, so that FLT_RADIX == 2.
Then every finite double x has a significand that is a fraction of some integer over some power of 2: a binary fraction of DBL_MANT_DIG digits #Richard Critten. This is often 53 binary digits.
Determine the exponent of the double. If large enough or x == 0.0, the double is a whole number.
Otherwise, scale a numerator and denominator by DBL_MANT_DIG. While the numerator is even, halve both the numerator and denominator. As the denominator is a power-of-2, no other prime values are needed for simplification consideration.
#include <float.h>
#include <math.h>
#include <stdio.h>
void form_ratio(double x) {
double numerator = x;
double denominator = 1.0;
if (isfinite(numerator) && x != 0.0) {
int expo;
frexp(numerator, &expo);
if (expo < DBL_MANT_DIG) {
expo = DBL_MANT_DIG - expo;
numerator = ldexp(numerator, expo);
denominator = ldexp(1.0, expo);
while (fmod(numerator, 2.0) == 0.0 && denominator > 1.0) {
numerator /= 2.0;
denominator /= 2.0;
}
}
}
int pre = DBL_DECIMAL_DIG;
printf("%.*g --> %.*g/%.*g\n", pre, x, pre, numerator, pre, denominator);
}
int main(void) {
form_ratio(123456789012.0);
form_ratio(42.0);
form_ratio(1.0 / 7);
form_ratio(867.5309);
}
Output
123456789012 --> 123456789012/1
42 --> 42/1
0.14285714285714285 --> 2573485501354569/18014398509481984
867.53089999999997 --> 3815441248019913/4398046511104

Oh where has my precision gone with OpenMesh vector arithmetic?

Using doubles I would expect to have about 15 decimal points of precision. I know that many decimal numbers are not exactly representable in floating point notation, so I would get an approximation for 1/3 for example. However, using a double I would expect an approximation that was correct to about 15 decimal points. I would also expect to retain that level of accuracy when doing arithmetic.
However, in the following example, I try to calculate the area of a triangle using Heron's formula and OpenMesh::Vec3d which are backed by OpenMesh::VectorDataT<double,3> and end up with a result that is only accurate to 5 decimal points.
The correct result is area = 8.19922e-8, but I'm getting area=8.1992238711962083e-8. Any ideas where this is coming from?
The suggestion that this might result from the instability in Heron's Formula is a good one, but unfortunately is not the case in this example. I have added code which calculates the stable variation on Heron for those who might be interested. In this example, u.norm()>v.norm()>w.norm().
#include <OpenMesh/Core/Mesh/PolyMesh_ArrayKernelT.hh>
int main()
{
//triangle vertices
OpenMesh::Vec3d x(0.051051, 0.057411, 0.001355);
OpenMesh::Vec3d y(0.050981, 0.057337, -0.000678);
OpenMesh::Vec3d z(0.050949, 0.057303, 0.0);
//edge vectors
OpenMesh::Vec3d u = x-y;
OpenMesh::Vec3d v = x-z;
OpenMesh::Vec3d w = y-z;
//Heron's Formula
double semiP = (u.norm() + v.norm() + w.norm())/2.0;
double area = sqrt(semiP * (semiP - u.norm()) * (semiP - v.norm()) * (semiP - w.norm()) );
//Heron's Formula for small angles
double areaSmall = sqrt((u.norm() + (v.norm()+w.norm()))*(w.norm()-(u.norm()-v.norm()))*(w.norm()+(u.norm()-v.norm()))*(u.norm()+(v.norm()-w.norm())))/4.0;
}
Heron's formula is numerically unstable. If you have a very "flat" triangle with small angles, the sum of the two small sides is almost the long side, so one of the terms gets very small. If, for example, a and b are the small sides,
(s - c)
will be very small, because
s = (a + b + c)/2
is nearly equal to c.
The wikipedia article about herons formula mentions a stable alternative:
Arrange the sides such that a > b > c and use
A = 1/4*sqrt((a + (b + c))*(c - (a - b))*(c + (a - b))*(a + (b - c)))
To 75 decimal places, the correct area of your triangle is
0.000000081992238711963087279421583920293974467992093148008322378721298327364.
If I replace the nine double constants you have with their decimal equivalents, I get
0.000000081992238711965902754749500279615357792172906541206211853522524016959
It would appear that you are not getting what you're expecting because you're expecting something unreasonable.
Any calculation involving subtraction will result in a loss of precision, if the values are at all close to each other. How many significant digits do you expect from this subtraction?
1.23456789012345
- 1.23456789000000
----------------
0.00000000012345
Both operands have 15 digits of precision, but the result only has 5.

Float increments precision problems with UI

Here is my problem, I have several parameters that I need to increment by 0.1.
But my UI only renders x.x , x.xx, x.xxx for floats so since 0.1f is not really 0.1 but something like 0.10000000149011612 on the long run my ui will render -0.00 and that doesn't make much sense. How to prevent that for all the possible cases of UI.
Thank you.
Use integers and divide by 10 (or 1000 etc...) just before displaying. Your parameters will store an integer number of tenths, and you'll increment them by 1 tenth.
If you know that your floating point value will always be a multiple of 0.1, you can round it after every increment to make sure it maintains a sensible value. It still won't be exact (because it physically can't be), but at least the errors won't accumulate and it will display properly.
Instead of:
x += delta;
Do:
x = floor((x + delta) / precision + 0.5) * precision;
Edit: It's useful to turn the rounding into a stand-alone function and decouple it from the increment:
inline double round(double value, double precision = 1.0)
{
return floor(value / precision + 0.5) * precision;
}
x = round(x + 0.1, 0.1);