C++ float comparison - c++

I want to compare floats. I had a problem when comparing equality so I used epsilon and it was solved
inline bool isEqual(float x, float y)
{
const float epsilon = 1e-5;
return abs(x - y) <= epsilon * abs(x);
}
but I also want to compare other comparisons such as '>' and '<='
I have two floats = 49 but when executing f1 > f2 it returns true.
I have tried this function:
inline bool isSmallerOrEqual(float x, float y)
{
const float epsilon = 0.01;
return epsilon * abs(x) <= epsilon * abs(y);
}
It worked but not for all values.
Any ideas ?

First, your isEquals function may be wrong, you're using the relative epsilon version. This version is works better with extremely large numbers, but breaks down when using extremely small numbers. The other version would be to use absolute epsilon. This version works well with small numbers (including extremely small numbers), but breaks down with extremely large numbers.
inline bool epsilonEquals(const float x, const float y, const float epsilon = 1E-5f)
{
return abs(x - y) <= epsilon;
}
Second, calling it isEquals is inaccurate, the function is generally known as Epsilon Equals because it doesn't validate that 2 numbers are equal, it validates that they are reasonably close to each other within a small margin of error (epsilon).
Third, if you want to check that 2 numbers are less than or equal to each other then you generally don't need or even want an epsilon equals function involved, doing so only increases the likelihood of false positive. If you do want to use epsilon comparison you can take advantage of paulmckenzie's method:
inline bool epsilonLessThanOrEqualTo(const float x, const float y, const float epsilon = 1E-5f)
{
return x <= y || epsilonEquals(x, y, epsilon);
}
If you want to have an epsilon not-equals method you can simply check that the absolute difference between the 2 numbers is larger than epsilon.

I was trying to add float values (all have one precision) through for loop, and then compare the result value with a normal declared float. But I was not getting the answers right.
I solved it by converting the floats to int by multiplying each with 10, then compare it with value*10

Related

How to increase accuracy of floating point second derivative calculation?

I've written a simple program to calculate the first and second derivative of a function, using function pointers. My program computes the correct answers (more or less), but for some functions, the accuracy is less than I would like.
This is the function I am differentiating:
float f1(float x) {
return (x * x);
}
These are the derivative functions, using the central finite difference method:
// Function for calculating the first derivative.
float first_dx(float (*fx)(float), float x) {
float h = 0.001;
float dfdx;
dfdx = (fx(x + h) - fx(x - h)) / (2 * h);
return dfdx;
}
// Function for calculating the second derivative.
float second_dx(float (*fx)(float), float x) {
float h = 0.001;
float d2fdx2;
d2fdx2 = (fx(x - h) - 2 * fx(x) + fx(x + h)) / (h * h);
return d2fdx2;
}
Main function:
int main() {
pc.baud(9600);
float x = 2.0;
pc.printf("**** Function Pointers ****\r\n");
pc.printf("Value of f(%f): %f\r\n", x, f1(x));
pc.printf("First derivative: %f\r\n", first_dx(f1, x));
pc.printf("Second derivative: %f\r\n\r\n", second_dx(f1, x));
}
This is the output from the program:
**** Function Pointers ****
Value of f(2.000000): 4.000000
First derivative: 3.999948
Second derivative: 1.430511
I'm happy with the accuracy of the first derivative, but I believe the second derivative is too far off (it should be equal to ~2.0).
I have a basic understanding of how floating point numbers are represented and why they are sometimes inaccurate, but how can I make this second derivative result more accurate? Could I be using something better than the central finite difference method, or is there a way I can get better results with the current method?
The accuracy can be increased by choosing a type which has more precision. float is currently defined as an IEEE-754 32-bit number, giving you a precision of ~7.225 decimal places.
What you want is the 64-bit counterpart: double with ~15.955 decimal places accuracy.
That should be sufficient for your calculation, however worth mentioning is boosts implementation which offers a quadruple-precision floating point number (128-bit).
Finally The GNU Multiple Precision Arithmetic Library offers types with an arbitrary number of decimal places for precision.
Go analytical. ;-) probably not an option given "with the current
method".
Use double instead of float.
Vary the epsilon (h), and combine the results in some way. For example you could try 0.00001, 0.000001, 0.0000001 and average them. In fact, you'd want the result with the smallest h that doesn't overflow/underflow. But it's not clear how to detect overflow and underflow.

Is there any "standard" way to calculate the numerical gradient?

I am trying to calculate the numerical gradient of a smooth function in c++. And the parameter value could vary from zero to a very large number(maybe 1e10 to 1e20?)
I used the function f(x,y) = 10*x^3 + y^3 as a testbench, but I found that if x or y is too large, I can't get correct gradient.
Here is my code to calculate the graidient:
#include <iostream>
#include <cmath>
#include <cassert>
using namespace std;
double f(double x, double y)
{
// black box expensive function
return 10 * pow(x, 3) + pow(y, 3);
}
int main()
{
// double x = -5897182590.8347721;
// double y = 269857217.0017581;
double x = 1.13041e+19;
double y = -5.49756e+14;
const double epsi = 1e-4;
double f1 = f(x, y);
double f2 = f(x, y+epsi);
double f3 = f(x, y-epsi);
cout << f1 << endl;
cout << f2 << endl;
cout << f3 << endl;
cout << f1 - f2 << endl; // 0
cout << f2 - f3 << endl; // 0
return 0;
}
If I use the above code to calculate the gradient, the gradient would be zero!
The testbench function, 10*x^3 + y^3, is just a demo, the real problem I need to solve is actually a black box function.
So, is there any "standard" way to calculate the numerical gradient?
In the first place, you should use the central difference scheme, which is more accurate (by cancellation of one more term of the Taylor develoment).
(f(x + h) - f(x - h)) / 2h
rather than
(f(x + h) - f(x)) / h
Then the choice of h is critical and using a fixed constant is the worst thing you can do. Because for small x, h will be too large so that the approximation formula no more works, and for large x, h will be too small, resulting in severe truncation error.
A much better choice is to take a relative value, h = x√ε, where ε is the machine epsilon (1 ulp), which gives a good tradeoff.
(f(x(1 + √ε)) - f(x(1 - √ε))) / 2x√ε
Beware that when x = 0, a relative value cannot work and you need to fall back to a constant. But then, nothing tells you which to use !
You need to consider the precision needed.
At first glance, since |y| = 5.49756e14 and epsi = 1e-4, you need at least ⌈log2(5.49756e14)-log2(1e-4)⌉ = 63 bits of significand precision (that is the number of bits used to encode the digits of your number, also known as mantissa) for y and y+epsi to be considered different.
The double-precision floating-point format only has 53 bits of significand precision (assuming it is 8 bytes). So, currently, f1, f2 and f3 are exactly the same because y, y+epsi and y-epsi are equal.
Now, let's consider the limit : y = 1e20, and the result of your function, 10x^3 + y^3. Let's ignore x for now, so let's take f = y^3. Now we can calculate the precision needed for f(y) and f(y+epsi) to be different : f(y) = 1e60 and f(epsi) = 1e-12. This gives a minimum significand precision of ⌈log2(1e60)-log2(1e-12)⌉ = 240 bits.
Even if you were to use the long double type, assuming it is 16 bytes, your results would not differ : f1, f2 and f3 would still be equal, even though y and y+epsi would not.
If we take x into account, the maximum value of f would be 11e60 (with x = y = 1e20). So the upper limit on precision is ⌈log2(11e60)-log2(1e-12)⌉ = 243 bits, or at least 31 bytes.
One way to solve your problem is to use another type, maybe a bignum used as fixed-point.
Another way is to rethink your problem and deal with it differently. Ultimately, what you want is f1 - f2. You can try to decompose f(y+epsi). Again, if you ignore x, f(y+epsi) = (y+epsi)^3 = y^3 + 3*y^2*epsi + 3*y*epsi^2 + epsi^3. So f(y+epsi) - f(y) = 3*y^2*epsi + 3*y*epsi^2 + epsi^3.
The only way to calculate gradient is calculus.
Gradient is a vector:
g(x, y) = Df/Dx i + Df/Dy j
where (i, j) are unit vectors in x and y directions, respectively.
One way to approximate derivatives is first order differences:
Df/Dx ~ (f(x2, y)-f(x1, y))/(x2-x1)
and
Df/Dy ~ (f(x, y2)-f(x, y1))/(y2-y1)
That doesn't look like what you're doing.
You have a closed form expression:
g(x, y) = 30*x^2 i + 3*y^2 j
You can plug in values for (x, y) and calculate the gradient exactly at any point. Compare that to your differences and see how well your approximation is doing.
How you implement it numerically is your responsibility. (10^19)^3 = 10^57, right?
What is the size of double on your machine? Is it a 64 bit IEEE double precision floating point number?
Use
dx = (1+abs(x))*eps, dfdx = (f(x+dx,y) - f(x,y)) / dx
dy = (1+abs(y))*eps, dfdy = (f(x,y+dy) - f(x,y)) / dy
to get meaningful step sizes for large arguments.
Use eps = 1e-8 for one-sided difference formulas, eps = 1e-5 for central difference quotients.
Explore automatic differentiation (see autodiff.org) for derivatives without difference quotients and thus much smaller numerical errors.
We can examine the behaviour of the error in the derivative using the following program - it calculates the 1-sided derivative and the central difference based derivative using a varying step size. Here I'm using x and y ~ 10^10, which is smaller than what you were using, but should illustrate the same point.
#include <iostream>
#include <cmath>
#include <cassert>
using namespace std;
double f(double x, double y) {
return 10 * pow(x, 3) + pow(y, 3);
}
double f_x(double x, double y) {
return 3 * 10 * pow(x,2);
}
double f_y(double x, double y) {
return 3 * pow(y,2);
}
int main()
{
// double x = -5897182590.8347721;
// double y = 269857217.0017581;
double x = 1.13041e+10;
double y = -5.49756e+10;
//double x = 10.1;
//double y = -5.2;
double epsi = 1e8;
for(int i=0; i<60; ++i) {
double dfx_n = (f(x+epsi,y) - f(x,y))/epsi;
double dfx_cd = (f(x+epsi,y) - f(x-epsi,y))/(2*epsi);
double dfx = f_x(x,y);
cout<<epsi<<" "<<fabs(dfx-dfx_n)<<" "<<fabs(dfx - dfx_cd)<<std::endl;
epsi/=1.5;
}
return 0;
}
The output shows that a 1-sided difference gets us an optimal error of about 1.37034e+13 at a step length of about 100.0. Note that while this error looks large, as a relative error it is 3.5746632302764072e-09 (since the exact value is 3.833e+21)
In comparison the 2-sided difference gets an optimal error of about 1.89493e+10 with a step size of about 45109.3. This is three-orders of magnitude better, (with a much larger step-size).
How can we work out the step size? The link in the comments of Yves Daosts answer gives us a ballpark value:
h=x_c sqrt(eps) for 1-Sided, and h=x_c cbrt(eps) for 2-Sided.
But either way, if the required step size for decent accuracy at x ~ 10^10 is 100.0, the required step size with x ~ 10^20 is going to be 10^10 larger too. So the problem is simply that your step size is way too small.
This can be verified by increasing the starting step-size in the above code and resetting the x/y values to the original values.
Then expected derivative is O(1e39), best 1-sided error of about O(1e31) occurs near a step length of 5.9e10, best 2-sided error of about O(1e29) occurs near a step length of 6.1e13.
As numerical differentiation is ill conditioned (which means a small error could alter your result significantly) you should consider to use Cauchy's integral formula. This way you can calculate the n-th derivative with an integral. This will lead to less problems with considering accuracy and stability.

Points on the same line

I was doing a practice question and it was something like this,We are given N pair of coordinates (x,y) and we are given a central point too which is (x0,y0).We were asked to find maximum number of points lying on a line passing from (x0,y0).
My approach:- I tried to maintain a hash map having slope as the key and I thought to get the maximum second value to get maximum number of points on the same line.Something like this
mp[(yi-y0)/(xi-x0))]++; //i from 0 to n
And iterating map and doing something line this
if(it->second >max) //it is the iterator
max=it->second;
and printing max at last;
Problem With my approach- Whenever I get (xi-x0) as 0 I get runtime error.I also tried atan(slope) so that i would get degrees instead of some not defined value but still its not working.
What i expect->How to remove this runtime error and is my approach correct for finding maximum points on a line passing from a point(x0,y0).
P.S -My native language is not english so please ignore if something goes wrong.I tried my best to make everything clear If i am not clear enough please tell me
I'm assuming no other points have the same coordinates as your "origin".
If all your coordinates happen to be integers, you can keep a rational number (i.e. a pair of integers, i.e. a numerator and a denominator) as the slope, instead of a single real number.
The slope is DeltaY / DeltaX, so all you have to do is keep the pair of numbers separate. You just need to take care to divide the pair by their greatest common divisor, and handle the case where DeltaX is zero. For example:
pair<int, int> CalcSlope (int x0, int y0, int x1, int y1)
{
int dx = abs(x1 - x0), dy = abs(y1 - y0);
int g = GCD(dx, dy);
return {dy / g, dx / g};
}
Now just use the return value of CalcSlope() as your map key.
In case you need it, here's one way to calculate the GCD:
int GCD (int a, int b)
{
if (0 == b) return a;
else return gcd(b, a % b);
}
You have already accepted an answer, but I would like to share my approach anyway. This method uses the fact that three points a, b, and c are covariant if and only if
(a.first-c.first)*(b.second-c.second) - (a.second-c.second)*(b.first-c.first) == 0
You can use this property to create a custom comparison object like this
struct comparePoints {
comparePoints(int x0 = 0, int y0 = 0) : _x0(x0), _y0(y0) {}
bool operator()(const point& a, const point& b) {
return (a.first-_x0)*(b.second-_y0) - (b.first-_x0)*(a.second-_y0) < 0;
}
private:
int _x0, _y0;
};
which you can then use as a comparison object of a map according to
comparePoints comparator(x0, y0);
map<pair<int, int>, int, comparePoints> counter(comparator);
You can then add points to this map similar to what you did before:
if (!(x == x0 && y == y0))
counter[{x,y}]++;
By using comparitor as a comparison object, two keys a, b in the map are considered equal if !comparator(a, b) && !comparator(b,a), which is true if and only if a, b and {x0,y0} are collinear.
The advantage of this method is that you don't need to divide the coordinates which avoids rounding errors and problems with dividing by zero, or calculate the atan which is a costly operation.
Move everything so that the zero point is at the origin:
(xi, yi) -= (x0, y0)
Then for each point (xi, yi), find the greatest common divisor of xi and yi and divide both numbers by it:
k = GCD(xi, yi)
(xi', yi`) = (yi/k, yi/k)
Now points that are on the same ray will map to equal points. If (xi, yi) is on the same ray as (xj, yj) then (xi', yi') = (xj', yj').
Now find the largest set of equal points (don't forget any (xi, yi) = (0, 0)) and you have your answer.
You've a very original approach here !
Nevertheless, a vertical line has a infinite slope and this is the problem here: dividing by 0 is not allowed.
Alternative built on your solution (slope):
...
int mpvertical=0; // a separate couner for verticals
if (xi-x0)
mp[(yi-y0)/(xi-x0))]++;
else if (yi-y0)
mpvertical++;
// else the point (xi,yi) is the point (x0,y0): it shall not be counted)
This is cumbersome, because you have everything in the map plus this extra counter. But it will be exact. A workaround could be to count such points in mp[std::numeric_limits<double>::max()], but this would be an approximation.
Remark: the case were xi==x0 AND yi==y0 corresponds to your origin point. These points have to be discarded as they belong to every line line.
Trigonomic alternative (angle):
This uses the general atan2 formula used to converting cartesian coordinates into polar coordinates, to get the angle:
if (xi!=x0 && yi!=y0) // the other case can be ignored
mp[ 2*atan((yi-y0)/((xi-x0)+sqrt(pow(xi-x0,2)+pow(yi-y0,2)))) ]++;
so your key for mp will be an angle between -pi and +pi. No more extra case, but slightly more calculations.
You can hide these extra details and use the slighltly more optimized build in function:
if (xi!=x0 && yi!=y0) // the other case can be ignored
mp[ atan2(yi-y0, xi-x0) ]++;
you can give this approach a try
struct vec2
{
vec2(float a,float b):x(a),y(b){}
float x,y;
};
bool isColinear(vec2 a, vec2 b, vec2 c)
{
return fabs((a.y-b.y)*(a.x-c.x) - (a.y-c.y)*(a.x-b.x)) <= 0.000001 ;
}

Make sure c++ decimal comparison is correct

I have two double variable. double a = 0.10000, double b = 0.1. How can I make sure the comparison (a == b) is always true ?
If you are being paranoid about using == on doubles or floats (which you should be) you can always check that they are close within a small tolerance.
bool same = fabs(a-b) < 0.000001;
The other answers here require you to scale the tolerance factor manually, which I wouldn't advise. For instance if you are comparing two numbers less than one millionth, one answer will always say the two numbers are "close enough." The other answer instead leaves it to the caller to specify which is equally error-prone.
I would instead suggest something like the following function. It will return 0 if the two doubles are within the stated range of each other, otherwise -1 (if d1 is smaller), or +1. Using fabs() may require you to link with the math library, such as with -lm.
#include <algorithm> // for max()
#include <cmath> // for fabs()
int double_compare( double d1, double d2 ) {
double dEpsilon = .00000001;
double dLarger = std::max( std::fabs(d1), std::fabs(d2) );
double dRange = dLarger * dEpsilon;
if ( std::fabs( d1 - d2 ) < dRange )
return 0;
return d1 < d2 ? -1: 1;
}
New answer to old question, but using epsilons is the way to go, check this example:
bool equals(const double a, const double b, const double maxRelativDiff = numeric_limits<double>::epsilon()) {
double difference = fabs(a - b);
const auto absoluteA = fabs(a);
const auto absoluteB = fabs(b);
double biggerBoi = (absoluteB > absoluteA) ? absoluteB : absoluteA; // Get the bigger number
return difference <= (biggerBoi * maxRelativDiff);
}
In this case you're checking if they are equal up to maxRelativDiff, so 0.0001 == 0.0001.
Check: https://en.cppreference.com/w/cpp/types/numeric_limits/epsilon

comparing float variable [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Most effective way for float and double comparison
How dangerous is it to compare floating point values?
I have const float M = 0.000001; and float input;. I want to not equality check on them. But I know direct check has side effect M != input. So, my question how I can compare two float value without side effect ?
const double epsilon = 1e-12;
if(fabs(input - M) < epsilon) //input == M
{
//...
}
if(fabs(input - M) >= epsilon) // input != M
{
//...
}
The smaller the value of epsilon the more accurate the comparison is, therefore the more the probablity that it will tell you that two values are not equal whereas you wanted them to be considered equal. The larger the value of epsilon, the more the probability that it will tell you the results are equal when in fact you wanted them to be not equal. The value of epsilon should be chosen in accordance with the specifics of the task at hand.
When comparing floats, you have to compare them for being "close" instead of "equal." There are multiple ways to define "close" based on what you need. However, a typical approach could be something like:
namespace FloatCmp {
const float Eps = 1e-6f;
bool eq(float a, float b, float eps = Eps) {
return fabs(a - b) < eps;
}
//etc. for neq, lt, gt, ...
}
Then, use FloatCmp::eq() instead of == to compare floats.