Minimum floating point number (closest to zero) - c++

I'm trying to find the minimum value (closest to zero) that I can store in a single precission floating point number. Using the <limits> header I can get the value, but if I make it much smaller, the float can still hold it and it gives the right result. Here is a test program, compiled with g++ 5.3.0.
#include <limits>
#include <iostream>
#include <math.h>
using namespace std;
int main()
{
float a = numeric_limits<float>::max();
float b = numeric_limits<float>::min();
a = a*2;
b = b/pow(2,23);
cout << a << endl;
cout << b << endl;
}
As I expected, "a" gives infinity, but "b" keeps holding the good result even after dividing the minimum value by 2^23, after that it gives 0.
The value that gives numeric_limits<float>::min() is 2^(-126) which I belive is the correct answer, but why is the float on my progam holding such small numbers?

std::numeric_limits::min for floating-point types gives the smallest non-zero value that can be represented without loss of precision. std::numeric_limits::lowest gives the smallest representable value. With IEEE representations that's a subnormal value (previously called denormalized).

From wikipedia https://en.wikipedia.org/wiki/Single-precision_floating-point_format:
The minimum positive normal value is 2^−126 ≈ 1.18 × 10^−38 and the
minimum positive (denormal) value is 2^−149 ≈ 1.4 × 10^−45.
So, for
cout << (float)pow(2,-149)
<< "-->" << (float)pow(2,-150)
<< "-->" << (float)pow(2,-151) << endl;
I'm getting:
1.4013e-45-->0-->0

I'm trying to find the minimum value (closest to zero) that I can
store in a single precission floating point number
0 is the closest value to 0 that you can store in any precision float. In fact, you can store it two ways, as there is a positive and negative 0.

Related

how to make 69.99*100 print 6999 instead of 6998?

I want to have the right 6999 but the code prints 6998, is there any way to implement it in C/C++?
#include <iostream>
using namespace std;
int main() {
double x = 69.99;
int xi = x*100;
cout << xi << endl;
return 0;
}
Your compiler is probably using the IEEE 754 double precision floating point format for representing the C++ data type double. This format cannot represent the number 69.99 exactly. It is stored as 69.989999999999994884. When you multiply this value with 100, the result is slightly smaller than 6999.
When implicitly converting floating-point numbers to integers, the number is always rounded towards zero (positive numbers get rounded down, negative ones get rounded up).
If you don't want to always round the result towards zero, you can change the line
int xi = x*100;
to
long xi = lround( x*100 );
which will not always round the number towards zero, but will instead always round it to the nearest integer.
Note that you must #include <cmath> to be able to use std::lround.

sum of double numbers in c++ [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 6 years ago.
I want to calculate the sum of three double numbers and I expect to get 1.
double a=0.0132;
double b=0.9581;
double c=0.0287;
cout << "sum= "<< a+b+c <<endl;
if (a+b+c != 1)
cout << "error" << endl;
The sum is equal to 1 but I still get the error! I also tried:
cout<< a+b+c-1
and it gives me -1.11022e-16
I could fix the problem by changing the code to
if (a+b+c-1 > 0.00001)
cout << "error" << endl;
and it works (no error). How can a negative number be greater than a positive number and why the numbers don't add up to 1?
Maybe it is something basic with summation and under/overflow but I really appreciate your help.
Thanks
Rational numbers are infinitely precise. Computers are finite.
Precision loss is a well known problem in computer programming.
The real question is, how can you remedy it?
Consider using an approximation function when comparing floats for equality.
#include <iostream>
#include <cmath>
#include <limits>
using namespace std;
template <typename T>
bool ApproximatelyEqual(const T dX, const T dY)
{
return std::abs(dX - dY) <= std::max(std::abs(dX), std::abs(dY))
* std::numeric_limits<T>::epsilon();
}
int main() {
double a=0.0132;
double b=0.9581;
double c=0.0287;
//Evaluates to true and does not print error.
if (!ApproximatelyEqual(a+b+c,1.0)) cout << "error" << endl;
}
Floating point numbers in C++ have a binary representation. This means that most numbers that can exactly represented by a decimal fraction with only a few digits cannot be exactly represented by floating point numbers. That's where your error comes from.
One example: 0.1 (decimal) is a periodic fraction in binary:
0.000110011001100110011001100...
Therefore it cannot be exactly be represented with any number of bits with binary encoding.
In order to avoid this type of error, you can use BCD (binary coded decimal) numbers which are supported by some special libraries. The drawbacks are slower calculation speed (not directly supported by the CPU) and slightly higher memory usage.
ANother option is to represent the number by a general fraction and store numerator and denomiator as separate integers.

IEEE 754 floating point, what is the largest number < 1?

When using IEEE 754 floating point representation (double type in c++), numbers that are very close to (representable) integers are rounded to their closest integer and represented exactly. Is that true?
Exactly how close does a number have to be to the nearest representable integer before it is rounded?
Is this distance constant?
For example, given that 1 can be represented exactly, what is the largest double less than 1?
When using IEEE 754 floating point representation (double type in
c++), numbers that are very close to (representable) integers are
rounded to their closest integer and represented exactly.
This depends upon whether the number is closer to the integer than to other values representable. 0.99999999999999994 is not equal to 1, but 0.99999999999999995 is.
Is this distance constant?
No, it becomes less with larger magnitudes - in particular with larger exponents in the representation. Larger exponents imply larger intervals to be covered by the mantissa, which in turn implies less precision overall.
For example, what is the largest double less than 1?
std::nexttoward(1.0, 0.0). E.g. 0.999999999999999889 on Coliru.
You will find much more definitive statements regarding the opposite direction from 1.0
The difference between 1.0 and the next larger number is documented here:
std::numeric_limits<double>::epsilon()
The way floating point works, the next smaller number should be exactly half as far away from 1.0 as the next larger number.
The first IEEE double below 1 can be written unambiguously as 0.99999999999999989, but is exactly 0.99999999999999988897769753748434595763683319091796875.
The distance is not constant, it depends on the exponent (and thus the magnitude) of the number. Eventually the gap becomes larger than 1, meaning even (not as opposed to odd - odd integers are the first to get rounded) integers will get rounded somewhat (or, eventually, a lot).
The binary representation of increasing IEEE floating point numbers can be seen as a increasing integer representation:
Sample Hack (Intel):
#include <cstdint>
#include <iostream>
#include <limits>
int main() {
double one = 1;
std::uint64_t one_representation = *reinterpret_cast<std::uint64_t*>(&one);
std::uint64_t lesser_representation = one_representation - 1;
std::cout.precision(std::numeric_limits<double>::digits10 + 1);
std::cout << std::hex;
std::cout << *reinterpret_cast<double*>(&lesser_representation)
<< " [" << lesser_representation
<< "] < " << *reinterpret_cast<double*>(&one_representation)
<< " [" << one_representation
<< "]\n";
}
Output:
0.9999999999999999 [3fefffffffffffff] < 1 [3ff0000000000000]
When advancing the integer representation to its limits, the difference of consecutive floating point numbers is increasing, if exponent bits change.
See also: http://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
When using IEEE 754 floating point representation (double type in c++), numbers that are very close to exact integers are rounded to the closest integer and represented exactly. Is that true?
This is false.
Exactly how close does a number have to be to the nearest int before it is rounded?
When you do a binary to string conversion the floating point number gets rounded to the current precision (for printf family of functions the default precision is 6) using the current rounding mode.

How to get the lowest representable floating point value in C++

I have a program where I need to set a variable to the lowest representable (non infinite) double-precision floating point number in C++. How am I able to set a variable to the lowest double-precision floating point value?
I tried using std::numeric_limits. I am not using C++11 so I am unable to try to use the lowest() function. I tried to use max(), but when I tried it, it returned infinity. I also tried subtracting a value from max() in the hope that I would then get a representable number.
double max_value = std::numeric_limits<double>::max();
cout << "Test 1: " << max_value << endl;
max_value = max_value - 1;
cout << "Test 2: " << max_value << endl;
double low_value = - std::numeric_limits<double>::max();
cout << "Test 3: " << low_value << endl;
cout << "Test 4: " << low_value + 1 << endl;
Output:
Test 1: inf
Test 2: inf
Test 3: -inf
Test 4: -inf
How can I set low_value in the example above to the lowest representable double?
Once you have -inf (you got it), you can get the lowest finite value with the nextafter function on (-inf,0).
EDIT: Depending on the context, this may be better than -DBL_MAX in case DBL_MAX is represented in decimal (thus in an inexact way). However the C standard requires that floating constants be evaluated in the default rounding mode (i.e. to nearest). In the particular case of GCC, DBL_MAX is a long double value cast to double; however the long double value seems to have enough digits so that, once converted from decimal to long double, the value is exactly representable in as double, so that the cast is exact and the active rounding mode doesn't affect it. As you can see, this is rather tricky, and one may want to check on various platforms that it is correct under any context. In a similar way, I have serious doubts on the correctness of the definition DBL_EPSILON by GCC on PowerPC (where the long double type is implemented as a double-double arithmetic) since there are many long double values extremely close to a power of two.
The standard library <cfloat>/<float.h> provides macros defining floating point implementation parameters.
The question is somewhat ambiguous - it is not clear whether you mean the smallest magnitude representable non-zero value (which would be DBL_MIN) or the lowest representable value (given by -DBL_MAX). Either way - take your pick as necessary.
It turned out that there was a bug in the iostream that I was using to print the values. I switched to using cstdio instead of iostream. The values were then printed as expected.
double low_value = - std::numeric_limits<double>::max();
cout <<"cout: " << low_value << endl;
printf("printf: %f\n",low_value);
Output:
cout: inf
printf: 179769...

C++: Cosine is wrong, should be zero. 3Pi/2

I have a program and I'm trying to calculatecos(M_PI*3/2) and instead of getting 0, as I should, I get -1.83691e-016
What am I missing here? I am in radians as I need to be.
First, M_PI is not a very portable macro and is usually good to about 15 decimal places, depending on the compiler you use - my guess is you're using Microsoft's C++ compiler.
Second, if you want a more accurate (and portable) version, use the Boost Math library:
http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/math_toolkit/tutorial/non_templ.html
Third, as Kay has pointed out, pi in itself is an irrational number and therefore no amount of bits (or digits in base 10) would be enough to accurately represent it. Therefore, What you're actually calculating is not cos(3*pi/2) exactly, but "the cosine of 3/2 times the closest approximation of pi given the bits required", which will NOT be 3 *pi/2 and therefore won't be zero.
Finally, if you want custom precision for your mathematical constants, read this: http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/math_toolkit/tutorial/user_def.html
The number M_PI is only an approximation of π. The cosine that you get back is also an approximation, and it's a pretty good one - it has fifteen correct digits after the decimal point.
Given the discrete nature of double values, the standard margin of error against which to test for numerical equality is numeric_limits<double>::epsilon():
#include <iostream>
#include <limits>
#include <cmath>
using namespace std;
int main()
{
double x = cos(M_PI*3/2);
cout << "x = << " << x << endl;
cout << "numeric_limits<double>::epsilon() = "
<< numeric_limits<double>::epsilon() << endl;
cout << "Is x sufficiently close to 0? "
<< (abs(x) < numeric_limits<double>::epsilon() ? "yes" : "no") << endl;
return 0;
}
Output:
x = << -1.83697e-16
numeric_limits<double>::epsilon() = 2.22045e-16
Is x sufficiently close to 0? yes
As you can see, the absolute value of -1.83697e-16 is within the margin of error given by epsilon 2.22045e-16.
Pi is irrational, the computer cannot represent the number perfectly. The small error to the "correct" value of pi causes the error in the output. Being 1.83691 × 10-16 off is still pretty good.
If you want to learn more about the restrictions of actual system and the impact of little errors in the input, then refer to http://en.wikipedia.org/wiki/Numerical_stability.