When using IEEE 754 floating point representation (double type in c++), numbers that are very close to (representable) integers are rounded to their closest integer and represented exactly. Is that true?
Exactly how close does a number have to be to the nearest representable integer before it is rounded?
Is this distance constant?
For example, given that 1 can be represented exactly, what is the largest double less than 1?
When using IEEE 754 floating point representation (double type in
c++), numbers that are very close to (representable) integers are
rounded to their closest integer and represented exactly.
This depends upon whether the number is closer to the integer than to other values representable. 0.99999999999999994 is not equal to 1, but 0.99999999999999995 is.
Is this distance constant?
No, it becomes less with larger magnitudes - in particular with larger exponents in the representation. Larger exponents imply larger intervals to be covered by the mantissa, which in turn implies less precision overall.
For example, what is the largest double less than 1?
std::nexttoward(1.0, 0.0). E.g. 0.999999999999999889 on Coliru.
You will find much more definitive statements regarding the opposite direction from 1.0
The difference between 1.0 and the next larger number is documented here:
std::numeric_limits<double>::epsilon()
The way floating point works, the next smaller number should be exactly half as far away from 1.0 as the next larger number.
The first IEEE double below 1 can be written unambiguously as 0.99999999999999989, but is exactly 0.99999999999999988897769753748434595763683319091796875.
The distance is not constant, it depends on the exponent (and thus the magnitude) of the number. Eventually the gap becomes larger than 1, meaning even (not as opposed to odd - odd integers are the first to get rounded) integers will get rounded somewhat (or, eventually, a lot).
The binary representation of increasing IEEE floating point numbers can be seen as a increasing integer representation:
Sample Hack (Intel):
#include <cstdint>
#include <iostream>
#include <limits>
int main() {
double one = 1;
std::uint64_t one_representation = *reinterpret_cast<std::uint64_t*>(&one);
std::uint64_t lesser_representation = one_representation - 1;
std::cout.precision(std::numeric_limits<double>::digits10 + 1);
std::cout << std::hex;
std::cout << *reinterpret_cast<double*>(&lesser_representation)
<< " [" << lesser_representation
<< "] < " << *reinterpret_cast<double*>(&one_representation)
<< " [" << one_representation
<< "]\n";
}
Output:
0.9999999999999999 [3fefffffffffffff] < 1 [3ff0000000000000]
When advancing the integer representation to its limits, the difference of consecutive floating point numbers is increasing, if exponent bits change.
See also: http://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
When using IEEE 754 floating point representation (double type in c++), numbers that are very close to exact integers are rounded to the closest integer and represented exactly. Is that true?
This is false.
Exactly how close does a number have to be to the nearest int before it is rounded?
When you do a binary to string conversion the floating point number gets rounded to the current precision (for printf family of functions the default precision is 6) using the current rounding mode.
Related
The IEEE 754 double precision floating point format has a binary precision of 53 bits, which translates into log10(2^53) ~ 16 significant decimal digits.
If the double precision format is used to store a floating point number in a 64 bit-long word in the memory, with 52 bits for the significand and 1 hidden bit, but a larger precision is used to output the number to the screen, what data is actually read from the memory and written to the output?
How can it even be read, when the total length of the word is 64 bit, does the read-from-memory operation on the machine just simply read more bits and interprets them as an addition to the significand of the number?
For example, take the number 0.1. It does not have an exact binary floating point representation regardless of the precision used, because it has an indefinitely repeating binary floating point pattern in the significand.
If 0.1 is stored with the double precision, and printed to the screen with the precision >16 like this in the C++ language:
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
double x = 0.1;
cout << setprecision(50) << "x= " << x << endl;
};
The output (on my machine at the point of execution), is:
x = 0.1000000000000000055511151231257827021181583404541
If the correct rounding is used with 2 guard bits and 1 sticky bits, can I trust the decimal values given by the first three non-zero binary floating point digits in the error 5.551115123125783e-17?
Every binary fraction is exactly equal to some decimal fraction. If, as is usually the case, double is a binary floating point type, each double number has an exactly equal decimal representation.
For what follows, I am assuming your system uses IEEE 754 64-bit binary floating point to represent double. That is not required by the standard, but is very common. The closest number to 0.1 in that format has exact value 0.1000000000000000055511151231257827021181583404541015625
Although this number has a lot of digits, it is exactly equal to 3602879701896397/255. Multiplying both numerator and denominator by 555 converts it to a decimal fraction, while increasing the number of digits in the numerator.
One common approach, consistent with the result in the question, is to use round-to-nearest to the number of digits required by the format. That will indeed give useful information about the rounding error on conversion of a string to double.
IEEE 754 numbers are not uniformly spaced, the larger the numbers, the bigger the difference between two consecutive representable numbers.
I have that the size of my long double with C++ is 16 bytes. So what is the bigger "safe" whole number "n" that can be represented with this type?.
I call it "safe" if n - 1 is possible to represent but n + 1 not.
The IEEE 754 standard defines the parameters of various numerical types:
https://en.wikipedia.org/wiki/IEEE_754
for long doubles which are 128 bits long, the mantissa (the part of the floating point number that contains the significant digits) is 113 bits so it can represent integers with full precision up to 2^113 - 1. It can represent floating point numbers which are larger, but you start losing precision beyond that because the lower order numbers just get rounded.
As I understand, you're asking for which is the largest contiguous precisely representable integer. It is exactly:
std::pow(std::numeric_limits<long double>::radix, std::numeric_limits<long double>::digits)
or expressed in math: radixdigits where (quoted from cppreference)
The value of std::numeric_limits<T>::radix is the base of the number system used in the representation of the type. It is 2 for all binary numeric types, but it may be, for example, 10 for IEEE 754 decimal floating-point types ...
The value of std::numeric_limits<T>::digits is the number of digits in base-radix that can be represented by the type T without change. ... For floating-point types, this is the number of digits in the mantissa
In C++ you can get a lot of information using <limits>. For example:
#include <limits>
#include <iostream>
#include <iomanip>
int main() {
auto p = std::numeric_limits<long double>::max_digits10;
std::cout << "Max Long Double: "
<< std::setprecision(p)
<< std::setw(p + 7)
<< std::numeric_limits<long double>::max()
<< std::endl;
}
Prints:
Max Long Double: 1.18973149535723176502e+4932
I'm seeing some error when simply assigning a floating point value which contains only 4 significant figures. I wrote a short program to debug and I don't understand what the problem is. After verifying the limits of a float on my platform is seems like there shouldn't be any error. What's causing this?
#include <stdlib.h>
#include <stdio.h>
#include <limits>
#include <iostream>
int main(){
printf("float size: %lu\n", sizeof(float));
printf("float max: %e\n", std::numeric_limits<float>::max());
printf("float significant figures: %i\n", std::numeric_limits<float>::digits10);
float a = 760.5e6;
printf("%.9f\n", a);
std::cout.precision(9);
std::cout << a << std::endl;
double b = 760.5e6;
printf("%.9f\n", b);
std::cout << b << std::endl;
return 0;
}
The output:
float size: 4
float max: 3.402823e+38
float significant figures: 6
760499968.000000000
760499968
760500000.000000000
760500000
A float has 24 bits of precision, which is roughly equivalent to 7 decimal digits. A double has 53 bits of precision, which is roughly equivalent to 16 decimal digits.
As mentioned in the comments, 760.5e6 is not exactly representable by float; however, it is exactly representable by double. This is why the printed results for double are exact, and those from float are not.
It is legal to request printing of more decimal digits than are representable by your floating point number, as you did. The results you report are not an error -- they are simply the result of the decimal printing algorithm doing the best it can.
The stored number in your float is 760499968. This is expected behavior for an IEEE 754 binary32 floating point numbers, as floats usually are.
IEEE 754 floating point numbers are stored in three parts: a sign bit, an exponent, and a mantissa. Since all these values are stored as bits the resulting number is sort of the binary equivalent of scientific notation. The mantissa bits are one less than the number of binary digits allowed as significant figures in the binary scientific notation.
Just like with decimal scientific numbers, if the exponent exceeds the significant figures, you're going to lose integer precision.
The analogy only extends so far: the mantissa is a modification of the coefficient found in the decimal scientific notation you might be familiar with, and there are certain bit patterns that have special meaning in the standard.
The ultimate result of this storage mechanism is that the integer 760500000 cannot be exactly represented by IEEE 754 binary32 with its 23-bit mantissa: it loses integer-level precision after the integer at 2^(mantissa_bits + 1), which is 16777217 for 23-bit mantissa floats. The closest integers to 76050000 that can be represented by a float are 760499968 and 76050032, the former of which is chosen for representation due to the round-ties-to-even rule, and printing the integer at a greater precision than the floating point number can represent will naturally result in apparent inaccuracies.
A double, which has 64 bit size in your case, naturally has more precision than a float, which is 32 bit in your case. Therefore, this is an expected result
Specifications do not enforce that any type should correctly represent all numbers less than std::numeric_limits::max() with all their precision.
The number you display is off only in the 8th digit and after. That is well within the 6 digits of accuracy you are guaranteed for a float. If you only printed 6 digits, the output would get rounded and you'd see the value you expect.
printf("%0.6g\n", a);
See http://ideone.com/ZiHYuT
I have come seeking knowledge.
I am trying to understand floating point numbers.
I am trying to figure out why, when I print the largest floating point number, it does not print correctly.
2-(2^-23) Exponent Bits
1.99999988079071044921875 * (1.7014118346046923173168730371588e+38) =
3.4028234663852885981170418348451e+38
This should be the largest single-precision floating point number:
340282346638528859811704183484510000000.0
So,
float i = 340282346638528859811704183484510000000.0;
printf(TEXT, "Float %.38f", i);
Output: 340282346638528860000000000000000000000.0
Obviously the number is being rounded up, so I am trying to figure out just exactly what is going on.
My questions are:
The Wikipedia documentation states that 3.4028234663852885981170418348451e+38 is the largest number that can be represented in IEEE-754 fixed point.
Is the number stored in the floating point register = 0 11111111 11111111111111111111111 and it is just not being displayed incorrectly?
If I write printf(TEXT, "Float %.38f", FLT_MAX);, I get the same answer.
Perhaps the computer I am using does not use IEEE-754?
I understand errors with calculations, but I don't understand why the number
340282346638528860000000000000000000000.0 is the largest floating point number that can be accurately represented.
Maybe the Mantissa * Exponent is causing calculation errors? If that is true, then 340282346638528860000000000000000000000.0 would be the largest number that can be faithfully represented without calculation errors. I guess that would make sense. Just need a blessing.
Thanks,
Looks like culprit is printf() (I guess because float is implicitly converted to double when passed to it):
#include <iostream>
#include <limits>
int main()
{
std::cout.precision( 38 );
std::cout << std::numeric_limits<float>::max() << std::endl;
}
Output is:
3.4028234663852885981170418348451692544e+38
With float as binary32, the largest finite float is
340282346638528859811704183484516925440.0
printf("%.1f", FLT_MAX) is not obliged to print exactly to 38+ significant digits, so seeing output like the below is not unexpected.
340282346638528860000000000000000000000.0
printf() will print floating point accurately to DECIMAL_DIG significant digits. DECIMAL_DIG is at least 10. If more than DECIMAL_DIG significance is directed, a compliant printf() may round the result at some point. C11dr §7.21.6.1 6 goes into detail.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
strange output in comparision of float with float literal
float f = 1.1;
double d = 1.1;
if(f == d) // returns false!
Why is it so?
The important factors under consideration with float or double numbers are:
Precision & Rounding
Precision:
The precision of a floating point number is how many digits it can represent without losing any information it contains.
Consider the fraction 1/3. The decimal representation of this number is 0.33333333333333… with 3′s going out to infinity. An infinite length number would require infinite memory to be depicted with exact precision, but float or double data types typically only have 4 or 8 bytes. Thus Floating point & double numbers can only store a certain number of digits, and the rest are bound to get lost. Thus, there is no definite accurate way of representing float or double numbers with numbers that require more precision than the variables can hold.
Rounding:
There is a non-obvious differences between binary and decimal (base 10) numbers.
Consider the fraction 1/10. In decimal, this can be easily represented as 0.1, and 0.1 can be thought of as an easily representable number. However, in binary, 0.1 is represented by the infinite sequence: 0.00011001100110011…
An example:
#include <iomanip>
int main()
{
using namespace std;
cout << setprecision(17);
double dValue = 0.1;
cout << dValue << endl;
}
This output is:
0.10000000000000001
And not
0.1.
This is because the double had to truncate the approximation due to it’s limited memory, which results in a number that is not exactly 0.1. Such an scenario is called a Rounding error.
Whenever comparing two close float and double numbers such rounding errors kick in and eventually the comparison yields incorrect results and this is the reason you should never compare floating point numbers or double using ==.
The best you can do is to take their difference and check if it is less than an epsilon.
abs(x - y) < epsilon
Try running this code, the results will make the reason obvious.
#include <iomanip>
#include <iostream>
int main()
{
std::cout << std::setprecision(100) << (double)1.1 << std::endl;
std::cout << std::setprecision(100) << (float)1.1 << std::endl;
std::cout << std::setprecision(100) << (double)((float)1.1) << std::endl;
}
The output:
1.100000000000000088817841970012523233890533447265625
1.10000002384185791015625
1.10000002384185791015625
Neither float nor double can represent 1.1 accurately. When you try to do the comparison the float number is implicitly upconverted to a double. The double data type can accurately represent the contents of the float, so the comparison yields false.
Generally you shouldn't compare floats to floats, doubles to doubles, or floats to doubles using ==.
The best practice is to subtract them, and check if the absolute value of the difference is less than a small epsilon.
if(std::fabs(f - d) < std::numeric_limits<float>::epsilon())
{
// ...
}
One reason is because floating point numbers are (more or less) binary fractions, and can only approximate many decimal numbers. Many decimal numbers must necessarily be converted to repeating binary "decimals", or irrational numbers. This will introduce a rounding error.
From wikipedia:
For instance, 1/5 cannot be represented exactly as a floating point number using a binary base but can be represented exactly using a decimal base.
In your particular case, a float and double will have different rounding for the irrational/repeating fraction that must be used to represent 1.1 in binary. You will be hard pressed to get them to be "equal" after their corresponding conversions have introduced different levels of rounding error.
The code I gave above solves this by simply checking if the values are within a very short delta. Your comparison changes from "are these values equal?" to "are these values within a small margin of error from each other?"
Also, see this question: What is the most effective way for float and double comparison?
There are also a lot of other oddities about floating point numbers that break a simple equality comparison. Check this article for a description of some of them:
http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm
The IEEE 754 32-bit float can store: 1.1000000238...
The IEEE 754 64-bit double can store: 1.1000000000000000888...
See why they're not "equal"?
In IEEE 754, fractions are stored in powers of 2:
2^(-1), 2^(-2), 2^(-3), ...
1/2, 1/4, 1/8, ...
Now we need a way to represent 0.1. This is (a simplified version of) the 32-bit IEEE 754 representation (float):
2^(-4) + 2^(-5) + 2^(-8) + 2^(-9) + 2^(-12) + 2^(-13) + ... + 2^(-24) + 2^(-25) + 2^(-27)
00011001100110011001101
1.10000002384185791015625
With 64-bit double, it's even more accurate. It doesn't stop at 2^(-25), it keeps going for about twice as much. (2^(-48) + 2^(-49) + 2^(-51), maybe?)
Resources
IEEE 754 Converter (32-bit)
Floats and doubles are stored in a binary format that can not represent every number exactly (it's impossible to represent the infinitely many possible different numbers in a finite space).
As a result they do rounding. Float has to round more than double, because it is smaller, so 1.1 rounded to the nearest valid Float is different to 1.1 rounded to the nearest valud Double.
To see what numbers are valid floats and doubles see Floating Point