How does casting from float to double work in C++? [duplicate] - c++

This question already has answers here:
strange output in comparison of float with float literal
(8 answers)
Closed 8 years ago.
The mantissa bits in a float variable are 23 in total while in a double variable 53.
This means that the digits that can be represented precisely by
a float variable are log10(2^23) = 6.92368990027 = 6 in total
and by a double variable log10(2^53) = 15.9545897702 = 15
Let's look at this code:
float b = 1.12391;
double d = b;
cout<<setprecision(15)<<d<<endl;
It prints
1.12390995025635
however this code:
double b = 1.12391;
double d = b;
cout<<setprecision(15)<<d<<endl;
prints 1.12391
Can someone explain why I get different results? I converted a float variable of 6 digits to double, the compiler must know that these 6 digits are important. Why? Because I'm not using more digits that can't all be represented correctly in a float variable. So instead of printing the correct value it decides to print something else.

Converting from float to double preserves the value. For this reason, in the first snippet, d contains exactly the approximation to float precision of 112391/100000. The rational 112391/100000 is stored in the float format as 9428040 / 223. If you carry out this division, the result is exactly 1.12390995025634765625: the float approximation is not very close. cout << prints the representation to 14 digits after the decimal point. The first omitted digit is 7, so the last printed digit, 4, is rounded up to 5.
In the second snippet, d contains the approximation to double precision of the value of 112391/100000, 1.123909999999999964614971759147010743618011474609375 (in other words 5061640657197974 / 252). This approximation is much closer to the rational. If it was printed with 14 digits after the decimal point, the last digits would all be zeroes (after rounding because the first omitted digit would be 9). cout << does not print trailing zeroes, so you see 1.12391 as output.
Because I'm not using more digits that can't all be represented correctly in a float variable
When you incorrectly apply log10 to 223 (it should be 224), you get the number of decimal digits that can be stored in a float. Because float's representation is not decimal, the digits after these seven or so are not zeroes in general. They are the digits that happen to be there in decimal for the closest binary representation that the compiler chose for the number you wrote.

float b = 1.12391;
The problem is here, and here:
double b = 1.12391;
These assignments are already imprecise. Calculations or casts using them will therefore also be imprecise.

You're mistaken in assuming that the first 6 digits will be precisely the same. When we say that float is precise to within 6 (decimal) digits, we mean that the relative difference between the actual and intended value is less than 10-6. So, 1.12390995 and 1.12391 differ by 0.0000005. That's much better than the 10-6 you can rely on.

Related

C++ `digits10` is 6 for IEEE float, but the first non-representable integer already has 8 digits?

C++'s std::numeric_limits<float>::digits10, is described on cppref as such:
The value of std::numeric_limits<T>::digits10 is the number of base-10 digits that can be represented by the type T without change, that is, any number with this many decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow.
A similar description exists for the C cousin FLT_DIG.
The value given is:
float FLT_DIG /* 6 for IEEE float */
However, it is shown here on S.O. that all integers up to 16,777,216 (224) are exactly representable in a 32 bit IEEE float type. And if I can count, that number has 8 digits, so the value for digits10 should actually be 7, now shouldn't it?
Rather obviously, I misunderstand something about digits10 here, so what does this actually tell me?
Practical applicability:
I was asked if we could store all numbers from 0.00 - 86,400.00 exactly in an IEEE 32 bit float.
Now, I'm very confident that we could store all numbers from 0 - 8,640,000 in an IEEE 32 bit float, but does this hold for the same "integer" range shifted by 2 digits to the left?
(Restricting this answer to IEEE754 float).
8.589973e9 and 8.589974e9 both map to 8589973504. That's a counter-example for an assertion that the 7th significant figure is preserved.
Since no such counter-example exists on the 6th significant figure, std::numeric_limits<float>::digits10 and FLT_DIG are 6.
Indeed integers can be represented exactly up to the 24th power of 2. (16,777,216 and 16,777,217 both map to 16,777,216). That's because a float has a 24 bit significand.
As the other answer and comment establishes, digits10 covers all "exponent ranges", that is it has to hold for 1234567 as well as for 1.234567 and 12345670000 -- and this only holds for 6 digits!
Counter example for 7 digits:
8.589,973 e9 vs. 8.589,974 e9 (from cppref example)
Sometimes it is easy enough to look for counter examples.
#include <stdio.h>
#include <string.h>
int main(void) {
int p6 = 1e6;
int p7 = 1e7;
for (int expo = 0; expo < 29; expo++) {
for (int frac = p6; frac < p7; frac++) {
char s[30];
sprintf(s, "%d.%06de%+03d", frac / p6, frac % p6, expo);
float f = atof(s);
char t[30];
sprintf(t, "%.6e", f);
if (strcmp(s, t)) {
printf("<%s> %.10e <%s>\n", s, f, t);
break;
}
}
}
puts("Done");
}
Output
<8.589973e+09> 8.5899735040e+09 <8.589974e+09>
<8.796103e+12> 8.7961035080e+12 <8.796104e+12>
<9.007203e+15> 9.0072024760e+15 <9.007202e+15>
<9.223377e+18> 9.2233775344e+18 <9.223378e+18>
<9.444738e+21> 9.4447374693e+21 <9.444737e+21>
<9.671414e+24> 9.6714134744e+24 <9.671413e+24>
<9.903522e+27> 9.9035214949e+27 <9.903521e+27>
<1.000000e+28> 9.9999994421e+27 <9.999999e+27> This is an interesting one
Another point of view:
Consider between each pair of powers-of-2, a float like IEEE binary encodes 223 values distributed linearly.
Example: Between 20 and 21 or 1.0 and 2.0,
The difference between float values is 1.0/223 or 10.192e-06.
Written in text form "1.dddddd", a 7 digit number, the numbers have a difference of 1.000e-06.
So for every step of a decimal text number, there are about 10.2 float.
No problems encoding these 7 digit numbers.
In this range, no problems encoding 8 digits either.
Example: Between 223 and 224 or 8,388,608.0 and 16,777,216.0.
The difference between float values is 223/223 or 1.0.
The numbers near the low end written in text form "8or9.dddddd*106", a 7 significant digit number, have a difference of 1.0.
No problems encoding these 7 digit numbers.
Example: Between 233 and 234 or 8,589,934,592.0 and 17,179,869,184.0,
The difference between float values is 233/223 or 1,024.0.
Numbers near the low end written in text form "8or9.dddddd*109", a 7 significant digit number, have a difference of 1,000.0.
Now we have a problem. From 8,589,934,592.0, then next 1024 numbers in text form only have 1000 different float encoding.
7 digits in the form d.dddddd * 10expo is too many combinations to uniquely encode using float.

Can you help me to understand what "significant digits" means in floating point math?

For what I'm learning, once I convert a floating point value to a decimal one, the "significant digits" I need are a fixed number (17 for double, for example). 17 totals: before and after decimal separator.
So for example this code:
typedef std::numeric_limits<double> dbl;
int main()
{
std::cout.precision(dbl::max_digits10);
//std::cout << std::fixed;
double value1 = 1.2345678912345678912345;
double value2 = 123.45678912345678912345;
double value3 = 123456789123.45678912345;
std::cout << value1 << std::endl;
std::cout << value2 << std::endl;
std::cout << value3 << std::endl;
}
will correctly "show me" 17 values:
1.2345678912345679
123.45678912345679
123456789123.45679
But if I increase precision for the cout (i.e. std::cout.precision(100)), I can see there are other numbers after the 17 range:
1.2345678912345678934769921397673897445201873779296875
123.456789123456786683163954876363277435302734375
123456789123.456787109375
Why should ignore them? They are stored within the variables/double as well, so they will affect the whole "math" later (division, multiplication, sum, and so on).
What does it means "significant digits"? There is other...
Can you help me to understand what “significant digits” means in floating point math?
With FP numbers, like mathematical real numbers, significant digits is the leading digits of a value that do not begin with 0 and then, depending on context, to 1) the decimal point, 2) the last non-zero digit, or 3) the last printed digit.
123. // 3 significant decimal digits
123.125 // 6 significant decimal digits
0.0078125 // 5 significant decimal digits
0x0.00123p45 // 3 significant hexadecimal digits
123000.0 // 3, 6, or 7 significant decimal digits depending on context
When concerned about decimal significant digits and FP types like double. the issue is often "How many decimal significant digits are needed or of concern?"
Nearly all C FP implementations use a binary encoding such that all finite FP are exact sums of power of 2. Each finite FP is exact. Common encoding affords most double to have 53 binary digits is it significand - so 53 significant binary digits. How this appears as a decimal is often the source of confusion.
// Example 0.1 is not an exact sum of powers of 2 so a nearby value is used.
double x = 0.1;
// x takes on the exact value of
// 0.1000000000000000055511151231257827021181583404541015625
// aka 0x1.999999999999ap-4
// aka base2: 0.000110011001100110011001100110011001100110011001100110011010
// The preceding and subsequent doubles
// 0.09999999999999999167332731531132594682276248931884765625
// 0.10000000000000001942890293094023945741355419158935546875
// 123456789012345678901234567890123456789012345678901234567890
Looking at above, one could say x has over 50 decimal significant digits. Yet the value matches the intended 0.1 to 16 decimal significant digits. Or yet since the preceding and subsequent possible double values differ in the 17 place, one could say x has 17 decimal significant digits.
What does it means "significant digits"?
Various meanings of significant digits exist, but for C, 2 common ones are:
The number of decimal significant digits that a textual value to double converts as expected for all double. This is typically 15. C specifies this as DBL_DIG and must be at least 10.
The number of decimal significant digits that a textual value of double needs to be printed to distinguish from another double. This is typically 17. C specifies this as DBL_DECIMAL_DIG and must be at least 10.
Why should ignore them?
It depends of coding goals. Rarely are all digits of the exact value needed. (DBL_TRUE_MIN might have 752 od them.) For most applications, DBL_DECIMAL_DIG is enough. In select apps, DBL_DIG will do. So usually, ignoring digits past 17 does not cause problems.
Keep in mind that floating-point values are not real numbers. There are gaps between the values, and all those extra digits, while meaningful for real numbers, don’t reflect any difference in the floating-point value. When you convert a floating-point value to text, having std::numeric_limits<...>::max_digits10 digits ensures that you can convert the text string back to floating-point and get the original value. The extra digits don’t affect the result.
The extra digits that you see when you ask for more digits are the result of the conversion algorithm trying to do what you asked. The algorithm typically just keeps extracting digits until it reaches the desired precision; it could be written to start outputting zeros after it’s written max_digits10 digits, but that’s an additional complication that nobody bothers with. It wouldn’t really be helpful.
just to add to Pete Becker's answer, I think you're confusing the problem of finding the exact decimal representation of a binary mantissa, with the problem of finding some decimal representation uniquely representing that binary mantissa ( given some fixed rounding scheme ).
Now, regarding the first problem, you always need a finite number of decimal digits to exactly represent a binary mantissa ( because 2 divides 10 ).
For example, you need 18 decimal digits to exactly represent the binary 1.0000000000000001, being 1.00000762939453125 in decimal.
but you need just 17 digits to represent it uniquely as 1.0000076293945312 because no other number having exact value 1.0000076293945312xyz... where 0<=x<5 can exist as a double ( more precisely, the next and prior exactly representable values being 1.0000076293945314720446049250313080847263336181640625 and 1.0000076293945310279553950749686919152736663818359375 ).
Of course, this does not mean that given some decimal number you can ignore all digits past the 17th; it just means that if you apply the same rounding scheme used to produce the decimal at the 17th position and assign it back to a double you'll get the same original double.

Why does cout.precision() increase floating-point's precision?

I understand that single floating-point numbers have the precision of about 6 digits, so it's not surprising that the following program will output 2.
#include<iostream>
using namespace std;
int main(void) {
//cout.precision(7);
float f = 1.999998; //this number gets rounded up to the nearest hundred thousandths
cout << f << endl; //so f should equal 2
return 0;
}
But when cout.precision(7) is included, in fact anywhere before cout << f << endl;, the program outputs the whole 1.999998. This could only mean that f stored the whole floating-point number without rounding, right?
I know that cout.precision() should not, in any way, affect floating-point storage. Is there an explanation for this behavior? Or is it just on my machine?
I understand that single floating-point numbers have the precision of about 6 digits
About six decimal digits, or exactly 23 binary digits.
this number gets rounded up to the nearest hundred thousand
No it doesn't. It gets rounded to the nearest 23 binary digits. Not the same thing, and not commensurable with it.
Why does cout.precision() increase floating-point's precision?
It doesn't. It affects how it is printed.
As already written in the comments: The number is stored in binary.
cout.setprecision() actually does not affect the storage of the floating point value, it affects only the output precision.
The default precision for std::cout is 6 according to this and your number is 7 digits long including the parts before and after the decimal place. Therefore when you set precision to 7, there is enough precision to represent your number but when you don't set the precision, rounding is performed.
Remember this only affects how the numbers are displayed, not how they are stored. Investigate IEEE floating point if you are interested in learning how floating point numbers are stored.
Try changing the number before the decimal place to see how it affects the rounding e.g float f = 10.9998 and float f = 10.99998

C - Printing a float - loss of precision when casting to int [duplicate]

This question already has answers here:
What range of numbers can be represented in a 16-, 32- and 64-bit IEEE-754 systems?
(7 answers)
Closed 8 years ago.
I'm trying to make a function that enables me to print floats.
Right now, I'm encountering two strange behaviors :
Sometimes, values like 1.3 come out as 1.2999999 instead of 1.3000000,and sometimes values like 1.234567 come out as 1.2345672 instead of 1.2345670.
Here's the source code :
int ft_putflt(float f)
{
int ret;
int intpart;
int i;
ret = 0;
i = 0;
intpart = (int)f;
ft_putnbr(intpart);
ret = ft_nbrlen(intpart) + 8;
write(1, ".", 1);
while (i++ < 7)
{
f *= 10;
ft_putchar(48 + ((int)f % 10));
}
return (ret);
}
ft_putnbr is OK AFAIK.
ft_putchar is a simple call to "write(1, &c, 1)".
test values (value : output)
1.234567 : 1.2345672 (!)
1.2345670 : 1.2345672 (!)
1.0000001 : 1.0000001 OK
0.1234567 : 0.1234567 OK
0.67 : 0.6700000 OK
1.3 : 1.3000000 OK (fixed it)
1.321012 : 1.3210119 (!)
1.3210121 : 1.3210122 (!)
This all seems a bit mystic to me... Loss of precision when casting to int maybe ?
Yes, you lose precision when messing with floats and ints.
If both floats have differing magnitude and both are using the complete precision range (of about 7 decimal digits) then yes, you will see some loss in the last places, because floats are stored in the form of (sign) (mantissa) × 2(exponent). If two values have differing exponents and you add them, then the smaller value will get reduced to less digits in the mantissa (because it has to adapt to the larger exponent):
PS> [float]([float]0.0000001 + [float]1)
1
In relation to integers, a normal 32-bit integer is capable of representing values exactly which do not fit exactly into a float. A float can still store approximately the same number, but no longer exactly. Of course, this only applies to numbers that are large enough, i. e. longer than 24 bits.Because a float has 24 bits of precision and (32-bit) integers have 32, float will still be able to retain the magnitude and most of the significant digits, but the last places may likely differ:
PS> [float]2100000050 + [float]100
2100000100
This is inherent in the use of finite-precision numerical representation schemes. Given any number that can be represented, A, there is some number that is the smallest number greater than A that can be represented, call that B. Numbers between A and B cannot be represented exactly and must be approximated.
For example, let's consider using six decimal digits because that's an easier system to understand as a starting point. If A is .333333, then B is .333334. Numbers between A and B, such 1/3, cannot be exactly represented. So if you take 1/3 and add it to itself twice (or multiply it by 3), you will get .999999, not 1. You should expect to see imprecision at the limits of the representation.

Why the float point value is this?

And when I was reading the chapter about float points in C++ Primier Plus.
It gave an example as shown below:
#include <iostream>
int main(){
using namespace std;
float a = 2.34E+22f;
float b = a + 1.0f;
cout << "a =" << a <<endl;
cout << "b -a =" << b - a<< endl;
return 0;
}
And its result is:
a = 2.34e+22
b -a = 0
The explanation from the book is and I quote:
"The problem is that 2.34E+22 represents a number with 23 digits to the left of the
decimal. By adding 1, you are attempting to add 1 to the 23rd digit in that number. But
type float can represent only the first 6 or 7 digits in a number, so trying to change the
23rd digit has no effect on the value."
But I do not understand it. Could anybody help me to understand why b -a is 0 in a understandable way please?
The float type in C/C++ is stored in the standard 'single precision' format. The numbers are of the form
±m*2^e
where m is an integer between 223 and 224, and e is an integer. (I'm leaving out a lot of details that are not relevant here.)
So how does this explain what you are seeing?
First of all, the numbers in your code are always "rounded" to the nearest floating-point number. So the value 2.34e+22 is actually rounded to 10391687*251, which is 23399998850475413733376. This is the value of a.
Second, the result of a floating-point operation is always rounded to the nearest floating-point number. So, if you add 1 to a, the result is 23399998850475413733377, which is again rounded to the nearest floating-point number, which is still, of course, 23399998850475413733376. So b gets the same value as a. Since both numbers are equal, a - b == 0.
You can add numbers that are much larger than 1 to a and still get the same result. The reason is again the fact that the result is rounded, and the closest floating-point number will still be a when you add numbers up to at least 250, or about 1e+15.
b - a is 0 because b and a are equal.
When you add a too small number to a large number, it's as if you didn't add anything at all.
In this case, "too small" would be anything less than about 2.34e+15 i.e. 7 digits smaller.
The single precision floating point type float is like this(assuming IEEE-754)
The fraction part has only 23 bits, roughly less than 107. When you add a rather small number to 2.34E+22f, the precision of float limits the result's representaion, so b end up with unchanged value from a.
Both the existing answers are correct (from Mark Ransom and Yu Hao).
The very short explanation is that float values are not very precise. The values are rounded off after 6 or 7 decimal digits. For very large numbers, this imprecision means that small changes to the value get rounded away to nothing. Even + 1 or + 100 can be a "very small change" if it's being done to 1000000000.