Detecting precision loss when converting from double to float - c++

I am writing a piece of code in which i have to convert from double to float values. I am using boost::numeric_cast to do this conversion which will alert me of any overflow/underflow. However i am also interested in knowing if that conversion resulted in some precision loss or not.
For example
double source = 1988.1012;
float dest = numeric_cast<float>(source);
Produces dest which has value 1988.1
Is there any way available in which i can detect this kind of precision loss/rounding

You could cast the float back to a double and compare this double to the original - that should give you a fair indication as to whether there was a loss of precision.

float dest = numeric_cast<float>(source);
double residual = source - numeric_cast<double>(dest);
Hence, residual contains the "loss" you're looking for.

Look at these articles for single precision and double precision floats. First of all, floats have 8 bits for the exponent vs. 11 for a double. So anything bigger than 10^127 or smaller than 10^-126 in magnitude is going to be the overflow as you mentioned. For the float, you have 23 bits for the actual digits of the number, vs 52 bits for the double. So obviously, you have a lot more digits of precision for the double than float.
Say you have a number like: 1.1123. This number may not actually be encoded as 1.1123 because the digits in a floating point number are used to actually add up as fractions. For example, if your bits in the mantissa were 11001, then the value would be formed by 1 (implicit) + 1 * 1/2 + 1 * 1/4 + 0 * 1/8 + 0 * 1/16 + 1 * 1/32 + 0 * (64 + 128 + ...). So the exact value cannot be encoded unless you can add up these fractions in such a way that it's the exact number. This is rare. Therefore, there will almost always be a precision loss.

You're going to have a certain level of precision loss, as per Dave's answer. If, however, you want to focus on quantifying it and raising an exception when it exceeds a certain number, you will have to open up the floating point number itself and parse out the mantissa & exponent, then do some analysis to determine if you've exceeded your tolerance.
But, the good news, its usually the standard IEEE floating-point float. :-)

Related

double precision error when converting to scientific notation

I'm building a program to to convert double values in to scientific value format(mantissa, exponent). Then I noticed the below
369.7900000000000 -> 3.6978999999999997428
68600000 -> 6.8599999999999994316
I noticed the same pattern for several other values also. The maximum fractional error is
0.000 000 000 000 001 = 1*e-15
I know the inaccuracy in representing double values in a computer. Can this be concluded that the maximum fractional error we would get is 1*e-15? What is significant about this?
I went through most of the questions on floating point precision problem in stack overflow, but I didnt see any about the maximum fractional error in 64 bits.
To be clear on the computation I do, I have mentioned my code snippet as well
double norm = 68600000;
if (norm)
{
while (norm >= 10.0)
{
norm /= 10.0;
exp++;
}
while (norm < 1.0)
{
norm *= 10.0;
exp--;
}
}
Now I get
norm = 6.8599999999999994316;
exp = 7
The number you are getting is related to the machine epsilon for the double data type.
A double is 64 bits long, with 1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa fraction. A double's value is given by
1.mmmmm... * (2^exp)
With only 52 bits for the mantissa, any double value below 2^-52 will be completely lost when added to 1.0 due to its small significance. In binary, 1.0 + 2^-52 would be
1.000...00 + 0.000...01 = 1.000.....01
Obviously anything lower would not change the value of 1.0. You can verify for yourself that 1.0 + 2^-53 == 1.0 in a program.
This number 2^-52 = 2.22e-16 is called the machine epsilon and is an upper bound on the relative error that occurs during one floating point arithmetic due to round-off error with double values.
Similarly, float has 23 bits in its mantissa and so its machine epsilon is 2^-23 = 1.19e-7.
The reason you are getting 1e-15 may be because errors accumulate as you perform many arithmetic operations, but I can't say because I don't know the exact calculations you are doing.
EDIT: I've looked into the relative error for your problem with 68600000.
First off, you may be interested to know that round-off error can change the result of your computation if you break it into steps:
686.0/10.0 = 68.59999999999999431566
686.0/10.0/10.0 = 6.85999999999999943157
686.0/100.0 = 6.86000000000000031974
In the first line, the closest double to 68.6 is lower than the actual value, but in the third line we see the closest double to 6.86 is greater.
If we look at the abosolute error e_abs = abs(v-v_approx) of your program, we see that it is
6.8600000 - 6.85999999999999943156581139192 ~= 5.684e-16
However, the relative error e_abs = abs( (v-v_approx)/ v) = abs(e_abs/v) would be
5.684e-16 / 6.86 ~= 8.286e-17
Which is indeed below our machine epsilon of 2.22e-16.
This is a famous paper you can read if you want to know all the details about floating point arithmetic.

How can I avoid this float number rounding issue in C++?

With below code, I get result "4.31 43099".
double f = atof("4.31");
long ff = f * 10000L;
std::cout << f << ' ' << ff << '\n';
If I change "double f" to "float f". I get expected result "4.31 43100". I am not sure if changing "double" to "float" is a good solution. Is there any good solution to assure I get "43100"?
You're not going to be able to eliminate the errors in floating point arithmatic (though with proper analysis you can calculate the error). For casual usage one thing you can do to get more intuitive results is to replace the built-in float to integral conversion (which does truncation), with normal rounding:
double f = atof("4.31");
long ff = std::round(f * 10000L);
std::cout << f << ' ' << ff << '\n';
This should output what you expect: 4.31 43100
Also there's no point in using 10000L, because no matter what kind of integral type you use it still gets converted to f's floating point type for the multiplication. just use std::round(f * 10000.0);
The problem is that floating point is inexact by nature when talking about decimal numbers. A decimal number can be rounded either up or down when converted to binary, depending on which value is closest.
In this case you just want to make sure that if the number was rounded down, it's rounded up instead. You do this by adding the smallest amount possible to the value, which is done with the nextafter function if you have C++11:
long ff = std::nextafter(f, 1.1*f) * 10000L;
If you don't have nextafter you can approximate it with numeric_limits.
long ff = (f * (1.0 + std::numeric_limits<double>::epsilon())) * 10000L;
I just saw your comment that you only use 4 decimal places, so this would be simpler but less robust:
long ff = (f * 1.0000001) * 10000L;
With standard C types - i doubt.
There are many values that cannot be represented in those bits - they actually demand more space to be stored. So floating-point processor just uses the closest possible.
Floating pointing numbers cannot store all the values you think it could - there is only limited amount of bits - you can't put more than 4 billion different values in 32 bits. And that's just the first restriction.
Floating point values(in C) are represented as: sign - one sign bit, power - bits which defines the power of two for the number, significand - the bits that actually make the number.
Your actual number is sign * significand * 2 inpowerof(power - normalization).
Double is 1bit of sign, 15 bits of power(normalized to be positive but that is not the point) and 48 bits to represent the value;
It is a lot but not enough to represent all the values, especially when they cannot be easily represented as finite sum of powers of two: like binary 1010.101101(101). For example it cannot represent precisely such values like 1/3 = 0.333333(3). That's the second restriction.
Try to read - decent understanding of advantages and disadvantages of floating point arithmetic may be very handy:
http://en.wikipedia.org/wiki/Floating_point and http://homepage.cs.uiowa.edu/~atkinson/m170.dir/overton.pdf
There have been some confused answers here! What is happening is this: 4.31 can't be exactly represented as either a single- or double-precision number. It turns out that the nearest representable single-precision number is a little more than 4.31, while the nearest representable double-precision number is a little less than 4.31. When a floating-point value is assigned to an integer variable, it is rounded towards zero (not towards the nearest integer!).
So if f is single-precision, f * 10000L is greater than 43100, so it is rounded down to 43100. And if f is double-precision, f * 10000L is less than 43100, so it is rounded down to 43099.
The comment by n.m. suggests f * 10000L + 0.5, which is I think the best solution.

C++ floating-point console output issue

float x = 384.951257;
std::cout << std::fixed << std::setprecision(6) << x << std::endl;
The output is 384.951263. Why? I'm using gcc.
float is usually only 32-bit. With about 3 bits per decimal digit (210 roughly equals 103) that means it can't possibly represent more than about 11 decimal digits, and accounting for other information it also needs to represent, such as magnitude, let's say 6-7 decimal digits. Hey, that's what you got!
Check e.g. Wikipedia for details.
Use double or long double for better precision. double is the default in C++. E.g., the literal 3.14 is of type double.
Floats have a limited resolution. So it gets rounded when you assing the value to x.
All answers here talk as though the issue is due to floating-point numbers and their capacity, but those are just implementation details; the issue is deeper than that. This issue occurs when representing decimal numbers using binary number system. Even something as simple as 0.1)10 is not precisely representable in binary, since it can only represent those numbers as a finite fraction where the denominator is a power of 2. Unfortunately, this does not include most of the numbers that can be represented as finite fraction in base 10, like 0.1.
The single-precision float datatype usually gets mapped to binary32 as called by the IEEE 754 standard, has 32-bits which is partitioned into 1 sign bit, 8 exponent bits and 23 significand bits (excluding the hidden/implicit bit). Thus we've to calculate upto 24 bits when converting to binary32.
Other answers here evade the actual calculations involved, I'll try to do it. This method is explained in greater detail here. So lets convert the real number into a binary number:
Integer part 384)10 = 110000000)2 (using the usual method of successive division by 2)
Fractional part 0.951257)10 can be converted by successive multiplication by 2 and taking the integer part
0.951257 * 2 = 1.902514
0.902514 * 2 = 1.805028
0.805028 * 2 = 1.610056
0.610056 * 2 = 1.220112
0.220112 * 2 = 0.440224
0.440224 * 2 = 0.880448
0.880448 * 2 = 1.760896
0.760896 * 2 = 1.521792
0.521792 * 2 = 1.043584
0.043584 * 2 = 0.087168
0.087168 * 2 = 0.174336
0.174336 * 2 = 0.348672
0.348672 * 2 = 0.697344
0.697344 * 2 = 1.394688
0.394688 * 2 = 0.789376
Gathering the obtined fractional part in binary we've 0.111100111000010)2. The overall number in binary would be 110000000.111100111000010)2; this has 24 bits as required.
Converting this back to decimal would give you 384 + (15585 / 16384) = 384.951232)10. With the rounding mode (round to nearest) enabled this comes to, what you see, 384.951263)10.
This can be verified here.

Discarding 4 precision points?

I have a floating point number, but I only want the number till 2 points of precision. How do I get this in C++?
float foo(float num) { // num=1234.567891
// code
return num2; // returns 1234.560000
}
A simple way would be:
float foo(float num) {
return floor(num * 100) / 100;
}
You may also consider:
float foo(float num) {
return (int)(num * 100) / 100.0f;
}
There might be differences with negative numbers. Only information I can get from your question is that for positive numbers you want floor (and not round for example).
You can attempt to round a floating point number to a certain number of decimal places, by multiplying and rounding/truncating.
So num = floor(num * 100.f) / 100.f; would try to truncate to two decimal places.
However note that this is not the same as fixing the precision of the float. By definition the decimal point in a floating point number floats. And you only have around 7 digits of precision.
So your original number cannot be as precise as "1234.567891" - this is too much precision for a float.
And perhaps more importantly, your output from such a function may also not be precise. "1234.56" cannot be exactly represented by a float, so the returned value will not be "1234.560000".
The larger the integer portion of the number, the less digits can be represented after the decimal point. Indeed, even a moderately large number cannot represent any fractional part at all.
Floating-point types have their own internal precision. You can't change it. You can fiddle with the value in various ways, but ultimately the floating-point types manage their own precision, and you won't succeed in limiting what they do. Even if you manage to force a bunch of zeros into the low bits of some value, subsequent computations will still use the full precision, and you've only introduced some random noise into the computation.
What you can do is control the number of significant digits on input and output. And that's almost always the right way to manage precision.

Converting a decimal number in scientific notation to IEEE 754

I've read a few texts and threads showing how to convert from a decimal to IEEE 754 but I am still confused as to how I can convert the number without expanding the decimal (which is represented in scientific notation)
The number I am particularly working with is 9.07 * 10^23, but any number would do; I will figure out how to do it for my particular example.
I'm assuming you want the result to be the floating-point number closest to the decimal number, and that you are using double-precision floating-point numbers.
For most numbers, there is a way to do it relatively quickly. Here's how it works in a nutshell.
You need to split the number into either a product or a fraction of numbers that have an exact representation as a floating-point number. The largest power of 10 that is exactly representable is 10^22. So, to get 9.07e+23 in floating-point form, we can write:
9.07e+23 = 907 * 10^21
According to the IEEE-754 standard, a single floating-point operation is guaranteed to be correctly rounded, so the above product, computed as a product of 2 double precision floating-point numbers, will give the correctly rounded result.
If you were to use this in a conversion function, you would probably store the powers of 10 in an array.
Note that you can't use this method for 9.07e-23. This number equals 907 / 10^23, so the denominator would be too large to be exactly representable. In this situation, and other dealings with very large or very small numbers, you have to use some form of high-precision arithmetic.
See Fast Path Decimal to Floating-Point Conversion for further details and examples.
Converting a number from a decimal string to binary IEEE is fairly straight-forward if you know how to do IEEE floating-point addition and multiplication. (or if you're using any basic programming language like C/C++)
There's a lot of different approaches to this, but the easiest is to evaluate 9.07 * 10^23 directly.
First, start with 9.07:
9.07 = 9 + 0 * 10^-1 + 7 * 10^-2
Now evaluate 10^23. This can be done by starting with 10 and using any powering algorithm.
Then multiply the results together.
Here's a simple implementation in C/C++:
double mantissa = 9;
mantissa += 0 / 10.;
mantissa += 7 / 100.;
double exp = 1;
for (int i = 0; i < 23; i++){
exp *= 10;
}
double result = mantissa * exp;
Now, going backwards (IEEE -> to decimal) is a lot harder.
Again, there's also a lot of different approaches. Here's the easiest one I can think of it.
I'll use 1.0011101b * 2^40 as the example. (the mantissa is in binary)
First, convert the mantissa to decimal: (this should be easy, since there's no exponent)
1.0011101b * 2^40 = 1.22656 * 2^40
Now, "scale" the number such that the binary exponent vanishes. This is done by multiplying by an appropriate power of 10 to "get rid" of the binary exponent.
1.22656 * 2^40 = 1.22656 * (2^40 * 10^-12) * 10^12
= 1.22656 * (1.09951) * 10^12
= 1.34861 * 10^12
So the answer is:
1.0011101b * 2^40 = 1.34861 * 10^12
In this example, 10^12 was needed to "scale away" the 2^40. Determining the power of 10 that is needed is simply equal to:
power of 10 = (power of 2) * log(2)/log(10)