I'm currently trying to learn about floating point representation in depth, so I played around a bit. While doing so, I stumbled on some strange behaviour; I can't really work out what's happening, and I'd be very grateful for some insight. Apologies if this has been answered, I found it quite hard to google!
#include <iostream>
#include <cmath>
using namespace std;
int main(){
float minVal = pow(2,-149); // set to smallest float possible
float nextCheck = static_cast<float>(minVal/2.0f); // divide by two
bool isZero = (static_cast<float>(minVal/2.0f) == 0.0f); // this evaluates to false
bool isZero2 = (nextCheck == 0.0f); // this evaluates to true
cout << nextCheck << " " << isZero << " " << isZero2 << endl;
// this outputs 0 0 1
return 0;
}
Essentially what's happening is:
I set minVal to be the smallest float that can be represented using
single precision
Dividing by 2 should yield 0 -- we're at the minimum
Indeed, isZero2 does return true, but isZero returns false.
What's going on -- I would have thought them to be identical? Is the compiler trying to be clever, saying that dividing any number cannot possibly yield zero?
Thanks for your help!
The reason isZero and isZero2 can evaluate to different values, and isZero can be false, is that the C++ compiler is allowed to implement intermediate floating-point operations with more precision than the type of the expression would indicate, but the extra precision has to be dropped on assignment.
Typically, when generating code for the 387 historical FPU, the generated instructions work on either the 80-bit extended-precision type, or, if the FPU is set to a 53-bit significand (e.g. on Windows), a strange floating-point type with 53-bit significands and 15-bit exponents.
Either way, minVal/2.0f is evaluated exactly because the exponent range allows to represent it, but assigning it to nextCheck rounds it to zero.
If you are using GCC, there is the additional problem that -fexcess-precision=standard has not yet been implemented for the C++ front-end, meaning that the code generated by g++ does not implement exactly what the standard recommends.
Related
In C++, we know that we can find the minimum representable double precision value using std::numeric_limits<double>::min(). The value turns out to be 2.22507e-308 when printed.
Now if a given double value (say val) is subtracted from this minimum value and then a division is undertaken with the same previous double value (val - minval) / val, I was expecting the answer to be rounded to 0 if the operation floor((val - minval ) / val) was performed on the resulting divided value.
To my surprise, the answer is delivered as 1. Can someone please explain this anomalous behavior?
Consider the following code:
int main()
{
double minval = std::numeric_limits<double>::min(), wg = 8038,
ans = floor((wg - minval) / wg); // expecting the answer to round to 0
cout << ans; // but the answer actually resulted as 1!
}
A double typically has around 16 digits of precision.
You're starting with 8038. For simplicity, I'm going to call that 8.038e3. Since we have around 16 digits of precision, the smallest number we can subtract from that and still get a result different from 8038 is 8038e(3-16) = 8038e-13.
8038 - 2.2e-308 is like reducing the mass of the universe by one electron, and expecting that to affect the mass of the universe by a significant amount.
Actually, relatively speaking, 8038-2.2e-308 is a much smaller change than removing a whole electron from the universe--more like removing a minuscule fraction of a single electron from the universe, if that were possible. Even if we were to assume that string theory were correct, even removing one string from the universe would still be a huge change compared to subtracting 2.2e-308 from 8038.
The comments and the previous answer correctly attribute the cause to floating point precision issues but there are additional details needed to explain the correct behavior. In fact, even in cases where subtraction cannot be carried out such that the results of the subtraction cannot be represented with the finite precision of floating point numbers, inexact rounding is still performed by the compiler and subtraction is not completely discarded.
As an example, consider the code below.
int main()
{
double b, c, d;
vector<double> a{0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.6, 0.7};
cout << "Subtraction Possible?" << "\t" << "Floor Result" << "\n";
for( int i = 0; i < 9; i++ ) {
b = std::nextafter( a[i], 0 );
c = a[i] - b;
d = 1e-17;
if( (bool)(d > c) )
cout << "True" << "\t";
else
cout << "False" << "\t";
cout << setprecision(52) << floor((a[i] - d)/a[i]) << "\n";
}
return 0;
}
The code takes in different double precision values in the form of vector a and performs subtraction from 1e-17. It must be noted that the smallest value that can be subtracted from 0.07 is shown to be 1.387778780781445675529539585113525390625e-17 using std::nextafter for the value 0.07. This means that 1e-17 is smaller than the smallest value which can be subtracted from any of these numbers. Hence, theoretically, subtraction should not be possible for any of the numbers listed in vector a. If we assume that the subtraction results are discarded, then the answer must always stay 1 but it turns out that sometimes the answer is 0 and other times 1.
This can be observed from the output of the C++ program as shown below:
Subtraction Possible? Floor Result
False 0
False 0
False 0
False 0
False 1
False 1
False 1
False 1
False 1
The reasons lay buried within the Floating Point specification prescribed in the IEEE 754 document. In general the standard specifically states that even in cases where the results of an operation cannot be represented, rounding must be carried out. I quote Page 27, Section 4.3 of the IEEE 754, 2019 document:
Except where stated otherwise, every operation shall be performed as if it first produced an
intermediate result correct to infinite precision and with unbounded range, and then rounded that result
according to one of the attributes in this clause
The statement in further repeated in Section 5.1 of Page 29 as shown below:
Unless otherwise specified, each of the computational
operations specified by this standard that returns a numeric result shall be performed as if it first produced
an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result, if necessary, to fit in the destination’s format (see Clause 4 and Clause 7).
C++'s g++ compiler (which I have been testing) correctly and very precisely interprets the standard by implementing nearest rounding stated in Section 4.3.1 of the IEEE 754 document. This has the implication that even when a[i] - b is not representable, a numeric result is delivered as if the subtraction first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result. Hence, it may or may not be the case that a[i] - b == a[i] which means that the answer may or may not be 1 depending on whether a[i] - b is closer to a[i] or it is closer to the next representable value after a[i].
It turns out that 8038 - 2.22507e-308 is closer to 8038 due to which the answer is rounded (using nearest rounding) to 8038 and the final answer is 1 but this is to only state that this behavior does result from the compiler's interpretation of the standard and is not something arbitrary.
I found below references on Floating Point numbers to be very useful. I would recommend reading Cleve Moler's (founder of MATLAB) reference on floating point numbers before going through the IEEE specification for a quick and easy understanding of their behavior.
"IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
Moler, Cleve. “Floating Points.” MATLAB News and Notes. Fall, 1996.
This question already has answers here:
Round a float to a regular grid of predefined points
(11 answers)
Closed 4 years ago.
I am calculating the number of significant numbers past the decimal point. My program discards any numbers that are spaced more than 7 orders of magnitude apart after the decimal point. Expecting some error with doubles, I accounted for very small numbers popping up when subtracting ints from doubles, even when it looked like it should equal zero (To my knowledge this is due to how computers store and compute their numbers). My confusion is why my program does not handle this unexpected number given this random test value.
Having put many cout statements it would seem that it messes up when it tries to cast the final 2. Whenever it casts it casts to 1 instead.
bool flag = true;
long double test = 2029.00012;
int count = 0;
while(flag)
{
test = test - static_cast<int>(test);
if(test <= 0.00001)
{
flag = false;
}
test *= 10;
count++;
}
The solution I found was to cast only once at the beginning, as rounding may produce a negative and terminate prematurely, and to round thenceforth. The interesting thing is that both trunc and floor also had this issue, seemingly turning what should be a 2 into a 1.
My Professor and I were both quite stumped as I fully expected small numbers to appear (most were in the 10^-10 range), but was not expecting that casting, truncing, and flooring would all also fail.
It is important to understand that not all rational numbers are representable in finite precision. Also, it is important to understand that set of numbers which are representable in finite precision in decimal base, is different from the set of numbers that are representable in finite precision in binary base. Finally, it is important to understand that your CPU probably represents floating point numbers in binary.
2029.00012 in particular happens to be a number that is not representable in a double precision IEEE 754 floating point (and it indeed is a double precision literal; you may have intended to use long double instead). It so happens that the closest number that is representable is 2029.000119999999924402800388634204864501953125. So, you're counting the significant digits of that number, not the digits of the literal that you used.
If the intention of 0.00001 was to stop counting digits when the number is close to a whole number, it is not sufficient to check whether the value is less than the threshold, but also whether it is greater than 1 - threshold, as the representation error can go either way:
if(test <= 0.00001 || test >= 1 - 0.00001)
After all, you can multiple 0.99999999999999999999999999 with 10 many times until the result becomes close to zero, even though that number is very close to a whole number.
As multiple people have already commented, that won't work because of limitations of floating-point numbers. You had a somewhat correct intuition when you said that you expected "some error" with doubles, but that is ultimately not enough. Running your specific program on my machine, the closest representable double to 2029.00012 is 2029.0001199999999244 (this is actually a truncated value, but it shows the series of 9's well enough). For that reason, when you multiply by 10, you keep finding new significant digits.
Ultimately, the issue is that you are manipulating a base-2 real number like it's a base-10 number. This is actually quite difficult. The most notorious use cases for this are printing and parsing floating-point numbers, and a lot of sweat and blood went into that. For example, it wasn't that long ago that you could trick the official Java implementation into looping endlessly trying to convert a String to a double.
Your best shot might be to just reuse all that hard work. Print to 7 digits of precision, and subtract the number of trailing zeroes from the result:
#include <iostream>
#include <sstream>
#include <iomanip>
#include <string>
int main() {
long double d = 2029.00012;
auto double_string = (std::stringstream() << std::fixed << std::setprecision(7) << d).str();
auto first_decimal_index = double_string.find('.') + 1;
auto last_nonzero_index = double_string.find_last_not_of('0');
if (last_nonzero_index == std::string::npos) {
std::cout << "7 significant digits\n";
} else if (last_nonzero_index < first_decimal_index) {
std::cout << -(first_decimal_index - last_nonzero_index + 1) << " significant digits\n";
} else {
std::cout << (last_nonzero_index - first_decimal_index) << " significant digits\n";
}
}
It feels unsatisfactory, but:
it correctly prints 5;
the "satisfactory" alternative is possibly significantly harder to implement.
It seems to me that your second-best alternative is to read on floating-point printing algorithms and implement just enough of it to get the length of the value that you're going to print, and that's not exactly an introductory-level task. If you decide to go this route, the current state of the art is the Grisu2 algorithm. Grisu2 has the notable benefit that it will always print the shortest base-10 string that will produce the given floating-point value, which is what you seem to be after.
If you want sane results, you can't just truncate the digits, because sometimes the floating point number will be a hair less than the rounded number. If you want to fix this via a fluke, change your initialization to be
long double test = 2029.00012L;
If you want to fix it for real,
bool flag = true;
long double test = 2029.00012;
int count = 0;
while (flag)
{
test = test - static_cast<int>(test + 0.000005);
if (test <= 0.00001)
{
flag = false;
}
test *= 10;
count++;
}
My apologies for butchering your haphazard indent; I can't abide by them. According to one of my CS professors, "ideally, a computer scientist never has to worry about the underlying hardware." I'd guess your CS professor might have similar thoughts.
Here is my code :
#include <iostream>
#include <cmath>
using namespace std;
int main()
{
int n, i, num, m, k = 0;
cout << "Enter a number :\n";
cin >> num;
n = log10(num);
while (n > 0) {
i = pow(10, n);
m = num / i;
k = k + pow(m, 3);
num = num % i;
--n;
cout << m << endl;
cout << num << endl;
}
k = k + pow(num, 3);
return 0;
}
When I input 111 it gives me this
1
12
1
2
I am using codeblocks. I don't know what is wrong.
Whenever I use pow expecting an integer result, I add .5 so I use (int)(pow(10,m)+.5) instead of letting the compiler automatically convert pow(10,m) to an int.
I have read many places telling me others have done exhaustive tests of some of the situations in which I add that .5 and found zero cases where it makes a difference. But accurately identifying the conditions in which it isn't needed can be quite hard. Using it when it isn't needed does no real harm.
If it makes a difference, it is a difference you want. If it doesn't make a difference, it had a tiny cost.
In the posted code, I would adjust every call to pow that way, not just the one I used as an example.
There is no equally easy fix for your use of log10, but it may be subject to the same problem. Since you expect a non integer answer and want that non integer answer truncated down to an integer, adding .5 would be very wrong. So you may need to find some more complicated work around for the fundamental problem of working with floating point. I'm not certain, but assuming 32-bit integers, I think adding 1e-10 to the result of log10 before converting to int is both never enough to change log10(10^n-1) into log10(10^n) but always enough to correct the error that might have done the reverse.
pow does floating-point exponentiation.
Floating point functions and operations are inexact, you cannot ever rely on them to give you the exact value that they would appear to compute, unless you are an expert on the fine details of IEEE floating point representations and the guarantees given by your library functions.
(and furthermore, floating-point numbers might even be incapable of representing the integers you want exactly)
This is particularly problematic when you convert the result to an integer, because the result is truncated to zero: int x = 0.999999; sets x == 0, not x == 1. Even the tiniest error in the wrong direction completely spoils the result.
You could round to the nearest integer, but that has problems too; e.g. with sufficiently large numbers, your floating point numbers might not have enough precision to be near the result you want. Or if you do enough operations (or unstable operations) with the floating point numbers, the errors can accumulate to the point you get the wrong nearest integer.
If you want to do exact, integer arithmetic, then you should use functions that do so. e.g. write your own ipow function that computes integer exponentiation without any floating-point operations at all.
what is wrong with my code? its converting inches and feet and comparing them in meters. if i enter 12 for inches and 1 for feet it says that the numbers are not equal. Is this a known issue with g++? Can somebody explain this to me?
#include <iostream>
#include <cmath>
using namespace std;
int main()
{
double in, ft, m1, m2;
cin >> in >> ft;
m1 = in * 0.0254;
m2 = ft * 0.3048;
cout << m1 << '\t' << m2 << '\n' << endl;
// to show that both numbers are equal
if (m1 == m2) cout << "yay";
else cout << "boo";
}
Does anybody else have this issue?
#Josh, add this to your code and run it
cout << m2-m1;
u will be surprised, answer is not zero
For the problem in code, changing data type from double to float fixes the problem
float in, ft, m1, m2;
The reason that the numbers don't match is that computers use a binary representation of numbers which leads to inaccuracies when trying to represent decimal numbers.
You think the number is 0.3048 (because that's what you coded) - but when compiled, the computer can only represent this as the nearest equivalent in binary format (see IEEE floating point for more info). So the number might be something extremely close to 0.3048, but not precisely that.
After you've done your calculations, you compare the numbers - but if the two are not absolutely identical in their binary representations, they won't match.
One simple way to solve it (but by no means the only solution) it to subtract the two operands and check how close to zero it is. If:
fabs(a - b) < 0.00001
(an arbitrary amount), then you can presume the values are the same.
What you're seeing is a result of inexact floating point representation. Base 2^n floating point numbers cannot represent all base 10 decimal values exactly. Thus, when you do something simple like multiplying 12*0.0254 you get the very odd result of 0.3047999.......6, whereas if you compute 1*0.3048 you get the expected result of 0.3048. The problem is that 0.0254 isn't being stored exactly; instead, the closest approximate value (something like 0.0253999999....98) is used. The difference is small but can become noticeable when you use the inexact value in a calculation, and then compare it to another value which doesn't suffer from rounding issue such as 0.3048. A basic rule to keep in mind is that you should never compare floating point values for equality; instead, compare them in a manner that allows for an acceptable error, e.g. instead of comparing values in the following manner:
if(val1 == val2)...
use something like
if(abs(val1 - val2) < 0.0000001)...
so that the two variables will be considered equal if their values differ by less than 1/10,000,000 (which is pretty close :-).
I am running long simulations. I record the results into a vector to compute statistics about the data. I realized that, in theory, those samples could be the result of a division by zero; this is only theoretical, I am pretty sure it's not the case. In order to avoid rerunning the simulation after modifying the code, I was wondering what happens in that case. Would I be able to realize whether a division by 0 has occurred or not? Will I get error messages? (Exceptions are not being handled at the moment).
Thanks
For IEEE floats, division of a finite nonzero float by 0 is well-defined and results in +infinity (if the value was >zero) or -infinity (if the value was less than zero). The result of 0.0/0.0 is NaN. If you use integers, the behaviour is undefined.
Note that C standard says (6.5.5):
The result of the / operator is the quotient from the division of
the first operand by the second; the result of the % operator is the
remainder. In both operations, if the value of the second operand is
zero, the behavior is undefined.
So something/0 is undefined (by the standard) both for integral types and Floating points. Nevertheless most implementations have fore mentioned behavior (+-INF or NAN).
If you're talking integers then your program should crash upon division by zero.
If you're talking floats then division by zero is allowed and the result to that is INF or -INF. Now it's all up to your code if the program will crash, handle that nicely or continue with undefined/unexpected results.
If you use IEEE floats, then it will return 0 or NaN. If the op1 is 0, you will get undefined. If op1 is higher than 0, you will get Infinity. If op1 is lower than 0, then you will get -Infinity. If you use dividing by 0 directly or in integer , you will get error "Floating point exception".
#include <iostream>
#include <math.h>
using namespace std;
int main()
{
double a = 123, b = 0;
double result = a/b;
string isInfinite = isinf(result) ? "is" : "is not";
cout << "result=" << result << " " << isInfinite << " infinity" << endl;
}
result=inf is infinity