Double overflow? - c++

I have always wondered what happens in case a double reaches it's max value, so I decided to write this code:
#include <stdint.h>
#include <iostream>
#define UINT64_SIZE 18446744073709551615
int main() {
std::uint64_t i = UINT64_SIZE;
double d1 = ((double)(i+1)) / UINT64_SIZE;
double d2 = (((double)(i)) / UINT64_SIZE)*16;
double d3 = ((double)(i * 16)) / UINT64_SIZE;
std::cout << d1 << " " << d2 << " " << d3;
}
I was expecting something like this:
0 16 0
But this is my output:
0 16 1
What is going on here? Why are the values of d3 and d1 different?
EDIT:
I decided to change my code to this to see the result:
#include <stdint.h>
#include <iostream>
#define UINT64_SIZE 18446744073709551615
int main() {
std::uint64_t i = UINT64_SIZE;
double d1 = ((double)(i+1.0)) / UINT64_SIZE; //what?
double d2 = (((double)(i)) / UINT64_SIZE)*16;
double d3 = ((double)(i * 16.0)) / UINT64_SIZE;
std::cout << d1 << " " << d2 << " " << d3;
}
The result I get now is this:
1 16 16
However, shouldn't d1 and d3 still be the same value?

double overflows by loosing precision, not by starting from 0 (as it works with unsigned integers)
d1
So, when you add 1.0 to very big value (18446744073709551615), you're not getting 0 in double, but something like 18446744073709551610 (note last 10 instead of 15) or 18446744073709551620 (note last 20 instead of 15), so - less significant digit(s) are rounded.
Now, you're dividing two almost identical values, result will be either 0.9(9)9 or 1.0(0)1, as soon as double cannot hold such small value - again it looses precision and rounds to 1.0.
d3
Almost the same, when you multiple huge value by 16 - you're getting rounded result (less significant digits are thrown away), by diving it - you're getting "almost" 16, which is rounded to 16.

This is a case of loss of precision. Consider the following.
#include <stdint.h>
#include <iostream>
#define UINT64_SIZE 18446744073709551615
int main() {
std::uint64_t i = UINT64_SIZE;
auto a = i;
auto b = i * 16;
auto c = (double)b;
auto d = (uint64_t)c;
std::cout << a << std::endl;
std::cout << b << std::endl;
std::cout << c << std::endl;
std::cout << d << std::endl;
return 0;
}
On my system the output is as follow.
18446744073709551615
18446744073709551600
1.8446744073709552e+19
9223372036854775808
double simply doesn't have enough precision in this case.
Edit: There is also a rounding problem. When you preform the division with UINT64_SIZE the denumerator is promoted to double and you are left with a decimal value between 0.0 and 1.0. The decimals are not rounded off. The actual value is very near 1.0 and is rounded up when pushed to std::cout.
In your question you ask "what happens in case a double reaches it's max value". Note that in the example you provided no double is ever near it's maximum value. Only it's precision is exceeded. When a double's precision is exceeded, the excess precision is discarded.

Related

Double precision issues when converting it to a large integer

Precision is the number of digits in a number. Scale is the number of
digits to the right of the decimal point in a number. For example, the
number 123.45 has a precision of 5 and a scale of 2.
I need to convert a double with a maximum scale of 7(i.e. it may have 7 digits after the decimal point) to a __int128. However, given a number, I don't know in advance, the actual scale the number has.
#include <iostream>
#include "json.hpp"
using json = nlohmann::json;
#include <string>
static std::ostream& operator<<(std::ostream& o, const __int128& x) {
if (x == std::numeric_limits<__int128>::min()) return o << "-170141183460469231731687303715884105728";
if (x < 0) return o << "-" << -x;
if (x < 10) return o << (char)(x + '0');
return o << x / 10 << (char)(x % 10 + '0');
}
int main()
{
std::string str = R"({"time": [0.143]})";
std::cout << "input: " << str << std::endl;
json j = json::parse(str);
std::cout << "output: " << j.dump(4) << std::endl;
double d = j["time"][0].get<double>();
__int128_t d_128_bad = d * 10000000;
__int128_t d_128_good = __int128(d * 1000) * 10000;
std::cout << std::setprecision(16) << std::defaultfloat << d << std::endl;
std::cout << "d_128_bad: " << d_128_bad << std::endl;
std::cout << "d_128_good: " << d_128_good << std::endl;
}
Output:
input: {"time": [0.143]}
output: {
"time": [
0.143
]
}
0.143
d_128_bad: 1429999
d_128_good: 1430000
As you can see, the converted double is not the expected 1430000 instead it is 1429999. I know the reason is that a float point number can not be represented exactly. The problem can be solved if I know the number of digit after the decimal point.
For example,
I can instead use __int128_t(d * 1000) * 10000. However, I don't know the scale of a given number which might have a maximum of scale 7.
Question> Is there a possible solution for this? Also, I need to do this conversion very fast.
I'm not familiar with this library, but it does appear to have a mechanism to get a json object's string representation (dump()). I would suggest you parse that into your value rather than going through the double intermediate representation, as in that case you will know the scale of the value as it was written.

How to avoid floating point format error

I am facing with following issue.
when I multiply two numbers depending from values of this numbers I get different results. I tried to experiment with types but didn't get expected result.
#include <stdio.h>
#include <iostream>
#include <fstream>
#include <iomanip>
#include <math.h>
int main()
{
const double value1_39 = 1.39;
const long long m_100000 = 100000;
const long long m_10000 = 10000;
const double m_10000double = 10000;
const long long longLongResult_1 = value1_39 * m_100000;
const double doubleResult_1 = value1_39 * m_100000;
const long long longLongResult_2 = value1_39 * m_10000;
const double doubleResult_2 = value1_39 * m_10000;
const long long longLongResult_3 = value1_39 * m_10000double;
const double doubleResult_3 = value1_39 * m_10000double;
std::cout << std::setprecision(6) << value1_39 << '\n';
std::cout << std::setprecision(6) << longLongResult_1 << '\n';
std::cout << std::setprecision(6) << doubleResult_1 << '\n';
std::cout << std::setprecision(6) << longLongResult_2 << '\n';
std::cout << std::setprecision(6) << doubleResult_2 << '\n';
std::cout << std::setprecision(6) << longLongResult_3 << '\n';
std::cout << std::setprecision(6) << doubleResult_3 << '\n';
return 0;
}
result seen in debuger
Variable Value
value1_39 1.3899999999999999
m_100000 100000
m_10000 10000
m_10000double 10000
longLongResult_1 139000
doubleResult_1 139000
longLongResult_2 13899
doubleResult_2 13899.999999999998
longLongResult_3 13899
doubleResult_3 13899.999999999998
result seen in cout
1.39
139000
139000
13899
13900
13899
13900
I know that the problem is that the problem is in nature of keeping floating point format in computer. It keeps data as a fractions in base 2.
My question is how to get 1.39 * 10 000 as 13900?(because I am getting 139000 when multipling with 100000 the same value) is there any trick which can help to achieve my goal?
I have some ideas in my mind bunt not sure are they good enough.
1) pars string to get number from left of . and rigth of doth
2) multiply number by 100 and divide by 100 when calculation is done, but each of this solutions has their drawback. I am wondering is there any nice trick for this.
As the comments already said, no there is no solution. This problem is due to the nature of floating points being stored as base 2 (as you already said). The type floating point is defined in IEEE 754. Everything that is not a base two number can't be stored precisely in base 2.
To be more specific
You CAN store:
1.25 (2^0 + 2^-2)
0.75 (2^-1 + 2^-2)
because there is an exact representation.
You CAN'T store:
1.1
1.4
because this will result in an irrational fracture in the base 2 system. You can try to round or use a sort of arbitrary precision float point library (but even they have their limits [memory/speed]) with a much greater precision than float and then backcast to float after multiplication.
There are also a lot of other related problems when it comes to floating points. You will find out that the result of 10^20 + 2 is only 10^20 because you have a fixed digit resolution (6-7 digits for float and 15-16 digits for double). When you calculate with numbers that have huge differences in magnitude the smaller ones will just "disappear".
Question: Why does multiply 1.39 * 10^6 get 139000 but multiplying 1.39 * 10^5 not?
This could be because of the order of magnitude. 10000 has 5 digits, 1.39 has 3 digits (distance 7 - just within the float). Both could be near enough to "show" the problem. When it comes to 100000 you have 6 digits but you have one more magnitude difference to 1.39 (distance 8 - just out of float). Therefore one of the trailing digits gets cut off and you get a more "natural" result. (This is just one reason for this. Compiler, OS and other reasons might exist)

Small numerical error when calculating Weight Average

Here is a part in a Physics engine.
The simplified function centerOfMass calculates 1D-center-of-mass of two rigid bodies (demo) :-
#include <iostream>
#include <iomanip>
float centerOfMass(float pos1,float m1, float pos2,float m2){
return (pos1*m1+pos2*m2)/(m1+m2);
}
int main(){
float a=5.55709743f;
float b= centerOfMass(a,50,0,0);
std::cout << std::setprecision(9) << a << '\n'; //5.55709743
std::cout << std::setprecision(9) << b << '\n'; //5.55709696
}
I need b to be precisely = 5.55709743.
The tiny difference can, sometimes (my real case = 5%), introduces a nasty Physics divergence.
There are some ways to solve it e.g. heavily do some conditional checking.
However, it is very error-prone for me.
Question: How to solve the calculation error while keep the code clean, fast, and still easily to be maintained?
By the way, if it can't be done elegantly, I would probably need to improve the caller to be more resistant against such numerical error.
Edit
(clarify duplicate question)
Yes, the cause is the precision error from the storage/computing format (mentioned in Is floating point math broken?).
However, this question asks about how to neutralize its symptom in a very specific case.
You are trying to get 9 decimal digits of precision , but the datatype float has a precision of about 7 decimal digits.
Use double instead. (demo)
Use double, not float. IEEE 754 double has about 16 decimal places of precision.
#include <iostream>
#include <iomanip>
double centerOfMass(double pos1, double m1, double pos2, double m2) {
return (pos1*m1 + pos2 * m2) / (m1 + m2);
}
int main() {
double a = 5.55709743;
double b = centerOfMass(a, 50, 0, 0);
std::cout << std::setprecision(16) << a << '\n'; //5.55709743
std::cout << std::setprecision(16) << b << '\n'; //5.55709743
std::cout << std::setprecision(16) << (b - a) << '\n'; // 0
}
For the example given, centerOfMass(a, 50, 0, 0), the following will give exact results for all values of a, but of course the example does not look realistic.
double centerOfMass(double pos1, double m1, double pos2, double m2) {
double divisor = m1 + m2;
return pos1*(m1/divisor) + pos2*(m2/ divisor);
}

Problematic output of fmod (long double, long double)

Problematic output of fmod (long double, long double)
It seems that output of fmod (long double, long double) in this test is problematocs.
Any suggestions?
g++ --version
g++ (GCC) 4.9.2
uname -srvmpio
CYGWIN_NT-6.1 1.7.34(0.285/5/3) 2015-02-04 12:12 i686 unknown unknown Cygwin
g++ test1.cpp
// No errors, no warnings
./a.exe
l1 = 4294967296
l2 = 72057594037927934
l3 = 4294967294
d1 = 4294967296
d2 = 72057594037927934
d3 = 0 // Expected 4294967294
// -------- Program test1. cpp --------
#include <iostream>
#include <iomanip>
#include <cmath>
int main (int argc, char** argv)
{
long long l1 = 4294967296;
long long l2 = 72057594037927934;
long long l3 = l2 % l1;
long double d1 = static_cast<long double>(l1);
long double d2 = static_cast<long double>(l2);
long double d3 = fmod (d2, d1);
std::cout << "l1 = " << l1 << std::endl;
std::cout << "l2 = " << l2 << std::endl;
std::cout << "l3 = " << l3 << std::endl;
std::cout << std::endl;
std::cout << "d1 = " << std::setprecision(18) << d1 << std::endl;
std::cout << "d2 = " << std::setprecision(18) << d2 << std::endl;
std::cout << "d3 = " << std::setprecision(18) << d3 << std::endl;
return 0;
}
// -----------------------
Floating point types, including long double cannot represent all integral values in their range. In practice, they are less likely to support larger magnitude values precisely.
A consequence is that converting large integral values to long double (or any floating point type) does not necessarily preserve value - the converted value is only the closest approximation possible given limits of the floating point type.
From there, if your two conversions have produced different values, it would be very lucky if the result of fmod() was exactly the value you seek.
fmod() is also not overloaded to accept long double arguments. Which means your long double values will be converted to double. The set of values a double can represent is a subset of the set that a long double can represent. Among other things, that means a smaller range of integral values that can be represented exactly.
The usual suggestion is not to do such things. If you can do required operations using integer operations (like %) then do so. And, if you use floating point, you need to allow for and manage the loss of precision associated with using floating point.

Floating point limits code not producing correct results

I am racking my brain trying to figure out why this code does not get the right result. I am looking for the hexadecimal representations of the floating point positive and negative overflow/underflow levels. The code is based off this site and a Wikipedia entry:
7f7f ffff ≈ 3.4028234 × 1038 (max single precision) -- from wikipedia entry, corresponds to positive overflow
Here's the code:
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <cmath>
using namespace std;
int main(void) {
float two = 2;
float twentyThree = 23;
float one27 = 127;
float one49 = 149;
float posOverflow, negOverflow, posUnderflow, negUnderflow;
posOverflow = two - (pow(two, -twentyThree) * pow(two, one27));
negOverflow = -(two - (pow(two, one27) * pow(two, one27)));
negUnderflow = -pow(two, -one49);
posUnderflow = pow(two, -one49);
cout << "Positive overflow occurs when value greater than: " << hex << *(int*)&posOverflow << endl;
cout << "Neg overflow occurs when value less than: " << hex << *(int*)&negOverflow << endl;
cout << "Positive underflow occurs when value greater than: " << hex << *(int*)&posUnderflow << endl;
cout << "Neg overflow occurs when value greater than: " << hex << *(int*)&negUnderflow << endl;
}
The output is:
Positive overflow occurs when value greater than: f3800000
Neg overflow occurs when value less than: 7f800000
Positive underflow occurs when value greater than: 1
Neg overflow occurs when value greater than: 80000001
To get the hexadecimal representation of the floating point, I am using a method described here:
Why isn't the code working? I know it'll work if positive overflow = 7f7f ffff.
Your expression for the highest representable positive float is wrong. The page you linked uses (2-pow(2, -23)) * pow(2, 127), and you have 2 - (pow(2, -23) * pow(2, 127)). Similarly for the smallest representable negative float.
Your underflow expressions look correct, however, and so do the hexadecimal outputs for them.
Note that posOverflow and negOverflow are simply +FLT_MAX and -FLT_MAX. But note that your posUnderflow and negUnderflow are actually smaller than FLT_MIN(because they are denormal, and FLT_MIN is the smallest positive normal float).
Floating point loses precision as the number gets bigger. A number of the magnitude 2127 does not change when you add 2 to it.
Other than that, I'm not really following your code. Using words to spell out numbers makes it hard for me to read.
Here is the standard way to get the floating-point limits of your machine:
#include <limits>
#include <iostream>
#include <iomanip>
std::ostream &show_float( std::ostream &s, float f ) {
s << f << " = ";
std::ostream s_hex( s.rdbuf() );
s_hex << std::hex << std::setfill( '0' );
for ( char const *c = reinterpret_cast< char const * >( & f );
c != reinterpret_cast< char const * >( & f + 1 );
++ c ) {
s_hex << std::setw( 2 ) << ( static_cast< unsigned int >( * c ) & 0xff );
}
return s;
}
int main() {
std::cout << std::hex;
std::cout << "Positive overflow occurs when value greater than: ";
show_float( std::cout, std::numeric_limits< float >::max() ) << '\n';
std::cout << "Neg overflow occurs when value less than: ";
show_float( std::cout, - std::numeric_limits< float >::max() ) << '\n';
std::cout << "Positive underflow occurs when value less than: ";
show_float( std::cout, std::numeric_limits< float >::denormal_min() ) << '\n';
std::cout << "Neg underflow occurs when value greater than: ";
show_float( std::cout, - std::numeric_limits< float >::min() ) << '\n';
}
output:
Positive overflow occurs when value greater than: 3.40282e+38 = ffff7f7f
Neg overflow occurs when value less than: -3.40282e+38 = ffff7fff
Positive underflow occurs when value less than: 1.17549e-38 = 00008000
Neg underflow occurs when value greater than: -1.17549e-38 = 00008080
The output depends on the endianness of the machine. Here the bytes are reversed due to little-endian order.
Note, "underflow" in this case isn't a catastrophic zero result, but just denormalization which gradually reduces precision. (It may be catastrophic to performance, though.) You might also check numeric_limits< float >::denorm_min() which produces 1.4013e-45 = 01000000.
Your code assumes integers have the same size as a float (so do all but a few of the posts on the page you've linked, btw.) You probably want something along the lines of:
for (size_t s = 0; s < sizeof(myVar); ++s) {
unsigned char *byte = reinterpret_cast<unsigned char*>(myVar)[s];
//sth byte is byte
}
that is, something akin to the templated code on that page.
Your compiler may not be using those specific IEEE 754 types. You'll need to check its documentation.
Also, consider using std::numeric_limits<float>.min()/max() or cfloat FLT_ constants for determining some of those values.