Problematic output of fmod (long double, long double) - c++

Problematic output of fmod (long double, long double)
It seems that output of fmod (long double, long double) in this test is problematocs.
Any suggestions?
g++ --version
g++ (GCC) 4.9.2
uname -srvmpio
CYGWIN_NT-6.1 1.7.34(0.285/5/3) 2015-02-04 12:12 i686 unknown unknown Cygwin
g++ test1.cpp
// No errors, no warnings
./a.exe
l1 = 4294967296
l2 = 72057594037927934
l3 = 4294967294
d1 = 4294967296
d2 = 72057594037927934
d3 = 0 // Expected 4294967294
// -------- Program test1. cpp --------
#include <iostream>
#include <iomanip>
#include <cmath>
int main (int argc, char** argv)
{
long long l1 = 4294967296;
long long l2 = 72057594037927934;
long long l3 = l2 % l1;
long double d1 = static_cast<long double>(l1);
long double d2 = static_cast<long double>(l2);
long double d3 = fmod (d2, d1);
std::cout << "l1 = " << l1 << std::endl;
std::cout << "l2 = " << l2 << std::endl;
std::cout << "l3 = " << l3 << std::endl;
std::cout << std::endl;
std::cout << "d1 = " << std::setprecision(18) << d1 << std::endl;
std::cout << "d2 = " << std::setprecision(18) << d2 << std::endl;
std::cout << "d3 = " << std::setprecision(18) << d3 << std::endl;
return 0;
}
// -----------------------

Floating point types, including long double cannot represent all integral values in their range. In practice, they are less likely to support larger magnitude values precisely.
A consequence is that converting large integral values to long double (or any floating point type) does not necessarily preserve value - the converted value is only the closest approximation possible given limits of the floating point type.
From there, if your two conversions have produced different values, it would be very lucky if the result of fmod() was exactly the value you seek.
fmod() is also not overloaded to accept long double arguments. Which means your long double values will be converted to double. The set of values a double can represent is a subset of the set that a long double can represent. Among other things, that means a smaller range of integral values that can be represented exactly.
The usual suggestion is not to do such things. If you can do required operations using integer operations (like %) then do so. And, if you use floating point, you need to allow for and manage the loss of precision associated with using floating point.

Related

Output of strtoull() loses precision when converted to double and then back to uint64_t

Consider the following:
#include <iostream>
#include <cstdint>
int main() {
std::cout << std::hex
<< "0x" << std::strtoull("0xFFFFFFFFFFFFFFFF",0,16) << std::endl
<< "0x" << uint64_t(double(std::strtoull("0xFFFFFFFFFFFFFFFF",0,16))) << std::endl
<< "0x" << uint64_t(double(uint64_t(0xFFFFFFFFFFFFFFFF))) << std::endl;
return 0;
}
Which prints:
0xffffffffffffffff
0x0
0xffffffffffffffff
The first number is just the result of converting ULLONG_MAX, from a string to a uint64_t, which works as expected.
However, if I cast the result to double and then back to uint64_t, then it prints 0, the second number.
Normally, I would attribute this to the precision inaccuracy of floats, but what further puzzles me, is that if I cast the ULLONG_MAX from uint64_t to double and then back to uint64_t, the result is correct (third number).
Why the discrepancy between the second and the third result?
EDIT (by #Radoslaw Cybulski)
For another what-is-going-on-here try this code:
#include <iostream>
#include <cstdint>
using namespace std;
int main() {
uint64_t z1 = std::strtoull("0xFFFFFFFFFFFFFFFF",0,16);
uint64_t z2 = 0xFFFFFFFFFFFFFFFFull;
std::cout << z1 << " " << uint64_t(double(z1)) << "\n";
std::cout << z2 << " " << uint64_t(double(z2)) << "\n";
return 0;
}
which happily prints:
18446744073709551615 0
18446744073709551615 18446744073709551615
The number that is closest to 0xFFFFFFFFFFFFFFFF, and is representable by double (assuming 64 bit IEEE) is 18446744073709551616. You'll find that this is a bigger number than 0xFFFFFFFFFFFFFFFF. As such, the number is outside the representable range of uint64_t.
Of the conversion back to integer, the standard says (quoting latest draft):
[conv.fpint]
A prvalue of a floating-point type can be converted to a prvalue of an integer type.
The conversion truncates; that is, the fractional part is discarded.
The behavior is undefined if the truncated value cannot be represented in the destination type.
Why the discrepancy between the second and the third result?
Because the behaviour of the program is undefined.
Although it is mostly pointless to analyse reasons for differences in UB because the scope of variation is limitless, my guess at the reason for the discrepancy in this case is that in one case the value is compile time constant, while in the other there is a call to a library function that is invoked at runtime.

Unexpected result after converting uint64_t to double

In the following code:
#include <iostream>
...
uint64_t t1 = 1510763846;
uint64_t t2 = 1510763847;
double d1 = (double)t1;
double d2 = (double)t2;
// d1 == t2 => evaluates to true somehow?
// t1 == d2 => evaluates to true somehow?
// d1 == d2 => evaluates to true somehow?
// t1 == t2 => evaluates to false, of course.
std::cout << std::fixed <<
"uint64_t: " << t1 << ", " << t2 << ", " <<
"double: " << d1 << ", " << d2 << ", " << (d2+1) << std::endl;
I get this output:
uint64_t: 1510763846, 1510763847, double: 1510763904.000000, 1510763904.000000, 1510763905.000000
And I don't understand why. This answer: biggest integer that can be stored in a double says that an integral number up to 2^53 (9007199254740992) can be stored in a double without losing precision.
I actually get errors when I start doing calculations with the doubles, so it's not only a printing issue. (e.g. 1510763846 and 1510763847 both give 1510763904)
It's also very weird that the double can just be added to and then come out correct (d2+1 == 1510763905.000000)
Rationale: I'm converting these numbers to doubles because I need to work with them in Lua, which only supports floating point numbers. I'm sure I'm compiling the Lua lib with double as the lua_Number type, not float.
std::cout << sizeof(t1) << ", " << sizeof(d2) << std::endl;
Outputs
8, 8
I'm using VS 2012 with target MachineX86, toolkit v110_xp. Floating point model "Precise (/fp:precise)"
Addendum
With the help of people who replied and this article Why are doubles added incorrectly in a specific Visual Studio 2008 project?, I've been able to pinpoint the problem. A library is using a function like _set_controlfp, _control87, _controlfp or __control87_2 to change the precision of my executable to "single". That is why a uint64_t conversion to a double behaves as if it's a float.
When doing a file search for the above function names and "MCW_PC", which is used for Precision Control, I found the following libraries that might have set it:
Android NDK
boost::math
boost::numeric
DirectX (We're using June 2010)
FMod (non-EX)
Pyro particle engine
Now I'd like to rephrase my question:
How do I make sure converting from a uint64_t to a double goes correctly every time, without:
having to call _fpreset() each and every time a possible conversion occurs (think about the function parameters)
having to worry about a library's thread changing the floating point precision in between my _fpreset() and the conversion?
Naive code would be something like this:
double toDouble(uint64_t i)
{
double d;
do {
_fpreset();
d = i;
_fpreset();
} while (d != i);
return d;
}
double toDouble(int64_t i)
{
double d;
do {
_fpreset();
d = i;
_fpreset();
} while (d != i);
return d;
}
This solution assumes the odds of a thread messing with the Floating Point Precision twice are astronomically small. Problem is, the values I'm working with, are timers that represent real-world value. So I shouldn't be taking any chances. Is there a silver bullet for this problem?
From ieee754 floating point conversion it looks like your implementation of double is actually float, which is of course allowed by the standard, that mandates that sizeof double >= sizeof float.
The most accurate representation of 1510763846 is 1.510763904E9.

float to int conversion going wrong (even though the float is already an int)

I was writing a little function to calculate the binomial coefficiant using the tgamma function provided by c++. tgamma returns float values, but I wanted to return an integer. Please take a look at this example program comparing three ways of converting the float back to an int:
#include <iostream>
#include <cmath>
int BinCoeffnear(int n,int k){
return std::nearbyint( std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1)) );
}
int BinCoeffcast(int n,int k){
return static_cast<int>( std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1)) );
}
int BinCoeff(int n,int k){
return (int) std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1));
}
int main()
{
int n = 7;
int k = 2;
std::cout << "Correct: " << std::tgamma(7+1) / (std::tgamma(2+1)*std::tgamma(7-2+1)); //returns 21
std::cout << " BinCoeff: " << BinCoeff(n,k); //returns 20
std::cout << " StaticCast: " << BinCoeffcast(n,k); //returns 20
std::cout << " nearby int: " << BinCoeffnear(n,k); //returns 21
return 0;
}
why is it, that even though the calculation returns a float equal to 21, 'normal' conversion fails and only nearbyint returns the correct value. What is the nicest way to implement this?
EDIT: according to c++ documentation here tgamma(int) returns a double.
From this std::tgamma reference:
If arg is a natural number, std::tgamma(arg) is the factorial of arg-1. Many implementations calculate the exact integer-domain factorial if the argument is a sufficiently small integer.
It seems that the compiler you're using is doing that, calculating the factorial of 7 for the expression std::tgamma(7+1).
The result might differ between compilers, and also between optimization levels. As demonstrated by Jonas there is a big difference between optimized and unoptimized builds.
The remark by #nos is on point. Note that the first line
std::cout << "Correct: " <<
std::tgamma(7+1) / (std::tgamma(2+1)*std::tgamma(7-2+1));
Prints a double value and does not perform a floating point to integer conversion.
The result of your calculation in floating point is indeed less than 21, yet this double precision value is printed by cout as 21.
On my machine (x86_64, gnu libc, g++ 4.8, optimization level 0) setting cout.precision(18) makes the results explicit.
Correct: 20.9999999999999964 BinCoeff: 20 StaticCast: 20 nearby int: 21
In this case practical to replace integer operations with floating point operations, but one has to keep in mind that the result must be integer. The intention is to use std::round.
The problem with std::nearbyint is that depending on the rounding mode it may produce different results.
std::fesetround(FE_DOWNWARD);
std::cout << " nearby int: " << BinCoeffnear(n,k);
would return 20.
So with std::round the BinCoeff function might look like
int BinCoeffRound(int n,int k){
return static_cast<int>(
std::round(
std::tgamma(n+1) /
(std::tgamma(k+1)*std::tgamma(n-k+1))
));
}
Floating-point numbers have rounding errors associated with them. Here is a good article on the subject: What Every Computer Scientist Should Know About Floating-Point Arithmetic.
In your case the floating-point number holds a value very close but less than 21. Rules for implicit floating–integral conversions say:
The fractional part is truncated, that is, the fractional part is
discarded.
Whereas std::nearbyint:
Rounds the floating-point argument arg to an integer value in floating-point format, using the current rounding mode.
In this case the floating-point number will be exactly 21 and the following implicit conversion would return 21.
The first cout outputs 21 because of rounding that happens in cout by default. See std::setprecition.
Here's a live example.
What is the nicest way to implement this?
Use the exact integer factorial function that takes and returns unsigned int instead of tgamma.
the problem is on handling the floats.
floats cant 2 as 2 but as 1.99999 something like that.
So converting to int will drop out the decimal part.
So instead of converting to int immediately first round it to by calling the ceil function w/c declared in cmath or math.h.
this code will return all 21
#include <iostream>
#include <cmath>
int BinCoeffnear(int n,int k){
return std::nearbyint( std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1)) );
}
int BinCoeffcast(int n,int k){
return static_cast<int>( ceil(std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1))) );
}
int BinCoeff(int n,int k){
return (int) ceil(std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1)));
}
int main()
{
int n = 7;
int k = 2;
std::cout << "Correct: " << (std::tgamma(7+1) / (std::tgamma(2+1)*std::tgamma(7-2+1))); //returns 21
std::cout << " BinCoeff: " << BinCoeff(n,k); //returns 20
std::cout << " StaticCast: " << BinCoeffcast(n,k); //returns 20
std::cout << " nearby int: " << BinCoeffnear(n,k); //returns 21
std::cout << "\n" << (int)(2.9995) << "\n";
}

Double overflow?

I have always wondered what happens in case a double reaches it's max value, so I decided to write this code:
#include <stdint.h>
#include <iostream>
#define UINT64_SIZE 18446744073709551615
int main() {
std::uint64_t i = UINT64_SIZE;
double d1 = ((double)(i+1)) / UINT64_SIZE;
double d2 = (((double)(i)) / UINT64_SIZE)*16;
double d3 = ((double)(i * 16)) / UINT64_SIZE;
std::cout << d1 << " " << d2 << " " << d3;
}
I was expecting something like this:
0 16 0
But this is my output:
0 16 1
What is going on here? Why are the values of d3 and d1 different?
EDIT:
I decided to change my code to this to see the result:
#include <stdint.h>
#include <iostream>
#define UINT64_SIZE 18446744073709551615
int main() {
std::uint64_t i = UINT64_SIZE;
double d1 = ((double)(i+1.0)) / UINT64_SIZE; //what?
double d2 = (((double)(i)) / UINT64_SIZE)*16;
double d3 = ((double)(i * 16.0)) / UINT64_SIZE;
std::cout << d1 << " " << d2 << " " << d3;
}
The result I get now is this:
1 16 16
However, shouldn't d1 and d3 still be the same value?
double overflows by loosing precision, not by starting from 0 (as it works with unsigned integers)
d1
So, when you add 1.0 to very big value (18446744073709551615), you're not getting 0 in double, but something like 18446744073709551610 (note last 10 instead of 15) or 18446744073709551620 (note last 20 instead of 15), so - less significant digit(s) are rounded.
Now, you're dividing two almost identical values, result will be either 0.9(9)9 or 1.0(0)1, as soon as double cannot hold such small value - again it looses precision and rounds to 1.0.
d3
Almost the same, when you multiple huge value by 16 - you're getting rounded result (less significant digits are thrown away), by diving it - you're getting "almost" 16, which is rounded to 16.
This is a case of loss of precision. Consider the following.
#include <stdint.h>
#include <iostream>
#define UINT64_SIZE 18446744073709551615
int main() {
std::uint64_t i = UINT64_SIZE;
auto a = i;
auto b = i * 16;
auto c = (double)b;
auto d = (uint64_t)c;
std::cout << a << std::endl;
std::cout << b << std::endl;
std::cout << c << std::endl;
std::cout << d << std::endl;
return 0;
}
On my system the output is as follow.
18446744073709551615
18446744073709551600
1.8446744073709552e+19
9223372036854775808
double simply doesn't have enough precision in this case.
Edit: There is also a rounding problem. When you preform the division with UINT64_SIZE the denumerator is promoted to double and you are left with a decimal value between 0.0 and 1.0. The decimals are not rounded off. The actual value is very near 1.0 and is rounded up when pushed to std::cout.
In your question you ask "what happens in case a double reaches it's max value". Note that in the example you provided no double is ever near it's maximum value. Only it's precision is exceeded. When a double's precision is exceeded, the excess precision is discarded.

Analysis of float/double precision in 32 decimal digits

From a .c file of another guy, I saw this:
const float c = 0.70710678118654752440084436210485f;
where he wants to avoid the computation of sqrt(1/2).
Can this be really stored somehow with plain C/C++? I mean without loosing precision. It seems impossible to me.
I am using C++, but I do not believe that precision difference between this two languages are too big (if any), that' why I did not test it.
So, I wrote these few lines, to have a look at the behaviour of the code:
std::cout << "Number: 0.70710678118654752440084436210485\n";
const float f = 0.70710678118654752440084436210485f;
std::cout << "float: " << std::setprecision(32) << f << std::endl;
const double d = 0.70710678118654752440084436210485; // no f extension
std::cout << "double: " << std::setprecision(32) << d << std::endl;
const double df = 0.70710678118654752440084436210485f;
std::cout << "doublef: " << std::setprecision(32) << df << std::endl;
const long double ld = 0.70710678118654752440084436210485;
std::cout << "l double: " << std::setprecision(32) << ld << std::endl;
const long double ldl = 0.70710678118654752440084436210485l; // l suffix!
std::cout << "l doublel: " << std::setprecision(32) << ldl << std::endl;
The output is this:
* ** ***
v v v
Number: 0.70710678118654752440084436210485 // 32 decimal digits
float: 0.707106769084930419921875 // 24 >> >>
double: 0.70710678118654757273731092936941
doublef: 0.707106769084930419921875 // same as float
l double: 0.70710678118654757273731092936941 // same as double
l doublel: 0.70710678118654752438189403651592 // suffix l
where * is the last accurate digit of float, ** the last accurate digit of double and *** the last accurate digit of long double.
The output of double has 32 decimal digits, since I have set the precision of std::cout at that value.
float output has 24, as expected, as said here:
float has 24 binary bits of precision, and double has 53.
I would expect the last output to be the same with the pre-last, i.e. that the f suffix would not prevent the number from becoming a double. I think that when I write this:
const double df = 0.70710678118654752440084436210485f;
what happens is that first the number becomes a float one and then stored as a double, so after the 24th decimal digits, it has zeroes and that's why the double precision stops there.
Am I correct?
From this answer I found some relevant information:
float x = 0 has an implicit typecast from int to float.
float x = 0.0f does not have such a typecast.
float x = 0.0 has an implicit typecast from double to float.
[EDIT]
About __float128, it is not standard, thus it's out of the competition. See more here.
From the standard:
There are three floating point types: float, double, and long double.
The type double provides at least as much precision as float, and the
type long double provides at least as much precision as double. The
set of values of the type float is a subset of the set of values of
the type double; the set of values of the type double is a subset of
the set of values of the type long double. The value representation of
floating-point types is implementation-defined.
So you can see your issue with this question: the standard doesn't actually say how precise floats are.
In terms of standard implementations, you need to look at IEEE754, which means the other two answers from Irineau and Davidmh are perfectly valid approaches to the problem.
As to suffix letters to indicate type, again looking at the standard:
The type of a floating literal is double unless explicitly specified by
a suffix. The suffixes f and F specify float, the suffixes l and L specify
long double.
So your attempt to create a long double will just have the same precision as the double literal you are assigning to it unless you use the L suffix.
I understand that some of these answers may not seem satisfactory, but there is a lot of background reading to be done on the relevant standards before you can dismiss answers. This answer is already longer than intended so I won't try and explain everything here.
And as a final note: Since the precision is not clearly defined, why not have a constant that's longer than it needs to be? Seems to make sense to always define a constant that is precise enough to always be representable regardless of type.
Python's numerical library, numpy, has a very convenient float info function. All the types are the equivalent to C:
For C's float:
print numpy.finfo(numpy.float32)
Machine parameters for float32
---------------------------------------------------------------------
precision= 6 resolution= 1.0000000e-06
machep= -23 eps= 1.1920929e-07
negep = -24 epsneg= 5.9604645e-08
minexp= -126 tiny= 1.1754944e-38
maxexp= 128 max= 3.4028235e+38
nexp = 8 min= -max
---------------------------------------------------------------------
For C's double:
print numpy.finfo(numpy.float64)
Machine parameters for float64
---------------------------------------------------------------------
precision= 15 resolution= 1.0000000000000001e-15
machep= -52 eps= 2.2204460492503131e-16
negep = -53 epsneg= 1.1102230246251565e-16
minexp= -1022 tiny= 2.2250738585072014e-308
maxexp= 1024 max= 1.7976931348623157e+308
nexp = 11 min= -max
---------------------------------------------------------------------
And for C's long float:
print numpy.finfo(numpy.float128)
Machine parameters for float128
---------------------------------------------------------------------
precision= 18 resolution= 1e-18
machep= -63 eps= 1.08420217249e-19
negep = -64 epsneg= 5.42101086243e-20
minexp=-16382 tiny= 3.36210314311e-4932
maxexp= 16384 max= 1.18973149536e+4932
nexp = 15 min= -max
---------------------------------------------------------------------
So, not even long float (128 bits) will give you the 32 digits you want. But, do you really need them all?
Some compilers have an implementation of the binary128 floating point format, normalized by IEEE 754-2008. Using gcc, for example, the type is __float128. That floating point format have about 34 decimal precision (log(2^113)/log(10)).
You can use the Boost Multiprecision library, to use their wrapper float128. That implementation will either use native types, if available, or use a drop-in replacement.
Let's extend your experiment with that new non-standard type __float128, with a recent g++ (4.8):
// Compiled with g++ -Wall -lquadmath essai.cpp
#include <iostream>
#include <iomanip>
#include <quadmath.h>
#include <sstream>
std::ostream& operator<<(std::ostream& out, __float128 f) {
char buf[200];
std::ostringstream format;
format << "%." << (std::min)(190L, out.precision()) << "Qf";
quadmath_snprintf(buf, 200, format.str().c_str(), f);
out << buf;
return out;
}
int main() {
std::cout.precision(32);
std::cout << "Number: 0.70710678118654752440084436210485\n";
const float f = 0.70710678118654752440084436210485f;
std::cout << "float: " << std::setprecision(32) << f << std::endl;
const double d = 0.70710678118654752440084436210485; // no f extension
std::cout << "double: " << std::setprecision(32) << d << std::endl;
const double df = 0.70710678118654752440084436210485f;
std::cout << "doublef: " << std::setprecision(32) << df << std::endl;
const long double ld = 0.70710678118654752440084436210485;
std::cout << "l double: " << std::setprecision(32) << ld << std::endl;
const long double ldl = 0.70710678118654752440084436210485l; // l suffix!
std::cout << "l doublel: " << std::setprecision(32) << ldl << std::endl;
const __float128 f128 = 0.70710678118654752440084436210485;
const __float128 f128f = 0.70710678118654752440084436210485f; // f suffix
const __float128 f128l = 0.70710678118654752440084436210485l; // l suffix
const __float128 f128q = 0.70710678118654752440084436210485q; // q suffix
std::cout << "f128: " << f128 << std::endl;
std::cout << "f f128: " << f128f << std::endl;
std::cout << "l f128: " << f128l << std::endl;
std::cout << "q f128: " << f128q << std::endl;
}
The output is:
* ** *** ****
v v v v
Number: 0.70710678118654752440084436210485
float: 0.707106769084930419921875
double: 0.70710678118654757273731092936941
doublef: 0.707106769084930419921875
l double: 0.70710678118654757273731092936941
l doublel: 0.70710678118654752438189403651592
f128: 0.70710678118654757273731092936941
f f128: 0.70710676908493041992187500000000
l f128: 0.70710678118654752438189403651592
q f128: 0.70710678118654752440084436210485
where * is the last accurate digit of float, ** the last accurate digit of
double, *** the last accurate digit of long double, and **** is the
last accurate digit of __float128.
As said by another answer, the C++ standard does not say what is the precision of the various floating point types (like it does not says what is the size of the integral types). It only specifies minimal precision/size of those types. But the norm IEEE754 does specify all that! The FPU of all lot of architectures does implement that norm IEEE745, and the recent versions of gcc implement the type binary128 of the norm with the extension __float128.
As for the explanation of your code, or mine, an expression like 0.70710678118654752440084436210485f is a floating-point literal. It has a type, that is defined by its suffix, here f for float. And thus the value of the literal correspond to the nearest value of the given type from the given number. That explains why, for example, the precision of "doublef" is the same as for "float", in your code. In recent gcc versions, there is an extension, that allows to define floating-point literals of type __float128, with the Q suffix (Quadruple-precision).