Working on Mac OS X 10.6.2, Intel, with i686-apple-darwin10-g++-4.2.1, and compiling with the -arch x86_64 flag, I just noticed that while...
std::numeric_limits<long double>::max_exponent10 = 4932
...as is expected, when a long double is actually set to a value with exponent greater than 308, it becomes inf--ie in reality it only has 64bit precision instead of 80bit.
Also, sizeof() is showing long doubles to be 16 bytes, which they should be.
Finally, using <limits.h> gives the same results as <limits>.
Does anyone know where the discrepancy might be?
long double x = 1e308, y = 1e309;
cout << std::numeric_limits<long double>::max_exponent10 << endl;
cout << x << '\t' << y << endl;
cout << sizeof(x) << endl;
gives
4932
1e+308 inf
16
It's because 1e309 is a literal that gives a double. You need to use a long-double literal 1e309L.
Related
Smallest double is 2.22507e-308. Is there any way I can use smaller numbers?
I found a library called gmp, but have no idea how to use it, documentation is not clear at all, and I'm not sure if it works on windows.
I don't expect to give me instructions, but maybe at least some piece of advice.
If you need really big precision, then give gmp chance. I am sure it works on Windows too.
If you just need bigger precision than double, try long double. It may or may not give you more, depends on your compiler and target platform.
In my case it does give more (gcc 6, x86_64 linux):
Test program:
#include <iostream>
#include <limits>
int main() {
std::cout << "float:"
<< " bytes=" << sizeof(float)
<< " min=" << std::numeric_limits<float>::min()
<< std::endl;
std::cout << "double:"
<< " bytes=" << sizeof(double)
<< " min=" << std::numeric_limits<double>::min()
<< std::endl;
std::cout << "long double:"
<< " bytes=" << sizeof(long double)
<< " min=" << std::numeric_limits<long double>::min()
<< std::endl;
}
Output:
float: bytes=4 min=1.17549e-38
double: bytes=8 min=2.22507e-308
long double: bytes=16 min=3.3621e-4932
If your compiler/architecture allows it, you could use something like long double, which compiles to an 80-bit float (though I think it aligns to 128 bits, so there's a bit of wasted space) and has more range and precision than a typical double value. Not all compilers will do that though, and on many compilers, long double is equivalent to a double, at 64-bits.
"gmp" is one library you could use for extended precision floats. I generally recommend boost.multiprecision, which includes gmp, though personally, I'd use cpp_bin_float or cpp_dec_float for my multiprecision needs (the former is IEEE756 compliant, the latter isn't)
As for how to use them: I haven't used gmp, so I can't comment on its syntax, but cpp_bin_float is pretty easy to use:
typedef boost::multiprecision::cpp_bin_float_quad quad;
quad a = 34;
quad b = 17.95467;
b += a;
for(int i = 0; i < 10; i++) {
b *= b;
}
std::cout << "This might be rather big: " << b << std::endl;
If you change your compiler to gcc or Intel type long double will be supported with bigger precission (80-bit). With default visual studio compiler, I have no advice for you what to do.
Problematic output of fmod (long double, long double)
It seems that output of fmod (long double, long double) in this test is problematocs.
Any suggestions?
g++ --version
g++ (GCC) 4.9.2
uname -srvmpio
CYGWIN_NT-6.1 1.7.34(0.285/5/3) 2015-02-04 12:12 i686 unknown unknown Cygwin
g++ test1.cpp
// No errors, no warnings
./a.exe
l1 = 4294967296
l2 = 72057594037927934
l3 = 4294967294
d1 = 4294967296
d2 = 72057594037927934
d3 = 0 // Expected 4294967294
// -------- Program test1. cpp --------
#include <iostream>
#include <iomanip>
#include <cmath>
int main (int argc, char** argv)
{
long long l1 = 4294967296;
long long l2 = 72057594037927934;
long long l3 = l2 % l1;
long double d1 = static_cast<long double>(l1);
long double d2 = static_cast<long double>(l2);
long double d3 = fmod (d2, d1);
std::cout << "l1 = " << l1 << std::endl;
std::cout << "l2 = " << l2 << std::endl;
std::cout << "l3 = " << l3 << std::endl;
std::cout << std::endl;
std::cout << "d1 = " << std::setprecision(18) << d1 << std::endl;
std::cout << "d2 = " << std::setprecision(18) << d2 << std::endl;
std::cout << "d3 = " << std::setprecision(18) << d3 << std::endl;
return 0;
}
// -----------------------
Floating point types, including long double cannot represent all integral values in their range. In practice, they are less likely to support larger magnitude values precisely.
A consequence is that converting large integral values to long double (or any floating point type) does not necessarily preserve value - the converted value is only the closest approximation possible given limits of the floating point type.
From there, if your two conversions have produced different values, it would be very lucky if the result of fmod() was exactly the value you seek.
fmod() is also not overloaded to accept long double arguments. Which means your long double values will be converted to double. The set of values a double can represent is a subset of the set that a long double can represent. Among other things, that means a smaller range of integral values that can be represented exactly.
The usual suggestion is not to do such things. If you can do required operations using integer operations (like %) then do so. And, if you use floating point, you need to allow for and manage the loss of precision associated with using floating point.
I have recently become interested in learning about programming in c++, because I want to get a bit deeper understanding of the way computers work and handle instructions. I thought I would try out the data types, but I don't really understand what's happening with my output...
#include <iostream>
#include <iomanip>
using namespace std;
int main() {
float fValue = 123.456789;
cout << setprecision(20) << fixed << fValue << endl;
cout << "Size of float: " << sizeof(float) << endl;
double dValue = 123.456789;
cout << setprecision(20) << fixed << dValue << endl;
cout << "Size of double: " << sizeof(double) << endl;
long double lValue = 123.456789;
cout << setprecision(20) << fixed << lValue << endl;
cout << "Size of long double: " << sizeof(long double) << endl;
return 0;
}
The output I expected would be something like:
123.45678710937500000000
Size of float: 4
123.45678900000000000000
Size of double: 8
123.45678900000000000000
Size of long double: 16
This is my actual output:
123.45678710937500000000
Size of float: 4
123.45678900000000000000
Size of double: 8
-6518427077408613100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.00000000000000000000
Size of long double: 12
Any ideas on what happened would be much appreciated, thanks!
Edit:
System:
Windows 10 Pro Technical Preview
64-bit Operating System, x64-based processor
Eclipse CDT 8.5
From the patch that fixed this in earlier versions:
MinGW uses the Microsoft runtime DLL msvcrt.dll. Here lies a problem: while gcc creates 80 bits long doubles, the MS runtime accepts 64 bit long doubles only.
This bug happens to me when I use 4.8.1 revision 4 from MinGW-get (the most recent version it offers), but not when I use 4.8.1 revision 5.
So you are not using long double wrong (although there would be better accuracy to do long double lValue = 123.456789L to make sure it doesn't take 123.456789 as a double, then cast it to a long double).
The easiest way to fix this would be to simply change the version of MinGW you are using to 4.9 or 4.7, depending on what you need (you can get 4.9 here).
If you are willing to instead use printf, you could change to printf("%Lf", ...), and either:
add the flag -posix when you compile with g++
add #define __USE_MINGW_ANSI_STDIO 1 before #include <cstdio> (found this from the origional patch)
Finally, you can even just cast to a double whenever you try to print out the long double (there is some loss of accuracy, but it shouldn't matter when just printing out numbers).
To find more details, you can also look at my blog post on this issue.
Update: If you want to continue to use Mingw 4.8, you can also just download a different distribution of Mignw, which didn't have that problem for me.
From a .c file of another guy, I saw this:
const float c = 0.70710678118654752440084436210485f;
where he wants to avoid the computation of sqrt(1/2).
Can this be really stored somehow with plain C/C++? I mean without loosing precision. It seems impossible to me.
I am using C++, but I do not believe that precision difference between this two languages are too big (if any), that' why I did not test it.
So, I wrote these few lines, to have a look at the behaviour of the code:
std::cout << "Number: 0.70710678118654752440084436210485\n";
const float f = 0.70710678118654752440084436210485f;
std::cout << "float: " << std::setprecision(32) << f << std::endl;
const double d = 0.70710678118654752440084436210485; // no f extension
std::cout << "double: " << std::setprecision(32) << d << std::endl;
const double df = 0.70710678118654752440084436210485f;
std::cout << "doublef: " << std::setprecision(32) << df << std::endl;
const long double ld = 0.70710678118654752440084436210485;
std::cout << "l double: " << std::setprecision(32) << ld << std::endl;
const long double ldl = 0.70710678118654752440084436210485l; // l suffix!
std::cout << "l doublel: " << std::setprecision(32) << ldl << std::endl;
The output is this:
* ** ***
v v v
Number: 0.70710678118654752440084436210485 // 32 decimal digits
float: 0.707106769084930419921875 // 24 >> >>
double: 0.70710678118654757273731092936941
doublef: 0.707106769084930419921875 // same as float
l double: 0.70710678118654757273731092936941 // same as double
l doublel: 0.70710678118654752438189403651592 // suffix l
where * is the last accurate digit of float, ** the last accurate digit of double and *** the last accurate digit of long double.
The output of double has 32 decimal digits, since I have set the precision of std::cout at that value.
float output has 24, as expected, as said here:
float has 24 binary bits of precision, and double has 53.
I would expect the last output to be the same with the pre-last, i.e. that the f suffix would not prevent the number from becoming a double. I think that when I write this:
const double df = 0.70710678118654752440084436210485f;
what happens is that first the number becomes a float one and then stored as a double, so after the 24th decimal digits, it has zeroes and that's why the double precision stops there.
Am I correct?
From this answer I found some relevant information:
float x = 0 has an implicit typecast from int to float.
float x = 0.0f does not have such a typecast.
float x = 0.0 has an implicit typecast from double to float.
[EDIT]
About __float128, it is not standard, thus it's out of the competition. See more here.
From the standard:
There are three floating point types: float, double, and long double.
The type double provides at least as much precision as float, and the
type long double provides at least as much precision as double. The
set of values of the type float is a subset of the set of values of
the type double; the set of values of the type double is a subset of
the set of values of the type long double. The value representation of
floating-point types is implementation-defined.
So you can see your issue with this question: the standard doesn't actually say how precise floats are.
In terms of standard implementations, you need to look at IEEE754, which means the other two answers from Irineau and Davidmh are perfectly valid approaches to the problem.
As to suffix letters to indicate type, again looking at the standard:
The type of a floating literal is double unless explicitly specified by
a suffix. The suffixes f and F specify float, the suffixes l and L specify
long double.
So your attempt to create a long double will just have the same precision as the double literal you are assigning to it unless you use the L suffix.
I understand that some of these answers may not seem satisfactory, but there is a lot of background reading to be done on the relevant standards before you can dismiss answers. This answer is already longer than intended so I won't try and explain everything here.
And as a final note: Since the precision is not clearly defined, why not have a constant that's longer than it needs to be? Seems to make sense to always define a constant that is precise enough to always be representable regardless of type.
Python's numerical library, numpy, has a very convenient float info function. All the types are the equivalent to C:
For C's float:
print numpy.finfo(numpy.float32)
Machine parameters for float32
---------------------------------------------------------------------
precision= 6 resolution= 1.0000000e-06
machep= -23 eps= 1.1920929e-07
negep = -24 epsneg= 5.9604645e-08
minexp= -126 tiny= 1.1754944e-38
maxexp= 128 max= 3.4028235e+38
nexp = 8 min= -max
---------------------------------------------------------------------
For C's double:
print numpy.finfo(numpy.float64)
Machine parameters for float64
---------------------------------------------------------------------
precision= 15 resolution= 1.0000000000000001e-15
machep= -52 eps= 2.2204460492503131e-16
negep = -53 epsneg= 1.1102230246251565e-16
minexp= -1022 tiny= 2.2250738585072014e-308
maxexp= 1024 max= 1.7976931348623157e+308
nexp = 11 min= -max
---------------------------------------------------------------------
And for C's long float:
print numpy.finfo(numpy.float128)
Machine parameters for float128
---------------------------------------------------------------------
precision= 18 resolution= 1e-18
machep= -63 eps= 1.08420217249e-19
negep = -64 epsneg= 5.42101086243e-20
minexp=-16382 tiny= 3.36210314311e-4932
maxexp= 16384 max= 1.18973149536e+4932
nexp = 15 min= -max
---------------------------------------------------------------------
So, not even long float (128 bits) will give you the 32 digits you want. But, do you really need them all?
Some compilers have an implementation of the binary128 floating point format, normalized by IEEE 754-2008. Using gcc, for example, the type is __float128. That floating point format have about 34 decimal precision (log(2^113)/log(10)).
You can use the Boost Multiprecision library, to use their wrapper float128. That implementation will either use native types, if available, or use a drop-in replacement.
Let's extend your experiment with that new non-standard type __float128, with a recent g++ (4.8):
// Compiled with g++ -Wall -lquadmath essai.cpp
#include <iostream>
#include <iomanip>
#include <quadmath.h>
#include <sstream>
std::ostream& operator<<(std::ostream& out, __float128 f) {
char buf[200];
std::ostringstream format;
format << "%." << (std::min)(190L, out.precision()) << "Qf";
quadmath_snprintf(buf, 200, format.str().c_str(), f);
out << buf;
return out;
}
int main() {
std::cout.precision(32);
std::cout << "Number: 0.70710678118654752440084436210485\n";
const float f = 0.70710678118654752440084436210485f;
std::cout << "float: " << std::setprecision(32) << f << std::endl;
const double d = 0.70710678118654752440084436210485; // no f extension
std::cout << "double: " << std::setprecision(32) << d << std::endl;
const double df = 0.70710678118654752440084436210485f;
std::cout << "doublef: " << std::setprecision(32) << df << std::endl;
const long double ld = 0.70710678118654752440084436210485;
std::cout << "l double: " << std::setprecision(32) << ld << std::endl;
const long double ldl = 0.70710678118654752440084436210485l; // l suffix!
std::cout << "l doublel: " << std::setprecision(32) << ldl << std::endl;
const __float128 f128 = 0.70710678118654752440084436210485;
const __float128 f128f = 0.70710678118654752440084436210485f; // f suffix
const __float128 f128l = 0.70710678118654752440084436210485l; // l suffix
const __float128 f128q = 0.70710678118654752440084436210485q; // q suffix
std::cout << "f128: " << f128 << std::endl;
std::cout << "f f128: " << f128f << std::endl;
std::cout << "l f128: " << f128l << std::endl;
std::cout << "q f128: " << f128q << std::endl;
}
The output is:
* ** *** ****
v v v v
Number: 0.70710678118654752440084436210485
float: 0.707106769084930419921875
double: 0.70710678118654757273731092936941
doublef: 0.707106769084930419921875
l double: 0.70710678118654757273731092936941
l doublel: 0.70710678118654752438189403651592
f128: 0.70710678118654757273731092936941
f f128: 0.70710676908493041992187500000000
l f128: 0.70710678118654752438189403651592
q f128: 0.70710678118654752440084436210485
where * is the last accurate digit of float, ** the last accurate digit of
double, *** the last accurate digit of long double, and **** is the
last accurate digit of __float128.
As said by another answer, the C++ standard does not say what is the precision of the various floating point types (like it does not says what is the size of the integral types). It only specifies minimal precision/size of those types. But the norm IEEE754 does specify all that! The FPU of all lot of architectures does implement that norm IEEE745, and the recent versions of gcc implement the type binary128 of the norm with the extension __float128.
As for the explanation of your code, or mine, an expression like 0.70710678118654752440084436210485f is a floating-point literal. It has a type, that is defined by its suffix, here f for float. And thus the value of the literal correspond to the nearest value of the given type from the given number. That explains why, for example, the precision of "doublef" is the same as for "float", in your code. In recent gcc versions, there is an extension, that allows to define floating-point literals of type __float128, with the Q suffix (Quadruple-precision).
I still have not run it through enough tests however for some reason, using certain non-negative values, this function will sometimes pass back a negative value. I have done a lot of manual testing in calculator with different values but I have yet to have it display this same behavior.
I was wondering if someone would take a look at see if I am missing something.
float calcPop(int popRand1, int popRand2, int popRand3, float pERand, float pSRand)
{
return ((((((23000 * popRand1) * popRand2) * pERand) * pSRand) * popRand3) / 8);
}
The variables are all contain randomly generated values:
popRand1: between 1 and 30
popRand2: between 10 and 30
popRand3: between 50 and 100
pSRand: between 1 and 1000
pERand: between 1.0f and 5500.0f which is then multiplied by 0.001f before being passed to the function above
Edit:
Alright so after following the execution a bit more closely it is not the fault of this function directly. It produces an infinitely positive float which then flips negative when I use this code later on:
pPMax = (int)pPStore;
pPStore is a float that holds popCalc's return.
So the question now is, how do I stop the formula from doing this? Testing even with very high values in Calculator has never displayed this behavior. Is there something in how the compiler processes the order of operations that is causing this or are my values simply just going too high?
In this case it seems that when you are converting back to an int after the function returns it is possible that you reach the maximum value of an int, my suggestion is for you to use a type that can represent a greater range of values.
#include <iostream>
#include <limits>
#include <boost/multiprecision/cpp_int.hpp>
int main(int argc, char* argv[])
{
std::cout << "int min: " << std::numeric_limits<int>::min() << std::endl;
std::cout << "int max: " << std::numeric_limits<int>::max() << std::endl;
std::cout << "long min: " << std::numeric_limits<long>::min() << std::endl;
std::cout << "long max: " << std::numeric_limits<long>::max() << std::endl;
std::cout << "long long min: " << std::numeric_limits<long long>::min() << std::endl;
std::cout << "long long max: " << std::numeric_limits<long long>::max() << std::endl;
boost::multiprecision::cpp_int bigint = 113850000000;
int smallint = 113850000000;
std::cout << bigint << std::endl;
std::cout << smallint << std::endl;
std::cin.get();
return 0;
}
As you can see here, there are other types which have a bigger range. If these do not suffice I believe the latest boost version has just the thing for you.
Throw an exception:
if (pPStore > static_cast<float>(INT_MAX)) {
throw std::overflow_error("exceeds integer size");
} else {
pPMax = static_cast<int>(pPStore);
}
or use float instead of int.
When you multiply the maximum values of each term together you get a value around 1.42312e+12 which is somewhat larger than a 32 bit integer can hold, so let's see what the standard has to say about floating point-to-integer conversions, in 4.9/1:
A prvalue of a floating point type can be converted to a prvalue of an
integer type. The conversion trun- cates; that is, the fractional part
is discarded. The behavior is undefined if the truncated value cannot
be represented in the destination type.
So we learn that for a large segment of possible result values your function can generate, the conversion back to a 32 bit integer would be undefined, which includes making negative numbers.
You have a few options here. You could use a 64 bit integer type (long or long long possibly) to hold the value instead of truncating down to int.
Alternately you could scale down the results of your function by a factor of around 1000 or so, to keep the maximal results within the range of values that a 32 bit integer could hold.