This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
long double vs double
I am new to programming and I am unable to understand the difference between between long double and double in C and C++. I tried to Google it but was unable to understand it and got confused. Can anyone please help.?
To quote the C++ standard, §3.9.1 ¶8:
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum and minimum values of each arithmetic type for an implementation.
That is to say that double takes at least as much memory for its representation as float and long double at least as much as double. That extra memory is used for more precise representation of a number.
On x86 systems, float is typically 4 bytes long and can store numbers as large as about 3×10³⁸ and about as small as 1.4×10⁻⁴⁵. It is an IEEE 754 single-precision number that stores about 7 decimal digits of a fractional number.
Also on x86 systems, double is 8 bytes long and can store numbers in the IEEE 754 double-precision format, which has a much larger range and stores numbers with more precision, about 15 decimal digits. On some other platforms, double may not be 8 bytes long and may indeed be the same as a single-precision float.
The standard only requires that long double is at least as precise as double, so some compilers will simply treat long double as if it is the same as double. But, on most x86 chips, the 10-byte extended precision format 80-bit number is available through the CPU's floating-point unit, which provides even more precision than 64-bit double, with about 21 decimal digits of precision.
Some compilers instead support a 16-byte (128-bit) IEEE 754 quadruple precision number format with yet more precise representations and a larger range.
It depends on your compiler but the following code can show you the number of bytes that each type requires:
int main() {
printf("%d\n", sizeof(double)); // some compilers print 8
printf("%d\n", sizeof(long double)); // some compilers print 16
return 0;
}
A long <type> data type may hold larger values then a <type> data type, depending on the compiler.
Related
Particularly I'm interested if int32_t is always losslessly converted to double.
Does the following code always return true?
int is_lossless(int32_t i)
{
double d = i;
int32_t i2 = d;
return (i2 == i);
}
What is for int64_t?
When is integer to floating point conversion lossless?
When the floating point type has enough precision and range to encode all possible values of the integer type.
Does the following int32_t code always return true? --> Yes.
Does the following int64_t code always return true? --> No.
As DBL_MAX is at least 1E+37, the range is sufficient for at least int122_t, let us look to precision.
With common double, with its base 2, sign bit, 53 bit significand, and exponent, all values of int54_t with its 53 value bits can be represented exactly. INT54_MIN is also representable. With this double, it has DBL_MANT_DIG == 53 and in this case that is the number of base-2 digits in the floating-point significand.
The smallest magnitude non-representable value would be INT54_MAX + 2. Type int55_t and wider have values not exactly representable as a double.
With uintN_t types, there is 1 more value bit. The typical double can then encode all uint53_t and narrower.
With other possible double encodings, as C specifies DBL_DIG >= 10, all values of int34_t can round trip.
Code is always true with int32_t, regardless of double encoding.
What is for int64_t?
UB potential with int64_t.
The conversion in int64_t i ... double d = i;, when inexact, makes for a implementation defined result of the 2 nearest candidates. This is often a round to nearest. Then i values near INT64_MAX can convert to a double one more than INT64_MAX.
With int64_t i2 = d;, the conversion of the double value one more than INT64_MAX to int64_t is undefined behavior (UB).
A simple prior test to detect this:
#define INT64_MAX_P1 ((INT64_MAX/2 + 1) * 2.0)
if (d == INT64_MAX_P1) return false; // not lossless
Question: Does the following code always return true?
Always is a big statement and therefore the answer is no.
The C++ Standard makes no mention whether or not the floating-point types which are known to C++ (float, double and long double) are of the IEEE-754 type. The standard explicitly states:
There are three floating-point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. [Note: This document imposes no requirements on the accuracy of floating-point operations; see also [support.limits]. — end note] Integral and floating-point types are collectively called arithmetic types. Specialisations of the standard library template std::numeric_limits shall specify the maximum and minimum values of each arithmetic type for an implementation.
source: C++ standard: basic fundamentals
Most commonly, the type double represents the IEEE 754 double-precision binary floating-point format binary64, and can be depicted as:
and decoded as:
However, there is a plethora of other floating-point formats out there that are decoded differently and not necessarly have the same properties as the well known IEEE-754. Nonetheless, they are all-by-all similar:
They are n bits long
One bit represents the sign
m bits represent the significant with or without a hidden first bit
e bits represent some form of an exponent of a given base (2 or 10)
To know Whether or not a double can represent all 32-bit signed integer or not, you must answer the following question (assuming our floating-point number is in base 2):
Does my floating-point representation have a hidden first bit in the significant? If so, assume m=m+1
A 32bit signed integer is represented by 1 sign bit and 31 bits representing the number. Is the significant large enough to hold those 31 bits?
Is the exponent large enough that it can represent a number of the form 1.xxxxx 2^31?
If you can answer yes to the last two questions, then yes a int32 can always be represented by the double that is implemented on this particular system.
Note: I ignored decimal32 and decimal64 numbers, as I have no direct knowledge about them.
Note : my answer supposes the double follow IEEE 754, and both int32_t and int64_tare 2's complement.
Does the following code always return true?
the mantissa/significand of a double is longer than 32b so int32_t => double is always done without error because there is no possible precision error (and there is no possible overflow/underflow, the exponent cover more than the needed range of values)
What is for int64_t?
but 53 bits of mantissa/significand (including 1 implicit) of a double is not enough to save 64b of a int64_t => int64_t having upper and lower bits enough distant cannot be store in a double without precision error (there is still no possible overflow/underflow, the exponent still cover more than the needed range of values)
If your platform uses IEEE754 for the double, then yes, any int32_t can be represented perfectly in a double. This is not the case for all possible values that an int64_t can have.
(It is possible on some platforms to tweak the mantissa / exponent sizes of floating point types to make the transformation lossy, but such a type would not be an IEEE754 double.)
To test for IEEE754, use
static_assert(std::numeric_limits<double>::is_iec559, "IEEE 754 floating point");
This question already has answers here:
Representing integers in doubles
(4 answers)
Closed 5 years ago.
My question is whether all integer values are guaranteed to have a perfect double representation.
Consider the following code sample that prints "Same":
// Example program
#include <iostream>
#include <string>
int main()
{
int a = 3;
int b = 4;
double d_a(a);
double d_b(b);
double int_sum = a + b;
double d_sum = d_a + d_b;
if (double(int_sum) == d_sum)
{
std::cout << "Same" << std::endl;
}
}
Is this guaranteed to be true for any architecture, any compiler, any values of a and b? Will any integer i converted to double, always be represented as i.0000000000000 and not, for example, as i.000000000001?
I tried it for some other numbers and it always was true, but was unable to find anything about whether this is coincidence or by design.
Note: This is different from this question (aside from the language) since I am adding the two integers.
Disclaimer (as suggested by Toby Speight): Although IEEE 754 representations are quite common, an implementation is permitted to use any other representation that satisfies the requirements of the language.
The doubles are represented in the form mantissa * 2^exponent, i.e. some of the bits are used for the non-integer part of the double number.
bits range precision
float 32 1.5E-45 .. 3.4E38 7- 8 digits
double 64 5.0E-324 .. 1.7E308 15-16 digits
long double 80 1.9E-4951 .. 1.1E4932 19-20 digits
The part in the fraction can also used to represent an integer by using an exponent which removes all the digits after the dot.
E.g. 2,9979 · 10^4 = 29979.
Since a common int is usually 32 bit you can represent all ints as double, but for 64 bit integers of course this is no longer true. To be more precise (as LThode noted in a comment): IEEE 754 double-precision can guarantee this for up to 53 bits (52 bits of significand + the implicit leading 1 bit).
Answer: yes for 32 bit ints, no for 64 bit ints.
(This is correct for server/desktop general-purpose CPU environments, but other architectures may behave differently.)
Practical Answer as Malcom McLean puts it: 64 bit doubles are an adequate integer type for almost all integers that are likely to count things in real life.
For the empirically inclined, try this:
#include <iostream>
#include <limits>
using namespace std;
int main() {
double test;
volatile int test_int;
for(int i=0; i< std::numeric_limits<int>::max(); i++) {
test = i;
test_int = test;
// compare int with int:
if (test_int != i)
std::cout<<"found integer i="<<i<<", test="<<test<<std::endl;
}
return 0;
}
Success time: 0.85 memory: 15240 signal:0
Subquestion:
Regarding the question for fractional differences. Is it possible to have a integer which converts to a double which is just off the correct value by a fraction, but which converts back to the same integer due to rounding?
The answer is no, because any integer which converts back and forth to the same value, actually represents the same integer value in double. For me the simplemost explanation (suggested by ilkkachu) for this is that using the exponent 2^exponent the step width must always be a power of two. Therefore, beyond the largest 52(+1 sign) bit integer, there are never two double values with a distance smaller than 2, which solves the rounding issue.
No. Suppose you have a 64-bit integer type and a 64-bit floating-point type (which is typical for a double). There are 2^64 possible values for that integer type and there are 2^64 possible values for that floating-point type. But some of those floating-point values (in fact, most of them) do not represent integer values, so the floating-point type can represent fewer integer values than the integer type can.
The answer is no. This only works if ints are 32 bit, which, while true on most platforms, isn't guaranteed by the standard.
The two integers can share the same double representation.
For example, this
#include <iostream>
int main() {
int64_t n = 2397083434877565865;
if (static_cast<double>(n) == static_cast<double>(n - 1)) {
std::cout << "n and (n-1) share the same double representation\n";
}
}
will print
n and (n-1) share the same double representation
I.e. both 2397083434877565865 and 2397083434877565864 will convert to the same double.
Note that I used int64_t here to guarantee 64-bit integers, which - depending on your platform - might also be what int is.
You have 2 different questions:
Are all integer values perfectly represented as doubles?
That was already answered by other people (TL;DR: it depends on the precision of int and double).
Consider the following code sample that prints "Same": [...] Is this guaranteed to be true for any architecture, any compiler, any values of a and b?
Your code adds two ints and then converts the result to double. The sum of ints will overflow for certain values, but the sum of the two separately-converted doubles will not (typically). For those values the results will differ.
The short answer is "possibly". The portable answer is "not everywhere".
It really depends on your platform, and in particular, on
the size and representation of double
the range of int
For platforms using IEEE-754 doubles, it may be true if int is 53-bit or smaller. For platforms where int is larger than double, it's obviously false.
You may want be able to investigate the properties on your runtime host, using std::numeric_limits and std::nextafter.
This question already has answers here:
Representing integers in doubles
(4 answers)
Closed 5 years ago.
My question is whether all integer values are guaranteed to have a perfect double representation.
Consider the following code sample that prints "Same":
// Example program
#include <iostream>
#include <string>
int main()
{
int a = 3;
int b = 4;
double d_a(a);
double d_b(b);
double int_sum = a + b;
double d_sum = d_a + d_b;
if (double(int_sum) == d_sum)
{
std::cout << "Same" << std::endl;
}
}
Is this guaranteed to be true for any architecture, any compiler, any values of a and b? Will any integer i converted to double, always be represented as i.0000000000000 and not, for example, as i.000000000001?
I tried it for some other numbers and it always was true, but was unable to find anything about whether this is coincidence or by design.
Note: This is different from this question (aside from the language) since I am adding the two integers.
Disclaimer (as suggested by Toby Speight): Although IEEE 754 representations are quite common, an implementation is permitted to use any other representation that satisfies the requirements of the language.
The doubles are represented in the form mantissa * 2^exponent, i.e. some of the bits are used for the non-integer part of the double number.
bits range precision
float 32 1.5E-45 .. 3.4E38 7- 8 digits
double 64 5.0E-324 .. 1.7E308 15-16 digits
long double 80 1.9E-4951 .. 1.1E4932 19-20 digits
The part in the fraction can also used to represent an integer by using an exponent which removes all the digits after the dot.
E.g. 2,9979 · 10^4 = 29979.
Since a common int is usually 32 bit you can represent all ints as double, but for 64 bit integers of course this is no longer true. To be more precise (as LThode noted in a comment): IEEE 754 double-precision can guarantee this for up to 53 bits (52 bits of significand + the implicit leading 1 bit).
Answer: yes for 32 bit ints, no for 64 bit ints.
(This is correct for server/desktop general-purpose CPU environments, but other architectures may behave differently.)
Practical Answer as Malcom McLean puts it: 64 bit doubles are an adequate integer type for almost all integers that are likely to count things in real life.
For the empirically inclined, try this:
#include <iostream>
#include <limits>
using namespace std;
int main() {
double test;
volatile int test_int;
for(int i=0; i< std::numeric_limits<int>::max(); i++) {
test = i;
test_int = test;
// compare int with int:
if (test_int != i)
std::cout<<"found integer i="<<i<<", test="<<test<<std::endl;
}
return 0;
}
Success time: 0.85 memory: 15240 signal:0
Subquestion:
Regarding the question for fractional differences. Is it possible to have a integer which converts to a double which is just off the correct value by a fraction, but which converts back to the same integer due to rounding?
The answer is no, because any integer which converts back and forth to the same value, actually represents the same integer value in double. For me the simplemost explanation (suggested by ilkkachu) for this is that using the exponent 2^exponent the step width must always be a power of two. Therefore, beyond the largest 52(+1 sign) bit integer, there are never two double values with a distance smaller than 2, which solves the rounding issue.
No. Suppose you have a 64-bit integer type and a 64-bit floating-point type (which is typical for a double). There are 2^64 possible values for that integer type and there are 2^64 possible values for that floating-point type. But some of those floating-point values (in fact, most of them) do not represent integer values, so the floating-point type can represent fewer integer values than the integer type can.
The answer is no. This only works if ints are 32 bit, which, while true on most platforms, isn't guaranteed by the standard.
The two integers can share the same double representation.
For example, this
#include <iostream>
int main() {
int64_t n = 2397083434877565865;
if (static_cast<double>(n) == static_cast<double>(n - 1)) {
std::cout << "n and (n-1) share the same double representation\n";
}
}
will print
n and (n-1) share the same double representation
I.e. both 2397083434877565865 and 2397083434877565864 will convert to the same double.
Note that I used int64_t here to guarantee 64-bit integers, which - depending on your platform - might also be what int is.
You have 2 different questions:
Are all integer values perfectly represented as doubles?
That was already answered by other people (TL;DR: it depends on the precision of int and double).
Consider the following code sample that prints "Same": [...] Is this guaranteed to be true for any architecture, any compiler, any values of a and b?
Your code adds two ints and then converts the result to double. The sum of ints will overflow for certain values, but the sum of the two separately-converted doubles will not (typically). For those values the results will differ.
The short answer is "possibly". The portable answer is "not everywhere".
It really depends on your platform, and in particular, on
the size and representation of double
the range of int
For platforms using IEEE-754 doubles, it may be true if int is 53-bit or smaller. For platforms where int is larger than double, it's obviously false.
You may want be able to investigate the properties on your runtime host, using std::numeric_limits and std::nextafter.
If I have an int, convert it to a double, then convert the double back to an int, am I guaranteed to get the same value back that I started with? In other words, given this function:
int passThroughDouble(int input)
{
double d = input;
return d;
}
Am I guaranteed that passThroughDouble(x) == x for all ints x?
No it isn't. The standard says nothing about the relative sizes of int and double.
If int is a 64-bit integer and double is the standard IEEE double-precision, then it will already fail for numbers bigger than 2^53.
That said, int is still 32-bit on the majority of environments today. So it will still hold in many cases.
If we restrict consideration to the "traditional" IEEE-754-style representation of floating-point types, then you can expect this conversion to be value-preserving if and only if the mantissa of the type double has as many bits as there are non-sign bits in type int.
Mantissa of a classic IEEE-754 double type is 53-bit wide (including the "implied" leading bit), which means that you can represent integers in [-2^53, +2^53] range precisely. Everything out of this range will generally lose precision.
So, it all depends on how wide your int is compared to your double. The answer depends on the specific platform. With 32-bit int and IEEE-754 double the equality should hold.
Is it safe to cast a UINT64 to a float? I realize that UINT64 does not hold decimals, so my float will be whole numbers. However, my function to return my delta-time returns a UINT64, which isn't a very useful type for the function I'm currently working with. I'm assuming a simple static_cast<float>(uint64value) will not work?
Large values of UINT64, (an 8 byte value) may be truncated if you cast them to a float, which is only 4 bytes.
Define safe - you can easily lose a lot of digits of precision if the 64-bit value is large, but apart from that (which is presumably a known issue that you don't mind about), the conversion should be safe. If your compiler doesn't handle it correctly, get a better compiler.
You might try performing your arithmetic in a long double or double first:
typedef long double real_type
real_type x = static_cast<real_type>(long1);
real_type y = static_cast<real_type>(long2);
real_type z = x / y;
float result = static_cast<float>(real_type);
Rule of thumb: int can be cast to and back from double
It is safe to cast to and back from float but you will be limited to rather small numbers, about 16 million, and if you exceed the allowed magnitude you will silently lose lower-order precision. With double, you can use much larger integers.
Assuming an IEEE 754 underlying floating point system, you will be able to accurately cast integers of 23 bits to and from float and 52 bits to and from double. Actually, you get one more bit because of the hidden bit, so you can fit an integer up to and including 1FFFFFFFFFFFFF or 9007199254740991 in a double.
So every single 32-bit integer has an exact representation in double; it can be cast to and back safely, and the ordinary arithmetic operations on them will produce exact results.
Indeed, this is what JavaScript does for every integer numeric operation. People who say "floating point is inaccurate" are drastically oversimplifying the matter.
Safe? What do you mean by safe? As far as the precision is concerned, IEEE-754 float has a 23-(+1)bit mantissa. By forcefully converting a 64-bit value into a "rounded" 24 bit value, you'll inflict a massive loss of precision in the wide range of least-significant bits. Is this loss acceptable in your application? Frankly, if your original value really makes use of the 64-bit range, forcing it into something as small as float doesn't sound as a good idea to me.
why wouldn't static_cast work?
Max uint64 is 2^64 = 1.84467441 × 10^19
According to this max 32-bit float is
9.999999×10^96.
Should work... having problems?
http://en.wikipedia.org/wiki/Decimal32_floating-point_format