Particularly I'm interested if int32_t is always losslessly converted to double.
Does the following code always return true?
int is_lossless(int32_t i)
{
double d = i;
int32_t i2 = d;
return (i2 == i);
}
What is for int64_t?
When is integer to floating point conversion lossless?
When the floating point type has enough precision and range to encode all possible values of the integer type.
Does the following int32_t code always return true? --> Yes.
Does the following int64_t code always return true? --> No.
As DBL_MAX is at least 1E+37, the range is sufficient for at least int122_t, let us look to precision.
With common double, with its base 2, sign bit, 53 bit significand, and exponent, all values of int54_t with its 53 value bits can be represented exactly. INT54_MIN is also representable. With this double, it has DBL_MANT_DIG == 53 and in this case that is the number of base-2 digits in the floating-point significand.
The smallest magnitude non-representable value would be INT54_MAX + 2. Type int55_t and wider have values not exactly representable as a double.
With uintN_t types, there is 1 more value bit. The typical double can then encode all uint53_t and narrower.
With other possible double encodings, as C specifies DBL_DIG >= 10, all values of int34_t can round trip.
Code is always true with int32_t, regardless of double encoding.
What is for int64_t?
UB potential with int64_t.
The conversion in int64_t i ... double d = i;, when inexact, makes for a implementation defined result of the 2 nearest candidates. This is often a round to nearest. Then i values near INT64_MAX can convert to a double one more than INT64_MAX.
With int64_t i2 = d;, the conversion of the double value one more than INT64_MAX to int64_t is undefined behavior (UB).
A simple prior test to detect this:
#define INT64_MAX_P1 ((INT64_MAX/2 + 1) * 2.0)
if (d == INT64_MAX_P1) return false; // not lossless
Question: Does the following code always return true?
Always is a big statement and therefore the answer is no.
The C++ Standard makes no mention whether or not the floating-point types which are known to C++ (float, double and long double) are of the IEEE-754 type. The standard explicitly states:
There are three floating-point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. [Note: This document imposes no requirements on the accuracy of floating-point operations; see also [support.limits]. — end note] Integral and floating-point types are collectively called arithmetic types. Specialisations of the standard library template std::numeric_limits shall specify the maximum and minimum values of each arithmetic type for an implementation.
source: C++ standard: basic fundamentals
Most commonly, the type double represents the IEEE 754 double-precision binary floating-point format binary64, and can be depicted as:
and decoded as:
However, there is a plethora of other floating-point formats out there that are decoded differently and not necessarly have the same properties as the well known IEEE-754. Nonetheless, they are all-by-all similar:
They are n bits long
One bit represents the sign
m bits represent the significant with or without a hidden first bit
e bits represent some form of an exponent of a given base (2 or 10)
To know Whether or not a double can represent all 32-bit signed integer or not, you must answer the following question (assuming our floating-point number is in base 2):
Does my floating-point representation have a hidden first bit in the significant? If so, assume m=m+1
A 32bit signed integer is represented by 1 sign bit and 31 bits representing the number. Is the significant large enough to hold those 31 bits?
Is the exponent large enough that it can represent a number of the form 1.xxxxx 2^31?
If you can answer yes to the last two questions, then yes a int32 can always be represented by the double that is implemented on this particular system.
Note: I ignored decimal32 and decimal64 numbers, as I have no direct knowledge about them.
Note : my answer supposes the double follow IEEE 754, and both int32_t and int64_tare 2's complement.
Does the following code always return true?
the mantissa/significand of a double is longer than 32b so int32_t => double is always done without error because there is no possible precision error (and there is no possible overflow/underflow, the exponent cover more than the needed range of values)
What is for int64_t?
but 53 bits of mantissa/significand (including 1 implicit) of a double is not enough to save 64b of a int64_t => int64_t having upper and lower bits enough distant cannot be store in a double without precision error (there is still no possible overflow/underflow, the exponent still cover more than the needed range of values)
If your platform uses IEEE754 for the double, then yes, any int32_t can be represented perfectly in a double. This is not the case for all possible values that an int64_t can have.
(It is possible on some platforms to tweak the mantissa / exponent sizes of floating point types to make the transformation lossy, but such a type would not be an IEEE754 double.)
To test for IEEE754, use
static_assert(std::numeric_limits<double>::is_iec559, "IEEE 754 floating point");
Related
When int64_t is cast to double and doesn't have an exact match, to my knowledge I get a sort of best-effort-nearest-value equivalent in double. For example, 9223372036854775000 in int64_t appears to end up as 9223372036854774784.0 in double:
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
return 0;
}
It appears to me as if an int64_t cast to a double always ends up on as a clean non-fractional number, even in this higher number range where double has really low precision. However, I just observed this from random attempts. Is this guaranteed to happen for any value of int64_t cast to a double?
And if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off? (Assuming it doesn't overflow during the conversion back.) Like here:
#include <inttypes.h>
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
printf("Corresponding int to corresponding double: %" PRId64 "\n",
(int64_t)((double)9223372036854775000LL));
// Outputs: 9223372036854774784
return 0;
}
Or can it be imprecise and get me the "wrong" int in some corner cases?
Intuitively and from my tests the answer to both points appears to be "yes", but if somebody with a good formal understanding of the floating point standards and the maths behind it could confirm this that would be really helpful to me. I would also be curious if any known more aggressive optimizations like gcc's -Ofast are known to break any of this.
In general case yes, both should be true. The floating point base needs to be - if not 2, then at least integer and given that, an integer converted to nearest floating point value can never produce non-zero fractions - either the precision suffices or the lowest-order integer digits in the base of the floating type would be zeroed. For example in your case your system uses ISO/IEC/IEEE 60559 binary floating point numbers. When inspected in base 2, it can be seen that the trailing digits of the value are indeed zeroed:
>>> bin(9223372036854775000)
'0b111111111111111111111111111111111111111111111111111110011011000'
>>> bin(9223372036854774784)
'0b111111111111111111111111111111111111111111111111111110000000000'
The conversion of a double without fractions to an integer type, given that the value of the double falls within the range of the integer type should be exact...
Though you still might encounter a quality-of-implementation issue, or an outright bug - for example MSVC currently has a compiler bug where a round-trip conversion of unsigned 32-bit value with MSB set (or just double value between 2³¹ and 2³²-1 converted to unsigned int) would "overflow" in the conversion and always result in exactly 2³¹.
The following assumes the value being converted is positive. The behavior of negative numbers is analogous.
C 2018 6.3.1.4 2 specifies conversions from integer to real and says:
… If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.
This tells us that some integer value x being converted to floating-point can produce a non-integer only if one of the two representable values bounding x is not an integer and x is not representable.
5.2.4.2.2 specifies the model used for floating-point numbers. Each finite floating-point number is represented by a sequence of digits in a certain base b scaled by be for some exponent e. (b is an integer greater than 1.) Then, if one of the two values bounding x, say p is not an integer, the scaling must be such that the lowest digit in that floating-point number represents a fraction. But if this is the case, then setting all of the digits in p that represent fractions to 0 must produce a new floating-point number that is an integer. If x < p, this integer must be x, and therefore x is representable in the floating-point format. On the other hand, if p < x, we can add enough to each digit that represents a fraction to make it 0 (and produce a carry to the next higher digit). This will also produce an integer representable in the floating-point type1, and it must be x.
Therefore, if conversion of an integer x to the floating-point type would produce a non-integer, x must be representable in the type. But then conversion to the floating-point type must produce x. So it is never possible to produce a non-integer.
Footnote
1 It is possible this will carry out of all the digits, as when applying it to a three-digit decimal number 9.99, which produces 10.00. In this case, the value produced is the next power of b, if it is in range of the floating-point format. If it is not, the C standard does not define the behavior. Also note the C standard sets minimum requirements on the range that floating-point formats must support which preclude any format from not being able to represent 1, which avoids a degenerate case in which a conversion could produce a number like .999 because it was the largest representable finite value.
When a 64bit int is cast to 64bit float ... and doesn't have an exact match, will it always land on a non-fractional number?
Is this guaranteed to happen for any value of int64_t cast to a double?
For common double: Yes, it always land on a non-fractional number
When there is no match, the result is the closest floating point representable value above or below, depending on rounding mode. Given the characteristics of common double, these 2 bounding values are also whole numbers. When the value is not representable, there is first a nearby whole number one.
... if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off?
No. Edge cases near INT64_MAX fail as the converted value could become a FP value above INT64_MAX. Then conversion back to the integer type incurs: "the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised." C17dr § 6.3.1.3 3
#include <limits.h>
#include <string.h>
int main() {
long long imaxm1 = LLONG_MAX - 1;
double max = (double) imaxm1;
printf("%lld\n%f\n", imaxm1, max);
long long imax = (long long) max;
printf("%lld\n", imax);
}
9223372036854775806
9223372036854775808.000000
9223372036854775807 // Value here is implementation defined.
Deeper exceptions
(Question variation) When an N bit integer type is cast to a floating point and doesn't have an exact match, will it always land on a non-fractional number?
Integer type range exceeds finite float point
Conversion to infinity: With common float, and uint128_t, UINT128_MAX converts to infinity. This is readily possible with extra wide integer types.
int main() {
unsigned __int128 imaxm1 = 0xFFFFFFFFFFFFFFFF;
imaxm1 <<= 64;
imaxm1 |= 0xFFFFFFFFFFFFFFFF;
double fmax = (float) imaxm1;
double max = (double) imaxm1;
printf("%llde27\n%f\n%f\n", (long long) (imaxm1/1000000000/1000000000/1000000000),
fmax, max);
}
340282366920e27
inf
340282366920938463463374607431768211456.000000
Floating point precession deep more than range
On some unicorn implementation, with very wide FP precision and small range, the largest finite could, in theory, not practice, be a non-whole number. Then with an even wider integer type, the conversion could result in this non-whole number value. I do not see this as a legit concern of OP's.
I've built a custom version of frexp:
auto frexp(float f) noexcept
{
static_assert(std::numeric_limits<float>::is_iec559);
constexpr uint32_t ExpMask = 0xff;
constexpr int32_t ExpOffset = 126;
constexpr int MantBits = 23;
uint32_t u;
std::memcpy(&u, &f, sizeof(float)); // well defined bit transformation from float to int
int exp = ((u >> MantBits) & ExpMask) - ExpOffset; // extract the 8 bits of the exponent (it has an offset of 126)
// divide by 2^exp (leaving mantissa intact while placing "0" into the exponent)
u &= ~(ExpMask << MantBits); // zero out the exponent bits
u |= ExpOffset << MantBits; // place 126 into exponent bits (representing 0)
std::memcpy(&f, &u, sizeof(float)); // copy back to f
return std::make_pair(exp, f);
}
By checking is_iec559 I'm making sure that float fulfills
the requirements of IEC 559 (IEEE 754) standard.
My question is: Does this mean that the bit operations I'm doing are well defined and do what I want? If not, is there a way to fix it?
I tested it for some random values and it seems to be correct, at least on Windows 10 compiled with msvc and on wandbox. Note however, that (on purpose) I'm not handling the edge cases of subnormals, NaN, and inf.
If anyone wonders why I'm doing this: In benchmarks I found that this version of frexp is up to 15 times faster than std::frexp on Windows 10. I haven't tested other platforms yet. But I want to make sure that this not just works by coincident and may brake in future.
Edit:
As mentioned in the comments, endianess could be an issue. Does anybody know?
"Does this mean that the bit operations I'm doing are well defined..."
The TL;DR;, by the strict definition of "well defined": no.
Your assumptions are likely correct but not well defined, because there are no guarantees about the bit width, or the implementation of float. From § 3.9.1:
there are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.
The is_iec559 clause only qualifies with:
True if and only if the type adheres to IEC 559 standard
If a literal genie wrote you a terrible compiler, and made float = binary16, double = binary32, and long double = binary64, and made is_iec559 true for all the types, it would still adhere to the standard.
does that mean that I can extract exponent and mantissa in a well defined way?
The TL;DR;, by the limited guarantees of the C++ standard: no.
Assume you use float32_t and is_iec559 is true, and logically deduced from all the rules that it could only be binary32 with no trap representations, and you correctly argued that memcpy is a well defined for conversion between arithmetic types of the same width, and won't break strict aliasing. Even with all those assumptions, the behavior might be well defined but it's only likely and not guaranteed that you can extract the mantissa this way.
The IEEE 754 standard and 2's complement regard bit string encodings, and the behavior of memcpy is described using bytes. While it's plausible to assume the bit string of uint32_t and float32_t would be encoded the same way (e.g. same endianness), there's no guarantee in the standard for that. If the bit strings are stored differently and you shift and mask the copied integer representation to get the mantissa, the answer will be incorrect, despite the memcpy behavior being well defined.
As mentioned in the comments, endianess could be an issue. Does anybody know?
At least a few architectures have used different endianness for floating point registers and integer registers. The same link says that except for small embedded processors, this isn't a concern. I trust Wikipedia entirely for all topics and refuse to do any further research.
Consider two very simple multiplications below:
double result1;
long double result2;
float var1=3.1;
float var2=6.789;
double var3=87.45;
double var4=234.987;
result1=var1*var2;
result2=var3*var4;
Are multiplications by default done in a higher precision than the operands? I mean in case of first multiplication is it done in double precision and in case of second one in x86 architecture is it done in 80-bit extended-precision or we should cast operands in expressions to the higher precision ourselves like below?
result1=(double)var1*(double)var2;
result2=(long double)var3*(long double)var4;
What about other operations(add, division and remainder)? For example when adding more than two positive single-precision values, using extra significant bits of double-precision can decrease round-off errors if used to hold intermediate results of expression.
Precision of floating-point computations
C++11 incorporates the definition of FLT_EVAL_METHOD from C99 in cfloat.
FLT_EVAL_METHOD
Possible values:
-1 undetermined
0 evaluate just to the range and precision of the type
1 evaluate float and double as double, and long double as long double.
2 evaluate all as long double
If your compiler defines FLT_EVAL_METHOD as 2, then the computations of r1 and r2, and of s1 and s2 below are respectively equivalent:
double var3 = …;
double var4 = …;
double r1 = var3 * var4;
double r2 = (long double)var3 * (long double)var4;
long double s1 = var3 * var4;
long double s2 = (long double)var3 * (long double)var4;
If your compiler defines FLT_EVAL_METHOD as 2, then in all four computations above, the multiplication is done at the precision of the long double type.
However, if the compiler defines FLT_EVAL_METHOD as 0 or 1, r1 and r2, and respectively s1 and s2, aren't always the same. The multiplications when computing r1 and s1 are done at the precision of double. The multiplications when computing r2 and s2 are done at the precision of long double.
Getting wide results from narrow arguments
If you are computing results that are destined to be stored in a wider result type than the type of the operands, as are result1 and result2 in your question, you should always convert the arguments to a type at least as wide as the target, as you do here:
result2=(long double)var3*(long double)var4;
Without this conversion (if you write var3 * var4), if the compiler's definition of FLT_EVAL_METHOD is 0 or 1, the product will be computed in the precision of double, which is a shame, since it is destined to be stored in a long double.
If the compiler defines FLT_EVAL_METHOD as 2, then the conversions in (long double)var3*(long double)var4 are not necessary, but they do not hurt either: the expression means exactly the same thing with and without them.
Digression: if the destination format is as narrow as the arguments, when is extended-precision for intermediate results better?
Paradoxically, for a single operation, rounding only once to the target precision is best. The only effect of computing a single multiplication in extended precision is that the result will be rounded to extended precision and then to double precision. This makes it less accurate. In other words, with FLT_EVAL_METHOD 0 or 1, the result r2 above is sometimes less accurate than r1 because of double-rounding, and if the compiler uses IEEE 754 floating-point, never better.
The situation is different for larger expressions that contain several operations. For these, it is usually better to compute intermediate results in extended precision, either through explicit conversions or because the compiler uses FLT_EVAL_METHOD == 2. This question and its accepted answer show that when computing with 80-bit extended precision intermediate computations for binary64 IEEE 754 arguments and results, the interpolation formula u2 * (1.0 - u1) + u1 * u3 always yields a result between u2 and u3 for u1 between 0 and 1. This property may not hold for binary64-precision intermediate computations because of the larger rounding errors then.
The usual arthimetic conversions for floating point types are applied before multiplication, division, and modulus:
The usual arithmetic conversions are performed on the operands and determine the type of the result.
§5.6 [expr.mul]
Similarly for addition and subtraction:
The usual arithmetic conversions are performed for operands of arithmetic or enumeration type.
§5.7 [expr.add]
The usual arithmetic conversions for floating point types are laid out in the standard as follows:
Many binary operators that expect operands of arithmetic or enumeration type cause conversions and yield result types in a similar way. The purpose is to yield a common type, which is also the type of the result. This pattern is called the usual arithmetic conversions, which are defined as follows:
[...]
— If either operand is of type long double, the other shall be converted to long double.
— Otherwise, if either operand is double, the other shall be converted to double.
— Otherwise, if either operand is float, the other shall be converted to float.
§5 [expr]
The actual form/precision of these floating point types is implementation-defined:
The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.
§3.9.1 [basic.fundamental]
For floating point multiplication: FP multipliers use internally double the width of the operands to generate an intermediate result, which equals the real result within an infinite precision, and then round it to the target precision. Thus you should not worry about multiplication. The result is correctly rounded.
For floating point addition, the result is also correctly rounded as standard FP adders use extra sufficient 3 guard bits to compute a correctly rounded result.
For division, remainder and other complicated functions, like transcendentals such as sin, log, exp, etc... it depends mainly on the architecture and the used libraries. I recommend you to use the MPFR library if you seek correctly rounded results for division or any other complicated function.
Not a direct answer to your question, but for constant floating-point values (such as the ones specified in your question), the method that yields the least amount of precision-loss would be using the rational representation of each value as an integer numerator divided by an integer denominator, and perform as many integer-multiplications as possible before the actual floating-point-division.
For the floating-point values specified in your question:
int var1_num = 31;
int var1_den = 10;
int var2_num = 6789;
int var2_den = 1000;
int var3_num = 8745;
int var3_den = 100;
int var4_num = 234987;
int var4_den = 1000;
double result1 = (double)(var1_num*var2_num)/(var1_den*var2_den);
long double result2 = (long double)(var3_num*var4_num)/(var3_den*var4_den);
If any of the integer-products are too large to fit in an int, then you can use larger integer types:
unsigned int
signed long
unsigned long
signed long long
unsigned long long
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
long double vs double
I am new to programming and I am unable to understand the difference between between long double and double in C and C++. I tried to Google it but was unable to understand it and got confused. Can anyone please help.?
To quote the C++ standard, §3.9.1 ¶8:
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum and minimum values of each arithmetic type for an implementation.
That is to say that double takes at least as much memory for its representation as float and long double at least as much as double. That extra memory is used for more precise representation of a number.
On x86 systems, float is typically 4 bytes long and can store numbers as large as about 3×10³⁸ and about as small as 1.4×10⁻⁴⁵. It is an IEEE 754 single-precision number that stores about 7 decimal digits of a fractional number.
Also on x86 systems, double is 8 bytes long and can store numbers in the IEEE 754 double-precision format, which has a much larger range and stores numbers with more precision, about 15 decimal digits. On some other platforms, double may not be 8 bytes long and may indeed be the same as a single-precision float.
The standard only requires that long double is at least as precise as double, so some compilers will simply treat long double as if it is the same as double. But, on most x86 chips, the 10-byte extended precision format 80-bit number is available through the CPU's floating-point unit, which provides even more precision than 64-bit double, with about 21 decimal digits of precision.
Some compilers instead support a 16-byte (128-bit) IEEE 754 quadruple precision number format with yet more precise representations and a larger range.
It depends on your compiler but the following code can show you the number of bytes that each type requires:
int main() {
printf("%d\n", sizeof(double)); // some compilers print 8
printf("%d\n", sizeof(long double)); // some compilers print 16
return 0;
}
A long <type> data type may hold larger values then a <type> data type, depending on the compiler.
If I have an int, convert it to a double, then convert the double back to an int, am I guaranteed to get the same value back that I started with? In other words, given this function:
int passThroughDouble(int input)
{
double d = input;
return d;
}
Am I guaranteed that passThroughDouble(x) == x for all ints x?
No it isn't. The standard says nothing about the relative sizes of int and double.
If int is a 64-bit integer and double is the standard IEEE double-precision, then it will already fail for numbers bigger than 2^53.
That said, int is still 32-bit on the majority of environments today. So it will still hold in many cases.
If we restrict consideration to the "traditional" IEEE-754-style representation of floating-point types, then you can expect this conversion to be value-preserving if and only if the mantissa of the type double has as many bits as there are non-sign bits in type int.
Mantissa of a classic IEEE-754 double type is 53-bit wide (including the "implied" leading bit), which means that you can represent integers in [-2^53, +2^53] range precisely. Everything out of this range will generally lose precision.
So, it all depends on how wide your int is compared to your double. The answer depends on the specific platform. With 32-bit int and IEEE-754 double the equality should hold.