What precision are floating-point arithmetic operations done in?

What precision are floating-point arithmetic operations done in? - c++

Consider two very simple multiplications below:
double result1;
long double result2;
float var1=3.1;
float var2=6.789;
double var3=87.45;
double var4=234.987;
result1=var1*var2;
result2=var3*var4;
Are multiplications by default done in a higher precision than the operands? I mean in case of first multiplication is it done in double precision and in case of second one in x86 architecture is it done in 80-bit extended-precision or we should cast operands in expressions to the higher precision ourselves like below?
result1=(double)var1*(double)var2;
result2=(long double)var3*(long double)var4;
What about other operations(add, division and remainder)? For example when adding more than two positive single-precision values, using extra significant bits of double-precision can decrease round-off errors if used to hold intermediate results of expression.

Precision of floating-point computations
C++11 incorporates the definition of FLT_EVAL_METHOD from C99 in cfloat.
FLT_EVAL_METHOD
Possible values:
-1 undetermined
0 evaluate just to the range and precision of the type
1 evaluate float and double as double, and long double as long double.
2 evaluate all as long double
If your compiler defines FLT_EVAL_METHOD as 2, then the computations of r1 and r2, and of s1 and s2 below are respectively equivalent:
double var3 = …;
double var4 = …;
double r1 = var3 * var4;
double r2 = (long double)var3 * (long double)var4;
long double s1 = var3 * var4;
long double s2 = (long double)var3 * (long double)var4;
If your compiler defines FLT_EVAL_METHOD as 2, then in all four computations above, the multiplication is done at the precision of the long double type.
However, if the compiler defines FLT_EVAL_METHOD as 0 or 1, r1 and r2, and respectively s1 and s2, aren't always the same. The multiplications when computing r1 and s1 are done at the precision of double. The multiplications when computing r2 and s2 are done at the precision of long double.
Getting wide results from narrow arguments
If you are computing results that are destined to be stored in a wider result type than the type of the operands, as are result1 and result2 in your question, you should always convert the arguments to a type at least as wide as the target, as you do here:
result2=(long double)var3*(long double)var4;
Without this conversion (if you write var3 * var4), if the compiler's definition of FLT_EVAL_METHOD is 0 or 1, the product will be computed in the precision of double, which is a shame, since it is destined to be stored in a long double.
If the compiler defines FLT_EVAL_METHOD as 2, then the conversions in (long double)var3*(long double)var4 are not necessary, but they do not hurt either: the expression means exactly the same thing with and without them.
Digression: if the destination format is as narrow as the arguments, when is extended-precision for intermediate results better?
Paradoxically, for a single operation, rounding only once to the target precision is best. The only effect of computing a single multiplication in extended precision is that the result will be rounded to extended precision and then to double precision. This makes it less accurate. In other words, with FLT_EVAL_METHOD 0 or 1, the result r2 above is sometimes less accurate than r1 because of double-rounding, and if the compiler uses IEEE 754 floating-point, never better.
The situation is different for larger expressions that contain several operations. For these, it is usually better to compute intermediate results in extended precision, either through explicit conversions or because the compiler uses FLT_EVAL_METHOD == 2. This question and its accepted answer show that when computing with 80-bit extended precision intermediate computations for binary64 IEEE 754 arguments and results, the interpolation formula u2 * (1.0 - u1) + u1 * u3 always yields a result between u2 and u3 for u1 between 0 and 1. This property may not hold for binary64-precision intermediate computations because of the larger rounding errors then.

The usual arthimetic conversions for floating point types are applied before multiplication, division, and modulus:
The usual arithmetic conversions are performed on the operands and determine the type of the result.
§5.6 [expr.mul]
Similarly for addition and subtraction:
The usual arithmetic conversions are performed for operands of arithmetic or enumeration type.
§5.7 [expr.add]
The usual arithmetic conversions for floating point types are laid out in the standard as follows:
Many binary operators that expect operands of arithmetic or enumeration type cause conversions and yield result types in a similar way. The purpose is to yield a common type, which is also the type of the result. This pattern is called the usual arithmetic conversions, which are defined as follows:
[...]
— If either operand is of type long double, the other shall be converted to long double.
— Otherwise, if either operand is double, the other shall be converted to double.
— Otherwise, if either operand is float, the other shall be converted to float.
§5 [expr]
The actual form/precision of these floating point types is implementation-defined:
The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.
§3.9.1 [basic.fundamental]

For floating point multiplication: FP multipliers use internally double the width of the operands to generate an intermediate result, which equals the real result within an infinite precision, and then round it to the target precision. Thus you should not worry about multiplication. The result is correctly rounded.
For floating point addition, the result is also correctly rounded as standard FP adders use extra sufficient 3 guard bits to compute a correctly rounded result.
For division, remainder and other complicated functions, like transcendentals such as sin, log, exp, etc... it depends mainly on the architecture and the used libraries. I recommend you to use the MPFR library if you seek correctly rounded results for division or any other complicated function.

Not a direct answer to your question, but for constant floating-point values (such as the ones specified in your question), the method that yields the least amount of precision-loss would be using the rational representation of each value as an integer numerator divided by an integer denominator, and perform as many integer-multiplications as possible before the actual floating-point-division.
For the floating-point values specified in your question:
int var1_num = 31;
int var1_den = 10;
int var2_num = 6789;
int var2_den = 1000;
int var3_num = 8745;
int var3_den = 100;
int var4_num = 234987;
int var4_den = 1000;
double result1 = (double)(var1_num*var2_num)/(var1_den*var2_den);
long double result2 = (long double)(var3_num*var4_num)/(var3_den*var4_den);
If any of the integer-products are too large to fit in an int, then you can use larger integer types:
unsigned int
signed long
unsigned long
signed long long
unsigned long long

Related

When a 64bit int is cast to 64bit float in C/C++ and doesn't have an exact match, will it always land on a non-fractional number?

When int64_t is cast to double and doesn't have an exact match, to my knowledge I get a sort of best-effort-nearest-value equivalent in double. For example, 9223372036854775000 in int64_t appears to end up as 9223372036854774784.0 in double:
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
return 0;
}
It appears to me as if an int64_t cast to a double always ends up on as a clean non-fractional number, even in this higher number range where double has really low precision. However, I just observed this from random attempts. Is this guaranteed to happen for any value of int64_t cast to a double?
And if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off? (Assuming it doesn't overflow during the conversion back.) Like here:
#include <inttypes.h>
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
printf("Corresponding int to corresponding double: %" PRId64 "\n",
(int64_t)((double)9223372036854775000LL));
// Outputs: 9223372036854774784
return 0;
}
Or can it be imprecise and get me the "wrong" int in some corner cases?
Intuitively and from my tests the answer to both points appears to be "yes", but if somebody with a good formal understanding of the floating point standards and the maths behind it could confirm this that would be really helpful to me. I would also be curious if any known more aggressive optimizations like gcc's -Ofast are known to break any of this.

In general case yes, both should be true. The floating point base needs to be - if not 2, then at least integer and given that, an integer converted to nearest floating point value can never produce non-zero fractions - either the precision suffices or the lowest-order integer digits in the base of the floating type would be zeroed. For example in your case your system uses ISO/IEC/IEEE 60559 binary floating point numbers. When inspected in base 2, it can be seen that the trailing digits of the value are indeed zeroed:
>>> bin(9223372036854775000)
'0b111111111111111111111111111111111111111111111111111110011011000'
>>> bin(9223372036854774784)
'0b111111111111111111111111111111111111111111111111111110000000000'
The conversion of a double without fractions to an integer type, given that the value of the double falls within the range of the integer type should be exact...
Though you still might encounter a quality-of-implementation issue, or an outright bug - for example MSVC currently has a compiler bug where a round-trip conversion of unsigned 32-bit value with MSB set (or just double value between 2³¹ and 2³²-1 converted to unsigned int) would "overflow" in the conversion and always result in exactly 2³¹.

The following assumes the value being converted is positive. The behavior of negative numbers is analogous.
C 2018 6.3.1.4 2 specifies conversions from integer to real and says:
… If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.
This tells us that some integer value x being converted to floating-point can produce a non-integer only if one of the two representable values bounding x is not an integer and x is not representable.
5.2.4.2.2 specifies the model used for floating-point numbers. Each finite floating-point number is represented by a sequence of digits in a certain base b scaled by be for some exponent e. (b is an integer greater than 1.) Then, if one of the two values bounding x, say p is not an integer, the scaling must be such that the lowest digit in that floating-point number represents a fraction. But if this is the case, then setting all of the digits in p that represent fractions to 0 must produce a new floating-point number that is an integer. If x < p, this integer must be x, and therefore x is representable in the floating-point format. On the other hand, if p < x, we can add enough to each digit that represents a fraction to make it 0 (and produce a carry to the next higher digit). This will also produce an integer representable in the floating-point type1, and it must be x.
Therefore, if conversion of an integer x to the floating-point type would produce a non-integer, x must be representable in the type. But then conversion to the floating-point type must produce x. So it is never possible to produce a non-integer.
Footnote
1 It is possible this will carry out of all the digits, as when applying it to a three-digit decimal number 9.99, which produces 10.00. In this case, the value produced is the next power of b, if it is in range of the floating-point format. If it is not, the C standard does not define the behavior. Also note the C standard sets minimum requirements on the range that floating-point formats must support which preclude any format from not being able to represent 1, which avoids a degenerate case in which a conversion could produce a number like .999 because it was the largest representable finite value.

When a 64bit int is cast to 64bit float ... and doesn't have an exact match, will it always land on a non-fractional number?
Is this guaranteed to happen for any value of int64_t cast to a double?
For common double: Yes, it always land on a non-fractional number
When there is no match, the result is the closest floating point representable value above or below, depending on rounding mode. Given the characteristics of common double, these 2 bounding values are also whole numbers. When the value is not representable, there is first a nearby whole number one.
... if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off?
No. Edge cases near INT64_MAX fail as the converted value could become a FP value above INT64_MAX. Then conversion back to the integer type incurs: "the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised." C17dr § 6.3.1.3 3
#include <limits.h>
#include <string.h>
int main() {
long long imaxm1 = LLONG_MAX - 1;
double max = (double) imaxm1;
printf("%lld\n%f\n", imaxm1, max);
long long imax = (long long) max;
printf("%lld\n", imax);
}
9223372036854775806
9223372036854775808.000000
9223372036854775807 // Value here is implementation defined.
Deeper exceptions
(Question variation) When an N bit integer type is cast to a floating point and doesn't have an exact match, will it always land on a non-fractional number?
Integer type range exceeds finite float point
Conversion to infinity: With common float, and uint128_t, UINT128_MAX converts to infinity. This is readily possible with extra wide integer types.
int main() {
unsigned __int128 imaxm1 = 0xFFFFFFFFFFFFFFFF;
imaxm1 <<= 64;
imaxm1 |= 0xFFFFFFFFFFFFFFFF;
double fmax = (float) imaxm1;
double max = (double) imaxm1;
printf("%llde27\n%f\n%f\n", (long long) (imaxm1/1000000000/1000000000/1000000000),
fmax, max);
}
340282366920e27
inf
340282366920938463463374607431768211456.000000
Floating point precession deep more than range
On some unicorn implementation, with very wide FP precision and small range, the largest finite could, in theory, not practice, be a non-whole number. Then with an even wider integer type, the conversion could result in this non-whole number value. I do not see this as a legit concern of OP's.

When is integer to floating point conversion lossless?

Particularly I'm interested if int32_t is always losslessly converted to double.
Does the following code always return true?
int is_lossless(int32_t i)
{
double d = i;
int32_t i2 = d;
return (i2 == i);
}
What is for int64_t?

When is integer to floating point conversion lossless?
When the floating point type has enough precision and range to encode all possible values of the integer type.
Does the following int32_t code always return true? --> Yes.
Does the following int64_t code always return true? --> No.
As DBL_MAX is at least 1E+37, the range is sufficient for at least int122_t, let us look to precision.
With common double, with its base 2, sign bit, 53 bit significand, and exponent, all values of int54_t with its 53 value bits can be represented exactly. INT54_MIN is also representable. With this double, it has DBL_MANT_DIG == 53 and in this case that is the number of base-2 digits in the floating-point significand.
The smallest magnitude non-representable value would be INT54_MAX + 2. Type int55_t and wider have values not exactly representable as a double.
With uintN_t types, there is 1 more value bit. The typical double can then encode all uint53_t and narrower.
With other possible double encodings, as C specifies DBL_DIG >= 10, all values of int34_t can round trip.
Code is always true with int32_t, regardless of double encoding.
What is for int64_t?
UB potential with int64_t.
The conversion in int64_t i ... double d = i;, when inexact, makes for a implementation defined result of the 2 nearest candidates. This is often a round to nearest. Then i values near INT64_MAX can convert to a double one more than INT64_MAX.
With int64_t i2 = d;, the conversion of the double value one more than INT64_MAX to int64_t is undefined behavior (UB).
A simple prior test to detect this:
#define INT64_MAX_P1 ((INT64_MAX/2 + 1) * 2.0)
if (d == INT64_MAX_P1) return false; // not lossless

Question: Does the following code always return true?
Always is a big statement and therefore the answer is no.
The C++ Standard makes no mention whether or not the floating-point types which are known to C++ (float, double and long double) are of the IEEE-754 type. The standard explicitly states:
There are three floating-point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. [Note: This document imposes no requirements on the accuracy of floating-point operations; see also [support.limits]. — end note] Integral and floating-point types are collectively called arithmetic types. Specialisations of the standard library template std::numeric_limits shall specify the maximum and minimum values of each arithmetic type for an implementation.
source: C++ standard: basic fundamentals
Most commonly, the type double represents the IEEE 754 double-precision binary floating-point format binary64, and can be depicted as:
and decoded as:
However, there is a plethora of other floating-point formats out there that are decoded differently and not necessarly have the same properties as the well known IEEE-754. Nonetheless, they are all-by-all similar:
They are n bits long
One bit represents the sign
m bits represent the significant with or without a hidden first bit
e bits represent some form of an exponent of a given base (2 or 10)
To know Whether or not a double can represent all 32-bit signed integer or not, you must answer the following question (assuming our floating-point number is in base 2):
Does my floating-point representation have a hidden first bit in the significant? If so, assume m=m+1
A 32bit signed integer is represented by 1 sign bit and 31 bits representing the number. Is the significant large enough to hold those 31 bits?
Is the exponent large enough that it can represent a number of the form 1.xxxxx 2^31?
If you can answer yes to the last two questions, then yes a int32 can always be represented by the double that is implemented on this particular system.
Note: I ignored decimal32 and decimal64 numbers, as I have no direct knowledge about them.

Note : my answer supposes the double follow IEEE 754, and both int32_t and int64_tare 2's complement.
Does the following code always return true?
the mantissa/significand of a double is longer than 32b so int32_t => double is always done without error because there is no possible precision error (and there is no possible overflow/underflow, the exponent cover more than the needed range of values)
What is for int64_t?
but 53 bits of mantissa/significand (including 1 implicit) of a double is not enough to save 64b of a int64_t => int64_t having upper and lower bits enough distant cannot be store in a double without precision error (there is still no possible overflow/underflow, the exponent still cover more than the needed range of values)

If your platform uses IEEE754 for the double, then yes, any int32_t can be represented perfectly in a double. This is not the case for all possible values that an int64_t can have.
(It is possible on some platforms to tweak the mantissa / exponent sizes of floating point types to make the transformation lossy, but such a type would not be an IEEE754 double.)
To test for IEEE754, use
static_assert(std::numeric_limits<double>::is_iec559, "IEEE 754 floating point");

How does DBL_MAX addition work?

Code
#include<stdio.h>
#include<limits.h>
#include<float.h>
int f( double x, double y, double z){
return (x+y)+z == x+(y+z);
}
int ff( long long x, long long y, long long z){
return (x+y)+z == x+(y+z);
}
int main()
{
printf("%d\n",f(DBL_MAX,DBL_MAX,-DBL_MAX));
printf("%d\n",ff(LLONG_MAX,LLONG_MAX,-LLONG_MAX));
return 0;
}
Output
0
1
I am unable to understand why both functions work differently. What is happening here?

In the eyes of the C++ and the C standard, the integer version definitely and the floating point version potentially invoke Undefined Behavior because the results of the computation x + y is not representable in the type the arithmetic is performed on.† So both functions may yield or even do anything.
However, many real world platforms offer additional guarantees for floating point operations and implement integers in a certain way that lets us explain the results you get.
Considering f, we note that many popular platforms implement floating point math as described in IEEE 754. Following the rules of that standard, we get for the LHS:
DBL_MAX + DBL_MAX = INF
and
INF - DBL_MAX = INF.
The RHS yields
DBL_MAX - DBL_MAX = 0
and
DBL_MAX + 0 = DBL_MAX
and thus LHS != RHS.
Moving on to ff: Many platforms perform signed integer computation in twos complement. Twos complement's addition is associative, so the comparison will yield true as long as optimizer does not change it to something that contradicts twos complement rules.
The latter is entirely possible (for example see this discussion), so you cannot rely on signed integer overflow doing what I explained above. However, it seems that it "was nice" in this case.
†Note that this never applies to unsigned integer arithmetic. In C++, unsigned integers implement arithmetic modulo 2^NumBits where NumBits is the number of bits of the type. In this arithmetic, every integer can be represented by picking a representative of its equivalence class in [0, 2^NumBits - 1]. So this arithmetic can never overflow.
For those doubting that the floating point case is potential UB: N4140 5/4 [expr] says
If during the evaluation of an expression, the result is not mathematically defined or not in the range of
representable values for its type, the behavior is undefined.
which is the case. The inf and NaN stuff is allowed, but not required in C++ and C floating point math. It is only required if std::numeric_limits::is_iec559<T> is true for floating point type in question. (Or in C, if it defines __STDC_IEC_559__ . Otherwise, the Annex F stuff need not apply.) If either of the iec indicators guarantees us IEEE semantics, the behavior is well defined to do what I described above.

C++ calculation with type "long"

I have a inline function does a frequency to period conversion. The calculation precision has to be using type long, not type double. Otherwise, it may cause some rounding errors. The function then converts the result back to double. I was wondering in below code, which line would keep the calculation in type long. No matter the parameter bar is 100, 100.0 or 33.3333.
double foo(long bar)
{
return 1000000/bar;
return 1000000.0/bar;
return (long)1000000/bar;
return (long)1000000.0/bar;
}
I tried it myself, and the 4th line works. But just wondering the concept of type conversion in this case.
EDIT:
One of the error is 1000000/37038 = 26, not 26.9993.

return 1000000/bar;
This will do the math as a long.
return 1000000.0/bar;
This will do the math as a double.
return (long)1000000.0/bar;
This is equivalent to the first -- 1000000.0 is a double, but then you cast it to long before the division, so the division will be done on longs.

This problem, as you posed it, doesn't make sense.
bar is of an integral type, so 1000000/bar will surely be less than 1000000, which can be represented exactly by a double1, so there's no way in which performing the calculation all in integral arithmetic can give better precision - actually, you will get integer division, that in this case is less precise for any value of bar, since it will truncate the decimal part. The only way you can have a problem in a long to double conversion here is in bar conversion to double, but if it exceeds the range of double the final result of the division will be 0, as it would be anyway in integer arithmetic.
Still:
1000000/bar
performs a division between longs: 1000000 is an int or a long, depending on the platform, bar is a long; the first operand gets promoted to a long if necessary and then an integer division is performed.
1000000.0/bar
performs a division between doubles: 1000000.0 is a double literal, so bar gets promoted to double before the division.
(long)1000000/bar
is equivalent to the first one: the cast has precedence over the division, and forces 1000000 (which is either a long or an int) to be a long; bar is a long, division between longs is performed.
(long)1000000.0/bar
is equivalent to the previous one: 1000000.0 is a double, but you cast it to a long and then integer division is performed.
The C standard, to which the C++ standard delegates the matter, asks for a minimum of 10 decimal digits for the mantissa of doubles (DBL_DIG) and at least 10**37 as representable power of ten before going out of range (DBL_MAX_10_EXP) (C99, annex E, ¶4).

The first line (and third more verbosely) will do the math as long (whihc in C++ always truncates down any result) and then return the integral value as a double. I don't understand what you're saying in your question about bar being 33.3333 because that's not a possible long value.

Is int->double->int guaranteed to be value-preserving?

If I have an int, convert it to a double, then convert the double back to an int, am I guaranteed to get the same value back that I started with? In other words, given this function:
int passThroughDouble(int input)
{
double d = input;
return d;
}
Am I guaranteed that passThroughDouble(x) == x for all ints x?

No it isn't. The standard says nothing about the relative sizes of int and double.
If int is a 64-bit integer and double is the standard IEEE double-precision, then it will already fail for numbers bigger than 2^53.
That said, int is still 32-bit on the majority of environments today. So it will still hold in many cases.

If we restrict consideration to the "traditional" IEEE-754-style representation of floating-point types, then you can expect this conversion to be value-preserving if and only if the mantissa of the type double has as many bits as there are non-sign bits in type int.
Mantissa of a classic IEEE-754 double type is 53-bit wide (including the "implied" leading bit), which means that you can represent integers in [-2^53, +2^53] range precisely. Everything out of this range will generally lose precision.
So, it all depends on how wide your int is compared to your double. The answer depends on the specific platform. With 32-bit int and IEEE-754 double the equality should hold.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js