C++ calculation with type "long" - c++

I have a inline function does a frequency to period conversion. The calculation precision has to be using type long, not type double. Otherwise, it may cause some rounding errors. The function then converts the result back to double. I was wondering in below code, which line would keep the calculation in type long. No matter the parameter bar is 100, 100.0 or 33.3333.
double foo(long bar)
{
return 1000000/bar;
return 1000000.0/bar;
return (long)1000000/bar;
return (long)1000000.0/bar;
}
I tried it myself, and the 4th line works. But just wondering the concept of type conversion in this case.
EDIT:
One of the error is 1000000/37038 = 26, not 26.9993.

return 1000000/bar;
This will do the math as a long.
return 1000000.0/bar;
This will do the math as a double.
return (long)1000000.0/bar;
This is equivalent to the first -- 1000000.0 is a double, but then you cast it to long before the division, so the division will be done on longs.

This problem, as you posed it, doesn't make sense.
bar is of an integral type, so 1000000/bar will surely be less than 1000000, which can be represented exactly by a double1, so there's no way in which performing the calculation all in integral arithmetic can give better precision - actually, you will get integer division, that in this case is less precise for any value of bar, since it will truncate the decimal part. The only way you can have a problem in a long to double conversion here is in bar conversion to double, but if it exceeds the range of double the final result of the division will be 0, as it would be anyway in integer arithmetic.
Still:
1000000/bar
performs a division between longs: 1000000 is an int or a long, depending on the platform, bar is a long; the first operand gets promoted to a long if necessary and then an integer division is performed.
1000000.0/bar
performs a division between doubles: 1000000.0 is a double literal, so bar gets promoted to double before the division.
(long)1000000/bar
is equivalent to the first one: the cast has precedence over the division, and forces 1000000 (which is either a long or an int) to be a long; bar is a long, division between longs is performed.
(long)1000000.0/bar
is equivalent to the previous one: 1000000.0 is a double, but you cast it to a long and then integer division is performed.
The C standard, to which the C++ standard delegates the matter, asks for a minimum of 10 decimal digits for the mantissa of doubles (DBL_DIG) and at least 10**37 as representable power of ten before going out of range (DBL_MAX_10_EXP) (C99, annex E, ¶4).

The first line (and third more verbosely) will do the math as long (whihc in C++ always truncates down any result) and then return the integral value as a double. I don't understand what you're saying in your question about bar being 33.3333 because that's not a possible long value.

Related

When a 64bit int is cast to 64bit float in C/C++ and doesn't have an exact match, will it always land on a non-fractional number?

When int64_t is cast to double and doesn't have an exact match, to my knowledge I get a sort of best-effort-nearest-value equivalent in double. For example, 9223372036854775000 in int64_t appears to end up as 9223372036854774784.0 in double:
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
return 0;
}
It appears to me as if an int64_t cast to a double always ends up on as a clean non-fractional number, even in this higher number range where double has really low precision. However, I just observed this from random attempts. Is this guaranteed to happen for any value of int64_t cast to a double?
And if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off? (Assuming it doesn't overflow during the conversion back.) Like here:
#include <inttypes.h>
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
printf("Corresponding int to corresponding double: %" PRId64 "\n",
(int64_t)((double)9223372036854775000LL));
// Outputs: 9223372036854774784
return 0;
}
Or can it be imprecise and get me the "wrong" int in some corner cases?
Intuitively and from my tests the answer to both points appears to be "yes", but if somebody with a good formal understanding of the floating point standards and the maths behind it could confirm this that would be really helpful to me. I would also be curious if any known more aggressive optimizations like gcc's -Ofast are known to break any of this.
In general case yes, both should be true. The floating point base needs to be - if not 2, then at least integer and given that, an integer converted to nearest floating point value can never produce non-zero fractions - either the precision suffices or the lowest-order integer digits in the base of the floating type would be zeroed. For example in your case your system uses ISO/IEC/IEEE 60559 binary floating point numbers. When inspected in base 2, it can be seen that the trailing digits of the value are indeed zeroed:
>>> bin(9223372036854775000)
'0b111111111111111111111111111111111111111111111111111110011011000'
>>> bin(9223372036854774784)
'0b111111111111111111111111111111111111111111111111111110000000000'
The conversion of a double without fractions to an integer type, given that the value of the double falls within the range of the integer type should be exact...
Though you still might encounter a quality-of-implementation issue, or an outright bug - for example MSVC currently has a compiler bug where a round-trip conversion of unsigned 32-bit value with MSB set (or just double value between 2³¹ and 2³²-1 converted to unsigned int) would "overflow" in the conversion and always result in exactly 2³¹.
The following assumes the value being converted is positive. The behavior of negative numbers is analogous.
C 2018 6.3.1.4 2 specifies conversions from integer to real and says:
… If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.
This tells us that some integer value x being converted to floating-point can produce a non-integer only if one of the two representable values bounding x is not an integer and x is not representable.
5.2.4.2.2 specifies the model used for floating-point numbers. Each finite floating-point number is represented by a sequence of digits in a certain base b scaled by be for some exponent e. (b is an integer greater than 1.) Then, if one of the two values bounding x, say p is not an integer, the scaling must be such that the lowest digit in that floating-point number represents a fraction. But if this is the case, then setting all of the digits in p that represent fractions to 0 must produce a new floating-point number that is an integer. If x < p, this integer must be x, and therefore x is representable in the floating-point format. On the other hand, if p < x, we can add enough to each digit that represents a fraction to make it 0 (and produce a carry to the next higher digit). This will also produce an integer representable in the floating-point type1, and it must be x.
Therefore, if conversion of an integer x to the floating-point type would produce a non-integer, x must be representable in the type. But then conversion to the floating-point type must produce x. So it is never possible to produce a non-integer.
Footnote
1 It is possible this will carry out of all the digits, as when applying it to a three-digit decimal number 9.99, which produces 10.00. In this case, the value produced is the next power of b, if it is in range of the floating-point format. If it is not, the C standard does not define the behavior. Also note the C standard sets minimum requirements on the range that floating-point formats must support which preclude any format from not being able to represent 1, which avoids a degenerate case in which a conversion could produce a number like .999 because it was the largest representable finite value.
When a 64bit int is cast to 64bit float ... and doesn't have an exact match, will it always land on a non-fractional number?
Is this guaranteed to happen for any value of int64_t cast to a double?
For common double: Yes, it always land on a non-fractional number
When there is no match, the result is the closest floating point representable value above or below, depending on rounding mode. Given the characteristics of common double, these 2 bounding values are also whole numbers. When the value is not representable, there is first a nearby whole number one.
... if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off?
No. Edge cases near INT64_MAX fail as the converted value could become a FP value above INT64_MAX. Then conversion back to the integer type incurs: "the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised." C17dr § 6.3.1.3 3
#include <limits.h>
#include <string.h>
int main() {
long long imaxm1 = LLONG_MAX - 1;
double max = (double) imaxm1;
printf("%lld\n%f\n", imaxm1, max);
long long imax = (long long) max;
printf("%lld\n", imax);
}
9223372036854775806
9223372036854775808.000000
9223372036854775807 // Value here is implementation defined.
Deeper exceptions
(Question variation) When an N bit integer type is cast to a floating point and doesn't have an exact match, will it always land on a non-fractional number?
Integer type range exceeds finite float point
Conversion to infinity: With common float, and uint128_t, UINT128_MAX converts to infinity. This is readily possible with extra wide integer types.
int main() {
unsigned __int128 imaxm1 = 0xFFFFFFFFFFFFFFFF;
imaxm1 <<= 64;
imaxm1 |= 0xFFFFFFFFFFFFFFFF;
double fmax = (float) imaxm1;
double max = (double) imaxm1;
printf("%llde27\n%f\n%f\n", (long long) (imaxm1/1000000000/1000000000/1000000000),
fmax, max);
}
340282366920e27
inf
340282366920938463463374607431768211456.000000
Floating point precession deep more than range
On some unicorn implementation, with very wide FP precision and small range, the largest finite could, in theory, not practice, be a non-whole number. Then with an even wider integer type, the conversion could result in this non-whole number value. I do not see this as a legit concern of OP's.

why c++ is rounding of big numbers to ceil and small numbers to floor [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 2 years ago.
"x","y" are two long long type variables in c++ to which i have assigned two different numbers.
Variable type is long long but i have assigned decimals to the integer.
so i expected that it will trim the decimal part and display only integer part.
it trimmed off the numbers after the decimal and retured an integer.
Output :
i was expecting floor() of x but it returned some integer ending with 6 instead of 5, i mean it returned ceil(x).but in the second case it returned floor(y).
and its only occurring when the integer is too long.
So what might be the possible reason for this ?
I am using minGW c++17 version on visual studio code .. but same is happening with online compiler also.
Each initialization involves two conversions, first from the decimal numeral in the source text to double, and then from double to long long.
Let’s discuss the second declaration first. Because 2.001 is a double constant, the decimal source text 2.001 must be converted to double. Assuming your C implementation uses IEEE-754 binary64, the result is 2.000999999999999889865875957184471189975738525390625. Then, for the initialization, this double value is converted to long long. This conversion to an integer type discards the fraction, so the result is 2.
In the first declaration, when 9223372036854775.001 is converted to double, the result is 9223372036854776. This is because the two double numbers nearest 9223372036854775.001 are 9223372036854774 and 9223372036854776. The latter one is closer, so it is chosen. Then this double value is converted to long long. There is no fraction part, so the result is simply 9223372036854776.
Thus, the first conversion rounds up because it is not simply converting to the nearest long long value. It first has to round to the nearest double value. And, at the scale of that number, the double format does not have enough resolution to represent every integer. It is representing only every second integer: 9223372036854770, …772, …774, …776, …778, and so on. So 9223372036854775 is not a candidate.

Outside the range of representable values when casting double to long long with UBSAN enabled

Compiling the code below with UBSAN enabled causes this error:
Runtime error: value 9.22337e+18 is outside the range of representable
values of type 'long long'
double a = (double)LLONG_MAX; // or (double)LLONG_MAX - 1
long long b = (long long)a;
Is this a known issue? My understanding is that the value is not outside the range.
What you are seeing is expected behavior.
LLONG_MAX is a 64-bit integer on your system, needs 63 bits to fully represent its value (since one bit is used to hold the sign). A typical double (using IEEE arithmetic) only has 53 bits of precision.
When you assign an integer to a double, the integer value needs to be converted to a double. When the integer needs more bits to represent its value than the double can store, this conversion will result in a loss of some of the precision of the value, which is accomplished thru rounding. It is an implementation detail if this will round up or round down, but under IEEE arithmetic rules the rounding will be to nearest.
Because on your system the long value is rounded up, the resulting double value cannot be stored in a long long, resulting in your runtime error.

Loss of precision for int to float conversion

In C++, the conversion of an integer value of type I to a floating point type F will be exact — as static_cast<I>(static_cast<F>(i)) == i — if the range of I is a part of the range of integral values of F.
Is it possible, and if yes how, to calculate the loss of precision of static_cast<F>(i) (without using another floating point type with a wider range)?
As a start, I tried to code a function that would return if a conversion is safe or not (safe, meaning no loss of precision), but I must admit I am not so sure about its correctness.
template <class F, class I>
bool is_cast_safe(I value)
{
return std::abs(alue) < std::numeric_limits<F>::digits;
}
std::cout << is_cast_safe<float>(4) << std::endl; // true
std::cout << is_cast_safe<float>(0x1000001) << std::endl; // false
Thanks in advance.
is_cast_safe can be implemented with:
static const F One = 1;
F ULP = std::scalbn(One, std::ilogb(value) - std::numeric_limits<F>::digits + 1);
I U = std::max(ULP, One);
return value % U;
This sets ULP to the value of the least digit position in the result of converting value to F. ilogb returns the position (as an exponent of the floating-point radix) for the highest digit position, and subtracting one less than the number of digits adjusts to the lowest digit position. Then scalbn gives us the value of that position, which is the ULP.
Then value can be represented exactly in F if and only if it is a multiple of the ULP. To test that, we convert the ULP to I (but substitute 1 if it is less than 1), and then take the remainder of value divided by the ULP (or 1).
Also, if one is concerned the conversion to F might overflow, code can be inserted to handle this as well.
Calculating the actual amount of the change is trickier. The conversion to floating-point could round up or down, and the rule for choosing is implementation-defined, although round-to-nearest-ties-to-even is common. So the actual change cannot be calculated from the floating-point properties we are given in numeric_limits. It must involve performing the conversion and doing some work in floating-point. This definitely can be done, but it is a nuisance. I think an approach that should work is:
Assume value is non-negative. (Negative values can be handled similarly but are omitted for now for simplicity.)
First, test for overflow in conversion to F. This in itself is tricky, as the behavior is undefined if the value is too large. Some similar considerations were addressed in this answer to a question about safely converting from floating-point to integer (in C).
If the value does not overflow, then convert it. Let the result be x. Divide x by the floating-point radix r, producing y. If y is not an integer (which can be tested using fmod or trunc) the conversion was exact.
Otherwise, convert y to I, producing z. This is safe because y is less than the original value, so it must fit in I.
Then the error due to conversion is (z-value/r)*r + value%r.
I loss = abs(static_cast<I>(static_cast<F>(i))-i) should do the job. The only exception if i's magnitude is large, so static_cast<F>(i) would generate an out-of-I-range F.
(I supposed here that I abs(I) is available)

What precision are floating-point arithmetic operations done in?

Consider two very simple multiplications below:
double result1;
long double result2;
float var1=3.1;
float var2=6.789;
double var3=87.45;
double var4=234.987;
result1=var1*var2;
result2=var3*var4;
Are multiplications by default done in a higher precision than the operands? I mean in case of first multiplication is it done in double precision and in case of second one in x86 architecture is it done in 80-bit extended-precision or we should cast operands in expressions to the higher precision ourselves like below?
result1=(double)var1*(double)var2;
result2=(long double)var3*(long double)var4;
What about other operations(add, division and remainder)? For example when adding more than two positive single-precision values, using extra significant bits of double-precision can decrease round-off errors if used to hold intermediate results of expression.
Precision of floating-point computations
C++11 incorporates the definition of FLT_EVAL_METHOD from C99 in cfloat.
FLT_EVAL_METHOD
Possible values:
-1 undetermined
0 evaluate just to the range and precision of the type
1 evaluate float and double as double, and long double as long double.
2 evaluate all as long double
If your compiler defines FLT_EVAL_METHOD as 2, then the computations of r1 and r2, and of s1 and s2 below are respectively equivalent:
double var3 = …;
double var4 = …;
double r1 = var3 * var4;
double r2 = (long double)var3 * (long double)var4;
long double s1 = var3 * var4;
long double s2 = (long double)var3 * (long double)var4;
If your compiler defines FLT_EVAL_METHOD as 2, then in all four computations above, the multiplication is done at the precision of the long double type.
However, if the compiler defines FLT_EVAL_METHOD as 0 or 1, r1 and r2, and respectively s1 and s2, aren't always the same. The multiplications when computing r1 and s1 are done at the precision of double. The multiplications when computing r2 and s2 are done at the precision of long double.
Getting wide results from narrow arguments
If you are computing results that are destined to be stored in a wider result type than the type of the operands, as are result1 and result2 in your question, you should always convert the arguments to a type at least as wide as the target, as you do here:
result2=(long double)var3*(long double)var4;
Without this conversion (if you write var3 * var4), if the compiler's definition of FLT_EVAL_METHOD is 0 or 1, the product will be computed in the precision of double, which is a shame, since it is destined to be stored in a long double.
If the compiler defines FLT_EVAL_METHOD as 2, then the conversions in (long double)var3*(long double)var4 are not necessary, but they do not hurt either: the expression means exactly the same thing with and without them.
Digression: if the destination format is as narrow as the arguments, when is extended-precision for intermediate results better?
Paradoxically, for a single operation, rounding only once to the target precision is best. The only effect of computing a single multiplication in extended precision is that the result will be rounded to extended precision and then to double precision. This makes it less accurate. In other words, with FLT_EVAL_METHOD 0 or 1, the result r2 above is sometimes less accurate than r1 because of double-rounding, and if the compiler uses IEEE 754 floating-point, never better.
The situation is different for larger expressions that contain several operations. For these, it is usually better to compute intermediate results in extended precision, either through explicit conversions or because the compiler uses FLT_EVAL_METHOD == 2. This question and its accepted answer show that when computing with 80-bit extended precision intermediate computations for binary64 IEEE 754 arguments and results, the interpolation formula u2 * (1.0 - u1) + u1 * u3 always yields a result between u2 and u3 for u1 between 0 and 1. This property may not hold for binary64-precision intermediate computations because of the larger rounding errors then.
The usual arthimetic conversions for floating point types are applied before multiplication, division, and modulus:
The usual arithmetic conversions are performed on the operands and determine the type of the result.
§5.6 [expr.mul]
Similarly for addition and subtraction:
The usual arithmetic conversions are performed for operands of arithmetic or enumeration type.
§5.7 [expr.add]
The usual arithmetic conversions for floating point types are laid out in the standard as follows:
Many binary operators that expect operands of arithmetic or enumeration type cause conversions and yield result types in a similar way. The purpose is to yield a common type, which is also the type of the result. This pattern is called the usual arithmetic conversions, which are defined as follows:
[...]
— If either operand is of type long double, the other shall be converted to long double.
— Otherwise, if either operand is double, the other shall be converted to double.
— Otherwise, if either operand is float, the other shall be converted to float.
§5 [expr]
The actual form/precision of these floating point types is implementation-defined:
The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.
§3.9.1 [basic.fundamental]
For floating point multiplication: FP multipliers use internally double the width of the operands to generate an intermediate result, which equals the real result within an infinite precision, and then round it to the target precision. Thus you should not worry about multiplication. The result is correctly rounded.
For floating point addition, the result is also correctly rounded as standard FP adders use extra sufficient 3 guard bits to compute a correctly rounded result.
For division, remainder and other complicated functions, like transcendentals such as sin, log, exp, etc... it depends mainly on the architecture and the used libraries. I recommend you to use the MPFR library if you seek correctly rounded results for division or any other complicated function.
Not a direct answer to your question, but for constant floating-point values (such as the ones specified in your question), the method that yields the least amount of precision-loss would be using the rational representation of each value as an integer numerator divided by an integer denominator, and perform as many integer-multiplications as possible before the actual floating-point-division.
For the floating-point values specified in your question:
int var1_num = 31;
int var1_den = 10;
int var2_num = 6789;
int var2_den = 1000;
int var3_num = 8745;
int var3_den = 100;
int var4_num = 234987;
int var4_den = 1000;
double result1 = (double)(var1_num*var2_num)/(var1_den*var2_den);
long double result2 = (long double)(var3_num*var4_num)/(var3_den*var4_den);
If any of the integer-products are too large to fit in an int, then you can use larger integer types:
unsigned int
signed long
unsigned long
signed long long
unsigned long long