Converting a long double to double with upward (or downward) rounding - c++

Assume that we are working on a platform where the type long double has a strictly greater precision than 64 bits. What is the fastest way to convert a given long double to an ordinary double-precision number with some prescribed rounding (upward, downward, round-to-nearest, etc.)?
To make the question more precise, let me give a concrete example: Let N be a given long double, and M be the double whose value minimizes the (real) value M - N such that M > N. This M would be the upward rounded conversion I want to find.
Can I get away with setting the rounding mode of the FP environment appropriately and performing a simple cast (e.g. (double) N)?
Clarification: You can assume that the platform supports the IEEE Standard for Floating-Point Arithmetic (IEEE 754).

Can I get away with setting the rounding mode of the FP environment appropriately and performing a simple cast (e.g. (double) N)?
Yes, as long as the compiler implements the IEEE 754 (most of them do at least roughly). Conversion from one floating-point format to the other is one of the operations to which, according to IEEE 754, the rounding mode should apply. In order to convert from long double to double up, set the rounding mode to upwards and do the conversion.
In C99, which should be accepted by C++ compilers (I'm not sure a syntax is specified for C++):
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
…
fesetround(FE_UPWARD);
double d = (double) ld;
PS: you may discover that your compiler does not implement #pragma STDC FENV_ACCESS ON properly. Welcome to the club.

Related

C/C++: Are IEEE 754 float addition/multiplication/... and int-to-float conversion standardized?

Example:
#include <math.h>
#include <stdio.h>
int main()
{
float f1 = 1;
float f2 = 4.f * 3.f;
float f3 = 1.f / 1024.f;
float f4 = 3.f - 2.f;
printf("%a\n",f1);
printf("%a\n",f2);
printf("%a\n",f3);
printf("%a\n",f4);
return 0;
}
Output on gcc/clang as expected:
0x1p+0
0x1.8p+3
0x1p-10
0x1p+0
As one can see, the results look "reasonable". However, there are probably multiple different ways
to display these numbers. Or to display numbers very close.
Is it guaranteed in C and in C++ that IEEE 754 floating arithmetic like addition, multiplication and int-to-float conversion yield the same results, on all machines and with all compilers (i.e. that the resulting floats are all bit-wise equal)?
No, unless the macro __STD_IEC_559__ is defined.
Basically the standard does not require IEEE 754 compatible floating point, so most compilers will use whatever floating point support the hardware provides. If the hardware provides IEEE compatible floating point, most compilers for that target will use it and predefine the __STD_IEC_559__ macro.
If the macro is defined, then IEEE 754 guarantees the bit representation (but not the byte order) of float and double as 32-bit and 64-bit IEEE 754. This in turn guarantees bit-exact representation of double arithmetic (but note that the C standard allows float arithmetic to happen at either 32 bit or 64 bit precision).
The C standard requires that float to int conversion be the same as the trunc function if the result is in range for the resulting type, but unfortunately IEEE doesn't actually define the behavior of functions, just of basic arithmetic. The C spec also allows the compiler reorder operations in violation of IEEE754 (which might affect precision), but most that support IEEE754 will not do that wihout a command line option.
Anecdotal evidence also suggest that some compilers do not define the macro even though they should while other compilers define it when they should not (do not follow all the requirements of IEEE 754 strictly). These cases should probably be considered compiler bugs.
Is it guaranteed in C and in C++ that IEEE 754 floating arithmetic like addition, multiplication and int-to-float conversion yield the same results, on all machines and with all compilers (i.e. that the resulting floats are all bit-wise equal)?
No
If the exceptional compiler defines _STDC_IEC_559__, then almost yes.
An implementation that defines STDC_IEC_559 shall conform to the specifications in this annex.
C17dr Annex F (normative) IEC 60559 floating-point arithmetic
IEEE 754 floating arithmetic like addition, multiplication and int-to-float conversion yield like results when _FLT_EVAL_METHOD_ == 0. When _FLT_EVAL_METHOD_ > 0, wider floating point math may be used for many operations causing different results. Yet even with _FLT_EVAL_METHOD_ == 0, I have doubts that all FP code will compute with exactly the same result.
For highly portable FP code, a variation tolerance should be expected.
OP is also looking for bit-wise equivalent. FP has endian issues too, so 2 implementations could meet all IEEE 754 criteria, yet differ in endian.
Realize that both the C and C++ standards strive to be inclusive of unusual architectures. They would never mandate strict adherence to IEEE-754.
Also realize that the systems that do use IEEE-754 will rely on the processor architecture to implement it correctly. Your actual question then is how well do the processors conform to the IEEE-754 rules, which is hard to answer with authority. The Intel Pentium famously had a bug that produced wrong results for a tiny subset of operations.
I don't know if the conversion of integers to float is as tightly specified as other operations, but I suspect it is. A 32-bit IEEE-754 float has 24 bits of mantissa, and can therefore hold any 24-bit integer without loss of precision. That would be the range from -16777216 to 16777216. I would be very disappointed in any implementation that couldn't perform the operation 100% reliably. Outside of that range there are integer float values that can't be represented, so rounding should be applied to determine the final value. For example there are no valid floats between 2147483520 and 2147483648, so what should happen if you try to convert 2147483583 or 2147483585? I honestly don't know what the result will be, or whether that result would be correct.

Does the same floating-point calculation producing different results when performed twice indicate IEEE 754 non-conformance?

In a specific online judge running 32-bit GCC 7.3.0, this:
#include <iostream>
volatile float three = 3.0f, seven = 7.0f;
int main()
{
float x = three / seven;
std::cout << x << '\n';
float y = three / seven;
std::cout << (x == y) << '\n';
}
Outputs
0.428571
0
To me this seems like it violates IEEE 754, since the standard requires basic operations to be correctly rounded. Now I know there are a couple reasons for IEEE 754 floating-point calculations to be non-deterministic as discussed here, but I don't see how any of them applies to this example. Here are some of the things I considered:
Excess precision and contraction: I'm doing a single calculation and assigning the result to a float, which should force both of the values to be rounded to float precision.
Compile-time calculations: three and seven are volatile so both calculations must be done at runtime.
Floating-point flags: The calculations are done in the same thread almost immediately after each other, so the flags should be the same.
Does this necessarily indicate that the online judge system doesn't conform to IEEE 754?
Also, removing the statement printing x, adding a statement to print y, or making y volatile all changes the result. This seems to contradict my understanding of the C++ standard which I think requires the assignments to round off any excess precision.
Thanks to geza for pointing out that this is a known issue. I would still like a definitive answer on whether this conforms to the C++ standard and IEEE 754 though, since the C++ standard appears to require assignments to round off excess precision. Here's the quote from draft N4860 [expr.pre]:
The values of the floating-point operands and the results of floating-point expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.50
50) The cast and assignment operators must still perform their specific conversions as described in 7.6.1.3, 7.6.3, 7.6.1.8 and 7.6.19.
Does this necessarily indicate that the online judge system doesn't conform to IEEE 754?
Yes, with minor caveats.
One, C++ cannot just “conform” to IEEE 754. There has to be some specification of how things in C++ bind (connect) to IEEE 754, such as statements that the float format is IEEE-754 binary32, that x / y uses IEEE-754 division, and so on. C++ 2017 draft N4659 refers to LIA-1, but I do not see that it clearly requires LIA-1 be used even if std::numeric_limits<float>::is_iec559 reports true, and LIA-1 apparently only suggests language bindings.
The C++ standard tells us the fact that std::numeric_limits<float>::is_iec559 reports true means the float type conforms to ISO/IEC/IEEE 60559, which is effectively IEEE 754-2008. But, in addition to the binding problem, I do not see a statement in the C++ standard that nullifies 8 [expr] 13 (“The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.”) when is_iec559 is true. Although it is true that the cast and conversion operators must “perform their specific conversions” (footnote 64), and this forces float y = three / seven; to produce the correct IEEE-754 binary32 results even if binary64 or Intel’s 80-bit floating-point are used for the division, it might not force it to produce the correct result if only a little excess precision is used. (If at least 48 bits of precision are used, no double-rounding errors occur for division when rounded to the 24-bits of the binary32 format. If fewer excess bits are used, there may be some cases that experience double rounding errors.)
I believe the intent of is_iec559 is to indicate a sensible binding, and the behavior shown in the question does violate this. In particular, the defect shown in the question is caused by failing to round the excess precision used in the division to the actual float type; it is not caused by the hypothetical use of less-than-enough excess precision mentioned above.

Should I use bit manipulation on float point numbers

I'm writing an algorithm, to round a floating number. The input will be a 64bit IEEE754 double type number, very close to X.5, where X is a integer less than 32. The first solution came into my mind is to use a bit mask, to mask off those least significant bits as they represent very small fractions of 2^-n.(Given the exponent is not large).
But the problem is should I do that? Is there any other ways to accomplish the same thing? I feel using bit operation on float point is very controversy. Thanks!
The langugage I'm using is C++ by the way.
Edit:
Thanks guys, for your comments. I appreciate! Let's say I have a float number, can be 1.4999999... or 21.50000012.... I want to round it to 1.5 or 21.5. My goal is to round any number to its nearest to X.5 form, since it can be stored in a IEEE754 float point number.
If your compiler guarantees that you are using IEEE 754 floating-point, I would recommend that you round according to the method delineated in this blog post: add, and then immediately subtract a large constant so as to send the value in the binade of floating-point numbers where the ULP is 0.5. You won't find any faster method, and it does not involve any bit manipulation.
The appropriate constant to round a number between 0 and 32 to the nearest halt-unit for IEEE 754 double-precision is 2251799813685248.0.
Summary: use x = x + 2251799813685248.0 - 2251799813685248.0;.
You can use any of the functions round(), floor(), ceil(), rint(), nearbyint(), and trunc(). All do rounding in different modes, and all are standard C99. The only thing you need to do is to link against the standard math library by specifying -lm as a compiler flag.
As to trying to achieve rounding by bit manipulations, I would stay away from that: a) it will be much slower than using the functions above (they generally use hardware facilities where possible), b) it is reinventing the wheel with a lot of potential for bugs, and c) the newer C standards don't like you doing bit manipulations on floating point types: they use the so called strict aliasing rules that disallow you to just cast a double* to an uint64_t*. You would either need to do your bit manipulation by casting to a unsigned char* and manipulating the IEEE number byte by byte, or you would have to use memcpy() to copy the bit representation from a double variable into an uint64_t and back again. A lot of hassle for something already available in the form of standardized functions and hardware support.
You want to round x to the nearest value of the form d.5. For a generan number you write:
round(x+0.5)-0.5
For a number close to d.5, less than 0.25 away, you can use Pascal's offering:
round(2*x)*0.5
If you're looking for a bit trick and are guaranteed to have doubles in the ranges you describe, then you could do something like this (inline as you see fit):
void RoundNearestHalf(double &d) {
unsigned const maskshift = ((*(unsigned __int64*)&d >> 52) - 1023);
unsigned __int64 const setmask = 0x0008000000000000 >> maskshift;
unsigned __int64 const clearmask = ~0x0007FFFFFFFFFFFF >> maskshift;
*(unsigned __int64*)&d |= setmask;
*(unsigned __int64*)&d &= clearmask;
}
maskshift is the unbiased exponent. For the input range, we know this will be non-negative and no more than 4 (the trick will work for higher values too, but no more than 51). We use this value to make a setmask which sets the 2^-1 (one-half) place in the mantissa, and clearmask which clears all bits in the mantissa of lower value than 2^-1. The result is d rounded to the nearest half.
Note that it would be worth profiling this against other implementations, perhaps using the standard library to determine whether or not its actually faster.
I can't speak about C++ for sure, but in C99 the use of IEEE 754 standard for floating point will be purely normative (not required). In C99 if the __STDC_IEC_559__ macro is set then it declares that IEC 559 (which is more or less IEEE 754) is used for floating point.
I think it should be pointed out that there are functions to handle many types of rounding for you.

Banker's rounding with Visual C++? [duplicate]

I'm porting a CUDA code to C++ and using Visual Studio 2010. The CUDA code uses the rint function, which does not seem to be present in the Visual Studio 2010 math.h, so it seems that I need to implement it by myself.
According to this link, the CUDA rint function
rounds x to the nearest integer value in floating-point format, with halfway cases rounded towards zero.
I think I could use the casting to int which discards the fractional part, effectively rounding towards zero, so I ended-up with the following function
inline double rint(double x)
{
int temp; temp = (x >= 0. ? (int)(x + 0.5) : (int)(x - 0.5));
return (double)temp;
}
which has two different castings, one to int and one to double.
I have three questions:
Is the above function fully equivalent to CUDA rint for "small" numbers? Will it fail for "large" numbers that cannot be represented as an int?
Is there any more computationlly efficient way (rather than using two castings) of defining rint?
Thank you very much in advance.
The cited description of rint() in the CUDA documentation is incorrect. Roundings to integer with floating-point result map the IEEE-754 (2008) specified rounding modes as follows:
trunc() // round towards zero
floor() // round down (towards negative infinity)
ceil() // round up (towards positive infinity)
rint() // round to nearest or even (i.e. ties are rounded to even)
round() // round to nearest, ties away from zero
Generally, these functions work as described in the C99 standard. For rint(), the standard specifies that the function rounds according to the current rounding mode (which defaults to round to nearest or even). Since CUDA does not support dynamic rounding modes, all functions that are defined to use the current rounding mode use the rounding mode "round to nearest or even". Here are some examples showing the difference between round() and rint():
argument rint() round()
1.5 2.0 2.0
2.5 2.0 3.0
3.5 4.0 4.0
4.5 4.0 5.0
round() can be emulated fairly easily along the lines of the code you posted, I am not aware of a simple emulation for rint(). Please note that you would not want to use an intermediate cast to integer, as 'int' supports a narrower numeric range than integers that are exactly representable by a 'double'. Instead use trunc(), ceil(), floor() as appropriate.
Since rint() is part of both the current C and C++ standards, I am a bit surprised that MSVC does not include this function; I would suggest checking MSDN to see whether a substitute is offered. If your platforms are SSE4 capable, you could use the SSE intrinsics _mm_round_sd(), _mm_round_pd() defined in smmintrin.h, with the rounding mode set to _MM_FROUND_TO_NEAREST_INT, to implement the functionality of CUDA's rint().
While (in my experience), the SSE intrinsics are portable across Windows, Linux, and Mac OS X, you may want to avoid hardware specific code. In this case, you could try the following code (lightly tested):
double my_rint(double a)
{
const double two_to_52 = 4.5035996273704960e+15;
double fa = fabs(a);
double r = two_to_52 + fa;
if (fa >= two_to_52) {
r = a;
} else {
r = r - two_to_52;
r = _copysign(r, a);
}
return r;
}
Note that MSVC 2010 seems to lack the standard copysign() function as well, so I had to substitute _copysign(). The above code assumes that the current rounding mode is round-to-nearest-even (which it is by default). By adding 2**52 it makes sure that rounding occurs at the integer unit bit. Note that this also assumes that pure double-precision computation is performed. On platforms that use some higher precision for intermediate results one might need to declare 'fa' and 'r' as volatile.

Converting float to double

How expensive is the conversion of a float to a double? Is it as trivial as an int to long conversion?
EDIT: I'm assuming a platform where where float is 4 bytes and double is 8 bytes
Platform considerations
This depends on platform used for float computation. With x87 FPU the conversion is free, as the register content is the same - the only price you may sometimes pay is the memory traffic, but in many cases there is even no traffic, as you can simply use the value without any conversion. x87 is actually a strange beast in this respect - it is hard to properly distinguish between floats and doubles on it, as the instructions and registers used are the same, what is different are load/store instructions and computation precision itself is controlled using status bits. Using mixed float/double computations may result in unexpected results (and there are compiler command line options to control exact behaviour and optimization strategies because of this).
When you use SSE (and sometimes Visual Studio uses SSE by default), it may be different, as you may need to transfer the value in the FPU registers or do something explicit to perform the conversion.
Memory savings performance
As a summary, and answering to your comment elsewhere: if you want to store results of floating computations into 32b storage, the result will be same speed or faster, because:
If you do this on x87, the conversion is free - the only difference will be fstp dword[] will be used instead of fstp qword[].
If you do this with SSE enabled, you may even see some performance gain, as some float computations can be done with SSE once the precision of the computation is only float insteead of default double.
In all cases the memory traffic is lower
Float to double conversions happen for free on some platforms (PPC, x86 if your compiler/runtime uses the "to hell with what type you told me to use, i'm going to evaluate everything in long double anyway, nyah nyah" evaluation mode).
On an x86 environment where floating-point evaluation is actually done in the specified type using SSE registers, conversions between float and double are about as expensive as a floating-point add or multiply (i.e., unlikely to be a performance consideration unless you're doing a lot of them).
In an embedded environment that lacks hardware floating-point, they can be somewhat costly.
I can't imagine it'd be too much more complex. The big difference between converting int to long and converting float to double is that the int types have two components (sign and value) while floating point numbers have three components (sign, mantissa, and exponent).
IEEE 754 single precision is encoded
in 32 bits using 1 bit for the sign, 8
bits for the exponent, and 23 bits for
the significand. However, it uses a
hidden bit, so the significand is 24
bits (p = 24), even though it is
encoded using only 23 bits.
-- David Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic
So, converting between float and double is going to keep the same sign bit, set the last 23/24 bits of the float's mantissa to the double's mantissa, and set the last 8 bits of the float's exponent to the double's exponent.
This behavior may even be guaranteed by IEEE 754... I haven't checked it, so I'm not sure.
This is specific to the C++ implementation you are using. In C++, the default floating-point type is double. A compiler should issue a warning for the following code:
float a = 3.45;
because the double value 3.45 is being assigned to a float. If you need to use float specifically, suffix the value with f:
float a = 3.45f;
The point is, all floating-point numbers are by default double. It's safe to stick to this default if you are not sure of the implementation details of your compiler and don't have significant understanding of floating point computation. Avoid the cast.
Also see section 4.5 of The C++ Programming Language.
probably a bit slower than converting int to long, as memory required is larger and manipulation is more complex. A good reference about memory alignment issues
Maybe this help:
#include <stdlib.h>
#include <stdio.h>
#include <conio.h>
double _ftod(float fValue)
{
char czDummy[30];
printf(czDummy,"%9.5f",fValue);
double dValue = strtod(czDummy,NULL);
return dValue;
}
int main(int argc, char* argv[])
{
float fValue(250.84f);
double dValue = _ftod(fValue);//good conversion
double dValue2 = fValue;//wrong conversion
printf("%f\n",dValue);//250.840000
printf("%f\n",dValue2);//250.839996
getch();
return 0;
}