How can I convert an integer to float with rounding towards zero? - c++

When an integer is converted to floating-point, and the value cannot be directly represented by the destination type, the nearest value is usually selected (required by IEEE-754).
I would like to convert an integer to floating-point with rounding towards zero in case the integer value cannot be directly represented by the floating-point type.
Example:
int i = 2147483647;
float nearest = static_cast<float>(i); // 2147483648 (likely)
float towards_zero = convert(i); // 2147483520

Since C++11, one can use fesetround(), the floating-point environment rounding direction manager. There are four standard rounding directions and an implementation is permitted to add additional rounding directions.
#include <cfenv> // for fesetround() and FE_* macros
#include <iostream> // for cout and endl
#include <iomanip> // for setprecision()
#pragma STDC FENV_ACCESS ON
int main(){
int i = 2147483647;
std::cout << std::setprecision(10);
std::fesetround(FE_DOWNWARD);
std::cout << "round down " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round down " << -i << " : " << static_cast<float>(-i) << std::endl;
std::fesetround(FE_TONEAREST);
std::cout << "round to nearest " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round to nearest " << -i << " : " << static_cast<float>(-i) << std::endl;
std::fesetround(FE_TOWARDZERO);
std::cout << "round toward zero " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round toward zero " << -i << " : " << static_cast<float>(-i) << std::endl;
std::fesetround(FE_UPWARD);
std::cout << "round up " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round up " << -i << " : " << static_cast<float>(-i) << std::endl;
return(0);
}
Compiled under g++ 7.5.0, the resulting executable outputs
round down 2147483647 : 2147483520
round down -2147483647 : -2147483648
round to nearest 2147483647 : 2147483648
round to nearest -2147483647 : -2147483648
round toward zero 2147483647 : 2147483520
round toward zero -2147483647 : -2147483520
round up 2147483647 : 2147483648
round up -2147483647 : -2147483520
Omitting the #pragma doesn't seem to change anything under g++.
#chux comments correctly that the standard doesn't explicitly state that fesetround() affects rounding in static_cast<float>(i). For a guarantee that the set rounding direction affects the conversion, use std::nearbyint and its -f and -l variants. See also std::rint and its many type-specific variants.
I probably should have looked up the format specifier to use a space for positive integers and floats, rather than stuffing it into the preceding string constants.
(I haven't tested the following snippet.) Your convert() function would be something like
float convert(int i, int direction = FE_TOWARDZERO){
float retVal = 0.;
int prevdirection = std::fegetround();
std::fesetround(direction);
retVal = static_cast<float>(i);
std::fesetround(prevdirection);
return(retVal);
}

You can use std::nextafter.
int i = 2147483647;
float nearest = static_cast<float>(i); // 2147483648 (likely)
float towards_zero = std::nextafter(nearest, 0.f); // 2147483520
But you have to check, if static_cast<float>(i) is exact, if so, nextafter would go one step towards 0, which you probably don't want.
Your convert function might look like this:
float convert(int x){
if(std::abs(long(static_cast<float>(x))) <= std::abs(long(x)))
return static_cast<float>(x);
return std::nextafter(static_cast<float>(x), 0.f);
}
It may be that sizeof(int)==sizeof(long) or even sizeof(int)==sizeof(long long) in this case long(...) may behave undefined, when the static_cast<float>(x) exceeds the possible values. Depending on the compiler it might still work in this cases.

I understand the question to be restricted to platforms that use IEEE-754 binary floating-point arithmetic, and where float maps to IEEE-754 (2008) binary32. This answer assumes this to be the case.
As other answers have pointed out, if the tool chain and the platform supports this, use the facilities supplied by fenv.h to set the rounding mode for the conversion as desired.
Where those are not available, or slow, it is not difficult to emulate the truncation during int to float conversion. Basically, normalize the integer until the most significant bit is set, recording the required shift count. Now, shift the normalized integer into place to form the mantissa, compute the exponent based on the normalization shift count, and add in the sign bit based on the sign of the original integer. The process of normalization can be sped up significantly if a clz (count leading zeros) primitive is available, maybe as an intrinsic.
The exhaustively tested code below demonstrates this approach for 32-bit integers, see function int32_to_float_rz. I successfully built it as both C and C++ code with the Intel compiler version 13.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <fenv.h>
float int32_to_float_rz (int32_t a)
{
uint32_t i = (uint32_t)a;
int shift = 0;
float r;
// take absolute value of integer
if (a < 0) i = 0 - i;
// normalize integer so MSB is set
if (!(i > 0x0000ffffU)) { i <<= 16; shift += 16; }
if (!(i > 0x00ffffffU)) { i <<= 8; shift += 8; }
if (!(i > 0x0fffffffU)) { i <<= 4; shift += 4; }
if (!(i > 0x3fffffffU)) { i <<= 2; shift += 2; }
if (!(i > 0x7fffffffU)) { i <<= 1; shift += 1; }
// form mantissa with explicit integer bit
i = i >> 8;
// add in exponent, taking into account integer bit of mantissa
if (a != 0) i += (127 + 31 - 1 - shift) << 23;
// add in sign bit
if (a < 0) i |= 0x80000000;
// reinterpret bit pattern as 'float'
memcpy (&r, &i, sizeof r);
return r;
}
#pragma STDC FENV_ACCESS ON
float int32_to_float_rz_ref (int32_t a)
{
float r;
int orig_mode = fegetround ();
fesetround (FE_TOWARDZERO);
r = (float)a;
fesetround (orig_mode);
return r;
}
int main (void)
{
int32_t arg;
float res, ref;
arg = 0;
do {
res = int32_to_float_rz (arg);
ref = int32_to_float_rz_ref (arg);
if (res != ref) {
printf ("error # %08x: res=% 14.6a ref=% 14.6a\n", arg, res, ref);
return EXIT_FAILURE;
}
arg++;
} while (arg);
return EXIT_SUCCESS;
}

A C implementation dependent solution that I am confident has a C++ counterpart.
Temporarily change the rounding mode the conversion uses that to determine which way to go in inexact cases.
the nearest value is usually selected (required by IEEE-754).
Is not entirely accurate. The inexact case is rounding mode dependent.
C does not specify this behavior. C allows this behavior, as it is implementation-defined.
If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.
#include <fenv.h>
float convert(int i) {
#pragma STDC FENV_ACCESS ON
int save_round = fegetround();
fesetround(FE_TOWARDZERO);
float f = (float) i;
fesetround(save_round);
return f;
}

A specified approach.
"the nearest value is usually selected (required by IEEE-754)" implies OP expects IEEE-754 is involved. Many C/C++ implementation do follow much of IEEE-754, yet adherence to that spec is not required. The following relies on C specifications.
Conversion of an integer type to a floating point type is specified as below. Notice conversion is not specified to depend on rounding mode.
When a value of integer type is converted to a real floating type, if the value being converted can be represented exactly in the new type, it is unchanged. If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner. C17dr § 6.3.1.4 2
When the result it not exact, the converted value the nearest higher or nearest lower?
A round trip int --> float --> int is warranted.
Round tripping needs to watch out for convert(near_INT_MAX) converting to outside the int range.
Rather than rely on long or long long having a wider range than int (C does not specify this property), let code compare on the negative side as INT_MIN (with 2's complement) can be expected to convert exactly to a float.
float convert(int i) {
int n = (i < 0) ? i : -i; // n <= 0
float f = (float) n;
int rt_n = (int) f; // Overflow not expected on the negative side
// If f rounded away from 0.0 ...
if (rt_n < n) {
f = nextafterf(f, 0.0); // Move toward 0.0
}
return (i < 0) f : -f;
}

Changing the rounding mode is somewhat expensive, although I think some modern x86 CPUs do rename MXCSR so it doesn't have to drain the out-of-order execution back-end.
If you care about performance, benchmarking njuffa's pure integer version (using shift = __builtin_clz(i); i<<=shift;) against the rounding-mode-changing version would make sense. (Make sure to test in the context you want to use it in; it's so small that it matters how well it overlaps with surrounding code.)
AVX-512 can use rounding-mode overrides on a per-instruction basis, letting you use a custom rounding mode for the conversion basically the same cost as a normal int->float. (Only available on Intel Skylake-server, and Ice Lake CPUs so far, unfortunately.)
#include <immintrin.h>
float int_to_float_trunc_avx512f(int a) {
const __m128 zero = _mm_setzero_ps(); // SSE scalar int->float are badly designed to merge into another vector, instead of zero-extend. Short-sighted Pentium-3 decision never changed for AVX or AVX512
__m128 v = _mm_cvt_roundsi32_ss (zero, a, _MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC);
return _mm_cvtss_f32(v); // the low element of a vector already is a scalar float so this is free.
}
_mm_cvt_roundi32_ss is a synonym, IDK why Intel defined both i and si names, or if some compilers might only have one.
This compiles efficiently with all 4 mainstream x86 compilers (GCC/clang/MSVC/ICC) on the Godbolt compiler explorer.
# gcc10.2 -O3 -march=skylake-avx512
int_to_float_trunc_avx512f:
vxorps xmm0, xmm0, xmm0
vcvtsi2ss xmm0, xmm0, {rz-sae}, edi
ret
int_to_float_plain:
vxorps xmm0, xmm0, xmm0 # GCC is always cautious about false dependencies, spending an extra instruction to break it, like we did with setzero()
vcvtsi2ss xmm0, xmm0, edi
ret
In a loop, the same zeroed register can be reused as a merge target, allowing the vxorps zeroing to be hoisted out of a loop.
Using _mm_undefined_ps() instead of _mm_setzero_ps(), we can get ICC to skip zeroing XMM0 before converting into it, like clang does for plain (float)i in this case. But ironically, clang which is normally cavalier and reckless about false dependencies compiles _mm_undefined_ps() the same as setzero in this case.
The performance in practice of vcvtsi2ss (scalar integer to scalar single-precision float) is the same whether you use a rounding-mode override or not (2 uops on Ice Lake, same latency: https://uops.info/). The AVX-512 EVEX encoding is 2 bytes longer than the AVX1.
Rounding mode overrides also suppress FP exceptions (like "inexact"), so you couldn't check the FP environment to later detect if the conversion happened to be exact (no rounding). But in this case, converting back to int and comparing would be fine. (You can do that without risk of overflow because of the rounding towards 0).

A simple solution is to use a higher precision floating point for comparison. As long as the high precision floating point can exactly represent all integers, we can accurately compare whether the float result was greater.
double should be sufficient with 32 bit integers, and long double is sufficient for 64 bit on most systems, but it's good practice to verify it.
float convert(int x) {
static_assert(std::numeric_limits<double>::digits
>= sizeof(int) * CHAR_BIT);
float f = x;
double d = x;
return std::abs(f) > std::abs(d)
? std::nextafter(f, 0.f)
: f;
}

Shift the integer right by an arithmetic shift until the number of bits agrees with the precision of the floating point arithmetic. Count the shifts.
Convert the integer to float. The result is now precise.
Multiply the resulting float by a power of two corresponding to the number of shifts.

For nonnegative values, this can be done by taking the integer value and shifting right until the highest set bit is less than 24 bits (i.e. the precision of IEEE single) from the right, then shifting back.
For negative values, you would shift right until all bits from 24 and up are set, then shift back. For the shift back, you'll first need to cast the value to unsigned to avoid undefined behavior of left-shifting a negative value, then cast the result back to int before converting to float.
Note also that the conversion from unsigned to signed is implementation defined, however we're already dealing with ID as we're assuming float is IEEE754 and int is two's complement.
float rount_to_zero(int x)
{
int cnt = 0;
if (x >= 0) {
while (x != (x & 0xffffff)) {
x >>= 1;
cnt++;
}
return x << cnt;
} else {
while (~0xffffff != (x & ~0xffffff)) {
x >>= 1;
cnt++;
}
return (int)((unsigned)x << cnt);
}
}

Related

Sum signed 32-bit int with unsigned 64bit int

On my application, I receive two signed 32-bit int and I have to store them. I have to create a sort of counter and I don't know when it will be reset, but I'll receive big values and frequently. Beacause of that, in order to store these values, I decided to use two unsigned 64-bit int.
The following could be a simple version of the counter.
struct Counter
{
unsigned int elementNr;
unsigned __int64 totalLen1;
unsigned __int64 totalLen2;
void UpdateCounter(int len1, int len2)
{
if(len1 > 0 && len2 > 0)
{
++elementNr;
totalLen1 += len1;
totalLen2 += len2;
}
}
}
I know that if a smaller type is casted to a bigger one (e.g. int to long) there should be no issues. However, passing from 32 bit rappresentation to 64 bit rappresentation and from signed to unsigned at the same time, is something new for me.
Reading around, I undertood that len1 should be expanded from 32 bit to 64 bit and then applied sign extension. Because the unsigned int and signen int have the same rank (Section 4.13), the latter should be converted.
If len1 stores a negative value, passing from signed to unsigned will return a wrong value, this is why I check the positivy at the beginning of the function. However, for positive values, there
should be no issues I think.
For clarity I could revrite UpdateCounter(int len1, int len2) like this
void UpdateCounter(int len1, int len2)
{
if(len1 > 0 && len2 > 0)
{
++elementNr;
__int64 tmp = len1;
totalLen1 += static_cast<unsigned __int64>(tmp);
tmp = len2;
totalLen2 += static_cast<unsigned __int64>(tmp);
}
}
Might there be some side effects that I have not considered.
Is there another better and safer way to do that?
A little background, just for reference: binary operators such arithmetic addition work on operands of the same type (the specific CPU instruction to which is translated depends on the number representation that must be the same for both instruction operands).
When you write something like this (using fixed width integer types to be explicit):
int32_t a = <some value>;
uint64_t sum = 0;
sum += a;
As you already know this involves an implicit conversion, more specifically an
integral promotion according to integer conversion rank.
So the expression sum += a; is equivalent to sum += static_cast<uint64_t>(a);, so a is promoted having the lesser rank.
Let's see what happens in this example:
int32_t a = 60;
uint64_t sum = 100;
sum += static_cast<uint64_t>(a);
std::cout << "a=" << static_cast<uint64_t>(a) << " sum=" << sum << '\n';
The output is:
a=60 sum=160
So all is all ok as expected. Let's se what happens adding a negative number:
int32_t a = -60;
uint64_t sum = 100;
sum += static_cast<uint64_t>(a);
std::cout << "a=" << static_cast<uint64_t>(a) << " sum=" << sum << '\n';
The output is:
a=18446744073709551556 sum=40
The result is 40 as expected: this relies on the two's complement integer representation (note: unsigned integer overflow is not undefined behaviour) and all is ok, of course as long as you ensure that the sum does not become negative.
Coming back to your question you won't have any surprises if you always add positive numbers or at least ensuring that sum will never be negative... until you reach the maximum representable value std::numeric_limits<uint64_t>::max() (2^64-1 = 18446744073709551615 ~ 1.8E19).
If you continue to add numbers indefinitely sooner or later you'll reach that limit (this is valid also for your counter elementNr).
You'll overflow the 64 bit unsigned integer by adding 2^31-1 (2147483647) every millisecond for approximately three months, so in this case it may be advisable to check:
#include <limits>
//...
void UpdateCounter(const int32_t len1, const int32_t len2)
{
if( len1>0 )
{
if( static_cast<decltype(totalLen1)>(len1) <= std::numeric_limits<decltype(totalLen1)>::max()-totalLen1 )
{
totalLen1 += len1;
}
else
{// Would overflow!!
// Do something
}
}
}
When I have to accumulate numbers and I don't have particular requirements about accuracy I often use double because the maximum representable value is incredibly high (std::numeric_limits<double>::max() 1.79769E+308) and to reach overflow I would need to add 2^32-1=4294967295 every picoseconds for 1E+279 years.

Truncating a double floating point at a certain number of digits

I have written the following routine, which is supposed to truncate a C++ double at the n'th decimal place.
double truncate(double number_val, int n)
{
double factor = 1;
double previous = std::trunc(number_val); // remove integer portion
number_val -= previous;
for (int i = 0; i < n; i++) {
number_val *= 10;
factor *= 10;
}
number_val = std::trunc(number_val);
number_val /= factor;
number_val += previous; // add back integer portion
return number_val;
}
Usually, this works great... but I have found that with some numbers, most notably those that do not seem to have an exact representation within double, have issues.
For example, if the input is 2.0029, and I want to truncate it at the fifth place, internally, the double appears to be stored as something somewhere between 2.0028999999999999996 and 2.0028999999999999999, and truncating this at the fifth decimal place gives 2.00289, which might be right in terms of how the number is being stored, but is going to look like the wrong answer to an end user.
If I were rounding instead of truncating at the fifth decimal, everything would be fine, of course, and if I give a double whose decimal representation has more than n digits past the decimal point it works fine as well, but how do I modify this truncation routine so that inaccuracies due to imprecision in the double type and its decimal representation will not affect the result that the end user sees?
I think I may need some sort of rounding/truncation hybrid to make this work, but I'm not sure how I would write it.
Edit: thanks for the responses so far but perhaps I should clarify that this value is not producing output necessarily but this truncation operation can be part of a chain of many different user specified actions on floating point numbers. Errors that accumulate within the double precision over multiple operations are fine, but no single operation, such as truncation or rounding, should produce a result that differs from its actual ideal value by more than half of an epsilon, where epsilon is the smallest magnitude represented by the double precision with the current exponent. I am currently trying to digest the link provided by iinspectable below on floating point arithmetic to see if it will help me figure out how to do this.
Edit: well the link gave me one idea, which is sort of hacky but it should probably work which is to put a line like number_val += std::numeric_limits<double>::epsilon() right at the top of the function before I start doing anything else with it. Dunno if there is a better way, though.
Edit: I had an idea while I was on the bus today, which I haven't had a chance to thoroughly test yet, but it works by rounding the original number to 16 significant decimal digits, and then truncating that:
double truncate(double number_val, int n)
{
bool negative = false;
if (number_val == 0) {
return 0;
} else if (number_val < 0) {
number_val = -number_val;
negative = true;
}
int pre_digits = std::log10(number_val) + 1;
if (pre_digits < 17) {
int post_digits = 17 - pre_digits;
double factor = std::pow(10, post_digits);
number_val = std::round(number_val * factor) / factor;
factor = std::pow(10, n);
number_val = std::trunc(number_val * factor) / factor;
} else {
number_val = std::round(number_val);
}
if (negative) {
number_val = -number_val;
}
return number_val;
}
Since a double precision floating point number only can have about 16 digits of precision anyways, this just might work for all practical purposes, at a cost of at most only one digit of precision that the double would otherwise perhaps support.
I would like to further note that this question differs from the suggested duplicate above in that a) this is using C++, and not Java... I don't have a DecimalFormatter convenience class, and b) I am wanting to truncate, not round, the number at the given digit (within the precision limits otherwise allowed by the double datatype), and c) as I have stated before, the result of this function is not supposed to be a printable string... it is supposed to be a native floating point number that the end user of this function might choose to further manipulate. Accumulated errors over multiple operations due to imprecision in the double type are acceptable, but any single operation should appear to perform correctly to the limits of the precision of the double datatype.
OK, if I understand this right, you've got a floating point number and you want to truncate it to n digits:
10.099999
^^ n = 2
becomes
10.09
^^
But your function is truncating the number to an approximately close value:
10.08999999
^^
Which is then displayed as 10.08?
How about you keep your truncate formula, which does truncate as well as it can, and use std::setprecision and std::fixed to round the truncated value to the required number of decimal places? (Assuming it is std::cout you're using for output?)
#include <iostream>
#include <iomanip>
using std::cout;
using std::setprecision;
using std::fixed;
using std::endl;
int main() {
double foo = 10.08995; // let's imagine this is the output of `truncate`
cout << foo << endl; // displays 10.0899
cout << setprecision(2) << fixed << foo << endl; // rounds to 10.09
}
I've set up a demo on wandbox for this.
I've looked into this. It's hard because you have inaccuracies due to the floating point representation, then further inaccuracies due to the decimal. 0.1 cannot be represented exactly in binary floating point. However you can use the built-in function sprintf with a %g argument that should round accurately for you.
char out[64];
double x = 0.11111111;
int n = 3;
double xrounded;
sprintf(out, "%.*g", n, x);
xrounded = strtod(out, 0);
Get double as a string
If you are looking just to print the output, then it is very easy and straightforward using stringstream:
#include <cmath>
#include <iostream>
#include <iomanip>
#include <limits>
#include <sstream>
using namespace std;
string truncateAsString(double n, int precision) {
stringstream ss;
double remainder = static_cast<double>((int)floor((n - floor(n)) * precision) % precision);
ss << setprecision(numeric_limits<double> ::max_digits10 + __builtin_ctz(precision))<< floor(n);
if (remainder)
ss << "." << remainder;
cout << ss.str() << endl;
return ss.str();
}
int main(void) {
double a = 9636346.59235;
int precision = 1000; // as many digits as you add zeroes. 3 zeroes means precision of 3.
string s = truncateAsString(a, precision);
return 0;
}
Getting the divided floating point with an exact value
Maybe you are looking for true value for your floating point, you can use boost multiprecision library
The Boost.Multiprecision library can be used for computations requiring precision exceeding that of standard built-in types such as float, double and long double. For extended-precision calculations, Boost.Multiprecision supplies a template data type called cpp_dec_float. The number of decimal digits of precision is fixed at compile-time via template parameter.
Demonstration
#include <boost/math/constants/constants.hpp>
#include <boost/multiprecision/cpp_dec_float.hpp>
#include <iostream>
#include <limits>
#include <cmath>
#include <iomanip>
using boost::multiprecision::cpp_dec_float_50;
cpp_dec_float_50 truncate(cpp_dec_float_50 n, int precision) {
cpp_dec_float_50 remainder = static_cast<cpp_dec_float_50>((int)floor((n - floor(n)) * precision) % precision) / static_cast<cpp_dec_float_50>(precision);
return floor(n) + remainder;
}
int main(void) {
int precision = 100000; // as many digits as you add zeroes. 5 zeroes means precision of 5.
cpp_dec_float_50 n = 9636346.59235789;
n = truncate(n, precision); // first part is remainder, floor(n) is int value truncated.
cout << setprecision(numeric_limits<cpp_dec_float_50> ::max_digits10 + __builtin_ctz(precision)) << n << endl; // __builtin_ctz(precision) will equal the number of trailing 0, exactly the precision we need!
return 0;
}
Output:
9636346.59235
NB: Requires sudo apt-get install libboost-all-dev

What happens if I set an int equal to the sum of the 2 largest ints?

int sum;
The largest number represented by an int N is 2^0 + 2^1 + ... + 2^(sizeof(int)*8-1). What happens if I set sum = N + N? I'm somewhat new to programming, just so you new.
If the result of an int addition exceeds the range of values that can be represented in an int (INT_MIN .. INT_MAX), you have an overflow.
For signed integer types, the behavior of integer overflow is undefined.
On many implementations, you'll usually get a result that's consistent with ignoring all but the low-order N bits of the mathematical result (where N is the number of bits in an int) -- but the language does not guarantee that.
Furthermore, a compiler is permitted to assume that your code's behavior is defined, and to perform optimizations based on that assumption.
For example, with clang++ 3.0, this program:
#include <climits>
#include <iostream>
int max() { return INT_MAX; }
int main()
{
int x = max();
int y = x + 1;
if (x < y) {
std::cout << x << " < " << y << "\n";
}
}
prints nothing when compiled with -O0, but prints
2147483647 < -2147483648
when compiled with -O1 or higher.
Summary: Don't do that.
(Incidentally, the largest representable value of type int is more simply expressed as 2N-1-1, where N is the width in bits of type int -- or even more simply as INT_MAX. For a typical system with 32-bit int, that's 231-1, or 2147483647. You're also assuming that sizeof is in units of 8 bits; in fact it's in units of CHAR_BIT bits, where CHAR_BIT is at least 8, but can (very rarely) be larger.)
you will have an integer overflow as the number will be too big for int if you want to do this you will have to declare it as a long
Hope this helps

How to detect double precision floating point overflow and underflow?

I have following variables:
double dblVar1;
double dblVar2;
They may have big values but less than double max.
I have various arithmetic on above variables like addition, multiplication and power:
double dblVar3 = dblVar1 * dblVar2;
double dblVar4 = dblVar1 + dblVar2;
double dblVar5 = pow(dblVar1, 2);
In all above I have to check overflow and underflow. How can I achieve this in C++?
A lot depends on context. To be perfectly portable, you have to
check before the operation, e.g. (for addition):
if ( (a < 0.0) == (b < 0.0)
&& std::abs( b ) > std::numeric_limits<double>::max() - std::abs( a ) ) {
// Addition would overflow...
}
Similar logic can be used for the four basic operators.
If all of the machines you target support IEEE (which is
probably the case if you don't have to consider mainframes), you
can just do the operations, then use isfinite or isinf on
the results.
For underflow, the first question is whether a gradual underflow
counts as underflow or not. If not, then simply checking if the
results are zero and a != -b would do the trick. If you want
to detect gradual underflow (which is probably only present if
you have IEEE), then you can use isnormal—this will
return false if the results correspond to gradual underflow.
(Unlike overflow, you test for underflow after the operation.)
POSIX, C99, C++11 have <fenv.h> (and <cfenv> for C++11) which have functions to test the IEEE754 exceptions flags (which have nothing to do with C++ exceptions, it would be too easy):
int feclearexcept(int);
int fegetexceptflag(fexcept_t *, int);
int feraiseexcept(int);
int fesetexceptflag(const fexcept_t *, int);
int fetestexcept(int);
The flag is a bitfield with the following bits defined:
FE_DIVBYZERO
FE_INEXACT
FE_INVALID
FE_OVERFLOW
FE_UNDERFLOW
So you can clear them before the operations and then test them after. You'll have to check the documentation for the effect of library functions on them.
With a decent compiler (which supports the newest C++ standard), you can use these functions:
#include <cfenv>
#include <iostream>
int main() {
std::feclearexcept(FE_OVERFLOW);
std::feclearexcept(FE_UNDERFLOW);
double overflowing_var = 1000;
double underflowing_var = 0.01;
std::cout << "Overflow flag before: " << (bool)std::fetestexcept(FE_OVERFLOW) << std::endl;
std::cout << "Underflow flag before: " << (bool)std::fetestexcept(FE_UNDERFLOW) << std::endl;
for(int i = 0; i < 20; ++i) {
overflowing_var *= overflowing_var;
underflowing_var *= underflowing_var;
}
std::cout << "Overflow flag after: " << (bool)std::fetestexcept(FE_OVERFLOW) << std::endl;
std::cout << "Underflow flag after: " << (bool)std::fetestexcept(FE_UNDERFLOW) << std::endl;
}
/** Output:
Overflow flag before: 0
Underflow flag before: 0
Overflow flag after: 1
Underflow flag after: 1
*/
ISO C99 defines functions to query and manipulate the floating-point status word. You can use these functions to check for untrapped exceptions when it's convenient, rather than worrying about them in the middle of a calculation.
It provides
FE_INEXACT
FE_DIVBYZERO
FE_UNDERFLOW
FE_OVERFLOW
FE_INVALID
For example
{
double f;
int raised;
feclearexcept (FE_ALL_EXCEPT);
f = compute ();
raised = fetestexcept (FE_OVERFLOW | FE_INVALID);
if (raised & FE_OVERFLOW) { /* ... */ }
if (raised & FE_INVALID) { /* ... */ }
/* ... */
}
http://www.gnu.org/software/libc/manual/html_node/Status-bit-operations.html

How to perform a bitwise operation on floating point numbers

I tried this:
float a = 1.4123;
a = a & (1 << 3);
I get a compiler error saying that the operand of & cannot be of type float.
When I do:
float a = 1.4123;
a = (int)a & (1 << 3);
I get the program running. The only thing is that the bitwise operation is done on the integer representation of the number obtained after rounding off.
The following is also not allowed.
float a = 1.4123;
a = (void*)a & (1 << 3);
I don't understand why int can be cast to void* but not float.
I am doing this to solve the problem described in Stack Overflow question How to solve linear equations using a genetic algorithm?.
At the language level, there's no such thing as "bitwise operation on floating-point numbers". Bitwise operations in C/C++ work on value-representation of a number. And the value-representation of floating point numbers is not defined in C/C++ (unsigned integers are an exception in this regard, as their shift is defined as-if they are stored in 2's complement). Floating point numbers don't have bits at the level of value-representation, which is why you can't apply bitwise operations to them.
All you can do is analyze the bit content of the raw memory occupied by the floating-point number. For that you need to either use a union as suggested below or (equivalently, and only in C++) reinterpret the floating-point object as an array of unsigned char objects, as in
float f = 5;
unsigned char *c = reinterpret_cast<unsigned char *>(&f);
// inspect memory from c[0] to c[sizeof f - 1]
And please, don't try to reinterpret a float object as an int object, as other answers suggest. That doesn't make much sense, and is not guaranteed to work in compilers that follow strict-aliasing rules in optimization. The correct way to inspect memory content in C++ is by reinterpreting it as an array of [signed/unsigned] char.
Also note that you technically aren't guaranteed that floating-point representation on your system is IEEE754 (although in practice it is unless you explicitly allow it not to be, and then only with respect to -0.0, ±infinity and NaN).
If you are trying to change the bits in the floating-point representation, you could do something like this:
union fp_bit_twiddler {
float f;
int i;
} q;
q.f = a;
q.i &= (1 << 3);
a = q.f;
As AndreyT notes, accessing a union like this invokes undefined behavior, and the compiler could grow arms and strangle you. Do what he suggests instead.
You can work around the strict-aliasing rule and perform bitwise operations on a float type-punned as an uint32_t (if your implementation defines it, which most do) without undefined behavior by using memcpy():
float a = 1.4123f;
uint32_t b;
std::memcpy(&b, &a, 4);
// perform bitwise operation
b &= 1u << 3;
std::memcpy(&a, &b, 4);
float a = 1.4123;
unsigned int* inta = reinterpret_cast<unsigned int*>(&a);
*inta = *inta & (1 << 3);
Have a look at the following. Inspired by fast inverse square root:
#include <iostream>
using namespace std;
int main()
{
float x, td = 2.0;
int ti = *(int*) &td;
cout << "Cast int: " << ti << endl;
ti = ti>>4;
x = *(float*) &ti;
cout << "Recast float: " << x << endl;
return 0;
}
FWIW, there is a real use case for bit-wise operations on floating point (I just ran into it recently) - shaders written for OpenGL implementations that only support older versions of GLSL (1.2 and earlier did not have support for bit-wise operators), and where there would be loss of precision if the floats were converted to ints.
The bit-wise operations can be implemented on floating point numbers using remainders (modulo) and inequality checks. For example:
float A = 0.625; //value to check; ie, 160/256
float mask = 0.25; //bit to check; ie, 1/4
bool result = (mod(A, 2.0 * mask) >= mask); //non-zero if bit 0.25 is on in A
The above assumes that A is between [0..1) and that there is only one "bit" in mask to check, but it could be generalized for more complex cases.
This idea is based on some of the info found in is-it-possible-to-implement-bitwise-operators-using-integer-arithmetic
If there is not even a built-in mod function, then that can also be implemented fairly easily. For example:
float mod(float num, float den)
{
return num - den * floor(num / den);
}
#mobrule:
Better:
#include <stdint.h>
...
union fp_bit_twiddler {
float f;
uint32_t u;
} q;
/* mutatis mutandis ... */
For these values int will likely be ok, but generally, you should use
unsigned ints for bit shifting to avoid the effects of arithmetic shifts. And
the uint32_t will work even on systems whose ints are not 32 bits.
The Python implementation in Floating point bitwise operations (Python recipe) of floating point bitwise operations works by representing numbers in binary that extends infinitely to the left as well as to the right from the fractional point. Because floating point numbers have a signed zero on most architectures it uses ones' complement for representing negative numbers (well, actually it just pretends to do so and uses a few tricks to achieve the appearance).
I'm sure it can be adapted to work in C++, but care must be taken so as to not let the right shifts overflow when equalizing the exponents.
Bitwise operators should NOT be used on floats, as floats are hardware specific, regardless of similarity on what ever hardware you might have. Which project/job do you want to risk on "well it worked on my machine"? Instead, for C++, you can get a similar "feel" for the bit shift operators by overloading the stream operator on an "object" wrapper for a float:
// Simple object wrapper for float type as templates want classes.
class Float
{
float m_f;
public:
Float( const float & f )
: m_f( f )
{
}
operator float() const
{
return m_f;
}
};
float operator>>( const Float & left, int right )
{
float temp = left;
for( right; right > 0; --right )
{
temp /= 2.0f;
}
return temp;
}
float operator<<( const Float & left, int right )
{
float temp = left;
for( right; right > 0; --right )
{
temp *= 2.0f;
}
return temp;
}
int main( int argc, char ** argv )
{
int a1 = 40 >> 2;
int a2 = 40 << 2;
int a3 = 13 >> 2;
int a4 = 256 >> 2;
int a5 = 255 >> 2;
float f1 = Float( 40.0f ) >> 2;
float f2 = Float( 40.0f ) << 2;
float f3 = Float( 13.0f ) >> 2;
float f4 = Float( 256.0f ) >> 2;
float f5 = Float( 255.0f ) >> 2;
}
You will have a remainder, which you can throw away based on your desired implementation.