How to perform a bitwise operation on floating point numbers

How to perform a bitwise operation on floating point numbers - c++

I tried this:
float a = 1.4123;
a = a & (1 << 3);
I get a compiler error saying that the operand of & cannot be of type float.
When I do:
float a = 1.4123;
a = (int)a & (1 << 3);
I get the program running. The only thing is that the bitwise operation is done on the integer representation of the number obtained after rounding off.
The following is also not allowed.
float a = 1.4123;
a = (void*)a & (1 << 3);
I don't understand why int can be cast to void* but not float.
I am doing this to solve the problem described in Stack Overflow question How to solve linear equations using a genetic algorithm?.

At the language level, there's no such thing as "bitwise operation on floating-point numbers". Bitwise operations in C/C++ work on value-representation of a number. And the value-representation of floating point numbers is not defined in C/C++ (unsigned integers are an exception in this regard, as their shift is defined as-if they are stored in 2's complement). Floating point numbers don't have bits at the level of value-representation, which is why you can't apply bitwise operations to them.
All you can do is analyze the bit content of the raw memory occupied by the floating-point number. For that you need to either use a union as suggested below or (equivalently, and only in C++) reinterpret the floating-point object as an array of unsigned char objects, as in
float f = 5;
unsigned char *c = reinterpret_cast<unsigned char *>(&f);
// inspect memory from c[0] to c[sizeof f - 1]
And please, don't try to reinterpret a float object as an int object, as other answers suggest. That doesn't make much sense, and is not guaranteed to work in compilers that follow strict-aliasing rules in optimization. The correct way to inspect memory content in C++ is by reinterpreting it as an array of [signed/unsigned] char.
Also note that you technically aren't guaranteed that floating-point representation on your system is IEEE754 (although in practice it is unless you explicitly allow it not to be, and then only with respect to -0.0, ±infinity and NaN).

If you are trying to change the bits in the floating-point representation, you could do something like this:
union fp_bit_twiddler {
float f;
int i;
} q;
q.f = a;
q.i &= (1 << 3);
a = q.f;
As AndreyT notes, accessing a union like this invokes undefined behavior, and the compiler could grow arms and strangle you. Do what he suggests instead.

You can work around the strict-aliasing rule and perform bitwise operations on a float type-punned as an uint32_t (if your implementation defines it, which most do) without undefined behavior by using memcpy():
float a = 1.4123f;
uint32_t b;
std::memcpy(&b, &a, 4);
// perform bitwise operation
b &= 1u << 3;
std::memcpy(&a, &b, 4);

float a = 1.4123;
unsigned int* inta = reinterpret_cast<unsigned int*>(&a);
*inta = *inta & (1 << 3);

Have a look at the following. Inspired by fast inverse square root:
#include <iostream>
using namespace std;
int main()
{
float x, td = 2.0;
int ti = *(int*) &td;
cout << "Cast int: " << ti << endl;
ti = ti>>4;
x = *(float*) &ti;
cout << "Recast float: " << x << endl;
return 0;
}

FWIW, there is a real use case for bit-wise operations on floating point (I just ran into it recently) - shaders written for OpenGL implementations that only support older versions of GLSL (1.2 and earlier did not have support for bit-wise operators), and where there would be loss of precision if the floats were converted to ints.
The bit-wise operations can be implemented on floating point numbers using remainders (modulo) and inequality checks. For example:
float A = 0.625; //value to check; ie, 160/256
float mask = 0.25; //bit to check; ie, 1/4
bool result = (mod(A, 2.0 * mask) >= mask); //non-zero if bit 0.25 is on in A
The above assumes that A is between [0..1) and that there is only one "bit" in mask to check, but it could be generalized for more complex cases.
This idea is based on some of the info found in is-it-possible-to-implement-bitwise-operators-using-integer-arithmetic
If there is not even a built-in mod function, then that can also be implemented fairly easily. For example:
float mod(float num, float den)
{
return num - den * floor(num / den);
}

#mobrule:
Better:
#include <stdint.h>
...
union fp_bit_twiddler {
float f;
uint32_t u;
} q;
/* mutatis mutandis ... */
For these values int will likely be ok, but generally, you should use
unsigned ints for bit shifting to avoid the effects of arithmetic shifts. And
the uint32_t will work even on systems whose ints are not 32 bits.

The Python implementation in Floating point bitwise operations (Python recipe) of floating point bitwise operations works by representing numbers in binary that extends infinitely to the left as well as to the right from the fractional point. Because floating point numbers have a signed zero on most architectures it uses ones' complement for representing negative numbers (well, actually it just pretends to do so and uses a few tricks to achieve the appearance).
I'm sure it can be adapted to work in C++, but care must be taken so as to not let the right shifts overflow when equalizing the exponents.

Bitwise operators should NOT be used on floats, as floats are hardware specific, regardless of similarity on what ever hardware you might have. Which project/job do you want to risk on "well it worked on my machine"? Instead, for C++, you can get a similar "feel" for the bit shift operators by overloading the stream operator on an "object" wrapper for a float:
// Simple object wrapper for float type as templates want classes.
class Float
{
float m_f;
public:
Float( const float & f )
: m_f( f )
{
}
operator float() const
{
return m_f;
}
};
float operator>>( const Float & left, int right )
{
float temp = left;
for( right; right > 0; --right )
{
temp /= 2.0f;
}
return temp;
}
float operator<<( const Float & left, int right )
{
float temp = left;
for( right; right > 0; --right )
{
temp *= 2.0f;
}
return temp;
}
int main( int argc, char ** argv )
{
int a1 = 40 >> 2;
int a2 = 40 << 2;
int a3 = 13 >> 2;
int a4 = 256 >> 2;
int a5 = 255 >> 2;
float f1 = Float( 40.0f ) >> 2;
float f2 = Float( 40.0f ) << 2;
float f3 = Float( 13.0f ) >> 2;
float f4 = Float( 256.0f ) >> 2;
float f5 = Float( 255.0f ) >> 2;
}
You will have a remainder, which you can throw away based on your desired implementation.

Related

How can I convert an integer to float with rounding towards zero?

When an integer is converted to floating-point, and the value cannot be directly represented by the destination type, the nearest value is usually selected (required by IEEE-754).
I would like to convert an integer to floating-point with rounding towards zero in case the integer value cannot be directly represented by the floating-point type.
Example:
int i = 2147483647;
float nearest = static_cast<float>(i); // 2147483648 (likely)
float towards_zero = convert(i); // 2147483520

Since C++11, one can use fesetround(), the floating-point environment rounding direction manager. There are four standard rounding directions and an implementation is permitted to add additional rounding directions.
#include <cfenv> // for fesetround() and FE_* macros
#include <iostream> // for cout and endl
#include <iomanip> // for setprecision()
#pragma STDC FENV_ACCESS ON
int main(){
int i = 2147483647;
std::cout << std::setprecision(10);
std::fesetround(FE_DOWNWARD);
std::cout << "round down " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round down " << -i << " : " << static_cast<float>(-i) << std::endl;
std::fesetround(FE_TONEAREST);
std::cout << "round to nearest " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round to nearest " << -i << " : " << static_cast<float>(-i) << std::endl;
std::fesetround(FE_TOWARDZERO);
std::cout << "round toward zero " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round toward zero " << -i << " : " << static_cast<float>(-i) << std::endl;
std::fesetround(FE_UPWARD);
std::cout << "round up " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round up " << -i << " : " << static_cast<float>(-i) << std::endl;
return(0);
}
Compiled under g++ 7.5.0, the resulting executable outputs
round down 2147483647 : 2147483520
round down -2147483647 : -2147483648
round to nearest 2147483647 : 2147483648
round to nearest -2147483647 : -2147483648
round toward zero 2147483647 : 2147483520
round toward zero -2147483647 : -2147483520
round up 2147483647 : 2147483648
round up -2147483647 : -2147483520
Omitting the #pragma doesn't seem to change anything under g++.
#chux comments correctly that the standard doesn't explicitly state that fesetround() affects rounding in static_cast<float>(i). For a guarantee that the set rounding direction affects the conversion, use std::nearbyint and its -f and -l variants. See also std::rint and its many type-specific variants.
I probably should have looked up the format specifier to use a space for positive integers and floats, rather than stuffing it into the preceding string constants.
(I haven't tested the following snippet.) Your convert() function would be something like
float convert(int i, int direction = FE_TOWARDZERO){
float retVal = 0.;
int prevdirection = std::fegetround();
std::fesetround(direction);
retVal = static_cast<float>(i);
std::fesetround(prevdirection);
return(retVal);
}

You can use std::nextafter.
int i = 2147483647;
float nearest = static_cast<float>(i); // 2147483648 (likely)
float towards_zero = std::nextafter(nearest, 0.f); // 2147483520
But you have to check, if static_cast<float>(i) is exact, if so, nextafter would go one step towards 0, which you probably don't want.
Your convert function might look like this:
float convert(int x){
if(std::abs(long(static_cast<float>(x))) <= std::abs(long(x)))
return static_cast<float>(x);
return std::nextafter(static_cast<float>(x), 0.f);
}
It may be that sizeof(int)==sizeof(long) or even sizeof(int)==sizeof(long long) in this case long(...) may behave undefined, when the static_cast<float>(x) exceeds the possible values. Depending on the compiler it might still work in this cases.

I understand the question to be restricted to platforms that use IEEE-754 binary floating-point arithmetic, and where float maps to IEEE-754 (2008) binary32. This answer assumes this to be the case.
As other answers have pointed out, if the tool chain and the platform supports this, use the facilities supplied by fenv.h to set the rounding mode for the conversion as desired.
Where those are not available, or slow, it is not difficult to emulate the truncation during int to float conversion. Basically, normalize the integer until the most significant bit is set, recording the required shift count. Now, shift the normalized integer into place to form the mantissa, compute the exponent based on the normalization shift count, and add in the sign bit based on the sign of the original integer. The process of normalization can be sped up significantly if a clz (count leading zeros) primitive is available, maybe as an intrinsic.
The exhaustively tested code below demonstrates this approach for 32-bit integers, see function int32_to_float_rz. I successfully built it as both C and C++ code with the Intel compiler version 13.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <fenv.h>
float int32_to_float_rz (int32_t a)
{
uint32_t i = (uint32_t)a;
int shift = 0;
float r;
// take absolute value of integer
if (a < 0) i = 0 - i;
// normalize integer so MSB is set
if (!(i > 0x0000ffffU)) { i <<= 16; shift += 16; }
if (!(i > 0x00ffffffU)) { i <<= 8; shift += 8; }
if (!(i > 0x0fffffffU)) { i <<= 4; shift += 4; }
if (!(i > 0x3fffffffU)) { i <<= 2; shift += 2; }
if (!(i > 0x7fffffffU)) { i <<= 1; shift += 1; }
// form mantissa with explicit integer bit
i = i >> 8;
// add in exponent, taking into account integer bit of mantissa
if (a != 0) i += (127 + 31 - 1 - shift) << 23;
// add in sign bit
if (a < 0) i |= 0x80000000;
// reinterpret bit pattern as 'float'
memcpy (&r, &i, sizeof r);
return r;
}
#pragma STDC FENV_ACCESS ON
float int32_to_float_rz_ref (int32_t a)
{
float r;
int orig_mode = fegetround ();
fesetround (FE_TOWARDZERO);
r = (float)a;
fesetround (orig_mode);
return r;
}
int main (void)
{
int32_t arg;
float res, ref;
arg = 0;
do {
res = int32_to_float_rz (arg);
ref = int32_to_float_rz_ref (arg);
if (res != ref) {
printf ("error # %08x: res=% 14.6a ref=% 14.6a\n", arg, res, ref);
return EXIT_FAILURE;
}
arg++;
} while (arg);
return EXIT_SUCCESS;
}

A C implementation dependent solution that I am confident has a C++ counterpart.
Temporarily change the rounding mode the conversion uses that to determine which way to go in inexact cases.
the nearest value is usually selected (required by IEEE-754).
Is not entirely accurate. The inexact case is rounding mode dependent.
C does not specify this behavior. C allows this behavior, as it is implementation-defined.
If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.
#include <fenv.h>
float convert(int i) {
#pragma STDC FENV_ACCESS ON
int save_round = fegetround();
fesetround(FE_TOWARDZERO);
float f = (float) i;
fesetround(save_round);
return f;
}

A specified approach.
"the nearest value is usually selected (required by IEEE-754)" implies OP expects IEEE-754 is involved. Many C/C++ implementation do follow much of IEEE-754, yet adherence to that spec is not required. The following relies on C specifications.
Conversion of an integer type to a floating point type is specified as below. Notice conversion is not specified to depend on rounding mode.
When a value of integer type is converted to a real floating type, if the value being converted can be represented exactly in the new type, it is unchanged. If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner. C17dr § 6.3.1.4 2
When the result it not exact, the converted value the nearest higher or nearest lower?
A round trip int --> float --> int is warranted.
Round tripping needs to watch out for convert(near_INT_MAX) converting to outside the int range.
Rather than rely on long or long long having a wider range than int (C does not specify this property), let code compare on the negative side as INT_MIN (with 2's complement) can be expected to convert exactly to a float.
float convert(int i) {
int n = (i < 0) ? i : -i; // n <= 0
float f = (float) n;
int rt_n = (int) f; // Overflow not expected on the negative side
// If f rounded away from 0.0 ...
if (rt_n < n) {
f = nextafterf(f, 0.0); // Move toward 0.0
}
return (i < 0) f : -f;
}

Changing the rounding mode is somewhat expensive, although I think some modern x86 CPUs do rename MXCSR so it doesn't have to drain the out-of-order execution back-end.
If you care about performance, benchmarking njuffa's pure integer version (using shift = __builtin_clz(i); i<<=shift;) against the rounding-mode-changing version would make sense. (Make sure to test in the context you want to use it in; it's so small that it matters how well it overlaps with surrounding code.)
AVX-512 can use rounding-mode overrides on a per-instruction basis, letting you use a custom rounding mode for the conversion basically the same cost as a normal int->float. (Only available on Intel Skylake-server, and Ice Lake CPUs so far, unfortunately.)
#include <immintrin.h>
float int_to_float_trunc_avx512f(int a) {
const __m128 zero = _mm_setzero_ps(); // SSE scalar int->float are badly designed to merge into another vector, instead of zero-extend. Short-sighted Pentium-3 decision never changed for AVX or AVX512
__m128 v = _mm_cvt_roundsi32_ss (zero, a, _MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC);
return _mm_cvtss_f32(v); // the low element of a vector already is a scalar float so this is free.
}
_mm_cvt_roundi32_ss is a synonym, IDK why Intel defined both i and si names, or if some compilers might only have one.
This compiles efficiently with all 4 mainstream x86 compilers (GCC/clang/MSVC/ICC) on the Godbolt compiler explorer.
# gcc10.2 -O3 -march=skylake-avx512
int_to_float_trunc_avx512f:
vxorps xmm0, xmm0, xmm0
vcvtsi2ss xmm0, xmm0, {rz-sae}, edi
ret
int_to_float_plain:
vxorps xmm0, xmm0, xmm0 # GCC is always cautious about false dependencies, spending an extra instruction to break it, like we did with setzero()
vcvtsi2ss xmm0, xmm0, edi
ret
In a loop, the same zeroed register can be reused as a merge target, allowing the vxorps zeroing to be hoisted out of a loop.
Using _mm_undefined_ps() instead of _mm_setzero_ps(), we can get ICC to skip zeroing XMM0 before converting into it, like clang does for plain (float)i in this case. But ironically, clang which is normally cavalier and reckless about false dependencies compiles _mm_undefined_ps() the same as setzero in this case.
The performance in practice of vcvtsi2ss (scalar integer to scalar single-precision float) is the same whether you use a rounding-mode override or not (2 uops on Ice Lake, same latency: https://uops.info/). The AVX-512 EVEX encoding is 2 bytes longer than the AVX1.
Rounding mode overrides also suppress FP exceptions (like "inexact"), so you couldn't check the FP environment to later detect if the conversion happened to be exact (no rounding). But in this case, converting back to int and comparing would be fine. (You can do that without risk of overflow because of the rounding towards 0).

A simple solution is to use a higher precision floating point for comparison. As long as the high precision floating point can exactly represent all integers, we can accurately compare whether the float result was greater.
double should be sufficient with 32 bit integers, and long double is sufficient for 64 bit on most systems, but it's good practice to verify it.
float convert(int x) {
static_assert(std::numeric_limits<double>::digits
>= sizeof(int) * CHAR_BIT);
float f = x;
double d = x;
return std::abs(f) > std::abs(d)
? std::nextafter(f, 0.f)
: f;
}

Shift the integer right by an arithmetic shift until the number of bits agrees with the precision of the floating point arithmetic. Count the shifts.
Convert the integer to float. The result is now precise.
Multiply the resulting float by a power of two corresponding to the number of shifts.

For nonnegative values, this can be done by taking the integer value and shifting right until the highest set bit is less than 24 bits (i.e. the precision of IEEE single) from the right, then shifting back.
For negative values, you would shift right until all bits from 24 and up are set, then shift back. For the shift back, you'll first need to cast the value to unsigned to avoid undefined behavior of left-shifting a negative value, then cast the result back to int before converting to float.
Note also that the conversion from unsigned to signed is implementation defined, however we're already dealing with ID as we're assuming float is IEEE754 and int is two's complement.
float rount_to_zero(int x)
{
int cnt = 0;
if (x >= 0) {
while (x != (x & 0xffffff)) {
x >>= 1;
cnt++;
}
return x << cnt;
} else {
while (~0xffffff != (x & ~0xffffff)) {
x >>= 1;
cnt++;
}
return (int)((unsigned)x << cnt);
}
}

Save a float into an integer without losing floating point precision

I want to save the value of a float variable named f in the third element of an array named i in a way that the floating point part isn't wiped (i.e. I don't want to save 1 instead of 1.5). After that, complete the last line in a way that we see 1.5 in the output (don't use cout<<1.5; or cout<<f; or some similar tricks!)
float f=1.5;
int i[3];
i[2] = ... ;
cout<<... ;
Does anybody have any idea?

Use type-punning with union if they have the same size under a compilation environment:
static_assert(sizeof(int) == sizeof(float));
int castFloatToInt(float f) {
union { float f; int i; } u;
u.f = f;
return u.i;
}
float castIntToFloat(int i) {
union { float f; int i; } u;
u.i = i;
return u.f;
}
// ...
float f=1.5;
int i[3];
i[2] = castFloatToInt(f);
cout << castIntToFloat(i);
Using union is the way to prevent aliasing problem, otherwise compiler may generate incorrect results due to optimization.
This is a common technique for manipulating bits of float directly. Although normally uint32_t will be used instead.

Generally speaking, you cannot store a float in an int without loss of precision.
You could multiply your number with a factor, store it and after that divide again to get some decimal places out of it.
Note that this will not work for all numbers and you have to choose your factor carefully.
float f = 1.5f;
const float factor = 10.0f;
int i[3];
i[2] = static_cast<int>(f * factor);
std::cout << static_cast<float>(i[2]) / factor;

If we can assume that int is 32 bits then you can do it with type-punning:
float f = 1.5;
int i[3];
i[2] = *(int *)&f;
cout << *(float *)&i[2];
but this is getting into Undefined Behaviour territory (breaking aliasing rules), since it accesses a type via a pointer to a different (incompatible) type.
LIVE DEMO

Fast log2(float x) implementation C++

I need a Very-Fast implementation of log2(float x) function in C++.
I found a very interesting implementation (and extremely fast!)
#include <intrin.h>
inline unsigned long log2(int x)
{
unsigned long y;
_BitScanReverse(&y, x);
return y;
}
But this function is good only for integer values in input.
Question: Is there any way to convert this function to double type input variable?
UPD:
I found this implementation:
typedef unsigned long uint32;
typedef long int32;
static inline int32 ilog2(float x)
{
uint32 ix = (uint32&)x;
uint32 exp = (ix >> 23) & 0xFF;
int32 log2 = int32(exp) - 127;
return log2;
}
which is much faster than the previous example, but the output is unsigned type.
Is it possible to make this function return a double type?
Thanks in advance!

If you just need the integer part of the logarithm, then you can extract that directly from the floating point number.
Portably:
#include <cmath>
int log2_fast(double d) {
int result;
std::frexp(d, &result);
return result-1;
}
Possibly faster, but relying on unspecified and undefined behaviour:
int log2_evil(double d) {
return ((reinterpret_cast<unsigned long long&>(d) >> 52) & 0x7ff) - 1023;
}

MSVC + GCC compatible version that give XX.XXXXXXX +-0.0054545
float mFast_Log2(float val) {
union { float val; int32_t x; } u = { val };
register float log_2 = (float)(((u.x >> 23) & 255) - 128);
u.x &= ~(255 << 23);
u.x += 127 << 23;
log_2 += ((-0.3358287811f) * u.val + 2.0f) * u.val -0.65871759316667f;
return (log_2);
}

Edit: See link by Job in the comments below for a better version.
Fast log() function (5× faster approximately)
Maybe of interest for you. The code works here; It is not infinitely precise though. As the code is broken on the web page (the > have been removed) I'll post it here:
inline float fast_log2 (float val)
{
int * const exp_ptr = reinterpret_cast <int *> (&val);
int x = *exp_ptr;
const int log_2 = ((x >> 23) & 255) - 128;
x &= ~(255 << 23);
x += 127 << 23;
*exp_ptr = x;
val = ((-1.0f/3) * val + 2) * val - 2.0f/3; // (1)
return (val + log_2);
}
inline float fast_log (const float &val)
{
return (fast_log2 (val) * 0.69314718f);
}

You can take a look into this implementation, but :
it may not work on some platforms
might not beat std::log

C++11 added std::log2 into <cmath>.

This is an improvement on the first answer which does not rely on IEEE implementation, although I imagine that it is only fast on IEEE machines where frexp() is basically a costless function.
Instead of discarding the fraction that frexp returns, one can use it to linearly interpolate. The fraction value is between 0.5 and 1.0 if it is positive, so we stretch between 0.0 and 1.0 and add it to the exponent.
In practice, it looks like this fast evaluation is good to about 5-10%, always returning a value that is a little low. I'm sure it could be made better by tweaking the 2* scaling factor.
#include <cmath>
double log2_fast(double d) {
int exponent;
double fraction = std::frexp(d, &exponent);
return (result-1) + 2* (fraction - 0.5);
}
You can verify that this is reasonable fast approximation with this:
#include <cmath>
int main()
{
for(double x=0.001;x<1000;x+=0.1)
{
std::cout << x << " " << std::log2(x) << " " << log2_fast(x) << "\n";
}
}

No, but if you only need the integeral part of the result and don't insist on portability, there is even faster one. Because all you need is to extract the exponent part of the float!

This function is not C++, it's MSVC++ specific. Also, I highly doubt that any such intrinsics exist. And if they did, the Standard function would simply be configured to use it. So just call the Standard-provided library.

How do I use bitwise operators on a "double" on C++?

I was asked to get the internal binary representation of different types in C. My program currently works fine with 'int' but I would like to use it with "double" and "float". My code looks like this:
template <typename T>
string findBin(T x) {
string binary;
for(int i = 4096 ; i >= 1; i/=2) {
if((x & i) != 0) binary += "1";
else binary += "0";
}
return binary;
}
The program fails when I try to instantiate the template using a "double" or a "float".

Succinctly, you don't.
The bitwise operators do not make sense when applied to double or float, and the standard says that the bitwise operators (~, &, |, ^, >>, <<, and the assignment variants) do not accept double or float operands.
Both double and float have 3 sections - a sign bit, an exponent, and the mantissa. Suppose for a moment that you could shift a double right. The exponent, in particular, means that there is no simple translation to shifting a bit pattern right - the sign bit would move into the exponent, and the least significant bit of the exponent would shift into the mantissa, with completely non-obvious sets of meanings. In IEEE 754, there's an implied 1 bit in front of the actual mantissa bits, which also complicates the interpretation.
Similar comments apply to any of the other bit operators.
So, because there is no sane or useful interpretation of the bit operators to double values, they are not allowed by the standard.
From the comments:
I'm only interested in the binary representation. I just want to print it, not do anything useful with it.
This code was written several years ago for SPARC (big-endian) architecture.
#include <stdio.h>
union u_double
{
double dbl;
char data[sizeof(double)];
};
union u_float
{
float flt;
char data[sizeof(float)];
};
static void dump_float(union u_float f)
{
int exp;
long mant;
printf("32-bit float: sign: %d, ", (f.data[0] & 0x80) >> 7);
exp = ((f.data[0] & 0x7F) << 1) | ((f.data[1] & 0x80) >> 7);
printf("expt: %4d (unbiassed %5d), ", exp, exp - 127);
mant = ((((f.data[1] & 0x7F) << 8) | (f.data[2] & 0xFF)) << 8) | (f.data[3] & 0xFF);
printf("mant: %16ld (0x%06lX)\n", mant, mant);
}
static void dump_double(union u_double d)
{
int exp;
long long mant;
printf("64-bit float: sign: %d, ", (d.data[0] & 0x80) >> 7);
exp = ((d.data[0] & 0x7F) << 4) | ((d.data[1] & 0xF0) >> 4);
printf("expt: %4d (unbiassed %5d), ", exp, exp - 1023);
mant = ((((d.data[1] & 0x0F) << 8) | (d.data[2] & 0xFF)) << 8) | (d.data[3] & 0xFF);
mant = (mant << 32) | ((((((d.data[4] & 0xFF) << 8) | (d.data[5] & 0xFF)) << 8) | (d.data[6] & 0xFF)) << 8) | (d.data[7] & 0xFF);
printf("mant: %16lld (0x%013llX)\n", mant, mant);
}
static void print_value(double v)
{
union u_double d;
union u_float f;
f.flt = v;
d.dbl = v;
printf("SPARC: float/double of %g\n", v);
// image_print(stdout, 0, f.data, sizeof(f.data));
// image_print(stdout, 0, d.data, sizeof(d.data));
dump_float(f);
dump_double(d);
}
int main(void)
{
print_value(+1.0);
print_value(+2.0);
print_value(+3.0);
print_value( 0.0);
print_value(-3.0);
print_value(+3.1415926535897932);
print_value(+1e126);
return(0);
}
The commented out 'image_print()` function prints an arbitrary set of bytes in hex, with various minor tweaks. Contact me if you want the code (see my profile).
If you're using Intel (little-endian), you'll probably need to tweak the code to deal with the reverse bit order. But it shows how you can do it - using a union.

You cannot directly apply bitwise operators to float or double, but you can still access the bits indirectly by putting the variable in a union with a character array of the appropriate size, then reading the bits from those characters. For example:
string BitsFromDouble(double value) {
union {
double doubleValue;
char asChars[sizeof(double)];
};
doubleValue = value; // Write to the union
/* Extract the bits. */
string result;
for (size i = 0; i < sizeof(double); ++i)
result += CharToBits(asChars[i]);
return result;
}
You may need to adjust your routine to work on chars, which usually don't range up to 4096, and there may also be some weirdness with endianness here, but the basic idea should work. It won't be cross-platform compatible, since machines use different endianness and representations of doubles, so be careful how you use this.

Bitwise operators don't generally work with "binary representation" (also called object representation) of any type. Bitwise operators work with value representation of the type, which is generally different from object representation. That applies to int as well as to double.
If you really want to get to the internal binary representation of an object of any type, as you stated in your question, you need to reinterpret the object of that type as an array of unsigned char objects and then use the bitwise operators on these unsigned chars
For example
double d = 12.34;
const unsigned char *c = reinterpret_cast<unsigned char *>(&d);
Now by accessing elements c[0] through c[sizeof(double) - 1] you will see the internal representation of type double. You can use bitwise operations on these unsigned char values, if you want to.
Note, again, that in general case in order to access internal representation of type int you have to do the same thing. It generally applies to any type other than char types.

Do a bit-wise cast of a pointer to the double to long long * and dereference.
Example:
inline double bit_and_d(double* d, long long mask) {
long long t = (*(long long*)d) & mask;
return *(double*)&t;
}
Edit: This is almost certainly going to run afoul of gcc's enforcement of strict aliasing. Use one of the various workarounds for that. (memcpy, unions, __attribute__((__may_alias__)), etc)

Other solution is to get a pointer to the floating point variable and cast it to a pointer to integer type of the same size, and then get value of the integer this pointer points to. Now you have an integer variable with same binary representation as the floating point one and you can use your bitwise operator.
string findBin(float f) {
string binary;
for(long i = 4096 ; i >= 1; i/=2) {
long x = * ( long * ) &y;
if((x & i) != 0) binary += "1";
else binary += "0";
}
return binary;
}
But remember: you have to cast to a type with same size. Otherwise unpredictable things may happen (like buffer overflow, access violation etc.).

As others have said, you can use a bitwise operator on a double by casting double* to long long* (or sometimes just long*).
int main(){
double * x = (double*)malloc(sizeof(double));
*x = -5.12345;
printf("%f\n", *x);
*((long*)x) &= 0x7FFFFFFFFFFFFFFF;
printf("%f\n", *x);
return 0;
}
On my computer, this code prints:
-5.123450
5.123450

C++ floating point to integer type conversions

What are the different techniques used to convert float type of data to integer in C++?
#include <iostream>
using namespace std;
struct database {
int id, age;
float salary;
};
int main() {
struct database employee;
employee.id = 1;
employee.age = 23;
employee.salary = 45678.90;
/*
How can i print this value as an integer
(with out changing the salary data type in the declaration part) ?
*/
cout << endl << employee.id << endl << employee.
age << endl << employee.salary << endl;
return 0;
}

What you are looking for is 'type casting'. typecasting (putting the type you know you want in brackets) tells the compiler you know what you are doing and are cool with it. The old way that is inherited from C is as follows.
float var_a = 9.99;
int var_b = (int)var_a;
If you had only tried to write
int var_b = var_a;
You would have got a warning that you can't implicitly (automatically) convert a float to an int, as you lose the decimal.
This is referred to as the old way as C++ offers a superior alternative, 'static cast'; this provides a much safer way of converting from one type to another. The equivalent method would be (and the way you should do it)
float var_x = 9.99;
int var_y = static_cast<int>(var_x);
This method may look a bit more long winded, but it provides much better handling for situations such as accidentally requesting a 'static cast' on a type that cannot be converted. For more information on the why you should be using static cast, see this question.

Normal way is to:
float f = 3.4;
int n = static_cast<int>(f);

Size of some float types may exceed the size of int.
This example shows a safe conversion of any float type to int using the int safeFloatToInt(const FloatType &num); function:
#include <iostream>
#include <limits>
using namespace std;
template <class FloatType>
int safeFloatToInt(const FloatType &num) {
//check if float fits into integer
if ( numeric_limits<int>::digits < numeric_limits<FloatType>::digits) {
// check if float is smaller than max int
if( (num < static_cast<FloatType>( numeric_limits<int>::max())) &&
(num > static_cast<FloatType>( numeric_limits<int>::min())) ) {
return static_cast<int>(num); //safe to cast
} else {
cerr << "Unsafe conversion of value:" << num << endl;
//NaN is not defined for int return the largest int value
return numeric_limits<int>::max();
}
} else {
//It is safe to cast
return static_cast<int>(num);
}
}
int main(){
double a=2251799813685240.0;
float b=43.0;
double c=23333.0;
//unsafe cast
cout << safeFloatToInt(a) << endl;
cout << safeFloatToInt(b) << endl;
cout << safeFloatToInt(c) << endl;
return 0;
}
Result:
Unsafe conversion of value:2.2518e+15
2147483647
43
23333

For most cases (long for floats, long long for double and long double):
long a{ std::lround(1.5f) }; //2l
long long b{ std::llround(std::floor(1.5)) }; //1ll

Check out the boost NumericConversion library. It will allow to explicitly control how you want to deal with issues like overflow handling and truncation.

I believe you can do this using a cast:
float f_val = 3.6f;
int i_val = (int) f_val;

the easiest technique is to just assign float to int, for example:
int i;
float f;
f = 34.0098;
i = f;
this will truncate everything behind floating point or you can round your float number before.

One thing I want to add. Sometimes, there can be precision loss. You may want to add some epsilon value first before converting. Not sure why that works... but it work.
int someint = (somedouble+epsilon);

This is one way to convert IEEE 754 float to 32-bit integer if you can't use floating point operations. It has also a scaler functionality to include more digits to the result. Useful values for scaler are 1, 10 and 100.
#define EXPONENT_LENGTH 8
#define MANTISSA_LENGTH 23
// to convert float to int without floating point operations
int ownFloatToInt(int floatBits, int scaler) {
int sign = (floatBits >> (EXPONENT_LENGTH + MANTISSA_LENGTH)) & 1;
int exponent = (floatBits >> MANTISSA_LENGTH) & ((1 << EXPONENT_LENGTH) - 1);
int mantissa = (floatBits & ((1 << MANTISSA_LENGTH) - 1)) | (1 << MANTISSA_LENGTH);
int result = mantissa * scaler; // possible overflow
exponent -= ((1 << (EXPONENT_LENGTH - 1)) - 1); // exponent bias
exponent -= MANTISSA_LENGTH; // modify exponent for shifting the mantissa
if (exponent <= -(int)sizeof(result) * 8) {
return 0; // underflow
}
if (exponent > 0) {
result <<= exponent; // possible overflow
} else {
result >>= -exponent;
}
if (sign) result = -result; // handle sign
return result;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to perform a bitwise operation on floating point numbers - c++

float a = 1.4123; unsigned int* inta = reinterpret_cast<unsigned int>(&a); inta = *inta & (1 << 3);

Have a look at the following. Inspired by fast inverse square root: #include <iostream> using namespace std; int main() { float x, td = 2.0; int ti = (int) &td; cout << "Cast int: " << ti << endl; ti = ti>>4; x = (float) &ti; cout << "Recast float: " << x << endl; return 0; }

Related

How can I convert an integer to float with rounding towards zero?

Save a float into an integer without losing floating point precision

Fast log2(float x) implementation C++

How do I use bitwise operators on a "double" on C++?

C++ floating point to integer type conversions

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to perform a bitwise operation on floating point numbers - c++

float a = 1.4123; unsigned int* inta = reinterpret_cast<unsigned int*>(&a); *inta = *inta & (1 << 3);

Have a look at the following. Inspired by fast inverse square root: #include <iostream> using namespace std; int main() { float x, td = 2.0; int ti = *(int*) &td; cout << "Cast int: " << ti << endl; ti = ti>>4; x = *(float*) &ti; cout << "Recast float: " << x << endl; return 0; }

Related

How can I convert an integer to float with rounding towards zero?

Save a float into an integer without losing floating point precision

Fast log2(float x) implementation C++

How do I use bitwise operators on a "double" on C++?

C++ floating point to integer type conversions

Categories

Resources

float a = 1.4123; unsigned int* inta = reinterpret_cast<unsigned int>(&a); inta = *inta & (1 << 3);

Have a look at the following. Inspired by fast inverse square root: #include <iostream> using namespace std; int main() { float x, td = 2.0; int ti = (int) &td; cout << "Cast int: " << ti << endl; ti = ti>>4; x = (float) &ti; cout << "Recast float: " << x << endl; return 0; }