Float bitwise ops

Float bitwise ops - c++

I want convert the storage of a floating point number to an integer (the 'number' value is not required to be equal).
If a floating point number (eg 10), is represented in binary (at least on my implementation) as:
01000001001000000000000000000000
Then it should (also on my implementation) represent the integer value 1092616192.
I am currently doing this through the following:
union UNFI {
float a;
int b;
};
UNFI n;
n.a = 10;
int bits = n.b; // foo!
for (int i=31; i>=0; --i) {
bool bit = ((bits >> i) & 1);
std::cout << bit;
}
Which is achieving what I would like, but it is, in its nature, undefined behavior. Therefore I would like to know what the 'correct' way of achieving this outcome is.
From the C99 standard:
With one exception, if the value of a member of a union object is used when the most recent store to the object was to a different member,the behavior is implementation-defined.
Is that not undefined behavior?

The correct way is to call memcpy. Most compilers will optimize this to an efficient single-word memory access just like the union, but on platforms with extra alignment requirements, they will do the right thing. And it won't ever trigger a signalling NaN.
float a = 10;
char bits[sizeof a];
memcpy(bits, &a, sizeof a);
or
int b;
static_assert(sizeof b == sizeof a);
memcpy(&b, &a, sizeof a);

What's undefined?
The layout of an IEEE float is very well defined, the only extra complication is the byte ordering on your platform (and how big is an int)

What you're after is simple type hosing. What you are doing seems like the easiest thing. Another alternative I suppose would be pointer conversion.
int const bits = *(reinterpret_cast<int *>(&n))

Related

Convert an integer's binary data to float

Lets say I have an integer:
unsigned long long int data = 4599331010119547059;
Now I want to convert this data to a double. I basically want to change the type, but keep the bits exactly as they were. For the given example, the float value is 0.31415926536.
How can I do that in C++? I saw some methods using Union but many advised against using this approach.

Since C++20, you can use std::bit_cast:
std::bit_cast<double>(data)
Prior to C++20, you can use std::memcpy:
double d;
static_assert(sizeof d == sizeof data);
std::memcpy(&d, &data, sizeof d);
Note that result will vary depending on floating point representation (IEEE-754 is ubiquitous though) as well as whether floating point and integer types have the same endianness.

Taking the question on its face value (assuming you have a valid reason to do this!) this is the only proper way of doing this in current C++ standard:
int i = get_int();
float x;
static_assert(sizeof(float) == sizeof(int), "!!!");
memcpy(&x, &i, sizeof(x));

You can use reinterpret_cast:
float f = reinterpret_cast<float&>(data);
For your value, I don't get 0.314... but that's how you could do it.

Adding binary integer without using + or -

add two integers without using + or -.
This is my solution.
class Solution {
public:
int getSum(int a, int b) {
int temp=a & b;
a=a^b;
while (temp>0){
b=temp<<1;
temp=a & b;
a=a^b;
}
return a;
}
};
But it doesn't work on the cases when a=-12,
b=-8.
Compare it side by side with another people's working solution, he has:
class Solution {
public:
int getSum(int a, int b) {
int sum = a;
while (b != 0)
{
sum = a ^ b;//calculate sum of a and b without thinking the carry
b = (a & b) << 1;//calculate the carry
a = sum;//add sum(without carry) and carry
}
return sum;
}
};
which is bascially the same. Why my solution doesn't work?

Strictly speaking both your solution and the one you are comparing with are incorrect, unless you make specific assumptions about the representation of signedintegral types. The reason yours differs is order of operations.
The explanation is written in the C standard itself. For example, from the 2011 ISO C standard (ISO/IEC 9899:2011) Section 6.5, para 4.
Some operators (the unary operator ~ , and the binary operators <<, >>, &, ^, and |, collectively described as bitwise operators) shall have operands that have integral type. These operators return values that depend on the internal representations of integers, and thus have implementation-defined and undefined aspects for signed types.
These concerns hit home with expressions like a & b if either a or b is negative .... and your example has both. a << 1 gives a similar concern if a is negative.
To eliminate your problem, you need to work with unsigned values (for which bitwise operators have well defined behaviour). If you need to deal with negative values, simply keep track of sign in another way (e.g. another variable of type bool).
In practice, bitwise operations work as expected for signed types with a twos-complement representation. The problem with relying on that, however, is that an implementation is not required to use such a representation.

ULP comparison code

The following code snippet is scattered all over the web and seems to be used in multiple different projects with very little changes:
union Float_t {
Float_t(float num = 0.0f) : f(num) {}
// Portable extraction of components.
bool Negative() const { return (i >> 31) != 0; }
int RawMantissa() const { return i & ((1 << 23) - 1); }
int RawExponent() const { return (i >> 23) & 0xFF; }
int i;
float f;
};
inline bool AlmostEqualUlpsAndAbs(float A, float B, float maxDiff, int maxUlpsDiff)
{
// Check if the numbers are really close -- needed
// when comparing numbers near zero.
float absDiff = std::fabs(A - B);
if (absDiff <= maxDiff)
return true;
Float_t uA(A);
Float_t uB(B);
// Different signs means they do not match.
if (uA.Negative() != uB.Negative())
return false;
// Find the difference in ULPs.
return (std::abs(uA.i - uB.i) <= maxUlpsDiff);
}
See, for example here or here or here.
However, I don't understand what is going on here. To my (maybe naive) understanding, the floating-point member variable f is initialized in the constructor, but the integer member i is not.
I'm not terribly familiar with the binary operators that are used here, but I fail to understand how accesses of uA.i and uB.i produce anything but random numbers, given that no line in the code actually connects the values of f and i in any meaningful way.
If somebody could enlighten my on why (and how) exactly this code produces the desired result, I would be very delighted!

A lot of Undefined Behaviour are being exploited here. First assumption is that fields of union can be accessed in place of each other, which is, in itself, UB. Furthermore, coder assumes that: sizeof(int) == sizeof(float), that floats have a given length of mantissa and exponent, that all union members are aligned to zero, that the binary representation of float coincides with the binary representation with int in a very specific way. In short, this will work as long as you're on x86, have specific int and float types and you say a prayer at every sunrise and sunset.
What you probably didn't note is that this is a union, therefore int i and float f is usually aligned in a specific manner in a common memory array by most compilers. This is, in general, still UB and you can't even safely assume that the same physical bits of memory will be used without restricting yourself to a specific compiler and a specific architecture. All that's guaranteed is, the address of both members will be the same (but there might be alignment and/or typedness issues). Assuming that your compiler uses the same physical bits (which is by no means guaranteed by standard) and they both start at offset 0 and have the same size, then i will represent the binary storage format of f.. as long as nothing changes in your architecture. Word of advice? Do not use it until you don't have to. Stick to floating point operations for AlmostEquals(), you can implement it like that. It's the very final pass of optimization when we consider these specialities and we usually do it in a separate branch, you shouldn't plan your code around it.

How does one safely static_cast between unsigned int and int?

I have an 8-character string representing a hexadecimal number and I need to convert it to an int. This conversion has to preserve the bit pattern for strings "80000000" and higher, i.e., those numbers should come out negative. Unfortunately, the naive solution:
int hex_str_to_int(const string hexStr)
{
stringstream strm;
strm << hex << hexStr;
unsigned int val = 0;
strm >> val;
return static_cast<int>(val);
}
doesn't work for my compiler if val > MAX_INT (the returned value is 0). Changing the type of val to int also results in a 0 for the larger numbers. I've tried several different solutions from various answers here on SO and haven't been successful yet.
Here's what I do know:
I'm using HP's C++ compiler on OpenVMS (using, I believe, an Itanium processor).
sizeof(int) will be at least 4 on every architecture my code will run on.
Casting from a number > INT_MAX to int is implementation-defined. On my machine, it usually results in a 0 but interestingly casting from long to int results in INT_MAX when the value is too big.
This is surprisingly difficult to do correctly, or at least it has been for me. Does anyone know of a portable solution to this?
Update:
Changing static_cast to reinterpret_cast results in a compiler error. A comment prompted me to try a C-style cast: return (int)val in the code above, and it worked. On this machine. Will that still be safe on other architectures?

Quoting the C++03 standard, §4.7/3 (Integral Conversions):
If the destination type is signed, the value is unchanged if it can be represented in the destination type (and bit-field width); otherwise, the value is implementation-defined.
Because the result is implementation-defined, by definition it is impossible for there to be a truly portable solution.

While there are ways to do this using casts and conversions, most rely on undefined behavior that happen to have well-defined behaviors on some machines / with some compilers. Instead of relying on undefined behavior, copy the data:
int signed_val;
std::memcpy (&signed_val, &val, sizeof(int));
return signed_val;

You can negate an unsigned twos-complement number by taking the complement and adding one. So let's do that for negatives:
if (val < 0x80000000) // positive values need no conversion
return val;
if (val == 0x80000000) // Complement-and-addition will overflow, so special case this
return -0x80000000; // aka INT_MIN
else
return -(int)(~val + 1);
This assumes that your ints are represented with 32-bit twos-complement representation (or have similar range). It does not rely on any undefined behavior related to signed integer overflow (note that the behavior of unsigned integer overflow is well-defined - although that should not happen here either!).
Note that if your ints are not 32-bit, things get more complex. You may need to use something like ~(~0U >> 1) instead of 0x80000000. Further, if your ints are no twos-complement, you may have overflow issues on certain values (for example, on a ones-complement machine, -0x80000000 cannot be represented in a 32-bit signed integer). However, non-twos-complement machines are very rare today, so this is unlikely to be a problem.

Here's another solution that worked for me:
if (val <= INT_MAX) {
return static_cast<int>(val);
}
else {
int ret = static_cast<int>(val & ~INT_MIN);
return ret | INT_MIN;
}
If I mask off the high bit, I avoid overflow when casting. I can then OR it back safely.

C++20 will have std::bit_cast that copies bits verbatim:
#include <bit>
#include <cassert>
#include <iostream>
int main()
{
int i = -42;
auto u = std::bit_cast<unsigned>(i);
// Prints 4294967254 on two's compliment platforms where int is 32 bits
std::cout << u << "\n";
auto roundtripped = std::bit_cast<int>(u);
assert(roundtripped == i);
std::cout << roundtripped << "\n"; // Prints -42
return 0;
}
cppreference shows an example of how one can implement their own bit_cast in terms of memcpy (under Notes).
While OpenVMS is not likely to gain C++20 support anytime soon, I hope this answer helps someone arriving at the same question via internet search.

unsigned int u = ~0U;
int s = *reinterpret_cast<int*>(&u); // -1
Сontrariwise:
int s = -1;
unsigned int u = *reinterpret_cast<unsigned int*>(&s); // all ones

Packing 32bit floats into 30 bits (c++)

Here are the goals I'm trying to achieve:
I need to pack 32 bit IEEE floats into 30 bits.
I want to do this by decreasing the size of mantissa by 2 bits.
The operation itself should be as fast as possible.
I'm aware that some precision will be lost, and this is acceptable.
It would be an advantage, if this operation would not ruin special cases like SNaN, QNaN, infinities, etc. But I'm ready to sacrifice this over speed.
I guess this questions consists of two parts:
1) Can I just simply clear the least significant bits of mantissa? I've tried this, and so far it works, but maybe I'm asking for trouble... Something like:
float f;
int packed = (*(int*)&f) & ~3;
// later
f = *(float*)&packed;
2) If there are cases where 1) will fail, then what would be the fastest way to achieve this?
Thanks in advance

You actually violate the strict aliasing rules (section 3.10 of the C++ standard) with these reinterpret casts. This will probably blow up in your face when you turn on the compiler optimizations.
C++ standard, section 3.10 paragraph 15 says:
If a program attempts to access the stored value of an object through an lvalue of other than one of the following types the behavior is undefined
the dynamic type of the object,
a cv-qualified version of the dynamic type of the object,
a type similar to the dynamic type of the object,
a type that is the signed or unsigned type corresponding to the dynamic type of the object,
a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union),
a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
a char or unsigned char type.
Specifically, 3.10/15 doesn't allow us to access a float object via an lvalue of type unsigned int. I actually got bitten myself by this. The program I wrote stopped working after turning on optimizations. Apparently, GCC didn't expect an lvalue of type float to alias an lvalue of type int which is a fair assumption by 3.10/15. The instructions got shuffled around by the optimizer under the as-if rule exploiting 3.10/15 and it stopped working.
Under the following assumptions
float really corresponds to a 32bit IEEE-float,
sizeof(float)==sizeof(int)
unsigned int has no padding bits or trap representations
you should be able to do it like this:
/// returns a 30 bit number
unsigned int pack_float(float x) {
unsigned r;
std::memcpy(&r,&x,sizeof r);
return r >> 2;
}
float unpack_float(unsigned int x) {
x <<= 2;
float r;
std::memcpy(&r,&x,sizeof r);
return r;
}
This doesn't suffer from the "3.10-violation" and is typically very fast. At least GCC treats memcpy as an intrinsic function. In case you don't need the functions to work with NaNs, infinities or numbers with extremely high magnitude you can even improve accuracy by replacing "r >> 2" with "(r+1) >> 2":
unsigned int pack_float(float x) {
unsigned r;
std::memcpy(&r,&x,sizeof r);
return (r+1) >> 2;
}
This works even if it changes the exponent due to a mantissa overflow because the IEEE-754 coding maps consecutive floating point values to consecutive integers (ignoring +/- zero). This mapping actually approximates a logarithm quite well.

Blindly dropping the 2 LSBs of the float may fail for small number of unusual NaN encodings.
A NaN is encoded as exponent=255, mantissa!=0, but IEEE-754 doesn't say anything about which mantiassa values should be used. If the mantissa value is <= 3, you could turn a NaN into an infinity!

You should encapsulate it in a struct, so that you don't accidentally mix the usage of the tagged float with regular "unsigned int":
#include <iostream>
using namespace std;
struct TypedFloat {
private:
union {
unsigned int raw : 32;
struct {
unsigned int num : 30;
unsigned int type : 2;
};
};
public:
TypedFloat(unsigned int type=0) : num(0), type(type) {}
operator float() const {
unsigned int tmp = num << 2;
return reinterpret_cast<float&>(tmp);
}
void operator=(float newnum) {
num = reinterpret_cast<int&>(newnum) >> 2;
}
unsigned int getType() const {
return type;
}
void setType(unsigned int type) {
this->type = type;
}
};
int main() {
const unsigned int TYPE_A = 1;
TypedFloat a(TYPE_A);
a = 3.4;
cout << a + 5.4 << endl;
float b = a;
cout << a << endl;
cout << b << endl;
cout << a.getType() << endl;
return 0;
}
I can't guarantee its portability though.

How much precision do you need? If 16-bit float is enough (sufficient for some types of graphics), then ILM's 16-bit float ("half"), part of OpenEXR is great, obeys all kinds of rules (http://www.openexr.com/), and you'll have plenty of space left over after you pack it into a struct.
On the other hand, if you know the approximate range of values they're going to take, you should consider fixed point. They're more useful than most people realize.

I can't select any of the answers as the definite one, because most of them have valid information, but not quite what I was looking for. So I'll just summarize my conclusions.
The method for conversion I've posted in my question's part 1) is clearly wrong by C++ standard, so other methods to extract float's bits should be used.
And most important... as far as I understand from reading the responses and other sources about IEEE754 floats, it's ok to drop the least significant bits from mantissa. It will mostly affect only precision, with one exception: sNaN. Since sNaN is represented by exponent set to 255, and mantissa != 0, there can be situation where mantissa would be <= 3, and dropping last two bits would convert sNaN to +/-Infinity. But since sNaN are not generated during floating point operations on CPU, its safe under controlled environment.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js