Treating a hexadecimal value as single precision or double precision value - c++

Is there a way i could initialize a float type variable with hexadecimal number? what i want to do is say i have single precision representation for 4 which is 0x40800000 in hex. I want to do something like float a = 0x40800000 in which case it takes the hex value as integer. What can i do to make it treat as floating point number?

One option is to use type punning via a union. This is defined behaviour in C since C99 (previously this was implementation defined).
union {
float f;
uint32_t u;
} un;
un.u = 0x40800000;
float a = un.f;
As you tagged this C++, you could also use reinterpret_cast.
uint32_t u = 0x40800000;
float a = *reinterpret_cast<float*>(&u);
Before doing either of these, you should also confirm that they're the same size:
assert(sizeof(float) == sizeof(uint32_t));

You can do this if you introduce a temporary integer type variable, cast it to a floating point type and dereference it. You must be careful about the sizes of the types involved, and know that they may change. With my compiler, for example, this works:
unsigned i = 0x40800000;
float a = *(float*)&i;
printf("%f\n", a);
// output 4.00000

I'm not sure how you're getting your the value "0x40800000".
If that's coming in as an int you can just do:
const auto foo = 0x40800000;
auto a = *(float*)&foo;
If that's coming in as a string you can do:
float a;
sscanf("0x40800000", "0x%x", &a);

Related

Convert an integer's binary data to float

Lets say I have an integer:
unsigned long long int data = 4599331010119547059;
Now I want to convert this data to a double. I basically want to change the type, but keep the bits exactly as they were. For the given example, the float value is 0.31415926536.
How can I do that in C++? I saw some methods using Union but many advised against using this approach.
Since C++20, you can use std::bit_cast:
std::bit_cast<double>(data)
Prior to C++20, you can use std::memcpy:
double d;
static_assert(sizeof d == sizeof data);
std::memcpy(&d, &data, sizeof d);
Note that result will vary depending on floating point representation (IEEE-754 is ubiquitous though) as well as whether floating point and integer types have the same endianness.
Taking the question on its face value (assuming you have a valid reason to do this!) this is the only proper way of doing this in current C++ standard:
int i = get_int();
float x;
static_assert(sizeof(float) == sizeof(int), "!!!");
memcpy(&x, &i, sizeof(x));
You can use reinterpret_cast:
float f = reinterpret_cast<float&>(data);
For your value, I don't get 0.314... but that's how you could do it.

C++ floating point representation

I am trying to create a float from a hexadecimal representation I got from here. For the representation of 32.50002, the site shows the IEEE 754 hexadecimal representation as 0x42020005.
In my code, I have this: float f = 0x42020005;. However, when I print the value, I get 1.10E+9 instead of 32.50002. Why is this?
I am using Microsoft Visual C++ 2010.
When you assign a value to a float variable via =, you don’t assign its internal representation, you assign its value. 0x42020005 in decimal is 1107427333, and that’s the value you are assigning.
The underlying representation of a float cannot be retrieved in a platform independent way. However, making some assumptions (namely, that the float is in fact using IEEE 754 format), we can trick a bit:
float f;
uint32_t rep = 0x42020005;
std::memcpy(&f, &rep, sizeof f);
Will give the desired result.
0x42020005 actually is int value of 1107427333.
You can try out this code. Should work... Use union:
union IntFloat {
uint32_t i;
float f;
};
and call it when you need to convert the value.
union IntFloat val;
val.i = 0x42020005;
printf("%f\n", val.f);
0x42020005 is an int with value of 1107427333.
float f = 0x42020005; is equal with
float f = 1107427333;

Endianess of float numbers

I want to convert float numbers from little endian to big endian but am not able to do it .
I have succesfuly converted endianess of int numbers but can somebody help with float numbers please
#include <cstring> // for std::memcpy
#include <algorithm> // for std::reverse
#include <iterator> // For C++11 std::begin() and std::end()
// converting from float to bytes for writing out
float f = 10.0;
char c[sizeof f];
std::memcpy(c,&f,sizeof f);
std::reverse(std::begin(c),std::end(c)); // begin() and end() are C++11. For C++98 say std::reverse(c,c + sizeof f);
// ... write c to network, file, whatever ...
going the other direction:
char c[] = { 41, 36, 42, 59 };
static_assert(sizeof(float) == sizeof c,"");
std::reverse(std::begin(c),std::end(c));
float f;
std::memcpy(&f,c,sizeof f);
The representation of floating point values is implementation defined, so the values resulting from this could be different between different implementations. That is, 10.0 byte swapped could be 1.15705e-041, or something else, or it might not be a valid floating point number at all.
However any implementation which uses IEEE 754 (which most do, and which you can check by seeing if std::numeric_limits<float>.is_iec559 is true), should give you the same results. (std::numeric_limits is from #include <limits>.)
The above code converts a float to bytes, modifies the bytes, and then converts those bytes back to float. If you have some byte values that you want to read as a float then you could set the values of the char array to your bytes and then use memcpy() as shown above (by the line after std::reverse()) to put those bytes into f.
Often people will recommend using reinterpret_cast for this sort of thing but I think it's good to avoid casts. People often use them incorrectly and get undefined behavior without realizing it. In this case reinterpret_cast can be used legally, but I still think it's better to avoid it.
Although it does reduce 4 lines to 1...
std::reverse(reinterpret_cast<char*>(&f),reinterpret_cast<char*>(&f) + sizeof f);
And here's an example of why you shouldn't use reinterpret_cast. The following will probably work but may result in undefined behavior. Since it works you probably wouldn't even notice you've done anything wrong, which is one of the least desirable outcomes possible.
char c[] = { 41, 36, 42, 59 };
static_assert(sizeof(float) == sizeof c,"");
float f = *reinterpret_cast<float*>(&c[0]);
The correct way to do such things is to use a union.
union float_int {
float m_float;
int32_t m_int;
};
That way you can convert your float in an integer and since you already know how to convert your integer endianess, you're all good.
For a double it goes like this:
union double_int {
double m_float;
int64_t m_int;
};
The int32_t and int64_t are usually available in stdint.h, boost offers such and Qt has its own set of definitions. Just make sure that the size of the integer is exactly equal to the size of the float. On some systems you also have long double defined:
union double_int {
long double m_float;
int128_t m_int;
};
If the int128_t doesn't work, you can use a struct as this:
union long_double_int {
long double m_float;
struct {
int32_t m_int_low;
int32_t m_int_hi;
};
};
Which could make you think that in all cases, instead of using an int, you could use bytes:
union float_int {
float m_float;
unsigned char m_bytes[4];
};
And that's when you discover that you don't need all the usual shifts used when doing such a conversion... because you can also declare:
union char_int {
int m_int;
unsigned char m_bytes[4];
};
Now your code looks very simple:
float_int fi;
char_int ci;
fi.m_float = my_float;
ci.m_bytes[0] = fi.m_bytes[3];
ci.m_bytes[1] = fi.m_bytes[2];
ci.m_bytes[2] = fi.m_bytes[1];
ci.m_bytes[3] = fi.m_bytes[0];
// now ci.m_int is the float in the other endian
fwrite(&ci, 1, 4, f);
[...snip...]
fread(&ci, 1, 4, f);
// here ci.m_int is the float in the other endian, so restore:
fi.m_bytes[0] = ci.m_bytes[3];
fi.m_bytes[1] = ci.m_bytes[2];
fi.m_bytes[2] = ci.m_bytes[1];
fi.m_bytes[3] = ci.m_bytes[0];
my_float = fi.m_float;
// now my_float was restored from the file
Obviously the endianess is swapped in this example. You probably also need to know whether you indeed need to do such a swap if your program is to be compiled on both LITTLE_ENDIAN and BIG_ENDIAN computers (check against BYTE_ENDIAN.)

Encode Multiple ints into a double

I would like to encode a pair of ints in a double. For example say i wanted to pass a function:
foo(int a, int b)
but instead I want just one double to represent the two ints (ie) :
foo(double aAndB)
Currently I am doing it by having one int on either side of the decimal place (ie 10 and 15 would become 10.15) and then converting it to a stringstream tokenising and extracting the two numbers.
However, this has an obvious flaw when it comes to numbers like 10 and 10 ie it becomes 10.1.
Is there a way to do this through some tricky mathematical method so that I can pass a function a double that represents 2 ints?
Thanks.
Since (usually) a double has 64 bits in it and each int has 32 bits, you'd think that you could just store the bits into the double directly, e.g.:
int32_t i1 = rand();
int32_t i2 = rand();
int64_t x = (((int64_t)i1)<<32) | ((int64_t)i2);
double theDouble;
memcpy(&theDouble, &x, sizeof(theDouble));
... and doing that "almost works". That is, it works okay for many possible values of i1 and i2 -- but not for all of them. In particular, for IEEE754 floating point format, any values where the exponent bits are set to 0x7ff will be treated as indicating "NaN", and the floating point hardware can (and does) convert different NaN-equivalent bit-patterns back to its preferred NaN bit-pattern when passing a double as an argument, etc.
Because of this, stuffing two 32-bit integers into a double will appear to work in most cases, but if you test it with all possible input values you'll find some cases where the values unexpectedly mutated during their stay inside the double, and came out as different values when you decoded them again.
Of course, you could get around this by being careful only to set the mantissa bits of the double, but that will only give you 26 bits per integer, so you would only be able to store integer values of +/- 33,554,432 or so. Maybe that's okay, depending on your use case.
My advice is, find a different way to do whatever you're trying to do. Storing non-floating-point data in a floating point variable is asking for trouble, especially if you want your code to be at all portable.
If you're lucky and an int is half a double you can store the ints like this:
int a = 10;
int b = 20;
double d;
*(int *)&d = a;
*((int *)&d + 1) = b;
int outa = *((int *)&d);
int outb = *(((int *)&d) + 1);
printf("%d %d\n", outa, outb);
This doesn't work generally/portability. If a double and int have the same number of bits what you want is impossible.
A double can exactly represent an integer up to 53 bits. If you want to hold a 26-bit and a 27-bit integer, it's very easy: double combined = bits27*67108864.0 + bits26;
Note that 67108864 is 2^26.
Try to define a union like this:
struct two_int {
int a;
int b;
};
union encoding {
struct two_int a;
double c;
};
But doing like this may introduce problem with portability. Double check please wether this approach appropriate to your case.
You can do it by using binary mask and extract information from the "double".
For example:
double encode(int a, int b)
{
double d = 0;
d = d | a;
d = d | (b << 8);
return d;
}
double decode(double d)
{
a = d & 0xFF;
b = (d >> 8) & 0xFF;
}
In encode part, a will be in the lower 8 bits of the double variable d, b will be in the higher 8 bits of d.
If you are always passing two ints to this one parameter then it makes no sense to pass a double. Instead pass either the two ints as separate ints, or wrap them up in a struct.
The way you are doing it leaves you no opportunity to detect the difference between a true double and two ints. And so I conclude that you will lose no functionality by doing what I describe above.

convert double number to (IEEE 754) 64-bit binary string representation in c++

I have a double number, I want to represent it in IEEE 754 64-bit binary string.
Currently i'm using a code like this:
double noToConvert;
unsigned long* valueRef = reinterpret_cast<unsigned long*>(&noToConvert);
bitset<64> lessSignificative(*valueRef);
bitset<64> mostSignificative(*(++valueRef));
mostSignificative <<= 32;
mostSignificative |= lessSignificative;
RowVectorXd binArray = RowVectorXd::Zero(mostSignificative.size());
for(unsigned int i = 0; i <mostSignificative.size();i++)
{
(mostSignificative[i] == 0) ? (binArray(i) = 0) : (binArray(i) = 1);
}
The above code just works fine without any problem. But If you see, i'm using reinterpret_cast and using unsigned long. So, this code is very much compiler dependent. Could anyone show me how to write a code that is platform independent and without using any libraries. i'm ok, if we use the standard libraries and even bitset, but i dont want to use any machine or compiler dependent code.
Thanks in advance.
If you're willing to assume that double is the IEEE-754 double type:
#include <cstdint>
#include <cstring>
uint64_t getRepresentation(const double number) {
uint64_t representation;
memcpy(&representation, &number, sizeof representation);
}
If you don't even want to make that assumption:
#include <cstring>
char *getRepresentation(const double number) {
char *representation = new char[sizeof number];
memcpy(representation, &number, sizeof number);
return representation;
}
Why not use the union?
bitset<64> binarize(unsigned long* input){
union binarizeUnion
{
unsigned long* intVal;
bitset<64> bits;
} binTransfer;
binTransfer.intVal=input;
return (binTransfer.bits);
}
The simplest way to get this is to memcpy the double into an array of char:
char double_as_char[sizeof(double)];
memcpy(double_as_char, &noToConvert, sizeof(double_as_char));
and then extract the bits from double_as_char. C and C++ define that in the standard as legal.
Now, if you want to actually extract the various components of a double, you can use the following:
sign= noToConvert<=-0.0f;
int exponent;
double normalized_mantissa= frexp(noToConvert, &exponent);
unsigned long long mantissa= normalized_mantissa * (1ull << 53);
Since the value returned by frexp is in [0.5, 1), you need to shift it one extra bit to get all the bits in the mantissa as an integer. Then you just need to map that into the binary represenation you want, although you'll have to adjust the exponent to include the implicit bias as well.
The function print_raw_double_binary() in my article Displaying the Raw Fields of a Floating-Point Number should be close to what you want. You'd probably want to replace the casting of double to int with a union, since the former violates "strict aliasing" (although even use of a union to access something different than what is stored is technically illegal).