I have an unsigned long long (or uint64_t) value and want to convert it to a double. The double shall have the same bit pattern as the long value. This way I can set the bits of the double "by hand".
unsigned long long bits = 1ULL;
double result = /* some magic here */ bits;
I am looking for a way to do this.
The portable way to do this is with memcpy (you may also be able to conditionally do it with reinterpret_cast or a union, but those aren't certain to be portable because they violate the letter of the strict-alias rules):
// First, static assert that the sizes are the same
memcpy(&result, &bits, sizeof(bits));
But before you do make sure you know exactly what you're doing and what floating point representation is being used (although IEEE754 is a popular/common choice). You'll want to avoid all kinds of problem values like infinity, NaN, and denormal numbers.
Beware of union and reinterpret_cast<double*>(&bits), for both of these methods are UB. Pretty much all you can do is memcpy.
since C++ 20 we have std::bit_cast() to to such conversions
example:
double d = 1.5;
uint64_t i = std::bit_cast<uint64_t>(d); //use the same bits in an integer
double dd = std::bit_cast<double>(i); //back to floating point again
The following uses a void pointer.
unsigned long long bits = 1ULL;
void* tempPtr=(void*)&bits;
double result = *(double*)tempPtr;
Related
Lets say I have an integer:
unsigned long long int data = 4599331010119547059;
Now I want to convert this data to a double. I basically want to change the type, but keep the bits exactly as they were. For the given example, the float value is 0.31415926536.
How can I do that in C++? I saw some methods using Union but many advised against using this approach.
Since C++20, you can use std::bit_cast:
std::bit_cast<double>(data)
Prior to C++20, you can use std::memcpy:
double d;
static_assert(sizeof d == sizeof data);
std::memcpy(&d, &data, sizeof d);
Note that result will vary depending on floating point representation (IEEE-754 is ubiquitous though) as well as whether floating point and integer types have the same endianness.
Taking the question on its face value (assuming you have a valid reason to do this!) this is the only proper way of doing this in current C++ standard:
int i = get_int();
float x;
static_assert(sizeof(float) == sizeof(int), "!!!");
memcpy(&x, &i, sizeof(x));
You can use reinterpret_cast:
float f = reinterpret_cast<float&>(data);
For your value, I don't get 0.314... but that's how you could do it.
If I have an int, convert it to a double, then convert the double back to an int, am I guaranteed to get the same value back that I started with? In other words, given this function:
int passThroughDouble(int input)
{
double d = input;
return d;
}
Am I guaranteed that passThroughDouble(x) == x for all ints x?
No it isn't. The standard says nothing about the relative sizes of int and double.
If int is a 64-bit integer and double is the standard IEEE double-precision, then it will already fail for numbers bigger than 2^53.
That said, int is still 32-bit on the majority of environments today. So it will still hold in many cases.
If we restrict consideration to the "traditional" IEEE-754-style representation of floating-point types, then you can expect this conversion to be value-preserving if and only if the mantissa of the type double has as many bits as there are non-sign bits in type int.
Mantissa of a classic IEEE-754 double type is 53-bit wide (including the "implied" leading bit), which means that you can represent integers in [-2^53, +2^53] range precisely. Everything out of this range will generally lose precision.
So, it all depends on how wide your int is compared to your double. The answer depends on the specific platform. With 32-bit int and IEEE-754 double the equality should hold.
I have a double number, I want to represent it in IEEE 754 64-bit binary string.
Currently i'm using a code like this:
double noToConvert;
unsigned long* valueRef = reinterpret_cast<unsigned long*>(&noToConvert);
bitset<64> lessSignificative(*valueRef);
bitset<64> mostSignificative(*(++valueRef));
mostSignificative <<= 32;
mostSignificative |= lessSignificative;
RowVectorXd binArray = RowVectorXd::Zero(mostSignificative.size());
for(unsigned int i = 0; i <mostSignificative.size();i++)
{
(mostSignificative[i] == 0) ? (binArray(i) = 0) : (binArray(i) = 1);
}
The above code just works fine without any problem. But If you see, i'm using reinterpret_cast and using unsigned long. So, this code is very much compiler dependent. Could anyone show me how to write a code that is platform independent and without using any libraries. i'm ok, if we use the standard libraries and even bitset, but i dont want to use any machine or compiler dependent code.
Thanks in advance.
If you're willing to assume that double is the IEEE-754 double type:
#include <cstdint>
#include <cstring>
uint64_t getRepresentation(const double number) {
uint64_t representation;
memcpy(&representation, &number, sizeof representation);
}
If you don't even want to make that assumption:
#include <cstring>
char *getRepresentation(const double number) {
char *representation = new char[sizeof number];
memcpy(representation, &number, sizeof number);
return representation;
}
Why not use the union?
bitset<64> binarize(unsigned long* input){
union binarizeUnion
{
unsigned long* intVal;
bitset<64> bits;
} binTransfer;
binTransfer.intVal=input;
return (binTransfer.bits);
}
The simplest way to get this is to memcpy the double into an array of char:
char double_as_char[sizeof(double)];
memcpy(double_as_char, &noToConvert, sizeof(double_as_char));
and then extract the bits from double_as_char. C and C++ define that in the standard as legal.
Now, if you want to actually extract the various components of a double, you can use the following:
sign= noToConvert<=-0.0f;
int exponent;
double normalized_mantissa= frexp(noToConvert, &exponent);
unsigned long long mantissa= normalized_mantissa * (1ull << 53);
Since the value returned by frexp is in [0.5, 1), you need to shift it one extra bit to get all the bits in the mantissa as an integer. Then you just need to map that into the binary represenation you want, although you'll have to adjust the exponent to include the implicit bias as well.
The function print_raw_double_binary() in my article Displaying the Raw Fields of a Floating-Point Number should be close to what you want. You'd probably want to replace the casting of double to int with a union, since the former violates "strict aliasing" (although even use of a union to access something different than what is stored is technically illegal).
I am trying to convert a char* to double and back to char* again. the following code works fine if the application you created is 32-bit but doesn't work for 64-bit application. The problem occurs when you try to convert back to char* from int. for example if the hello = 0x000000013fcf7888 then converted is 0x000000003fcf7888 only the last 32 bits are right.
#include <iostream>
#include <stdlib.h>
#include <tchar.h>
using namespace std;
int _tmain(int argc, _TCHAR* argv[]){
char* hello = "hello";
unsigned int hello_to_int = (unsigned int)hello;
double hello_to_double = (double)hello_to_int;
cout<<hello<<endl;
cout<<hello_to_int<<"\n"<<hello_to_double<<endl;
unsigned int converted_int = (unsigned int)hello_to_double;
char* converted = reinterpret_cast<char*>(converted_int);
cout<<converted_int<<"\n"<<converted<<endl;
getchar();
return 0;
}
On 64-bit Windows pointers are 64-bit while int is 32-bit. This is why you're losing data in the upper 32-bits while casting. Instead of int use long long to hold the intermediate result.
char* hello = "hello";
unsigned long long hello_to_int = (unsigned long long)hello;
Make similar changes for the reverse conversion. But this is not guaranteed to make the conversions function correctly because a double can easily represent the entire 32-bit integer range without loss of precision but the same is not true for a 64-bit integer.
Also, this isn't going to work
unsigned int converted_int = (unsigned int)hello_to_double;
That conversion will simply truncate anything digits after the decimal point in the floating point representation. The problem exists even if you change the data type to unsigned long long. You'll need to reinterpret_cast<unsigned long long> to make it work.
Even after all that you may still run into trouble depending on the value of the pointer. The conversion to double may cause the value to be a signalling NaN for instance, in which cause your code might throw an exception.
Simple answer is, unless you're trying this out for fun, don't do conversions like these.
You can't cast a char* to int on 64-bit Windows because an int is 32 bits, while a char* is 64 bits because it's a pointer. Since a double is always 64 bits, you might be able to get away with casting between a double and char*.
A couple of issues with encoding any integer (specifically, a collection of bits) into a floating point value:
Conversions from 64-bit integers to doubles can be lossy. A double has 53-bits of actual precision, so integers above 2^52 (give or take an extra 2) will not necessarily be represented precisely.
If you decide to reinterpret the bits of a pointer as a double instead (via union or reinterpret_cast) you will still have issues if you happen to encode a pointer as set of bits that are not a valid double representation. Unless you can guarantee that the double value never gets written back by the FPU, the FPU can silently transform an invalid double into another invalid double (see NaN), i.e., a double value that represents the same value but has different bits. (See this for issues related to using floating point formats as bits.)
You can probably safely get away with encoding a 32-bit pointer in a double, as that will definitely fit within the 53-bit precision range.
only the last 32 bits are right.
That's because an int in your platform is only 32 bits long. Note that reinterpret_cast only guarantees that you can convert a pointer to an int of sufficient size (not your case), and back.
If it works in any system, anywhere, just all yourself lucky and move on. Converting a pointer to an integer is one thing (as long as the integer is large enough, you can get away with it), but a double is a floating point number - what you are doing simply doesn't make any sense, because a double is NOT necessarily capable of representing any random number. A double has range and precision limitations, and limits on how it represents things. It can represent numbers across a wide range of values, but it can't represent EVERY number in that range.
Remember that a double has two components: the mantissa and the exponent. Together, these allow you to represent either very big or very small numbers, but the mantissa has limited number of bits. If you run out of bits in the mantissa, you're going to lose some bits in the number you are trying to represent.
Apparently you got away with it under certain circumstances, but you're asking it to do something it wasn't made for, and for which it is manifestly inappropriate.
Just don't do that - it's not supposed to work.
This is as expected.
Typically a char* is going to be 32 bits on a 32-bit system, 64 bits on a 64-bit system; double is typically 64 bits on both systems. (These sizes are typical, and probably correct for Windows; the language permits a lot more variations.)
Conversion from a pointer to a floating-point type is, as far as I know, undefined. That doesn't just mean that the result of the conversion is undefined; the behavior of a program that attempts to perform such a conversion is undefined. If you're lucky, the program will crash or fail to compile.
But you're converting from a pointer to an integer (which is permitted, but implementation-defined) and then from an integer to a double (which is permitted and meaningful for meaningful numeric values -- but converted pointer values are not numerically meaningful). You're losing information because not all of the 64 bits of a double are used to represent the magnitude of the number; typically 11 or so bits are used to represent the exponent.
What you're doing quite simply makes no sense.
What exactly are you trying to accomplish? Whatever it is, there's surely a better way to do it.
Is it safe to cast a UINT64 to a float? I realize that UINT64 does not hold decimals, so my float will be whole numbers. However, my function to return my delta-time returns a UINT64, which isn't a very useful type for the function I'm currently working with. I'm assuming a simple static_cast<float>(uint64value) will not work?
Large values of UINT64, (an 8 byte value) may be truncated if you cast them to a float, which is only 4 bytes.
Define safe - you can easily lose a lot of digits of precision if the 64-bit value is large, but apart from that (which is presumably a known issue that you don't mind about), the conversion should be safe. If your compiler doesn't handle it correctly, get a better compiler.
You might try performing your arithmetic in a long double or double first:
typedef long double real_type
real_type x = static_cast<real_type>(long1);
real_type y = static_cast<real_type>(long2);
real_type z = x / y;
float result = static_cast<float>(real_type);
Rule of thumb: int can be cast to and back from double
It is safe to cast to and back from float but you will be limited to rather small numbers, about 16 million, and if you exceed the allowed magnitude you will silently lose lower-order precision. With double, you can use much larger integers.
Assuming an IEEE 754 underlying floating point system, you will be able to accurately cast integers of 23 bits to and from float and 52 bits to and from double. Actually, you get one more bit because of the hidden bit, so you can fit an integer up to and including 1FFFFFFFFFFFFF or 9007199254740991 in a double.
So every single 32-bit integer has an exact representation in double; it can be cast to and back safely, and the ordinary arithmetic operations on them will produce exact results.
Indeed, this is what JavaScript does for every integer numeric operation. People who say "floating point is inaccurate" are drastically oversimplifying the matter.
Safe? What do you mean by safe? As far as the precision is concerned, IEEE-754 float has a 23-(+1)bit mantissa. By forcefully converting a 64-bit value into a "rounded" 24 bit value, you'll inflict a massive loss of precision in the wide range of least-significant bits. Is this loss acceptable in your application? Frankly, if your original value really makes use of the 64-bit range, forcing it into something as small as float doesn't sound as a good idea to me.
why wouldn't static_cast work?
Max uint64 is 2^64 = 1.84467441 × 10^19
According to this max 32-bit float is
9.999999×10^96.
Should work... having problems?
http://en.wikipedia.org/wiki/Decimal32_floating-point_format