Cast double to int64_t and back without info lost - c++

This question has probably being asked, but I searched and could not find the answer.
I'm implementing a toy virtual machine, where the OpCodes take the form:
std::tuple<int8_t, int64_t, int64_t> // instruction op1, op2
I'm trying to pack a double into one of the operands and read it back again when processing it. This doesn't work reliably.
double d = ...
auto a = static_cast<int64_t>(d);
auto b = static_cast<double>(a)
// sometimes, b != d
Is there a way to pack the bit representation of the double into an int64_t, and then read that bit pattern back get the same exact double as before?

static_cast performs a value conversion - the fractionary part is always lost. memcpy is what you are after.
double d = ...
int64_t a;
memcpy(&a, &d, sizeof(a));
double d2;
memcpy(&d2, &a, sizeof(d2));
Still, I would probably instead make the operands a union with a double and an int64_t (plus possibly other types that are interesting for your VM).

One way to make it work is to reinterpret the block of memory as int64_t/double, i.e. to do pointer casts:
double d = ...
auto *a = (int64_t*)&d;
auto *d2 = (double*)a;
auto b = *d2;
assert(d == b);
Note that we both assume here that double and int64_t are of the same size (64 bit). I don't remember now if it is a part of the standard.

Related

Safe, signed subtraction of large unsigned ints

I'm working with a protocol where I don't have control of the input types. But I need to compute the difference in two, 64-bit unsigned integers (currently baked into a std::uint64_t). But the difference might be negative or positive. I don't want to do this:
uint64_t a{1};
uint64_t b{2};
int64_t x = a - b; // -1; correct, but what if a and b were /enormous/?
So I was looking at Boost's safe_numerics here. The large-values case is handled as I would like:
boost::safe_numerics::safe<uint64_t> a{UINT64_MAX};
boost::safe_numerics::safe<uint64_t> b{1};
boost::safe_numerics::safe<int64_t> x = a - b;
// ^^ Throws "converted unsigned value too large: positive overflow error"
Great! But ... they're a little too safe:
boost::safe_numerics::safe<uint64_t> a{1}; //UINT64_MAX;
boost::safe_numerics::safe<uint64_t> b{2};
boost::safe_numerics::safe<int64_t> x = a - b;
// ^^ Throws "subtraction result cannot be negative: negative overflow error"
// ... even though `x` is signed
I have a suspicion that it's a - b that actually throws, not the assignment. But I've tried every kind of cast in the book to get a - b into a safe, signed integer, but no joy.
There are some inelegant ways to deal with this, like comparing a and b to always subtract the smaller from the larger. Or I can do a lot of casting with boost::numeric_cast, or old-school range checking. Or...god forbid...I just throw myself when a or b exceed 63 bits, but all that is a bit lame.
But my real question is: Why does Boost detect a negative overflow in the final example above? Am I using safe_numerics incorrectly?
Am targeting C++-17 with gcc on a 64-bit system and using Boost 1.71.
The behavior I was looking for is actually implemented in boost::safe_numerics::checked_result:
https://www.boost.org/doc/libs/develop/libs/safe_numerics/doc/html/checked_result.html
checked::subtract allows negative overflows when the difference of two unsigned integers is negative (and being stored in a signed integer of adequate size). But it throws when the result does not. For example:
using namespace std;
using namespace boost::safe_numerics;
safe<uint64_t> a{2};
safe<uint64_t> b{1};
checked_result<int64_t> x0 = checked::subtract<int64_t>(b, a);
assert(x0 == -1);
checked_result<int64_t> x1 = checked::subtract<int64_t>(a, b);
assert(x1 == 1);
a = UINT64_MAX;
checked_result<int64_t> x2 = checked::subtract<int64_t>(a, b); // throws

Converting a double to intptr_t in C++

Assume the following:
#include <iostream>
int main()
{
double a = 5.6;
intptr_t b = (intptr_t) a;
double c = (double) b;
return 0;
}
c will be 5. My question is, since intptr_t is also 64 bits on a 64 bit machine (same as double), how come the precision bits are not saved during casting?
Although intptr_t is meant to represent a pointer to an int, its underling type is still an integer. Thus
intptr_t b = (intptr_t)a
Is still truncating the double, similar to if you'd just written:
int b = (int)a;
What you want to do is take the address of a:
intptr_t b = (intptr_t)&a
And then convert it back
double c = *(double*)b;

C++ double pointer array to float conversion

What is a correct way to convert double to float in c++. Is the conversion implicit?
Question 1: Consider double d = 5.0; and float f;
Which one is correct?
f = d;
f = (float)d;
f = static_cast<float>(d);
Question 2: Now consider we have
char *buffer = readAllBuffer();
double *d = (double*)(buffer + offset);
float f;
Which one is now correct?
f = d[0];
f = (float)d[0];
f = static_cast<float>(d[0]);
Thanks in advance!
They all boil down to the same thing, and the use of arrays is a red herring. You can indeed write
float f = d;
Some folk argue that a static_cast makes code more readable as it sticks out so clearly. It can also defeat warnings that some compilers might issue if a less long-winded form is used.
Naturally of course since a double is a superset of float, you might lose precision. Finally, note that for
float f1 = whatever;
double d1 = f1;
float f2 = d1;
, the C++ standard insists that f1 and f2 must be the same value.
You do have one major issue. This is not allowed:
double *d = (double*)(buffer + offset);
It violates strict aliasing and quite possibly alignment requirements. Instead you need to use memcpy:
double d;
memcpy(&d, buffer + offset, sizeof d);
float f = d;
Either of the cast alternative can be substituted for the last line, the important change is from dereferencing an pointer with incorrect type and alignment to making a bytewise copy.

Treating a hexadecimal value as single precision or double precision value

Is there a way i could initialize a float type variable with hexadecimal number? what i want to do is say i have single precision representation for 4 which is 0x40800000 in hex. I want to do something like float a = 0x40800000 in which case it takes the hex value as integer. What can i do to make it treat as floating point number?
One option is to use type punning via a union. This is defined behaviour in C since C99 (previously this was implementation defined).
union {
float f;
uint32_t u;
} un;
un.u = 0x40800000;
float a = un.f;
As you tagged this C++, you could also use reinterpret_cast.
uint32_t u = 0x40800000;
float a = *reinterpret_cast<float*>(&u);
Before doing either of these, you should also confirm that they're the same size:
assert(sizeof(float) == sizeof(uint32_t));
You can do this if you introduce a temporary integer type variable, cast it to a floating point type and dereference it. You must be careful about the sizes of the types involved, and know that they may change. With my compiler, for example, this works:
unsigned i = 0x40800000;
float a = *(float*)&i;
printf("%f\n", a);
// output 4.00000
I'm not sure how you're getting your the value "0x40800000".
If that's coming in as an int you can just do:
const auto foo = 0x40800000;
auto a = *(float*)&foo;
If that's coming in as a string you can do:
float a;
sscanf("0x40800000", "0x%x", &a);

Encode Multiple ints into a double

I would like to encode a pair of ints in a double. For example say i wanted to pass a function:
foo(int a, int b)
but instead I want just one double to represent the two ints (ie) :
foo(double aAndB)
Currently I am doing it by having one int on either side of the decimal place (ie 10 and 15 would become 10.15) and then converting it to a stringstream tokenising and extracting the two numbers.
However, this has an obvious flaw when it comes to numbers like 10 and 10 ie it becomes 10.1.
Is there a way to do this through some tricky mathematical method so that I can pass a function a double that represents 2 ints?
Thanks.
Since (usually) a double has 64 bits in it and each int has 32 bits, you'd think that you could just store the bits into the double directly, e.g.:
int32_t i1 = rand();
int32_t i2 = rand();
int64_t x = (((int64_t)i1)<<32) | ((int64_t)i2);
double theDouble;
memcpy(&theDouble, &x, sizeof(theDouble));
... and doing that "almost works". That is, it works okay for many possible values of i1 and i2 -- but not for all of them. In particular, for IEEE754 floating point format, any values where the exponent bits are set to 0x7ff will be treated as indicating "NaN", and the floating point hardware can (and does) convert different NaN-equivalent bit-patterns back to its preferred NaN bit-pattern when passing a double as an argument, etc.
Because of this, stuffing two 32-bit integers into a double will appear to work in most cases, but if you test it with all possible input values you'll find some cases where the values unexpectedly mutated during their stay inside the double, and came out as different values when you decoded them again.
Of course, you could get around this by being careful only to set the mantissa bits of the double, but that will only give you 26 bits per integer, so you would only be able to store integer values of +/- 33,554,432 or so. Maybe that's okay, depending on your use case.
My advice is, find a different way to do whatever you're trying to do. Storing non-floating-point data in a floating point variable is asking for trouble, especially if you want your code to be at all portable.
If you're lucky and an int is half a double you can store the ints like this:
int a = 10;
int b = 20;
double d;
*(int *)&d = a;
*((int *)&d + 1) = b;
int outa = *((int *)&d);
int outb = *(((int *)&d) + 1);
printf("%d %d\n", outa, outb);
This doesn't work generally/portability. If a double and int have the same number of bits what you want is impossible.
A double can exactly represent an integer up to 53 bits. If you want to hold a 26-bit and a 27-bit integer, it's very easy: double combined = bits27*67108864.0 + bits26;
Note that 67108864 is 2^26.
Try to define a union like this:
struct two_int {
int a;
int b;
};
union encoding {
struct two_int a;
double c;
};
But doing like this may introduce problem with portability. Double check please wether this approach appropriate to your case.
You can do it by using binary mask and extract information from the "double".
For example:
double encode(int a, int b)
{
double d = 0;
d = d | a;
d = d | (b << 8);
return d;
}
double decode(double d)
{
a = d & 0xFF;
b = (d >> 8) & 0xFF;
}
In encode part, a will be in the lower 8 bits of the double variable d, b will be in the higher 8 bits of d.
If you are always passing two ints to this one parameter then it makes no sense to pass a double. Instead pass either the two ints as separate ints, or wrap them up in a struct.
The way you are doing it leaves you no opportunity to detect the difference between a true double and two ints. And so I conclude that you will lose no functionality by doing what I describe above.