Converting a double to intptr_t in C++ - c++

Assume the following:
#include <iostream>
int main()
{
double a = 5.6;
intptr_t b = (intptr_t) a;
double c = (double) b;
return 0;
}
c will be 5. My question is, since intptr_t is also 64 bits on a 64 bit machine (same as double), how come the precision bits are not saved during casting?

Although intptr_t is meant to represent a pointer to an int, its underling type is still an integer. Thus
intptr_t b = (intptr_t)a
Is still truncating the double, similar to if you'd just written:
int b = (int)a;
What you want to do is take the address of a:
intptr_t b = (intptr_t)&a
And then convert it back
double c = *(double*)b;

Related

Cast double to int64_t and back without info lost

This question has probably being asked, but I searched and could not find the answer.
I'm implementing a toy virtual machine, where the OpCodes take the form:
std::tuple<int8_t, int64_t, int64_t> // instruction op1, op2
I'm trying to pack a double into one of the operands and read it back again when processing it. This doesn't work reliably.
double d = ...
auto a = static_cast<int64_t>(d);
auto b = static_cast<double>(a)
// sometimes, b != d
Is there a way to pack the bit representation of the double into an int64_t, and then read that bit pattern back get the same exact double as before?
static_cast performs a value conversion - the fractionary part is always lost. memcpy is what you are after.
double d = ...
int64_t a;
memcpy(&a, &d, sizeof(a));
double d2;
memcpy(&d2, &a, sizeof(d2));
Still, I would probably instead make the operands a union with a double and an int64_t (plus possibly other types that are interesting for your VM).
One way to make it work is to reinterpret the block of memory as int64_t/double, i.e. to do pointer casts:
double d = ...
auto *a = (int64_t*)&d;
auto *d2 = (double*)a;
auto b = *d2;
assert(d == b);
Note that we both assume here that double and int64_t are of the same size (64 bit). I don't remember now if it is a part of the standard.

Treating a hexadecimal value as single precision or double precision value

Is there a way i could initialize a float type variable with hexadecimal number? what i want to do is say i have single precision representation for 4 which is 0x40800000 in hex. I want to do something like float a = 0x40800000 in which case it takes the hex value as integer. What can i do to make it treat as floating point number?
One option is to use type punning via a union. This is defined behaviour in C since C99 (previously this was implementation defined).
union {
float f;
uint32_t u;
} un;
un.u = 0x40800000;
float a = un.f;
As you tagged this C++, you could also use reinterpret_cast.
uint32_t u = 0x40800000;
float a = *reinterpret_cast<float*>(&u);
Before doing either of these, you should also confirm that they're the same size:
assert(sizeof(float) == sizeof(uint32_t));
You can do this if you introduce a temporary integer type variable, cast it to a floating point type and dereference it. You must be careful about the sizes of the types involved, and know that they may change. With my compiler, for example, this works:
unsigned i = 0x40800000;
float a = *(float*)&i;
printf("%f\n", a);
// output 4.00000
I'm not sure how you're getting your the value "0x40800000".
If that's coming in as an int you can just do:
const auto foo = 0x40800000;
auto a = *(float*)&foo;
If that's coming in as a string you can do:
float a;
sscanf("0x40800000", "0x%x", &a);

Division of integers in C++ not working as expected

I'm new here so really sorry if this is to basic but what am I missing here? This is just a dummy code:
#include <iostream>
using namespace std;
int main() {
unsigned int a, b, c;
int d;
a = 10E06;
b = 25E06;
c = 4096;
d = (a - b)/c;
std::cout << d << std::endl;
return 0;
}
cout is printing 1044913 instead of -3662. If I cast a and b to long the problem is solved. Is there a problem of overflow or something?
That's because (a-b) itself is unsigned:
#include <iostream>
using namespace std;
int main() {
unsigned int a, b, c;
int d;
a = 10E06;
b = 25E06;
c = 4096;
d = (a - b)/c;
std::cout << (a-b) << std::endl; // 4279967296
std::cout << d << std::endl; // 1044913
return 0;
}
The conversion from unsigned to int happens when d is assigned to, not before.
So (a-b)/c must be unsigned since a,b,c are.
Operations between unsigned numbers yield unsigned numbers. It's up to you to make sure the operations make sense, or protect against the opposite.
If you have unsigned int a = 2, b = 3;, what do you think the value of a-b would be?
Since a b and c are all declared as unsigned, the output of the computation (a-b)/c will be unsigned. Since the calculation of the values you provided cannot be properly represented with an unsigned type, things get a little messy. The unsigned value is then assigned to d, and even though this is signed, the value is already garbled.
I will also note that the notation 10E06 represents a floating point number that is then being implicitly cast to an unsigned int. Depending on the particular floating point value provided, this may or may not cast as expected.
You want your result to take a sign. So you should declare your variables as signed int or just int. That will give the desired result. If you cast a and b to long, a-b will be long and hence take a sign. Following is a solution.
int main() {
int a, b, c;
int d;
a = 10E06;
b = 25E06;
c = 4096;
d = (a - b)/c;
std::cout << d << std::endl;
return 0;
}
If you also want rational numbers you should use doubles or float (atlhough it won't give a different result for this particular case).
int main() {
double a, b, c;
double d;
a = 10E06;
b = 25E06;
c = 4096;
d = (a - b)/c;
std::cout << d << std::endl;
return 0;
}
Because of the way C++ (and many other C-based languages) deal with operators, when unsigned numbers are put into an expression, that expression returns an unsigned value, and not held in a mysterious inter-type state that might be expected.
Step-by-step:
(a - b) subtracts 25E06 from 10E06, which would normally return -15E06, is unsigned, so it's wrapped around to a whole bunch of junk.
This junk is then divided by c, and both inputs are unsigned, so the output is also unsigned.
Lastly, this is stuffed into a signed int, remaining at 1044913.
"unsigned int" is a type just like float and bool, even though it requires two keywords. If you want it to turn into a signed int for that calculation, you must either make sure a, b, and c are all signed (remove the unsigned keyword), or cast them as such when putting them into the expression, like this: d = ((signed)a - (signed)b) / (signed)c;

double datatype casting works on windows but not on linux

I am having a string pointer in C which is having bigint data.
eg 9223372036854775807 i.e 2^63
I wanted to cast this to double but as you know double has 15/16 digits available to store in fraction part the rest of the bits are discarded.so the above number which is very large would be casted to 9.22337203685476E+18 i.e 922337203685476000.
This makes comparing the original value and casted value mismatch. This usually happens on Linux platform. Thing is why this doesnot happen on Windows?
Is it compiler dependent or something which is unknown to me. ?
That value is 2^63 - 1, which cannot be exactly represented with a double. The closest value that can be represented is 2^63. And that's what you get if you use e.g. sscanf or atof:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
const char *str = "9223372036854775807";
double d;
double e;
sscanf(str, "%lf", &d);
e = atof(str);
printf("%f\n", d); // 9223372036854775808.000000
printf("%f\n", e); // 9223372036854775808.000000
}
See http://ideone.com/zz2NIF.

convert double number to (IEEE 754) 64-bit binary string representation in c++

I have a double number, I want to represent it in IEEE 754 64-bit binary string.
Currently i'm using a code like this:
double noToConvert;
unsigned long* valueRef = reinterpret_cast<unsigned long*>(&noToConvert);
bitset<64> lessSignificative(*valueRef);
bitset<64> mostSignificative(*(++valueRef));
mostSignificative <<= 32;
mostSignificative |= lessSignificative;
RowVectorXd binArray = RowVectorXd::Zero(mostSignificative.size());
for(unsigned int i = 0; i <mostSignificative.size();i++)
{
(mostSignificative[i] == 0) ? (binArray(i) = 0) : (binArray(i) = 1);
}
The above code just works fine without any problem. But If you see, i'm using reinterpret_cast and using unsigned long. So, this code is very much compiler dependent. Could anyone show me how to write a code that is platform independent and without using any libraries. i'm ok, if we use the standard libraries and even bitset, but i dont want to use any machine or compiler dependent code.
Thanks in advance.
If you're willing to assume that double is the IEEE-754 double type:
#include <cstdint>
#include <cstring>
uint64_t getRepresentation(const double number) {
uint64_t representation;
memcpy(&representation, &number, sizeof representation);
}
If you don't even want to make that assumption:
#include <cstring>
char *getRepresentation(const double number) {
char *representation = new char[sizeof number];
memcpy(representation, &number, sizeof number);
return representation;
}
Why not use the union?
bitset<64> binarize(unsigned long* input){
union binarizeUnion
{
unsigned long* intVal;
bitset<64> bits;
} binTransfer;
binTransfer.intVal=input;
return (binTransfer.bits);
}
The simplest way to get this is to memcpy the double into an array of char:
char double_as_char[sizeof(double)];
memcpy(double_as_char, &noToConvert, sizeof(double_as_char));
and then extract the bits from double_as_char. C and C++ define that in the standard as legal.
Now, if you want to actually extract the various components of a double, you can use the following:
sign= noToConvert<=-0.0f;
int exponent;
double normalized_mantissa= frexp(noToConvert, &exponent);
unsigned long long mantissa= normalized_mantissa * (1ull << 53);
Since the value returned by frexp is in [0.5, 1), you need to shift it one extra bit to get all the bits in the mantissa as an integer. Then you just need to map that into the binary represenation you want, although you'll have to adjust the exponent to include the implicit bias as well.
The function print_raw_double_binary() in my article Displaying the Raw Fields of a Floating-Point Number should be close to what you want. You'd probably want to replace the casting of double to int with a union, since the former violates "strict aliasing" (although even use of a union to access something different than what is stored is technically illegal).