I would like to encode a pair of ints in a double. For example say i wanted to pass a function:
foo(int a, int b)
but instead I want just one double to represent the two ints (ie) :
foo(double aAndB)
Currently I am doing it by having one int on either side of the decimal place (ie 10 and 15 would become 10.15) and then converting it to a stringstream tokenising and extracting the two numbers.
However, this has an obvious flaw when it comes to numbers like 10 and 10 ie it becomes 10.1.
Is there a way to do this through some tricky mathematical method so that I can pass a function a double that represents 2 ints?
Thanks.
Since (usually) a double has 64 bits in it and each int has 32 bits, you'd think that you could just store the bits into the double directly, e.g.:
int32_t i1 = rand();
int32_t i2 = rand();
int64_t x = (((int64_t)i1)<<32) | ((int64_t)i2);
double theDouble;
memcpy(&theDouble, &x, sizeof(theDouble));
... and doing that "almost works". That is, it works okay for many possible values of i1 and i2 -- but not for all of them. In particular, for IEEE754 floating point format, any values where the exponent bits are set to 0x7ff will be treated as indicating "NaN", and the floating point hardware can (and does) convert different NaN-equivalent bit-patterns back to its preferred NaN bit-pattern when passing a double as an argument, etc.
Because of this, stuffing two 32-bit integers into a double will appear to work in most cases, but if you test it with all possible input values you'll find some cases where the values unexpectedly mutated during their stay inside the double, and came out as different values when you decoded them again.
Of course, you could get around this by being careful only to set the mantissa bits of the double, but that will only give you 26 bits per integer, so you would only be able to store integer values of +/- 33,554,432 or so. Maybe that's okay, depending on your use case.
My advice is, find a different way to do whatever you're trying to do. Storing non-floating-point data in a floating point variable is asking for trouble, especially if you want your code to be at all portable.
If you're lucky and an int is half a double you can store the ints like this:
int a = 10;
int b = 20;
double d;
*(int *)&d = a;
*((int *)&d + 1) = b;
int outa = *((int *)&d);
int outb = *(((int *)&d) + 1);
printf("%d %d\n", outa, outb);
This doesn't work generally/portability. If a double and int have the same number of bits what you want is impossible.
A double can exactly represent an integer up to 53 bits. If you want to hold a 26-bit and a 27-bit integer, it's very easy: double combined = bits27*67108864.0 + bits26;
Note that 67108864 is 2^26.
Try to define a union like this:
struct two_int {
int a;
int b;
};
union encoding {
struct two_int a;
double c;
};
But doing like this may introduce problem with portability. Double check please wether this approach appropriate to your case.
You can do it by using binary mask and extract information from the "double".
For example:
double encode(int a, int b)
{
double d = 0;
d = d | a;
d = d | (b << 8);
return d;
}
double decode(double d)
{
a = d & 0xFF;
b = (d >> 8) & 0xFF;
}
In encode part, a will be in the lower 8 bits of the double variable d, b will be in the higher 8 bits of d.
If you are always passing two ints to this one parameter then it makes no sense to pass a double. Instead pass either the two ints as separate ints, or wrap them up in a struct.
The way you are doing it leaves you no opportunity to detect the difference between a true double and two ints. And so I conclude that you will lose no functionality by doing what I describe above.
Related
Lets say I have an integer:
unsigned long long int data = 4599331010119547059;
Now I want to convert this data to a double. I basically want to change the type, but keep the bits exactly as they were. For the given example, the float value is 0.31415926536.
How can I do that in C++? I saw some methods using Union but many advised against using this approach.
Since C++20, you can use std::bit_cast:
std::bit_cast<double>(data)
Prior to C++20, you can use std::memcpy:
double d;
static_assert(sizeof d == sizeof data);
std::memcpy(&d, &data, sizeof d);
Note that result will vary depending on floating point representation (IEEE-754 is ubiquitous though) as well as whether floating point and integer types have the same endianness.
Taking the question on its face value (assuming you have a valid reason to do this!) this is the only proper way of doing this in current C++ standard:
int i = get_int();
float x;
static_assert(sizeof(float) == sizeof(int), "!!!");
memcpy(&x, &i, sizeof(x));
You can use reinterpret_cast:
float f = reinterpret_cast<float&>(data);
For your value, I don't get 0.314... but that's how you could do it.
I'm trying to convert a float value to an integer, modify the int value, then reconvert back to a float value. However, the decimals' value gets lost and I'm pretty sure I used the static_cast<>() function wrong in my code.
My code is a binary multiplier, which shifts the binary value f times to left. For example, when I'm doing something like 1.2 x 2, I'm only getting 2 instead of 2.4.
int mantissa;
int f;
int exp;
float result = mantissa + 0x800000;
int resultInt = static_cast<int>(result);
int expF = log2(abs(f));
int expM = exp + expF;
int newExp = (127 + 23 - expM);
resultInt >>= newExp;
float result2 = resultInt;
Bit shifting will not work for floating point values because the bits are laid out differently. They have to preserve the decimal location as well as the digits (hence the floating "point" value).
An integer, on the other hand, works well with bit shifting due to how well it maps from decimal-to-binary, but does not store a decimal point anywhere. Thus, when casting, you lose that information.
In short, it is impossible to multiply a decimal value directly using bit shifting the same way you can with an integer.
However, you can multiply the floating point by 10 until all digits are on the left side of the decimal, then cast to an integer. It may eat up performance depending on how it's implemented, but it's certainly possible to preserve all information this way. It's difficult to answer the question beyond that without understanding your intentions.
Is there a way i could initialize a float type variable with hexadecimal number? what i want to do is say i have single precision representation for 4 which is 0x40800000 in hex. I want to do something like float a = 0x40800000 in which case it takes the hex value as integer. What can i do to make it treat as floating point number?
One option is to use type punning via a union. This is defined behaviour in C since C99 (previously this was implementation defined).
union {
float f;
uint32_t u;
} un;
un.u = 0x40800000;
float a = un.f;
As you tagged this C++, you could also use reinterpret_cast.
uint32_t u = 0x40800000;
float a = *reinterpret_cast<float*>(&u);
Before doing either of these, you should also confirm that they're the same size:
assert(sizeof(float) == sizeof(uint32_t));
You can do this if you introduce a temporary integer type variable, cast it to a floating point type and dereference it. You must be careful about the sizes of the types involved, and know that they may change. With my compiler, for example, this works:
unsigned i = 0x40800000;
float a = *(float*)&i;
printf("%f\n", a);
// output 4.00000
I'm not sure how you're getting your the value "0x40800000".
If that's coming in as an int you can just do:
const auto foo = 0x40800000;
auto a = *(float*)&foo;
If that's coming in as a string you can do:
float a;
sscanf("0x40800000", "0x%x", &a);
I have an unsigned long long (or uint64_t) value and want to convert it to a double. The double shall have the same bit pattern as the long value. This way I can set the bits of the double "by hand".
unsigned long long bits = 1ULL;
double result = /* some magic here */ bits;
I am looking for a way to do this.
The portable way to do this is with memcpy (you may also be able to conditionally do it with reinterpret_cast or a union, but those aren't certain to be portable because they violate the letter of the strict-alias rules):
// First, static assert that the sizes are the same
memcpy(&result, &bits, sizeof(bits));
But before you do make sure you know exactly what you're doing and what floating point representation is being used (although IEEE754 is a popular/common choice). You'll want to avoid all kinds of problem values like infinity, NaN, and denormal numbers.
Beware of union and reinterpret_cast<double*>(&bits), for both of these methods are UB. Pretty much all you can do is memcpy.
since C++ 20 we have std::bit_cast() to to such conversions
example:
double d = 1.5;
uint64_t i = std::bit_cast<uint64_t>(d); //use the same bits in an integer
double dd = std::bit_cast<double>(i); //back to floating point again
The following uses a void pointer.
unsigned long long bits = 1ULL;
void* tempPtr=(void*)&bits;
double result = *(double*)tempPtr;
In C++, what's the generic way to convert any floating point value (float) to fixed point (int, 16:16 or 24:8)?
EDIT: For clarification, fixed-point values have two parts to them: an integer part and a fractional part. The integer part can be represented by a signed or unsigned integer data type. The fractional part is represented by an unsigned data integer data type.
Let's make an analogy with money for the sake of clarity. The fractional part may represent cents -- a fractional part of a dollar. The range of the 'cents' data type would be 0 to 99. If a 8-bit unsigned integer were to be used for fixed-point math, then the fractional part would be split into 256 evenly divisible parts.
I hope that clears things up.
Here you go:
// A signed fixed-point 16:16 class
class FixedPoint_16_16
{
short intPart;
unsigned short fracPart;
public:
FixedPoint_16_16(double d)
{
*this = d; // calls operator=
}
FixedPoint_16_16& operator=(double d)
{
intPart = static_cast<short>(d);
fracPart = static_cast<unsigned short>
(numeric_limits<unsigned short> + 1.0)*d);
return *this;
}
// Other operators can be defined here
};
EDIT: Here's a more general class based on anothercommon way to deal with fixed-point numbers (and which KPexEA pointed out):
template <class BaseType, size_t FracDigits>
class fixed_point
{
const static BaseType factor = 1 << FracDigits;
BaseType data;
public:
fixed_point(double d)
{
*this = d; // calls operator=
}
fixed_point& operator=(double d)
{
data = static_cast<BaseType>(d*factor);
return *this;
}
BaseType raw_data() const
{
return data;
}
// Other operators can be defined here
};
fixed_point<int, 8> fp1; // Will be signed 24:8 (if int is 32-bits)
fixed_point<unsigned int, 16> fp1; // Will be unsigned 16:16 (if int is 32-bits)
A cast from float to integer will throw away the fractional portion so if you want to keep that fraction around as fixed point then you just multiply the float before casting it. The below code will not check for overflow mind you.
If you want 16:16
double f = 1.2345;
int n;
n=(int)(f*65536);
if you want 24:8
double f = 1.2345;
int n;
n=(int)(f*256);
**** Edit** : My first comment applies to before Kevin's edit,but I'll leave it here for posterity. Answers change so quickly here sometimes!
The problem with Kevin's approach is that with Fixed Point you are normally packing into a guaranteed word size (typically 32bits). Declaring the two parts separately leaves you to the whim of your compiler's structure packing. Yes you could force it, but it does not work for anything other than 16:16 representation.
KPexEA is closer to the mark by packing everything into int - although I would use "signed long" to try and be explicit on 32bits. Then you can use his approach for generating the fixed point value, and bit slicing do extract the component parts again. His suggestion also covers the 24:8 case.
( And everyone else who suggested just static_cast.....what were you thinking? ;) )
I gave the answer to the guy that wrote the best answer, but I really used a related questions code that points here.
It used templates and was easy to ditch dependencies on the boost lib.
This is fine for converting from floating point to integer, but the O.P. also wanted fixed point.
Now how you'd do that in C++, I don't know (C++ not being something I can think in readily). Perhaps try a scaled-integer approach, i.e. use a 32 or 64 bit integer and programmatically allocate the last, say, 6 digits to what's on the right hand side of the decimal point.
There isn't any built in support in C++ for fixed point numbers. Your best bet would be to write a wrapper 'FixedInt' class that takes doubles and converts them.
As for a generic method to convert... the int part is easy enough, just grab the integer part of the value and store it in the upper bits... decimal part would be something along the lines of:
for (int i = 1; i <= precision; i++)
{
if (decimal_part > 1.f/(float)(i + 1)
{
decimal_part -= 1.f/(float)(i + 1);
fixint_value |= (1 << precision - i);
}
}
although this is likely to contain bugs still