How to convert Hex to IEEE 754 32 bit float in C++ - c++

I am trying to convert hex values stored as int and convert them to floatting point numbers using the IEEE 32 bit rules. I am specifically struggling with getting the right values for the mantissa and exponent. The hex is stored from in a file in hex. I want to have four significant figures to it. Below is my code.
float floatizeMe(unsigned int myNumba ) {
//// myNumba comes in as 32 bits or 8 byte
unsigned int sign = (myNumba & 0x007fffff) >>31;
unsigned int exponent = ((myNumba & 0x7f800000) >> 23)- 0x7F;
unsigned int mantissa = (myNumba & 0x007fffff) ;
float value = 0;
float mantissa2;
cout << endl<< "mantissa is : " << dec << mantissa << endl;
unsigned int m1 = mantissa & 0x00400000 >> 23;
unsigned int m2 = mantissa & 0x00200000 >> 22;
unsigned int m3 = mantissa & 0x00080000 >> 21;
unsigned int m4 = mantissa & 0x00040000 >> 20;
mantissa2 = m1 * (2 ^ -1) + m2*(2 ^ -2) + m3*(2 ^ -3) + m4*(2 ^ -4);
cout << "\nsign is: " << dec << sign << endl;
cout << "exponent is : " << dec << exponent << endl;
cout << "mantissa 2 is : " << dec << mantissa2 << endl;
// if above this number it is negative
if ( sign == 1)
sign = -1;
// if above this number it is positive
else {
sign = 1;
}
value = (-1^sign) * (1+mantissa2) * (2 ^ exponent);
cout << dec << "Float value is: " << value << "\n\n\n";
return value;
}
int main()
{
ifstream myfile("input.txt");
if (myfile.is_open())
{
unsigned int a, b,b1; // Hex
float c, d, e; // Dec
int choice;
unsigned int ex1 = 0;
unsigned int ex2 = 1;
myfile >> std::hex;
myfile >> a >> b ;
floatizeMe(a);
myfile.close();
return 0;
}

I suspect you mean for the ^ in
mantissa2 = m1 * (2 ^ -1) + m2*(2 ^ -2) + m3*(2 ^ -3) + m4*(2 ^ -4);
to mean "to the power of". There is no such operator in C or C++. The ^ operator is the bit-wise XOR operator.

Considering your CPU follows the IEEE standard, you can also use union. Something like this
union
{
int num;
float fnum;
} my_union;
Then store the integer values into my_union.num and read them as float by getting my_union.fnum.

We needed to convert IEEE-754 single and double precision numbers (using 32bit and 64bit encoding). We were using a C compiler (Vector CANoe/Canalyzer CAPL Script) with a restricted set of functions and ended up developing the function below (it can easily be tested using any on-line C compiler):
#include <stdio.h>
#include <math.h>
double ConvertNumberToFloat(unsigned long number, int isDoublePrecision)
{
int mantissaShift = isDoublePrecision ? 52 : 23;
unsigned long exponentMask = isDoublePrecision ? 0x7FF0000000000000 : 0x7f800000;
int bias = isDoublePrecision ? 1023 : 127;
int signShift = isDoublePrecision ? 63 : 31;
int sign = (number >> signShift) & 0x01;
int exponent = ((number & exponentMask) >> mantissaShift) - bias;
int power = -1;
double total = 0.0;
for ( int i = 0; i < mantissaShift; i++ )
{
int calc = (number >> (mantissaShift-i-1)) & 0x01;
total += calc * pow(2.0, power);
power--;
}
double value = (sign ? -1 : 1) * pow(2.0, exponent) * (total + 1.0);
return value;
}
int main()
{
// Single Precision
unsigned int singleValue = 0x40490FDB; // 3.141592...
float singlePrecision = (float)ConvertNumberToFloat(singleValue, 0);
printf("IEEE754 Single (from 32bit 0x%08X): %.7f\n",singleValue,singlePrecision);
// Double Precision
unsigned long doubleValue = 0x400921FB54442D18; // 3.141592653589793...
double doublePrecision = ConvertNumberToFloat(doubleValue, 1);
printf("IEEE754 Double (from 64bit 0x%016lX): %.16f\n",doubleValue,doublePrecision);
}

Just do the following (but of course make sure you have the right endianness when reading bytes into the integer in the first place):
float int_bits_to_float(int32_t ieee754_bits) {
float flt;
*((int*) &flt) = ieee754_bits;
return flt;
}
Works for me... this of course assumes that float has 32 bits, and is in IEEE754 format, on your architecture (which is almost always the case).

There are a number of very basic errors in your code.
The most visible is repeatedly using ^ for "power of". ^ is the XOR-operator, and for "power" you must use the function pow(base, exponent) in math.h.
Next, "I want to have four significant figures" (presumably for the mantissa), but you only extract four bits. Four bits can encode only 0..15, which is about a digit-and-a-half. To get four significant digits, you'd need at least log(10,000)/log(2) ≈ 13.288, or at least 14 bits (but preferably 17, so you get one full extra digit to get better rounding).
You extract the wrong bit for sign, and then you use it the wrong way. Yes, if it is 0 then sign = 1 and if 1 then sign = -1, but you use it in the final calculation as
value = (-1^sign) * ...
(again with a ^, although even pow does not make any sense here). You ought to have used sign * .. straight away.
exponent was declared an unsigned int, but that fails for negative values. It needs to be signed for pow(2, exponent) (corrected from your (2 ^ exponent)).
On the positive side, (1+mantissa2) is indeed correct.
With all of those points taken together, and ignoring the fact that you actually ask for only 4 significant digits, I get the following code. Note that I rearranged the initial bit shifting and extracting for convenience – I shift mantissa to the left, rather than the right, so I can test against 0 in its calculation.
(Ah, I missed this!) Using sign straight away does not work because it was declared as an unsigned int. Therefore, where you think you give it the value -1, it actually gets the value 4294967295 (more precise: the value of UINT_MAX from limits.h).
The easiest way to get rid of this is not multiplying by sign but only test it, and negate value if it is set.
float floatizeMe (unsigned int myNumba )
{
//// myNumba comes in as 32 bits or 8 byte
unsigned int sign = myNumba >>31;
signed int exponent = ((myNumba >> 23) & 0xff) - 0x7F;
unsigned int mantissa = myNumba << 9;
float value = 0;
float mantissa2;
cout << endl << "input is : " << hex << myNumba << endl;
cout << endl << "mantissa is : " << hex << mantissa << endl;
value = 0.5f;
mantissa2 = 0.0f;
while (mantissa)
{
if (mantissa & 0x80000000)
mantissa2 += value;
mantissa <<= 1;
value *= 0.5f;
}
cout << "\nsign is: " << sign << endl;
cout << "exponent is : " << hex << exponent << endl;
cout << "mantissa 2 is : " << mantissa2 << endl;
/* REMOVE:
if above this number it is negative
if ( sign == 1)
sign = -1;
// if above this number it is positive
else {
sign = 1;
} */
/* value = sign * (1.0f + mantissa2) * (pow (2, exponent)); */
value = (1.0f + mantissa2) * (pow (2, exponent));
if (sign) value = -value;
cout << dec << "Float value is: " << value << "\n\n\n";
return value;
}
With the above, you get correct results for values such as 0x3e4ccccd (0.2000000030) and 0x40490FDB (3.1415927410).
All said and done, if your input is already in IEEE-754 format (albeit in hex), then a simple cast ought to be enough.

As well as being much simpler, this also avoids any rounding/precision errors.
float value = reinterpret_cast<float&>(myNumba)
If you still want to inspect the parts separately, use the library function std::frexp afterwards. Of if you don't like the type punning, at least use std::ldexp to apply the exponent rather than your explicit maths, which is vulnerable to rounding/precision errors and overflow.
An alternate to both of these is to use a union type, as described in this answer.

Related

I implemented my own square root function in c++ to get precision upto 9 points but it's not working

I want to get square root of a number upto 9 precision points so I did something like below but I am not getting correct precision. Here e is the precision which is greater than 10^9 then also ans is upto 5 precision points. What am I doing wrong here??
#include <iostream>
using namespace std;
long double squareRoot(long double n)
{
long double x = n;
long double y = 1;
long double e = 0.00000000000001;
while (x - y > e)
{
x = (x + y) / 2;
y = n / x;
}
cout << x << "\n";
return x;
}
int main()
{
int arr[] = {2,3,4,5,6};
int size = sizeof(arr)/sizeof(arr[0]);
long double ans = 0.0;
for(int i=0; i<size; i++)
{
ans += squareRoot(arr[i]);
}
cout << ans << "\n";
return 0;
}
The output is
1.41421
1.73205
2
2.23607
2.44949
9.83182
What should I do to get precision upto 9 points??
There are two places at which precision plays a role:
precision of the value itself
precision of the output stream
You can only get output in desired precision if both value and stream are precise enough.
In your case, the calculated value doesn't seem to be a problem, however, default stream precision is only five digits, i. e. no matter how precise your double value actually is, the stream will stop after five digits, rounding the last one appropriately. So you'll need to increase stream precision up to the desired nine digits:
std::cout << std::setprecision(9);
// or alternatively:
std::cout.precision(9);
Precision is kept until a new one is set, in contrast to e. g. std::setw, which only applies for next value.
try this
cout << setprecision(10) << x << "\n";
cout << setprecision(10) << ans << "\n";

convert hexadecimal string to binary and seperate into bits n C++

I need to covert hexadecimal string to binary then pass the bits into different variables.
For example, my input is:
std::string hex = "E136";
How do I convert the string into binary output 1110 0001 0011 0110?
After that I need to pass the bit 0 to variable A, bits 1-9 to variable B and bits 10-15 to variable C.
Thanks in advance
How do I convert the string [...]?
Start with result value of null, then for each character (starting at first, indicating most significant one) determine its value (in range of [0:15]), multiply the so far received result by 16 and add the current value to. For your given example, this will result in
(((0 * 16 + v('E')) * 16 + v('1')) * 16 + v('3')) + v('6')
There are standard library functions doing the stuff for you, such as std::strtoul:
char* end;
unsigned long value = strtoul(hex.c_str(), &end, 16);
// ^^ base!
The end pointer useful to check if you have read the entire string:
if(*char == 0)
{
// end of string reached
}
else
{
// some part of the string was left, you might consider this
// as error (could occur if e. g. "f10s12" was passed, then
// end would point to the 's')
}
If you don't care for end checking, you can just pass nullptr instead.
Don't convert back to a string afterwards, you can get the required values by masking (&) and bitshifting (>>), e. g getting bits [1-9]:
uint32_t b = value >> 1 & 0x1ffU;
Working on integrals is much more efficient than working on strings. Only when you want to print out the final result, then convert back to string (if using a std::ostream, operator<< already does the work for you...).
While playing with this sample, I realized that I gave a wrong recommendation:
std::setbase(2) does not work by standard. Ouch! (SO: Why doesn't std::setbase(2) switch to binary output?)
For conversion of numbers to string with binary digits, something else must be used. I made this small sample. Though, the separation of bits is considered as well, my main focus was on output with different bases (and IMHO worth another answer):
#include <algorithm>
#include <iomanip>
#include <iostream>
#include <string>
std::string bits(unsigned value, unsigned w)
{
std::string text;
for (unsigned i = 0; i < w || value; ++i) {
text += '0' + (value & 1); // bit -> character '0' or '1'
value >>= 1; // shift right one bit
}
// text is right to left -> must be reversed
std::reverse(text.begin(), text.end());
// done
return text;
}
void print(const char *name, unsigned value)
{
std::cout
<< name << ": "
// decimal output
<< std::setbase(10) << std::setw(5) << value
<< " = "
// binary output
#if 0 // OLD, WRONG:
// std::setbase(2) is not supported by standard - Ouch!
<< "0b" << std::setw(16) << std::setfill('0') << std::setbase(2) << value
#else // NEW:
<< "0b" << bits(value, 16)
#endif // 0
<< " = "
// hexadecimal output
<< "0x" << std::setw(4) << std::setfill('0') << std::setbase(16) << value
<< '\n';
}
int main()
{
std::string hex = "E136";
unsigned value = strtoul(hex.c_str(), nullptr, 16);
print("hex", value);
// bit 0 -> a
unsigned a = value & 0x0001;
// bit 1 ... 9 -> b
unsigned b = (value & 0x03FE) >> 1;
// bit 10 ... 15 -> c
unsigned c = (value & 0xFC00) >> 10;
// report
print(" a ", a);
print(" b ", b);
print(" c ", c);
// done
return 0;
}
Output:
hex: 57654 = 0b1110000100110110 = 0xe136
a : 00000 = 0b0000000000000000 = 0x0000
b : 00155 = 0b0000000010011011 = 0x009b
c : 00056 = 0b0000000000111000 = 0x0038
Live Demo on coliru
Concerning, the bit operations:
binary bitwise and operator (&) is used to set all unintended bits to 0. The second value can be understood as mask. It would be more obvious if I had used binary numbers but this is not supported in C++. Hex codes do nearly as well as a hex digit represents always the same pattern of 4 bits. (as 16 = 24) After some time of practice, you usually learn to "see" the bits in the hex code.
About the right shift (>>), I was not quite sure. OP didn't require that bits have to be moved somewhere – only that they had to be separated into distinct variables. So, these right-shift's might be obsolete.
So, this question which seemed to be trivial leaded to a surprising enlightment (for me).

Am I doing double to float conversion here

const double dBLEPTable_8_BLKHAR[4096] = {
0.00000000000000000000000000000000,
-0.00000000239150987901837200000000,
-0.00000000956897738824125100000000,
-0.00000002153888378764179400000000,
-0.00000003830892270073604800000000,
-0.00000005988800189093979000000000,
-0.00000008628624126316708500000000,
-0.00000011751498329992671000000000,
-0.00000015358678995269770000000000,
-0.00000019451544774895524000000000,
-0.00000024031597312124120000000000,
-0.00000029100459975062165000000000
}
If I change the double above to float, am I doing incurring conversion cpu cycles when I perform operations on the array contents? Or is the "conversion" sorted out during compile time?
Say, dBLEPTable_8_BLKHAR[1] + dBLEPTable_8_BLKHAR[2] , something simple like this?
On a related note, how many trailing decimal places should a float be able to store?
This is c++.
Any good compiler will convert the initializers during compile time. However, you also asked
am I incurring conversion cpu cycles when I perform operations on the array contents?
and that depends on the code performing the operations. If your expression combines array elements with variables of double type, then the operation will be performed at double precision, and the array elements will be promoted (converted) before the arithmetic takes place.
If you just combine array elements with variables of float type (including other array elements), then the operation is performed on floats and the language doesn't require any promotion (But if your hardware only implements double precision operations, conversion might still be done. Such hardware surely makes the conversions very cheap, though.)
Ben Voigt answer addresses your question for most parts.
But you also ask:
On a related note, how many trailing decimal places should a float be able to store
It depends on the value of the number you are trying to store. For large numbers there is no decimals - in fact the format can't even give you a precise value for the integer part. For instance:
float x = BIG_NUMBER;
float y = x + 1;
if (x == y)
{
// The code get here if BIG_NUMBER is very high!
}
else
{
// The code get here if BIG_NUMBER is no so high!
}
If BIG_NUMBER is 2^23 the next greater number would be (2^23 + 1).
If BIG_NUMBER is 2^24 the next greater number would be (2^24 + 2).
The value (2^24 + 1) can not be stored.
For very small numbers (i.e. close to zero), you will have a lot of decimal places.
Floating point is to be used with great care because they are very imprecise.
http://en.wikipedia.org/wiki/Single-precision_floating-point_format
For small numbers you can experiment with the program below.
Change the exp variable to set the starting point. The program will show you what the step size is for the range and the first four valid numbers.
int main (int argc, char* argv[])
{
int exp = -27; // <--- !!!!!!!!!!!
// Change this to set starting point for the range
// Starting point will be 2 ^ exp
float f;
unsigned int *d = (unsigned int *)&f; // Brute force to set f in binary format
unsigned int e;
cout.precision(100);
// Calculate step size for this range
e = ((127-23) + exp) << 23;
*d = e;
cout << "Step size = " << fixed << f << endl;
cout << "First 4 numbers in range:" << endl;
// Calculate first four valid numbers in this range
e = (127 + exp) << 23;
*d = e | 0x00000000;
cout << hex << "0x" << *d << " = " << fixed << f << endl;
*d = e | 0x00000001;
cout << hex << "0x" << *d << " = " << fixed << f << endl;
*d = e | 0x00000002;
cout << hex << "0x" << *d << " = " << fixed << f << endl;
*d = e | 0x00000003;
cout << hex << "0x" << *d << " = " << fixed << f << endl;
return 0;
}
For exp = -27 the output will be:
Step size = 0.0000000000000008881784197001252323389053344726562500000000000000000000000000000000000000000000000000
First 4 numbers in range:
0x32000000 = 0.0000000074505805969238281250000000000000000000000000000000000000000000000000000000000000000000000000
0x32000001 = 0.0000000074505814851022478251252323389053344726562500000000000000000000000000000000000000000000000000
0x32000002 = 0.0000000074505823732806675252504646778106689453125000000000000000000000000000000000000000000000000000
0x32000003 = 0.0000000074505832614590872253756970167160034179687500000000000000000000000000000000000000000000000000
const double dBLEPTable_8_BLKHAR[4096] = {
If you change the double in that line to float, then one of two things will happen:
At compile time, the compiler will convert the numbers -0.00000000239150987901837200000000 to the float that best represents them, and will then store that data directly into the array.
At runtime, during the program initialization (before main() is called!) the runtime that the compiler generated will fill that array with data of type float.
Either way, once you get to main() and to code that you've written, all of that data will be stored as float variables.

Two bytes into one

First off, I apologize if this is a duplicate; but my Google-fu seems to be failing me today.
I'm in the middle of writing an image format module for Photoshop, and one of the save options for this format, includes a 4-bit alpha channel. Of course, the data I have to convert is 8-bit/1 byte alpha - so I need to essentially take every two bytes of alpha, and merge it into one.
my attempt (below), I believe has a lot of room for improvement:
for(int x=0,w=0;x < alphaData.size();x+=2,w++)
{
short ashort=(alphaData[x] << 8)+alphaData[x+1];
alphaFinal[w]=(unsigned char)ashort;
}
alphaData and alphaFinal are vectors that contains the 8-bit alpha data and the 4-bit alpha data, respectively. I realize that reducing two bytes into the value of one, is bound to result in loss of "resolution", but I can't help but think there's a better way of doing this.
For extra information, here's the loop that does the reverse (converts 4-bit alpha from the format to 8-bit for Photoshop)
alphaData serves the same purpose as above, and imgData is an unsigned char vector that holds the raw image data. (alpha data is tacked on after the actual rgb data for the image in this particular variant of the format)
for(int b=alphaOffset,x2=0;b < (alphaOffset+dataLength); b++,x2+=2)
{
unsigned char lo = (imgData[b] & 15);
unsigned char hi = ((imgData[b] >> 4) & 15);
alphaData[x2]=lo*17;
alphaData[x2+1]=hi*17;
}
Are you sure that it's
alphaData[x2]=lo*17;
alphaData[x2+1]=hi*17;
and not
alphaData[x2]=lo*16;
alphaData[x2+1]=hi*16;
In any case, to generate the values that work with the decoding function you have posted, you just have to reverse the operations. So multiplying by 17 becomes dividing by 17 and the shifts and masks get reordered to look like this:
for(int x=0,w=0;x < alphaData.size();x+=2,w++)
{
unsigned char alpha1 = alphaData[x] / 17;
unsigned char alpha2 = alphaData[x+1] / 17;
Assert(alpha1 < 16 && alpha2 < 16);
alphaFinal[w]=(alpha2 << 4) | alpha1;
}
short ashort=(alphaData[x] << 8)+alphaData[x+1];
alphaFinal[w]=(unsigned char)ashort;
You're actually losing alphaData[x] in alphaFinal. You shift alphaData[x] by 8 bits to the left and then assign 8 low bits.
Also your for loop is unsafe, if for some reason alphaData.size() is odd, you'll run out of range.
what you want to do, I think, is to truncate an 8-bit value into a 4-bit one; not to combine two 8-bit vales. In other words, you want to drop the four least significant bits of each alpha value, not to combine two different alpha values.
So, basically, you want to right-shift by 4.
output = (input >> 4); /* truncate four bits */
in case you're not familiar with binary shifts, take this random 8-bit number:
10110110
>> 1
= 01011011
>> 1
= 00101101
>> 1
= 00010110
>> 1
= 00001011
so,
10110110
>> 4
= 00001011
and to reverse, left-shift instead...
input = (output << 4); /* expand four bits */
which, using the result from that same random 8-bit number as before, would be
00001011
>> 4
= 10110000
obviously, as you noted, 4 bits of precision is lost. But you'd be surprised how little it's noticed in a fully-composited work.
This code
for(int x=0,w=0;x < alphaData.size();x+=2,w++)
{
short ashort=(alphaData[x] << 8)+alphaData[x+1];
alphaFinal[w]=(unsigned char)ashort;
}
Is broken. Given
#include <iostream>
using std::cout;
using std::endl;
typedef unsigned char uchar;
int main() {
uchar x0 = 1; // for alphaData[x]
uchar x1 = 2; // for alphaData[x+1]
short ashort = (x0 << 8) + x1; // The value 0x0102
uchar afinal = (uchar)ashort; // truncates to 0x02.
cout << std::hex
<< "x0 = 0x" << x0 << " << 8 = 0x" << (x0 << 8) << endl
<< "x1 = 0x" << x1 << endl
<< "ashort = 0x" << ashort << endl
<< "afinal = 0x" << (unsigned int)afinal << endl
;
}
If you are saying that your source stream contains sequences of 4-bit pairs stored in 8-bit storage values, which you need to re-store as a single 8-bit value, then what you want is:
for(int x=0,w=0;x < alphaData.size();x+=2,w++)
{
unsigned char aleft = alphaData[x] & 0x0f; // 4 bits.
unsigned char aright = alphaData[x + 1] & 0x0f; // 4 bits.
alphaFinal[w] = (aleft << 4) | (aright);
}
"<<4" is equivalent to "*16", as ">>4" is equivalent to "/16".

How write good round_double function in c++?

I'm trying to write good round_double function which will round double in specified precision:
1.
double round_double(double num, int prec)
{
for (int i = 0; i < abs(prec); ++i)
if(prec > 0)
num *= 10.0;
else
num /= 10.0;
double result = (long long)floor(num + 0.5);
for (int i = 0; i < abs(prec); ++i)
if(prec > 0)
result /= 10.0;
else
result *= 10.0;
return result;
}
2.
double round_double(double num, int prec)
{
double tmp = pow(10.0, prec);
double result = (long long)floor(num * tmp + 0.5);
result /= tmp;
return result;
}
This functions do what I wan't but they are, in my opinion, not good enough. Because starting from precision = 13 - 14, they returning bad results.
The cause I'm sure that there is possible to write good double_round is that just printing the number via cout in specified precision (say 18) is prints better result than result of my function.
For example this part of code:
int prec = 18;
double num = 10.123456789987654321;
cout << setiosflags(ios::showpoint | ios::fixed)
<< setprecision(prec) << "round_double(" << num << ", "
<< prec << ") = " << round_double(num, prec) << endl;
Will print round_double(10.123456789987655000, 18) = -9.223372036854776500 for first round_double and round_double(10.123456789987655000, 18) = -9.223372036854776500for second one.
How write good round_double function in c++? Or there is already exists?
Don't cast to long long that is forcing a conversion to an integer with limited range, beyond what 10^13 requires (well 19 for 64-bit with no whole number part). Just calling floor should be enough.
double round_double(double num, int prec)
{
double tmp = pow(10.0, prec);
double result = floor(num * tmp + 0.5);
result /= tmp;
return result;
}
Note that Mike is also correct, you have a limited range you can represent just in double itself. It isn't so great if you need clean decimal responses. But the long long is the cause of your totally wacky numbers.
The problem is the floating-point representation. A binary representation cannot represent all decimal numbers exactly, and only has a finite precision.
double usually means a 64-bit binary representation as specified by IEEE754, with a 52-bit fractional part. This gives a precision of approximately 16 decimal digits.
If you need more precision than that, then the best option is probably to use an arbitrary-precision arithmetic library such as GMP. Your compiler may or may not offer a long double type with a higher precision than double.
EDIT: sorry, I didn't notice that you're getting completely incorrect results. As another answer says, this is due to the conversion to long long overflowing.
Another approach is to round based on binary-digits of precision. Sample implementation below - not sure if it's useful to you, but since you got me playing I thought I'd throw it out there.
Notes:
this uses the ieee754.h header common on Linux systems: it could easily be ported to Windows, but this is undeniably bit hackery and whether it's appropriate in any given production code is a case-by-case call.
you could approximate some decimal near-equivalent, e.g. multiply the desired decimal precision by 10 and divide by 3 (based on 2^10 ~= 10^3).
The input number (10.1234...) with 1 bit of precision is 8; with 2 it's 10 etc..
Separately, IMHO decimal rounding is best done at output time, or when using a decimal-capable representation (e.g. storing an int mantissa and power-of-10 exponent).
#include <ieee754.h>
#include <iostream>
#include <iomanip>
double round_double(double d, int precision)
{
ieee754_double* p = reinterpret_cast<ieee754_double*>(&d);
std::cout << "mantissa 0:" << std::hex << p->ieee.mantissa0
<< ", 1:" << p->ieee.mantissa1 << '\n';
unsigned mask0 = precision < 20 ? 0x000FFFFF << (20 - precision) :
0x000FFFFF;
unsigned mask1 = precision < 20 ? 0 :
precision == 53 ? 0xFFFFFFFF :
0xFFFFFFFE << (32 + 20 - precision);
std::cout << "masks 0:" << mask0 << ", 1: " << mask1 << '\n';
p->ieee.mantissa0 &= mask0;
p->ieee.mantissa1 &= mask1;
std::cout << "mantissa' 0:" << p->ieee.mantissa0
<< ", 1:" << p->ieee.mantissa1 << '\n';
return d;
}
int main()
{
double num = 10.123456789987654321;
for (int prec = 1; prec <= 53; ++prec)
std::cout << std::setiosflags(std::ios::showpoint | std::ios::fixed)
<< std::setprecision(60)
<< "round_double(" << num << ", " << prec << ") = "
<< round_double(num, prec) << std::endl;
}
Output...
mantissa 0:43f35, 1:ba76eea7
masks 0:fff80000, 1: 0
mantissa' 0:0, 1:0
round_double(10.123456789987654858009591407608240842819213867187500000000000, 1) = 8.000000000000000000000000000000000000000000000000000000000000
mantissa 0:43f35, 1:ba76eea7
masks 0:fffc0000, 1: 0
mantissa' 0:40000, 1:0
round_double(10.123456789987654858009591407608240842819213867187500000000000, 2) = 10.000000000000000000000000000000000000000000000000000000000000
mantissa 0:43f35, 1:ba76eea7
masks 0:fffe0000, 1: 0
mantissa' 0:40000, 1:0
round_double(10.123456789987654858009591407608240842819213867187500000000000, 3) = 10.000000000000000000000000000000000000000000000000000000000000
mantissa 0:43f35, 1:ba76eea7
masks 0:ffff0000, 1: 0
mantissa' 0:40000, 1:0
round_double(10.123456789987654858009591407608240842819213867187500000000000, 4) = 10.000000000000000000000000000000000000000000000000000000000000
mantissa 0:43f35, 1:ba76eea7
masks 0:ffff8000, 1: 0
mantissa' 0:40000, 1:0
round_double(10.123456789987654858009591407608240842819213867187500000000000, 5) = 10.000000000000000000000000000000000000000000000000000000000000
mantissa 0:43f35, 1:ba76eea7
masks 0:ffffc000, 1: 0
mantissa' 0:40000, 1:0
round_double(10.123456789987654858009591407608240842819213867187500000000000, 6) = 10.000000000000000000000000000000000000000000000000000000000000
mantissa 0:43f35, 1:ba76eea7
masks 0:ffffe000, 1: 0
mantissa' 0:42000, 1:0
round_double(10.123456789987654858009591407608240842819213867187500000000000, 7) = 10.062500000000000000000000000000000000000000000000000000000000
mantissa 0:43f35, 1:ba76eea7
masks 0:fffff000, 1: 0
mantissa' 0:43000, 1:0
round_double(10.123456789987654858009591407608240842819213867187500000000000, 8) = 10.093750000000000000000000000000000000000000000000000000000000
mantissa 0:43f35, 1:ba76eea7
masks 0:7ffff800, 1: 0
mantissa' 0:43800, 1:0
round_double(10.123456789987654858009591407608240842819213867187500000000000, 9) = 10.109375000000000000000000000000000000000000000000000000000000
mantissa 0:43f35, 1:ba76eea7
masks 0:3ffffc00, 1: 0
mantissa' 0:43c00, 1:0
round_double(10.123456789987654858009591407608240842819213867187500000000000, a) = 10.117187500000000000000000000000000000000000000000000000000000
mantissa 0:43f35, 1:ba76eea7
masks 0:1ffffe00, 1: 0
mantissa' 0:43e00, 1:0
round_double(10.123456789987654858009591407608240842819213867187500000000000, b) = 10.121093750000000000000000000000000000000000000000000000000000
mantissa 0:43f35, 1:ba76eea7
masks 0:fffff00, 1: 0
mantissa' 0:43f00, 1:0
round_double(10.123456789987654858009591407608240842819213867187500000000000, c) = 10.123046875000000000000000000000000000000000000000000000000000
etc....