Exact binary representation of a double [duplicate] - c++

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Float to binary in C++
I have a very small double var, and when I print it I get -0. (using C++).
Now in order to get better precision I tried using
cout.precision(18); \\i think 18 is the max precision i can get.
cout.setf(ios::fixed,ios::floatfield);
cout<<var;\\var is a double.
but it just writes -0.00000000000...
I want to see the exact binary representation of the var.
In other words I want to see what binary number is written in the stack memory/register for this var.

union myUnion {
double dValue;
uint64_t iValue;
};
myUnion myValue;
myValue.dValue=123.456;
cout << myValue.iValue;
Update:
The version above will work for most purposes, but it assumes 64 bit doubles. This version makes no assumptions and generates a binary representation:
double someDouble=123.456;
unsigned char rawBytes[sizeof(double)];
memcpy(rawBytes,&someDouble,sizeof(double));
//The C++ standard does not guarantee 8-bit bytes
unsigned char startMask=1;
while (0!=static_cast<unsigned char>(startMask<<1)) {
startMask<<=1;
}
bool hasLeadBit=false; //set this to true if you want to see leading zeros
size_t byteIndex;
for (byteIndex=0;byteIndex<sizeof(double);++byteIndex) {
unsigned char bitMask=startMask;
while (0!=bitMask) {
if (0!=(bitMask&rawBytes[byteIndex])) {
std::cout<<"1";
hasLeadBit=true;
} else if (hasLeadBit) {
std::cout<<"0";
}
bitMask>>=1;
}
}
if (!hasLeadBit) {
std::cout<<"0";
}

This way is guaranteed to work by the standard:
double d = -0.0;
uint64_t u;
memcpy(&u, &d, sizeof(d));
std::cout << std::hex << u;

Try:
printf("0x%08x\n", myFloat);
This should work for a 32 bit variable, to display it in hex. I've never tried using this technique to see a 64 bit variable, but I think it's:
printf("%016llx\n", myDouble);
EDIT: tested the 64-bit version and it definitely works on Win32 (I seem to recall the need for uppercase LL on GCC.. maybe)
EDIT2: if you really want binary, you are best off using one of the other answers to get a uint64_t version of your double, and then looping:
for ( int i = 63; i >= 0; i-- )
{
printf( "%d", (myUint64 >> i ) & 1 );
}

Related

How to deal with the sign bit of integer representations with odd bit counts?

Let's assume we have a representation of -63 as signed seven-bit integer within a uint16_t. How can we convert that number to float and back again, when we don't know the representation type (like two's complement).
An application for such an encoding could be that several numbers are stored in one int16_t. The bit-count could be known for each number and the data is read/written from a third-party library (see for example the encoding format of tivxDmpacDofNode() here: https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/tiovx/docs/user_guide/group__group__vision__function__dmpac__dof.html --- but this is just an example). An algorithm should be developed that makes the compiler create the right encoding/decoding independent from the actual representation type. Of course it is assumed that the compiler uses the same representation type as the library does.
One way that seems to work well, is to shift the bits such that their sign bit coincides with the sign bit of an int16_t and let the compiler do the rest. Of course this makes an appropriate multiplication or division necessary.
Please see this example:
#include <iostream>
#include <cmath>
int main()
{
// -63 as signed seven-bits representation
uint16_t data = 0b1000001;
// Shift 9 bits to the left
int16_t correct_sign_data = static_cast<int16_t>(data << 9);
float f = static_cast<float>(correct_sign_data);
// Undo effect of shifting
f /= pow(2, 9);
std::cout << f << std::endl;
// Now back to signed bits
f *= pow(2, 9);
uint16_t bits = static_cast<uint16_t>(static_cast<int16_t>(f)) >> 9;
std::cout << "Equals: " << (data == bits) << std::endl;
return 0;
}
I have two questions:
This example uses actually a number with known representation type (two's complement) converted by https://www.exploringbinary.com/twos-complement-converter/. Is the bit-shifting still independent from that and would it work also for other representation types?
Is this the canonical and/or most elegant way to do it?
Clarification:
I know the bit width of the integers I would like to convert (please check the link to the TIOVX example above), but the integer representation type is not specified.
The intention is to write code that can be recompiled without changes on a system with another integer representation type and still correctly converts from int to float and/or back.
My claim is that the example source code above does exactly that (except that the example input data is hardcoded and it would have to be different if the integer representation type were not two's complement). Am I right? Could such a "portable" solution be written also with a different (more elegant/canonical) technique?
Your question is ambiguous as to whether you intend to truly store odd-bit integers, or odd-bit floats represented by custom-encoded odd-bit integers. I'm assuming by "not knowing" the bit-width of the integer, that you mean that the bit-width isn't known at compile time, but is discovered at runtime as your custom values are parsed from a file, for example.
Edit by author of original post:
The assumption in the original question that the presented code is independent from the actual integer representation type, is wrong (as explained in the comments). Integer types are not specified, for example it is not clear that the leftmost bit is the sign bit. Therefore the presented code also contains assumptions, they are just different (and most probably worse) than the assumption "integer representation type is two's complement".
Here's a simple example of storing an odd-bit integer. I provide a simple struct that let's you decide how many bits are in your integer. However, for simplicity in this example, I used uint8_t which has a maximum of 8-bits obviously. There are several different assumptions and simplifications made here, so if you want help on any specific nuance, please specify more in the comments and I will edit this answer.
One key detail is to properly mask off your n-bit integer after performing 2's complement conversions.
Also please note that I have basically ignored overflow concerns and bit-width switching concerns that may or may not be a problem depending on how you intend to use your custom-width integers and the maximum bit-width you intend to support.
#include <iostream>
#include <string>
struct CustomInt {
int bitCount = 7;
uint8_t value;
uint8_t mask = 0;
CustomInt(int _bitCount, uint8_t _value) {
bitCount = _bitCount;
value = _value;
mask = 0;
for (int i = 0; i < bitCount; ++i) {
mask |= (1 << i);
}
}
bool isNegative() {
return (value >> (bitCount - 1)) & 1;
}
int toInt() {
bool negative = isNegative();
uint8_t tempVal = value;
if (negative) {
tempVal = ((~tempVal) + 1) & mask;
}
int ret = tempVal;
return negative ? -ret : ret;
}
float toFloat() {
return toInt(); //Implied truncation!
}
void setFromFloat(float f) {
int intVal = f; //Implied truncation!
bool negative = f < 0;
if (negative) {
intVal = -intVal;
}
value = intVal;
if (negative) {
value = ((~value) + 1) & mask;
}
}
};
int main() {
CustomInt test(7, 0b01001110); // -50. Would be 78 if this were a normal 8-bit integer
std::cout << test.toFloat() << std::endl;
}

Casting float to int inconsistent across MinGw and Clang

Using C++, I'm trying to cast a float value to an int using these instructions :
#include <iostream>
int main() {
float NbrToCast = 1.8f;
int TmpNbr = NbrToCast * 10;
std::cout << TmpNbr << "\n";
}
I understand the value 1.8 cannot be precisely represented as a float and is actually stored as 1.79999995.
Thus, I would expect that multiplying this value by ten, would result to 17.99999995 and then casting it to an int would give 17.
When compiling and running this code with MinGW (v4.9.2 32bits) on Windows 7, I get the expected result (17).
When compiling and running this code with CLang (v600.0.57) on my Mac (OS X 10.11), I get 18as a result, which is not what I was expecting but which seems more correct in a mathematical way !
Why do I get this difference ?
Is there a way to have a consistent behavior no matter the OS or the compiler ?
Like Yuushi said in the comments, I guess the rounding rules may differ for each compiler. Having a portable solution on such a topic probably means you need to write your own rounding method.
So in your case you probably need to check the value of the digit after 7 and increment the value or not. Let's say something like:
int main() {
float NbrToCast = 1.8f;
float TmpNbr = NbrToCast * 10;
std::cout << RoundingFloatToInt(TmpNbr) << "\n";
}
int RoundingFloatToInt(const float &val)
{
float intPart, fractPart;
fractpart = modf (val, &intpart);
int result = intPart;
if (fractpart > 0.5)
{
result++;
}
return result;
}
(code not tested at all but you have the idea)
If you need performance, it's probably not great but I think it should be portable.

FP number's exponent field is not what I expected, why?

I've been stumped on this one for days. I've written this program from a book called Write Great Code Volume 1 Understanding the Machine Chapter four.
The project is to do Floating Point operations in C++. I plan to implement the other operations in C++ on my own; the book uses HLA (High Level Assembly) in the project for other operations like multiplication and division.
I wanted to display the exponent and other field values after they've been extracted from the FP number; for debugging. Yet I have a problem: when I look at these values in memory they are not what I think they should be. Key words: what I think. I believe I understand the IEEE FP format; its fairly simple and I understand all I've read so far in the book.
The big problem is why the Rexponent variable seems to be almost unpredictable; in this example with the given values its 5. Why is that? By my guess it should be two. Two because the decimal point is two digits right of the implied one.
I've commented the actual values that are produced in the program in to the code so you don't have to run the program to get a sense of whats happening (at least in the important parts).
It is unfinished at this point. The entire project has not been created on my computer yet.
Here is the code (quoted from the file which I copied from the book and then modified):
#include<iostream>
typedef long unsigned real; //typedef our long unsigned ints in to the label "real" so we don't confuse it with other datatypes.
using namespace std; //Just so I don't have to type out std::cout any more!
#define asreal(x) (*((float *) &x)) //Cast the address of X as a float pointer as a pointer. So we don't let the compiler truncate our FP values when being converted.
inline int extractExponent(real from) {
return ((from >> 23) & 0xFF) - 127; //Shift right 23 bits; & with eight ones (0xFF == 1111_1111 ) and make bias with the value by subtracting all ones from it.
}
void fpadd ( real left, real right, real *dest) {
//Left operand field containers
long unsigned int Lexponent = 0;
long unsigned Lmantissa = 0;
int Lsign = 0;
//RIGHT operand field containers
long unsigned int Rexponent = 0;
long unsigned Rmantissa = 0;
int Rsign = 0;
//Resulting operand field containers
long int Dexponent = 0;
long unsigned Dmantissa = 0;
int Dsign = 0;
std::cout << "Size of datatype: long unsigned int is: " << sizeof(long unsigned int); //For debugging
//Properly initialize the above variable's:
//Left
Lexponent = extractExponent(left); //Zero. This value is NOT a flat zero when displayed because we subtract 127 from the exponent after extracting it! //Value is: 0xffffff81
Lmantissa = extractMantissa (left); //Zero. We don't do anything to this number except add a whole number one to it. //Value is: 0x00000000
Lsign = extractSign(left); //Simple.
//Right
**Rexponent = extractExponent(right); //Value is: 0x00000005 <-- why???**
Rmantissa = extractMantissa (right);
Rsign = extractSign(right);
}
int main (int argc, char *argv[]) {
real a, b, c;
asreal(a) = -0.0;
asreal(b) = 45.67;
fpadd(a,b, &c);
printf("Sum of A and B is: %f", c);
std::cin >> a;
return 0;
}
Help would be much appreciated; I'm several days in to this project and very frustrated!
in this example with the given values its 5. Why is that?
The floating point number 45.67 is internally represented as
2^5 * 1.0110110101011100001010001111010111000010100011110110
which actually represents the number
45.6700000000000017053025658242404460906982421875
This is as close as you can get to 45.67 inside float.
If all you are interested in is the exponent of a number, simply compute its base 2 logarithm and round down. Since 45.67 is between 32 (2^5) and 64 (2^6), the exponent is 5.
Computers use binary representation for all numbers. Hence, the exponent is for base two, not base ten. int(log2(45.67)) = 5.

read float and double from binary data in C++

I need to be able to read in a float or double from binary data in C++, similarly to Python's struct.unpack function. My issue is that the data I am receiving will always be big-endian. I have dealt with this for integer values as described here, but working byte by byte does not work with floating point values. I need a way to extract floating point values (both 32 bit floats and 64 bit doubles) in in C++, similar to how you would use struct.unpack(">f", num) or struct.unpack(">d", num) in Python.
here's an example of what I have tried:
stuct.unpack("d", num) ==> *(double*) str; // if str is a char* containing the data
That works fine if str is little-endian, but not if it is big-endian, as I know it will always be. The problem is that I do not know what the native endianness of the environment will be, so I need to be able to extract the binary data as big-endian at all times.
If you look at the linked question, you'll see this is easily using bitwise-ors and bitshifts for integer values, but that method does not work for floating point.
NOTE I should have pointed this out earlier, but I cannot use c++11 or any third party libraries other than Boost.
Why working byte by byte does not work with floating point values?
Just extract 32bit integer as usual, then reinterpret it as float: float f = *(float*)&i
And the same for 64bit integers and double
void ByteSwap(void * data, int size)
{
char * ptr = (char *) data;
for (int i = 0; i < size/2; ++i)
std::swap(ptr[i], ptr[size-1-i]);
}
bool LittleEndian()
{
int test = 1;
return *((char *)&test) == 1;
}
if (LittleEndian())
ByteSwap(&my_double, sizeof(double));

scanf not taking in long double

i have problem with scanf not reading long double in the code below:
(please excuse my poor English)
#include <iostream>
#include <cstdlib>
#include <math.h>
using namespace std;
int main()
{
int n;
scanf("%d",&n);
long double a,b,c,ha,hb,hc,ma,cosa,r,l,res,area;
for (int i=0;i<n;i++)
{
scanf("%Lf %Lf %Lf %Lf",&a,&ha,&hb,&hc);//this is where the problem lies,
//i need to read 4 long double a,ha,hb,hc
printf("%Lf %Lf %Lf %Lf\n",a,ha,hb,hc);//but it returned wrong answer so
//i used printf to check, ps: the code works with float but not with double or
//long double
ha*=3;hb*=3;hc*=3;
c=(a*ha)/hc; b=(a*ha)/hb;
ma=sqrt(0.5*b*b+0.5*c*c-0.25*a*a);
cosa=ha/ma;
r=(2*ma)/3;
l=b*(b-sqrt(a*a-hb*hb))/ha;
res=sqrt(l*l+r*r-2*cosa*l*r);
area=a*ha/2;
printf("%.3Lf %.3Lf\n",area,res);
}
system("PAUSE");
return 0;}
}
here's the input:
2
3.0 0.8660254038 0.8660254038 0.8660254038
657.8256599140 151.6154399062 213.5392629932 139.4878846649
and here what's show in the command line:
2
3.0 0.8660254038 0.8660254038 0.8660254038
3.000000 -4824911833695204400000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000.000000 284622047019579100000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000.0
00000 0.866025
-2.000 0.000
657.8256599140 151.6154399062 213.5392629932 139.4878846649
657.825660 -0.000000 28969688850499604000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000.000000 213.539263
-2.000 0.000
Press any key to continue . . .
I want to know why scanf won't take in long double in the code and how to fix it.
Thanks in advance!
Dev-c++ uses MinGW, which uses the gcc compiler and the Microsoft runtime library. Unfortunately, those components disagree on the underlying type to be used for long double (64 vs. 80 or 96 bits, I think). Windows assumes long double is the same size as double; gcc makes long double bigger.
Either choice is valid, but the combination results in a broken C and C++ implementation.
If you don't need the extra range and precision, you can read into a double and store into a long double.
Otherwise, you can write or borrow a custom string to long double converter, or just use a different implementation.
EDIT
More details:
Microsoft's own compiler and runtime library are consistent in treating long double as 64 bits, the same size as double. The language standard permits this (it requires long double to be at least as wide as double, but places the same requirements on both), but it does seem odd that it doesn't take advantage of the x86's 80-bit floating-point hardware.
gcc on x86 treats long double as 96 bits (sizeof (long double) == 12). I think only 80 of those bits are significant; the extra 16 bits are for alignment purposes.
MinGW uses gcc as its compiler, but uses Microsoft's runtime library. For most language features, this works fine, but the mismatch for long double means that you can do computations with long double, but you can't pass long double values (or pointers to them) to the runtime library. It's a bug in MinGW.
There are workarounds within MinGW. You can define the macro __USE_MINGW_ANSI_STDIO, either by passing -D__USE_MINGW_ANSI_STDIO on the gcc command line or by adding a line
#define __USE_MINGW_ANSI_STDIO
to your source files. (It has to be defined before #include <stdio.h>.) A commenter, paulsm4, says that the -ansi and -posix options cause MinGW to use its own conforming library (I have no reason to doubt this, but I'm not currently able to confirm it). Or you can call __mingw_printf() directly.
Assuming you're on Windows, Cygwin might be a good alternative (it uses gcc, but it doesn't use Microsoft's runtime library). Or you can use long double internally, but double for I/O.
You are a lucky lucky man. This won't solve the general problem of long double on MinGW, but I'll explain what is happening to your problem. Now, in a far far day when you'll be able to upvote, I want your upvote. :-) (but I don't want this to be marked as the correct response. It's the response to what you need, but not to what you asked (the general problem in your title scanf not taking in long double) ).
First, the solution: use float. Use %f in scanf/printf. The results comes perfectly equal to the ones given as the solution in your site. As a sidenote, if you want to printf with some decimals, do as it's showed in the last printf: %.10f will print 10 decimals after the decimal separator.
Second: why you had a problem with doubles: the res=sqrt() calculates a square root. Using floats, l*l+r*r-2*cosa*l*r == 0.0, using doubles it's -1.0781242565371940e-010, so something near zero BUT NEGATIVE!!! So the sqrt(-something) is NaN (Not a Number) a special value of double/float. You can check if a number is NaN by doing res != res. This because NaN != NaN (but note that this isn't guaranteed by older C standards, but in many compilers on Intel platform do it. http://www.gnu.org/s/hello/manual/libc/Infinity-and-NaN.html). And this explains why the printf printed something like -1.#IO.
You can avoid most of your conversion problems by actually using C++ instead of using legacy C-functions:
#include <algorithm>
#include <iostream>
#include <iterator>
int main()
{
long double a = 0.0;
long double ha = 0.0;
long double hb = 0.0;
long double hc = 0.0;
int n = 0;
std::cout << "Enter Count: ";
std::cin >> n;
for (int i = 0; i < n; i++)
{
std::cout << "Enter A, Ha, Hb, Hc: ";
std::cin >> a >> ha >> hb >> hc;
std::cout.precision(10);
std::cout << "You Entered: "
<< a << " " << ha << " " << hb << " " << hc << std::endl;
ha *= 3;
hb *= 3;
hc *= 3;
long double c = (a * ha) / hc;
long double b = (a * ha) / hb;
long double ma = static_cast<long double>(std::sqrt(0.5 * b * b + 0.5 * c * c - 0.25 * a * a));
long double cosa = ha / ma;
long double r = (2 * ma) / 3;
long double l = b * (b - static_cast<long double>(std::sqrt(a * a - hb * hb))) / ha;
long double res = static_cast<long double>(std::sqrt(l * l + r * r - 2 * cosa * l * r));
long double area = a * ha / 2.0;
std::cout << "Area = " << area << std::endl;
}
return 0;
}
Don't know if this is of use to you but you could have a look at it.
long long int XDTOI(long double VALUE)
{
union
{
long double DWHOLE;
struct
{
unsigned int DMANTISSALO:32;
unsigned int DMANTISSAHI:32;
unsigned int DEXPONENT:15;
unsigned int DNEGATIVE:1;
unsigned int DEMPTY:16;
} DSPLIT;
} DKEY;
union
{
unsigned long long int WHOLE;
struct
{
unsigned int ARRAY[2];
} SPLIT;
} KEY;
int SIGNBIT,RSHIFT;
unsigned long long int BIGNUMBER;
long long int ACTUAL;
DKEY.DWHOLE=VALUE; SIGNBIT=DKEY.DSPLIT.DNEGATIVE;
RSHIFT=(63-(DKEY.DSPLIT.DEXPONENT-16383));
KEY.SPLIT.ARRAY[0]=DKEY.DSPLIT.DMANTISSALO;
KEY.SPLIT.ARRAY[1]=DKEY.DSPLIT.DMANTISSAHI;
BIGNUMBER=KEY.WHOLE;
BIGNUMBER>>=RSHIFT;
ACTUAL=((long long int)(BIGNUMBER));
if(SIGNBIT==1) ACTUAL=(-ACTUAL);
return ACTUAL;
}
Sadly enough, long double has faulty printing in GCC/Windows. However, you can guarantee that long double still does higher precision calculations in the background when you're doing arithmetic and trigonometry, because it would store at least 80 or 96 bits.
Therefore, I recommend this workaround for various things:
Use scanf on doubles, but cast them to long doubles after. You don't need precision in input parsing, anyway.
double x;
scanf("%lf", &x); // no biggie
long double y = x;
Make sure you use long double versions of functions in the <math.h> library. The normal ones will just cast your long double to double, so the higher precision will become useless.
long double sy = sqrtl(y); // not sqrt
long double ay = 2.0L * acosl(y); // note the L-suffix in the constant
To print your long double, just cast them back to double and use "%lf". Double can have at most 15 significant digits so it is more than enough. Of course, if it's not enough, you ought to switch to Linux's GCC.
printf("%.15lf\n", (double) y);
Most programs actually don't need the extra digits for the output. The thing is, even the early digits lose their precision at the slightest use of sqrt or trig functions. THEREFORE, it's OK to keep the double at least just for printing, but what's important is that you still use long double for the rough calculations to not lose the precision you worked hard to invest.