Bit representation of float using an int pointer - c++

I have the following exercise:
Implement a function void float to bits(float x) which prints the bit
representation of x. Hint: Casting a float to an int truncates the
fractional part, but no information is lost casting a float pointer to
an int pointer.
Now, I know that a float is represented by a sign-bit, some bits for its mantissa, some bits for the basis and some bits for the exponent. It depends on my system how many bits are used.
The problem we are facing here is that our number basically has two parts. Let's consider 8.7 the bit representation of this number would be (to my understanding) the following: 1000.0111
Now, float's are stored wit a leading zero, so 8.8 would become 0.88*10^1
So I somehow have to get all the information out of my memory. I don't really see how I should do that. What should that hint hint me to? What's the difference between a integer pointer and a float pointer?
Currently I have this:
void float_to_bits() {
float a = 4.2345678f;
int* b;
b = (int*)(&a);
*b = a;
std::cout << *(b) << "\n";
}
But I really don't get the bigger picture behind the hint here. How do I get the mantissa, the exponent, the sign and the basis? I also tried playing around with the bit-wise operators >>, <<. But I just don't see how this should help me here, since they won't change the pointers position. It's useful to get e.g. the bit representation of an integer but that's about it, no idea what use it'd be here.

The hint your teacher gave is misleading: casting pointer between different types is at best implementation defined. However, memcpy(...)ing an object to a suutably sized array if unsigned char is defined. The content if the resulting array can then be decomposed into bits. Here is a quick hack to represent the bits using hexadecimal values:
#include <iostream>
#include <iomanip>
#include <cstring>
int main() {
float f = 8.7;
unsigned char bytes[sizeof(float)];
std::memcpy(bytes, &f, sizeof(float));
std::cout << std::hex << std::setfill(‘0’);
for (int b: bytes) {
std::cout << std::setw(2) << b;
}
std::cout << ‘\n’;
}
Note that IEEE 754 binary floating points do not store the full significand (the standard doesn’t use mantissa as a term) except for denormalized values: the 32 bit floats store
1 bit for the sign
8 bits for the exponent
23 bits for the normalized significand with the non-zero high bit being implied

The hint directs you how to pass the Float into an Integer without passing through value conversion.
When you assign floating-point value to an integer, the processor removes the fraction part. int i = (int) 4.502f; will result in i=4;
but when you make a int pointer (int*) point to a float's location,
no conversion is made, also when you read the int* value.
to show the representation, i like seeing HEX numbers,
thats why my first example was given in HEX
(each Hexa-decimal digit represents 4 binary digits).
but it is also possible to print as binary,
and there are many ways (I like this one best!)
Follows an annotated example code:
Also available # Culio
#include <iostream>
#include <bitset>
using namespace std;
int main()
{
float a = 4.2345678f; // allocate space for a float. Call it 'a' and put the floating point value of `4.2345678f` in it.
unsigned int* b; // allocate a space for a pointer (address), call the space b, (hint to compiler, this will point to integer number)
b = (unsigned int*)(&a); // GREAT, exactly what you needed! take the float 'a', get it's address '&'.
// by default, it is an address pointing at float (float*) , so you correctly cast it to (int*).
// Bottom line: Set 'b' to the address of a, but treat this address of an int!
// The Hint implied that this wont cause type conversion:
// int someInt = a; // would cause `someInt = 4` same is your line below:
// *b = a; // <<<< this was your error.
// 1st thing, it aint required, as 'b' already pointing to `a` address, hence has it's value.
// 2nd by this, you set the value pointed by `b` to 'a' (including conversion to int = 4);
// the value in 'a' actually changes too by this instruction.
cout << a << " in binary " << bitset<32>(*b) << endl;
cout << "Sign " << bitset<1>(*b >> 31) << endl; // 1 bit (31)
cout << "Exp " << bitset<8>(*b >> 23) << endl; // 8 bits (23-30)
cout << "Mantisa " << bitset<23>(*b) << endl; // 23 bits (0-22)
}

Related

How to set precision of a float?

For a number a = 1.263839, we can do -
float a = 1.263839
cout << fixed << setprecision(2) << a <<endl;
output :- 1.26
But what if i want set precision of a number and store it, for example-
convert 1.263839 to 1.26 without printing it.
But what if i want set precision of a number and store it
You can store the desired precision in a variable:
int precision = 2;
You can then later use this stored precision when converting the float to a string:
std::cout << std::setprecision(precision) << a;
I think OP wants to convert from 1.263839 to 1.26 without printing the number.
If this is your goal, then you first must realise, that 1.26 is not representable by most commonly used floating point representation. The closest representable 32 bit binary IEEE-754 value is 1.2599999904632568359375.
So, assuming such representation, the best that you can hope for is some value that is very close to 1.26. In best case the one I showed, but since we need to calculate the value, keep in mind that some tiny error may be involved beyond the inability to precisely represent the value (at least in theory; there is no error with your example input using the algorithm below, but the possibility of accuracy loss should always be considered with floating point math).
The calculation is as follows:
Let P bet the number of digits after decimal point that you want to round to (2 in this case).
Let D be 10P (100 in this case).
Multiply input by D
std::round to nearest integer.
Divide by D.
P.S. Sometimes you might not want to round to the nearest, but instead want std::floor or std::ceil to the precision. This is slightly trickier. Simply std::floor(val * D) / D is wrong. For example 9.70 floored to two decimals that way would become 9.69, which would be undesirable.
What you can do in this case is multiply with one magnitude of precision, round to nearest, then divide the extra magnitude and proceed:
Let P bet the number of digits after decimal point that you want to round to (2 in this case).
Let D be 10P (100 in this case).
Multiply input by D * 10
std::round to nearest integer.
Divide by 10
std::floor or std::ceil
Divide by D.
You would need to truncate it. Possibly the easiest way is to multiply it by a factor (in case of 2 decimal places, by a factor of 100), then truncate or round it, and lastly divide by the very same factor.
Now, mind you, that floating-point precision issues might occur, and that even after those operations your float might not be 1.26, but 1.26000000000003 instead.
If your goal is to store a number with a small, fixed number of digits of precision after the decimal point, you can do that by storing it as an integer with an implicit power-of-ten multiplier:
#include <stdio.h>
#include <math.h>
// Given a floating point value and the number of digits
// after the decimal-point that you want to preserve,
// returns an integer encoding of the value.
int ConvertFloatToFixedPrecision(float floatVal, int numDigitsAfterDecimalPoint)
{
return (int) roundf(floatVal*powf(10.0f, numDigitsAfterDecimalPoint));
}
// Given an integer encoding of your value (as returned
// by the above function), converts it back into a floating
// point value again.
float ConvertFixedPrecisionBackToFloat(int fixedPrecision, int numDigitsAfterDecimalPoint)
{
return ((float) fixedPrecision) / powf(10.0f, numDigitsAfterDecimalPoint);
}
int main(int argc, char ** arg)
{
const float val = 1.263839;
int fixedTwoDigits = ConvertFloatToFixedPrecision(val, 2);
printf("fixedTwoDigits=%i\n", fixedTwoDigits);
float backToFloat = ConvertFixedPrecisionBackToFloat(fixedTwoDigits, 2);
printf("backToFloat=%f\n", backToFloat);
return 0;
}
When run, the above program prints this output:
fixedTwoDigits=126
backToFloat=1.260000
If you're talking about storing exactly 1.26 in your variable, chances are you can't (there may be an off chance that exactly 1.26 works, but let's assume it doesn't for a moment) because floating point numbers don't work like that. There are always little inaccuracies because of the way computers handle floating point decimal numbers. Even if you could get 1.26 exactly, the moment you try to use it in a calculation.
That said, you can use some math and truncation tricks to get very close:
int main()
{
// our float
float a = 1.263839;
// the precision we're trying to accomplish
int precision = 100; // 3 decimal places
// because we're an int, this will keep the 126 but lose everything else
int truncated = a * precision; // multiplying by the precision ensures we keep that many digits
// convert it back to a float
// Of course, we need to ensure we're doing floating point division
float b = static_cast<float>(truncated) / precision;
cout << "a: " << a << "\n";
cout << "b: " << b << "\n";
return 0;
}
Output:
a: 1.26384
b: 1.26
Note that this is not really 1.26 here. But is is very close.
This can be demonstrated by using setprecision():
cout << "a: " << std:: setprecision(10) << a << "\n";
cout << "b: " << std:: setprecision(10) << b << "\n";
Output:
a: 1.263839006
b: 1.25999999
So again, it's not exactly 1.26, but very close, and slightly closer than you were before.
Using a stringstream would be an easy way to achieve that:
#include <iostream>
#include <iomanip>
#include <sstream>
using namespace std;
int main() {
stringstream s("");
s << fixed << setprecision(2) << 1.263839;
float a;
s >> a;
cout << a; //Outputs 1.26
return 0;
}

c++ print number in hexadecimal right after floor function

I've noticed some weird behaviour in c++ which i don't understand,
i'm trying to print a truncated double in a hexadecimal representation
this code output is 17 which is a decimal representation
double a = 17.123;
cout << hex << floor(a) << '\n';
while this code output is 11 and also my desirable output
double a = 17.123;
long long aASll = floor(a);
cout << hex << aASll << '\n';
as double can get really big numbers i'm afraid of wrong output while storing the truncated number in long long variable, any suggestions or improvements?
Quoting CPPreference's documentation page for std::hex (and friends)
Modifies the default numeric base for integer I/O.
This suggests that std::hex does not have any effect on floating point inputs. The best you are going to get is
cout << hex << static_cast<long long>(floor(a)) << '\n';
or a function that does the same.
uintmax_t from <cstdint> may be useful to get the largest available integer if the values are always positive. After all, what is a negative hex number?
Since a double value can easily exceed the maximum resolution of available integers, this won't cover the whole range. If the floored values exceed what can fit in an integer type, you are going to have to do the conversion by hand or use a big integer library.
Side note: std::hexfloat does something very different and does not work correctly in all compilers due to some poor wording in the current Standard that is has since been hammered out and should be corrected in the next revision.
Just write your own version of floor and have it return an integral value. For example:
long long floorAsLongLong(double d)
{
return (long long)floor(d);
}
int main() {
double a = 17.123;
cout << hex << floorAsLongLong(a) << endl;
}

How long double fits so many characters in just 12 bytes?

How long double fits so many characters in just 12 bytes?
I made an example, a C ++ factorial
when entering a large number, 1754 for example it calculates with a number that apparently would not fit a long double type.
#include <iostream>
#include <string.h>
using namespace std;
int main()
{
unsigned int n;
long double fatorial = 1;
cout << "Enter number: ";
cin >> n;
for(int i = 1; i <=n; ++i)
{
fatorial *= i;
}
string s = to_string(fatorial);
cout << "Factorial of " << n << " = " <<fatorial << " = " << s;
return 0;
}
Important note:
GCC Compiler on Windows, by visual Studio long double behaves like a double
The problem is how is it stored or the to_string function?
std::to_string(factorial) will return a string containing the same result as std::sprintf(buf, "%Lf", value).
In turn, %Lf prints the entire integer part of a long double, a period and 6 decimal digits of the fractional part.
Since factorial is a very large number, you end up with a very long string.
However, note that this has nothing to do with long double. A simpler example with e.g. double is:
#include <iostream>
#include <string>
int main()
{
std::cout << std::to_string(1e300) << '\n';
return 0;
}
This will print:
10000000000000000525047602 [...300 decimal digits...] 540160.000000
The decimal digits are not exactly zero because the number is not exactly 1e300 but the closest to it that can be represented in the floating-point type.
It doesn't fit that many characters. Rather, to_string produces that many characters from the data.
Here is a toy program:
std::string my_to_string( bool b ) {
if (b)
return "This is a string that never ends, it goes on and on my friend, some people started typing it not knowing what it was, and now they still are typing it it just because this is the string that never ends, it goes on and on my friend, some people started typing it not knowing what it was, and now they still are typing it just because...";
else
return "no it isn't, I can see the end right ^ there";
}
bool stores exactly 1 bit of data. But the string it produces from calling my_to_string can be as long as you want.
double's to_string is like that. It generates far more characters than there is "information" in the double.
This is because it is encoded as a base 10 number on output. Inside the double, it is encoded as a combination of an unsigned number, a sign bit, and an exponential part.
The "value" is then roughly "1+number/2^constant", times +/- one for the sign, times "2^exponential part".
There are only a certain number of "bits of precision" in base 2; if you printed it in base 2 (or hex, or any power-of-2 base) the double would have a few non-zero digits, then a pile of 0s afterwards (or, if small, it would have 0.0000...000 then a handful of non-zero digits).
But when converted to base 10 there isn't a pile of zero digits in it.
Take 0b10000000 -- aka 2^8. This is 256 in base 10 -- it has no trailing 0s at all!
This is because floating point numbers only store an approximation of the actual value. If you look at the actual exact value of 1754! you'll see that your result becomes completely different after the first ~18 digits. The digits after that are just the result of writing (a multiple of) a large power of two in decimal.

shifting the binary numbers in c++

#include <iostream>
int main()
{
using namespace std;
int number, result;
cout << "Enter a number: ";
cin >> number;
result = number << 1;
cout << "Result after bitshifting: " << result << endl;
}
If the user inputs 12, the program outputs 24.
In a binary representation, 12 is 0b1100. However, the result the program prints is 24 in decimal, not 8 (0b1000).
Why does this happen? How may I get the result I except?
Why does the program output 24?
You are right, 12 is 0b1100 in its binary representation. That being said, it also is 0b001100 if you want. In this case, bitshifting to the left gives you 0b011000, which is 24. The program produces the excepted result.
Where does this stop?
You are using an int variable. Its size is typically 4 bytes (32 bits) when targeting 32-bit. However, it is a bad idea to rely on int's size. Use stdint.h when you need specific sizes variables.
A word of warning for bitshifting over signed types
Using the << bitshift operator over negative values is undefined behavior. >>'s behaviour over negative values is implementation-defined. In your case, I would recommend you to use an unsigned int (or just unsigned which is the same), because int is signed.
How to get the result you except?
If you know the size (in bits) of the number the user inputs, you can use a bitmask using the & (bitwise AND) operator. e.g.
result = (number << 1) & 0b1111; // 0xF would also do the same

C++ Precision: String to Double

I am having a problem with precision of a double after performing some operations on a converted string to double.
#include <iostream>
#include <sstream>
#include <math.h>
using namespace std;
// conversion function
void convert(const char * a, const int i, double &out)
{
double val;
istringstream in(a);
in >> val;
cout << "char a -- " << a << endl;
cout << "val ----- " << val << endl;
val *= i;
cout << "modified val --- " << val << endl;
cout << "FMOD ----- " << fmod(val, 1) << endl;
out = val;
return 0;
}
This isn't the case for all numbers entered as a string, so the error isn't constant.
It only affects some numbers (34.38 seems to be constant).
At the minute, it returns this when i pass in a = 34.38 and i=100:
char a -- 34.38
Val ----- 34.38
modified val --- 3438
FMOD ----- 4.54747e-13
This will work if I change the Val to a float, as there is lower precision, but I need a double.
This also is repro when i use atof, sscanf and strtod instead of sstream.
In C++, what is the best way to correctly convert a string to a double, and actually return an accurate value?
Thanks.
This is almost an exact duplicate of so many questions here - basically there is no exact representation of 34.38 in binary floating point, so your 34 + 19/50 is represented as a 34 + k/n where n is a power of two, and there is no exact power of two which has 50 as a factor, so there is no exact value of k possible.
If you set the output precision, you can see that the best double representation is not exact:
cout << fixed << setprecision ( 20 );
gives
char a -- 34.38
val ----- 34.38000000000000255795
modified val --- 3438.00000000000045474735
FMOD ----- 0.00000000000045474735
So in answer to your question, you are already using the best way to convert a string to a double (though boost lexical cast wraps up your two or three lines into one line, so might save you writing your own function). The result is due to the representation used by doubles, and would apply to any finite representation based on binary floating point.
With floats, the multiplication happens to be rounded down rather than up, so you happen to get an exact result. This is not behaviour you can depend on.
The "problem" here is simply that 34.38 cannot be exactly represented in double-precision floating point. You should read this article which describes why it's impossible to represent decimal values exactly in floating point.
If you were to examine "34.38 * 100" in hex (as per "format hex" in MATLAB for example), you'd see:
40aadc0000000001
Notice the final digit.