Why do I get so precise floating-point number from std::cout? - c++

The program
int main ()
{
long long ll = LLONG_MAX;
float f = ll;
std::cout << ll << '\n';
std::cout << std::fixed << f << '\n';
return 0;
}
gives:
9223372036854775807
9223372036854775808.000000
How is it possible? If 23-bit mantissa can have only 8,388,607 maximum value, why does cout output a 64-bit number?

You stored 2^63-1 in a float, which was rounded to 2^63 = 9223372036854775808. The powers of 2 are exactly representable.
The nearest number which is exactly representable is 2^63 + 2^40 = 9223373136366403584.

long long for you is a 64 bit data type so that means LLONG_MAX has a value of 2^63 - 1. You are right in that this can't be stored in a float which only has 23 bits of mantissa, but 2^63, which is one more than LLONG_MAX is easily stored in a float. It stores 2 in the mantissa and 63 in the exponent and there you have it.

Related

Largest uint64 which can be accurately represented in a float in C/C++ [duplicate]

This question already has answers here:
Which is the first integer that an IEEE 754 float is incapable of representing exactly?
(2 answers)
Closed 5 months ago.
I understand that floating point precision has only so many bits. It comes as no surprise that the following code thinks that (float)(UINT64_MAX) and (float)(UINT64_MAX - 1) are equal. I am trying to write a function which would detect this type of, for a lack of proper term, "conversion overflow". I thought I could somehow use FLT_MAX but that's not correct. What's the right way to do this?
#include <iostream>
#include <cstdint>
int main()
{
uint64_t x1(UINT64_MAX);
uint64_t x2(UINT64_MAX - 1);
float f1(static_cast<float>(x1));
float f2(static_cast<float>(x2));
std::cout << f1 << " == " << f2 << " = " << (f1 == f2) << std::endl;
return 0;
}
Largest uint64 which can be accurately represented in a float
What's the right way to do this?
When FLT_RADIX == 2, we are looking for a uint64_t of the form below where n is the max number of bits encodable in a float value. This is usually 24. See FLT_MANT_DIG from <float.h>.
111...(total of n binary digits)...111000...(64-n bits all zero)...000.
//
//1234561234567890
0xFFFFFF0000000000, in decimal: 18446742974197923840
// e.g.
~( (1ull << (64-FLT_MANT_DIG)) - 1)
The following function gives you the highest integer exactly representable in a floating point type such that all smaller positive integers are also exactly representable.
template<typename T>
T max_representable_integer()
{
return std::scalbn(T(1.0), std::numeric_limits<T>::digits);
}
It does the computation in the floating point as for some the result may not be representable in a uint64_t.

Convert double to integer mantissa and exponents

I am trying extract the mantissa and exponent part from the double.
For the test data '0.15625', expected mantissa and exponent are '5' and '-5' respectively (5*2^-5).
double value = 0.15625;
double mantissa = frexp(value, &exp);
Result: mantissa = 0.625 and exp = -2.
Here the mantissa returned is a fraction. For my use case (ASN.1 encoding), mantissa should be integer. I understand by right-shifting the mantissa and adjusting the exponent, I can convert binary fraction to the integer. In the eg, 0.625 base 10 is 0.101 base 2, so 3 bytes to be shifted to get the integer. But I am finding it difficult to find a generic algorithm.
So my question is, how do I calculate the number bits to be shifted to convert a decimal fraction to a binary integer?
#include <cmath> // For frexp.
#include <iomanip> // For fixed and setprecision.
#include <iostream> // For cout.
#include <limits> // For properties of floating-point format.
int main(void)
{
double value = 0.15625;
// Separate value into significand in [.5, 1) and exponent.
int exponent;
double significand = std::frexp(value, &exponent);
// Scale significand by number of digits in it, to produce an integer.
significand = scalb(significand, std::numeric_limits<double>::digits);
// Adjust exponent to compensate for scaling.
exponent -= std::numeric_limits<double>::digits;
// Set stream to print significand in full.
std::cout << std::fixed << std::setprecision(0);
// Output triple with significand, base, and exponent.
std::cout << "(" << significand << ", "
<< std::numeric_limits<double>::radix << ", " << exponent << ")\n";
}
Sample output:
(5629499534213120, 2, -55)
(If the value is zero, you might wish to force the exponent to zero, for aesthetic reasons. Mathematically, any exponent would be correct.)

Why (int)pow(2, 32) == -2147483648

On the Internet I found the following problem:
int a = (int)pow(2, 32);
cout << a;
What does it print on the screen?
Firstly I thought about 0,
but after I wrote code and executed it, i got -2147483648, but why?
Also I noticed that even (int)(pow(2, 32) - pow(2, 31)) equals -2147483648.
Can anyone explain why (int)pow(2, 32) equals -2147483648?
Assuming int is 32 bits (or less) on your machine, this is undefined behavior.
From the standard, conv.fpint:
A prvalue of a floating-point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type.
Most commonly int is 32 bits, and it can represent values in the interval [-2^31, 2^31-1] which is [-2147483648, 2147483647]. The result of std::pow(2, 32) is a double that represents the exact value 2^32. Since 2^32 exceeds the range that can be represented by int, the conversion attempt is undefined behavior. This means that in the best case, the result can be anything.
The same goes for your second example: pow(2, 32) - pow(2, 31) is simply the double representation of 2^31, which (just barely) exceeds the range that can be represented by a 32-bit int.
The correct way to do this would be to convert to a large enough integral type, e.g. int64_t:
std::cout << static_cast<int64_t>(std::pow(2, 32)) << "\n"; // prints 4294967296
The behavior you are seeing relates to using Two's Complement to represent
signed integers. For 3-bit numbers the range of values range from [-4, 3]. For 32-bit numbers it ranges from -(2^31) to (2^31)-1. (i.e. -2147483648 to 2147483647).
this because the result of the operation overflow int data type because it exceeds its max value so don't cast to int cast it to long
#include <iostream>
#include <cmath>
#include <climits>
using namespace std;
int main() {
cout << (int)pow(2, 32) << endl;
// 2147483647
cout << INT_MIN << endl;
//-2147483648
cout << INT_MAX << endl;
//2147483647
cout << (long)pow(2, 32) << endl;
//4294967296
cout << LONG_MIN << endl;
// -9223372036854775808
cout << LONG_MAX << endl;
// 9223372036854775808
return 0;
}
if you are not aware about int overflow you can check this link

c++: a program to find the average of very high numbers?

so im trying to make a c++ program that can find the average of very high numbers (the range was <10^19)
heres my attemp:
#include <iostream>
int main()
{
long double a,b,result;
std::cin>>a;
std::cin>>b;
result=(a+b)/2;
std::cout<<result<<"\n";
}
but somehow i did not the result i expected. my teacher said there was a "trick" and there was no need to even use double. but i search and researched and did not found the trick. so any help?
When using floating point numbers you have to consider their precision, it is represented by std::numeric_limits<T>::digits10 in base 10, and the following program can give them (they may depend on your platform):
#include <iostream>
#include <limits>
int main() {
std::cout << "float: " << std::numeric_limits<float>::digits10 << "\n";
std::cout << "double: " << std::numeric_limits<double>::digits10 << "\n";
std::cout << "long double: " << std::numeric_limits<long double>::digits10 << "\n";
return 0;
}
On ideone I get:
float: 6
double: 15
long double: 18
Which is consistent with 32 bits, 64 bits and 80 bits floating point numbers (respectively).
Since 1019 is above 18 digits (it has 20), the type you have chosen lacks the necessary precision to represent all numbers below it, and no amount of computation can recover the lost data.
Let's switch back to integrals, while their range is more limited, they have a higher degree of precision for the same amount of bits. A 64 bits signed integer has a maximum of 9,223,372,036,854,775,807 and the unsigned version goes up to 18,446,744,073,709,551,615. For comparison 1019 is 10,000,000,000,000,000,000.
A uint64_t (from <cstdint>) gives you to necessary building block, however you'll be teetering on the edge of overflow: 2 times 1019 is too much.
You now have to find a way to compute the average without adding the two number together.
Supposing two integers M, N such that M <= N, (M + N) / 2 = M + (N - M) / 2

Precision loss of 1 when converting double to float

For the below program I am getting precision loss of 1 which I am unable to understand. Need help.
void main()
{
typedef std::numeric_limits< double > dbl;
cout.precision(dbl::digits10);
double x = -53686781.0;
float xFloat = (float) x;
cout << "x :: " << x << "\n";
cout << "xFloat :: " << xFloat << "\n";
}
Outpput:
x :: -53686781
xFloat :: -53686780
53686781 looks like this in binary: 11001100110011000111111101. That's 26 bits.
Your float can only store up to 24 bits in its mantissa portion, so, you end up with 110011001100110001111111 stored in it. The last two binary digits, 01, get truncated.
And 11001100110011000111111100 is 53686780.
As simple as that.
For normal floats I believe p=23, which gives 2^23 of digit precision (about 7 digits as already mentioned. Double has p=52, which gives 2^52 of digit precision (about 15 digits).
The wiki page is actually pretty good.