Convert double to integer mantissa and exponents - c++

I am trying extract the mantissa and exponent part from the double.
For the test data '0.15625', expected mantissa and exponent are '5' and '-5' respectively (5*2^-5).
double value = 0.15625;
double mantissa = frexp(value, &exp);
Result: mantissa = 0.625 and exp = -2.
Here the mantissa returned is a fraction. For my use case (ASN.1 encoding), mantissa should be integer. I understand by right-shifting the mantissa and adjusting the exponent, I can convert binary fraction to the integer. In the eg, 0.625 base 10 is 0.101 base 2, so 3 bytes to be shifted to get the integer. But I am finding it difficult to find a generic algorithm.
So my question is, how do I calculate the number bits to be shifted to convert a decimal fraction to a binary integer?

#include <cmath> // For frexp.
#include <iomanip> // For fixed and setprecision.
#include <iostream> // For cout.
#include <limits> // For properties of floating-point format.
int main(void)
{
double value = 0.15625;
// Separate value into significand in [.5, 1) and exponent.
int exponent;
double significand = std::frexp(value, &exponent);
// Scale significand by number of digits in it, to produce an integer.
significand = scalb(significand, std::numeric_limits<double>::digits);
// Adjust exponent to compensate for scaling.
exponent -= std::numeric_limits<double>::digits;
// Set stream to print significand in full.
std::cout << std::fixed << std::setprecision(0);
// Output triple with significand, base, and exponent.
std::cout << "(" << significand << ", "
<< std::numeric_limits<double>::radix << ", " << exponent << ")\n";
}
Sample output:
(5629499534213120, 2, -55)
(If the value is zero, you might wish to force the exponent to zero, for aesthetic reasons. Mathematically, any exponent would be correct.)

Related

Why do I get so precise floating-point number from std::cout?

The program
int main ()
{
long long ll = LLONG_MAX;
float f = ll;
std::cout << ll << '\n';
std::cout << std::fixed << f << '\n';
return 0;
}
gives:
9223372036854775807
9223372036854775808.000000
How is it possible? If 23-bit mantissa can have only 8,388,607 maximum value, why does cout output a 64-bit number?
You stored 2^63-1 in a float, which was rounded to 2^63 = 9223372036854775808. The powers of 2 are exactly representable.
The nearest number which is exactly representable is 2^63 + 2^40 = 9223373136366403584.
long long for you is a 64 bit data type so that means LLONG_MAX has a value of 2^63 - 1. You are right in that this can't be stored in a float which only has 23 bits of mantissa, but 2^63, which is one more than LLONG_MAX is easily stored in a float. It stores 2 in the mantissa and 63 in the exponent and there you have it.

Why typecasting a double to int seems to round it after 16th digit?

//g++ 5.4.0
#include <iostream>
int main()
{
std::cout << "Hello, world!\n";
std::cout << (int)0.9999999999999999 << std::endl; // 16 digits after decimal
std::cout << (int)0.99999999999999999 << std::endl; // 17 digits after decimal
}
Output:
Hello, world!
0
1
Why does this happen?
The most accurate representation of 0.99999999999999999 is 1.0.1)
The most accurate representation of 0.9999999999999999 is 0.999999999999999888977697537484.
1) In 64-bit double precision IEEE754 floating point representation.
Since there is no rounding but truncation, one gives 1 and the other gives 0 when converted to an integer type.
The most accurate floating point representation of the value 0.99999999999999999 (17 digits after the decimal) is exactly 1.0.
The most accurate floating point representation of the value 0.9999999999999999 (16 digits after the decimal) is less than 1.0.
The conversion to int truncates one to 0 and the other to 1.

Losing precision with floating point numbers (double) in c++

I'm trying to assign a big double value to a variable and print it on the console. The number I supply in is different than what is displayed as output. Is it possible to get the double value correctly assigned and output without the loss of precision? Here is the C++ code:
#include <iostream>
#include <limits>
int main( int argc, char *argv[] ) {
// turn off scientific notation on floating point numbers
std::cout << std::fixed << std::setprecision( 3 );
// maximum double value on my machine
std::cout << std::numeric_limits<double>::max() << std::endl;
// string representation of the double value I want to get
std::cout << "123456789123456789123456789123456789.01" << std::endl;
// value I supplied
double d = 123456789123456789123456789123456789.01;
// it's printing 123456789123456784102659645885120512.000 instead of 123456789123456789123456789123456789.01
std::cout << d << std::endl;
return EXIT_SUCCESS;
}
Could you, please, help me to understand the problem.
C++ built-in floating point types are finite in precision. double is usually implemented as IEEE-754 double precision, meaning it has 53 bits of mantissa (the "value") precision, 11 bits of exponent precision, and 1 sign bit.
The number 123456789123456789123456789123456789 requires way more than 53 bits to represent, meaning a typical double cannot possibly represent it accurately. If you want such large numbers with perfect precision, you need to use some sort of a "big number" library.
For more information on floating point formats and their inaccuracies, you can read What Every Programmer Should Know About Floating-Point Arithmetic.

conversion of double to string to double throws exception

The following code throws an std::out_of_range exception in Visual Studio 2013 where in my opinion it shouldn't:
#include <string>
#include <limits>
int main(int argc, char ** argv)
{
double maxDbl = std::stod(std::to_string(std::numeric_limits<double>::max()));
return 0;
}
I tested the code also with gcc 4.9.2 and there it does not throw an exception. The issue seems to be caused by an inaccurate string representation after the conversion to string. In Visual Studio std::to_string(std::numeric_limits<double>::max()) yields
179769313486231610000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000
which indeed seems too large. In gcc, however, it yields
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
which seems to be smaller than the passed value.
However, isn't std::numeric_limits<double>::max() supposed to return the
maximum finite representable floating-point number?
So why do the string representations get off? What am I missing here?
Direct answer
Gcc (and Clang and VS2105) correctly return the integer value of (21024 - 1) - (21024-53 - 1) that is what is represented with 52 one bits of significand and an unbiased exponent of 1023 (21024 - 1 would be the integer value with 1023 one bits, and I just substract all the bits below the 52 of the IEE754 format)
I can confirm that a large integer library give 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368L
The previous exact floating point would be 2971 lesser (971 = 1023 - 52) that is : 179769313486231550856124328384506240234343437157459335924404872448581845754556114388470639943126220321960804027157371570809852884964511743044087662767600909594331927728237078876188760579532563768698654064825262115771015791463983014857704008123419459386245141723703148097529108423358883457665451722744025579520L
The next non representable value would be 2971 greater that is:
179769313486231590772930519078902473361797697894230657273430081157732675805500963132708477322407536021120113879871393357658789768814416622492847430639474124377767893424865485276302219601246094119453082952085005768838150682342462881473913110540827237163350510684586298239947245938479716304835356329624224137216L
But the value used by MSVC2013 and previous is near to 21024 + 2971, that is : 179769313486231610731333614426100589925524828262616317947942685512308090830973387504827396012048193870699768806228404251083258210739369062217227314575410731769485876273179688476358949112102859294830297395714877595371718127781702814782017661749531126051903195165027873311156314696040132728420308633064323416064L
. As it is greater than any value representable in IEEE754 double precision, it cannot be decoded to a double.
Because at most, one could say that any value between 21024 - 2971 (std::numeric_limits<double>::max()) and 21024 could be rounded to std::numeric_limits<double>::max(), but values greater than 21024 are clearly an overflow.
Discussion on accuracy
Only 16 decimal digits are accurate in a double and all other digits can be seen as garbage or random values since they do not depend on the value itself but only one the way you choose to calculate them. Just try to substract 1e+288 (that's already a big value) to maxDbl and look what happens :
maxLess = max Dbl - 1.e+288;
if (maxLess == maxDbl) {
std::cout << "Unchanged" << std::endl;
}
else std::cout << "Changed" << std::endl;
You should see ... Unchanged.
It just looks like VS 2013 is a little incoherent in the way it rounds floating point values : it rounded maxDbl by excess to one bit higher than the maximum actually representable value, and could not decode it later.
The problem is that the standard choosed to use a %f format which gives a false sentiment of accuracy. If you want to see an equivalent problem in gcc, just use :
#include <iostream>
#include <string>
#include <limits>
#include <iomanip>
#include <sstream>
int main() {
double max = std::numeric_limits<double>::max();
std::ostringstream ostr;
ostr << std::setprecision(16) << max;
std::string smax = ostr.str();
std::cout << smax << std::endl;
double m2 = std::stod(smax);
std::cout << m2 << std::endl;
return 0;
}
Rounded to 16 digits mxDbl writes (correctly) : 1.797693134862316e+308, but can no longer be decoded back
And this one :
#include <iostream>
#include <string>
#include <limits>
int main() {
double maxDbl = std::numeric_limits<double>::max();
std::string smax = std::to_string(maxDbl);
std::cout << smax << std::endl;
std::string smax2 = "179769313486231570800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000";
double max2 = std::stod(smax2);
if (max2 == maxDbl) {
std::cout << smax2 << " is same double as " << smax << std::endl;
}
return 0;
}
Displays :
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
179769313486231570800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 is same double as 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
TL/DR : What I mean is that one big enoudh double value can of course be represented by an exact integer (per IEEE754). But it does represent all integers between half to the previous one and half to the next one. So any integer in that range could be an acceptable representation for the double, and one value rounded at 16 decimal digits should be acceptable, but current standard libraries only allow max floating point value to be truncated at 16 decimal digits. But VS2013 gave a number above the max of the range what was in any case an error.
Reference
IEEE floating point on wikipedia

How to extract base 10 mantissa and exponent using gpmlib in C++

I need to extract significand and exponent of a double in C++ using gpmlib.
Ex: double a = 1.234;
I would like to extract 1234 as significand and 3 as exponent so that a = 1234e-3. I heard that gpmlib supports this type functions. I am not sure how to this library.
Please share some sample code using this library.
It seems that you're looking for mpf_class::get_str(), which will break up the floating-point value 1.234 into the string "1234" and the exponent 1, because 1.234 == 0.1234 * 10^1
You will need to subtract the size of the string from that exponent, to fit your requirements.
#include <iostream>
#include <string>
#include <gmpxx.h>
int main()
{
double a = 1.125; // 1.234 cannot be stored in a double exactly, try "1.234"
mpf_class f(a);
mp_exp_t exp;
std::string significand = f.get_str(exp);
std::cout << "significand = " << significand
<< " exponent = " << exp-(int)significand.size() << '\n';
}
This prints
~ $ ./test
significand = 1125 exponent = -3