How to detect double precision floating point overflow and underflow? - c++

I have following variables:
double dblVar1;
double dblVar2;
They may have big values but less than double max.
I have various arithmetic on above variables like addition, multiplication and power:
double dblVar3 = dblVar1 * dblVar2;
double dblVar4 = dblVar1 + dblVar2;
double dblVar5 = pow(dblVar1, 2);
In all above I have to check overflow and underflow. How can I achieve this in C++?

A lot depends on context. To be perfectly portable, you have to
check before the operation, e.g. (for addition):
if ( (a < 0.0) == (b < 0.0)
&& std::abs( b ) > std::numeric_limits<double>::max() - std::abs( a ) ) {
// Addition would overflow...
}
Similar logic can be used for the four basic operators.
If all of the machines you target support IEEE (which is
probably the case if you don't have to consider mainframes), you
can just do the operations, then use isfinite or isinf on
the results.
For underflow, the first question is whether a gradual underflow
counts as underflow or not. If not, then simply checking if the
results are zero and a != -b would do the trick. If you want
to detect gradual underflow (which is probably only present if
you have IEEE), then you can use isnormal—this will
return false if the results correspond to gradual underflow.
(Unlike overflow, you test for underflow after the operation.)

POSIX, C99, C++11 have <fenv.h> (and <cfenv> for C++11) which have functions to test the IEEE754 exceptions flags (which have nothing to do with C++ exceptions, it would be too easy):
int feclearexcept(int);
int fegetexceptflag(fexcept_t *, int);
int feraiseexcept(int);
int fesetexceptflag(const fexcept_t *, int);
int fetestexcept(int);
The flag is a bitfield with the following bits defined:
FE_DIVBYZERO
FE_INEXACT
FE_INVALID
FE_OVERFLOW
FE_UNDERFLOW
So you can clear them before the operations and then test them after. You'll have to check the documentation for the effect of library functions on them.

With a decent compiler (which supports the newest C++ standard), you can use these functions:
#include <cfenv>
#include <iostream>
int main() {
std::feclearexcept(FE_OVERFLOW);
std::feclearexcept(FE_UNDERFLOW);
double overflowing_var = 1000;
double underflowing_var = 0.01;
std::cout << "Overflow flag before: " << (bool)std::fetestexcept(FE_OVERFLOW) << std::endl;
std::cout << "Underflow flag before: " << (bool)std::fetestexcept(FE_UNDERFLOW) << std::endl;
for(int i = 0; i < 20; ++i) {
overflowing_var *= overflowing_var;
underflowing_var *= underflowing_var;
}
std::cout << "Overflow flag after: " << (bool)std::fetestexcept(FE_OVERFLOW) << std::endl;
std::cout << "Underflow flag after: " << (bool)std::fetestexcept(FE_UNDERFLOW) << std::endl;
}
/** Output:
Overflow flag before: 0
Underflow flag before: 0
Overflow flag after: 1
Underflow flag after: 1
*/

ISO C99 defines functions to query and manipulate the floating-point status word. You can use these functions to check for untrapped exceptions when it's convenient, rather than worrying about them in the middle of a calculation.
It provides
FE_INEXACT
FE_DIVBYZERO
FE_UNDERFLOW
FE_OVERFLOW
FE_INVALID
For example
{
double f;
int raised;
feclearexcept (FE_ALL_EXCEPT);
f = compute ();
raised = fetestexcept (FE_OVERFLOW | FE_INVALID);
if (raised & FE_OVERFLOW) { /* ... */ }
if (raised & FE_INVALID) { /* ... */ }
/* ... */
}
http://www.gnu.org/software/libc/manual/html_node/Status-bit-operations.html

Related

How can I convert an integer to float with rounding towards zero?

When an integer is converted to floating-point, and the value cannot be directly represented by the destination type, the nearest value is usually selected (required by IEEE-754).
I would like to convert an integer to floating-point with rounding towards zero in case the integer value cannot be directly represented by the floating-point type.
Example:
int i = 2147483647;
float nearest = static_cast<float>(i); // 2147483648 (likely)
float towards_zero = convert(i); // 2147483520
Since C++11, one can use fesetround(), the floating-point environment rounding direction manager. There are four standard rounding directions and an implementation is permitted to add additional rounding directions.
#include <cfenv> // for fesetround() and FE_* macros
#include <iostream> // for cout and endl
#include <iomanip> // for setprecision()
#pragma STDC FENV_ACCESS ON
int main(){
int i = 2147483647;
std::cout << std::setprecision(10);
std::fesetround(FE_DOWNWARD);
std::cout << "round down " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round down " << -i << " : " << static_cast<float>(-i) << std::endl;
std::fesetround(FE_TONEAREST);
std::cout << "round to nearest " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round to nearest " << -i << " : " << static_cast<float>(-i) << std::endl;
std::fesetround(FE_TOWARDZERO);
std::cout << "round toward zero " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round toward zero " << -i << " : " << static_cast<float>(-i) << std::endl;
std::fesetround(FE_UPWARD);
std::cout << "round up " << i << " : " << static_cast<float>(i) << std::endl;
std::cout << "round up " << -i << " : " << static_cast<float>(-i) << std::endl;
return(0);
}
Compiled under g++ 7.5.0, the resulting executable outputs
round down 2147483647 : 2147483520
round down -2147483647 : -2147483648
round to nearest 2147483647 : 2147483648
round to nearest -2147483647 : -2147483648
round toward zero 2147483647 : 2147483520
round toward zero -2147483647 : -2147483520
round up 2147483647 : 2147483648
round up -2147483647 : -2147483520
Omitting the #pragma doesn't seem to change anything under g++.
#chux comments correctly that the standard doesn't explicitly state that fesetround() affects rounding in static_cast<float>(i). For a guarantee that the set rounding direction affects the conversion, use std::nearbyint and its -f and -l variants. See also std::rint and its many type-specific variants.
I probably should have looked up the format specifier to use a space for positive integers and floats, rather than stuffing it into the preceding string constants.
(I haven't tested the following snippet.) Your convert() function would be something like
float convert(int i, int direction = FE_TOWARDZERO){
float retVal = 0.;
int prevdirection = std::fegetround();
std::fesetround(direction);
retVal = static_cast<float>(i);
std::fesetround(prevdirection);
return(retVal);
}
You can use std::nextafter.
int i = 2147483647;
float nearest = static_cast<float>(i); // 2147483648 (likely)
float towards_zero = std::nextafter(nearest, 0.f); // 2147483520
But you have to check, if static_cast<float>(i) is exact, if so, nextafter would go one step towards 0, which you probably don't want.
Your convert function might look like this:
float convert(int x){
if(std::abs(long(static_cast<float>(x))) <= std::abs(long(x)))
return static_cast<float>(x);
return std::nextafter(static_cast<float>(x), 0.f);
}
It may be that sizeof(int)==sizeof(long) or even sizeof(int)==sizeof(long long) in this case long(...) may behave undefined, when the static_cast<float>(x) exceeds the possible values. Depending on the compiler it might still work in this cases.
I understand the question to be restricted to platforms that use IEEE-754 binary floating-point arithmetic, and where float maps to IEEE-754 (2008) binary32. This answer assumes this to be the case.
As other answers have pointed out, if the tool chain and the platform supports this, use the facilities supplied by fenv.h to set the rounding mode for the conversion as desired.
Where those are not available, or slow, it is not difficult to emulate the truncation during int to float conversion. Basically, normalize the integer until the most significant bit is set, recording the required shift count. Now, shift the normalized integer into place to form the mantissa, compute the exponent based on the normalization shift count, and add in the sign bit based on the sign of the original integer. The process of normalization can be sped up significantly if a clz (count leading zeros) primitive is available, maybe as an intrinsic.
The exhaustively tested code below demonstrates this approach for 32-bit integers, see function int32_to_float_rz. I successfully built it as both C and C++ code with the Intel compiler version 13.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <fenv.h>
float int32_to_float_rz (int32_t a)
{
uint32_t i = (uint32_t)a;
int shift = 0;
float r;
// take absolute value of integer
if (a < 0) i = 0 - i;
// normalize integer so MSB is set
if (!(i > 0x0000ffffU)) { i <<= 16; shift += 16; }
if (!(i > 0x00ffffffU)) { i <<= 8; shift += 8; }
if (!(i > 0x0fffffffU)) { i <<= 4; shift += 4; }
if (!(i > 0x3fffffffU)) { i <<= 2; shift += 2; }
if (!(i > 0x7fffffffU)) { i <<= 1; shift += 1; }
// form mantissa with explicit integer bit
i = i >> 8;
// add in exponent, taking into account integer bit of mantissa
if (a != 0) i += (127 + 31 - 1 - shift) << 23;
// add in sign bit
if (a < 0) i |= 0x80000000;
// reinterpret bit pattern as 'float'
memcpy (&r, &i, sizeof r);
return r;
}
#pragma STDC FENV_ACCESS ON
float int32_to_float_rz_ref (int32_t a)
{
float r;
int orig_mode = fegetround ();
fesetround (FE_TOWARDZERO);
r = (float)a;
fesetround (orig_mode);
return r;
}
int main (void)
{
int32_t arg;
float res, ref;
arg = 0;
do {
res = int32_to_float_rz (arg);
ref = int32_to_float_rz_ref (arg);
if (res != ref) {
printf ("error # %08x: res=% 14.6a ref=% 14.6a\n", arg, res, ref);
return EXIT_FAILURE;
}
arg++;
} while (arg);
return EXIT_SUCCESS;
}
A C implementation dependent solution that I am confident has a C++ counterpart.
Temporarily change the rounding mode the conversion uses that to determine which way to go in inexact cases.
the nearest value is usually selected (required by IEEE-754).
Is not entirely accurate. The inexact case is rounding mode dependent.
C does not specify this behavior. C allows this behavior, as it is implementation-defined.
If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.
#include <fenv.h>
float convert(int i) {
#pragma STDC FENV_ACCESS ON
int save_round = fegetround();
fesetround(FE_TOWARDZERO);
float f = (float) i;
fesetround(save_round);
return f;
}
A specified approach.
"the nearest value is usually selected (required by IEEE-754)" implies OP expects IEEE-754 is involved. Many C/C++ implementation do follow much of IEEE-754, yet adherence to that spec is not required. The following relies on C specifications.
Conversion of an integer type to a floating point type is specified as below. Notice conversion is not specified to depend on rounding mode.
When a value of integer type is converted to a real floating type, if the value being converted can be represented exactly in the new type, it is unchanged. If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner. C17dr § 6.3.1.4 2
When the result it not exact, the converted value the nearest higher or nearest lower?
A round trip int --> float --> int is warranted.
Round tripping needs to watch out for convert(near_INT_MAX) converting to outside the int range.
Rather than rely on long or long long having a wider range than int (C does not specify this property), let code compare on the negative side as INT_MIN (with 2's complement) can be expected to convert exactly to a float.
float convert(int i) {
int n = (i < 0) ? i : -i; // n <= 0
float f = (float) n;
int rt_n = (int) f; // Overflow not expected on the negative side
// If f rounded away from 0.0 ...
if (rt_n < n) {
f = nextafterf(f, 0.0); // Move toward 0.0
}
return (i < 0) f : -f;
}
Changing the rounding mode is somewhat expensive, although I think some modern x86 CPUs do rename MXCSR so it doesn't have to drain the out-of-order execution back-end.
If you care about performance, benchmarking njuffa's pure integer version (using shift = __builtin_clz(i); i<<=shift;) against the rounding-mode-changing version would make sense. (Make sure to test in the context you want to use it in; it's so small that it matters how well it overlaps with surrounding code.)
AVX-512 can use rounding-mode overrides on a per-instruction basis, letting you use a custom rounding mode for the conversion basically the same cost as a normal int->float. (Only available on Intel Skylake-server, and Ice Lake CPUs so far, unfortunately.)
#include <immintrin.h>
float int_to_float_trunc_avx512f(int a) {
const __m128 zero = _mm_setzero_ps(); // SSE scalar int->float are badly designed to merge into another vector, instead of zero-extend. Short-sighted Pentium-3 decision never changed for AVX or AVX512
__m128 v = _mm_cvt_roundsi32_ss (zero, a, _MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC);
return _mm_cvtss_f32(v); // the low element of a vector already is a scalar float so this is free.
}
_mm_cvt_roundi32_ss is a synonym, IDK why Intel defined both i and si names, or if some compilers might only have one.
This compiles efficiently with all 4 mainstream x86 compilers (GCC/clang/MSVC/ICC) on the Godbolt compiler explorer.
# gcc10.2 -O3 -march=skylake-avx512
int_to_float_trunc_avx512f:
vxorps xmm0, xmm0, xmm0
vcvtsi2ss xmm0, xmm0, {rz-sae}, edi
ret
int_to_float_plain:
vxorps xmm0, xmm0, xmm0 # GCC is always cautious about false dependencies, spending an extra instruction to break it, like we did with setzero()
vcvtsi2ss xmm0, xmm0, edi
ret
In a loop, the same zeroed register can be reused as a merge target, allowing the vxorps zeroing to be hoisted out of a loop.
Using _mm_undefined_ps() instead of _mm_setzero_ps(), we can get ICC to skip zeroing XMM0 before converting into it, like clang does for plain (float)i in this case. But ironically, clang which is normally cavalier and reckless about false dependencies compiles _mm_undefined_ps() the same as setzero in this case.
The performance in practice of vcvtsi2ss (scalar integer to scalar single-precision float) is the same whether you use a rounding-mode override or not (2 uops on Ice Lake, same latency: https://uops.info/). The AVX-512 EVEX encoding is 2 bytes longer than the AVX1.
Rounding mode overrides also suppress FP exceptions (like "inexact"), so you couldn't check the FP environment to later detect if the conversion happened to be exact (no rounding). But in this case, converting back to int and comparing would be fine. (You can do that without risk of overflow because of the rounding towards 0).
A simple solution is to use a higher precision floating point for comparison. As long as the high precision floating point can exactly represent all integers, we can accurately compare whether the float result was greater.
double should be sufficient with 32 bit integers, and long double is sufficient for 64 bit on most systems, but it's good practice to verify it.
float convert(int x) {
static_assert(std::numeric_limits<double>::digits
>= sizeof(int) * CHAR_BIT);
float f = x;
double d = x;
return std::abs(f) > std::abs(d)
? std::nextafter(f, 0.f)
: f;
}
Shift the integer right by an arithmetic shift until the number of bits agrees with the precision of the floating point arithmetic. Count the shifts.
Convert the integer to float. The result is now precise.
Multiply the resulting float by a power of two corresponding to the number of shifts.
For nonnegative values, this can be done by taking the integer value and shifting right until the highest set bit is less than 24 bits (i.e. the precision of IEEE single) from the right, then shifting back.
For negative values, you would shift right until all bits from 24 and up are set, then shift back. For the shift back, you'll first need to cast the value to unsigned to avoid undefined behavior of left-shifting a negative value, then cast the result back to int before converting to float.
Note also that the conversion from unsigned to signed is implementation defined, however we're already dealing with ID as we're assuming float is IEEE754 and int is two's complement.
float rount_to_zero(int x)
{
int cnt = 0;
if (x >= 0) {
while (x != (x & 0xffffff)) {
x >>= 1;
cnt++;
}
return x << cnt;
} else {
while (~0xffffff != (x & ~0xffffff)) {
x >>= 1;
cnt++;
}
return (int)((unsigned)x << cnt);
}
}

Why does std::round(sin(pi/6)) not equal 1?

The cppreference documentation states that std::round will specifically round away from zero in "halfway cases." While this is true with the literal 0.5, it's not true with std::sin(pi/6). I thought this might be a floating point error, so I printed the value but it's exactly 0.5. After inspecting the binary representation however, I can see that they are indeed represented differently. I've provided the code I used to make these inspections below.
#include <iostream>
#include <stdio.h>
#include <cmath>
int main(int argc, char * argv[])
{
double const pi = std::acos(-1);
double const a = std::sin(pi/6);
double const b = 0.5;
std::cout << "round(" << a << ") = " << std::round(a) << "\n";
auto pa = reinterpret_cast<const unsigned char *>(&a);
auto pb = reinterpret_cast<const unsigned char *>(&b);
std::cout << "a = 0x";
for (size_t i = 0; i != sizeof(double); ++i) {
printf("%02x", pa[i]);
}
std::cout << "\nb = 0x";
for (size_t i = 0; i != sizeof(double); ++i) {
printf("%02x", pb[i]);
}
std::cout << "\n";
}
round(0.5) = 0
round(0.5) = 1
a = 0xffffffffffffdf3f
b = 0x000000000000e03f
So my question is this rounding behavior a part of the c++ specification or is this a bug? And in any case, is there some general way that I can "correct" the representation of the value returned by sin? I'm not sure what format it's in because based off what I know of IEEE-754, it looks like it should be NaN. Although from what I understand, c++ doesn't guarantee IEEE-754 floating point representation?
The issue is that you're not printing the value with enough significant digits. When I increase precision with std::setprecision(20), I get: round(0.49999999999999994449) = 0.
You can see this for yourself by either changing the code or entering 3fdfffffffffffff into the bottom Hexadecimal field of this online calculator: https://baseconvert.com/ieee-754-floating-point
The representation looks like NaN because you're reading it backwards. x86/x64 have little-endian floating point numbers. So you should read that it from high to low address, yielding 0x3fdfffff..., which is of course slightly less than 0.5.

How to get the coefficient from a std::decimal?

Background
I want to write an is_even( decimal::decimal64 d ) function that returns true if the least-significant digit is even.
Unfortunately, I can't seem to find any methods to extract the coefficient from a decimal64.
Code
#include <iostream>
#include <decimal/decimal>
using namespace std;
static bool is_even( decimal::decimal64 d )
{
return true; // fix this - want to: return coefficient(d)%2==0;
}
int main()
{
auto d1 = decimal::make_decimal64( 60817ull, -4 ); // not even
auto d2 = decimal::make_decimal64( 60816ull, -4 ); // is even
cout << decimal64_to_float( d1 ) << " " << is_even( d1 ) << endl;
cout << decimal64_to_float( d2 ) << " " << is_even( d2 ) << endl;
return 0;
}
It's a little odd that there's no provided function to recover the coefficient of a decimal; but you can just multiply by 10 raised to its negative exponent:
bool is_even(decimal::decimal64 d)
{
auto q = quantexpd64(d);
auto coeff = static_cast<long long>(d * decimal::make_decimal64(1, -q));
return coeff % 2 == 0;
}
assert(!is_even(decimal::make_decimal64(60817ull, -4)));
assert(!is_even(decimal::make_decimal64(60816ull, -4)));
I would use corresponding fmod function if possible.
static bool is_even( decimal::decimal64 d )
{
auto e = quantexpd64(d);
auto divisor = decimal::make_decimal64(2, e);
return decimal::fmodd64(d, divisor) == decimal::make_decimal64(0,0);
}
It constructs a divisor that is 2*10^e where e is exponent of the tested value. Then it performs fmod and checks whether it is equal to a decimal 0. (NOTE: operator== for decimal is said to be IEEE 754-2008 conformant so we don't need to take care of -0.0).
An alternative would be to multiply the number by 10^-e (to "normalize" it) and cast it to an integer type and traditionally check modulo. I think this is #ecatmur's proposal. Though the "normalization" might fail if it goes out of chosen integer type bounds.
I think fmod is better when it comes to overflows. You are guaranteed to hold 2*10^e given that is a proper d decimal (i.e. not a NaN, or an inf).
One caveat I see is the definition of least significant digit. The above methods assume that least significant digit is denoted by e, which sometimes might be counterintuitive. I.e. is decimal(21,2) even? Then is decimal(2100,0)?

if(a == b) doesn't work for doubles in a for loop

I am at the moment trying to code a titration curve simulator. But I am running into some trouble with comparing two values.
I have created a small working example that perfectly replicates the bug that I encounter:
#include <iostream>
#include <math.h>
using namespace std;
int main()
{
double a, b;
a = 5;
b = 0;
for(double i = 0; i<=(2*a); i+=0.1){
b = i;
cout << "a=" << a << "; b="<<b;
if(a==b)
cout << "Equal!" << endl;
else
cout << endl;
}
return 0;
}
The output at the relevant section is
a=5; b=5
However, if I change the iteration increment from i+=0.1 to i+=1 or i+=0.5 I get an output of
a=5; b=5Equal!
as you would expect.
I am compiling with g++ on linux using no further flags and I am frankly at a loss how to solve this problem. Any pointers (or even a full-blown solution to my problem) are very appreciated.
Unlike integers, multiplying floats/doubles and adding them up doesn't produce exactly the same results.
So the best practice is find if the abs of their difference is small enough.
If you have some idea on the size of the numbers, you can use a constant:
if (fabs(a - b) < EPS) // equal
If you don't (much slower!):
float a1 = fabs(a), b1 = fabs(b);
float mn = min(a1,b1), mx = max(a1,b1);
if (mn / mx > (1- EPS)) // equal
Note:
In your code, you can use std::abs instead. Same for std::min/max. The code is clearer/shorter when using the C functions.
I would recommend restructuring your loop to iterate using integers and then converting the integers into doubles, like this:
double step = 0.1;
for(int i = 0; i*step<=2*a; ++i){
b = i*step;
cout << "a=" << a << "; b="<<b;
if(a==b)
cout << "Equal!" << endl;
else
cout << endl;
}
This still isn't perfect. You possibly have some loss of precision in the multiplication; however, the floating point errors don't accumulate like they do when iterating using floating point values.
Floating point arithmetic is... interesting. Testing equality is annoying with floats/doubles in most languages because it is impossible to accurately represent many numbers in IEEE floating point math. Basically, where you might compute an expression to be 5.0, the compiler might compute it to be 4.9999999, because it's the closest representable number in the IEEE standard.
Because these numbers are slightly different, you end up with an inequality. Because it's unmaintainble to try and predict which number you will see at compile time, you can't/shouldn't attempt to hard code either one of them into your source to test equality with. As a hard rule, avoid directly checking equality of floating point numbers.
Instead, test that they are extremely close to being equal with something like the following:
template<typename T>
bool floatEqual(const T& a, const T& b) {
auto delta = a * 0.03;
auto minAccepted = a - delta;
auto maxAccepted = a + delta;
return b > minAccepted && b < maxAccepted;
}
This checks whether b is within a range of + or - 3% of the value of a.

How to print a C++ double with the correct number of significant decimal digits?

When dealing with floating point values in Java, calling the toString() method gives a printed value that has the correct number of floating point significant figures. However, in C++, printing a float via stringstream will round the value after 5 or less digits. Is there a way to "pretty print" a float in C++ to the (assumed) correct number of significant figures?
EDIT: I think I am being misunderstood. I want the output to be of dynamic length, not a fixed precision. I am familiar with setprecision. If you look at the java source for Double, it calculates the number of significant digits somehow, and I would really like to understand how it works and/or how feasible it is to replicate this easily in C++.
/*
* FIRST IMPORTANT CONSTRUCTOR: DOUBLE
*/
public FloatingDecimal( double d )
{
long dBits = Double.doubleToLongBits( d );
long fractBits;
int binExp;
int nSignificantBits;
// discover and delete sign
if ( (dBits&signMask) != 0 ){
isNegative = true;
dBits ^= signMask;
} else {
isNegative = false;
}
// Begin to unpack
// Discover obvious special cases of NaN and Infinity.
binExp = (int)( (dBits&expMask) >> expShift );
fractBits = dBits&fractMask;
if ( binExp == (int)(expMask>>expShift) ) {
isExceptional = true;
if ( fractBits == 0L ){
digits = infinity;
} else {
digits = notANumber;
isNegative = false; // NaN has no sign!
}
nDigits = digits.length;
return;
}
isExceptional = false;
// Finish unpacking
// Normalize denormalized numbers.
// Insert assumed high-order bit for normalized numbers.
// Subtract exponent bias.
if ( binExp == 0 ){
if ( fractBits == 0L ){
// not a denorm, just a 0!
decExponent = 0;
digits = zero;
nDigits = 1;
return;
}
while ( (fractBits&fractHOB) == 0L ){
fractBits <<= 1;
binExp -= 1;
}
nSignificantBits = expShift + binExp +1; // recall binExp is - shift count.
binExp += 1;
} else {
fractBits |= fractHOB;
nSignificantBits = expShift+1;
}
binExp -= expBias;
// call the routine that actually does all the hard work.
dtoa( binExp, fractBits, nSignificantBits );
}
After this function, it calls dtoa( binExp, fractBits, nSignificantBits ); which handles a bunch of cases - this is from OpenJDK6
For more clarity, an example:
Java:
double test1 = 1.2593;
double test2 = 0.004963;
double test3 = 1.55558742563;
System.out.println(test1);
System.out.println(test2);
System.out.println(test3);
Output:
1.2593
0.004963
1.55558742563
C++:
std::cout << test1 << "\n";
std::cout << test2 << "\n";
std::cout << test3 << "\n";
Output:
1.2593
0.004963
1.55559
I think you are talking about how to print the minimum number of floating point digits that allow you to read the exact same floating point number back. This paper is a good introduction to this tricky problem.
http://grouper.ieee.org/groups/754/email/pdfq3pavhBfih.pdf
The dtoa function looks like David Gay's work, you can find the source here http://www.netlib.org/fp/dtoa.c (although this is C not Java).
Gay also wrote a paper about his method. I don't have a link but it's referenced in the above paper so you can probably google it.
Is there a way to "pretty print" a float in C++ to the (assumed) correct number of significant figures?
Yes, you can do it with C++20 std::format, for example:
double test1 = 1.2593;
double test2 = 0.004963;
double test3 = 1.55558742563;
std::cout << std::format("{}", test1) << "\n";
std::cout << std::format("{}", test2) << "\n";
std::cout << std::format("{}", test3) << "\n";
prints
1.2593
0.004963
1.55558742563
The default format will give you the shortest decimal representation with a round-trip guarantee like in Java.
Since this is a new feature and may not be supported by some standard libraries yet, you can use the {fmt} library, std::format is based on. {fmt} also provides the print function that makes this even easier and more efficient (godbolt):
fmt::print("{}", 1.2593);
Disclaimer: I'm the author of {fmt} and C++20 std::format.
You can use the ios_base::precision technique where you can specify the number of digits you want
For example
#include <iostream>
using namespace std;
int main () {
double f = 3.14159;
cout.unsetf(ios::floatfield); // floatfield not set
cout.precision(5);
cout << f << endl;
cout.precision(10);
cout << f << endl;
cout.setf(ios::fixed,ios::floatfield); // floatfield set to fixed
cout << f << endl;
return 0;
The above code with output
3.1416
3.14159
3.1415900000
There is a utility called numeric_limits:
#include <limits>
...
int num10 = std::numeric_limits<double>::digits10;
int max_num10 = std::numeric_limits<double>::max_digits10;
Note that IEEE numbers are not represented exactly bydecimal digits. These are binary quantities. A more accurate number is the number of binary bits:
int bits = std::numeric_limits<double>::digits;
To pretty print all the significant digits use setprecision with this:
out.setprecision(std::numeric_limits<double>::digits10);