printing very large floating point numbers

printing very large floating point numbers - c++

#include <stdio.h>
#include <float.h>
int main()
{
printf("%f\n", FLT_MAX);
}
Output from GNU:
340282346638528859811704183484516925440.000000
Output from Visual Studio:
340282346638528860000000000000000000000.000000
Do the C and C++ standards allow both results? Or do they mandate a specific result?
Note that FLT_MAX = 2^128-2^104 = 340282346638528859811704183484516925440.

I think the relevant part of the C99 standard is the "Recommended practice" from 7.19.6.1 p.13:
For e, E, f, F, g, and G conversions, if the number of significant decimal digits is at most
DECIMAL_DIG, then the result should be correctly rounded. If the number of
significant decimal digits is more than DECIMAL_DIG but the source value is exactly
representable with DECIMAL_DIG digits, then the result should be an exact
representation with trailing zeros. Otherwise, the source value is bounded by two
adjacent decimal strings L < U, both having DECIMAL_DIG significant digits; the value
of the resultant decimal string D should satisfy L <= D <= U, with the extra stipulation that
the error should have a correct sign for the current rounding direction.
My impression is that this allows some leeway in what may be printed in this case; so my conclusion is that both VS and GCC are compliant here.

Both are allowed by the C standard (C++ just inports the C standard)
From a draft version in section 5.2.4.2.2 part 10
The values given in the following list shall be replaced by constant expressions with
implementation-deﬁned values that are greater than or equal to those shown:
— maximum representable ﬁnite ﬂoating-point number, (1 − b −p)b emax
FLT_MAX 1E+37
and visual C++ 2012 has
#define FLT_MAX 3.402823466e+38F /* max value */

The code itself is flawed where it uses %f for a value larger than significance held in a float or double. By doing so you are asking to see "behind the curtain" as to the meaningless guard bits or other floating point noise generated in the conversion to decimal.
Clearly you should not expect any consistency in the metal filings generated after making an engine at Honda versus at Toyota. Nevermind any sensible expectation of such consistency.
The proper way to display such numbers is by using one of the "scientific" formats such as %g provided precision is not over-specified. On IEEE-754 implementations, 7 decimal figures are significant for float, 15-16 for double, about 19 for long double, and 34 for __float128. So, for the example you have given, %.15g would be proper, assuming it is on a IEEE-754 implementation.

Related

Can multiplying a pair of almost-one values ever yield a result of 1.0?

I have two floating point values, a and b. I can guarantee they are values in the domain (0, 1). Is there any circumstance where a * b could equal one? I intend to calculate 1/(1 - a * b), and wish to avoid a divide by zero.
My instinct is that it cannot, because the result should be equal or smaller to a or b. But instincts are a poor replacement for understanding the correct behavior.
I do not get to specify the rounding mode, so if there's a rounding mode where I could get into trouble, I want to know about it.
Edit: I did not specify whether the compiler was IEEE compliant or not because I cannot guarantee that the compiler/CPU running my software will indeed by IEEE compliant.

I have two floating point values, a and b…
Since this says we have “values,” not “variables,” it admits a possibility that 1 - a*b may evaluate to 1. When writing about software, people sometimes use names as placeholders for more complicated expressions. For example, one might have an expression a that is sin(x)/x and an expression b that is 1-y*y and then ask about computing 1 - a*b when the code is actually 1 - (sin(x)/x)*(1-y*y). This would be a problem because C++ allows extra precision to be used when evaluating floating-point expressions.
The most common instances of this is that the compiler uses long double arithmetic while computing expressions containing double operands or it uses a fused multiply-add instructions while computing an expression of the format x + y*z.
Suppose expressions a and b have been computed with excess precision and are positive values less than 1 in that excess precision. E.g., for illustration, suppose double were implemented with four decimal digits but a and b were computed with long double with six decimal digits. a and b could both be .999999. Then a*b is .999998000001 before rounding, .999998 after rounding to six digits. Now suppose that at this point in the computation, the compiler converts from long double to double, perhaps because it decides to store this intermediate value on the stack temporarily while it computes some other things from nearby expressions. Converting it to four-digit double produces 1.000, because that is the four-decimal-digit number nearest .999998. When the compiler later loads this from the stack and continues evaluation, we have 1 - 1.000, and the result is zero.
On the other hand, if a and b are variables, I expect your expression is safe. When a value is assigned to a variable or is converted with a cast operation, the C++ standard requires it to be converted to the nominal type; the result must be a value in the nominal type, without any “extra precision.” Then, given 0 < a < 1 and 0 < b < 1, the mathematical value (that, without floating-point rounding) a•b is less than a and is less than b. Then rounding of a•b to the nominal type cannot produce a value greater than a or b with any IEEE-754 rounding method, so it cannot produce 1. (The only requirement here is that the rounding method never skip over values—it might be constrained to round in a particular direction, upward or downward or toward zero or whatever, but it never goes past a representable value in that direction to get to a value farther away from the unrounded result. Since we know a•b is bounded above by both a and b, rounding cannot produce any result greater than the lesser of a and b.)
Formally, the C++ standard does not impose any requirements on the accuracy of floating-point results. So a C++ implementation could use a bonkers rounding mode that produced 3.14 for .9*.9. Aside from implementations flushing subnormals to zero, I am not aware of any C++ implementations that do not obey the requirement above. Flushing subnormals to zero will not affect calculations in 1 - a*b when a and b are near 1. (In a perverse floating-point format, with an exponent range narrower than the significand and no subnormal values, .9999 could be representable while .0001 is not because the exponent required for it is out of range. Then 1-.9999*.9999, which would produce .0002 in normal four-digit arithmetic, would produce 0 due to underflow. No such formats are in normal hardware.)
So, if a and b are variables, 0 < a < 1 and 0 < b < 1, and your C++ implementation is reasonable (may use extra precision, may flush subnormals, does not use perverse floating-point formats or rounding), then 1 - a*b does not evaluate to zero.

There is a mathematical proof that it will never be >= 1. I don't have it handy.... you may want to ask on the math stack overflow site if you are interested in studying the proof. But your instincts are correct. It will never be >= 1.
Now, we must be careful because floating point arithmetic is only an approximation of math and has limitations. I'm not an expert on these limitations, but the floating-point standard is very carefully designed and provides certain guarantees. I'm pretty sure one of them includes (or implies) that x * y where x < 1 and y < 1 is guaranteed to be < 1.

You can check that even if using the highest float or double that is lower than 1, and multiplying by itself, the result will be lower than 1. Any multiplication of numbers lower than that must give a smaller result.
Here is the code I ran, with the results in comments:
float a = nextafterf(1, 0); // 0.999999940
double b = nextafter(1, 0); // 0.99999999999999989
float c = a * a; // 0.999999881
double d = b * b; // 0.99999999999999978

When a 64bit int is cast to 64bit float in C/C++ and doesn't have an exact match, will it always land on a non-fractional number?

When int64_t is cast to double and doesn't have an exact match, to my knowledge I get a sort of best-effort-nearest-value equivalent in double. For example, 9223372036854775000 in int64_t appears to end up as 9223372036854774784.0 in double:
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
return 0;
}
It appears to me as if an int64_t cast to a double always ends up on as a clean non-fractional number, even in this higher number range where double has really low precision. However, I just observed this from random attempts. Is this guaranteed to happen for any value of int64_t cast to a double?
And if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off? (Assuming it doesn't overflow during the conversion back.) Like here:
#include <inttypes.h>
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
printf("Corresponding int to corresponding double: %" PRId64 "\n",
(int64_t)((double)9223372036854775000LL));
// Outputs: 9223372036854774784
return 0;
}
Or can it be imprecise and get me the "wrong" int in some corner cases?
Intuitively and from my tests the answer to both points appears to be "yes", but if somebody with a good formal understanding of the floating point standards and the maths behind it could confirm this that would be really helpful to me. I would also be curious if any known more aggressive optimizations like gcc's -Ofast are known to break any of this.

In general case yes, both should be true. The floating point base needs to be - if not 2, then at least integer and given that, an integer converted to nearest floating point value can never produce non-zero fractions - either the precision suffices or the lowest-order integer digits in the base of the floating type would be zeroed. For example in your case your system uses ISO/IEC/IEEE 60559 binary floating point numbers. When inspected in base 2, it can be seen that the trailing digits of the value are indeed zeroed:
>>> bin(9223372036854775000)
'0b111111111111111111111111111111111111111111111111111110011011000'
>>> bin(9223372036854774784)
'0b111111111111111111111111111111111111111111111111111110000000000'
The conversion of a double without fractions to an integer type, given that the value of the double falls within the range of the integer type should be exact...
Though you still might encounter a quality-of-implementation issue, or an outright bug - for example MSVC currently has a compiler bug where a round-trip conversion of unsigned 32-bit value with MSB set (or just double value between 2³¹ and 2³²-1 converted to unsigned int) would "overflow" in the conversion and always result in exactly 2³¹.

The following assumes the value being converted is positive. The behavior of negative numbers is analogous.
C 2018 6.3.1.4 2 specifies conversions from integer to real and says:
… If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.
This tells us that some integer value x being converted to floating-point can produce a non-integer only if one of the two representable values bounding x is not an integer and x is not representable.
5.2.4.2.2 specifies the model used for floating-point numbers. Each finite floating-point number is represented by a sequence of digits in a certain base b scaled by be for some exponent e. (b is an integer greater than 1.) Then, if one of the two values bounding x, say p is not an integer, the scaling must be such that the lowest digit in that floating-point number represents a fraction. But if this is the case, then setting all of the digits in p that represent fractions to 0 must produce a new floating-point number that is an integer. If x < p, this integer must be x, and therefore x is representable in the floating-point format. On the other hand, if p < x, we can add enough to each digit that represents a fraction to make it 0 (and produce a carry to the next higher digit). This will also produce an integer representable in the floating-point type1, and it must be x.
Therefore, if conversion of an integer x to the floating-point type would produce a non-integer, x must be representable in the type. But then conversion to the floating-point type must produce x. So it is never possible to produce a non-integer.
Footnote
1 It is possible this will carry out of all the digits, as when applying it to a three-digit decimal number 9.99, which produces 10.00. In this case, the value produced is the next power of b, if it is in range of the floating-point format. If it is not, the C standard does not define the behavior. Also note the C standard sets minimum requirements on the range that floating-point formats must support which preclude any format from not being able to represent 1, which avoids a degenerate case in which a conversion could produce a number like .999 because it was the largest representable finite value.

When a 64bit int is cast to 64bit float ... and doesn't have an exact match, will it always land on a non-fractional number?
Is this guaranteed to happen for any value of int64_t cast to a double?
For common double: Yes, it always land on a non-fractional number
When there is no match, the result is the closest floating point representable value above or below, depending on rounding mode. Given the characteristics of common double, these 2 bounding values are also whole numbers. When the value is not representable, there is first a nearby whole number one.
... if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off?
No. Edge cases near INT64_MAX fail as the converted value could become a FP value above INT64_MAX. Then conversion back to the integer type incurs: "the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised." C17dr § 6.3.1.3 3
#include <limits.h>
#include <string.h>
int main() {
long long imaxm1 = LLONG_MAX - 1;
double max = (double) imaxm1;
printf("%lld\n%f\n", imaxm1, max);
long long imax = (long long) max;
printf("%lld\n", imax);
}
9223372036854775806
9223372036854775808.000000
9223372036854775807 // Value here is implementation defined.
Deeper exceptions
(Question variation) When an N bit integer type is cast to a floating point and doesn't have an exact match, will it always land on a non-fractional number?
Integer type range exceeds finite float point
Conversion to infinity: With common float, and uint128_t, UINT128_MAX converts to infinity. This is readily possible with extra wide integer types.
int main() {
unsigned __int128 imaxm1 = 0xFFFFFFFFFFFFFFFF;
imaxm1 <<= 64;
imaxm1 |= 0xFFFFFFFFFFFFFFFF;
double fmax = (float) imaxm1;
double max = (double) imaxm1;
printf("%llde27\n%f\n%f\n", (long long) (imaxm1/1000000000/1000000000/1000000000),
fmax, max);
}
340282366920e27
inf
340282366920938463463374607431768211456.000000
Floating point precession deep more than range
On some unicorn implementation, with very wide FP precision and small range, the largest finite could, in theory, not practice, be a non-whole number. Then with an even wider integer type, the conversion could result in this non-whole number value. I do not see this as a legit concern of OP's.

When is integer to floating point conversion lossless?

Particularly I'm interested if int32_t is always losslessly converted to double.
Does the following code always return true?
int is_lossless(int32_t i)
{
double d = i;
int32_t i2 = d;
return (i2 == i);
}
What is for int64_t?

When is integer to floating point conversion lossless?
When the floating point type has enough precision and range to encode all possible values of the integer type.
Does the following int32_t code always return true? --> Yes.
Does the following int64_t code always return true? --> No.
As DBL_MAX is at least 1E+37, the range is sufficient for at least int122_t, let us look to precision.
With common double, with its base 2, sign bit, 53 bit significand, and exponent, all values of int54_t with its 53 value bits can be represented exactly. INT54_MIN is also representable. With this double, it has DBL_MANT_DIG == 53 and in this case that is the number of base-2 digits in the floating-point significand.
The smallest magnitude non-representable value would be INT54_MAX + 2. Type int55_t and wider have values not exactly representable as a double.
With uintN_t types, there is 1 more value bit. The typical double can then encode all uint53_t and narrower.
With other possible double encodings, as C specifies DBL_DIG >= 10, all values of int34_t can round trip.
Code is always true with int32_t, regardless of double encoding.
What is for int64_t?
UB potential with int64_t.
The conversion in int64_t i ... double d = i;, when inexact, makes for a implementation defined result of the 2 nearest candidates. This is often a round to nearest. Then i values near INT64_MAX can convert to a double one more than INT64_MAX.
With int64_t i2 = d;, the conversion of the double value one more than INT64_MAX to int64_t is undefined behavior (UB).
A simple prior test to detect this:
#define INT64_MAX_P1 ((INT64_MAX/2 + 1) * 2.0)
if (d == INT64_MAX_P1) return false; // not lossless

Question: Does the following code always return true?
Always is a big statement and therefore the answer is no.
The C++ Standard makes no mention whether or not the floating-point types which are known to C++ (float, double and long double) are of the IEEE-754 type. The standard explicitly states:
There are three floating-point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. [Note: This document imposes no requirements on the accuracy of floating-point operations; see also [support.limits]. — end note] Integral and floating-point types are collectively called arithmetic types. Specialisations of the standard library template std::numeric_limits shall specify the maximum and minimum values of each arithmetic type for an implementation.
source: C++ standard: basic fundamentals
Most commonly, the type double represents the IEEE 754 double-precision binary floating-point format binary64, and can be depicted as:
and decoded as:
However, there is a plethora of other floating-point formats out there that are decoded differently and not necessarly have the same properties as the well known IEEE-754. Nonetheless, they are all-by-all similar:
They are n bits long
One bit represents the sign
m bits represent the significant with or without a hidden first bit
e bits represent some form of an exponent of a given base (2 or 10)
To know Whether or not a double can represent all 32-bit signed integer or not, you must answer the following question (assuming our floating-point number is in base 2):
Does my floating-point representation have a hidden first bit in the significant? If so, assume m=m+1
A 32bit signed integer is represented by 1 sign bit and 31 bits representing the number. Is the significant large enough to hold those 31 bits?
Is the exponent large enough that it can represent a number of the form 1.xxxxx 2^31?
If you can answer yes to the last two questions, then yes a int32 can always be represented by the double that is implemented on this particular system.
Note: I ignored decimal32 and decimal64 numbers, as I have no direct knowledge about them.

Note : my answer supposes the double follow IEEE 754, and both int32_t and int64_tare 2's complement.
Does the following code always return true?
the mantissa/significand of a double is longer than 32b so int32_t => double is always done without error because there is no possible precision error (and there is no possible overflow/underflow, the exponent cover more than the needed range of values)
What is for int64_t?
but 53 bits of mantissa/significand (including 1 implicit) of a double is not enough to save 64b of a int64_t => int64_t having upper and lower bits enough distant cannot be store in a double without precision error (there is still no possible overflow/underflow, the exponent still cover more than the needed range of values)

If your platform uses IEEE754 for the double, then yes, any int32_t can be represented perfectly in a double. This is not the case for all possible values that an int64_t can have.
(It is possible on some platforms to tweak the mantissa / exponent sizes of floating point types to make the transformation lossy, but such a type would not be an IEEE754 double.)
To test for IEEE754, use
static_assert(std::numeric_limits<double>::is_iec559, "IEEE 754 floating point");

Java-style printing of `double` in C++

Is there any combination of stream manipulators (or any other method in the standard C++) that would allow me to get the "right" number of digits when printing double in C++?
By the "right" number I mean the number of digits as defined here:
How many digits must be printed for the fractional part of m or a? There must be at least one digit to represent the fractional part, and beyond that as many, but only as many, more digits as are needed to uniquely distinguish the argument value from adjacent values of type double. That is, suppose that x is the exact mathematical value represented by the decimal representation produced by this method for a finite nonzero argument d. Then d must be the double value nearest to x; or if two double values are equally close to x, then d must be one of them and the least significant bit of the significand of d must be 0.
In a bit of a simplistic example, let's suppose that we have three double values: DD, D0 and D1. DD is the "middle", D1 has mantissa larger by 1, D0 smaller by 1.
When printed to some very large arbitrary precision, they produce the following values (the numbers in the example are completely off the wall):
D0 => 1.299999999701323987
DD => 1.300000000124034353
D1 => 1.300000000524034353
(EPSILON, the value of least significant bit of mantissa at 0 exponent, is ~ 0.0000000004)
In that case, the method above would produce
D0 => 1.2999999997
DD => 1.3
DD => 1.3000000005

If I understand you correctly, you want std::to_chars.
value is converted to a string as if by std::printf in the default ("C") locale. The conversion specifier is f or e (resolving in favor of f in case of a tie), chosen according to the requirement for a shortest representation: the string representation consists of the smallest number of characters such that there is at least one digit before the radix point (if present) and parsing the representation using the corresponding std::from_chars function recovers value exactly. If there are several such representations, one with the smallest difference to value is chosen, resolving any remaining ties using rounding according to std::round_to_nearest.
It's a low-level function, so printing the result requires some work:
run on gcc.godbolt.org
#include <charconv>
#include <iostream>
int main()
{
double val = 0.1234;
char buf[64];
*std::to_chars(buf, buf + sizeof buf, val).ptr = '\0';
std::cout << buf << '\n';
}
This function needs an up-to-date standard library: GCC (libstdc++) 11 or newer, or Clang (libc++) 14 or newer (currently trunk). MSVC also supports it.

Absent the C++17 library function that isn’t implemented everywhere yet, you can just try precisions (with whatever equivalent of %g) until one rounds back exactly. If you don’t care about subnormal numbers (where even two digits can be too many), start at 15 for IEEE 754 doubles; that never produces meaningless trailing digits. You’ll never need to go past 17.

C++ double operator+ [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Incorrect floating point math?
Float compile-time calculation not happening?
Strange stuff going on today, I'm about to lose it...
#include <iomanip>
#include <iostream>
using namespace std;
int main()
{
cout << setprecision(14);
cout << (1/9+1/9+4/9) << endl;
}
This code outputs 0 on MSVC 9.0 x64 and x86 and on GCC 4.4 x64 and x86 (default options and strict math...). And as far as I remember, 1/9+1/9+4/9 = 6/9 = 2/3 != 0

1/9 is zero, because 1 and 9 are integers and divided by integer division. The same applies to 4/9.
If you want to express floating-point division through arithmetic literals, you have to either use floating-point literals 1.0/9 + 1.0/9 + 4.0/9 (or 1/9. + 1/9. + 4/9. or 1.f/9 + 1.f/9 + 4.f/9) or explicitly cast one operand to the desired floating-point type (double) 1/9 + (double) 1/9 + (double) 4/9.
P.S. Finally my chance to answer this question :)

Use a decimal point in your calculations to force floating point math optionally along with one of these suffixes: f l F L on your numbers. A number alone without a decimal point and without one of those suffixes is not considered a floating point literal.
C++03 2.13.3-1 on Floating literals:
A floating literal consists of an
integer part, a decimal point, a
fraction part, an e or E, an
optionally signed integer exponent,
and an optional type suffix. The
integer and fraction parts both
consist of a sequence of decimal (base
ten) digits. Either the integer part
or the fraction part (not both) can be
omitted; either the decimal point or
the letter e (or E) and the exponent
(not both) can be omitted. The integer
part, the optional decimal point and
the optional fraction part form the
significant part of the floating
literal. The exponent, if present,
indicates the power of 10 by which the
significant part is to be scaled. If
the scaled value is in the range of
representable values for its type, the
result is the scaled value if
representable, else the larger or
smaller representable value nearest
the scaled value, chosen in an
implementation-defined manner. The
type of a floating literal is double
unless explicitly specified by a
suffix. The suffixes f and F specify
float, the suffixes l and L specify
long double. If the scaled value is
not in the range of representable
values for its type, the program is
ill-formed. 18

They are all integers. So 1/9 is 0. 4/9 is also 0. And 0 + 0 + 0 = 0. So the result is 0. If you want fractions, cast your fractions to floats.

1/9(=0)+1/9(=0)+4/9(=0) = 0

well, in C++ (and many other languages), 1/9+1/9+4/9 is zero, because it is integer arithmetic.
You probably want to write 1/9.0+1/9.0+4/9.0

Unless you specifically specify the decimal, the numbers C++ uses are integers, so 1/9 = 4/9 = 0 and 0 + 0 + 0 = 0.
You should simply add the decimal 1.0 etc...

By the C rules of types, you're doing all integer math there. 1/9 and 4/9 are both truncated to 0 (as integers). If you wrote 1.0/9.0 etc, it would use double precision math and do what you want.

You might make it a habit to use more parentheses. They cost little time, make clear what you intend, and ensure you get what you wanted. Well mostly... ;)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js