Modeling infinity for the largest double value - c++

The question is about modeling infinity in C++ for the double data type. I need it in a header file, so we cannot use functions like numeric_limits.
Is there a defined constant that represents the largest value?

floating point numbers(such as doubles) can actually hold positive and negative infinity. The constant INFINITY should be in your math.h header.
Went standard diving and found the text:
4 The macro INFINITY expands to a constant expression of type float
representing positive or unsigned infinity, if available; else to a
positive constant of type float that overflows at translation time.
In Section 7.12 Mathematics <math.h>
Then of course you have the helper function isinf to test for infinity(which is also in math.h).
7.12.3.3 The isinf macro
int isinf(real-floating x);
Description: The isinf macro determines whether its argument value is an infinity (positive
or negative). First, an argument represented in a format wider than
its semantic type is converted to its semantic type. Then
determination is based on the type of the argument.
Returns: The
isinf macro returns a nonzero value if and only if its argument has an
infinite value.

numeric_limits functions are all constexpr so they work just fine as compile time constants (assuming you're using the current version of C++). So std::numeric_limits<double>::infinity() ought to work in any context.
Even if you're using an older version, this will still work anywhere that you don't require a compile time constant. It's not clear from your question if your use really needs a compile time constant or not; just being in a header doesn't necessarily require it.
If you are using an older version, and you really do need a compile time constant, the macro INFINITY in cmath should work for you. It's actually the float value for infinity, but it can be converted to a double.

Not sure why you can't use std::numeric_limits in a header file. But there is also this carried over from ANSI C:
#include <cfloat>
DBL_MAX

Maybe in your C++ environment you have float.h, see http://www.gamedev.net/topic/392211-max-value-for-double-/ (DBL_MAX)

I thought the answer was "42.0" ;)
This article might be of interest:
http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm
Or this:
http://www.cplusplus.com/reference/clibrary/cfloat/
MAXimum Maximum finite representable floating-point number:
FLT_MAX 1E+37
DBL_MAX 1E+37
LDBL_MAX 1E+37

From Wikipedia:
0x 7ff0 0000 0000 0000 = Infinity
0x fff0 0000 0000 0000 = −Infinity

DBL_MAX can be used. This is found in float.h as follows
#define DBL_MAX 1.7976931348623158e+308 /* max value */

#include <cmath>
...
double d = INFINITY;
You can find INFINITY defined in <cmath> (math.h):
A constant expression of type float representing positive or unsigned infinity, if available; else a positive constant of type float that overflows at translation time.

Wouldn't this work?
const double infinity = 1.0/0.0;

Related

Detect overflow when converting integral to floating types

The C standard, which C++ relies on for these matters as well, as far as I know, has the following section:
When a value of integer type is converted to a real floating type, if the value being converted can be represented exactly in the new type, it is unchanged. If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner. If the value being converted is outside the range of values that can be represented, the behavior is undefined.
Is there any way I can check for the last case? It seems to me that this last undefined behaviour is unavoidable. If I have an integral value i and naively check something like
i <= FLT_MAX
I will (apart from other problems related to precision) already trigger it because the comparison first converts i to a float (in this case or to any other floating type in general), so if it is out of range, we get undefined behaviour.
Or is there some guarantee about the relative sizes of integral and floating types that would imply something like "float can always represent all values of int (not necessarily exactly of course)" or at least "long double can always hold everything" so that we could do comparisons in that type? I couldn't find anything like that, though.
This is mainly a theoretical exercise, so I'm not interested in answers along the lines of "on most architectures these conversions always work". Let's try to find a way to detect this kind of overflow without assuming anything beyond the C(++) standard! :)
Detect overflow when converting integral to floating types
FLT_MAX, DBL_MAX are at least 1E+37 per the C spec, so all integers with |values| of 122 bits or less will convert to a float without overflow on all compliant platforms. Same with double
To solve this in the general case for integers of 128/256/etc. bits, both FLT_MAX and some_big_integer_MAX need to be reduced.
Perhaps by taking the log of both. (bit_count() is a TBD user code)
if(bit_count(unsigned_big_integer_MAX) > logbf(FLT_MAX)) problem();
Or if the integer lacks padding
if(sizeof(unsigned_big_integer_MAX)*CHAR_BIT > logbf(FLT_MAX)) problem();
Note: working with a FP function like logbf() may produce an edge condition with the exact integer math with an incorrect compare.
Macro magic can use obtuse tests like the following that takes advantage the BIGINT_MAX is certainly a power-of-2 minus 1 and FLT_MAX division by a power of 2 is certainly exact (unless FLT_RADIX == 10).
This pre-processor code will complain if conversion from a big integer type to float will be inexact for some big integer.
#define POW2_61 0x2000000000000000u
#if BIGINT_MAX/POW2_61 > POW2_61
// BIGINT is at least a 122 bit integer
#define BIGINT_MAX_PLUS1_div_POW2_61 ((BIGINT_MAX/2 + 1)/(POW2_61/2))
#if BIGINT_MAX_PLUS1_div_POW2_61 > POW2_61
#warning TBD code for an integer wider than 183 bits
#else
_Static_assert(BIGINT_MAX_PLUS1_div_POW2_61 <= FLT_MAX/POW2_61,
"bigint too big for float");
#endif
#endif
[Edit 2]
Is there any way I can check for the last case?
This code will complain if conversion from a big integer type to float will be inexact for a select big integer.
Of course the test needs to occur before the conversion is attempted.
Given various rounding modes or a rare FLT_RADIX == 10, the best that can readily be had is a test that aims a bit low. When it is true, the conversion will work. Yet a vary small range of of big integers that report false on the below test do convert OK.
Below is a more refined idea that I need to mull over for a bit, yet I hope it provides some coding idea for the test OP is looking for.
#define POW2_60 0x1000000000000000u
#define POW2_62 0x4000000000000000u
#define MAX_FLT_MIN 1e37
#define MAX_FLT_MIN_LOG2 (122 /* 122.911.. */)
bool intmax_to_float_OK(intmax_t x) {
#if INTMAX_MAX/POW2_60 < POW2_62
(void) x;
return true; // All big integer values work
#elif INTMAX_MAX/POW2_60/POW2_60 < POW2_62
return x/POW2_60 < (FLT_MAX/POW2_60)
#elif INTMAX_MAX/POW2_60/POW2_60/POW2_60 < POW2_62
return x/POW2_60/POW2_60 < (FLT_MAX/POW2_60/POW2_60)
#else
#error TBD code
#endif
}
Here's a C++ template function that returns the largest positive integer that fits into both of the given types.
template<typename float_type, typename int_type>
int_type max_convertible()
{
static const int int_bits = sizeof(int_type) * CHAR_BIT - std::is_signed<int_type>() ? 1 : 0;
if ((int)ceil(std::log2(std::numeric_limits<float_type>::max())) > int_bits)
return std::numeric_limits<int_type>::max();
return (int_type) std::numeric_limits<float_type>::max();
}
If the number you're converting is larger than the return from this function, it can't be converted. Unfortunately I'm having trouble finding a combination of types to test it with, it's very hard to find an integer type that won't fit into the smallest floating point type.

Is a 32 bit normalized floating point number that hasn't been operated on, same on any platform/compiler?

For example:
float a = 3.14159f;
If I was to inspect the bits in this number (or any other normalized floating point number), what are the chances that the bits are different in some other platform/compiler combination, or is that possible?
Not necessarily: The c++ standard doesn't define floating point representation (it doesn't even define the representation of signed integers), although most platforms probably orient themselves on the same IEEE standard (IEEE 754-2008?).
Your question can be rephrased as: Will the final assertion in the following code always be upheld, no matter what platform you run it on?
#include <cassert>
#include <cstring>
#include <cstdint>
#include <limits>
#if __cplusplus < 201103L // no static_assert prior to C++11
#define static_assert(a,b) assert(a)
#endif
int main() {
float f = 3.14159f;
std::uint32_t i = 0x40490fd0;// IEC 659/IEEE 754 representation
static_assert(std::numeric_limits<float>::is_iec559, "floating point must be IEEE 754");
static_assert(sizeof(f) == sizeof(i), "float must be 32 bits wide");
assert(std::memcmp(&f, &i, sizeof(f)) == 0);
}
Answer: There's nothing in the C++ standard that guarantees that the assertion will be upheld. Yet, on most sane platforms the assertion will hold and the code won't abort, no matter if the platform is big- or little-endian. As long as you only care that your code works on some known set of platforms, it'll be OK: you can verify that the tests pass there :)
Realistically speaking, some compilers might use a sub-par decimal-to-IEEE-754 conversion routine that doesn't properly round the result, so if you specify f to enough digits of precision, it might be a couple of LSBs of mantissa off from the value that would be nearest to the decimal representation. And then the assertion won't hold anymore. For such platforms, you might wish to test a couple mantissa LSBs around the desired one.

printing very large floating point numbers

#include <stdio.h>
#include <float.h>
int main()
{
printf("%f\n", FLT_MAX);
}
Output from GNU:
340282346638528859811704183484516925440.000000
Output from Visual Studio:
340282346638528860000000000000000000000.000000
Do the C and C++ standards allow both results? Or do they mandate a specific result?
Note that FLT_MAX = 2^128-2^104 = 340282346638528859811704183484516925440.
I think the relevant part of the C99 standard is the "Recommended practice" from 7.19.6.1 p.13:
For e, E, f, F, g, and G conversions, if the number of significant decimal digits is at most
DECIMAL_DIG, then the result should be correctly rounded. If the number of
significant decimal digits is more than DECIMAL_DIG but the source value is exactly
representable with DECIMAL_DIG digits, then the result should be an exact
representation with trailing zeros. Otherwise, the source value is bounded by two
adjacent decimal strings L < U, both having DECIMAL_DIG significant digits; the value
of the resultant decimal string D should satisfy L <= D <= U, with the extra stipulation that
the error should have a correct sign for the current rounding direction.
My impression is that this allows some leeway in what may be printed in this case; so my conclusion is that both VS and GCC are compliant here.
Both are allowed by the C standard (C++ just inports the C standard)
From a draft version in section 5.2.4.2.2 part 10
The values given in the following list shall be replaced by constant expressions with
implementation-defined values that are greater than or equal to those shown:
— maximum representable finite floating-point number, (1 − b −p)b emax
FLT_MAX 1E+37
and visual C++ 2012 has
#define FLT_MAX 3.402823466e+38F /* max value */
The code itself is flawed where it uses %f for a value larger than significance held in a float or double. By doing so you are asking to see "behind the curtain" as to the meaningless guard bits or other floating point noise generated in the conversion to decimal.
Clearly you should not expect any consistency in the metal filings generated after making an engine at Honda versus at Toyota. Nevermind any sensible expectation of such consistency.
The proper way to display such numbers is by using one of the "scientific" formats such as %g provided precision is not over-specified. On IEEE-754 implementations, 7 decimal figures are significant for float, 15-16 for double, about 19 for long double, and 34 for __float128. So, for the example you have given, %.15g would be proper, assuming it is on a IEEE-754 implementation.

double variables in c++ are showing equal even when they are not

I just wrote the following code in C++:
double variable1;
double variable2;
variable1=numeric_limits<double>::max()-50;
variable2=variable1;
variable1=variable1+5;
cout<<"\nVariable1==Variable2 ? "<<(variable1==variable2);
The answer to the cout statement comes out 1, even when variable2 and variable1 are not equal.Can someone help me with this? Why is this happening?
I knew the concept of imprecise floating point math but didn't think this would happen with comparing two doubles directly. Also I am getting the same resuklt when I replace variable1 with:
double variable1=(numeric_limits<double>::max()-10000000000000);
The comparison still shows them as equal. How much would I have to subtract to see them start differing?
The maximum value for a double is 1.7976931348623157E+308. Due to lack of precision, adding and removing small values such as 50 and 5 does not actually changes the values of the variable. Thus they stay the same.
There isn't enough precision in a double to differentiate between M and M-45 where M is the largest value that can be represented by a double.
Imagine you're counting atoms to the nearest million. "123,456 million atoms" plus 1 atom is still "123,456 million atoms" because there's no space in the "millions" counting system for the 1 extra atom to make any difference.
numeric_limits<double>::max()
is a huuuuuge number. But the greater the absolute value of a double, the smaller is its precision. Apparently in this case max-50and max-5 are indistinguishable from double's point of view.
You should read the floating point comparison guide. In short, here are some examples:
float a = 0.15 + 0.15
float b = 0.1 + 0.2
if(a == b) // can be false!
if(a >= b) // can also be false!
The comparison with an epsilon value is what most people do.
#define EPSILON 0.00000001
bool AreSame(double a, double b)
{
return fabs(a - b) < EPSILON;
}
In your case, that max value is REALLY big. Adding or subtracting 50 does nothing. Thus they look the same because of the size of the number. See #RichieHindle's answer.
Here are some additional resources for research.
See this blog post.
Also, there was a stack overflow question on this very topic (language agnostic).
From the C++03 standard:
3.9.1/ [...] The value representation of floating-point types is
implementation-defined
and
5/ [...] If during the evaluation of an expression, the result is not
mathematically defined or not in the range of representable values for
its type, the behavior is undefined, unless such an expression is a
constant expression (5.19), in which case the program is ill-formed.
and
18.2.1.2.4/ (about numeric_limits<T>::max()) Maximum finite value.
This implies that once you add something to std::numeric_limits<T>::max(), the behavior of the program is implementation defined if T is floating point, perfectly defined if T is an unsigned type, and undefined otherwise.
If you happen to have std::numeric_limits<T>::is_iec559 == true, in this case the behavior is defined by IEEE 754. I don't have it handy, so I cannot tell whether variable1 is finite or infinite in this case. It seems (according to some lecture notes on IEEE 754 on the internet) that it depends on the rounding mode..
Please read What Every Computer Scientist Should Know About Floating-Point Arithmetic.

minimum double value in C/C++

Is there a standard and/or portable way to represent the smallest negative value (e.g. to use negative infinity) in a C(++) program?
DBL_MIN in float.h is the smallest positive number.
-DBL_MAX in ANSI C, which is defined in float.h.
Floating point numbers (IEEE 754) are symmetrical, so if you can represent the greatest value (DBL_MAX or numeric_limits<double>::max()), just prepend a minus sign.
And then is the cool way:
double f;
(*((uint64_t*)&f))= ~(1LL<<52);
In C, use
#include <float.h>
const double lowest_double = -DBL_MAX;
In C++pre-11, use
#include <limits>
const double lowest_double = -std::numeric_limits<double>::max();
In C++11 and onwards, use
#include <limits>
constexpr double lowest_double = std::numeric_limits<double>::lowest();
Try this:
-1 * numeric_limits<double>::max()
Reference: numeric_limits
This class is specialized for each of
the fundamental types, with its
members returning or set to the
different values that define the
properties that type has in the
specific platform in which it
compiles.
Are you looking for actual infinity or the minimal finite value? If the former, use
-numeric_limits<double>::infinity()
which only works if
numeric_limits<double>::has_infinity
Otherwise, you should use
numeric_limits<double>::lowest()
which was introduces in C++11.
If lowest() is not available, you can fall back to
-numeric_limits<double>::max()
which may differ from lowest() in principle, but normally doesn't in practice.
A truly portable C++ solution
As from C++11 you can use numeric_limits<double>::lowest().
According to the standard, it returns exactly what you're looking for:
A finite value x such that there is no other finite value y where y < x.
Meaningful for all specializations in which is_bounded != false.
Online demo
Lots of non portable C++ answers here !
There are many answers going for -std::numeric_limits<double>::max().
Fortunately, they will work well in most of the cases. Floating point encoding schemes decompose a number in a mantissa and an exponent and most of them (e.g. the popular IEEE-754) use a distinct sign bit, which doesn't belong to the mantissa. This allows to transform the largest positive in the smallest negative just by flipping the sign:
Why aren't these portable ?
The standard doesn't impose any floating point standard.
I agree that my argument is a little bit theoretic, but suppose that some excentric compiler maker would use a revolutionary encoding scheme with a mantissa encoded in some variations of a two's complement. Two's complement encoding are not symmetric. for example for a signed 8 bit char the maximum positive is 127, but the minimum negative is -128. So we could imagine some floating point encoding show similar asymmetric behavior.
I'm not aware of any encoding scheme like that, but the point is that the standard doesn't guarantee that the sign flipping yields the intended result. So this popular answer (sorry guys !) can't be considered as fully portable standard solution ! /* at least not if you didn't assert that numeric_limits<double>::is_iec559 is true */
- std::numeric_limits<double>::max()
should work just fine
Numeric limits
Is there a standard and/or portable way to represent the smallest negative value (e.g. to use negative infinity) in a C(++) program?
C approach.
Many implementations support +/- infinities, so the most negative double value is -INFINITY.
#include <math.h>
double most_negative = -INFINITY;
Is there a standard and/or portable way ....?
Now we need to also consider other cases:
No infinities
Simply -DBL_MAX.
Only an unsigned infinity.
I'd expect in this case, OP would prefer -DBL_MAX.
De-normal values greater in magnitude than DBL_MAX.
This is an unusual case, likely outside OP's concern. When double is encoded as a pair of a floating points to achieve desired range/precession, (see double-double) there exist a maximum normal double and perhaps a greater de-normal one. I have seen debate if DBL_MAX should refer to the greatest normal, of the greatest of both.
Fortunately this paired approach usually includes an -infinity, so the most negative value remains -INFINITY.
For more portability, code can go down the route
// HUGE_VAL is designed to be infinity or DBL_MAX (when infinites are not implemented)
// .. yet is problematic with unsigned infinity.
double most_negative1 = -HUGE_VAL;
// Fairly portable, unless system does not understand "INF"
double most_negative2 = strtod("-INF", (char **) NULL);
// Pragmatic
double most_negative3 = strtod("-1.0e999999999", (char **) NULL);
// Somewhat time-consuming
double most_negative4 = pow(-DBL_MAX, 0xFFFF /* odd value */);
// My suggestion
double most_negative5 = (-DBL_MAX)*DBL_MAX;
The original question concerns infinity.
So, why not use
#define Infinity ((double)(42 / 0.0))
according to the IEEE definition?
You can negate that of course.
If you do not have float exceptions enabled (which you shouldn't imho), you can simply say:
double neg_inf = -1/0.0;
This yields negative infinity. If you need a float, you can either cast the result
float neg_inf = (float)-1/0.0;
or use single precision arithmetic
float neg_inf = -1.0f/0.0f;
The result is always the same, there is exactly one representation of negative infinity in both single and double precision, and they convert to each other as you would expect.