minimum double value in C/C++ - c++

Is there a standard and/or portable way to represent the smallest negative value (e.g. to use negative infinity) in a C(++) program?
DBL_MIN in float.h is the smallest positive number.

-DBL_MAX in ANSI C, which is defined in float.h.

Floating point numbers (IEEE 754) are symmetrical, so if you can represent the greatest value (DBL_MAX or numeric_limits<double>::max()), just prepend a minus sign.
And then is the cool way:
double f;
(*((uint64_t*)&f))= ~(1LL<<52);

In C, use
#include <float.h>
const double lowest_double = -DBL_MAX;
In C++pre-11, use
#include <limits>
const double lowest_double = -std::numeric_limits<double>::max();
In C++11 and onwards, use
#include <limits>
constexpr double lowest_double = std::numeric_limits<double>::lowest();

Try this:
-1 * numeric_limits<double>::max()
Reference: numeric_limits
This class is specialized for each of
the fundamental types, with its
members returning or set to the
different values that define the
properties that type has in the
specific platform in which it
compiles.

Are you looking for actual infinity or the minimal finite value? If the former, use
-numeric_limits<double>::infinity()
which only works if
numeric_limits<double>::has_infinity
Otherwise, you should use
numeric_limits<double>::lowest()
which was introduces in C++11.
If lowest() is not available, you can fall back to
-numeric_limits<double>::max()
which may differ from lowest() in principle, but normally doesn't in practice.

A truly portable C++ solution
As from C++11 you can use numeric_limits<double>::lowest().
According to the standard, it returns exactly what you're looking for:
A finite value x such that there is no other finite value y where y < x.
Meaningful for all specializations in which is_bounded != false.
Online demo
Lots of non portable C++ answers here !
There are many answers going for -std::numeric_limits<double>::max().
Fortunately, they will work well in most of the cases. Floating point encoding schemes decompose a number in a mantissa and an exponent and most of them (e.g. the popular IEEE-754) use a distinct sign bit, which doesn't belong to the mantissa. This allows to transform the largest positive in the smallest negative just by flipping the sign:
Why aren't these portable ?
The standard doesn't impose any floating point standard.
I agree that my argument is a little bit theoretic, but suppose that some excentric compiler maker would use a revolutionary encoding scheme with a mantissa encoded in some variations of a two's complement. Two's complement encoding are not symmetric. for example for a signed 8 bit char the maximum positive is 127, but the minimum negative is -128. So we could imagine some floating point encoding show similar asymmetric behavior.
I'm not aware of any encoding scheme like that, but the point is that the standard doesn't guarantee that the sign flipping yields the intended result. So this popular answer (sorry guys !) can't be considered as fully portable standard solution ! /* at least not if you didn't assert that numeric_limits<double>::is_iec559 is true */

- std::numeric_limits<double>::max()
should work just fine
Numeric limits

Is there a standard and/or portable way to represent the smallest negative value (e.g. to use negative infinity) in a C(++) program?
C approach.
Many implementations support +/- infinities, so the most negative double value is -INFINITY.
#include <math.h>
double most_negative = -INFINITY;
Is there a standard and/or portable way ....?
Now we need to also consider other cases:
No infinities
Simply -DBL_MAX.
Only an unsigned infinity.
I'd expect in this case, OP would prefer -DBL_MAX.
De-normal values greater in magnitude than DBL_MAX.
This is an unusual case, likely outside OP's concern. When double is encoded as a pair of a floating points to achieve desired range/precession, (see double-double) there exist a maximum normal double and perhaps a greater de-normal one. I have seen debate if DBL_MAX should refer to the greatest normal, of the greatest of both.
Fortunately this paired approach usually includes an -infinity, so the most negative value remains -INFINITY.
For more portability, code can go down the route
// HUGE_VAL is designed to be infinity or DBL_MAX (when infinites are not implemented)
// .. yet is problematic with unsigned infinity.
double most_negative1 = -HUGE_VAL;
// Fairly portable, unless system does not understand "INF"
double most_negative2 = strtod("-INF", (char **) NULL);
// Pragmatic
double most_negative3 = strtod("-1.0e999999999", (char **) NULL);
// Somewhat time-consuming
double most_negative4 = pow(-DBL_MAX, 0xFFFF /* odd value */);
// My suggestion
double most_negative5 = (-DBL_MAX)*DBL_MAX;

The original question concerns infinity.
So, why not use
#define Infinity ((double)(42 / 0.0))
according to the IEEE definition?
You can negate that of course.

If you do not have float exceptions enabled (which you shouldn't imho), you can simply say:
double neg_inf = -1/0.0;
This yields negative infinity. If you need a float, you can either cast the result
float neg_inf = (float)-1/0.0;
or use single precision arithmetic
float neg_inf = -1.0f/0.0f;
The result is always the same, there is exactly one representation of negative infinity in both single and double precision, and they convert to each other as you would expect.

Related

If std::numeric_limits<float>::is_iec559 is true, does that mean that I can extract exponent and mantissa in a well defined way?

I've built a custom version of frexp:
auto frexp(float f) noexcept
{
static_assert(std::numeric_limits<float>::is_iec559);
constexpr uint32_t ExpMask = 0xff;
constexpr int32_t ExpOffset = 126;
constexpr int MantBits = 23;
uint32_t u;
std::memcpy(&u, &f, sizeof(float)); // well defined bit transformation from float to int
int exp = ((u >> MantBits) & ExpMask) - ExpOffset; // extract the 8 bits of the exponent (it has an offset of 126)
// divide by 2^exp (leaving mantissa intact while placing "0" into the exponent)
u &= ~(ExpMask << MantBits); // zero out the exponent bits
u |= ExpOffset << MantBits; // place 126 into exponent bits (representing 0)
std::memcpy(&f, &u, sizeof(float)); // copy back to f
return std::make_pair(exp, f);
}
By checking is_iec559 I'm making sure that float fulfills
the requirements of IEC 559 (IEEE 754) standard.
My question is: Does this mean that the bit operations I'm doing are well defined and do what I want? If not, is there a way to fix it?
I tested it for some random values and it seems to be correct, at least on Windows 10 compiled with msvc and on wandbox. Note however, that (on purpose) I'm not handling the edge cases of subnormals, NaN, and inf.
If anyone wonders why I'm doing this: In benchmarks I found that this version of frexp is up to 15 times faster than std::frexp on Windows 10. I haven't tested other platforms yet. But I want to make sure that this not just works by coincident and may brake in future.
Edit:
As mentioned in the comments, endianess could be an issue. Does anybody know?
"Does this mean that the bit operations I'm doing are well defined..."
The TL;DR;, by the strict definition of "well defined": no.
Your assumptions are likely correct but not well defined, because there are no guarantees about the bit width, or the implementation of float. From ยง 3.9.1:
there are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.
The is_iec559 clause only qualifies with:
True if and only if the type adheres to IEC 559 standard
If a literal genie wrote you a terrible compiler, and made float = binary16, double = binary32, and long double = binary64, and made is_iec559 true for all the types, it would still adhere to the standard.
does that mean that I can extract exponent and mantissa in a well defined way?
The TL;DR;, by the limited guarantees of the C++ standard: no.
Assume you use float32_t and is_iec559 is true, and logically deduced from all the rules that it could only be binary32 with no trap representations, and you correctly argued that memcpy is a well defined for conversion between arithmetic types of the same width, and won't break strict aliasing. Even with all those assumptions, the behavior might be well defined but it's only likely and not guaranteed that you can extract the mantissa this way.
The IEEE 754 standard and 2's complement regard bit string encodings, and the behavior of memcpy is described using bytes. While it's plausible to assume the bit string of uint32_t and float32_t would be encoded the same way (e.g. same endianness), there's no guarantee in the standard for that. If the bit strings are stored differently and you shift and mask the copied integer representation to get the mantissa, the answer will be incorrect, despite the memcpy behavior being well defined.
As mentioned in the comments, endianess could be an issue. Does anybody know?
At least a few architectures have used different endianness for floating point registers and integer registers. The same link says that except for small embedded processors, this isn't a concern. I trust Wikipedia entirely for all topics and refuse to do any further research.

Using scientific notation in for loops

I've recently come across some code which has a loop of the form
for (int i = 0; i < 1e7; i++){
}
I question the wisdom of doing this since 1e7 is a floating point type, and will cause i to be promoted when evaluating the stopping condition. Should this be of cause for concern?
The elephant in the room here is that the range of an int could be as small as -32767 to +32767, and the behaviour on assigning a larger value than this to such an int is undefined.
But, as for your main point, indeed it should concern you as it is a very bad habit. Things could go wrong as yes, 1e7 is a floating point double type.
The fact that i will be converted to a floating point due to type promotion rules is somewhat moot: the real damage is done if there is unexpected truncation of the apparent integral literal. By the way of a "proof by example", consider first the loop
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 18446744073709551615ULL; ){
std::cout << i << "\n";
}
This outputs every consecutive value of i in the range, as you'd expect. Note that std::numeric_limits<std::uint64_t>::max() is 18446744073709551615ULL, which is 1 less than the 64th power of 2. (Here I'm using a slide-like "operator" ++< which is useful when working with unsigned types. Many folk consider --> and ++< as obfuscating but in scientific programming they are common, particularly -->.)
Now on my machine, a double is an IEEE754 64 bit floating point. (Such as scheme is particularly good at representing powers of 2 exactly - IEEE754 can represent powers of 2 up to 1022 exactly.) So 18,446,744,073,709,551,616 (the 64th power of 2) can be represented exactly as a double. The nearest representable number before that is 18,446,744,073,709,550,592 (which is 1024 less).
So now let's write the loop as
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 1.8446744073709551615e19; ){
std::cout << i << "\n";
}
On my machine that will only output one value of i: 18,446,744,073,709,550,592 (the number that we've already seen). This proves that 1.8446744073709551615e19 is a floating point type. If the compiler was allowed to treat the literal as an integral type then the output of the two loops would be equivalent.
It will work, assuming that your int is at least 32 bits.
However, if you really want to use exponential notation, you should better define an integer constant outside the loop and use proper casting, like this:
const int MAX_INDEX = static_cast<int>(1.0e7);
...
for (int i = 0; i < MAX_INDEX; i++) {
...
}
Considering this, I'd say it is much better to write
const int MAX_INDEX = 10000000;
or if you can use C++14
const int MAX_INDEX = 10'000'000;
1e7 is a literal of type double, and usually double is 64-bit IEEE 754 format with a 52-bit mantissa. Roughly every tenth power of 2 corresponds to a third power of 10, so double should be able to represent integers up to at least 105*3 = 1015, exactly. And if int is 32-bit then int has roughly 103*3 = 109 as max value (asking Google search it says that "2**31 - 1" = 2 147 483 647, i.e. twice the rough estimate).
So, in practice it's safe on current desktop systems and larger.
But C++ allows int to be just 16 bits, and on e.g. an embedded system with that small int, one would have Undefined Behavior.
If the intention to loop for a exact integer number of iterations, for example if iterating over exactly all the elements in an array then comparing against a floating point value is maybe not such a good idea, solely for accuracy reasons; since the implicit cast of an integer to float will truncate integers toward zero there's no real danger of out-of-bounds access, it will just abort the loop short.
Now the question is: When do these effects actually kick in? Will your program experience them? The floating point representation usually used these days is IEEE 754. As long as the exponent is 0 a floating point value is essentially an integer. C double precision floats 52 bits for the mantissa, which gives you integer precision to a value of up to 2^52, which is in the order of about 1e15. Without specifying with a suffix f that you want a floating point literal to be interpreted single precision the literal will be double precision and the implicit conversion will target that as well. So as long as your loop end condition is less 2^52 it will work reliably!
Now one question you have to think about on the x86 architecture is efficiency. The very first 80x87 FPUs came in a different package, and later a different chip and as aresult getting values into the FPU registers is a bit awkward on the x86 assembly level. Depending on what your intentions are it might make the difference in runtime for a realtime application; but that's premature optimization.
TL;DR: Is it safe to to? Most certainly yes. Will it cause trouble? It could cause numerical problems. Could it invoke undefined behavior? Depends on how you use the loop end condition, but if i is used to index an array and for some reason the array length ended up in a floating point variable always truncating toward zero it's not going to cause a logical problem. Is it a smart thing to do? Depends on the application.

Is a 32 bit normalized floating point number that hasn't been operated on, same on any platform/compiler?

For example:
float a = 3.14159f;
If I was to inspect the bits in this number (or any other normalized floating point number), what are the chances that the bits are different in some other platform/compiler combination, or is that possible?
Not necessarily: The c++ standard doesn't define floating point representation (it doesn't even define the representation of signed integers), although most platforms probably orient themselves on the same IEEE standard (IEEE 754-2008?).
Your question can be rephrased as: Will the final assertion in the following code always be upheld, no matter what platform you run it on?
#include <cassert>
#include <cstring>
#include <cstdint>
#include <limits>
#if __cplusplus < 201103L // no static_assert prior to C++11
#define static_assert(a,b) assert(a)
#endif
int main() {
float f = 3.14159f;
std::uint32_t i = 0x40490fd0;// IEC 659/IEEE 754 representation
static_assert(std::numeric_limits<float>::is_iec559, "floating point must be IEEE 754");
static_assert(sizeof(f) == sizeof(i), "float must be 32 bits wide");
assert(std::memcmp(&f, &i, sizeof(f)) == 0);
}
Answer: There's nothing in the C++ standard that guarantees that the assertion will be upheld. Yet, on most sane platforms the assertion will hold and the code won't abort, no matter if the platform is big- or little-endian. As long as you only care that your code works on some known set of platforms, it'll be OK: you can verify that the tests pass there :)
Realistically speaking, some compilers might use a sub-par decimal-to-IEEE-754 conversion routine that doesn't properly round the result, so if you specify f to enough digits of precision, it might be a couple of LSBs of mantissa off from the value that would be nearest to the decimal representation. And then the assertion won't hold anymore. For such platforms, you might wish to test a couple mantissa LSBs around the desired one.

Fast compare IEEE float greater than zero by cheating

I am working on a platform that has terrible stalls when comparing floats to zero. As an optimization I have seen the following code used:
inline bool GreaterThanZero( float value )
{
const int value_as_int = *(int*)&value;
return ( value_as_int > 0 );
}
Looking at the generated assembly the stalls are gone and the function is more performant.
Does this work? I'm confused because all of the optimizations for IEEE tricks use SIGNMASKS and lots of AND/OR operations (https://www.lomont.org/papers/2005/CompareFloat.pdf for example). Does the cast to a signed int help? Testing in a simple harness detects no problems.
Any insight would be good.
The expression *(int*)&value > 0 tests if value is any positive float, from the smallest positive denormal (which has the same representation as 0x00000001) to the largest finite float (with representation 0x7f7fffff) and +inf (which has the same representation as 0x7f800000). The trick detects as positive a number of, but not all, NaN representations (the NaN representations above 0x7f800001). It is fine if you don't care about some values of NaN making the test true.
This all works because of the representation of IEEE 754 formats.
The bit manipulation functions that you saw in the literature for the purpose of emulating IEEE 754 operations were probably aiming for perfect emulation, taking into account the particular behaviors of NaN and signed zeroes. For instance, the variation *(int*)&value >= 0 would not be equivalent to value >= 0.0f, because -0.0f, represented as 0x80000000 as an unsigned int and thus as -0x80000000 as a signed one, makes the latter condition true and the former one false. This can make such functions quite complicated.
Does the cast to a signed int help?
Well, yes, because the sign bits of float and int are in the same place and both indicate a positive number when unset. But the condition value > 0.0f could be implemented by re-interpreting value as an unsigned integer too.
Note: the conversion to int* of the address of value breaks strict aliasing rules, but this may be acceptable if your compiler guarantees that it gives meaning to these programs (perhaps with a command-line option).

double variables in c++ are showing equal even when they are not

I just wrote the following code in C++:
double variable1;
double variable2;
variable1=numeric_limits<double>::max()-50;
variable2=variable1;
variable1=variable1+5;
cout<<"\nVariable1==Variable2 ? "<<(variable1==variable2);
The answer to the cout statement comes out 1, even when variable2 and variable1 are not equal.Can someone help me with this? Why is this happening?
I knew the concept of imprecise floating point math but didn't think this would happen with comparing two doubles directly. Also I am getting the same resuklt when I replace variable1 with:
double variable1=(numeric_limits<double>::max()-10000000000000);
The comparison still shows them as equal. How much would I have to subtract to see them start differing?
The maximum value for a double is 1.7976931348623157E+308. Due to lack of precision, adding and removing small values such as 50 and 5 does not actually changes the values of the variable. Thus they stay the same.
There isn't enough precision in a double to differentiate between M and M-45 where M is the largest value that can be represented by a double.
Imagine you're counting atoms to the nearest million. "123,456 million atoms" plus 1 atom is still "123,456 million atoms" because there's no space in the "millions" counting system for the 1 extra atom to make any difference.
numeric_limits<double>::max()
is a huuuuuge number. But the greater the absolute value of a double, the smaller is its precision. Apparently in this case max-50and max-5 are indistinguishable from double's point of view.
You should read the floating point comparison guide. In short, here are some examples:
float a = 0.15 + 0.15
float b = 0.1 + 0.2
if(a == b) // can be false!
if(a >= b) // can also be false!
The comparison with an epsilon value is what most people do.
#define EPSILON 0.00000001
bool AreSame(double a, double b)
{
return fabs(a - b) < EPSILON;
}
In your case, that max value is REALLY big. Adding or subtracting 50 does nothing. Thus they look the same because of the size of the number. See #RichieHindle's answer.
Here are some additional resources for research.
See this blog post.
Also, there was a stack overflow question on this very topic (language agnostic).
From the C++03 standard:
3.9.1/ [...] The value representation of floating-point types is
implementation-defined
and
5/ [...] If during the evaluation of an expression, the result is not
mathematically defined or not in the range of representable values for
its type, the behavior is undefined, unless such an expression is a
constant expression (5.19), in which case the program is ill-formed.
and
18.2.1.2.4/ (about numeric_limits<T>::max()) Maximum finite value.
This implies that once you add something to std::numeric_limits<T>::max(), the behavior of the program is implementation defined if T is floating point, perfectly defined if T is an unsigned type, and undefined otherwise.
If you happen to have std::numeric_limits<T>::is_iec559 == true, in this case the behavior is defined by IEEE 754. I don't have it handy, so I cannot tell whether variable1 is finite or infinite in this case. It seems (according to some lecture notes on IEEE 754 on the internet) that it depends on the rounding mode..
Please read What Every Computer Scientist Should Know About Floating-Point Arithmetic.