Accurately creating a floating point value in the range [a,b) - c++

Consider
auto x = a + (b-a)*v;
Which is meant to create a value in the range [a,b) by the factor v in [0,1.0). From a purely mathematical point of view, x>=a, and x<b. But how do we prove or ensure that this holds for floating point?
a, b, v are non-negative and finite floating point values of the same type (double or float) and b>a (originally said b>=a which is obviously incompatible with my requirements on x), and v<=netxtafter(1.0,0) (that is, it's just below 1.0).
It seems obvious, that b-a >0, and therefore (b-a)*v >=0, so that we don't need to check:
if (x<a) return a;
But is this also redundant?
if (x>=b) return std::nextafter(b,a);
May the compiler (optimization) rewrite the expressions to influence these questions?
Does the type of floating point representation enter? (I'm mostly interested in the most common (iec559 / IEEE 754).

It seems obvious, that b-a >0, and therefore (b-a)*v >=0, so that we don't need to check: if (x<a) return a;
The property b - a > 0 is correct in IEEE 754 but I wouldn't say that it is obvious. At the time of floating-point standardization, Kahan fought for this property resulting from “gradual underflow” to be true. Other proposals did not have subnormal numbers and did not make it true. You could have b > a and b - a == 0 in these other proposals, for instance by taking a the smallest positive number and b its successor.
Even without gradual underflow, on a system that mis-implements IEEE 754 by flushing subnormals to zero, b > a implies b - a >= 0, so that there is no need to worry about x being below a.
But is this also redundant? if (x>=b) return std::nextafter(b,a);
This test is not redundant even in IEEE 754. For an example take b to be the successor of a. For all values of v above 0.5, in the default round-to-nearest mode, the result of a + (b-a)*v is b, which you were trying to avoid.
My examples are built with unusual values for a and b because that saves me from writing a program to find counter-examples through brute-force, but do not assume that other, more likely, pairs of values for a and b do not exhibit the problem. If I was looking for additional counter-examples, I would in particular look for pairs of values for which the floating-point subtraction b - a rounds up.
EDIT: Oh alright here is another counter-example:
Take a to be the successor of the successor of -1.0 (that is, in double-precision, using C99's hexadecimal notation, -0x1.ffffffffffffep-1) and b to be 3.0. Then b - a rounds up to 4.0, and taking v to be the predecessor of 1.0, a + (b - a) * v rounds up to 3.0.
The floating-point subtraction b - a rounding up is not necessary for some values of a and b making a counter-example, as shown here: taking a as the successor of 1.0 and b as 2.0 also works.

Related

Can multiplying a pair of almost-one values ever yield a result of 1.0?

I have two floating point values, a and b. I can guarantee they are values in the domain (0, 1). Is there any circumstance where a * b could equal one? I intend to calculate 1/(1 - a * b), and wish to avoid a divide by zero.
My instinct is that it cannot, because the result should be equal or smaller to a or b. But instincts are a poor replacement for understanding the correct behavior.
I do not get to specify the rounding mode, so if there's a rounding mode where I could get into trouble, I want to know about it.
Edit: I did not specify whether the compiler was IEEE compliant or not because I cannot guarantee that the compiler/CPU running my software will indeed by IEEE compliant.
I have two floating point values, a and b…
Since this says we have “values,” not “variables,” it admits a possibility that 1 - a*b may evaluate to 1. When writing about software, people sometimes use names as placeholders for more complicated expressions. For example, one might have an expression a that is sin(x)/x and an expression b that is 1-y*y and then ask about computing 1 - a*b when the code is actually 1 - (sin(x)/x)*(1-y*y). This would be a problem because C++ allows extra precision to be used when evaluating floating-point expressions.
The most common instances of this is that the compiler uses long double arithmetic while computing expressions containing double operands or it uses a fused multiply-add instructions while computing an expression of the format x + y*z.
Suppose expressions a and b have been computed with excess precision and are positive values less than 1 in that excess precision. E.g., for illustration, suppose double were implemented with four decimal digits but a and b were computed with long double with six decimal digits. a and b could both be .999999. Then a*b is .999998000001 before rounding, .999998 after rounding to six digits. Now suppose that at this point in the computation, the compiler converts from long double to double, perhaps because it decides to store this intermediate value on the stack temporarily while it computes some other things from nearby expressions. Converting it to four-digit double produces 1.000, because that is the four-decimal-digit number nearest .999998. When the compiler later loads this from the stack and continues evaluation, we have 1 - 1.000, and the result is zero.
On the other hand, if a and b are variables, I expect your expression is safe. When a value is assigned to a variable or is converted with a cast operation, the C++ standard requires it to be converted to the nominal type; the result must be a value in the nominal type, without any “extra precision.” Then, given 0 < a < 1 and 0 < b < 1, the mathematical value (that, without floating-point rounding) a•b is less than a and is less than b. Then rounding of a•b to the nominal type cannot produce a value greater than a or b with any IEEE-754 rounding method, so it cannot produce 1. (The only requirement here is that the rounding method never skip over values—it might be constrained to round in a particular direction, upward or downward or toward zero or whatever, but it never goes past a representable value in that direction to get to a value farther away from the unrounded result. Since we know a•b is bounded above by both a and b, rounding cannot produce any result greater than the lesser of a and b.)
Formally, the C++ standard does not impose any requirements on the accuracy of floating-point results. So a C++ implementation could use a bonkers rounding mode that produced 3.14 for .9*.9. Aside from implementations flushing subnormals to zero, I am not aware of any C++ implementations that do not obey the requirement above. Flushing subnormals to zero will not affect calculations in 1 - a*b when a and b are near 1. (In a perverse floating-point format, with an exponent range narrower than the significand and no subnormal values, .9999 could be representable while .0001 is not because the exponent required for it is out of range. Then 1-.9999*.9999, which would produce .0002 in normal four-digit arithmetic, would produce 0 due to underflow. No such formats are in normal hardware.)
So, if a and b are variables, 0 < a < 1 and 0 < b < 1, and your C++ implementation is reasonable (may use extra precision, may flush subnormals, does not use perverse floating-point formats or rounding), then 1 - a*b does not evaluate to zero.
There is a mathematical proof that it will never be >= 1. I don't have it handy.... you may want to ask on the math stack overflow site if you are interested in studying the proof. But your instincts are correct. It will never be >= 1.
Now, we must be careful because floating point arithmetic is only an approximation of math and has limitations. I'm not an expert on these limitations, but the floating-point standard is very carefully designed and provides certain guarantees. I'm pretty sure one of them includes (or implies) that x * y where x < 1 and y < 1 is guaranteed to be < 1.
You can check that even if using the highest float or double that is lower than 1, and multiplying by itself, the result will be lower than 1. Any multiplication of numbers lower than that must give a smaller result.
Here is the code I ran, with the results in comments:
float a = nextafterf(1, 0); // 0.999999940
double b = nextafter(1, 0); // 0.99999999999999989
float c = a * a; // 0.999999881
double d = b * b; // 0.99999999999999978

Behaviour of negative zero (-0.0) in comparison with positive zero (+0.0)

In my code,
float f = -0.0; // Negative
and compared with negative zero
f == -0.0f
result will be true.
But
float f = 0.0; // Positive
and compared with negative zero
f == -0.0f
also, result will be true instead of false
Why in both cases result to be true?
Here is a MCVE to test it (live on coliru):
#include <iostream>
int main()
{
float f = -0.0;
std::cout<<"==== > " << f <<std::endl<<std::endl;
if(f == -0.0f)
{
std::cout<<"true"<<std::endl;
}
else
{
std::cout<<"false"<<std::endl;
}
}
Output:
==== > -0 // Here print negative zero
true
C++11 introduced functions like std::signbit() which can detect signed zeros, and std::copysign() which can copy the sign bit between floating point values, if the implementation supports signed zero (e.g. due to using IEEE floating point). The specifications of those functions don't actually require that an implementation support distinct positive and negative zeros. That sort of thing aside, I'm unaware of any references in a C++ standard that even mentions signed zeros, let alone what should be the result of comparing them.
The C++ standards also do not stipulate any floating point representation - that is implementation-defined.
Although not definitive, these observations suggest that support of signed zeros, or the result of comparing them, would be determined by what floating point representation the implementation supports.
IEEE-754 is a common (albeit not the only) floating point representation used by modern implementations (i.e. compilers on their host systems). The current (published in 2008) version of IEEE-758 "IEEE Standard for Floating -Point Arithmetic" Section 5.11, second paragraph, says (bold emphasis mine)
Four mutually exclusive relations are possible: less than, equal, greater than, and unordered. The last case arises when at least one operand is NaN. Every NaN shall compare unordered with everything, including itself. Comparisons shall ignore the sign of zero (so +0 = −0). Infinite operands of the same sign shall compare equal.
Floating point arithmetic in C++ is often IEEE-754. This norm differs from the mathematical definition of the real number set.
This norm defines two different representations for the value zero: positive zero and negative zero. It is also defined that those two representations must compare equals, so by definition:
+0.0 == -0.0
As to why it is so, in its paper What Every Computer Scientist Should Know About Floating Point Arithmetic, David Goldberg, 1991-03 (linked in the IEEE-754 page on the IEEE website) writes:
In IEEE arithmetic, it is natural to define log 0 = -∞ and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -∞.
That's because the signed negative zero must compare true with zero: i.e. -0.0 == 0.0, -0f == 0f, and -0l == 0l.
It's a requirement of any floating point scheme supported by a C++ compiler.
(Note that most platforms these days use IEEE754 floating point, and this behaviour is explicitly documented in that specification.)
Because 0.0f and -0.0f is same negative of a zero is zero

Is negative zero always equal zero [duplicate]

In my code,
float f = -0.0; // Negative
and compared with negative zero
f == -0.0f
result will be true.
But
float f = 0.0; // Positive
and compared with negative zero
f == -0.0f
also, result will be true instead of false
Why in both cases result to be true?
Here is a MCVE to test it (live on coliru):
#include <iostream>
int main()
{
float f = -0.0;
std::cout<<"==== > " << f <<std::endl<<std::endl;
if(f == -0.0f)
{
std::cout<<"true"<<std::endl;
}
else
{
std::cout<<"false"<<std::endl;
}
}
Output:
==== > -0 // Here print negative zero
true
C++11 introduced functions like std::signbit() which can detect signed zeros, and std::copysign() which can copy the sign bit between floating point values, if the implementation supports signed zero (e.g. due to using IEEE floating point). The specifications of those functions don't actually require that an implementation support distinct positive and negative zeros. That sort of thing aside, I'm unaware of any references in a C++ standard that even mentions signed zeros, let alone what should be the result of comparing them.
The C++ standards also do not stipulate any floating point representation - that is implementation-defined.
Although not definitive, these observations suggest that support of signed zeros, or the result of comparing them, would be determined by what floating point representation the implementation supports.
IEEE-754 is a common (albeit not the only) floating point representation used by modern implementations (i.e. compilers on their host systems). The current (published in 2008) version of IEEE-758 "IEEE Standard for Floating -Point Arithmetic" Section 5.11, second paragraph, says (bold emphasis mine)
Four mutually exclusive relations are possible: less than, equal, greater than, and unordered. The last case arises when at least one operand is NaN. Every NaN shall compare unordered with everything, including itself. Comparisons shall ignore the sign of zero (so +0 = −0). Infinite operands of the same sign shall compare equal.
Floating point arithmetic in C++ is often IEEE-754. This norm differs from the mathematical definition of the real number set.
This norm defines two different representations for the value zero: positive zero and negative zero. It is also defined that those two representations must compare equals, so by definition:
+0.0 == -0.0
As to why it is so, in its paper What Every Computer Scientist Should Know About Floating Point Arithmetic, David Goldberg, 1991-03 (linked in the IEEE-754 page on the IEEE website) writes:
In IEEE arithmetic, it is natural to define log 0 = -∞ and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -∞.
That's because the signed negative zero must compare true with zero: i.e. -0.0 == 0.0, -0f == 0f, and -0l == 0l.
It's a requirement of any floating point scheme supported by a C++ compiler.
(Note that most platforms these days use IEEE754 floating point, and this behaviour is explicitly documented in that specification.)
Because 0.0f and -0.0f is same negative of a zero is zero

Floating Point, is an equality comparison enough to prevent division by zero?

// value will always be in the range of [0.0 - maximum]
float obtainRatio(float value, float maximum){
if(maximum != 0.f){
return value / maximum;
}else{
return 0.f;
}
}
The range of maximum can be anything, including negative numbers. The range of value can also be anything, though the function is only required to make "sense" when the input is in the range of [0.0 - maximum]. The output should always be in the range of [0.0 - 1.0]
I have two questions that I'm wondering about, with this:
Is this equality comparison enough to ensure the function never divides by zero?
If maximum is a degenerate value (extremely small or extremely large), is there a chance the function will return a result outside of [0.0 - 1.0] (assuming value is in the right range)?
Here is a late answer clarifying some concepts in relation to the question:
Just return value / maximum
In floating-point, division by zero is not a fatal error like integer division by zero is.
Since you know that value is between 0.0 and maximum, the only division by zero that can occur is 0.0 / 0.0, which is defined as producing NaN. The floating-point value NaN is a perfectly acceptable value for function obtainRatio to return, and is in fact a much better exceptional value to return than 0.0, as your proposed version is returning.
Superstitions about floating-point are only superstitions
There is nothing approximate about the definition of <= between floats. a <= b does not sometimes evaluate to true when a is just a little above b. If a and b are two finite float variables, a <= b evaluate to true exactly when the rational represented by a is less than or equal to the rational represented by b. The only little glitch one may perceive is actually not a glitch but a strict interpretation of the rule above: +0.0 <= -0.0 evaluates to true, because “the rational represented by +0.0” and “the rational represented by -0.0” are both 0.
Similarly, there is nothing approximate about == between floats: two finite float variables a and b make a == b true if and only if the rational represented by a and the rational represented by b are the same.
Within a if (f != 0.0) condition, the value of f cannot be a representation of zero, and thus a division by f cannot be a division by zero. The division can still overflow. In the particular case of value / maximum, there cannot be an overflow because your function requires 0 ≤ value ≤ maximum. And we don't need to wonder whether ≤ in the precondition means the relation between rationals or the relation between floats, since the two are essentially the same.
This said
C99 allows extra precision for floating-point expressions, which has been in the past wrongly interpreted by compiler makers as a license to make floating-point behavior erratic (to the point that the program if (m != 0.) { if (m == 0.) printf("oh"); } could be expected to print “oh” in some circumstances).
In reality, a C99 compiler that offers IEEE 754 floating-point and defines FLT_EVAL_METHOD to a nonnegative value cannot change the value of m after it has been tested. The variable m was set to a value representable as float when it was last assigned, and that value either is a representation of 0 or it isn't. Only operations and constants can have excess precision (See the C99 standard, 5.2.4.2.2:8).
In the case of GCC, recent versions do what is proper with -fexcess-precision=standard, implied by -std=c99.
Further reading
David Monniaux's description of the sad state of floating-point in C a few years ago (first version published in 2007). David's report does not try to interpret the C99 standard but describes the reality of floating-point computation in C as it was then, with real examples. The situation has much improved since, thanks to improved standard-compliance in compilers that care and thanks to the SSE2 instruction set that renders the entire issue moot.
The 2008 mailing list post by Joseph S. Myers describing the then current GCC situation with floats in GCC (bad), how he interpreted the standard (good) and how he was implementing his interpretation in GCC (GOOD).
In this case with the limited range, it should be OK. In general a check for zero first will prevent division by zero, but there's still a chance of getting overflow if the divisor is close to zero and the dividend is a large number, but in this case the dividend will be small if the divisor is small (both could be close to zero without causing overflow).

double variables in c++ are showing equal even when they are not

I just wrote the following code in C++:
double variable1;
double variable2;
variable1=numeric_limits<double>::max()-50;
variable2=variable1;
variable1=variable1+5;
cout<<"\nVariable1==Variable2 ? "<<(variable1==variable2);
The answer to the cout statement comes out 1, even when variable2 and variable1 are not equal.Can someone help me with this? Why is this happening?
I knew the concept of imprecise floating point math but didn't think this would happen with comparing two doubles directly. Also I am getting the same resuklt when I replace variable1 with:
double variable1=(numeric_limits<double>::max()-10000000000000);
The comparison still shows them as equal. How much would I have to subtract to see them start differing?
The maximum value for a double is 1.7976931348623157E+308. Due to lack of precision, adding and removing small values such as 50 and 5 does not actually changes the values of the variable. Thus they stay the same.
There isn't enough precision in a double to differentiate between M and M-45 where M is the largest value that can be represented by a double.
Imagine you're counting atoms to the nearest million. "123,456 million atoms" plus 1 atom is still "123,456 million atoms" because there's no space in the "millions" counting system for the 1 extra atom to make any difference.
numeric_limits<double>::max()
is a huuuuuge number. But the greater the absolute value of a double, the smaller is its precision. Apparently in this case max-50and max-5 are indistinguishable from double's point of view.
You should read the floating point comparison guide. In short, here are some examples:
float a = 0.15 + 0.15
float b = 0.1 + 0.2
if(a == b) // can be false!
if(a >= b) // can also be false!
The comparison with an epsilon value is what most people do.
#define EPSILON 0.00000001
bool AreSame(double a, double b)
{
return fabs(a - b) < EPSILON;
}
In your case, that max value is REALLY big. Adding or subtracting 50 does nothing. Thus they look the same because of the size of the number. See #RichieHindle's answer.
Here are some additional resources for research.
See this blog post.
Also, there was a stack overflow question on this very topic (language agnostic).
From the C++03 standard:
3.9.1/ [...] The value representation of floating-point types is
implementation-defined
and
5/ [...] If during the evaluation of an expression, the result is not
mathematically defined or not in the range of representable values for
its type, the behavior is undefined, unless such an expression is a
constant expression (5.19), in which case the program is ill-formed.
and
18.2.1.2.4/ (about numeric_limits<T>::max()) Maximum finite value.
This implies that once you add something to std::numeric_limits<T>::max(), the behavior of the program is implementation defined if T is floating point, perfectly defined if T is an unsigned type, and undefined otherwise.
If you happen to have std::numeric_limits<T>::is_iec559 == true, in this case the behavior is defined by IEEE 754. I don't have it handy, so I cannot tell whether variable1 is finite or infinite in this case. It seems (according to some lecture notes on IEEE 754 on the internet) that it depends on the rounding mode..
Please read What Every Computer Scientist Should Know About Floating-Point Arithmetic.