glsl What happens if infinity multiply on 0? - glsl

In my glsl vertex shader, lets say I have following code:
float len = k/0;
Now len is infinity. What happens if I multiply it on 0? Does result remains "infinity", or it becomes 0? In other words, what happens if infinity multiply on 0? Mathematically should be 0...

Mathematically, it is not well-defined, because infinity is not well-defined (at least, in the normal number system). The output of inf*0 on a IEEE-754-compliant system (which I think GLSL guarantees, perhaps with the exception of denormals?) is NaN.
See here for more information on the effect of various operations on various special floating-point values.

Related

Is it possible in floating point to return 0.0 subtracting two different values?

Due to the floating point "approx" nature, its possible that two different sets of values return the same value.
Example:
#include <iostream>
int main() {
std::cout.precision(100);
double a = 0.5;
double b = 0.5;
double c = 0.49999999999999994;
std::cout << a + b << std::endl; // output "exact" 1.0
std::cout << a + c << std::endl; // output "exact" 1.0
}
But is it also possible with subtraction? I mean: is there two sets of different values (keeping one value of them) that return 0.0?
i.e. a - b = 0.0 and a - c = 0.0, given some sets of a,b and a,c with b != c??
The IEEE-754 standard was deliberately designed so that subtracting two values produces zero if and only if the two values are equal, except that subtracting an infinity from itself produces NaN and/or an exception.
Unfortunately, C++ does not require conformance to IEEE-754, and many C++ implementations use some features of IEEE-754 but do not fully conform.
A not uncommon behavior is to “flush” subnormal results to zero. This is part of a hardware design to avoid the burden of handling subnormal results correctly. If this behavior is in effect, the subtraction of two very small but different numbers can yield zero. (The numbers would have to be near the bottom of the normal range, having some significand bits in the subnormal range.)
Sometimes systems with this behavior may offer a way of disabling it.
Another behavior to beware of is that C++ does not require floating-point operations to be carried out precisely as written. It allows “excess precision” to be used in intermediate operations and “contractions” of some expressions. For example, a*b - c*d may be computed by using one operation that multiplies a and b and then another that multiplies c and d and subtracts the result from the previously computed a*b. This latter operation acts as if c*d were computed with infinite precision rather than rounded to the nominal floating-point format. In this case, a*b - c*d may produce a non-zero result even though a*b == c*d evaluates to true.
Some C++ implementations offer ways to disable or limit such behavior.
Gradual underflow feature of IEEE floating point standard prevents this. Gradual underflow is achieved by subnormal (denormal) numbers, which are spaced evenly (as opposed to logarithmically, like normal floating point) and located between the smallest negative and positive normal numbers with zeroes in the middle. As they are evenly spaced, the addition of two subnormal numbers of differing signedness (i.e. subtraction towards zero) is exact and therefore won't reproduce what you ask. The smallest subnormal is (much) less than the smallest distance between normal numbers, and therefore any subtraction between unequal normal numbers is going to be closer to a subnormal than zero.
If you disable IEEE conformance using a special denormals-are-zero (DAZ) or flush-to-zero (FTZ) mode of the CPU, then indeed you could subtract two small, close numbers which would otherwise result in a subnormal number, which would be treated as zero due to the mode of the CPU. A working example (Linux):
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); // system specific
double d = std::numeric_limits<double>::min(); // smallest normal
double n = std::nextafter(d, 10.0); // second smallest normal
double z = d - n; // a negative subnormal (flushed to zero)
std::cout << (z == 0) << '\n' << (d == n);
This should print
1
0
First 1 indicates that result of subtraction is exactly zero, while the second 0 indicates that the operands are not equal.
Unfortunately the answer is dependent on your implementation and the way it is configured. C and C++ don't demand any specific floating point representation or behavior. Most implementations use the IEEE 754 representations, but they don't always precisely implement IEEE 754 arithmetic behaviour.
To understand the answer to this question we must first understand how floating point numbers work.
A naive floating point representation would have an exponent, a sign and a mantissa. It's value would be
(-1)s2(e – e0)(m/2M)
Where:
s is the sign bit, with a value of 0 or 1.
e is the exponent field
e0 is the exponent bias. It essentially sets the overall range of the floating point number.
M is the number of mantissa bits.
m is the mantissa with a value between 0 and 2M-1
This is similar in concept to the scientific notation you were taught in school.
However this format has many different representations of the same number, nearly a whole bit's worth of encoding space is wasted. To fix this we can add an "implicit 1" to the mantissa.
(-1)s2(e – e0)(1+(m/2M))
This format has exactly one representation of each number. However there is a problem with it, it can't represent zero or numbers close to zero.
To fix this IEEE floating point reserves a couple of exponent values for special cases. An exponent value of zero is reserved for representing small numbers known as subnormals. The highest possible exponent value is reserved for NaNs and infinities (which I will ignore in this post since they aren't relevant here). So the definition now becomes.
(-1)s2(1 – e0)(m/2M) when e = 0
(-1)s2(e – e0)(1+(m/2M)) when e >0 and e < 2E-1
With this representation smaller numbers always have a step size that is less than or equal to that for larger ones. So provided the result of the subtraction is smaller in magnitude than both operands it can be represented exactly. In particular results close to but not exactly zero can be represented exactly.
This does not apply if the result is larger in magnitude than one or both of the operands, for example subtracting a small value from a large value or subtracting two values of opposite signs. In those cases the result may be imprecise but it clearly can't be zero.
Unfortunately FPU designers cut corners. Rather than including the logic to handle subnormal numbers quickly and correctly they either did not support (non-zero) subnormals at all or provided slow support for subnormals and then gave the user the option to turn it on and off. If support for proper subnormal calculations is not present or is disabled and the number is too small to represent in normalized form then it will be "flushed to zero".
So in the real world under some systems and configurations subtracting two different very-small floating point numbers can result in a zero answer.
Excluding funny numbers like NAN, I don't think it's possible.
Let's say a and b are normal finite IEEE 754 floats, and |a - b| is less than or equal to both |a| and |b| (otherwise it's clearly not zero).
That means the exponent is <= both a's and b's, and so the absolute precision is at least as high, which makes the subtraction exactly representable. That means that if a - b == 0, then it is exactly zero, so a == b.

Can you get a "nan" from overflow in C++?

I'm writing a program that uses a very long recursion (about 50,000) and some very large vectors (also 50,000 in length of type double) to store the result of each recursion before averaging them. At the end of the program, I expect to get a number output.
However, some of the results I got was "nan". The mysterious thing is, if I reduce the number of recursions the program will work just fine. So I'm guessing this might be something to do with the size of the vector. So my question is, if you get an overflow in a very long vector (or say array), what will be the effect? Will you get an "nan" just like in my case?
Another mysterious thing about my program is that I have tried some even larger recursions (100,000), but the output was normal. But when I changed a parameter value, so that each numbers stored in the vector will become larger (although they are still of type double), the output becomes "nan". Will the maximum capacity of a vector be dependent on the size of the number it stores?
You didn't tell us what your recursion is, but it is fairly easy to generate NaNs with a long sequence of operations if you are using square root, pow, inverse sine, or inverse cosine.
Suppose your calculation produces a quantity, call it x, that is supposed to be the sine of some angle θ, and suppose the underlying math dictates that x must always be between -1 and 1, inclusive. You calculate θ by taking the inverse sine of x.
Here's the problem: Arithmetic done on a computer is but an approximation of the arithmetic of the real numbers. Addition and multiplication with IEEE floating point numbers are not transitive. You might well get a value of 1.0000000000000002 for x instead of 1. Take the inverse sine of this value and you get a NaN.
A standard trick is to protect against those near misses that result from numerical errors. Don't use the built-in asin, acos, sqrt, and pow. Use wrappers that protects against things like asin(1.0000000000000002) and sqrt(-1e-16). Make the former pi/2 rather than NaN, and make the latter zero. This is admittedly a kludge, and doing this can get you in trouble. What if the problem is that your calculations are formulated incorrectly? It's legitimate to treat 1.0000000000000002 as 1, but it's best not to treat a value of 100 as if it were 1. A value of 100 to your asin wrapper is best treated by throwing an exception rather than truncating to 1.
There's one other problem with infinities and NaNs: They propagate. An Inf or NaN in one single computation quickly becomes an Inf or a NaN in hundreds, then thousands of values. I usually make the floating point machinery raise a floating point exception on obtaining an Inf or NaN instead of continuing on. (Note well: Floating point exceptions are not C++ exceptions.) When you do this, your program will bomb unless you have a signal handler in place. That's not necessarily a bad thing. You can run the program in the debugger and find exactly where the problem arose. Without these floating point exceptions it is very hard to find the source of the problem.
Depends on the exact natur of your computations. If you just add up numbers which aren't NaN, the result shouldn't be NaN, either. It might be +infinity, though.
But you will get NaN if e.g. some part of your computation yields +infinity, another -infinity, and you later add those two results.
Assuming that your architecture conforms to IEEE 754, this http://en.wikipedia.org/wiki/NaN#Creation tells the situations in which arithmetic operations return NaN.

glsl infinity constant

Does GLSL have any pre-defined constants for +/-infinity or NaN? I'm doing this as a workaround but I wonder if there is a cleaner way:
// GLSL FRAGMENT SHADER
#version 410
<snip>
const float infinity = 1. / 0.;
void main ()
{
<snip>
}
I am aware of the isinf function but I need to assign infinity to a variable so that does not help me.
Like Nicol mentioned, there are no pre-defined constants.
However, from OpenGL 4.1 on, your solution is at least guaranteed to work and correctly generate an infinite value.
See for example in glsl 4.4:
4.7.1 Range and Precision
...
However, dividing a non-zero by 0 results in the
appropriately signed IEEE Inf: If both positive and negative zeros are implemented, the correctly signed
Inf will be generated, otherwise positive Inf is generated.
Be careful when you use an older version of OpenGL though:
For example in glsl 4.0 it says:
4.1.4 Floats
...
Similarly, treatment of conditions such as divide by 0 may lead to an unspecified result, but in no case should such a condition lead to the interruption or termination of processing.
There are no pre-defined constants for it, but there is the isinf function to test if something is infinity.
While I'm at it, are there constants for other things like FLT_MAX FLT_EPSILON etc the way there are in C?
No, there are not.
This might work?
const float pos_infinity = uintBitsToFloat(0x7F800000);
const float neg_infinity = uintBitsToFloat(0xFF800000);
"If the encoding of a floating point infinity is passed in parameter x, the resulting floating-point value is the corresponding (positive or negative) floating point infinity"

Why are FLT_MAX and FLT_MIN not positive and negative infinity, and what is their use?

Logically speaking, given the nature of floating point values, the maximum and minimum representable values of a float are positive and negative infinity, respectively.
Why, then, are FLT_MAX and FLT_MIN not set to them? I understand that this is "just how the standard called for". But then, what use could FLT_MAX or FLT_MIN have as they currently lie in the middle of the representable numeric range of float? Other numeric limits have some utility because they make guarantees about comparisons (e.g. "No INT can test greater than INT_MAX"). Without that kind of guarantee, what use are these float limits at all?
A motivating example for C++:
#include <vector>
#include <limits>
template<typename T>
T find_min(const std::vector<T> &vec)
{
T result = std::numeric_limits<T>::max();
for (std::vector<T>::const_iterator p = vec.start() ; p != vec.end() ; ++p)
if (*p < result) result = *p;
return result;
}
This code works fine if T is an integral type, but not if it is a floating point type. This is annoying. (Yes yes, the standard library provides min_element, but that is not the point. The point is the pattern.)
The purpose of FLT_MIN/MAX is to tell you what the smallest and largest representable floating-point numbers are. Infinity isn't a number; it's a limit.
what use could FLT_MAX or FLT_MIN have as they currently lie in the middle of the representable numeric range of float?
They do not lie in the middle of the representable range. There is no positive float value x which you can add to FLT_MAX and get a representable number. You will get +INF. Which, as previously stated, is not a number.
This code works fine if T is an integral type, but not if it is a floating point type. This is annoying. (Yes yes, the standard library provides min_element, but that is not the point. The point is the pattern.)
And how doesn't it "work fine?" It gives you the smallest value. The only situation where it doesn't "work fine" is if the table contains only +INF. And even in that case, it returns an actual number, not an error-code. Which is probably the better option anyway.
FLT_MAX is defined in section 5.2.4.2.2(9) as
maximum representable finite floating-point number
Positive infinity is not finite.
FLT_MIN is defined in section 5.2.4.2.2(10) as
minimum normalized positive floating-point number
Negative infinity is neither normalized nor positive.
Unlike integer types, floating-point types are (almost?) universally symmetric about zero, and I think the C floating-point model requires this.
On two's-complement systems (i.e., almost all modern systems), INT_MIN is -INT_MAX-1; on other systems, it may be -INT_MAX. (Quibble: a two's-complement system can have INT_MIN equal to -INT_MAX if the lowest representable value is treated as a trap representation.) So INT_MIN conveys information that INT_MAX by itself doesn't.
And a macro for the smallest positive value would not be particularly useful; that's just 1.
In floating-point, on the other hand, the negative value with the greatest magnitude is just -FLT_MAX (or -DBL_MAX, or -LDBL_MAX).
As for why they're not Infinity, there's already a way to represent infinite values (at least in C99): the macro INFINITY. That might cause problems for some C++ applications, but these were defined for C, which doesn't have things like std::numeric_limits<T>::max().
Furthermore, not all floating-point systems have representations for infinity (or NaN).
If FLT_MAX were INFINITY (on systems that support it), then there would probably need to be another macro for the largest representable real value.
I would say the broken pattern you're seeing is only an artifact of poor naming in C, whereas in C++ with numeric_limits and templates, it's an actual semantic flaw that breaks template code that wants to handle both integer and floating point values. Of course you can write a little bit of extra code to test if you have an integer or floating point type (e.g. if ((T)1/2) /* floating point */ else /* integer */) and the problem goes away.
As for why somebody would care about the values FLT_MIN and FLT_MAX give you, they're useful for avoiding underflow and overflow. For example, suppose I need to compute sqrt(x²-1). This is well-defined for any floating point x greater than or equal to 1, but performing the squaring, subtraction, and square root could easily overflow and render the result meaningless when x is large. One might want to test whether x > FLT_MAX/x and handle this case some other way (such as simply returning x :-).

How do I check and handle numbers very close to zero

I have some math (in C++) which seems to be generating some very small, near zero, numbers (I suspect the trig function calls are my real problem), but I'd like to detect these cases so that I can study them in more detail.
I'm currently trying out the following, is it correct?
if ( std::abs(x) < DBL_MIN ) {
log_debug("detected small num, %Le, %Le", x, y);
}
Second, the nature of the mathematics is trigonometric in nature (aka using a lot of radian/degree conversions and sin/cos/tan calls, etc), what sort of transformations can I do to avoid mathematical errors?
Obviously for multiplications I can use a log transform - what else?
Contrary to widespread belief, DBL_MIN is not the smallest positive double value but the smallest positive normalized double value. Typically - for 64-bit ieee754 doubles - it's 2-1022, while the smallest positive double value is 2-1074. Therefore
I'm currently trying out the following, is it correct?
if ( std::abs(x) < DBL_MIN ) {
log_debug("detected small num, %Le, %Le", x, y);
}
may have an affirmative answer. The condition checks whether x is a denormalized (also called subnormal) number or ±0.0. Without knowing more about your specific situation, I cannot tell if that test is appropriate. Denormalized numbers can be legitimate results of calculations or the consequence of rounding where the correct result would be 0. It is also possible that rounding produces numbers of far greater magnitude than DBL_MIN when the mathematically correct result would be 0, so a much larger threshold could be sensible.
If x is a double, then one problem with this approach is that you can't distinguish between x being legitimately zero, and x being a positive value smaller than DBL_MIN. So this will work if you know x can never be legitimately zero, and you want to see when underflow occurs.
You could also try catching the SIGFPE signal, which will fire on a POSIX-compliant system any time there's a math error including floating-point underflow. See: http://en.wikipedia.org/wiki/SIGFPE
EDIT: To be clear, DBL_MIN is NOT the largest negative value that a double can hold, it is the smallest positive normalized value that a double can hold. So your approach is fine as long as the value can't be zero.
Another useful constant is DBL_EPSILON which is the smallest double value that can be added to 1.0 without getting 1.0 back. Note that this is a much larger value than DBL_MIN. But it may be useful to you since you're doing trigonometric functions that may tend toward 1 instead of tending toward 0.
Since you are using C++, the most idiomatic is to use std::numeric_limits from header <limits>.
For instance:
template <typename T>
bool is_close_to_zero(T x)
{
return std::abs(x) < std::numeric_limits<T>::epsilon();
}
The actual tolerance to be used heavily depends on your problem. Please complete your question with a concrete use case so that I can enhance my answer.
There is also std::numeric_limits<T>::min() and std::numeric_limits<T>::denorm_min() that may be useful. The first one is the smallest positive non-denormalized value of type T (equal to FLT/DBL/LDBL_MIN from <cfloat>), the second one is the smallest positive value of type T (no <cfloat> equivalent).
[You may find this document useful to read if you aren't at ease with floating point numbers representation.]
The first if check will actually only be true when your value is zero.
For your second question, you imply lots of conversions. Instead, pick one unit (deg or rad) and do all your computational operations in that unit. Then at the very end do a single conversion to the other value if you need to.