I am running long simulations. I record the results into a vector to compute statistics about the data. I realized that, in theory, those samples could be the result of a division by zero; this is only theoretical, I am pretty sure it's not the case. In order to avoid rerunning the simulation after modifying the code, I was wondering what happens in that case. Would I be able to realize whether a division by 0 has occurred or not? Will I get error messages? (Exceptions are not being handled at the moment).
Thanks
For IEEE floats, division of a finite nonzero float by 0 is well-defined and results in +infinity (if the value was >zero) or -infinity (if the value was less than zero). The result of 0.0/0.0 is NaN. If you use integers, the behaviour is undefined.
Note that C standard says (6.5.5):
The result of the / operator is the quotient from the division of
the first operand by the second; the result of the % operator is the
remainder. In both operations, if the value of the second operand is
zero, the behavior is undefined.
So something/0 is undefined (by the standard) both for integral types and Floating points. Nevertheless most implementations have fore mentioned behavior (+-INF or NAN).
If you're talking integers then your program should crash upon division by zero.
If you're talking floats then division by zero is allowed and the result to that is INF or -INF. Now it's all up to your code if the program will crash, handle that nicely or continue with undefined/unexpected results.
If you use IEEE floats, then it will return 0 or NaN. If the op1 is 0, you will get undefined. If op1 is higher than 0, you will get Infinity. If op1 is lower than 0, then you will get -Infinity. If you use dividing by 0 directly or in integer , you will get error "Floating point exception".
#include <iostream>
#include <math.h>
using namespace std;
int main()
{
double a = 123, b = 0;
double result = a/b;
string isInfinite = isinf(result) ? "is" : "is not";
cout << "result=" << result << " " << isInfinite << " infinity" << endl;
}
result=inf is infinity
Related
In C++, we know that we can find the minimum representable double precision value using std::numeric_limits<double>::min(). The value turns out to be 2.22507e-308 when printed.
Now if a given double value (say val) is subtracted from this minimum value and then a division is undertaken with the same previous double value (val - minval) / val, I was expecting the answer to be rounded to 0 if the operation floor((val - minval ) / val) was performed on the resulting divided value.
To my surprise, the answer is delivered as 1. Can someone please explain this anomalous behavior?
Consider the following code:
int main()
{
double minval = std::numeric_limits<double>::min(), wg = 8038,
ans = floor((wg - minval) / wg); // expecting the answer to round to 0
cout << ans; // but the answer actually resulted as 1!
}
A double typically has around 16 digits of precision.
You're starting with 8038. For simplicity, I'm going to call that 8.038e3. Since we have around 16 digits of precision, the smallest number we can subtract from that and still get a result different from 8038 is 8038e(3-16) = 8038e-13.
8038 - 2.2e-308 is like reducing the mass of the universe by one electron, and expecting that to affect the mass of the universe by a significant amount.
Actually, relatively speaking, 8038-2.2e-308 is a much smaller change than removing a whole electron from the universe--more like removing a minuscule fraction of a single electron from the universe, if that were possible. Even if we were to assume that string theory were correct, even removing one string from the universe would still be a huge change compared to subtracting 2.2e-308 from 8038.
The comments and the previous answer correctly attribute the cause to floating point precision issues but there are additional details needed to explain the correct behavior. In fact, even in cases where subtraction cannot be carried out such that the results of the subtraction cannot be represented with the finite precision of floating point numbers, inexact rounding is still performed by the compiler and subtraction is not completely discarded.
As an example, consider the code below.
int main()
{
double b, c, d;
vector<double> a{0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.6, 0.7};
cout << "Subtraction Possible?" << "\t" << "Floor Result" << "\n";
for( int i = 0; i < 9; i++ ) {
b = std::nextafter( a[i], 0 );
c = a[i] - b;
d = 1e-17;
if( (bool)(d > c) )
cout << "True" << "\t";
else
cout << "False" << "\t";
cout << setprecision(52) << floor((a[i] - d)/a[i]) << "\n";
}
return 0;
}
The code takes in different double precision values in the form of vector a and performs subtraction from 1e-17. It must be noted that the smallest value that can be subtracted from 0.07 is shown to be 1.387778780781445675529539585113525390625e-17 using std::nextafter for the value 0.07. This means that 1e-17 is smaller than the smallest value which can be subtracted from any of these numbers. Hence, theoretically, subtraction should not be possible for any of the numbers listed in vector a. If we assume that the subtraction results are discarded, then the answer must always stay 1 but it turns out that sometimes the answer is 0 and other times 1.
This can be observed from the output of the C++ program as shown below:
Subtraction Possible? Floor Result
False 0
False 0
False 0
False 0
False 1
False 1
False 1
False 1
False 1
The reasons lay buried within the Floating Point specification prescribed in the IEEE 754 document. In general the standard specifically states that even in cases where the results of an operation cannot be represented, rounding must be carried out. I quote Page 27, Section 4.3 of the IEEE 754, 2019 document:
Except where stated otherwise, every operation shall be performed as if it first produced an
intermediate result correct to infinite precision and with unbounded range, and then rounded that result
according to one of the attributes in this clause
The statement in further repeated in Section 5.1 of Page 29 as shown below:
Unless otherwise specified, each of the computational
operations specified by this standard that returns a numeric result shall be performed as if it first produced
an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result, if necessary, to fit in the destination’s format (see Clause 4 and Clause 7).
C++'s g++ compiler (which I have been testing) correctly and very precisely interprets the standard by implementing nearest rounding stated in Section 4.3.1 of the IEEE 754 document. This has the implication that even when a[i] - b is not representable, a numeric result is delivered as if the subtraction first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result. Hence, it may or may not be the case that a[i] - b == a[i] which means that the answer may or may not be 1 depending on whether a[i] - b is closer to a[i] or it is closer to the next representable value after a[i].
It turns out that 8038 - 2.22507e-308 is closer to 8038 due to which the answer is rounded (using nearest rounding) to 8038 and the final answer is 1 but this is to only state that this behavior does result from the compiler's interpretation of the standard and is not something arbitrary.
I found below references on Floating Point numbers to be very useful. I would recommend reading Cleve Moler's (founder of MATLAB) reference on floating point numbers before going through the IEEE specification for a quick and easy understanding of their behavior.
"IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
Moler, Cleve. “Floating Points.” MATLAB News and Notes. Fall, 1996.
I read some article about nan, but sites didn't mentioned all situations. For example I compiled this code and received nan.
Why doesn't it give inf ?
#include <iostream>
using namespace std;
int main()
{
double input,counter,pow= 1, sum = 0, sign = 1.0;
cin >> input;
for (counter = 1; pow / counter >= 1e-4; counter++)
{
pow *= input;
sum += sign * pow / counter;
sign = -sign;
}
cout << sum << endl;
}
The result is :
nan
With input of “2”, your program adds two infinities of opposite signs, which generates a NaN. This occurs because repeatedly multiplying pow by two causes it to become infinity, and the alternating sign results in a positive infinity being added to a negative infinity in sum from the previous iteration or vice-versa.
However, it is not clear why you see any output, as counter++ becomes ineffective once counter reaches 253 (in typical C++ implementations) because then the double format lacks precision to represent 253+1, so the result of adding one to 253 is rounded to 253. So counter stops changing, and the loop continues forever.
One possibility is that your compiler is generating code that always terminates the loop, because this is allowed by the “Forward progress” clause (4.7.2 in draft n4659) of the C++ standard. It says the compiler can assume your loop will not continue forever without doing something useful (like writing output or calling exit), and that allows the compiler to generate code that exits the loop even though it would otherwise continue forever with input of “2”.
Per the IEEE-754 standard, operations that produce NaN as a result include:
operations on a NaN,
multiplication of zero by an infinity,
subtraction of two infinities of the same sign or addition of two infinities of opposite signs,
division of zero by zero or an infinity by an infinity,
remainder when the divisor is zero or the dividend is infinity,
square root of a value less than zero,
various exceptions in some utility and mathematical routines (such as pow, see IEEE-754 9.2, 5.3.2, and 5.3.3).
C++ implementations do not always conform to IEEE-754, but these are generally good guidelines for sources of NaNs.
Is there a difference in accuracy between pow(a/b,x) and pow(b/a,-x)?
If there is, does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Edit: Let's assume x86_64 processor and gcc compiler.
Edit: I tried comparing using some random numbers. For example:
printf("%.20f",pow(8.72138221/1.761329479,-1.51231)) // 0.08898783049228660424
printf("%.20f",pow(1.761329479/8.72138221, 1.51231)) // 0.08898783049228659037
So, it looks like there is a difference (albeit minuscule in this case), but maybe someone who knows about the algorithm implementation could comment on what the maximum difference is, and under what conditions.
Here's one way to answer such questions, to see how floating-point behaves. This is not a 100% correct way to analyze such question, but it gives a general idea.
Let's generate random numbers. Calculate v0=pow(a/b, n) and v1=pow(b/a, -n) in float precision. And calculate ref=pow(a/b, n) in double precision, and round it to float. We use ref as a reference value (we suppose that double has much more precision than float, so we can trust that ref can be considered the best possible value. This is true for IEEE-754 for most of the time). Then sum the difference between v0-ref and v1-ref. The difference should calculated with "the number of floating point numbers between v and ref".
Note, that the results may be depend on the range of a, b and n (and on the random generator quality. If it's really bad, it may give a biased result). Here, I've used a=[0..1], b=[0..1] and n=[-2..2]. Furthermore, this answer supposes that the algorithm of float/double division/pow is the same kind, have the same characteristics.
For my computer, the summed differences are: 2604828 2603684, it means that there is no significant precision difference between the two.
Here's the code (note, this code supposes IEEE-754 arithmetic):
#include <cmath>
#include <stdio.h>
#include <string.h>
long long int diff(float a, float b) {
unsigned int ai, bi;
memcpy(&ai, &a, 4);
memcpy(&bi, &b, 4);
long long int diff = (long long int)ai - bi;
if (diff<0) diff = -diff;
return diff;
}
int main() {
long long int e0 = 0;
long long int e1 = 0;
for (int i=0; i<10000000; i++) {
float a = 1.0f*rand()/RAND_MAX;
float b = 1.0f*rand()/RAND_MAX;
float n = 4.0f*rand()/RAND_MAX - 2.0f;
if (a==0||b==0) continue;
float v0 = std::pow(a/b, n);
float v1 = std::pow(b/a, -n);
float ref = std::pow((double)a/b, n);
e0 += diff(ref, v0);
e1 += diff(ref, v1);
}
printf("%lld %lld\n", e0, e1);
}
... between pow(a/b,x) and pow(b/a,-x) ... does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Whichever division is more arcuate.
Consider z = xy = 2y * log2(x).
Roughly: The error in y * log2(x) is magnified by the value of z to form the error in z. xy is very sensitive to the error in x. The larger the |log2(x)|, the greater concern.
In OP's case, both pow(a/b,p) and pow(b/a,-p), in general, have the same y * log2(x) and same z and similar errors in z. It is a question of how x, y are formed:
a/b and b/a, in general, both have the same error of +/- 0.5*unit in the last place and so both approaches are of similar error.
Yet with select values of a/b vs. b/a, one quotient will be more exact and it is that approach with the lower pow() error.
pow(7777777/4,-p) can be expected to be more accurate than pow(4/7777777,p).
Lacking assurance about the error in the division, the general rule applies: no major difference.
In general, the form with the positive power is slightly better, although by so little it will likely have no practical effect. Specific cases could be distinguished. For example, if either a or b is a power of two, it ought to be used as the denominator, as the division then has no rounding error.
In this answer, I assume IEEE-754 binary floating-point with round-to-nearest-ties-to-even and that the values involved are in the normal range of the floating-point format.
Given a, b, and x with values a, b, and x, and an implementation of pow that computes the representable value nearest the ideal mathematical value (actual implementations are generally not this good), pow(a/b, x) computes (a/b•(1+e0))x•(1+e1), where e0 is the rounding error that occurs in the division and e1 is the rounding error that occurs in the pow, and pow(b/a, -x) computes (b/a•(1+e2))−x•(1+e3), where e2 and e3 are the rounding errors in this division and this pow, respectively.
Each of the errors, e0…e3 lies in the interval [−u/2, u/2], where u is the unit of least precision (ULP) of 1 in the floating-point format. (The notation [p, q] is the interval containing all values from p to q, including p and q.) In case a result is near the edge of a binade (where the floating-point exponent changes and the significand is near 1), the lower bound may be −u/4. At this time, I will not analyze this case.
Rewriting, these are (a/b)x•(1+e0)x•(1+e1) and (a/b)x•(1+e2)−x•(1+e3). This reveals the primary difference is in (1+e0)x versus (1+e2)−x. The 1+e1 versus 1+e3 is also a difference, but this is just the final rounding. [I may consider further analysis of this later but omit it for now.]
Consider (1+e0)x and (1+e2)−x.The potential values of the first expression span [(1−u/2)x, (1+u/2)x], while the second spans [(1+u/2)−x, (1−u/2)−x]. When x > 0, the second interval is longer than the first:
The length of the first is (1+u/2)x−(1+u/2)x.
The length of the second is (1/(1−u/2))x−(1/(1+u/2))x.
Multiplying the latter by (1−u2/22)x produces ((1−u2/22)/(1−u/2))x−( (1−u2/22)/(1+u/2))x = (1+u/2)x−(1+u/2)x, which is the length of the first interval.
1−u2/22 < 1, so (1−u2/22)x < 1 for positive x.
Since the first length equals the second length times a number less than one, the first interval is shorter.
Thus, the form in which the exponent is positive is better in the sense that it has a shorter interval of potential results.
Nonetheless, this difference is very slight. I would not be surprised if it were unobservable in practice. Also, one might be concerned with the probability distribution of errors rather than the range of potential errors. I suspect this would also favor positive exponents.
For evaluation of rounding errors like in your case, it might be useful to use some multi-precision library, such as Boost.Multiprecision. Then, you can compare results for various precisions, e.g, such as with the following program:
#include <iomanip>
#include <iostream>
#include <boost/multiprecision/cpp_bin_float.hpp>
#include <boost/multiprecision/cpp_dec_float.hpp>
namespace mp = boost::multiprecision;
template <typename FLOAT>
void comp() {
FLOAT a = 8.72138221;
FLOAT b = 1.761329479;
FLOAT c = 1.51231;
FLOAT e = mp::pow(a / b, -c);
FLOAT f = mp::pow(b / a, c);
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << e << std::endl;
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << f << std::endl;
}
int main() {
std::cout << "Double: " << std::endl;
comp<mp::cpp_bin_float_double>();
td::cout << std::endl;
std::cout << "Double extended: " << std::endl;
comp<mp::cpp_bin_float_double_extended>();
std::cout << std::endl;
std::cout << "Quad: " << std::endl;
comp<mp::cpp_bin_float_quad>();
std::cout << std::endl;
std::cout << "Dec-100: " << std::endl;
comp<mp::cpp_dec_float_100>();
std::cout << std::endl;
}
Its output reads, on my platform:
Double:
0.0889878304922865903670015086390776559711
0.0889878304922866181225771242679911665618
Double extended:
0.0889878304922865999079806265115166752366
0.0889878304922865999012043629334822725241
Quad:
0.0889878304922865999004910375213273866639
0.0889878304922865999004910375213273505527
Dec-100:
0.0889878304922865999004910375213273881004
0.0889878304922865999004910375213273881004
Live demo: https://wandbox.org/permlink/tAm4sBIoIuUy2lO6
For double, the first calculation was more accurate, however, I guess one cannot make any generic conclusions here.
Also, note that your input numbers are not accurately representable with the IEEE 754 double precision floating-point type (none of them). The question is whether you care about the accuracy of calculations with either those exact numbers of their closest representations.
I'm currently trying to learn about floating point representation in depth, so I played around a bit. While doing so, I stumbled on some strange behaviour; I can't really work out what's happening, and I'd be very grateful for some insight. Apologies if this has been answered, I found it quite hard to google!
#include <iostream>
#include <cmath>
using namespace std;
int main(){
float minVal = pow(2,-149); // set to smallest float possible
float nextCheck = static_cast<float>(minVal/2.0f); // divide by two
bool isZero = (static_cast<float>(minVal/2.0f) == 0.0f); // this evaluates to false
bool isZero2 = (nextCheck == 0.0f); // this evaluates to true
cout << nextCheck << " " << isZero << " " << isZero2 << endl;
// this outputs 0 0 1
return 0;
}
Essentially what's happening is:
I set minVal to be the smallest float that can be represented using
single precision
Dividing by 2 should yield 0 -- we're at the minimum
Indeed, isZero2 does return true, but isZero returns false.
What's going on -- I would have thought them to be identical? Is the compiler trying to be clever, saying that dividing any number cannot possibly yield zero?
Thanks for your help!
The reason isZero and isZero2 can evaluate to different values, and isZero can be false, is that the C++ compiler is allowed to implement intermediate floating-point operations with more precision than the type of the expression would indicate, but the extra precision has to be dropped on assignment.
Typically, when generating code for the 387 historical FPU, the generated instructions work on either the 80-bit extended-precision type, or, if the FPU is set to a 53-bit significand (e.g. on Windows), a strange floating-point type with 53-bit significands and 15-bit exponents.
Either way, minVal/2.0f is evaluated exactly because the exponent range allows to represent it, but assigning it to nextCheck rounds it to zero.
If you are using GCC, there is the additional problem that -fexcess-precision=standard has not yet been implemented for the C++ front-end, meaning that the code generated by g++ does not implement exactly what the standard recommends.
I'm compiling my code via the following command:
icc -ltbb test.cxx -o test
Then when I run the program:
time ./mp6 100 > output.modified
Floating exception
4.871u 0.405s 0:05.28 99.8% 0+0k 0+0io 0pf+0w
I get a "Floating exception". This following is code in C++ that I had before the exception and after:
// before
if (j < E[i]) {
temp += foo(0, trr[i], ex[i+j*N]);
}
// after
temp += (j < E[i])*foo(0, trr[i], ex[i+j*N]);
This is boolean algebra... so (j < E[i]) is either going to be a 0 or a 1 so the multiplication would result either in 0 or the foo() result. I don't see why this would cause a floating exception.
This is what foo() does:
int foo(int s, int t, int e) {
switch(s % 4) {
case 0:
return abs(t - e)/e;
case 1:
return (t == e) ? 0 : 1;
case 2:
return (t < e) ? 5 : (t - e)/t;
case 3:
return abs(t - e)/t;
}
return 0;
}
foo() isn't a function I wrote so I'm not too sure as to what it does... but I don't think the problem is with the function foo(). Is there something about boolean algebra that I don't understand or something that works differently in C++ than I know of? Any ideas why this causes an exception?
Thanks,
Hristo
You are almost certainly dividing by zero in foo.
A simple program of
int main()
{
int bad = 0;
return 25/bad;
}
also prints
Floating point exception
on my system.
So, you should check whether e is 0 when s % 4 is zero, or whether t is 0 when s % 4 is 2 or 3. Then return whatever value makes sense for your situation instead of trying to divide by zero.
#hristo: C++ will still evaluate the right-hand-side of a multiplication even if the left-hand-side is zero. It doesn't matter that the result should be zero; it matters that foo was called and evaluated and caused an error.
Sample source:
#include <iostream>
int maybe_cause_exception(bool cause_it)
{
int divisor = cause_it ? 0 : 10;
return 10 / divisor;
}
int main()
{
std::cout << "Do not raise exception: " << maybe_cause_exception(false) << std::endl;
int x = 0;
std::cout << "Before 'if' statement..." << std::endl;
if(x)
{
std::cout << "Inside if: " << maybe_cause_exception(true) << std::endl;
}
std::cout << "Past 'if' statement." << std::endl;
std::cout << "Cause exception: " << x * maybe_cause_exception(true) << std::endl;
return 0;
}
Output:
Do not raise exception: 1
Before 'if' statement...
Past 'if' statement.
Floating point exception
Is it possible you are dividing by 0? It could be that an integer division by 0 is surfacing as a "Floating exception".
When you have the if, the computation isn't done if a division by 0 would happen. When you do the "Boolean algebra", the computation is done regardless, resulting in a divide by 0 error.
You're thinking that it will be temp += 0*foo(...); so it doesn't need to call foo (because 0 times anything will always be 0), but that's not how the compiler works. Both sides of a * have to be evaluated.
While I do not tell you the exact cause of your floating-point exception, I can provide some information you might find useful in investigating future floating-point errors. I believe Mark has already shed some light on why you are having this particular problem.
The most portable way of determining if a floating-point exception condition has occurred and its cause is to use the floating-point exception facilities provided by C99 in fenv.h. There are 11 functions defined in fenv.h for manipulating the floating-point environment (see fenv(3) man page). You may also find this article to be of interest.
On POSIX compliant systems, SIGFPE is sent to a process when in performs an erroneous arithmetic operation and this does not necessarily involve floating-point arithmetic. If the SIGFPE signal is handled and SA_SIGINFO is specified in the sa_flags for the call to sigaction(2), the si_code member of the siginfo_t structure should specify the reason for the fault.
From the wikipedia SIGFPE article:
A common oversight is to consider division by zero the only source of SIGFPE conditions. On some architectures (IA-32 included[citation needed]), integer division of INT_MIN, the smallest representable negative integer value, by −1 triggers the signal because the quotient, a positive number, is not representable.
When I suggested replacing branch with multiplication by one or zero, I did not consider that if statements may be guarding against numerical exception. multiplication trick still evaluates expression, but effectively throws it away. for small enough expression, such trick is better than conditional, but you have to make sure expression can evaluate.
You can still use multiplication trick, if you slightly transform denominator.
instead of x/t use x/(t + !t) which does not affect anything if denominator is nonzero (you are adding zero then)but allows denominator t = 0 to be compute, and then thrown away by multiplying by zero.
And sorry, but be careful with my suggestions, I do not know all details of your program. plus, I tend to go wild about replacing branches with "clever" Boolean expressions