I want to write a class representing Markov chain (let's name it MC). It has a constructor, which takes the state transition matrix (that is, vector<vector<double>>. I suppose, it is a good idea to check it is really a matrix (has the same number of rows and columns) and is really a transition matrix: all the numbers in it are probabilities, that is, no less than 0.0 and no greater than 1.0, and for every row the sum of its elements is 1.0. However, there is a problem which arises from floating point limitations: for example, the sum 0.3 + 0.3 + 0.3 + 0.1 will not be equal to 1.0, so the check will not be that easy. So I see two possible solutions of that problem:
Choose some epsilon and compare with epsilon error. Of course it will now accept some matrices violating the transition matrix property, but in general, if someone occasionally passes some bad data into the constructor, he will get an exception.
Don't check anything, rely on the class' user, if he passes something bad, it is completely his fault, and the behavior of the class will be unexpected.
What approach is better and more "real-world"? I like the first, but again, not sure how should I choose epsilon.
Do the second one.
Your class isn't in the business of summing up lists of floating-point numbers and deciding what's "close enough" to 1 and what isn't. Your user is. Your class represents Markov chains. You won't be able to choose a value of epsilon so that your class represents Markov chains in a useful way.
Think about the operations you're going to implement. Maybe you're going to have a function that hits a probability distribution on the states of the chain with the chain's transition matrix. Should that function check whether the input probability distribution is a probability distribution within some epsilon?
Your function almost certainly won't preserve the "is a probability distribution" property; you'll get some drift due to rounding error away from the space of probability distributions as you repeatedly hit your probability distribution by your Markov chain. You can correct this by normalising afterward, but that causes even more inaccuracy.
Now think about the "given a Markov chain and an integer k, return the Markov chain formed by iterating the input chain k times" operation. You can see that this will accumulate roundoff and suffer from much the same problems as "hit probability distribution with Markov chain."
Wouldn't it suck if you only had a choice between stuff that breaks after 12 hours of use and stuff that's unnecessarily inaccurate?
(Checking the squareness and matrixness of the square matrix argument is, of course, totally reasonable.)
Related
I'm writing a C++ program that takes the FFT of a real input signal containing double values and returns a vector X containing std::complex<double> values. Once I have the vector of results I then attempt to calculate the magnitude and phase of the result.
I am running into an issue with calculating the phase angle when one of the outputs is "zero". Zero is in quotes because when a calculation that results in 0 returns a double, the returned value will be very near zero, but not quite exactly zero.
For example, at index 3 my output array has the calculated "zero" value:
X[3] = 3.0531133177191805e-16 - i*5.5511151231257827e-17
I am trying to use the standard library std::arg function that is supposed to return the phase angle of a complex number. std::arg(X[3])
While X[3] is essentially 0, it is not EXACTLY 0 and the way phase is calculated this causes a problem because the calculation uses the ratio of the imaginary part divided by the ratio of the real part which is far from 0!
Doing the actual calculation results in a far from desirable result.
How can I make C++ realize that the result is really 0 so I can get the correct phase angle?
I'm looking for a more elegant solution than using an arbitrary hard-coded "epsilon" value to compare the double to, but so far searching online I haven't had any luck coming up with something better.
If you are computing the floating-point FFT of an input signal, then that signal will include noise, thus have a signal-to-noise ratio, including sensor noise, thermal noise, quantization noise, timing jitter noise, etc.
Thus the threshold for discarding FFT results as below your noise floor most likely isn't a matter of computational mathematics, but part of your physical or electronic data acquisition analysis. You will have to plug that number in, and set the phase to 0.0 or NaN or whatever your default flagging value is for a non-useful (at or below the noise floor) FFT result.
It was brought to my attention that my original answer will not work when the input to the FFT has been scaled. I believe I have an actual valid solution now... The original answer is kept below so that the comments still make sense.
From the comments on this answer and others, I've gathered that calculating the exact rounding error in the language may technically be possible but it is definitely not practical. The best practical solution seems to be to allow the user to provide their own noise threshold (in dB) and ignore any data points whose power level falls below that threshold. It would be impossible to come up with a generic threshold for all situations, but the user can provide a reasonable threshold based on the signal-to-noise ratio of the signal being analyzed and pass that in.
A generic phase calculation function is shown below that calculates the phase angles for a vector of complex data points.
std::vector<double> Phase(std::vector<std::complex<double>> X, double threshold, double amplitude)
{
size_t N = X.size();
std::vector<double> X_phase(N);
std::transform(X.begin(), X.end(), X_phase.begin(), [threshold, amplitude](const std::complex<double>& value) {
double level = 10.0 * std::log10(std::norm(value) / std::pow(amplitude, 2.0));
return level > threshold ? std::arg(value) : 0.0;
});
return X_phase;
}
This function takes 3 arguments:
The vector of complex signal data you want to calculate the phase of.
A sensible threshold -- Can be calculated from the signal-to-noise ratio of whatever measurement device was used to capture the signal. If your signal contains no noise other than the rounding errors of the language itself you can set this to some arbitrary really low value, like -120dB.
The maximum possible amplitude of your input signal. If your signal is calculated, this should simply be set to the amplitude of your signal. If your signal is measured, this should be set to the maximum amplitude the measuring device is capable of measuring (If your signal comes from reading an audio file, often its data will be normalized between -1.0 and 1.0. In this case you would just set the amplitude value to 1.0).
This new implementation still provides me with the correct results, but is much more robust. By leaving the threshold calculation to the user they can set the most sensible value themselves based on the characteristics of the measurement device used to measure their input signal.
Please let me know if you notice any errors or any ways I can further improve the design!
Original Answer
I found a solution that seems generic enough.
In the #include <limits> header, there is a constant value for std::numeric_limits<double>::digits10.
According the the documentation:
The value of std::numeric_limits<T>::digits10 is the number of base-10 digits that can be represented by the type T without change, that is, any number with this many significant decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow.
Using this I can filter out any output values that have a magnitude lower than this limit:
Calculate the phase of X[3]:
int N = X.size();
auto tmp = std::abs(X[3])/N > std::pow(10, -std::numeric_limits<double>::digits10)
? value
: 0.0
double phase = std::arg(tmp);
This effectively filters out any values that are not precisely zero due to rounding errors within the C++ language itself. It will NOT however filter out garbage data caused by noise in the input signal.
After adding this to my phase calculation I get the expected results.
The map from complex numbers to magnitude and phase is discontinuous at 0.
This is a discontinuity caused by the choice of coordinates you are using.
The solution will depend on why you chose those coordinates in a situation where values near the discontinuity are possible.
It isn't "really" zero. If you factored in error bars properly, your answer would really be a small magnitude (hopefully) and a unconstrained angle.
While going through this post at SO by the user #skrebbel who stated that the google testing framework does a good and fast job for comparing floats and doubles. So I wrote the following code to check the validity of the code and apparently it seems like I am missing something here , since I was expecting to enter the almost equal to section here this is my code
float left = 0.1234567;
float right= 0.1234566;
const FloatingPoint<float> lhs(left), rhs(right);
if (lhs.AlmostEquals(rhs))
{
std::cout << "EQUAL"; //Shouldnt it have entered here ?
}
Any suggetsions would be appreciated.
You can use
ASSERT_NEAR(val1, val2, abs_error);
where you can give the acceptable - your chosen one, like, say 0.0000001 - difference as abs_error, if the default one is too small, see here https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#floating-point-comparison
Your left and right are not “almost equal” because they are too far apart, farther than the default tolerance of AlmostEquals. The code in one of the answers in the question you linked to shows a tolerance of 4 ULP, but your numbers are 14 ULP apart (using IEEE 754 32-bit binary and correctly rounding software). (An ULP is the minimum increment of the floating-point value. It is small for floating-point numbers of small magnitude and large for large numbers, so it is approximately relative to the magnitude of the numbers.)
You should never perform any floating-point comparison without understanding what errors may be in the values you are comparing and what comparison you are performing.
People often misstate that you cannot test floating-point values for equality. This is false; executing a == b is a perfect operation. It returns true if and only if a is equal to b (that is, a and b are numbers with exactly the same value). The actual problem is that they are trying to calculate a correct function given incorrect input. == is a function: It takes two inputs and returns a value. Obviously, if you give any function incorrect inputs, it may return an incorrect result. So the problem here is not floating-point comparison; it is incorrect inputs. You cannot generally calculate a sum, a product, a square root, a logarithm, or any other function correctly given incorrect input. Therefore, when using floating-point, you must design an algorithm to work with approximate values (or, in special cases, use great care to ensure no errors are introduced).
Often people try to work around errors in their floating-point values by accepting as equal numbers that are slightly different. This decreases false negatives (indications of inequality due to prior computing errors) at the expense of increasing false positives (indications of equality caused by lax acceptance). Whether this exchange of one kind of error for another is acceptable depends on the application. There is no general solution, which is why functions like AlmostEquals are generally bad.
The errors in floating-point values are the results of preceding operations and values. These errors can range from zero to infinity, depending on circumstances. Because of this, one should never simply accept the default tolerance of a function such as AlmostEquals. Instead, one should calculate the tolerance, which is specific to their applications, needs, and computations, and use that calculated tolerance (or not use a comparison at all).
Another problem is that functions such as AlmostEquals are often written using tolerances that are specified relative to the values being compared. However, the errors in the values may have been affected by intermediate values of vastly different magnitude, so the final error might be a function of data that is not present in the values being compared.
“Approximate” floating-point comparisons may be acceptable in code that is testing other code because most bugs are likely cause large errors, so a lax acceptance of equality will allow good code to continue but will report bugs in most bad code. However, even in this situation, you must set the expected result and the permitted error tolerance appropriately. The AlmostEquals code appears to hard-code the error tolerance.
(Not sure if this 100% applies to the original question but this is what I came for when I stumbled upon it)
There also exist ASSERT_FLOAT_EQ and EXPECT_FLOAT_EQ (or the corresponding versions for double) which you can use if you don't want to worry about tolerable errors yourself.
Docs: https://github.com/google/googletest/blob/master/docs/reference/assertions.md#floating-point-comparison-floating-point
I am trying to implement a root finding algorithm. I am using the hybrid Newton-Raphson algorithm found in numerical recipes that works pretty nicely. But I have a problem in bracketing the root.
While implementing the root finding algorithm I realised that in several cases my functions have 1 real root and all the other imaginary (several of them, usually 6 or 9). The only root I am interested is in the real one so the problem is not there. The thing is that the function approaches the root like a cubic function, touching with the point the y=0 axis...
Newton-Rapson method needs some brackets of different sign and all the bracketing methods I found don't work for this specific case.
What can I do? It is pretty important to find that root in my program...
EDIT: more problems: sometimes due to reaaaaaally small numerical errors, say a variation of 1e-6 in some value the "cubic" function does NOT have that real root, it is just imaginary with a neglectable imaginary part... (checked with matlab)
EDIT 2: Much more information about the problem.
Ok, I need root finding algorithm.
Info I have:
The root I need to find is between [0-1] , if there are more roots outside that part I am not interested in them.
The root is real, there may be imaginary roots, but I don't want them.
Probably all the rest of the roots will be imaginary
The root may be double in that point, but I think that actually doesn't mater in numerical analysis problems
I need to use the root finding algorithm several times during the overall calculations, but the function will always be a polynomial
In one of the particular cases of the root finding, my polynomial will be similar to a quadratic function that touches Y=0 with the point. Example of a real case:
The coefficient may not be 100% precise and that really slight imprecision may make the function not to touch the Y=0 axis.
I cannot solve for this specific case because in other cases it may be that the polynomial is pretty normal and doesn't make any "strange" thing.
The method I am actually using is NewtonRaphson hybrid, where if the derivative is really small it makes a bisection instead of NewRaph (found in numerical recipes).
Matlab's answer to the function on the image:
roots:
0.853553390593276 + 0.353553390593278i
0.853553390593276 - 0.353553390593278i
0.146446609406726 + 0.353553390593273i
0.146446609406726 - 0.353553390593273i
0.499999999999996 + 0.000000040142134i
0.499999999999996 - 0.000000040142134i
The function is a real example I prepared where I know that the answer I want is 0.5
Note:
I still haven't check completely some of the answers I you people have give me (Thank you!), I am just trying to give al the information I already have to complete the question.
Assuming you have a one-dimensional polynomial problem (which I assume from the imaginary solutions) you can use Sturm sequences to bracket all real roots. See Sturm's theorem.
Welcome to the wonderful world of numerical methods. Watch your hairline; it might start receding as you pull your hair out in frustration.
First off, with numerical root finding, you are toast if you can't bracket the problem. Newton Raphson is nice for polishing off a solution once you get close, and it only works if the derivative near the root is well away from zero. You always need to have some slower technique at hand as a backup because Newton Raphson can send you off to never-never land (i.e., somewhere well outside the bracket). If your function is not a polynomial, the first thing to try is Brent's method. If your function is a polynomial, try Laguerre's method or Jenkins-Traub.
BTW, it sounds like you have a pathological problem. You shouldn't expect particularly good performance. Pathological problems are, well, pathological.
Addendum
If you are having problems with things that appear to be roots, but aren't, you need to take care how you evaluate your function. If you do have a polynomial, form each term of the polynomial, sort by absolute value, and add smallest to largest. This produces better accuracy most of the time, but fails if you have large terms whose sum is nearly zero. If that's the case, you might want to add those canceling terms separately, add the rest smallest to largest, and then compute a grand total -- and your still kinda screwed. That big addition that nearly cancels loses a lot of precision. There's no escape other than extended precision arithmetic.
Ander, thanks for responding to my question (about the interval); sorry for the delay in following up - I have very busy work. Also - before I found the additional information you've provided - I had in mind to explain quite a few things how to handle this and was contemplating how to present that. However, I now believe your case is not too difficult and we can get at it without too much additional stuff, since you apparently have an explicit polynomial expression (coefficients to the various powers).
Let's start with a simple case, to pinpoint the approach.
Step 1.
If you have a 2nd degree polynomial, its derivative is first order and has a simple zero (which you can find by bracketing or simply by explicitly solving the equation). (Yes, I know there's a closed formula for the roots of a 2nd degree polynomial also, but for the sake of the current argument, let us forget that).
The zero's of the 2nd degree polynomial are then located one at the left side and one at the right side of the zero of the derivative. So, if you also have the interval where the roots of the original function (the 2nd degree polynomial) are to be found, you now have two intervals - left and right of the derivative-zero, each with one zero.
It is important to realize that the original function is MONOTONIC on each subinterval (decreasing on one of them, increasing on the other). Therefore, simply by checking the function values at the ends of the (sub)interval you can determine whether or not they actually bracket a zero. If not, there's a multiple zero (double, in this case) exactly at the zero of the derivative IF the function is zero there (otherwise, it is a double imaginary root of which you've now found the real part).
In case the zero of the derivative lies OUTSIDE the total interval, you will have at most one root inside your interval and you need to check only that particular (sub)interval.
Step 2.
Consider now a 3rd order polynomial.
Its derivative is 2nd order.
The derivative of THAT 2nd order polynomial is again 1st order and you proceed as before to get two subintervals to find the roots of the derivative of the original function. These two roots give you THREE (at most) intervals where you will find the 3 roots of the original (3rd order) function.
And also here, you will have intervals (3) where the original function is monotonic (alternatingly increasing/decreasing), making the analysis per subinterval quite easy.
Again, zeros may coincide (2 or even all 3) and may in addition turn out to be complex-valued (i.e. having non-zero imaginary parts). The analysis of the cases is straightforward: check function values at the borders of the intervals to assess whether not there's a sign-change (function is monotonic on each subinterval) and/or whether the function is zero at one of the subinterval-borders.
Step 4.
Generalize this with the known polynomial. Let's say - your example - it is 6th order:
a) construct the 5th derivative (i.e. reducing the original to a 1st order polynomial). Compute it's zero (it is at precisely 0.5 in your example). In this case you're already done, but suppose you don't realize that. So you have now 2 intervals 0..0.5 and 0.5..1
b) construct the 4th derivative. Inspect its values at the subinterval-boundaries (0, 0.5, 1)
For each subinterval determine if it has a real zero inside. If so, you re-partition your original interval in 3 subintervals, using the two found zeros (you forget about the zero of the 5th derivative). If they coincide (at the previous cut, 0.5) you stick with that 0.5 (don't care whether you've found a true double zero of your 4th derivative there or a "double imaginary") and still have only 2 intervals, but for the sake of the argument let's say you now have 3.
c) construct the 3rd derivative and do likewise as before. You will then have 4 (at most) intervals.
d) And so on. After having processed the 2nd derivative in this fashion you have 5 (at most) intervals, and after processing the 1st derivative you have 6 intervals (or less...) and knowing the function is monotonic on each subinterval, you'll quickly determine in each of them if there's a real root, as always using the know monotonicity of the function in each of the final subintervals.
Adding a note on numerical accuracy at evaluating a function:
A first (probably sufficient, in this case) method to reduce noise is NOT to evaluate your function in the way suggested by the original form (i.e. a6 x*6 + a5 x*5 +..), but to rewrite it as:
a0 + x*(a1 + x*(a2 + x*(a3 + x*(a4 + x*(a5 + x*a6)))))
So, in evaluating you proceed:
tmp = a6
tmp = x*tmp + a5
tmp = x*tmp + a4
etcetera.
In case this little rewriting is not sufficient for numerical stability, you should rewrite your polynomial in (for instance) a chebyshev-polynomial expansion and evaluate that one with its recurrence relations. Both (getting the expansion and applying the recurrence relations for evaluation) are rather simple. I can explain, if you need help, but I guess it won't be necessary here.
In all cases, you HAVE to allow for some inaccuracy, i.e. accept that a computation will, generally speaking, NEVER give you the mathematically exact function value. So the assessment whether the function is presumably zero at some point must include some "tolerance", there's no way around this, unfortunately; the best you can aim for is to minimize the noise.
Well, if your function touches zero but never crosses it, you seem to be looking for a minimum (or a maximum). In which case, you're better off telling computer to do exactly that --- either find the root of a derivative (if you can calculate it analytically), or use a minimization routine. Then check that the function value at the minimum is 'close enough' to zero.
Just to reiterate what was already said by other people:
don't start with Newton-Raphson method; it's almost always better to start with Brent or even a straightforward bisection (provided you can bracket the root).
An instability where 'small numerical errors' of the order of 1e-6 have bad effects is worth investigating. Immediate suspects: mixing floats and doubles, loss of precision somewhere etc.
EDIT: So, depending on some parameters, your function has either a zero crossing, or a minimum with zero value, is this correct? In this case, what I'd do is this: use a simple and robust bracketing strategy (e.g. start from [-1, 1], multiply the endpoints by 1.1, check the signs, keep multiplying, something like this). If that succeeds, there's a zero crossing, use a root finding routine. If bracketing fails, use minimization.
Using Newton-Raphson is an act of desperation. You are much better off finding the continued fraction that represents your function and calculating that. A CF will converge much faster and will produce the real root(s). Also, because the CF produces a ratio of two integers you have tight control over numeric precision and don't have to worry about accumulation of rounding errors and other similar hair-pulling-out problems.
To find the real roots of any polynomial function refer to "A Continued Fraction Algorithm for Approximating All Real Polynomial Roots" by David Rosen (1978).
------------ ADDENDUM 1 --- 11 OCT-----------------
Ok, you are solving a sextic. You have several options. The simplest is to use a Taylor approximation (say to the 3rd degree) in conjunction with Halley's method. This is much superior to Newton because it has cubic convergence and you can detect imaginary solutions. The disadvantage is that you will have rounding problems which may result in an incorrect answer.
The ideal option is to find the continued fraction that represents the monic root, because this CF will be computable as an integer ratio of any desired precision, thus elminating the problem of rounding.
One approach to computing this CF is via the Jacobi-Perron algorithm. See the paper Hendy and Jeans: http://www.ams.org/mcom/1981-36-154/S0025-5718-1981-0606514-X/S0025-5718-1981-0606514-X.pdf. This paper shows the exact algorithm for computing cubic and quartic roots via CF approximation.
Note that if the sextic is reducible then it can converted into a quartic and quadratic: http://elib.mi.sanu.ac.rs/files/journals/tm/21/tm1124.pdf. The quartic is then solvable by the algorithm in the Hendy paper.
The general solution to generate a CF for a sextic can be done via the Rogers-Ramunajan CF. See the following paper for the method: http://arxiv.org/pdf/1111.6023v2. This will generate the CF for any sextic.
As in your case, you are interested in the real factorization of a real polynomial. One may see that all complex roots come in conjugate pairs which correspond to a real quadratic factor. By finding this real quadratic and completing the square to get the form (x-r)^2 + s you will be able to see the "real" even order root r with an "error" given by s. If s > 0 is too large, you may discard it as probably being complex. If s < 0 is also large, then you have two faraway real roots given by x = r ± √(-s). If s is very small then you might suspect r is a real double root and keep it.
Finding such a quadratic factor may be done using Bairstow's method, which actually applies a two-dimensional Newton method. This gives x^2 + ux + v and r = -u/2; s = v - r^2.
I'm writing a program that uses a very long recursion (about 50,000) and some very large vectors (also 50,000 in length of type double) to store the result of each recursion before averaging them. At the end of the program, I expect to get a number output.
However, some of the results I got was "nan". The mysterious thing is, if I reduce the number of recursions the program will work just fine. So I'm guessing this might be something to do with the size of the vector. So my question is, if you get an overflow in a very long vector (or say array), what will be the effect? Will you get an "nan" just like in my case?
Another mysterious thing about my program is that I have tried some even larger recursions (100,000), but the output was normal. But when I changed a parameter value, so that each numbers stored in the vector will become larger (although they are still of type double), the output becomes "nan". Will the maximum capacity of a vector be dependent on the size of the number it stores?
You didn't tell us what your recursion is, but it is fairly easy to generate NaNs with a long sequence of operations if you are using square root, pow, inverse sine, or inverse cosine.
Suppose your calculation produces a quantity, call it x, that is supposed to be the sine of some angle θ, and suppose the underlying math dictates that x must always be between -1 and 1, inclusive. You calculate θ by taking the inverse sine of x.
Here's the problem: Arithmetic done on a computer is but an approximation of the arithmetic of the real numbers. Addition and multiplication with IEEE floating point numbers are not transitive. You might well get a value of 1.0000000000000002 for x instead of 1. Take the inverse sine of this value and you get a NaN.
A standard trick is to protect against those near misses that result from numerical errors. Don't use the built-in asin, acos, sqrt, and pow. Use wrappers that protects against things like asin(1.0000000000000002) and sqrt(-1e-16). Make the former pi/2 rather than NaN, and make the latter zero. This is admittedly a kludge, and doing this can get you in trouble. What if the problem is that your calculations are formulated incorrectly? It's legitimate to treat 1.0000000000000002 as 1, but it's best not to treat a value of 100 as if it were 1. A value of 100 to your asin wrapper is best treated by throwing an exception rather than truncating to 1.
There's one other problem with infinities and NaNs: They propagate. An Inf or NaN in one single computation quickly becomes an Inf or a NaN in hundreds, then thousands of values. I usually make the floating point machinery raise a floating point exception on obtaining an Inf or NaN instead of continuing on. (Note well: Floating point exceptions are not C++ exceptions.) When you do this, your program will bomb unless you have a signal handler in place. That's not necessarily a bad thing. You can run the program in the debugger and find exactly where the problem arose. Without these floating point exceptions it is very hard to find the source of the problem.
Depends on the exact natur of your computations. If you just add up numbers which aren't NaN, the result shouldn't be NaN, either. It might be +infinity, though.
But you will get NaN if e.g. some part of your computation yields +infinity, another -infinity, and you later add those two results.
Assuming that your architecture conforms to IEEE 754, this http://en.wikipedia.org/wiki/NaN#Creation tells the situations in which arithmetic operations return NaN.
I am writing a simulation program that proceeds in discrete steps. The simulation consists of many nodes, each of which has a floating-point value associated with it that is re-calculated on every step. The result can be positive, negative or zero.
In the case where the result is zero or less something happens. So far this seems straightforward - I can just do something like this for each node:
if (value <= 0.0f) something_happens();
A problem has arisen, however, after some recent changes I made to the program in which I re-arranged the order in which certain calculations are done. In a perfect world the values would still come out the same after this re-arrangement, but because of the imprecision of floating point representation they come out very slightly different. Since the calculations for each step depend on the results of the previous step, these slight variations in the results can accumulate into larger variations as the simulation proceeds.
Here's a simple example program that demonstrates the phenomena I'm describing:
float f1 = 0.000001f, f2 = 0.000002f;
f1 += 0.000004f; // This part happens first here
f1 += (f2 * 0.000003f);
printf("%.16f\n", f1);
f1 = 0.000001f, f2 = 0.000002f;
f1 += (f2 * 0.000003f);
f1 += 0.000004f; // This time this happens second
printf("%.16f\n", f1);
The output of this program is
0.0000050000057854
0.0000050000062402
even though addition is commutative so both results should be the same. Note: I understand perfectly well why this is happening - that's not the issue. The problem is that these variations can mean that sometimes a value that used to come out negative on step N, triggering something_happens(), now may come out negative a step or two earlier or later, which can lead to very different overall simulation results because something_happens() has a large effect.
What I want to know is whether there is a good way to decide when something_happens() should be triggered that is not going to be affected by the tiny variations in calculation results that result from re-ordering operations so that the behavior of newer versions of my program will be consistent with the older versions.
The only solution I've so far been able to think of is to use some value epsilon like this:
if (value < epsilon) something_happens();
but because the tiny variations in the results accumulate over time I need to make epsilon quite large (relatively speaking) to ensure that the variations don't result in something_happens() being triggered on a different step. Is there a better way?
I've read this excellent article on floating point comparison, but I don't see how any of the comparison methods described could help me in this situation.
Note: Using integer values instead is not an option.
Edit the possibility of using doubles instead of floats has been raised. This wouldn't solve my problem since the variations would still be there, they'd just be of a smaller magnitude.
I've worked with simulation models for 2 years and the epsilon approach is the sanest way to compare your floats.
Generally, using suitable epsilon values is the way to go if you need to use floating point numbers. Here are a few things which may help:
If your values are in a known range you and you don't need divisions you may be able to scale the problem and use exact operations on integers. In general, the conditions don't apply.
A variation is to use rational numbers to do exact computations. This still has restrictions on the operations available and it typically has severe performance implications: you trade performance for accuracy.
The rounding mode can be changed. This can be use to compute an interval rather than an individual value (possibly with 3 values resulting from round up, round down, and round closest). Again, it won't work for everything but you may get an error estimate out of this.
Keeping track of the value and a number of operations (possible multiple counters) may also be used to estimate the current size of the error.
To possibly experiment with different numeric representations (float, double, interval, etc.) you might want to implement your simulation as templates parameterized for the numeric type.
There are many books written on estimating and minimizing errors when using floating point arithmetic. This is the topic of numerical mathematics.
Most cases I'm aware of experiment briefly with some of the methods mentioned above and conclude that the model is imprecise anyway and don't bother with the effort. Also, doing something else than using float may yield better result but is just too slow, even using double due to the doubled memory footprint and the smaller opportunity of using SIMD operations.
I recommend that you single step - preferably in assembly mode - through the calculations while doing the same arithmetic on a calculator. You should be able to determine which calculation orderings yield results of lesser quality than you expect and which that work. You will learn from this and probably write better-ordered calculations in the future.
In the end - given the examples of numbers you use - you will probably need to accept the fact that you won't be able to do equality comparisons.
As to the epsilon approach you usually need one epsilon for every possible exponent. For the single-precision floating point format you would need 256 single precision floating point values as the exponent is 8 bits wide. Some exponents will be the result of exceptions but for simplicity it is better to have a 256 member vector than to do a lot of testing as well.
One way to do this could be to determine your base epsilon in the case where the exponent is 0 i e the value to be compared against is in the range 1.0 <= x < 2.0. Preferably the epsilon should be chosen to be base 2 adapted i e a value that can be exactly represented in a single precision floating point format - that way you know exactly what you are testing against and won't have to think about rounding problems in the epsilon as well. For exponent -1 you would use your base epsilon divided by two, for -2 divided by 4 and so on. As you approach the lowest and the highest parts of the exponent range you gradually run out of precision - bit by bit - so you need to be aware that extreme values can cause the epsilon method to fail.
If it absolutely has to be floats then using an epsilon value may help but may not eliminate all problems. I would recommend using doubles for the spots in the code you know for sure will have variation.
Another way is to use floats to emulate doubles, there are many techniques out there and the most basic one is to use 2 floats and do a little bit of math to save most of the number in one float and the remainder in the other (saw a great guide on this, if I find it I'll link it).
Certainly you should be using doubles instead of floats. This will probably reduce the number of flipped nodes significantly.
Generally, using an epsilon threshold is only useful when you are comparing two floating-point number for equality, not when you are comparing them to see which is bigger. So (for most models, at least) using epsilon won't gain you anything at all -- it will just change the set of flipped nodes, it wont make that set smaller. If your model itself is chaotic, then it's chaotic.