How can I get consistent program behavior when using floats?

How can I get consistent program behavior when using floats? - c++

I am writing a simulation program that proceeds in discrete steps. The simulation consists of many nodes, each of which has a floating-point value associated with it that is re-calculated on every step. The result can be positive, negative or zero.
In the case where the result is zero or less something happens. So far this seems straightforward - I can just do something like this for each node:
if (value <= 0.0f) something_happens();
A problem has arisen, however, after some recent changes I made to the program in which I re-arranged the order in which certain calculations are done. In a perfect world the values would still come out the same after this re-arrangement, but because of the imprecision of floating point representation they come out very slightly different. Since the calculations for each step depend on the results of the previous step, these slight variations in the results can accumulate into larger variations as the simulation proceeds.
Here's a simple example program that demonstrates the phenomena I'm describing:
float f1 = 0.000001f, f2 = 0.000002f;
f1 += 0.000004f; // This part happens first here
f1 += (f2 * 0.000003f);
printf("%.16f\n", f1);
f1 = 0.000001f, f2 = 0.000002f;
f1 += (f2 * 0.000003f);
f1 += 0.000004f; // This time this happens second
printf("%.16f\n", f1);
The output of this program is
0.0000050000057854
0.0000050000062402
even though addition is commutative so both results should be the same. Note: I understand perfectly well why this is happening - that's not the issue. The problem is that these variations can mean that sometimes a value that used to come out negative on step N, triggering something_happens(), now may come out negative a step or two earlier or later, which can lead to very different overall simulation results because something_happens() has a large effect.
What I want to know is whether there is a good way to decide when something_happens() should be triggered that is not going to be affected by the tiny variations in calculation results that result from re-ordering operations so that the behavior of newer versions of my program will be consistent with the older versions.
The only solution I've so far been able to think of is to use some value epsilon like this:
if (value < epsilon) something_happens();
but because the tiny variations in the results accumulate over time I need to make epsilon quite large (relatively speaking) to ensure that the variations don't result in something_happens() being triggered on a different step. Is there a better way?
I've read this excellent article on floating point comparison, but I don't see how any of the comparison methods described could help me in this situation.
Note: Using integer values instead is not an option.
Edit the possibility of using doubles instead of floats has been raised. This wouldn't solve my problem since the variations would still be there, they'd just be of a smaller magnitude.

I've worked with simulation models for 2 years and the epsilon approach is the sanest way to compare your floats.

Generally, using suitable epsilon values is the way to go if you need to use floating point numbers. Here are a few things which may help:
If your values are in a known range you and you don't need divisions you may be able to scale the problem and use exact operations on integers. In general, the conditions don't apply.
A variation is to use rational numbers to do exact computations. This still has restrictions on the operations available and it typically has severe performance implications: you trade performance for accuracy.
The rounding mode can be changed. This can be use to compute an interval rather than an individual value (possibly with 3 values resulting from round up, round down, and round closest). Again, it won't work for everything but you may get an error estimate out of this.
Keeping track of the value and a number of operations (possible multiple counters) may also be used to estimate the current size of the error.
To possibly experiment with different numeric representations (float, double, interval, etc.) you might want to implement your simulation as templates parameterized for the numeric type.
There are many books written on estimating and minimizing errors when using floating point arithmetic. This is the topic of numerical mathematics.
Most cases I'm aware of experiment briefly with some of the methods mentioned above and conclude that the model is imprecise anyway and don't bother with the effort. Also, doing something else than using float may yield better result but is just too slow, even using double due to the doubled memory footprint and the smaller opportunity of using SIMD operations.

I recommend that you single step - preferably in assembly mode - through the calculations while doing the same arithmetic on a calculator. You should be able to determine which calculation orderings yield results of lesser quality than you expect and which that work. You will learn from this and probably write better-ordered calculations in the future.
In the end - given the examples of numbers you use - you will probably need to accept the fact that you won't be able to do equality comparisons.
As to the epsilon approach you usually need one epsilon for every possible exponent. For the single-precision floating point format you would need 256 single precision floating point values as the exponent is 8 bits wide. Some exponents will be the result of exceptions but for simplicity it is better to have a 256 member vector than to do a lot of testing as well.
One way to do this could be to determine your base epsilon in the case where the exponent is 0 i e the value to be compared against is in the range 1.0 <= x < 2.0. Preferably the epsilon should be chosen to be base 2 adapted i e a value that can be exactly represented in a single precision floating point format - that way you know exactly what you are testing against and won't have to think about rounding problems in the epsilon as well. For exponent -1 you would use your base epsilon divided by two, for -2 divided by 4 and so on. As you approach the lowest and the highest parts of the exponent range you gradually run out of precision - bit by bit - so you need to be aware that extreme values can cause the epsilon method to fail.

If it absolutely has to be floats then using an epsilon value may help but may not eliminate all problems. I would recommend using doubles for the spots in the code you know for sure will have variation.
Another way is to use floats to emulate doubles, there are many techniques out there and the most basic one is to use 2 floats and do a little bit of math to save most of the number in one float and the remainder in the other (saw a great guide on this, if I find it I'll link it).

Certainly you should be using doubles instead of floats. This will probably reduce the number of flipped nodes significantly.
Generally, using an epsilon threshold is only useful when you are comparing two floating-point number for equality, not when you are comparing them to see which is bigger. So (for most models, at least) using epsilon won't gain you anything at all -- it will just change the set of flipped nodes, it wont make that set smaller. If your model itself is chaotic, then it's chaotic.

Related

How do I handle values close to zero in c++?

I am trying to code an iterative function which takes an initial
double t = /*formula 1*/;
and then computes
for (auto i = 0; i < bigNumber; ++i)
{
temp = /*formula 2*/;
t = t*temp;
}
This works fine, except in the cases where the initial t is so small that C++ automatically sets it equal to zero (it is NOT actually supposed to be zero).
Then of course t will forever remain zero since we multiply it by itself, and that's the problem.
I tried solving this by setting t equal to some very small, but non-zero, number in case C++ had set it to zero, but this doesn't work, because then, I end up with the opposite problem, as t eventually blows up, once we have iterated it enough times.
How do I solve this problem?
Possibly worth mentioning:
The first formula (formula 1) involves stuff like exp(-verybignumber) and the second formula involves stuff like pow(i, -1), meaning it becomes very small with higher iterations.

Floating-point arithmetic isn't trivial, as you just discovered. This is not really related to C++, but to the IEEE 754 standard.
One of the things you need to need to ensure is that you stay within the normal numbers. That is, ensure your values throughout your computation do not get too small or too large.
In some cases, this is easy and maybe rescaling the input data is enough. In other cases, maybe you have to rethink your equations (steps) to avoid this.
Sometimes you can simply get away using a bigger type, e.g. long double or even __float128 (quad, check libquadmath).
Other solutions are to employ arbitrary-precision numbers (use a library like GMP and MPFR; do not attempt to do it yourself as a beginner) or even symbolic computation. It all depends on what performance you require.
Note that there are many other pitfalls when dealing with floating-point arithmetic.

Comparing Floating Point Nos - Google Test Framework

While going through this post at SO by the user #skrebbel who stated that the google testing framework does a good and fast job for comparing floats and doubles. So I wrote the following code to check the validity of the code and apparently it seems like I am missing something here , since I was expecting to enter the almost equal to section here this is my code
float left = 0.1234567;
float right= 0.1234566;
const FloatingPoint<float> lhs(left), rhs(right);
if (lhs.AlmostEquals(rhs))
{
std::cout << "EQUAL"; //Shouldnt it have entered here ?
}
Any suggetsions would be appreciated.

You can use
ASSERT_NEAR(val1, val2, abs_error);
where you can give the acceptable - your chosen one, like, say 0.0000001 - difference as abs_error, if the default one is too small, see here https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#floating-point-comparison

Your left and right are not “almost equal” because they are too far apart, farther than the default tolerance of AlmostEquals. The code in one of the answers in the question you linked to shows a tolerance of 4 ULP, but your numbers are 14 ULP apart (using IEEE 754 32-bit binary and correctly rounding software). (An ULP is the minimum increment of the floating-point value. It is small for floating-point numbers of small magnitude and large for large numbers, so it is approximately relative to the magnitude of the numbers.)
You should never perform any floating-point comparison without understanding what errors may be in the values you are comparing and what comparison you are performing.
People often misstate that you cannot test floating-point values for equality. This is false; executing a == b is a perfect operation. It returns true if and only if a is equal to b (that is, a and b are numbers with exactly the same value). The actual problem is that they are trying to calculate a correct function given incorrect input. == is a function: It takes two inputs and returns a value. Obviously, if you give any function incorrect inputs, it may return an incorrect result. So the problem here is not floating-point comparison; it is incorrect inputs. You cannot generally calculate a sum, a product, a square root, a logarithm, or any other function correctly given incorrect input. Therefore, when using floating-point, you must design an algorithm to work with approximate values (or, in special cases, use great care to ensure no errors are introduced).
Often people try to work around errors in their floating-point values by accepting as equal numbers that are slightly different. This decreases false negatives (indications of inequality due to prior computing errors) at the expense of increasing false positives (indications of equality caused by lax acceptance). Whether this exchange of one kind of error for another is acceptable depends on the application. There is no general solution, which is why functions like AlmostEquals are generally bad.
The errors in floating-point values are the results of preceding operations and values. These errors can range from zero to infinity, depending on circumstances. Because of this, one should never simply accept the default tolerance of a function such as AlmostEquals. Instead, one should calculate the tolerance, which is specific to their applications, needs, and computations, and use that calculated tolerance (or not use a comparison at all).
Another problem is that functions such as AlmostEquals are often written using tolerances that are specified relative to the values being compared. However, the errors in the values may have been affected by intermediate values of vastly different magnitude, so the final error might be a function of data that is not present in the values being compared.
“Approximate” floating-point comparisons may be acceptable in code that is testing other code because most bugs are likely cause large errors, so a lax acceptance of equality will allow good code to continue but will report bugs in most bad code. However, even in this situation, you must set the expected result and the permitted error tolerance appropriately. The AlmostEquals code appears to hard-code the error tolerance.

(Not sure if this 100% applies to the original question but this is what I came for when I stumbled upon it)
There also exist ASSERT_FLOAT_EQ and EXPECT_FLOAT_EQ (or the corresponding versions for double) which you can use if you don't want to worry about tolerable errors yourself.
Docs: https://github.com/google/googletest/blob/master/docs/reference/assertions.md#floating-point-comparison-floating-point

Preventing Rounding Errors

I was just reading about rounding errors in C++. So, if I'm making a math intense program (or any important calculations) should I just drop floats all together and use only doubles or is there an easier way to prevent rounding errors?

Obligatory lecture: What Every Programmer Should Know About Floating-Point Arithmetic.
Also, try reading IEEE Floating Point standard.
You'll always get rounding errors. Unless you use an infinite arbitrary precision library, like gmplib. You have to decide if your application really needs this kind of effort.
Or, you could use integer arithmetic, converting to floats only when needed. This is still hard to do, you have to decide if it's worth it.
Lastly, you can use float or double taking care not to make assumption about values at the limit of representation's precision. I'd wish this Valgrind plugin was implemented (grep for float)...

The rounding errors are normally very insignificant, even using floats. Mathematically-intense programs like games, which do very large numbers of floating-point computations, often still use single-precision.

This might work if your highest number is less than 10 billion and you're using C++ double precision.
if ( ceil(10000*(x + 0.00001)) > ceil(100000*(x - 0.00001))) {
x = ceil(10000*(x + 0.00004)) / 10000;
}
This should allow at least the last digit to be off +/- 9. I'm assuming dividing by 1000 will always just move a decimal place. If not, then maybe it could be done in binary.
You would have to apply it after every operation that is not +, -, *, or a comparison. For example, you can't do two divisions in the same formula because you'd have to apply it to each division.
If that doesn't work, you could work in integers by scaling the numbers up and always use integer division. If you need advanced functions maybe there is a package that does deterministic integer math. Integer division is required in a lot of financial settings because of round off error being subject to exploit like in the movie "The Office".

Accurate evaluation of 1/1 + 1/2 + ... 1/n row

I need to evaluate the sum of the row: 1/1+1/2+1/3+...+1/n. Considering that in C++ evaluations are not complete accurate, the order of summation plays important role. 1/n+1/(n-1)+...+1/2+1/1 expression gives the more accurate result.
So I need to find out the order of summation, which provides the maximum accuracy.
I don't even know where to begin.
Preferred language of realization is C++.
Sorry for my English, if there are any mistakes.

For large n you'd better use asymptotic formulas, like the ones on http://en.wikipedia.org/wiki/Harmonic_number;
Another way is to use exp-log transformation. Basically:
H_n = 1 + 1/2 + 1/3 + ... + 1/n = log(exp(1 + 1/2 + 1/3 + ... + 1/n)) = log(exp(1) * exp(1/2) * exp(1/3) * ... * exp(1/n)).
Exponents and logarithms can be calculated pretty quickly and accuratelly by your standard library. Using multiplication you should get much more accurate results.
If this is your homework and you are required to use simple addition, you'll better add from the smallest one to the largest one, as others suggested.

The reason for the lack of accuracy is the precision of the float, double, and long double types. They only store so many "decimal" places. So adding a very small value to a large value has no effect, the small term is "lost" in the larger one.
The series you're summing has a "long tail", in the sense that the small terms should add up to a large contribution. But if you sum in descending order, then after a while each new small term will have no effect (even before that, most of its decimal places will be discarded). Once you get to that point you can add a billion more terms, and if you do them one at a time it still has no effect.
I think that summing in ascending order should give best accuracy for this kind of series, although it's possible there are some odd corner cases where errors due to rounding to powers of (1/2) might just so happen to give a closer answer for some addition orders than others. You probably can't really predict this, though.

I don't even know where to begin.
Here: What Every Computer Scientist Should Know About Floating-Point Arithmetic

Actually, if you're doing the summation for large N, adding in order from smallest to largest is not the best way -- you can still get into a situation where the numbers you're adding are too small relative to the sum to produce an accurate result.
Look at the problem this way: You have N summations, regardless of ordering, and you wish to have the least total error. Thus, you should be able to get the least total error by minimizing the error of each summation -- and you minimize the error in a summation by adding values as nearly close to each other as possible. I believe that following that chain of logic gives you a binary tree of partial sums:
Sum[0,i] = value[i]
Sum[1,i/2] = Sum[0,i] + Sum[0,i+1]
Sum[j+1,i/2] = Sum[j,i] + Sum[j,i+1]
and so on until you get to a single answer.
Of course, when N is not a power of two, you'll end up with leftovers at each stage, which you need to carry over into the summations at the next stage.
(The margins of StackOverflow are of course too small to include a proof that this is optimal. In part because I haven't taken the time to prove it. But it does work for any N, however large, as all of the additions are adding values of nearly identical magnitude. Well, all but log(N) of them in the worst not-power-of-2 case, and that's vanishingly small compared to N.)

http://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic
You can find libraries with ready for use implementation for C/C++.
For example http://www.apfloat.org/apfloat/

Unless you use some accurate closed-form representation, a small-to-large ordered summation is likely to be most accurate simple solution (it's not clear to me why a log-exp would help - that's a neat trick, but you're not winning anything with it here, as far as I can tell).
You can further gain precision by realizing that after a while, the sum will become "quantized": Effectively, when you have 2 digits of precision, adding 1.3 to 41 results in 42, not 42.3 - but you achieve almost a precision doubling by maintaining an "error" term. This is called Kahan Summation. You'd compute the error term (42-41-1.3 == -0.3) and correct that in the next addition by adding 0.3 to the next term before you add it in again.
Kahan Summation in addition to a small-to-large ordering is liable to be as accurate as you'll ever need to get. I seriously doubt you'll ever need anything better for the harmonic series - after all, even after 2^45 iterations (crazy many) you'd still only be dealing with a numbers that are at least 1/2^45 large, and a sum that's on the order of 45 (<2^6), for an order of magnitude difference of 51 powers-of-two - i.e. even still representable in a double precision variable if you add in the "wrong" order.
If you go small-to-large, and use Kahan Summation, the sun's probably going to extinguish before today's processors reach a percent of error - and you'll run into other tricky accuracy issues just due to the individual term error on that scale first anyhow (being that a number of the order of 2^53 or larger cannot be represented accurately as a double at all anyhow.)

I'm not sure about the order of summation playing an important role, I havent heard that before. I guess you want to do this in floating point arithmetic so the first thing is to think more inline of (1.0/1.0 + 1.0/2.0+1.0/3.0) - otherwise the compiler will do integer division
to determine order of evaluation, maybe a for loop or brackets?
e.g.
float f = 0.0;
for (int i=n; i>0; --i)
{
f += 1.0/static_cast<float>(i);
}
oh forgot to say, compilers will normally have switches to determine floating point evaluation mode. this is maybe related to what you say on order of summation - in visual C+ these are found in code-generation compile settings, in g++ there're options -float that handle this
actually, the other guy is right - you should do summation in order of smallest component first; so
1/n + 1/(n-1) .. 1/1
this is because the precision of a floating point number is linked to the scale, if you start at 1 you'll have 23 bits of precision relative to 1.0. if you start at a smaller number the precision is relative to the smaller number, so you'll get 23 bits of precision relative to 1xe-200 or whatever. then as the number gets bigger rounding error will occur, but the overall error will be less than the other direction

As all your numbers are rationals, the easiest (and also maybe the fastest, as it will have to do less floating point operations) would be to do the computations with rationals (tuples of 2 integers p,q), and then do just one floating point division at the end.
update to use this technique effectively you will need to use bigints for p & q, as they grow quite fast...
A fast prototype in Lisp, that has built in rationals shows:
(defun sum_harmonic (n acc)
(if (= n 0) acc (sum_harmonic (- n 1) (+ acc (/ 1 n)))))
(sum_harmonic 10 0)
7381/2520
[2.9289682]
(sum_harmonic 100 0)
14466636279520351160221518043104131447711/278881500918849908658135235741249214272
[5.1873775]
(sum_harmonic 1000 0)
53362913282294785045591045624042980409652472280384260097101349248456268889497101
75750609790198503569140908873155046809837844217211788500946430234432656602250210
02784256328520814055449412104425101426727702947747127089179639677796104532246924
26866468888281582071984897105110796873249319155529397017508931564519976085734473
01418328401172441228064907430770373668317005580029365923508858936023528585280816
0759574737836655413175508131522517/712886527466509305316638415571427292066835886
18858930404520019911543240875811114994764441519138715869117178170195752565129802
64067621009251465871004305131072686268143200196609974862745937188343705015434452
52373974529896314567498212823695623282379401106880926231770886197954079124775455
80493264757378299233527517967352480424636380511370343312147817468508784534856780
21888075373249921995672056932029099390891687487672697950931603520000
[7.485471]
So, the next better option could be to mantain the list of floating points and to reduce it summing the two smallest numbers in each step...

Why would I use 2's complement to compare two doubles instead of comparing their differences against an epsilon value?

Referenced here and here...Why would I use two's complement over an epsilon method? It seems like the epsilon method would be good enough for most cases.
Update: I'm purely looking for a theoretical reason why you'd use one over the other. I've always used the epsilon method.
Has anyone used the 2's complement comparison successfully? Why? Why Not?

the second link you reference mentions an article that has quite a long description of the issue:
http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm
but unless you are tweaking performance I would stick with epsilon so people can debug your code

The bits method might be faster. I say might because on modern (multicore, highly pipelined) processors it is often impossible to guess what is really faster.
Code the simplest most obviously correct implementation, then measure, then optomise.

In short, when comparing two floats with unknown origins, picking an epsilon that is valid is almost impossible.
For example:
What is a good epsilon when comparing distance in miles between Atlanta GA, Dallas TX and some place in Ohio?
What is a good epsilon when comparing distance in miles between my left foot, my right foot and the computer under my desk?
EDIT:
Ok, I'm getting a fair number of people not understanding why you wouldn't know what your epsilon is.
Back in the old days of lore, I wrote two programs that worked with NeverWinter Nights (a game made by BioWare). One of the programs took a binary model and converted it to ASCII. The other program took an ASCII model and compiled it into binary. One of the tests I wrote was to take all of BioWare's binary models, decompile them to ASCII and then back to binary. Then I compared my binary version with original one from BioWare. One of the problems during the comparison was dealing with some of the slight variances in floating point values. So instead of coming up with a bunch of different EPSILONS for each type of floating point number (vertex, normal, etc), I wanted to use something such as this twos compliment compare. Thus avoiding the whole multiple EPSILON issue.
The same type of issue can apply to any type of software that processes 3rd party data and then needs to validate their results with the original. In these cases you might not even know what the floating point values represent, you just have to compare them. We ran into this issue with our industrial automation software.
EDIT:
LOL, this has been voted up and down by different people.
I'll boil the problem down to this, given two arbitrary floating point numbers, how do you decide what epsilon to use? You can't.
How can you compare 1e23 and 1.0001e23 with an epsilon and still compare 1e-23 and 5.2e-23 using the same epsilon? Sure, you can do some dynamic epsilon tricks, but that is the whole point to the integer compare (which does NOT require the integers be exact).
The integer compare is able to compare two floats using an epsilon relative to the magnitude of the numbers.
EDIT
Steve, lets look at what you said in the comments:
"But you know what equality means to you... Hence, you should be able to find an appropriate epsilon".
Turn this statement around to say:
"If you know what equality means to you, then you should be able to find an appropriate epsilon."
The whole point to what I am trying to say is that there are applications where we don't know what equality means in the absolute sense, thus we have to resort to a relative compare which is what the integer version is trying to do.

When it comes to speed, follow these rules:
If you're not a very experienced developer, don't optimize.
If you are an experienced developer, don't optimize yet.
Do the easiest method.
Alex

Oskar's right. Don't screw with this unless you really, really need that performance.
And you don't. If you were in the situation that did, you wouldn't have needed to ask the question -- you'd already know. If you think you do, then you don't. Your performance problems lie elsewhere. Just use the readable version.

Using any method that compares bitwise will result in trouble when fractions are represented by approximations. All floating point numbers with fractions that are not denominated in powers of two (1/2, 1/4, 1/8, 1/65536, &c) are approximated. So, of course, are all irrational numbers.
float third = 1/3;
float two=2.0;
float another_two=third*6.0;
if(two != another_two)
print ("Approximation!\n");
The only time comparing bitwise would work is when you derive the floating point numbers exactly the same way or they are exact representations (whole numbers, fraction powers of two). Even then, there can be multiple representations of some numbers, though I have never seen this in a working system.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js