When I run a solver to solve for x given y and the relationship y = f(x), we have to keep checking if previous guess of x and the current guess of x atleast differ by the machine precision. What is the most efficient way of doing this in C++?
If I have x1 and x2, then I want to check if mantissa(x1) - mantisaa(x2) < 1e-14 or 2^-equivalent. Is there a predefined function which does this check for us? I did notice a stackoverflow response that tells us an efficient way to get the mantissa of floating point using unions. But, is there a machine independent implementation of this in boost or gsl etc.
Related
I have to find the set of integers that minimize this objective function:
The costraints are:
every x must be a non-negative integer
T, A and B are double known numbers.
I have been looking at the OR-Tools C++ library in order to solve this problem, specifically at the CP-SAT solver.
Is it the right tool from such problems?
If yes, would it be feasible to convert all the double to int in the objective function?
If not, what else do you suggest? (I'm also open to other open source C++ libraries)
It will fit in the CP-SAT solver. You will need to scale floating point coefficients to integers.
The objective function accepts floating point coefficients.
But (x1 + A1)^2 will propagate better if you keep it that way instead of A1^2 + 2 * A1 * x1 + x1^2. which fits into the linear with double coefficient limitation of CP-SAT, provided you use temporary variables sx1 = x1 * x1.
Then make sure to use at least 8 workers for that. (parameters num_search_workers:8).
Now, I believe there are least square solvers that are more suited for this.
Subtracting two floating point numbers in C++ can lead to catastrophic cancellation.
Is there any advantage to writing y * (1 - x / y) instead of y - x, in terms of numerical stability?
Haha, no. Catastrophic cancellation occurs the moment you decide you want to find the difference between two numbers, if those two numbers are nearly equal and far from 0. It doesn't matter how wacky a formula you use to do it -- the information required for a precise difference calculation has already been discarded.
In order to avoid the situation, you need to calculate and store different numbers. That is, instead of calculating and storing x and y, say, you'd calculate a=x and b=y-x. Or a=(x+y)/2 and b=(x-y)/2. You wouldn't calculate them from x and y -- that would have exactly the same problem -- rather, you'd calculate and store them as the actual numbers, and calculate x and y as needed. (Obviously this requires a choice of a and b which can actually be calculated, given your use case.)
What you call "catastrophic cancellation" I call Sterbenz's theorem: If a/2 <= b <= 2a, then the difference b-a is exactly representable.
The term "catastrophic cancellation" is something of a misnomer. The subtraction is totally innocent; the catastrophe had already occurred when you rounded away the part of the number you wanted.
I'd think the the multiplication & division would give you more trouble, but nevertheless, in the long version, you are STILL subtracting two floating point numbers.
the error you get from x-y is y times the one you get from 1-(x/y), that you later multiply by y, having the same result.
Is the floating point implementation of exp() function in cmath equivalent to a truncated Taylor series expansion of a very high order? One possible source of the error we should keep in mind is the finiteness of the number of bits to represent the answer
Is the floating point implementation of exp() function in cmath equivalent to a truncated Taylor series expansion of a very high order?
Equivalent to? Yes. That's because any decent implementation of exp() has an error of half an ULP (unit of least precision) or so. Ignoring problems with finite precision arithmetic, one can always construct a truncated Taylor series that does the same.
However, no decent implementation of exp() will use a Taylor expansion. That would be very very slow, and wouldn't achieve the desired accuracy. It would be a downright stupid implementation. Much better is to use the fact that there is a strong relation between 2x and ex and the fact that 2x is fairly easy to compute given the almost universal power of 2 representation of floating point numbers.
Just an example how you could calculate exp (x):
If x is quite large then the result is +inf. If x is quite small then the result is 0.
Let k = round (x / ln 2). Then exp (x) = 2^k * exp (x - k ln 2). 2^k is very easy to calculate. A small problem is to calculate x - k ln 2 without any rounding error. That's quite easy: Let L1 = ln 2 rounded to say 35 bits, and L2 = ln 2 - L1. k is a smallish integer, so k * L1 has no rounding error, nor has x - k * L1; then we subtract k * L2 which is small and therefore has little rounding error.
To do this quicker (without a division), we calculate k = round (x * (1 / ln 2)). And we check whether x is close to zero, so the whole calculation isn't needed. Anyway, we now calculate exp (x) where the result is between sqrt (1/2) and sqrt (2).
You could calculate exp (x) using a Taylor polynomial. Instead you would probably use a Chebychev polynomial minimising the cutoff error with a much lower degree. With some care you can find a polynomial with a cutoff error substantially less than the lowest bit of the result.
It depends on the implementation of the compiler, C runtime and the processor. However, whoever computes the exponent is unlikely to use the Taylor expansion since better methods exist.
As per glibc, it may use its own implementation which says this in the comment (from sysdeps/ieee754/dbl-64/e_exp.c):
/* An ultimate exp routine. Given an IEEE double machine number x */
/* it computes the correctly rounded (to nearest) value of e^x */
Or it may use hardware supported processor instructions for floating point computations, as with x86 FPU. In both cases you are likely to get a correctly rounded value with full precision.
That's dependent of which C library implementation you're using. In the overy popular glibc, it isn't.
I am writing a simulation program that proceeds in discrete steps. The simulation consists of many nodes, each of which has a floating-point value associated with it that is re-calculated on every step. The result can be positive, negative or zero.
In the case where the result is zero or less something happens. So far this seems straightforward - I can just do something like this for each node:
if (value <= 0.0f) something_happens();
A problem has arisen, however, after some recent changes I made to the program in which I re-arranged the order in which certain calculations are done. In a perfect world the values would still come out the same after this re-arrangement, but because of the imprecision of floating point representation they come out very slightly different. Since the calculations for each step depend on the results of the previous step, these slight variations in the results can accumulate into larger variations as the simulation proceeds.
Here's a simple example program that demonstrates the phenomena I'm describing:
float f1 = 0.000001f, f2 = 0.000002f;
f1 += 0.000004f; // This part happens first here
f1 += (f2 * 0.000003f);
printf("%.16f\n", f1);
f1 = 0.000001f, f2 = 0.000002f;
f1 += (f2 * 0.000003f);
f1 += 0.000004f; // This time this happens second
printf("%.16f\n", f1);
The output of this program is
0.0000050000057854
0.0000050000062402
even though addition is commutative so both results should be the same. Note: I understand perfectly well why this is happening - that's not the issue. The problem is that these variations can mean that sometimes a value that used to come out negative on step N, triggering something_happens(), now may come out negative a step or two earlier or later, which can lead to very different overall simulation results because something_happens() has a large effect.
What I want to know is whether there is a good way to decide when something_happens() should be triggered that is not going to be affected by the tiny variations in calculation results that result from re-ordering operations so that the behavior of newer versions of my program will be consistent with the older versions.
The only solution I've so far been able to think of is to use some value epsilon like this:
if (value < epsilon) something_happens();
but because the tiny variations in the results accumulate over time I need to make epsilon quite large (relatively speaking) to ensure that the variations don't result in something_happens() being triggered on a different step. Is there a better way?
I've read this excellent article on floating point comparison, but I don't see how any of the comparison methods described could help me in this situation.
Note: Using integer values instead is not an option.
Edit the possibility of using doubles instead of floats has been raised. This wouldn't solve my problem since the variations would still be there, they'd just be of a smaller magnitude.
I've worked with simulation models for 2 years and the epsilon approach is the sanest way to compare your floats.
Generally, using suitable epsilon values is the way to go if you need to use floating point numbers. Here are a few things which may help:
If your values are in a known range you and you don't need divisions you may be able to scale the problem and use exact operations on integers. In general, the conditions don't apply.
A variation is to use rational numbers to do exact computations. This still has restrictions on the operations available and it typically has severe performance implications: you trade performance for accuracy.
The rounding mode can be changed. This can be use to compute an interval rather than an individual value (possibly with 3 values resulting from round up, round down, and round closest). Again, it won't work for everything but you may get an error estimate out of this.
Keeping track of the value and a number of operations (possible multiple counters) may also be used to estimate the current size of the error.
To possibly experiment with different numeric representations (float, double, interval, etc.) you might want to implement your simulation as templates parameterized for the numeric type.
There are many books written on estimating and minimizing errors when using floating point arithmetic. This is the topic of numerical mathematics.
Most cases I'm aware of experiment briefly with some of the methods mentioned above and conclude that the model is imprecise anyway and don't bother with the effort. Also, doing something else than using float may yield better result but is just too slow, even using double due to the doubled memory footprint and the smaller opportunity of using SIMD operations.
I recommend that you single step - preferably in assembly mode - through the calculations while doing the same arithmetic on a calculator. You should be able to determine which calculation orderings yield results of lesser quality than you expect and which that work. You will learn from this and probably write better-ordered calculations in the future.
In the end - given the examples of numbers you use - you will probably need to accept the fact that you won't be able to do equality comparisons.
As to the epsilon approach you usually need one epsilon for every possible exponent. For the single-precision floating point format you would need 256 single precision floating point values as the exponent is 8 bits wide. Some exponents will be the result of exceptions but for simplicity it is better to have a 256 member vector than to do a lot of testing as well.
One way to do this could be to determine your base epsilon in the case where the exponent is 0 i e the value to be compared against is in the range 1.0 <= x < 2.0. Preferably the epsilon should be chosen to be base 2 adapted i e a value that can be exactly represented in a single precision floating point format - that way you know exactly what you are testing against and won't have to think about rounding problems in the epsilon as well. For exponent -1 you would use your base epsilon divided by two, for -2 divided by 4 and so on. As you approach the lowest and the highest parts of the exponent range you gradually run out of precision - bit by bit - so you need to be aware that extreme values can cause the epsilon method to fail.
If it absolutely has to be floats then using an epsilon value may help but may not eliminate all problems. I would recommend using doubles for the spots in the code you know for sure will have variation.
Another way is to use floats to emulate doubles, there are many techniques out there and the most basic one is to use 2 floats and do a little bit of math to save most of the number in one float and the remainder in the other (saw a great guide on this, if I find it I'll link it).
Certainly you should be using doubles instead of floats. This will probably reduce the number of flipped nodes significantly.
Generally, using an epsilon threshold is only useful when you are comparing two floating-point number for equality, not when you are comparing them to see which is bigger. So (for most models, at least) using epsilon won't gain you anything at all -- it will just change the set of flipped nodes, it wont make that set smaller. If your model itself is chaotic, then it's chaotic.
So we have some function like (pow(e,(-a*x)))/(sqrt(x)) where a, e are const floats. we have some float eps=pow (10,(-4)). We need to find out starting from which x integral of that function from that x to infinety is less than eps? We can not use functions for special default integration function just standart math like operators. point is to achive max evaluetion speed.
If you perform the u-substitution u=sqrt(x), your integral will become 2 * integral e^(-au^2) du. With one more substitution you can reduce it to a standard normal. Once you have it in standard normal form, this reduces to calculating erf(x). The substitutions can be done abstractly for any a, and the results hardcoded for simplicity and speed.
To calculate this integral you need calculate Error function. If you use gcc you can find erf(...) function in math.h, but it doesn't take params to get exact precise. But you can evaluate Error function's value by youself just using Taylor's series. With given eps it possible to calc the exact number of terms of the series.
Hmm, no one seems to understand the question. The question is: given some function f, find the smallest x such that Integral _ x ^ +inf f(x) < eps. That's the question. So basically we try x = 0, then x = 0.1 then x = 0.2 ... until the integral, for all intents and purposes, vanishes.
For example, given the bell curve for IQ of programmers on SO, at what IQ is the cumulative intelligence of programmers with higher IQ vanishingly small? If we pick x = 100 we know at least half the programmers will have a higher IQ than 100, if we pick 120, how many are left? What about 200? If we have 10,000 programmers here and eps = 1/10000 we're basically asking what IQ the top 0.01% of SO contributors have.
The question is: what is the most efficient way to find this number, given that nothing is known about f other than that is decreases fast enough that its the integral from x to infinity approaches zero as x approaches infinity?
The general answer is: you must start with a guess of some kind. If the result is too big, double your guess, and keep going until you satisfy the requirement. Then, go back to the last value you had (which didn't) and do a binary chop to find the smallest x satisfying the requirement.
To make a good guess is hard. One way is to use a Chebychev approximation of the function, integrate it analytically, solve the problem with the resulting polynomial, and use the solution as your starting guess. The assumption is that all functions look like polynomials off sufficiently high order in any given range.