Bracketing algorithm when root finding. Single root in "quadratic" function - c++

I am trying to implement a root finding algorithm. I am using the hybrid Newton-Raphson algorithm found in numerical recipes that works pretty nicely. But I have a problem in bracketing the root.
While implementing the root finding algorithm I realised that in several cases my functions have 1 real root and all the other imaginary (several of them, usually 6 or 9). The only root I am interested is in the real one so the problem is not there. The thing is that the function approaches the root like a cubic function, touching with the point the y=0 axis...
Newton-Rapson method needs some brackets of different sign and all the bracketing methods I found don't work for this specific case.
What can I do? It is pretty important to find that root in my program...
EDIT: more problems: sometimes due to reaaaaaally small numerical errors, say a variation of 1e-6 in some value the "cubic" function does NOT have that real root, it is just imaginary with a neglectable imaginary part... (checked with matlab)
EDIT 2: Much more information about the problem.
Ok, I need root finding algorithm.
Info I have:
The root I need to find is between [0-1] , if there are more roots outside that part I am not interested in them.
The root is real, there may be imaginary roots, but I don't want them.
Probably all the rest of the roots will be imaginary
The root may be double in that point, but I think that actually doesn't mater in numerical analysis problems
I need to use the root finding algorithm several times during the overall calculations, but the function will always be a polynomial
In one of the particular cases of the root finding, my polynomial will be similar to a quadratic function that touches Y=0 with the point. Example of a real case:
The coefficient may not be 100% precise and that really slight imprecision may make the function not to touch the Y=0 axis.
I cannot solve for this specific case because in other cases it may be that the polynomial is pretty normal and doesn't make any "strange" thing.
The method I am actually using is NewtonRaphson hybrid, where if the derivative is really small it makes a bisection instead of NewRaph (found in numerical recipes).
Matlab's answer to the function on the image:
roots:
0.853553390593276 + 0.353553390593278i
0.853553390593276 - 0.353553390593278i
0.146446609406726 + 0.353553390593273i
0.146446609406726 - 0.353553390593273i
0.499999999999996 + 0.000000040142134i
0.499999999999996 - 0.000000040142134i
The function is a real example I prepared where I know that the answer I want is 0.5
Note:
I still haven't check completely some of the answers I you people have give me (Thank you!), I am just trying to give al the information I already have to complete the question.

Assuming you have a one-dimensional polynomial problem (which I assume from the imaginary solutions) you can use Sturm sequences to bracket all real roots. See Sturm's theorem.

Welcome to the wonderful world of numerical methods. Watch your hairline; it might start receding as you pull your hair out in frustration.
First off, with numerical root finding, you are toast if you can't bracket the problem. Newton Raphson is nice for polishing off a solution once you get close, and it only works if the derivative near the root is well away from zero. You always need to have some slower technique at hand as a backup because Newton Raphson can send you off to never-never land (i.e., somewhere well outside the bracket). If your function is not a polynomial, the first thing to try is Brent's method. If your function is a polynomial, try Laguerre's method or Jenkins-Traub.
BTW, it sounds like you have a pathological problem. You shouldn't expect particularly good performance. Pathological problems are, well, pathological.
Addendum
If you are having problems with things that appear to be roots, but aren't, you need to take care how you evaluate your function. If you do have a polynomial, form each term of the polynomial, sort by absolute value, and add smallest to largest. This produces better accuracy most of the time, but fails if you have large terms whose sum is nearly zero. If that's the case, you might want to add those canceling terms separately, add the rest smallest to largest, and then compute a grand total -- and your still kinda screwed. That big addition that nearly cancels loses a lot of precision. There's no escape other than extended precision arithmetic.

Ander, thanks for responding to my question (about the interval); sorry for the delay in following up - I have very busy work. Also - before I found the additional information you've provided - I had in mind to explain quite a few things how to handle this and was contemplating how to present that. However, I now believe your case is not too difficult and we can get at it without too much additional stuff, since you apparently have an explicit polynomial expression (coefficients to the various powers).
Let's start with a simple case, to pinpoint the approach.
Step 1.
If you have a 2nd degree polynomial, its derivative is first order and has a simple zero (which you can find by bracketing or simply by explicitly solving the equation). (Yes, I know there's a closed formula for the roots of a 2nd degree polynomial also, but for the sake of the current argument, let us forget that).
The zero's of the 2nd degree polynomial are then located one at the left side and one at the right side of the zero of the derivative. So, if you also have the interval where the roots of the original function (the 2nd degree polynomial) are to be found, you now have two intervals - left and right of the derivative-zero, each with one zero.
It is important to realize that the original function is MONOTONIC on each subinterval (decreasing on one of them, increasing on the other). Therefore, simply by checking the function values at the ends of the (sub)interval you can determine whether or not they actually bracket a zero. If not, there's a multiple zero (double, in this case) exactly at the zero of the derivative IF the function is zero there (otherwise, it is a double imaginary root of which you've now found the real part).
In case the zero of the derivative lies OUTSIDE the total interval, you will have at most one root inside your interval and you need to check only that particular (sub)interval.
Step 2.
Consider now a 3rd order polynomial.
Its derivative is 2nd order.
The derivative of THAT 2nd order polynomial is again 1st order and you proceed as before to get two subintervals to find the roots of the derivative of the original function. These two roots give you THREE (at most) intervals where you will find the 3 roots of the original (3rd order) function.
And also here, you will have intervals (3) where the original function is monotonic (alternatingly increasing/decreasing), making the analysis per subinterval quite easy.
Again, zeros may coincide (2 or even all 3) and may in addition turn out to be complex-valued (i.e. having non-zero imaginary parts). The analysis of the cases is straightforward: check function values at the borders of the intervals to assess whether not there's a sign-change (function is monotonic on each subinterval) and/or whether the function is zero at one of the subinterval-borders.
Step 4.
Generalize this with the known polynomial. Let's say - your example - it is 6th order:
a) construct the 5th derivative (i.e. reducing the original to a 1st order polynomial). Compute it's zero (it is at precisely 0.5 in your example). In this case you're already done, but suppose you don't realize that. So you have now 2 intervals 0..0.5 and 0.5..1
b) construct the 4th derivative. Inspect its values at the subinterval-boundaries (0, 0.5, 1)
For each subinterval determine if it has a real zero inside. If so, you re-partition your original interval in 3 subintervals, using the two found zeros (you forget about the zero of the 5th derivative). If they coincide (at the previous cut, 0.5) you stick with that 0.5 (don't care whether you've found a true double zero of your 4th derivative there or a "double imaginary") and still have only 2 intervals, but for the sake of the argument let's say you now have 3.
c) construct the 3rd derivative and do likewise as before. You will then have 4 (at most) intervals.
d) And so on. After having processed the 2nd derivative in this fashion you have 5 (at most) intervals, and after processing the 1st derivative you have 6 intervals (or less...) and knowing the function is monotonic on each subinterval, you'll quickly determine in each of them if there's a real root, as always using the know monotonicity of the function in each of the final subintervals.
Adding a note on numerical accuracy at evaluating a function:
A first (probably sufficient, in this case) method to reduce noise is NOT to evaluate your function in the way suggested by the original form (i.e. a6 x*6 + a5 x*5 +..), but to rewrite it as:
a0 + x*(a1 + x*(a2 + x*(a3 + x*(a4 + x*(a5 + x*a6)))))
So, in evaluating you proceed:
tmp = a6
tmp = x*tmp + a5
tmp = x*tmp + a4
etcetera.
In case this little rewriting is not sufficient for numerical stability, you should rewrite your polynomial in (for instance) a chebyshev-polynomial expansion and evaluate that one with its recurrence relations. Both (getting the expansion and applying the recurrence relations for evaluation) are rather simple. I can explain, if you need help, but I guess it won't be necessary here.
In all cases, you HAVE to allow for some inaccuracy, i.e. accept that a computation will, generally speaking, NEVER give you the mathematically exact function value. So the assessment whether the function is presumably zero at some point must include some "tolerance", there's no way around this, unfortunately; the best you can aim for is to minimize the noise.

Well, if your function touches zero but never crosses it, you seem to be looking for a minimum (or a maximum). In which case, you're better off telling computer to do exactly that --- either find the root of a derivative (if you can calculate it analytically), or use a minimization routine. Then check that the function value at the minimum is 'close enough' to zero.
Just to reiterate what was already said by other people:
don't start with Newton-Raphson method; it's almost always better to start with Brent or even a straightforward bisection (provided you can bracket the root).
An instability where 'small numerical errors' of the order of 1e-6 have bad effects is worth investigating. Immediate suspects: mixing floats and doubles, loss of precision somewhere etc.
EDIT: So, depending on some parameters, your function has either a zero crossing, or a minimum with zero value, is this correct? In this case, what I'd do is this: use a simple and robust bracketing strategy (e.g. start from [-1, 1], multiply the endpoints by 1.1, check the signs, keep multiplying, something like this). If that succeeds, there's a zero crossing, use a root finding routine. If bracketing fails, use minimization.

Using Newton-Raphson is an act of desperation. You are much better off finding the continued fraction that represents your function and calculating that. A CF will converge much faster and will produce the real root(s). Also, because the CF produces a ratio of two integers you have tight control over numeric precision and don't have to worry about accumulation of rounding errors and other similar hair-pulling-out problems.
To find the real roots of any polynomial function refer to "A Continued Fraction Algorithm for Approximating All Real Polynomial Roots" by David Rosen (1978).
------------ ADDENDUM 1 --- 11 OCT-----------------
Ok, you are solving a sextic. You have several options. The simplest is to use a Taylor approximation (say to the 3rd degree) in conjunction with Halley's method. This is much superior to Newton because it has cubic convergence and you can detect imaginary solutions. The disadvantage is that you will have rounding problems which may result in an incorrect answer.
The ideal option is to find the continued fraction that represents the monic root, because this CF will be computable as an integer ratio of any desired precision, thus elminating the problem of rounding.
One approach to computing this CF is via the Jacobi-Perron algorithm. See the paper Hendy and Jeans: http://www.ams.org/mcom/1981-36-154/S0025-5718-1981-0606514-X/S0025-5718-1981-0606514-X.pdf. This paper shows the exact algorithm for computing cubic and quartic roots via CF approximation.
Note that if the sextic is reducible then it can converted into a quartic and quadratic: http://elib.mi.sanu.ac.rs/files/journals/tm/21/tm1124.pdf. The quartic is then solvable by the algorithm in the Hendy paper.
The general solution to generate a CF for a sextic can be done via the Rogers-Ramunajan CF. See the following paper for the method: http://arxiv.org/pdf/1111.6023v2. This will generate the CF for any sextic.

As in your case, you are interested in the real factorization of a real polynomial. One may see that all complex roots come in conjugate pairs which correspond to a real quadratic factor. By finding this real quadratic and completing the square to get the form (x-r)^2 + s you will be able to see the "real" even order root r with an "error" given by s. If s > 0 is too large, you may discard it as probably being complex. If s < 0 is also large, then you have two faraway real roots given by x = r ± √(-s). If s is very small then you might suspect r is a real double root and keep it.
Finding such a quadratic factor may be done using Bairstow's method, which actually applies a two-dimensional Newton method. This gives x^2 + ux + v and r = -u/2; s = v - r^2.

Related

Efficient way to perform summation to infinite in c++

I am new to c++ and I am trying to implement a function that has a summation between k=1 and infinite:
Do you know how summation to infinite can be implemented efficiently in c++?
Mathematically, summing to infinite is perfectly valid (in some contexts). Formally, the sum you've written is the same as taking the limit of the summation for all k from 1 through n as n tends towards infinite.
Computers, however, cannot compute such a summation via brute force; it would take infinite time to iterate over a loop infinitely many times. Other than taking an approximation, you might be able to find a closed form equation for this sum. Unfortunately, it's not just a simple geometric series, so finding a closed form solution may not be possible, and is almost certainly non-trivial.
Do you know how summation to infinite can be implemented efficiently in c++?
Nothing of this sort, doing X untill infinity to arrive at any finite result can be done in any language. This is impossible, there is not sugar coated way to say this, but it is how it is for digital computers.
You can likely however make approximations. Example, a sin x is simply an expression that goes on to infinity. But to compute sin x we don't run the loop till infinity, there will be approximations made saying the terms after these are so small they are within the error bounds. And hence ignored.
You have to do a similar approach. Assuming t-t1 is always positive, there will come a k where the term (k.pi/2)^2 is so big that 1/e^(...) will be very small, say below 0.00001 and say you want your result to be in an error range of +- 0.001 and so if the series is convergent (look up converging series) then you can likely ignore these terms. Essentially it's a pen and paper problem first that you can then translate to your code.

Checking that floating points arguments are correct

I want to write a class representing Markov chain (let's name it MC). It has a constructor, which takes the state transition matrix (that is, vector<vector<double>>. I suppose, it is a good idea to check it is really a matrix (has the same number of rows and columns) and is really a transition matrix: all the numbers in it are probabilities, that is, no less than 0.0 and no greater than 1.0, and for every row the sum of its elements is 1.0. However, there is a problem which arises from floating point limitations: for example, the sum 0.3 + 0.3 + 0.3 + 0.1 will not be equal to 1.0, so the check will not be that easy. So I see two possible solutions of that problem:
Choose some epsilon and compare with epsilon error. Of course it will now accept some matrices violating the transition matrix property, but in general, if someone occasionally passes some bad data into the constructor, he will get an exception.
Don't check anything, rely on the class' user, if he passes something bad, it is completely his fault, and the behavior of the class will be unexpected.
What approach is better and more "real-world"? I like the first, but again, not sure how should I choose epsilon.
Do the second one.
Your class isn't in the business of summing up lists of floating-point numbers and deciding what's "close enough" to 1 and what isn't. Your user is. Your class represents Markov chains. You won't be able to choose a value of epsilon so that your class represents Markov chains in a useful way.
Think about the operations you're going to implement. Maybe you're going to have a function that hits a probability distribution on the states of the chain with the chain's transition matrix. Should that function check whether the input probability distribution is a probability distribution within some epsilon?
Your function almost certainly won't preserve the "is a probability distribution" property; you'll get some drift due to rounding error away from the space of probability distributions as you repeatedly hit your probability distribution by your Markov chain. You can correct this by normalising afterward, but that causes even more inaccuracy.
Now think about the "given a Markov chain and an integer k, return the Markov chain formed by iterating the input chain k times" operation. You can see that this will accumulate roundoff and suffer from much the same problems as "hit probability distribution with Markov chain."
Wouldn't it suck if you only had a choice between stuff that breaks after 12 hours of use and stuff that's unnecessarily inaccurate?
(Checking the squareness and matrixness of the square matrix argument is, of course, totally reasonable.)

How can I get consistent program behavior when using floats?

I am writing a simulation program that proceeds in discrete steps. The simulation consists of many nodes, each of which has a floating-point value associated with it that is re-calculated on every step. The result can be positive, negative or zero.
In the case where the result is zero or less something happens. So far this seems straightforward - I can just do something like this for each node:
if (value <= 0.0f) something_happens();
A problem has arisen, however, after some recent changes I made to the program in which I re-arranged the order in which certain calculations are done. In a perfect world the values would still come out the same after this re-arrangement, but because of the imprecision of floating point representation they come out very slightly different. Since the calculations for each step depend on the results of the previous step, these slight variations in the results can accumulate into larger variations as the simulation proceeds.
Here's a simple example program that demonstrates the phenomena I'm describing:
float f1 = 0.000001f, f2 = 0.000002f;
f1 += 0.000004f; // This part happens first here
f1 += (f2 * 0.000003f);
printf("%.16f\n", f1);
f1 = 0.000001f, f2 = 0.000002f;
f1 += (f2 * 0.000003f);
f1 += 0.000004f; // This time this happens second
printf("%.16f\n", f1);
The output of this program is
0.0000050000057854
0.0000050000062402
even though addition is commutative so both results should be the same. Note: I understand perfectly well why this is happening - that's not the issue. The problem is that these variations can mean that sometimes a value that used to come out negative on step N, triggering something_happens(), now may come out negative a step or two earlier or later, which can lead to very different overall simulation results because something_happens() has a large effect.
What I want to know is whether there is a good way to decide when something_happens() should be triggered that is not going to be affected by the tiny variations in calculation results that result from re-ordering operations so that the behavior of newer versions of my program will be consistent with the older versions.
The only solution I've so far been able to think of is to use some value epsilon like this:
if (value < epsilon) something_happens();
but because the tiny variations in the results accumulate over time I need to make epsilon quite large (relatively speaking) to ensure that the variations don't result in something_happens() being triggered on a different step. Is there a better way?
I've read this excellent article on floating point comparison, but I don't see how any of the comparison methods described could help me in this situation.
Note: Using integer values instead is not an option.
Edit the possibility of using doubles instead of floats has been raised. This wouldn't solve my problem since the variations would still be there, they'd just be of a smaller magnitude.
I've worked with simulation models for 2 years and the epsilon approach is the sanest way to compare your floats.
Generally, using suitable epsilon values is the way to go if you need to use floating point numbers. Here are a few things which may help:
If your values are in a known range you and you don't need divisions you may be able to scale the problem and use exact operations on integers. In general, the conditions don't apply.
A variation is to use rational numbers to do exact computations. This still has restrictions on the operations available and it typically has severe performance implications: you trade performance for accuracy.
The rounding mode can be changed. This can be use to compute an interval rather than an individual value (possibly with 3 values resulting from round up, round down, and round closest). Again, it won't work for everything but you may get an error estimate out of this.
Keeping track of the value and a number of operations (possible multiple counters) may also be used to estimate the current size of the error.
To possibly experiment with different numeric representations (float, double, interval, etc.) you might want to implement your simulation as templates parameterized for the numeric type.
There are many books written on estimating and minimizing errors when using floating point arithmetic. This is the topic of numerical mathematics.
Most cases I'm aware of experiment briefly with some of the methods mentioned above and conclude that the model is imprecise anyway and don't bother with the effort. Also, doing something else than using float may yield better result but is just too slow, even using double due to the doubled memory footprint and the smaller opportunity of using SIMD operations.
I recommend that you single step - preferably in assembly mode - through the calculations while doing the same arithmetic on a calculator. You should be able to determine which calculation orderings yield results of lesser quality than you expect and which that work. You will learn from this and probably write better-ordered calculations in the future.
In the end - given the examples of numbers you use - you will probably need to accept the fact that you won't be able to do equality comparisons.
As to the epsilon approach you usually need one epsilon for every possible exponent. For the single-precision floating point format you would need 256 single precision floating point values as the exponent is 8 bits wide. Some exponents will be the result of exceptions but for simplicity it is better to have a 256 member vector than to do a lot of testing as well.
One way to do this could be to determine your base epsilon in the case where the exponent is 0 i e the value to be compared against is in the range 1.0 <= x < 2.0. Preferably the epsilon should be chosen to be base 2 adapted i e a value that can be exactly represented in a single precision floating point format - that way you know exactly what you are testing against and won't have to think about rounding problems in the epsilon as well. For exponent -1 you would use your base epsilon divided by two, for -2 divided by 4 and so on. As you approach the lowest and the highest parts of the exponent range you gradually run out of precision - bit by bit - so you need to be aware that extreme values can cause the epsilon method to fail.
If it absolutely has to be floats then using an epsilon value may help but may not eliminate all problems. I would recommend using doubles for the spots in the code you know for sure will have variation.
Another way is to use floats to emulate doubles, there are many techniques out there and the most basic one is to use 2 floats and do a little bit of math to save most of the number in one float and the remainder in the other (saw a great guide on this, if I find it I'll link it).
Certainly you should be using doubles instead of floats. This will probably reduce the number of flipped nodes significantly.
Generally, using an epsilon threshold is only useful when you are comparing two floating-point number for equality, not when you are comparing them to see which is bigger. So (for most models, at least) using epsilon won't gain you anything at all -- it will just change the set of flipped nodes, it wont make that set smaller. If your model itself is chaotic, then it's chaotic.

Need pow(-1,1.2) to be 1

I am using math.h with GCC and GSL. I was wondering how to get this to evaluate?
I was hoping that the pow function would recognize pow(-1,1.2) as ((-1)^6)^(1/5). But it doesn't.
Does anybody know of a c++ library that will recognize these? Perhaps somebody has a decomposition routine they could share.
Mathematically, pow(-1, 1.2) is simply not defined. There are no powers with fractional exponents of negative numbers, and I hope there is no library that will simply return some arbitray value for such an expression. Would you also expect things like
pow(-1, 0.5) = ((-1)^2)^(1/4) = 1
which obviously isn't desirable.
Moreover, the floating point number 1.2 isn't even exactly equal to 6/5. The closest double precision number to 1.2 is
1.1999999999999999555910790149937383830547332763671875
Given this, what result would you expect now for pow(-1, 1.2)?
If you want to raise negative numbers to powers -- especially fractional powers -- use the cpow() method. You'll need to include <complex> to use it.
It seems like you're looking for pow(abs(x), y).
Explanation: you seem to be thinking in terms of
xy = (xN)(y/N)
If we choose that N === 2, then you have
(x2)y/2 = ((x2)1/2)y
But
(x2)1/2 = |x|
Substituting gives
|x|y
This is a stretch, because the above manipulations only work for non-negative x, but you're the one who chose to use that assumption.
Sounds like you want to perform a complex power (cpow()) and then take the magnitude (abs()) of that after.
>>> abs(cmath.exp(1.2*cmath.log(-1)))
1.0
>>> abs(cmath.exp(1.2*cmath.log(-293.2834)))
913.57662451612202
pow(a,b) is often thought of, defined as, and implemented as exp(log(a)*b) where log(a) is natural logarithm of a. log(a) is not defined for a<=0 in real numbers. So you need to either write a function with special case for negative a and integer b and/or b=1/(some_integer). It's easy to special-case for integer b, but for b=1/(some_integer) it's prone to round-off problems, like Sven Marnach pointed out.
Maybe for your domain pow(-a,b) should always be -pow(a,b)? But then you'd just implement such function, so I assume the question warrants more explanation .
Like duskwuff suggested, a much more robust and "mathematical" solution is to use complex functions log and exp, but it's much more "complex" (excuse my pun) than it seems on the surface (even though there's cpow function). And it'll be much slower if you have to compute a lot of pow()s.
Now there's an important catch with complex numbers that may or may not be relevant to your problem domain: when done right, the result of pow(a,b) is not one, but often a few complex numbers, but in the cases you care about, one of them will be complex number with nearly-zero imaginary part (it'll be non-zero due to roundoff errors) which you can simply ignore and/or not compute in your code.
To demonstrate it, consider what pow(-1,.5) is. It's a number X such that X^2==-1. Guess what? There are 2 such numbers: i and -i. Generally, pow(-1, 1/N) has exactly N solutions, although you're interested in only one of them.
If the imaginary part of all results of pow(a,b) is significant, it means you are passing wrong values. For single-precision floating point values in the range you describe, 1e-6*max(abs(a),abs(b)) would be a good starting point for defining the "significant enough" threshold. The extreme "wrong values" would be pow(-1,0.5) which would return 0 + 1i (0 in real part, 1 in imaginary part). Here the imaginary part is huge relative to the input and real part, so you know you screwed up your input values.
In any reasonable single-return-result implementation of cpow() , cpow(-1,0.3333) will probably return something like -1+0.000001i and ignore two other values with significant imaginary parts. So you can just take that real value and that's your answer.
Use std::complex. Without that, the roots of unity don't make much sense. With it they make a whole lot of sense.

Accurate evaluation of 1/1 + 1/2 + ... 1/n row

I need to evaluate the sum of the row: 1/1+1/2+1/3+...+1/n. Considering that in C++ evaluations are not complete accurate, the order of summation plays important role. 1/n+1/(n-1)+...+1/2+1/1 expression gives the more accurate result.
So I need to find out the order of summation, which provides the maximum accuracy.
I don't even know where to begin.
Preferred language of realization is C++.
Sorry for my English, if there are any mistakes.
For large n you'd better use asymptotic formulas, like the ones on http://en.wikipedia.org/wiki/Harmonic_number;
Another way is to use exp-log transformation. Basically:
H_n = 1 + 1/2 + 1/3 + ... + 1/n = log(exp(1 + 1/2 + 1/3 + ... + 1/n)) = log(exp(1) * exp(1/2) * exp(1/3) * ... * exp(1/n)).
Exponents and logarithms can be calculated pretty quickly and accuratelly by your standard library. Using multiplication you should get much more accurate results.
If this is your homework and you are required to use simple addition, you'll better add from the smallest one to the largest one, as others suggested.
The reason for the lack of accuracy is the precision of the float, double, and long double types. They only store so many "decimal" places. So adding a very small value to a large value has no effect, the small term is "lost" in the larger one.
The series you're summing has a "long tail", in the sense that the small terms should add up to a large contribution. But if you sum in descending order, then after a while each new small term will have no effect (even before that, most of its decimal places will be discarded). Once you get to that point you can add a billion more terms, and if you do them one at a time it still has no effect.
I think that summing in ascending order should give best accuracy for this kind of series, although it's possible there are some odd corner cases where errors due to rounding to powers of (1/2) might just so happen to give a closer answer for some addition orders than others. You probably can't really predict this, though.
I don't even know where to begin.
Here: What Every Computer Scientist Should Know About Floating-Point Arithmetic
Actually, if you're doing the summation for large N, adding in order from smallest to largest is not the best way -- you can still get into a situation where the numbers you're adding are too small relative to the sum to produce an accurate result.
Look at the problem this way: You have N summations, regardless of ordering, and you wish to have the least total error. Thus, you should be able to get the least total error by minimizing the error of each summation -- and you minimize the error in a summation by adding values as nearly close to each other as possible. I believe that following that chain of logic gives you a binary tree of partial sums:
Sum[0,i] = value[i]
Sum[1,i/2] = Sum[0,i] + Sum[0,i+1]
Sum[j+1,i/2] = Sum[j,i] + Sum[j,i+1]
and so on until you get to a single answer.
Of course, when N is not a power of two, you'll end up with leftovers at each stage, which you need to carry over into the summations at the next stage.
(The margins of StackOverflow are of course too small to include a proof that this is optimal. In part because I haven't taken the time to prove it. But it does work for any N, however large, as all of the additions are adding values of nearly identical magnitude. Well, all but log(N) of them in the worst not-power-of-2 case, and that's vanishingly small compared to N.)
http://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic
You can find libraries with ready for use implementation for C/C++.
For example http://www.apfloat.org/apfloat/
Unless you use some accurate closed-form representation, a small-to-large ordered summation is likely to be most accurate simple solution (it's not clear to me why a log-exp would help - that's a neat trick, but you're not winning anything with it here, as far as I can tell).
You can further gain precision by realizing that after a while, the sum will become "quantized": Effectively, when you have 2 digits of precision, adding 1.3 to 41 results in 42, not 42.3 - but you achieve almost a precision doubling by maintaining an "error" term. This is called Kahan Summation. You'd compute the error term (42-41-1.3 == -0.3) and correct that in the next addition by adding 0.3 to the next term before you add it in again.
Kahan Summation in addition to a small-to-large ordering is liable to be as accurate as you'll ever need to get. I seriously doubt you'll ever need anything better for the harmonic series - after all, even after 2^45 iterations (crazy many) you'd still only be dealing with a numbers that are at least 1/2^45 large, and a sum that's on the order of 45 (<2^6), for an order of magnitude difference of 51 powers-of-two - i.e. even still representable in a double precision variable if you add in the "wrong" order.
If you go small-to-large, and use Kahan Summation, the sun's probably going to extinguish before today's processors reach a percent of error - and you'll run into other tricky accuracy issues just due to the individual term error on that scale first anyhow (being that a number of the order of 2^53 or larger cannot be represented accurately as a double at all anyhow.)
I'm not sure about the order of summation playing an important role, I havent heard that before. I guess you want to do this in floating point arithmetic so the first thing is to think more inline of (1.0/1.0 + 1.0/2.0+1.0/3.0) - otherwise the compiler will do integer division
to determine order of evaluation, maybe a for loop or brackets?
e.g.
float f = 0.0;
for (int i=n; i>0; --i)
{
f += 1.0/static_cast<float>(i);
}
oh forgot to say, compilers will normally have switches to determine floating point evaluation mode. this is maybe related to what you say on order of summation - in visual C+ these are found in code-generation compile settings, in g++ there're options -float that handle this
actually, the other guy is right - you should do summation in order of smallest component first; so
1/n + 1/(n-1) .. 1/1
this is because the precision of a floating point number is linked to the scale, if you start at 1 you'll have 23 bits of precision relative to 1.0. if you start at a smaller number the precision is relative to the smaller number, so you'll get 23 bits of precision relative to 1xe-200 or whatever. then as the number gets bigger rounding error will occur, but the overall error will be less than the other direction
As all your numbers are rationals, the easiest (and also maybe the fastest, as it will have to do less floating point operations) would be to do the computations with rationals (tuples of 2 integers p,q), and then do just one floating point division at the end.
update to use this technique effectively you will need to use bigints for p & q, as they grow quite fast...
A fast prototype in Lisp, that has built in rationals shows:
(defun sum_harmonic (n acc)
(if (= n 0) acc (sum_harmonic (- n 1) (+ acc (/ 1 n)))))
(sum_harmonic 10 0)
7381/2520
[2.9289682]
(sum_harmonic 100 0)
14466636279520351160221518043104131447711/278881500918849908658135235741249214272
[5.1873775]
(sum_harmonic 1000 0)
53362913282294785045591045624042980409652472280384260097101349248456268889497101
75750609790198503569140908873155046809837844217211788500946430234432656602250210
02784256328520814055449412104425101426727702947747127089179639677796104532246924
26866468888281582071984897105110796873249319155529397017508931564519976085734473
01418328401172441228064907430770373668317005580029365923508858936023528585280816
0759574737836655413175508131522517/712886527466509305316638415571427292066835886
18858930404520019911543240875811114994764441519138715869117178170195752565129802
64067621009251465871004305131072686268143200196609974862745937188343705015434452
52373974529896314567498212823695623282379401106880926231770886197954079124775455
80493264757378299233527517967352480424636380511370343312147817468508784534856780
21888075373249921995672056932029099390891687487672697950931603520000
[7.485471]
So, the next better option could be to mantain the list of floating points and to reduce it summing the two smallest numbers in each step...