Error when Calculating the Approximate Value of e^x [duplicate] - c++

This question already has answers here:
Calculating e^x without using any functions
(4 answers)
Closed 8 years ago.
I am fairly new to c++ and writing a program to calculate the approximate value of e^x. Given by the formula:
1 + X + X^2/2! + ... + X^n/n! (for values of n from 1-100)
The program calculates the value perfectly until the user enters a number for "xValue" larger than 60 (ie. 61 or greater). I am unsure why this is and would really appreciate some feedback:
void calculate_sum(CalculateEx& numberSum)
{
double factoralSum;
numberSum.xTotal = numberSum.xValue;
numberSum.xTotal++;
for (double counter = 2; counter <= 100; counter++)
{
factoralSum = 1;
for (double factoral = 1; factoral <= counter; factoral++)
{
factoralSum *= factoral;
}
numberSum.xNextValue = pow(numberSum.xValue, counter) / factoralSum;
numberSum.xTotal += numberSum.xNextValue;
}
return;
}

Don't calculate the next row element from scratch, store the previous one, x^(n+1)/(n+1)! == (x^n)/n! * x/(n+1). This way you won't have to store values of x^n and especially n! separately (they are simply too big to fit in any reasonable type), whereas the values of x^n/n! converge to 0 as n rises.
Doing something like this would do:
double prevValue = 1;
sum = prevValue;
for (size_t power = 1; power < limit; ++power) {
prevValue *= x / (n + 1);
sum += prevValue;
}

Even a double can only fit so many digits. The computer always has a limit.
I know nothing about scientific computing, but I suppose if you wanted greater precision you might have to find a quad-precision floating point number or something.

Your program is attempting to calculate numbers that are out of the range of normal doubles. You can verify this by printing the value of factoralSum after the loop in which it is computed. If you insist on using the Taylor expansion, you may want to check the value of DBL_MAX in <float.h>
Java has a class called BigDecimal, which lets you create numbers with arbitrarily large precision. In C++, you may want to reference this question: Is there a C++ equivalent to Java's BigDecimal?

Related

#inf c++ visual studio

I came across question in calculating the sum of a double. When I set iteration to 100000, the function Asian_call_MC still return a number. However, when I set iteration to around 500000 and above, it begin to return 1.#INF. Can someone tell me why it happens and how to solve it? I am using visual studio 2013 to write c++ code.
double Avg_Price(double init_p, double impl_vol, double drift, int step, double deltasqrt)
{
//Calculate the average price of one sample path
//delta = T/ step
//drift = (risk_free - div_y - impl_vol*impl_vol / 2)*(T / step)
double Sa = 0.0;
double St = init_p;
for (int i = 0; i < step; i++)
{
St = St*exp(drift + impl_vol*deltasqrt*normal_gen());
//Sa = Sa * i / (i + 1) + St / (i + 1);
Sa += St;
}
Sa = Sa / double(step);
return Sa;
}
double Asian_call_MC(double strike_p, double T, double init_p, double impl_vol, double risk_free, double div_y, int iter, int step)
{
//Calculate constants in advance to reduce computation time
double drift = (risk_free - div_y - impl_vol*impl_vol / 2)*double(T / step);
double deltasqrt = sqrt(double(T / step));
//Generate x1, average x and y
double cur_p = Avg_Price(init_p,impl_vol,drift,step,deltasqrt);
double pay_o=0.0;
double x = max(cur_p - strike_p,0.0);
//double y = pow(x, 2.0);
//Generate x2 to xn
for (int i = 0; i < iter; i++)
{
cur_p = Avg_Price(init_p, impl_vol, drift, step, deltasqrt);
x = max(cur_p - strike_p,0.0);
//double q = double(i) / double(i + 1);
//pay_o = pay_o *i/(i+1) + x / (i + 1);
pay_o += x;
//y = (1 - (1 / (i + 1)))*y + x*x / (i + 1);
}
//pay_o = pay_o / double(iter);
//stdev = sqrt((y - pow(pay_o , 2)) / (iter - 1));
//return pay_o*exp(-risk_free*T) ;
return pay_o;
}
When you ane increasing the number of iterations, you are increasing the value of the sum. At some point, the value overflows what is possible to contain within a double, thus returning the 1.#INF value that represents infinity as what you calculated. It does this because the calculated value is greater than what can be held in a couple.
To fix the problem, you'll need to change the variable that you're holding the sum with to something that can hold a greater number than a double. The starting point would be using a long double.
Another option would be to build in some of the logic that you have after the for loop into it so you're dealing with smaller numbers. How to do this will vary depending on what exactly you're string to calculate.
It looks like you want to compute mean values. The the way most people learn to calculate a mean is to sum up all the values, then divide the sum by the number of values which contributed to the sum.
This method has a few problems associated with it -- for example, adding many values together might give a sum which is too large for the variable holding it.
Another technique is often used, which accumulates a "running" mean instead of a sum. The running mean's value is always the mean value for all samples already accumulated, so it never blows up into an overflow (floating-point infinity) value (except when one of the accumulated samples was infinity).
The example below demonstrates how to calculate a running mean. It also calculates the sum and shows how sum/count compares to the running mean (to show that they are the same -- I haven't let it run long enough to overflow the sum).
The example uses the C-Library rand(), for demonstration purposes -- I just needed something to calculate mean values from.
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <iomanip>
int main() {
srand(static_cast<unsigned>(time(0)));
double count = 0;
double running_mean = 0;
double sum = 0;
auto start = time(0);
auto end = start + 5;
while(time(0) < end) {
double sample = rand();
count += 1;
running_mean += (sample - running_mean)/count;
sum += sample;
}
std::cout << std::setprecision(12);
std::cout << "running mean:" << running_mean << " count:" << count << '\n';
double sum_mean = sum / count;
std::cout << "sum:" << sum << " sum/count:" << sum_mean << '\n';
}
Edit: He already tried this --
the technique appeared in commented-out lines that I missed in the OP's code
Unlike computing the average value by accumulating a grand sum, the running mean technique cannot simply overflow at some point. So knowing that he already tried this and that it didn't help the problem, the probable cause becomes that one of the iteration's terms is, itself INF. As soon as a single INF term is added, the accumulated sum or mean will become INF and stay INF.
The most likely section of code was normal_gen() used inside the argument for a call to the exp function. The name normal_gen() sounds like a source of Normally-distributed random values. The usual implementation employs a Box–Muller transform, which cannot produce values over about 7 standard-deviations away from the mean. So if a Box-Muller generator was causing the INF, it would probably occur within fewer iterations than reported. However, more advanced generators can produce more extreme values -- ideally a Normal distribution has a nonzero probability of producing any finite real value.
If a randomly large Normal sample is what was causing the problem, its correlation with increased iteration count would not be that more iterations inflate the sum, to the point of overflowing by adding more values -- it would be that more iterations gave the program a better chance to hit an unlikely random value which would result in an INF term.
You're overflowing what double can hold. INF is short for infinity, which is the error code you get when you overflow a floating point.
Long double may or may not help depending on your compiler. In Microsoft C++ I believe long double and double are both 64 bits, so no luck there.
Check out Boost multiprecision library, it has larger types, if you really need something that big and can't redo your math. I see you're multiplying a bunch and then dividing. Can you multiply some, then divide, then multiply some more possibly to save space?

Calculate this factorial term in C++ with basic datatypes

I am solving a programming problem, and in the end the problem boils down to calculating following term:
n!/(n1!n2!n3!....nm!)
n<50000
(n1+n2+n3...nm)<n
I am given that the final answer will fit in 8 byte. I am using C++. How should I calculate this. I am able to come up with some tricks but nothing concrete and generalized.
EDIT:
I would not like to use external libraries.
EDIT1 :
Added conditions and result will be definitely 64 bit int.
If the result is guaranteed to be an integer, work with the factored representation.
By the theorem of Legendre, you can express all these factorials by the sequence of exponents of the primes in the range (2,n).
By deducting the exponents of the factorials in the denominator from those in the numerator, you will obtain exponents for the whole quotient. The computation will then reduce to a product of primes that will never overflow the 8 bytes.
For example,
25! = 2^22.3^10.5^6.7^3.11^2.13.17.19.23
15! = 2^11.3^6.5^3.7^2.11.13
10! = 2^8.3^4.5^2.7
yields
25!/(15!.10!) = 2^3.5.11.17.19.23 = 3268760
The exponents of, say, 3 are found by
25/3 + 25/9 = 10
15/3 + 15/9 = 6
10/3 + 10/9 = 4
If all the input (not necessarily the output) is made of integers, you could try to count prime factors. You create an array of size sqrt(n) and fill it with the counts of each prime factor in n :
vector <int> v = vector <int> (sqrt(n)+1,0);
int m = 2;
while (m <=n) {
int i = 2;
int a = m;
while (a >1) {
while (a%i ==0) {
v[i] ++;
a/=i;
}
i++;
}
m++;
}
Then you iterate over the n_k (1 <= k <= m) and you decrease the count for each prime factor. This is pretty much the same code as above except that you replace the v[i]++ by v[i] --. Of course you need to call it with vector v previously obtained.
After that the vector v contains the list of count of prime factors in your expression and you just need to reconstruct the result as
int result = 1;
for (int i = 2; i < v.size(); v++) {
result *= pow(i,v[i]);
}
return result;
Note : you should use long long int instead of int above but I stick to int for simplicity
Edit : As mentioned in another answer, it would be better to use Legendre theorem to fill / unfill the vector v faster.
What you can do is to use the properties of the logarithm:
log(AB) = log(A) + log(B)
log(A/B) = log(A) - log(B)
and
X = e^(log(X))
So you can first compute the logarithm of your quantity, then exponentiate back:
log(N!/(n1!n2!...nk!)) = log(1) + ... + log(N) - [log(n1!) - ... log(nk!)]
then expand log(n1!) etc. so you end up writing everything in terms of logarithm of single numbers. Then take the exponential of your result to obtain the initial value of the factorial.
As #T.C. mentioned, this method may not be to accurate, although in typical scenarios you'll have many terms reduced. Alternatively, you expand each factorial into a list that stores the terms in its product, e.g. 6! will be stored in a list {1,2,3,4,5,6}. You do the same for the denominator terms. Then you start removing common elements. Finally, you can take gcd's and reduce everything to coprime factors, then compute the result.

C++ program which calculates ln for a given variable x without using any ready functions

I've searched for the equation which calculates the ln of a number x and found out that this equation is:
and I've written this code to implement it:
double ln = x-1 ;
for(int i=2;i<=5;i++)
{
double tmp = 1 ;
for(int j=1;j<=i;j++)
tmp *= (x-1) ;
if(i%2==0)
ln -= (tmp/i) ;
else
ln += (tmp/i) ;
}
cout << "ln: " << setprecision(10) << ln << endl ;
but unfortunately I'm getting outputs completely different from output on my calculator especially for large numbers, can anyone tell me where is the problem ?
The equation you link to is an infinite series as implied by the ellipsis following the main part of the equation and as indicated more explicitly by the previous formulation on the same page:
In your case, you are only computing the first four terms. Later terms will add small refinements to the result to come closer to the actual value, but ultimately to compute all infinite steps will require infinite time.
However, what you can do is approximate your response to something like:
double ln(double x) {
// validate 0 < x < 2
double threshold = 1e-5; // set this to whatever threshold you want
double base = x-1; // Base of the numerator; exponent will be explicit
int den = 1; // Denominator of the nth term
int sign = 1; // Used to swap the sign of each term
double term = base; // First term
double prev = 0; // Previous sum
double result = term; // Kick it off
while (fabs(prev - result) > threshold) {
den++;
sign *=- 1;
term *= base;
prev = result;
result += sign * term / den;
}
return result;
}
Caution: I haven't actually tested this so it may need some tweaking.
What this does is compute each term until the absolute difference between two consecutive terms is less than some threshold you establish.
Now this is not a particularly efficient way to do this. It's better to work with the functions the language you're using (in this case C++) provides to compute the natural log (which another poster has, I believe already shown to you). But there may be some value in trying this for yourself to see how it works.
Also, as barak manos notes below, this Taylor series only converges on the range (0, 2), so you will need to validate the value of x lies in that range before trying to run actual computation.
I believe the natural log in C++ language is simply log
It wouldn't hurt to use long and long double instead of int and double. This may get a little more accuracy on some larger values. Also, your series only extending 5 levels deep is also limiting your accuracy.
Using a series like this is basically an approximation of the logarithmic answer.
This version should be somewhat faster:
double const scale = 1.5390959186233239e-16;
double const offset = -709.05401552996614;
double fast_ln(double x)
{
uint64_t xbits;
memcpy(&xbits, &x, 8);
// if memcpy not allowed, use
// for( i = 0; i < 8; ++i ) i[(char*)xbits] = i[(char*)x];
return xbits * scale + offset;
}
The trick is that this uses a 64-bit integer * 64-bit floating-point multiply, which involves a conversion of the integer to floating-point. Said floating-point representation is similar to scientific notation and requires a logarithm to find the appropriate exponent... but it is done purely in hardware and is very fast.
However it is doing a linear approximation within each octave, which is not very accurate. Using a lookup table for those bits would be far better.
That formula won't work for large inputs, because it would require you to take in consideration the highest degree member, which you can't because they are infinity many.
It will only work for small inputs, where only the first terms of your series are relevant.
You can find ways to do that here: http://en.wikipedia.or/wiki/Pollard%27s_rho_algorithm_for_logarithms
and here: http://www.netlib.org/cephes/qlibdoc.html#qlog
This should work. You just needed the part where if x>=2 shrink x by half and add 0.6931. The reason for 0.6931 is that is ln(2). If you wanted to you could add if (x >= 1024) return myLN(x/1024) + 6.9315 where 6.9315 is ln(1024). This will add speed for big values of x. The for loop with 100 could be much less like 20. I believe to get exact result for an integer its 17.
double myLN(double x) {
if (x >= 2) {
return myLN(x/2.0) + 0.6931;
}
x = x-1;
double total = 0.0;
double xToTheIPower = x;
for (unsigned i = 1; i < 100; i++) {
if (i%2 == 1) {
total += xToTheIPower / (i);
} else {
total -= xToTheIPower / (i);
}
xToTheIPower *= x;
}
return total;
}

Generate random values with fixed sum in C++ [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I saw tons of answers to this question on the web but, can you believe me? I still don't get the solution of this problem. I have an array of values. The size of this array is "n". I have also the defined value "sum". What I want is to generate "n" random values in such a way that their sum is equals to "sum", preferably uniformly distributed, otherwise (for example) having the first random number equals to "sum" and the rest equals to zero is not that nice. I need two algorithms which accomplish this task. One with positive Integers and one with positive Floats. Thanks a lot in advance!
First generate n random variables. Then sum them up: randomSum. Calculate coefficient sum/randomSum. Then multiply all random variables with that coefficient.
Integers would pose a problem... Rounding too (probably)
You can generate n numbers with a normal distribution then normalize them to your sum
You can generate n values defined by this : ((Sum - sumOfGeneratedValues) / n - (numberOfGeneatedValue)) -+X (With X maximal deviance)
Example :
SUM = 100 N = 5
+- 10
Rand between 100 - 0 / 5 - 0 --> 20 +-10 (So bewteen 10 and 30)
Value1 = 17;
Rand between 100 - 17 / 5 - 1 ---> 21 +-10 (So between 11 and 31)
... etc
Deviance would make your random uniform :)
you have a loop, where the number of iterations is equal to the number of random numbers you want minus 1. for the first iteration, you find a random number between 0 and the sum. you then subtract that random number from the sum, and on the next iteration you get another random number and subtract that from the sub sum minus the last iteration
its probably more easy in psuedocode
int sum = 10;
int n = 5; // 5 random numbers summed to equal sum
int subSum = sum;
int[] randomNumbers = new int[n];
for(int i = 0; i < n - 2; i++)
{
randomNumbers[i] = rand(0, subSum); // get random number between 0 and subSum
subSum -= randomNumbers[i];
}
randomNumbers[n - 1] = subSum; // leftovers go to the last random number
My C++ is very (very very) rusty. So let's assume you already know how to get a random number between x and y with the function random(x,y). Then here is some psuedocode in some other c derived language:
int n = ....; // your input
int sum = ....; // your input
int[] answer = new int[n];
int tmpsum = 0;
for (int i=0; i < n; i++) {
int exactpart = sum/n;
int random = (exactpart/2) + random(0,exactpart);
int[i] = tmpsum + random > sum ? sum - tmpsum : random;
tmpsum += int[i];
}
int[n-1] += sum - tmpsum;

In which order should floats be added to get the most precise result?

This was a question I was asked at my recent interview and I want to know (I don't actually remember the theory of the numerical analysis, so please help me :)
If we have some function, which accumulates floating-point numbers:
std::accumulate(v.begin(), v.end(), 0.0);
v is a std::vector<float>, for example.
Would it be better to sort these numbers before accumulating them?
Which order would give the most precise answer?
I suspect that sorting the numbers in ascending order would actually make the numerical error less, but unfortunately I can't prove it myself.
P.S. I do realize this probably has nothing to do with real world programming, just being curious.
Your instinct is basically right, sorting in ascending order (of magnitude) usually improves things somewhat. Consider the case where we're adding single-precision (32 bit) floats, and there are 1 billion values equal to 1 / (1 billion), and one value equal to 1. If the 1 comes first, then the sum will come to 1, since 1 + (1 / 1 billion) is 1 due to loss of precision. Each addition has no effect at all on the total.
If the small values come first, they will at least sum to something, although even then I have 2^30 of them, whereas after 2^25 or so I'm back in the situation where each one individually isn't affecting the total any more. So I'm still going to need more tricks.
That's an extreme case, but in general adding two values of similar magnitude is more accurate than adding two values of very different magnitudes, since you "discard" fewer bits of precision in the smaller value that way. By sorting the numbers, you group values of similar magnitude together, and by adding them in ascending order you give the small values a "chance" of cumulatively reaching the magnitude of the bigger numbers.
Still, if negative numbers are involved it's easy to "outwit" this approach. Consider three values to sum, {1, -1, 1 billionth}. The arithmetically correct sum is 1 billionth, but if my first addition involves the tiny value then my final sum will be 0. Of the 6 possible orders, only 2 are "correct" - {1, -1, 1 billionth} and {-1, 1, 1 billionth}. All 6 orders give results that are accurate at the scale of the largest-magnitude value in the input (0.0000001% out), but for 4 of them the result is inaccurate at the scale of the true solution (100% out). The particular problem you're solving will tell you whether the former is good enough or not.
In fact, you can play a lot more tricks than just adding them in sorted order. If you have lots of very small values, a middle number of middling values, and a small number of large values, then it might be most accurate to first add up all the small ones, then separately total the middling ones, add those two totals together then add the large ones. It's not at all trivial to find the most accurate combination of floating-point additions, but to cope with really bad cases you can keep a whole array of running totals at different magnitudes, add each new value to the total that best matches its magnitude, and when a running total starts to get too big for its magnitude, add it into the next total up and start a new one. Taken to its logical extreme, this process is equivalent to performing the sum in an arbitrary-precision type (so you'd do that). But given the simplistic choice of adding in ascending or descending order of magnitude, ascending is the better bet.
It does have some relation to real-world programming, since there are some cases where your calculation can go very badly wrong if you accidentally chop off a "heavy" tail consisting of a large number of values each of which is too small to individually affect the sum, or if you throw away too much precision from a lot of small values that individually only affect the last few bits of the sum. In cases where the tail is negligible anyway you probably don't care. For example if you're only adding together a small number of values in the first place and you're only using a few significant figures of the sum.
There is also an algorithm designed for this kind of accumulation operation, called Kahan Summation, that you should probably be aware of.
According to Wikipedia,
The Kahan summation algorithm (also known as compensated summation) significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach. This is done by keeping a separate running compensation (a variable to accumulate small errors).
In pseudocode, the algorithm is:
function kahanSum(input)
var sum = input[1]
var c = 0.0 //A running compensation for lost low-order bits.
for i = 2 to input.length
y = input[i] - c //So far, so good: c is zero.
t = sum + y //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t //Algebraically, c should always be zero. Beware eagerly optimising compilers!
next i //Next time around, the lost low part will be added to y in a fresh attempt.
return sum
I tried out the extreme example in the answer supplied by Steve Jessop.
#include <iostream>
#include <iomanip>
#include <cmath>
int main()
{
long billion = 1000000000;
double big = 1.0;
double small = 1e-9;
double expected = 2.0;
double sum = big;
for (long i = 0; i < billion; ++i)
sum += small;
std::cout << std::scientific << std::setprecision(1) << big << " + " << billion << " * " << small << " = " <<
std::fixed << std::setprecision(15) << sum <<
" (difference = " << std::fabs(expected - sum) << ")" << std::endl;
sum = 0;
for (long i = 0; i < billion; ++i)
sum += small;
sum += big;
std::cout << std::scientific << std::setprecision(1) << billion << " * " << small << " + " << big << " = " <<
std::fixed << std::setprecision(15) << sum <<
" (difference = " << std::fabs(expected - sum) << ")" << std::endl;
return 0;
}
I got the following result:
1.0e+00 + 1000000000 * 1.0e-09 = 2.000000082740371 (difference = 0.000000082740371)
1000000000 * 1.0e-09 + 1.0e+00 = 1.999999992539933 (difference = 0.000000007460067)
The error in the first line is more than ten times bigger in the second.
If I change the doubles to floats in the code above, I get:
1.0e+00 + 1000000000 * 1.0e-09 = 1.000000000000000 (difference = 1.000000000000000)
1000000000 * 1.0e-09 + 1.0e+00 = 1.031250000000000 (difference = 0.968750000000000)
Neither answer is even close to 2.0 (but the second is slightly closer).
Using the Kahan summation (with doubles) as described by Daniel Pryden:
#include <iostream>
#include <iomanip>
#include <cmath>
int main()
{
long billion = 1000000000;
double big = 1.0;
double small = 1e-9;
double expected = 2.0;
double sum = big;
double c = 0.0;
for (long i = 0; i < billion; ++i) {
double y = small - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}
std::cout << "Kahan sum = " << std::fixed << std::setprecision(15) << sum <<
" (difference = " << std::fabs(expected - sum) << ")" << std::endl;
return 0;
}
I get exactly 2.0:
Kahan sum = 2.000000000000000 (difference = 0.000000000000000)
And even if I change the doubles to floats in the code above, I get:
Kahan sum = 2.000000000000000 (difference = 0.000000000000000)
It would seem that Kahan is the way to go!
There is a class of algorithms that solve this exact problem, without the need to sort or otherwise re-order the data.
In other words, the summation can be done in one pass over the data. This also makes such algorithms applicable in situations where the dataset is not known in advance, e.g. if the data arrives in real time and the running sum needs to be maintained.
Here is the abstract of a recent paper:
We present a novel, online algorithm for exact summation of a stream
of floating-point numbers. By “online” we mean that the algorithm
needs to see only one input at a time, and can take an arbitrary
length input stream of such inputs while requiring only constant
memory. By “exact” we mean that the sum of the internal array of our
algorithm is exactly equal to the sum of all the inputs, and the
returned result is the correctly-rounded sum. The proof of correctness
is valid for all inputs (including nonnormalized numbers but modulo
intermediate overflow), and is independent of the number of summands
or the condition number of the sum. The algorithm asymptotically needs
only 5 FLOPs per summand, and due to instruction-level parallelism
runs only about 2--3 times slower than the obvious, fast-but-dumb
“ordinary recursive summation” loop when the number of summands is
greater than 10,000. Thus, to our knowledge, it is the fastest, most
accurate, and most memory efficient among known algorithms. Indeed, it
is difficult to see how a faster algorithm or one requiring
significantly fewer FLOPs could exist without hardware improvements.
An application for a large number of summands is provided.
Source: Algorithm 908: Online Exact Summation of Floating-Point Streams.
Building on Steve's answer of first sorting the numbers in ascending order, I'd introduce two more ideas:
Decide on the difference in exponent of two numbers above which you might decide that you would lose too much precision.
Then add the numbers up in order until the exponent of the accumulator is too large for the next number, then put the accumulator onto a temporary queue and start the accumulator with the next number. Continue until you exhaust the original list.
You repeat the process with the temporary queue (having sorted it) and with a possibly larger difference in exponent.
I think this will be quite slow if you have to calculate exponents all the time.
I had a quick go with a program and the result was 1.99903
I think you can do better than sorting the numbers before you accumulate them, because during the process of accumulation, the accumulator gets bigger and bigger. If you have a large amount of similar numbers, you will start to lose precision quickly. Here is what I would suggest instead:
while the list has multiple elements
remove the two smallest elements from the list
add them and put the result back in
the single element in the list is the result
Of course this algorithm will be most efficient with a priority queue instead of a list. C++ code:
template <typename Queue>
void reduce(Queue& queue)
{
typedef typename Queue::value_type vt;
while (queue.size() > 1)
{
vt x = queue.top();
queue.pop();
vt y = queue.top();
queue.pop();
queue.push(x + y);
}
}
driver:
#include <iterator>
#include <queue>
template <typename Iterator>
typename std::iterator_traits<Iterator>::value_type
reduce(Iterator begin, Iterator end)
{
typedef typename std::iterator_traits<Iterator>::value_type vt;
std::priority_queue<vt> positive_queue;
positive_queue.push(0);
std::priority_queue<vt> negative_queue;
negative_queue.push(0);
for (; begin != end; ++begin)
{
vt x = *begin;
if (x < 0)
{
negative_queue.push(x);
}
else
{
positive_queue.push(-x);
}
}
reduce(positive_queue);
reduce(negative_queue);
return negative_queue.top() - positive_queue.top();
}
The numbers in the queue are negative because top yields the largest number, but we want the smallest. I could have provided more template arguments to the queue, but this approach seems simpler.
This doesn't quite answer your question, but a clever thing to do is to run the sum twice, once with rounding mode "round up" and once with "round down". Compare the two answers, and you know /how/ inaccurate your results are, and if you therefore need to use a cleverer summing strategy. Unfortunately, most languages don't make changing the floating point rounding mode as easy as it should be, because people don't know that it's actually useful in everyday calculations.
Take a look at Interval arithmetic where you do all maths like this, keeping highest and lowest values as you go. It leads to some interesting results and optimisations.
The simplest sort that improves accuracy is to sort by the ascending absolute value. That lets the smallest magnitude values have a chance to accumulate or cancel before interacting with larger magnitude values that have would trigger a loss of precision.
That said, you can do better by tracking multiple non-overlapping partial sums. Here is a paper describing the technique and presenting a proof-of-accuracy: www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps
That algorithm and other approaches to exact floating point summation are implemented in simple Python at: http://code.activestate.com/recipes/393090/ At least two of those can be trivially converted to C++.
For IEEE 754 single or double precision or known format numbers, another alternative is to use an array of numbers (passed by caller, or in a class for C++) indexed by the exponent. When adding numbers into the array, only numbers with the same exponent are added (until an empty slot is found and the number stored). When a sum is called for, the array is summed from smallest to largest to minimize truncation. Single precision example:
/* clear array */
void clearsum(float asum[256])
{
size_t i;
for(i = 0; i < 256; i++)
asum[i] = 0.f;
}
/* add a number into array */
void addtosum(float f, float asum[256])
{
size_t i;
while(1){
/* i = exponent of f */
i = ((size_t)((*(unsigned int *)&f)>>23))&0xff;
if(i == 0xff){ /* max exponent, could be overflow */
asum[i] += f;
return;
}
if(asum[i] == 0.f){ /* if empty slot store f */
asum[i] = f;
return;
}
f += asum[i]; /* else add slot to f, clear slot */
asum[i] = 0.f; /* and continue until empty slot */
}
}
/* return sum from array */
float returnsum(float asum[256])
{
float sum = 0.f;
size_t i;
for(i = 0; i < 256; i++)
sum += asum[i];
return sum;
}
double precision example:
/* clear array */
void clearsum(double asum[2048])
{
size_t i;
for(i = 0; i < 2048; i++)
asum[i] = 0.;
}
/* add a number into array */
void addtosum(double d, double asum[2048])
{
size_t i;
while(1){
/* i = exponent of d */
i = ((size_t)((*(unsigned long long *)&d)>>52))&0x7ff;
if(i == 0x7ff){ /* max exponent, could be overflow */
asum[i] += d;
return;
}
if(asum[i] == 0.){ /* if empty slot store d */
asum[i] = d;
return;
}
d += asum[i]; /* else add slot to d, clear slot */
asum[i] = 0.; /* and continue until empty slot */
}
}
/* return sum from array */
double returnsum(double asum[2048])
{
double sum = 0.;
size_t i;
for(i = 0; i < 2048; i++)
sum += asum[i];
return sum;
}
Your floats should be added in double precision. That will give you more additional precision than any other technique can. For a bit more precision and significantly more speed, you can create say four sums, and add them up at the end.
If you are adding double precision numbers, use long double for the sum - however, this will only have a positive effect in implementations where long double actually has more precision than double (typically x86, PowerPC depending on compiler settings).
Regarding sorting, it seems to me that if you expect cancellation then the numbers should be added in descending order of magnitude, not ascending. For instance:
((-1 + 1) + 1e-20) will give 1e-20
but
((1e-20 + 1) - 1) will give 0
In the first equation that two large numbers are cancelled out, whereas in the second the 1e-20 term gets lost when added to 1, since there is not enough precision to retain it.
Also, pairwise summation is pretty decent for summing lots of numbers.