summing array of doubles with large value span : proper algorithm - c++

I have an algorithm where I need to sum (a lot of time) double numbers ranging in the e-40 to the e+40.
Array Example (randomly dumped from real application):
-2.06991e-05
7.58132e-06
-3.91367e-06
7.38921e-07
-5.33143e-09
-4.13195e-11
4.01724e-14
6.03221e-17
-4.4202e-20
6.58873
-1.22257
-0.0606178
0.00036508
2.67599e-07
0
-627.061
-59.048
5.92985
0.0885884
0.000276455
-2.02579e-07
It goes without saying the I am aware of the rounding effect this will cause, I am trying to keep it under control : the final result should not have any missing information in the fractional part of the double or, if not avoidable result should be at least n-digit accurate (with n defined). End result needs something like 5 digits plus exponent.
After some decent thinking, I ended up with following algorithm :
Sort the array so that the largest absolute value comes first, closest to zero last.
Add everything in a loop
The idea is that in this case, any cancellation of large values (negatives and positive) will not impact latter smaller values.
In short :
(10e40 - 10e40) + 1 = 1 : result is as expected
(1 + 10e-40) - 10e40 = 0 : not good
I ended up using std::multiset (benchmark on my PC gave 20% higher speed with long double compared to normal doubles - I am fine with doubles resolution) with a custom sort function using std:fabs.
It's still quite slow (it takes 5 seconds to do the whole thing) and I still have this feeling of "you missed something in your algo". Any recommandation :
for speed optimization. Is there a better way to sort the intermediate products ? Sorting a set of 40 intermediate results (typically) takes about 70% of the total execution time.
for missed issues. Is there a chance to still lose critical data (one that should have been in the fractional part of the final result) ?
On a bigger picture, I am implementing real coefficient polynomial classes of pure imaginary variable (electrical impedances : Z(jw)). Z is a big polynom representing a user defined system, with coefficient exponent ranging very far.
The "big" comes from adding things like Zc1 = 1/jC1w to Zc2 = 1/jC2w :
Zc1 + Zc2 = (C1C2(jw)^2 + 0(jw))/(C1+C2)(jw)
In this case, with C1 and C2 in nanofarad (10e-9), C1C2 is already in 10e-18 (and it only started...)
my sort function use a manhattan distance of complex variables (because, mine are either pure real or pure imaginary) :
struct manhattan_complex_distance
{
bool operator() (std::complex<long double> a, std::complex<long double> b)
{
return std::fabs(std::real(a) + std::imag(a)) > std::fabs(std::real(b) + std::imag(b));
}
};
and my multi set in action :
std:complex<long double> get_value(std::vector<std::complex<long double>>& frequency_vector)
{
//frequency_vector is precalculated once for all to have at index n the value (jw)^n.
std::multiset<std::complex<long double>, manhattan_distance> temp_list;
for (int i=0; i<m_coeficients.size(); ++i)
{
// element of : ℝ * ℂ
temp_list.insert(m_coeficients[i] * frequency_vector[i]);
}
std::complex<long double> ret=0;
for (auto i:temp_list)
{
// it is VERY important to start adding the big values before adding the small ones.
// in informatics, 10^60 - 10^60 + 1 = 1; while 1 + 10^60 - 10^60 = 0. Of course you'd expected to get 1, not 0.
ret += i;
}
return ret;
}
The project I have is c++11 enabled (mainly for improvement of the math lib and complex number tools)
ps : I refactored the code to make is easy to read, in reality all complexes and long double names are template : I can change the polynomial type in no time or use the class for regular polynomial of ℝ

As GuyGreer suggested, you can use Kahan summation:
double sum = 0.0;
double c = 0.0;
for (double value : values) {
double y = value - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}
EDIT: You should also consider using Horner's method to evaluate the polynomial.
double value = coeffs[degree];
for (auto i = degree; i-- > 0;) {
value *= x;
value += coeffs[i];
}

Sorting the data is on the right track. But you definitely should be summing from smallest magnitude to largest, not from largest to smallest. Summing from largest to smallest, by the time you get to the smallest, aligning the next value with the current sum is liable to cause most or all of the bits of the next value to 'fall off the end'. Summing instead from smallest to largest, the smallest values get a chance to accumulate a decent-sized sum, for which more bits will get into the largest. Combined with Kahan summation, that should yield a fairly accurate sum.

First: have your math keep track of error. Replace your doubles with error-aware types, and when you add or multiply together two doubles it also calculates the maximium error.
This is about the only way you can guarantee that your code produces accurate results while being reasonably fast.
Second, don't use a multiset. The associative containers are not for sorting, they are for maintaining a sorted collection, while being able to incrementally add or remove elements from it efficiently.
The ability to add/remove elements incrementally means it is node-based, and node-based means it is slow in general.
If you simply want a sorted collection, start with a vector then std::sort it.
Next, to minimize error, keep a list of positive and negative elements. Start with zero as your sum. Now pick the smallest of either the positive or negative elements such that the total of your sum and that element is closest to zero.
Do so with elements that calculate their error bounds.
At the end, determine if you have 5 digits of precision, or not.
These error-propogating doubles should be ideally used as early on in the algorithm as possible.

Related

Efficient way of ensuring newness of a set

Given set N = {1,...,n}, consider P different pre-existing subsets of N. A subset, S_p, is characterized by the 0-1 n vector x_p where the ith element is 0 or 1 depending on whether the ith (of n) items is part of the subset or not. Let us call such x_ps indicator vectors.
For e.g., if N={1,2,3,4,5}, subset {1,2,5} is represented by vector (1,0,0,1,1).
Now, given P pre-existing subsets and their associated vectors x_ps.
A candidate subset denoted by vector yis computed.
What is the most efficient way of checking whether y is already part of the set of P pre-existing subsets or whether y is indeed a new subset not part of the P subsets?
The following are the methods I can think of:
(Method 1) Basically, we have to do an element by element check against all pre-existing sets. Pseudocode follows:
for(int p = 0; p < P; p++){
//(check if x_p == y by doing an element by element comparison)
int i;
for(i = 0; i < n; i++){
if(x_pi != y_i){
i = 999999;
}
}
if(i < 999999)
return that y is pre-existing
}
return that y is new
(Method 2) Another thought that comes to mind is to store the decimal equivalent of the indicator vectors x_ps (where the indicator vectors are taken to be binary representations) and compare it with the decimal equivalent of y. That is, if set of P pre-existing sets is: { (0,1,0,0,1), (1,0,1,1,0) }, the stored decimals for this set would be {9, 22}. If y is (0,1,1,0,0), we compute 12 and check this against the set {9, 22}. The benefit of this method is that for each new y, we don't have to check against the n elements of every pre-existing set. We can just compare the decimal numbers.
Question 1. It appears to me that (Method 2) should be more efficient than (Method 1). For (Method 2), is there an efficient way (inbuilt library function in C/C++) that converts the x_ps and y from binary to decimal? What should be data type of these indicator variables? For e.g., bool y[5]; or char y[5];?
Question 2. Is there any method more efficient than (Method 2)?
As you've noticed, there's a trivial isomorphism between your indicator vectors and N-bit integers. That means the answer to your question 2 is "no": the tools available for maintain a set and testing membership in it are the same as integers (hash tables bring the normal approach). A commented mentioned Bloom fillers, which can efficiently test membership at the risk of some false positives, but Bloom filters are generally for much larger data sizes than you're looking at.
As for your question 1: Method 2 is reasonable, and it's even easier than you think. While vector<bool> doesn't give you an easy way to turn it into integer blocks, on implementations I'm aware of it's already implemented this way (the C++ standard allows special treatment of that particular vector type, something that is generally considered nowadays to have been a poor decision, but which occasionally yields some benefit). And those vectors are hashable. So just keep an unordered_set<vector<bool>> around, and you'll get performance which is reasonably close to the optimum. (If you know N at compile time you may want to prefer bitset to vector<bool>.)
Method 2 can be optimized by calculating the decimal equivalent of the given subset and hashing it using modulus 1e9+7. This results in different decimal numbers every time since N<=1000(No collision occurs).
#define M 1000000007 //big prime number
unordered_set<long long> subset; //containing decimal representation of all the
//previous found subsets
/*fast computation of power of 2*/
long long Pow(long long num,long long pow){
long long result=1;
while(pow)
{
if(pow&1)
{
result*=num;
result%=M;
}
num*=num;
num%=M;
pow>>=1;
}
return result;
}
/*checks if subset pre exists*/
bool check(vector<bool> booleanVector){
long long result=0;
for(int i=0;i<booleanVector.size();i++)
if(booleanVector[i])
result+=Pow(2,i);
return (subset.find(result)==subset.end());
}

Evaluating multiplication with exponential function

I'm trying to come up with a good way to evaluate the following function
double foo(std::vector<double> const& x, double c = 0.95)
{
auto N = x.size(); // Small power of 2 such as 512 or 1024
double sum = 0;
for (auto i = 0; i != N; ++i) {
sum += (x[i] * pow(c, double(i)/N));
}
return sum;
}
My two main concerns with this naive implementation are performance and accuracy. So I suspect that the most trivial improvement would be to reverse the loop order: for (auto i = N-1; i != -1; --i) (The -1 wraps around, this is OK). This improves accuracy by adding smaller terms first.
While this is good for accuracy, it keeps the performance problem of pow. Numerically, pow(c, double(i)/N) is pow(c, (i-1)/N) * pow(c, 1/N). And the latter is a constant. So in theory we can replace pow with repeated multiplication. While good for performance, this hurts accuracy - errors will accumulate.
I suspect that there's a significantly better algorithm hiding in here. For instance, the fact that N is a power of two means that there is a middle term x[N/2] that's multiplied with sqrt(c). That hints at a recursive solution.
On a somewhat related numerical observation, this looks like a signal multiplication with an exponential, so I naturally think : "FFT, trivial convolution=shift, IFFT", but that seems to offer no real benefit in terms of accuracy or performance.
So, is this a well-known problem with known solutions?
The task is a polynomial evaluation. The method for a single evaluation with the least operation count is the Horner scheme. In general a low operation count will reduce the accumulation of floating point noise.
As the example value c=0.95 is close to 1, any root will be still closer to 1 and thus lose accuracy. Avoid that by computing the difference to 1 directly, z=1-c^(1/n), via
z = -expm1(log(c)/N).
Now you have to evaluate the polynomial
sum of x[i] * (1-z)^i
which can be done by careful modification of the Horner scheme. Instead of
for(i=N; i-->0; ) {
res = res*(1-z)+x[i]
}
use
for(i=N; i-->0; ) {
res = (res+x[i])-res*z
}
which is mathematically equivalent but has the loss of digits in 1-z happening as late as possible without using more involved method like doubly accurate addition.
In tests those two methods contrary to the intent gave almost the same results, a substantial improvement could be observed by separating the result into its value at c=1, z=0 and a multiple of z as in
double res0 = 0, resz=0;
int i;
for(i=N; i-->0; ) {
/* res0+z*resz = (res0+z*resz)*(1-z)+x[i]; */
resz = resz - res0 -z*resz;
res0 = res0 + x[i];
}
The test case that showed this improvement was for the coefficient sequence of
f(u) = (1-u/N)^(N-2)*(1-u)
where for N=1000 the evaluations result in
c z=1-c^(1/N) f(1-z) diff for 1st proc diff for 3rd proc
0.950000 0.000051291978909 0.000018898570629 1.33289104579937e-17 4.43845264361253e-19
0.951000 0.000050239954368 0.000018510931892 1.23765066121009e-16 -9.24959978401696e-19
0.952000 0.000049189034371 0.000018123700958 1.67678642238461e-17 -5.38712954453735e-19
0.953000 0.000048139216599 0.000017736876972 -2.86635949350855e-17 -2.37169225231204e-19
...
0.994000 0.000006018054217 0.000002217256601 1.31645860662263e-17 1.15619997300212e-19
0.995000 0.000005012529261 0.000001846785028 -4.15668713370839e-17 -3.5363625547867e-20
0.996000 0.000004008013365 0.000001476685973 8.48811716443534e-17 8.470329472543e-22
0.997000 0.000003004504507 0.000001106958687 1.44711343873661e-17 -2.92226366802734e-20
0.998000 0.000002002000667 0.000000737602425 5.6734266807093e-18 -6.56450534122083e-21
0.999000 0.000001000499833 0.000000368616443 -3.72557383333555e-17 1.47701370177469e-20
Yves' answer inspired me.
It seems that the best approach is to not calculate pow(c, 1.0/N) directly, but indirectly:
cc[0]=c; cc[1]=sqrt(cc[0]), cc[2]=sqrt(cc[1]),... cc[logN] = sqrt(cc[logN-1])
Or in binary,
cc[0]=c, cc[1]=c^0.1, cc[2]=c^0.01, cc[3]=c^0.001, ....
Now if we need x[0b100100] * c^0.100100, we can calculate that as x[0b100100]* c^0.1 * c^0.0001. I don't need to precalculate a table of size N, as geza suggested. A table of size log(N) is probably sufficient, and it can be created by repeatedly taking square roots.
[edit]
As pointed out in a comment thread on another answer, pairwise summation is very effective in keeping errors under control. And it happens to combine extremely nicely with this answer.
We start by observing that we sum
x[0] * c^0.0000000
x[1] * c^0.0000001
x[2] * c^0.0000010
x[3] * c^0.0000011
...
So, we run log(N) iterations. In iteration 1, we add the N/2 pairs x[i]+x[i+1]*c^0.000001 and store the result in x[i/2]. In iteration 2, we add the pairs x[i]+x[i+1]*c^0.000010, etcetera. The chief difference with normal pairwise summation is that this is a multiply-and-add in each step.
We see now that in each iteration, we're using the same multiplier pow(c, 2^i/N), which means we only need to calculate log(N) multipliers. It's also quite cache-efficient, as we're doing only contiguous memory access. It also allows for easy SIMD parallelization, especially when you have FMA instructions.
If N is a power of 2, you can replace the evaluations of the powers by geometric means, using
a^(i+j)/2 = √(a^i.a^j)
and recursively subdivide from c^N/N.c^0/N. With preorder recursion, you can make sure to accumulate by increasing weights.
Anyway, the speedup of sqrt vs. pow might be marginal.
You can also stop recursion at a certain level and continue linearly, with mere products.
You could mix repeated multiplication by pow(c, 1./N) with some explicit pow calls. I.e. every 16th iteration or so do a real pow and otherwise move forward with the multiply. This should yield large performance benefits at negligible accuracy cost.
Depending on how much c varies, you might even be able to precompute and replace all pow calls with a lookup, or just the ones needed in the above method (= smaller lookup table = better caching).

Pick a matrix cell according to its probability

I have a 2D matrix of positive real values, stored as follow:
vector<vector<double>> matrix;
Each cell can have a value equal or greater to 0, and this value represents the possibility of the cell to be chosen. In particular, for example, a cell with a value equals to 3 has three times the probability to be chosen compared to a cell with value 1.
I need to select N cells of the matrix (0 <= N <= total number of cells) randomly, but according to their probability to be selected.
How can I do that?
The algorithm should be as fast as possible.
I describe two methods, A and B.
A works in time approximately N * number of cells, and uses space O(log number of cells). It is good when N is small.
B works in time approximately (number of cells + N) * O(log number of cells), and uses space O(number of cells). So, it is good when N is large (or even, 'medium') but uses a lot more memory, in practice it might be slower in some regimes for that reason.
Method A:
The first thing you need to do is normalize the entries. (It's not clear to me if you assume they are normalized or not.) That means, sum all the entries and divide by the sum. (This part is potentially slow, so it's better if you assume or require that it already happened.)
Then you sample like this:
Choose a random [i,j] entry of the matrix (by choosing i,j each uniformly randomly from the range of integers 0 to n-1).
Choose a uniformly random real number p in the range [0, 1].
Check if matrix[i][j] > p. If so, return the pair [i][j]. If not, go back to step 1.
Why does this work? The probability that we end at step 3 with any particular output, is equal to, the probability that [i][j] was selected (this is the same for each entry), times the probality that the number p was small enough. This is proportional to the value matrix[i][j], so the sampling is choosing each entry with the correct proportions. It's also possible that at step 3 we go back to the start -- does that bias things? Basically, no. The reason is, suppose we arbitrarily choose a number k and then consider the distribution of the algorithm, conditioned on stopping exactly after k rounds. Conditioned on the assumption that we stop at the k'th round, no matter what value k we choose, the distribution we sample has to be exactly right by the above argument. Since if we eliminate the case that p is too small, the other possibilities all have their proportions correct. Since the distribution is perfect for each value of k that we might condition on, and the overall distribution (not conditioned on k) is an average of the distributions for each value of k, the overall distribution is perfect also.
If you want to analyze the number of rounds that typically needed in a rigorous way, you can do it by analyzing the probability that we actually stop at step 3 for any particular round. Since the rounds are independent, this is the same for every round, and statistically, it means that the running time of the algorithm is poisson distributed. That means it is tightly concentrated around its mean, and we can determine the mean by knowing that probability.
The probability that we stop at step 3 can be determined by considering the conditional probability that we stop at step 3, given that we chose any particular entry [i][j]. By the formulas for conditional expectation, you get that
Pr[ stop at step 3 ] = sum_{i,j} ( 1/(n^2) * Matrix[i,j] )
Since we assumed the matrix is normalized, this sum reduces to just 1/n^2. So, the expected number of rounds is about n^2 (that is, n^2 up to a constant factor) no matter what the entries in the matrix are. You can't hope to do a lot better than that I think -- that's about the same amount of time it takes to just read all the entries of the matrix, and it's hard to sample from a distribution that you cannot even read all of.
Note: What I described is a way to correctly sample a single element -- to get N elements from one matrix, you can just repeat it N times.
Method B:
Basically you just want to compute a histogram and sample inversely from it, so that you know you get exactly the right distribution. Computing the histogram is expensive, but once you have it, getting samples is cheap and easy.
In C++ it might look like this:
// Make histogram
typedef unsigned int uint;
typedef std::pair<uint, uint> upair;
typedef std::map<double, upair> histogram_type;
histogram_type histogram;
double cumulative = 0.0f;
for (uint i = 0; i < Matrix.size(); ++i) {
for (uint j = 0; j < Matrix[i].size(); ++j) {
cumulative += Matrix[i][j];
histogram[cumulative] = std::make_pair(i,j);
}
}
std::vector<upair> result;
for (uint k = 0; k < N; ++k) {
// Do a sample (this should never repeat... if it does not find a lower bound you could also assert false quite reasonably since it means something is wrong with rand() implementation)
while(1) {
double p = cumulative * rand(); // Or, for best results use std::mt19937 or boost::mt19937 and sample a real in the range [0,1] here.
histogram_type::iterator it = histogram::lower_bound(p);
if (it != histogram.end()) {
result.push_back(it->second);
break;
}
}
}
return result;
Here the time to make the histogram is something like number of cells * O(log number of cells) since inserting into the map takes time O(log n). You need an ordered data structure in order to get cheap lookup N * O(log number of cells) later when you do repeated sampling. Possibly you could choose a more specialized data structure to go faster, but I think there's only limited room for improvement.
Edit: As #Bob__ points out in comments, in method (B) a written there is potentially going to be some error due to floating point round-off if the matrices are quite large, even using type double, at this line:
cumulative += Matrix[i][j];
The problem is that, if cumulative is much larger than Matrix[i][j] beyond what the floating point precision can handle then these each time this statement is executed you may observe significant errors which accumulate to introduce significant inaccuracy.
As he suggests, if that happens, the most straightforward way to fix it is to sort the values Matrix[i][j] first. You could even do this in the general implementation to be safe -- sorting these guys isn't going to take more time asymptotically than you already have anyways.

m smallest values of vector with size n (c++11)

I need the average of the nClose smallest values (except the first zero) in a vector with n elements where we know that nClose + 1 < n, there are only non-negative numbers, and the vector contains at least one zero value. Furthermore, nClose will be a lot smaller than n, say that nClose will be around 10 and n will be around 500.
Normally I will use min_element to find the minimum, however this is useless here since I need several values. At the moment I use the following code
sort(diff.begin(), diff.end());
double sum = accumulate(diff.begin() + 1, diff.begin() + 1 + nClose, 0);
double avg = sum / nClose;
Due to the sort it runs in O(n log n) where we can do it in O(nClose*n) by just find the minimum and remove it, then repeat this for nClose times. Know one of you how to accomplish this with the algorithms of c++11?
You can use std::nth_element for that.
nth_element(diff.begin(),diff.begin()+nClose+1, diff.end());
double sum = accumulate(diff.begin(), diff.begin() + 1 + nClose, 0);
double avg = sum / nClose;
Regarding your remark about finding the minimum and removing it: This would probably be even less efficient than your current solution, as removing the nth element requires all elements after the nth position to be moved one position to the left, effectively turning your algorithm into something like O(nClose*n^2).
Also, while this should be a pretty efficient solution, I'd warn you against putting too much weight on algorithmic complexity, as the constants may actually play a much bigger role than any advantage in Big O notation.

In which order should floats be added to get the most precise result?

This was a question I was asked at my recent interview and I want to know (I don't actually remember the theory of the numerical analysis, so please help me :)
If we have some function, which accumulates floating-point numbers:
std::accumulate(v.begin(), v.end(), 0.0);
v is a std::vector<float>, for example.
Would it be better to sort these numbers before accumulating them?
Which order would give the most precise answer?
I suspect that sorting the numbers in ascending order would actually make the numerical error less, but unfortunately I can't prove it myself.
P.S. I do realize this probably has nothing to do with real world programming, just being curious.
Your instinct is basically right, sorting in ascending order (of magnitude) usually improves things somewhat. Consider the case where we're adding single-precision (32 bit) floats, and there are 1 billion values equal to 1 / (1 billion), and one value equal to 1. If the 1 comes first, then the sum will come to 1, since 1 + (1 / 1 billion) is 1 due to loss of precision. Each addition has no effect at all on the total.
If the small values come first, they will at least sum to something, although even then I have 2^30 of them, whereas after 2^25 or so I'm back in the situation where each one individually isn't affecting the total any more. So I'm still going to need more tricks.
That's an extreme case, but in general adding two values of similar magnitude is more accurate than adding two values of very different magnitudes, since you "discard" fewer bits of precision in the smaller value that way. By sorting the numbers, you group values of similar magnitude together, and by adding them in ascending order you give the small values a "chance" of cumulatively reaching the magnitude of the bigger numbers.
Still, if negative numbers are involved it's easy to "outwit" this approach. Consider three values to sum, {1, -1, 1 billionth}. The arithmetically correct sum is 1 billionth, but if my first addition involves the tiny value then my final sum will be 0. Of the 6 possible orders, only 2 are "correct" - {1, -1, 1 billionth} and {-1, 1, 1 billionth}. All 6 orders give results that are accurate at the scale of the largest-magnitude value in the input (0.0000001% out), but for 4 of them the result is inaccurate at the scale of the true solution (100% out). The particular problem you're solving will tell you whether the former is good enough or not.
In fact, you can play a lot more tricks than just adding them in sorted order. If you have lots of very small values, a middle number of middling values, and a small number of large values, then it might be most accurate to first add up all the small ones, then separately total the middling ones, add those two totals together then add the large ones. It's not at all trivial to find the most accurate combination of floating-point additions, but to cope with really bad cases you can keep a whole array of running totals at different magnitudes, add each new value to the total that best matches its magnitude, and when a running total starts to get too big for its magnitude, add it into the next total up and start a new one. Taken to its logical extreme, this process is equivalent to performing the sum in an arbitrary-precision type (so you'd do that). But given the simplistic choice of adding in ascending or descending order of magnitude, ascending is the better bet.
It does have some relation to real-world programming, since there are some cases where your calculation can go very badly wrong if you accidentally chop off a "heavy" tail consisting of a large number of values each of which is too small to individually affect the sum, or if you throw away too much precision from a lot of small values that individually only affect the last few bits of the sum. In cases where the tail is negligible anyway you probably don't care. For example if you're only adding together a small number of values in the first place and you're only using a few significant figures of the sum.
There is also an algorithm designed for this kind of accumulation operation, called Kahan Summation, that you should probably be aware of.
According to Wikipedia,
The Kahan summation algorithm (also known as compensated summation) significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach. This is done by keeping a separate running compensation (a variable to accumulate small errors).
In pseudocode, the algorithm is:
function kahanSum(input)
var sum = input[1]
var c = 0.0 //A running compensation for lost low-order bits.
for i = 2 to input.length
y = input[i] - c //So far, so good: c is zero.
t = sum + y //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t //Algebraically, c should always be zero. Beware eagerly optimising compilers!
next i //Next time around, the lost low part will be added to y in a fresh attempt.
return sum
I tried out the extreme example in the answer supplied by Steve Jessop.
#include <iostream>
#include <iomanip>
#include <cmath>
int main()
{
long billion = 1000000000;
double big = 1.0;
double small = 1e-9;
double expected = 2.0;
double sum = big;
for (long i = 0; i < billion; ++i)
sum += small;
std::cout << std::scientific << std::setprecision(1) << big << " + " << billion << " * " << small << " = " <<
std::fixed << std::setprecision(15) << sum <<
" (difference = " << std::fabs(expected - sum) << ")" << std::endl;
sum = 0;
for (long i = 0; i < billion; ++i)
sum += small;
sum += big;
std::cout << std::scientific << std::setprecision(1) << billion << " * " << small << " + " << big << " = " <<
std::fixed << std::setprecision(15) << sum <<
" (difference = " << std::fabs(expected - sum) << ")" << std::endl;
return 0;
}
I got the following result:
1.0e+00 + 1000000000 * 1.0e-09 = 2.000000082740371 (difference = 0.000000082740371)
1000000000 * 1.0e-09 + 1.0e+00 = 1.999999992539933 (difference = 0.000000007460067)
The error in the first line is more than ten times bigger in the second.
If I change the doubles to floats in the code above, I get:
1.0e+00 + 1000000000 * 1.0e-09 = 1.000000000000000 (difference = 1.000000000000000)
1000000000 * 1.0e-09 + 1.0e+00 = 1.031250000000000 (difference = 0.968750000000000)
Neither answer is even close to 2.0 (but the second is slightly closer).
Using the Kahan summation (with doubles) as described by Daniel Pryden:
#include <iostream>
#include <iomanip>
#include <cmath>
int main()
{
long billion = 1000000000;
double big = 1.0;
double small = 1e-9;
double expected = 2.0;
double sum = big;
double c = 0.0;
for (long i = 0; i < billion; ++i) {
double y = small - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}
std::cout << "Kahan sum = " << std::fixed << std::setprecision(15) << sum <<
" (difference = " << std::fabs(expected - sum) << ")" << std::endl;
return 0;
}
I get exactly 2.0:
Kahan sum = 2.000000000000000 (difference = 0.000000000000000)
And even if I change the doubles to floats in the code above, I get:
Kahan sum = 2.000000000000000 (difference = 0.000000000000000)
It would seem that Kahan is the way to go!
There is a class of algorithms that solve this exact problem, without the need to sort or otherwise re-order the data.
In other words, the summation can be done in one pass over the data. This also makes such algorithms applicable in situations where the dataset is not known in advance, e.g. if the data arrives in real time and the running sum needs to be maintained.
Here is the abstract of a recent paper:
We present a novel, online algorithm for exact summation of a stream
of floating-point numbers. By “online” we mean that the algorithm
needs to see only one input at a time, and can take an arbitrary
length input stream of such inputs while requiring only constant
memory. By “exact” we mean that the sum of the internal array of our
algorithm is exactly equal to the sum of all the inputs, and the
returned result is the correctly-rounded sum. The proof of correctness
is valid for all inputs (including nonnormalized numbers but modulo
intermediate overflow), and is independent of the number of summands
or the condition number of the sum. The algorithm asymptotically needs
only 5 FLOPs per summand, and due to instruction-level parallelism
runs only about 2--3 times slower than the obvious, fast-but-dumb
“ordinary recursive summation” loop when the number of summands is
greater than 10,000. Thus, to our knowledge, it is the fastest, most
accurate, and most memory efficient among known algorithms. Indeed, it
is difficult to see how a faster algorithm or one requiring
significantly fewer FLOPs could exist without hardware improvements.
An application for a large number of summands is provided.
Source: Algorithm 908: Online Exact Summation of Floating-Point Streams.
Building on Steve's answer of first sorting the numbers in ascending order, I'd introduce two more ideas:
Decide on the difference in exponent of two numbers above which you might decide that you would lose too much precision.
Then add the numbers up in order until the exponent of the accumulator is too large for the next number, then put the accumulator onto a temporary queue and start the accumulator with the next number. Continue until you exhaust the original list.
You repeat the process with the temporary queue (having sorted it) and with a possibly larger difference in exponent.
I think this will be quite slow if you have to calculate exponents all the time.
I had a quick go with a program and the result was 1.99903
I think you can do better than sorting the numbers before you accumulate them, because during the process of accumulation, the accumulator gets bigger and bigger. If you have a large amount of similar numbers, you will start to lose precision quickly. Here is what I would suggest instead:
while the list has multiple elements
remove the two smallest elements from the list
add them and put the result back in
the single element in the list is the result
Of course this algorithm will be most efficient with a priority queue instead of a list. C++ code:
template <typename Queue>
void reduce(Queue& queue)
{
typedef typename Queue::value_type vt;
while (queue.size() > 1)
{
vt x = queue.top();
queue.pop();
vt y = queue.top();
queue.pop();
queue.push(x + y);
}
}
driver:
#include <iterator>
#include <queue>
template <typename Iterator>
typename std::iterator_traits<Iterator>::value_type
reduce(Iterator begin, Iterator end)
{
typedef typename std::iterator_traits<Iterator>::value_type vt;
std::priority_queue<vt> positive_queue;
positive_queue.push(0);
std::priority_queue<vt> negative_queue;
negative_queue.push(0);
for (; begin != end; ++begin)
{
vt x = *begin;
if (x < 0)
{
negative_queue.push(x);
}
else
{
positive_queue.push(-x);
}
}
reduce(positive_queue);
reduce(negative_queue);
return negative_queue.top() - positive_queue.top();
}
The numbers in the queue are negative because top yields the largest number, but we want the smallest. I could have provided more template arguments to the queue, but this approach seems simpler.
This doesn't quite answer your question, but a clever thing to do is to run the sum twice, once with rounding mode "round up" and once with "round down". Compare the two answers, and you know /how/ inaccurate your results are, and if you therefore need to use a cleverer summing strategy. Unfortunately, most languages don't make changing the floating point rounding mode as easy as it should be, because people don't know that it's actually useful in everyday calculations.
Take a look at Interval arithmetic where you do all maths like this, keeping highest and lowest values as you go. It leads to some interesting results and optimisations.
The simplest sort that improves accuracy is to sort by the ascending absolute value. That lets the smallest magnitude values have a chance to accumulate or cancel before interacting with larger magnitude values that have would trigger a loss of precision.
That said, you can do better by tracking multiple non-overlapping partial sums. Here is a paper describing the technique and presenting a proof-of-accuracy: www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps
That algorithm and other approaches to exact floating point summation are implemented in simple Python at: http://code.activestate.com/recipes/393090/ At least two of those can be trivially converted to C++.
For IEEE 754 single or double precision or known format numbers, another alternative is to use an array of numbers (passed by caller, or in a class for C++) indexed by the exponent. When adding numbers into the array, only numbers with the same exponent are added (until an empty slot is found and the number stored). When a sum is called for, the array is summed from smallest to largest to minimize truncation. Single precision example:
/* clear array */
void clearsum(float asum[256])
{
size_t i;
for(i = 0; i < 256; i++)
asum[i] = 0.f;
}
/* add a number into array */
void addtosum(float f, float asum[256])
{
size_t i;
while(1){
/* i = exponent of f */
i = ((size_t)((*(unsigned int *)&f)>>23))&0xff;
if(i == 0xff){ /* max exponent, could be overflow */
asum[i] += f;
return;
}
if(asum[i] == 0.f){ /* if empty slot store f */
asum[i] = f;
return;
}
f += asum[i]; /* else add slot to f, clear slot */
asum[i] = 0.f; /* and continue until empty slot */
}
}
/* return sum from array */
float returnsum(float asum[256])
{
float sum = 0.f;
size_t i;
for(i = 0; i < 256; i++)
sum += asum[i];
return sum;
}
double precision example:
/* clear array */
void clearsum(double asum[2048])
{
size_t i;
for(i = 0; i < 2048; i++)
asum[i] = 0.;
}
/* add a number into array */
void addtosum(double d, double asum[2048])
{
size_t i;
while(1){
/* i = exponent of d */
i = ((size_t)((*(unsigned long long *)&d)>>52))&0x7ff;
if(i == 0x7ff){ /* max exponent, could be overflow */
asum[i] += d;
return;
}
if(asum[i] == 0.){ /* if empty slot store d */
asum[i] = d;
return;
}
d += asum[i]; /* else add slot to d, clear slot */
asum[i] = 0.; /* and continue until empty slot */
}
}
/* return sum from array */
double returnsum(double asum[2048])
{
double sum = 0.;
size_t i;
for(i = 0; i < 2048; i++)
sum += asum[i];
return sum;
}
Your floats should be added in double precision. That will give you more additional precision than any other technique can. For a bit more precision and significantly more speed, you can create say four sums, and add them up at the end.
If you are adding double precision numbers, use long double for the sum - however, this will only have a positive effect in implementations where long double actually has more precision than double (typically x86, PowerPC depending on compiler settings).
Regarding sorting, it seems to me that if you expect cancellation then the numbers should be added in descending order of magnitude, not ascending. For instance:
((-1 + 1) + 1e-20) will give 1e-20
but
((1e-20 + 1) - 1) will give 0
In the first equation that two large numbers are cancelled out, whereas in the second the 1e-20 term gets lost when added to 1, since there is not enough precision to retain it.
Also, pairwise summation is pretty decent for summing lots of numbers.