In a given range (a, b) ( a <= b, 2 <= a, b <= 1000000 ) find all natural numbers that can be expressed in format x ^ n ( x and n are natural numbers ). If there are more than one possibility to present expressed number, present it with a bigger exponential value.
40 110
49 = 7^2; 64 = 2^6; 81 = 3^4; 100 = 10^2;
#include <iostream>
#include <fstream>
#include <cmath>
int Power(int number, int base);
int main()
int a, b;
std::ifstream fin("U1.txt");
fin >> a >> b;
for (int i = a; i <= b; i++)
int max_power = 0;
int min_base = 10;
bool found = false;
for (int j = 2; j <= 10; j++)
int power = Power(i, j);
if (power > 0)
if (max_power < power) { max_power = power; }
if (min_base > j) { min_base = j; }
found = true;
if (found)
std::cout << i << " = " << min_base << " ^ " << max_power << "; ";
return 0;
int Power(int number, int base)
int power = (log(number) / log(base) + 0.5);
if (pow(base, power) == number)
return power;
return 0;
I solved the problem. However, I don't understand few things:
How the int Power(int number, int base) function works. Why the log function is used? Why after division of two log functions the 0.5 is added? I found the Idea on the Internet.
I am not sure if this solution works on all cases. I didn't know what could be the biggest value of the base number so my for (int j = 2; j <= 10; j++) loop is going from 2 to 10. If there is a number that base is bigger the solution won't work.
Are there any easier ways to solve this problem?

How does the function work?
That's something the OP should have asked to the authors of that snippet (assuming it was copied verbatim or close).
The intent seems to check if a whole number power exists, such that in combination with the integral arguments number and base the following equation is satisfied:
number = base power
The function returns it or 0 if it doesn't exist, meaning that number is not an integral power of some integral base. To do so,
it uses a property of the logarithms:
n = bp
log(n) = p log(b)
p = log(n) / log(b)
it rounds the number[1] to the "closest" integer, to avoid cases where the limited precision of floating-point types and operations would have yield incorrect results in case of a simple truncation.
In the comments I've already made the example of std::log(1000)/std::log(10), which may produce a double result close to 3.0, but less than 3.0 (something like 2.9999999999999996). When stored in an int it would be truncated to 2.
It checks if the number found is the exact power which solve the previous equation, but that comparison has the same problems I mentioned before.
pow(base, power) == number // It compares a double with an int
Just like std::log, std::pow returns a double value, making all the calculations performed with those functions prone to subtle numerical errors (either by rounding or by accumulation when multiple operations are involved). It's often preferable to use integral types and operations, if possible, when accuracy (or absolute exactness[2]) is needed.
Is the algorithm correct?
I didn't know what could be the biggest value of the base number so my for loop is going from 2 to 10
That's just wrong. One of the constraints of the problem is b <= 1'000'000, but the posted solution couldn't find any power greater than 102.
An extimate of the greatest possible base is the square root of said b.
Are there any easier ways to solve this problem?
Easiness is subjective and we don't know all the requirements and constraints of OP's assignment. I'll describe an alternative solution without posting the code I wrote to test it[3].
OP's code considers all the numbers between a and b checking for every (well, up to 10) base if there exists a whole power.
My proposal uses only integral variables, of a wide enough type, say long (any 32-bit integer is enough).
The outer loop starts from base = 2 and increments it by one at every step.
Inside this loop, exponent is set to 2 and value to base * base
If value is greater than b, the algorithm stops.
While value is less than a, updates it (multiplying it by base) and the exponent (it's incremented by one). We need to find the first power of base which is greater or equal to a.
While value is less than or equal to b, store the triplet of variables value, base and exponent in suitable container.
Consider a std::map<long, std::pair<long, long>>, it lets us associate all the values with the corresponding pair of base and exponent. Also, it could be later traversed to obtain all the values in ascending order.
The assignment requires, in case of multiple powers, to present only the one with the bigger exponent. In the example, it shows 64 = 26, ignoring 64 = 43. Note the needed one is the one with the smaller base, so that it's enough to ignore any further value if it's already present in the map.
value and exponent are updated as before.
Note that this algorithm only consider bases up to the square root of b (in the outer loop) and the number of iterations of the inner loop is much more limited (with base = 2, it would be less than 20, beeing 220 > 1'000'000. Greater bases would stop sooner and sooner).
[1] See e.g. Why do lots of (old) programs use floor(0.5 + input) instead of round(input)?
[2] See e.g. The most efficient way to implement an integer based power function pow(int, int)
why do we iterate to root(n) to check if n is a perfect number

while checking if a number n is perfect or not why do we check till square root of (n)?
also can some body explain the if conditions in the following loop
for(int i=2;i<sqrt(n);i++)
sum+=i; //Initially ,sum=1
According to number theory, any number has at least 2 divisors (1, the number itself), and if the number A is a divisor of the number B, then the number B / A is also a divisor of the number B. Now consider a pair of numbers X, Y, such that X * Y == B. If X == Y == sqrt(B), then it is obvious that X, Y <= sqrt(B). If we try to increase Y, then we have to reduce X so that their product is still equal to B. So it turns out that among any pair of numbers X, Y, which in the product give B, at least one of the numbers will be <= sqrt(B). Therefore it is enough to find simply all divisors of number B which <= sqrt(B).
As for the loop condition, then sqrt(B) is a divisor of the number B, but we B / sqrt(B) is also a divisor, and it is equal to sqrt(B), and so as not to add this divisor twice, we wrote this if (but you have to understand that it will never be executed, because your loop is up to sqrt(n) exclusively).
It's pretty simple according to number theory:
If N has a factor i, it'll also has a factor n/i (1)
If we know all factors from 1 -> sqrt(n), the rest can be calculated by applying (1)
So that's why you only have to check from 1 -> sqrt(n). However, you code didn't reach the clause i==n/i which is the same as i == sqrt(n), so if N is a perfect square, sqrt(n) won't be calculated.
#include <iostream>
#include <cmath>
using namespace std;
int main()
int n; cin >> n;
int sum = 1;
for(int i=2;i<sqrt(n);i++)
if(i==n/i) { sum+=i; }
else { sum+=i+(n/i); }
cout << sum;
Input : 9
Output : 1
As you can see, the factor 3 = sqrt(9) is missed completely. To avoid this, use i <= sqrt(n), or to avoid using sqrt(), use i <= n/i or i*i <= n.
Edit :
As #HansOlsson and #Bathsheba mentioned, there're no odd square which are perfect number (pretty easy to prove, there's even no known odd perfect number), and for even square, there's a proof here. So the sqrt(n) problem could be ignored in this particular case.
However, in other cases when you just need to iterate over the factors some error may occurred. It's better using the right method from the start, than trying to track bugs down afterward when using this for something else.
A related post : Why do we check up to the square root of a prime number to determine if it is prime?
The code uses the trick of finding two factors at once, since if i divides n then n/i divides n as well, and normally adds both of them (else-clause).
However, you are missing the error in the code: it loops while i<sqrt(n) but has code to handle i*i=n (the then-clause - and it should only add i once in that case), which doesn't make sense as both of these cannot be true at the same time.
So the loop should be to <=sqrt(n), even if there are no square perfect numbers. (At least I haven't seen any square perfect numbers, and I wouldn't be surprised if there's a simple proof that they don't exist at all.)

How to get rid of 2 numbers' common divisors

So I have a function that divides a pair of numbers until they no longer have any common divisors:
void simplify(int &x, int &y){
for (int i = 2;;++i){
if (x < i && y < i){
while (1){
if (!(x % i) && !(y % i)){
x /= i;
y /= i;
} else {
How can I make it more efficient? I know one problem in this solution is that it tests for divisibility with compound numbers, when it wouldn't have any of it's factors by the time it gets to them, so it's just wasted calculations. Can I do this without the program knowing a set of primes beforehand/compute them during the function's runtime?
Use the Euclidean algorithm1:
Let a be the larger of two given positive integers and b be the smaller.
Let r be the remainder of a divided by b.
If r is zero, we are done, and b is the greatest common divisor.
Otherwise, let a take the value of b, let b take the value of r, and go to step 2.
Once you have the greatest common divisor, you can divide the original two numbers by it, which will yield two numbers with the same ratio but without any common factors greater than one.
1 Euclid, Elements, book VII, propositions 1 and 2, circa 300 BCE.
Euclid used subtraction, which has been changed here to remainder.
Once this algorithm is working, you might consider the slightly more intricate Binary GCD, which replaces division (which is slow on some processors) with subtraction and bit operations.
Sounds like a job for the C++17 library feature gcd.
#include <numeric>
void simplify(int &x, int &y)
const auto d = std::gcd(x, y);
x /= d;
y /= d;
sieve or eratosthenes calculator -- running into memory issues and crashing with numbers >=1,000,000

I'm not exactly sure why this is. I tried changing the variables to long long, and I even tried doing a few other things -- but its either about the inefficiency of my code (it literally does the whole process of finding all primes up to the number, then checking against the number to see if its divisible by that prime -- very inefficient, but its my first attempt at this and I feel pretty accomplished having it work at all....)
Or the fact that it overflows the stack. Im not sure where it is exactly, but all I know is that it MUST be related to memory and the way its dealing with the number.
If I had to guess, Id say its a memory issue happening when it is dealing with the prime number generation up to that number -- thats where it dies even if I remove the check against the input number.
I'll post my code -- just be aware, I didnt change long long back to int in a few places, and I also have a SquareRoot Variable that is not used, because it was supposed to try and help memory efficiency but was not effective the way I tried to do it. I Just never deleted it. I will clean up the code when and if I can successfully finish it.
As far as I am aware though, it DOES work pretty reliably for 999,999 and down, I actually checked it up against other calculators of the same type and it seemingly does generate the proper answers.
If anyone can help or explain what I screwed up here, your helping a guy trying to learn on his own without any school or anything. so its appreciated.
#include <iostream>
#include <cmath>
void sieve(int ubound, int primes[]);
int main()
long long n;
int i;
std::cout << "Input Number: ";
std::cin >> n;
if (n < 2) {
return 1;
long long upperbound = n;
int A[upperbound];
int SquareRoot = sqrt(upperbound);
sieve(upperbound, A);
for (i = 0; i < upperbound; i++) {
if (A[i] == 1 && upperbound % i == 0) {
std::cout << " " << i << " ";
return 0;
void sieve(int ubound, int primes[])
long long i, j, m;
for (i = 0; i < ubound; i++) {
primes[i] = 1;
primes[0] = 0, primes[1] = 0;
for (i = 2; i < ubound; i++) {
for(j = i * i; j < ubound; j += i) {
primes[j] = 0;
If you used legal C++ constructs instead of non-standard variable length arrays, your code will run (whether it produces the correct answers is another question).
The issue is more than likely that you're exceeding the limits of the stack when you declare arrays with a million or more elements.
Therefore instead of this:
long long upperbound = n;
Use std::vector:
#include <vector>
long long upperbound = n;
std::vector<int> A(upperbound);
and then:
The std::vector does not use the stack space to allocate its elements (unless you have written an allocator for it that uses the stack).
As a matter of fact, you don't even need to pass upperbound to sieve, as a std::vector knows its own size by calling the size() member function. But I leave that as an exercise.
Live example using 2,000,000
First of all, read and apply PaulMcKenzie's advice. That's the most important thing. I'm only addressing some teeny bits of your question that remained open.
It seems that you are trying to factor the number that you misleadingly called upperbound. The mysterious role of the square root of this number is related to this fact: if the number is composite at all - and hence can be computed as the product of some prime factors - then the smallest of these prime factors cannot be greater than the square root of the number. In fact, only one factor can possibly be greater, all others cannot exceed the square root.
However, in its present form your code cannot draw advantage from this fact. The trial division loop as it stands now has to run up to number_to_be_factored / 2 in order not to miss any factors because its body looks like this:
if (sieve[i] == 1 && number_to_be_factored % i == 0) {
std::cout << " " << i << " ";
You can factor much more efficiently if you refactor your code a bit: when you have found the smallest prime factor p of your number then the remaining factors to be found must be precisely those of rest = number_to_be_factored / p (or n = n / p, if you will), and none of the remaining factors can be smaller than p. However, don't forget that p might occur more than once as a factor.
During any round of the proceedings you only need to consider the prime factors between p and the square root of the current number; if none of those primes divides the current number then it must be prime. To test whether p exceeds the square root of some number n you can use if (p * p > n), which is computationally more efficient that actually computing the square root.
Hence the square root occurs in two different roles:
the square root of the number to be factored limits the amount of sieving that needs to be done
during the trial division loop, the square root of the current number gives an upper bound for the highest prime factor that you need to consider
That's two faces of the same coin but two different usages in the actual code.
Note: once you got your code working by applying PaulMcKenzie's advice, you might also to consider posting over on Code Review.

Calculate this factorial term in C++ with basic datatypes

I am solving a programming problem, and in the end the problem boils down to calculating following term:
I am given that the final answer will fit in 8 byte. I am using C++. How should I calculate this. I am able to come up with some tricks but nothing concrete and generalized.
I would not like to use external libraries.
Added conditions and result will be definitely 64 bit int.
If the result is guaranteed to be an integer, work with the factored representation.
By the theorem of Legendre, you can express all these factorials by the sequence of exponents of the primes in the range (2,n).
By deducting the exponents of the factorials in the denominator from those in the numerator, you will obtain exponents for the whole quotient. The computation will then reduce to a product of primes that will never overflow the 8 bytes.
For example,
25! = 2^22.3^10.5^6.7^3.11^
15! = 2^11.3^6.5^3.7^2.11.13
10! = 2^8.3^4.5^2.7
25!/(15!.10!) = 2^ = 3268760
The exponents of, say, 3 are found by
25/3 + 25/9 = 10
15/3 + 15/9 = 6
10/3 + 10/9 = 4
If all the input (not necessarily the output) is made of integers, you could try to count prime factors. You create an array of size sqrt(n) and fill it with the counts of each prime factor in n :
vector <int> v = vector <int> (sqrt(n)+1,0);
int m = 2;
while (m <=n) {
int i = 2;
int a = m;
while (a >1) {
while (a%i ==0) {
v[i] ++;
Then you iterate over the n_k (1 <= k <= m) and you decrease the count for each prime factor. This is pretty much the same code as above except that you replace the v[i]++ by v[i] --. Of course you need to call it with vector v previously obtained.
After that the vector v contains the list of count of prime factors in your expression and you just need to reconstruct the result as
int result = 1;
for (int i = 2; i < v.size(); v++) {
result *= pow(i,v[i]);
return result;
Note : you should use long long int instead of int above but I stick to int for simplicity
Edit : As mentioned in another answer, it would be better to use Legendre theorem to fill / unfill the vector v faster.
What you can do is to use the properties of the logarithm:
log(AB) = log(A) + log(B)
log(A/B) = log(A) - log(B)
X = e^(log(X))
So you can first compute the logarithm of your quantity, then exponentiate back:
log(N!/(n1!n2!...nk!)) = log(1) + ... + log(N) - [log(n1!) - ... log(nk!)]
then expand log(n1!) etc. so you end up writing everything in terms of logarithm of single numbers. Then take the exponential of your result to obtain the initial value of the factorial.
As #T.C. mentioned, this method may not be to accurate, although in typical scenarios you'll have many terms reduced. Alternatively, you expand each factorial into a list that stores the terms in its product, e.g. 6! will be stored in a list {1,2,3,4,5,6}. You do the same for the denominator terms. Then you start removing common elements. Finally, you can take gcd's and reduce everything to coprime factors, then compute the result.

In which order should floats be added to get the most precise result?

This was a question I was asked at my recent interview and I want to know (I don't actually remember the theory of the numerical analysis, so please help me :)
If we have some function, which accumulates floating-point numbers:
std::accumulate(v.begin(), v.end(), 0.0);
v is a std::vector<float>, for example.
Would it be better to sort these numbers before accumulating them?
Which order would give the most precise answer?
I suspect that sorting the numbers in ascending order would actually make the numerical error less, but unfortunately I can't prove it myself.
P.S. I do realize this probably has nothing to do with real world programming, just being curious.
Your instinct is basically right, sorting in ascending order (of magnitude) usually improves things somewhat. Consider the case where we're adding single-precision (32 bit) floats, and there are 1 billion values equal to 1 / (1 billion), and one value equal to 1. If the 1 comes first, then the sum will come to 1, since 1 + (1 / 1 billion) is 1 due to loss of precision. Each addition has no effect at all on the total.
If the small values come first, they will at least sum to something, although even then I have 2^30 of them, whereas after 2^25 or so I'm back in the situation where each one individually isn't affecting the total any more. So I'm still going to need more tricks.
That's an extreme case, but in general adding two values of similar magnitude is more accurate than adding two values of very different magnitudes, since you "discard" fewer bits of precision in the smaller value that way. By sorting the numbers, you group values of similar magnitude together, and by adding them in ascending order you give the small values a "chance" of cumulatively reaching the magnitude of the bigger numbers.
Still, if negative numbers are involved it's easy to "outwit" this approach. Consider three values to sum, {1, -1, 1 billionth}. The arithmetically correct sum is 1 billionth, but if my first addition involves the tiny value then my final sum will be 0. Of the 6 possible orders, only 2 are "correct" - {1, -1, 1 billionth} and {-1, 1, 1 billionth}. All 6 orders give results that are accurate at the scale of the largest-magnitude value in the input (0.0000001% out), but for 4 of them the result is inaccurate at the scale of the true solution (100% out). The particular problem you're solving will tell you whether the former is good enough or not.
In fact, you can play a lot more tricks than just adding them in sorted order. If you have lots of very small values, a middle number of middling values, and a small number of large values, then it might be most accurate to first add up all the small ones, then separately total the middling ones, add those two totals together then add the large ones. It's not at all trivial to find the most accurate combination of floating-point additions, but to cope with really bad cases you can keep a whole array of running totals at different magnitudes, add each new value to the total that best matches its magnitude, and when a running total starts to get too big for its magnitude, add it into the next total up and start a new one. Taken to its logical extreme, this process is equivalent to performing the sum in an arbitrary-precision type (so you'd do that). But given the simplistic choice of adding in ascending or descending order of magnitude, ascending is the better bet.
It does have some relation to real-world programming, since there are some cases where your calculation can go very badly wrong if you accidentally chop off a "heavy" tail consisting of a large number of values each of which is too small to individually affect the sum, or if you throw away too much precision from a lot of small values that individually only affect the last few bits of the sum. In cases where the tail is negligible anyway you probably don't care. For example if you're only adding together a small number of values in the first place and you're only using a few significant figures of the sum.
There is also an algorithm designed for this kind of accumulation operation, called Kahan Summation, that you should probably be aware of.
According to Wikipedia,
The Kahan summation algorithm (also known as compensated summation) significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach. This is done by keeping a separate running compensation (a variable to accumulate small errors).
In pseudocode, the algorithm is:
function kahanSum(input)
var sum = input[1]
var c = 0.0 //A running compensation for lost low-order bits.
for i = 2 to input.length
y = input[i] - c //So far, so good: c is zero.
t = sum + y //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t //Algebraically, c should always be zero. Beware eagerly optimising compilers!
next i //Next time around, the lost low part will be added to y in a fresh attempt.
return sum
I tried out the extreme example in the answer supplied by Steve Jessop.
#include <iostream>
#include <iomanip>
#include <cmath>
int main()
long billion = 1000000000;
double big = 1.0;
double small = 1e-9;
double expected = 2.0;
double sum = big;
for (long i = 0; i < billion; ++i)
sum += small;
std::cout << std::scientific << std::setprecision(1) << big << " + " << billion << " * " << small << " = " <<
std::fixed << std::setprecision(15) << sum <<
" (difference = " << std::fabs(expected - sum) << ")" << std::endl;
sum = 0;
for (long i = 0; i < billion; ++i)
sum += small;
sum += big;
std::cout << std::scientific << std::setprecision(1) << billion << " * " << small << " + " << big << " = " <<
std::fixed << std::setprecision(15) << sum <<
" (difference = " << std::fabs(expected - sum) << ")" << std::endl;
return 0;
I got the following result:
1.0e+00 + 1000000000 * 1.0e-09 = 2.000000082740371 (difference = 0.000000082740371)
1000000000 * 1.0e-09 + 1.0e+00 = 1.999999992539933 (difference = 0.000000007460067)
The error in the first line is more than ten times bigger in the second.
If I change the doubles to floats in the code above, I get:
1.0e+00 + 1000000000 * 1.0e-09 = 1.000000000000000 (difference = 1.000000000000000)
1000000000 * 1.0e-09 + 1.0e+00 = 1.031250000000000 (difference = 0.968750000000000)
Neither answer is even close to 2.0 (but the second is slightly closer).
Using the Kahan summation (with doubles) as described by Daniel Pryden:
#include <iostream>
#include <iomanip>
#include <cmath>
int main()
long billion = 1000000000;
double big = 1.0;
double small = 1e-9;
double expected = 2.0;
double sum = big;
double c = 0.0;
for (long i = 0; i < billion; ++i) {
double y = small - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
std::cout << "Kahan sum = " << std::fixed << std::setprecision(15) << sum <<
" (difference = " << std::fabs(expected - sum) << ")" << std::endl;
return 0;
I get exactly 2.0:
Kahan sum = 2.000000000000000 (difference = 0.000000000000000)
And even if I change the doubles to floats in the code above, I get:
Kahan sum = 2.000000000000000 (difference = 0.000000000000000)
It would seem that Kahan is the way to go!
There is a class of algorithms that solve this exact problem, without the need to sort or otherwise re-order the data.
In other words, the summation can be done in one pass over the data. This also makes such algorithms applicable in situations where the dataset is not known in advance, e.g. if the data arrives in real time and the running sum needs to be maintained.
Here is the abstract of a recent paper:
We present a novel, online algorithm for exact summation of a stream
of floating-point numbers. By “online” we mean that the algorithm
needs to see only one input at a time, and can take an arbitrary
length input stream of such inputs while requiring only constant
memory. By “exact” we mean that the sum of the internal array of our
algorithm is exactly equal to the sum of all the inputs, and the
returned result is the correctly-rounded sum. The proof of correctness
is valid for all inputs (including nonnormalized numbers but modulo
intermediate overflow), and is independent of the number of summands
or the condition number of the sum. The algorithm asymptotically needs
only 5 FLOPs per summand, and due to instruction-level parallelism
runs only about 2--3 times slower than the obvious, fast-but-dumb
“ordinary recursive summation” loop when the number of summands is
greater than 10,000. Thus, to our knowledge, it is the fastest, most
accurate, and most memory efficient among known algorithms. Indeed, it
is difficult to see how a faster algorithm or one requiring
significantly fewer FLOPs could exist without hardware improvements.
An application for a large number of summands is provided.
Source: Algorithm 908: Online Exact Summation of Floating-Point Streams.
Building on Steve's answer of first sorting the numbers in ascending order, I'd introduce two more ideas:
Decide on the difference in exponent of two numbers above which you might decide that you would lose too much precision.
Then add the numbers up in order until the exponent of the accumulator is too large for the next number, then put the accumulator onto a temporary queue and start the accumulator with the next number. Continue until you exhaust the original list.
You repeat the process with the temporary queue (having sorted it) and with a possibly larger difference in exponent.
I think this will be quite slow if you have to calculate exponents all the time.
I had a quick go with a program and the result was 1.99903
I think you can do better than sorting the numbers before you accumulate them, because during the process of accumulation, the accumulator gets bigger and bigger. If you have a large amount of similar numbers, you will start to lose precision quickly. Here is what I would suggest instead:
while the list has multiple elements
remove the two smallest elements from the list
add them and put the result back in
the single element in the list is the result
Of course this algorithm will be most efficient with a priority queue instead of a list. C++ code:
template <typename Queue>
void reduce(Queue& queue)
typedef typename Queue::value_type vt;
while (queue.size() > 1)
vt x =;
vt y =;
queue.push(x + y);
#include <iterator>
#include <queue>
template <typename Iterator>
typename std::iterator_traits<Iterator>::value_type
reduce(Iterator begin, Iterator end)
typedef typename std::iterator_traits<Iterator>::value_type vt;
std::priority_queue<vt> positive_queue;
std::priority_queue<vt> negative_queue;
for (; begin != end; ++begin)
vt x = *begin;
if (x < 0)
return -;
The numbers in the queue are negative because top yields the largest number, but we want the smallest. I could have provided more template arguments to the queue, but this approach seems simpler.
This doesn't quite answer your question, but a clever thing to do is to run the sum twice, once with rounding mode "round up" and once with "round down". Compare the two answers, and you know /how/ inaccurate your results are, and if you therefore need to use a cleverer summing strategy. Unfortunately, most languages don't make changing the floating point rounding mode as easy as it should be, because people don't know that it's actually useful in everyday calculations.
Take a look at Interval arithmetic where you do all maths like this, keeping highest and lowest values as you go. It leads to some interesting results and optimisations.
The simplest sort that improves accuracy is to sort by the ascending absolute value. That lets the smallest magnitude values have a chance to accumulate or cancel before interacting with larger magnitude values that have would trigger a loss of precision.
That said, you can do better by tracking multiple non-overlapping partial sums. Here is a paper describing the technique and presenting a proof-of-accuracy:
That algorithm and other approaches to exact floating point summation are implemented in simple Python at: At least two of those can be trivially converted to C++.
For IEEE 754 single or double precision or known format numbers, another alternative is to use an array of numbers (passed by caller, or in a class for C++) indexed by the exponent. When adding numbers into the array, only numbers with the same exponent are added (until an empty slot is found and the number stored). When a sum is called for, the array is summed from smallest to largest to minimize truncation. Single precision example:
/* clear array */
void clearsum(float asum[256])
size_t i;
for(i = 0; i < 256; i++)
asum[i] = 0.f;
/* add a number into array */
void addtosum(float f, float asum[256])
size_t i;
/* i = exponent of f */
i = ((size_t)((*(unsigned int *)&f)>>23))&0xff;
if(i == 0xff){ /* max exponent, could be overflow */
asum[i] += f;
if(asum[i] == 0.f){ /* if empty slot store f */
asum[i] = f;
f += asum[i]; /* else add slot to f, clear slot */
asum[i] = 0.f; /* and continue until empty slot */
/* return sum from array */
float returnsum(float asum[256])
float sum = 0.f;
size_t i;
for(i = 0; i < 256; i++)
sum += asum[i];
return sum;
double precision example:
/* clear array */
void clearsum(double asum[2048])
size_t i;
for(i = 0; i < 2048; i++)
asum[i] = 0.;
/* add a number into array */
void addtosum(double d, double asum[2048])
size_t i;
/* i = exponent of d */
i = ((size_t)((*(unsigned long long *)&d)>>52))&0x7ff;
if(i == 0x7ff){ /* max exponent, could be overflow */
asum[i] += d;
if(asum[i] == 0.){ /* if empty slot store d */
asum[i] = d;
d += asum[i]; /* else add slot to d, clear slot */
asum[i] = 0.; /* and continue until empty slot */
/* return sum from array */
double returnsum(double asum[2048])
double sum = 0.;
size_t i;
for(i = 0; i < 2048; i++)
sum += asum[i];
return sum;
Your floats should be added in double precision. That will give you more additional precision than any other technique can. For a bit more precision and significantly more speed, you can create say four sums, and add them up at the end.
If you are adding double precision numbers, use long double for the sum - however, this will only have a positive effect in implementations where long double actually has more precision than double (typically x86, PowerPC depending on compiler settings).
Regarding sorting, it seems to me that if you expect cancellation then the numbers should be added in descending order of magnitude, not ascending. For instance:
((-1 + 1) + 1e-20) will give 1e-20
((1e-20 + 1) - 1) will give 0
In the first equation that two large numbers are cancelled out, whereas in the second the 1e-20 term gets lost when added to 1, since there is not enough precision to retain it.
Also, pairwise summation is pretty decent for summing lots of numbers.