How does Montgomery Multiplication work in speeding up the encryption process for computing c=m^e%n as used in RSA encryption?
I understand that Montgomery multiplication can efficiently multiply a*b%n but when trying to find m^e%n, is there a more efficient way of multiplying m*m e number of times than just looping through and computing a Montgomery multiplication each time?
mpz_class mod(mpz_class &m, mpz_class &exp, mpz_class &n) {
//End goal is to return m^exp%n
// cout << "Begin mod";
mpz_class orig_m = m; //the original message
mpz_class loc_m = m; //local value of m (to be changed as you cycle through)
cout << "m: " << m << " exp: " << exp << " n: " << n << endl;
//Conversion to the montgomery world
mpz_class mm_xp = (loc_m*r)%n;
mpz_class mm_yp = (orig_m*r)%n;
for(int i=0; i < exp-1; i++) //Repeat multiplaction "exp" number of times
{
mm(mm_xp, mm_yp, n); //montgomery multiplication algorithm returns m*orig_m%n but in the montgomery world form
}
mm_xp = (mm_xp*r_p)%n; //convert from montgomery world to normal numbers
return mm_xp;
}
I'm using the gmp libraries so I can work with larger numbers here. r and r_p are being pre-calculated in a separate function and are global. In this example I'm working in powers of 10 (though i realize it would be more efficient to work with powers of 2)
I convert to montgomery form prior the the multiplications and repeated multiply m*m in the for loop, converting back to normal world at the end of the m^e step. I'm curious as to know whether there is another way to compute operation m^e%n in a different way, rather than just cycling through in a for loop? As of now, i believe this to be the bottle neck of the computation however I could very well be wrong.
the actual montgomery multiplication step occurs in the function below.
void mm(mpz_class &ret, const mpz_class &y, const mpz_class &n)
{
mpz_class a = ret*y;
while(a%r != 0)
{
a += n;
}
ret = a/r; //ret*y%n in montgomery form
// cout << ret << endl;
}
Is this at all how RSA encryption works with the montgomery multiplication optimization?
No, you do not want to do e multiplications of m by itself to compute RSA.
You normally want to do me mod n by doing repeated squaring (there are other possibilities, but this is a simple one that's adequate for many typical purposes).
In a previous post on RSA, I included an implementation that used a pow_mod function. That, in turn, used a mul_mod function. Montgomery multiplication is (basically) an implementation of that mul_mod function that's better suited to working with large numbers. To make it useful, however, you just about need something on at least the general order of the pow_mod function, not just a loop to make e calls to mul_mod.
Given the magnitude of numbers involved in real use of RSA, trying to compute me mod n just using repeated multiplication would probably take years (quite possibly quite a few years) to complete even a single encryption. In other words, a different algorithm isn't just a nice optimization--it's absolutely necessary for use to be practical at all.
To put this in algorithmic terms, raising AB using plain multiplication is basically O(B). Doing it with the repeated squaring algorithm shown there, it's basically O(log B) instead. If B is very large at all, the difference between the two is immense.
Related
I am using Eigen c++ library for linear algebra operations.
There is a variable v in my code that is a VectorXd type, and I need to calculate its sum, so I called v.sum().
However, when I updated my program to a new version, although the value of v remain same(read from the same input file), the sum() function give slightly
different value.
Here is a piece of code that explains my problem:
double vsum1 = v.sum();
double vsum2 = 0; // compare with manually calculated sum
for(size_t i = 0; i < v.size(); ++i)
{
vsum2 += v(i);
}
cout << "sum1: " << vsum1 << endl;
cout << "sum2: " << vsum2 << endl;
for the old version, the result is
sum1: 94.8117866666666487
sum2: 94.8117866666666202
for the new version , the result is
sum1: 94.8117866666666345
sum2: 94.8117866666666202
The manually calculated sum vsum2 remains unchanged, so I think the origin vector v didn't change, then why would sum() give different result? Is it because of
some SIMD optimization performed by Eigen?
The difference is actually neligible, but that leads to a failure of regression test.
5gon12eder's comment is right. Eigen3.3 perform AVX vctorization if available (4 double at once) compared to SSE only in Eigen3.2 (2 double at once). In any case, you must use some tolerance when comparing floating-point numbers to account for round-off errors. You can take inspiration from Eigen's unit tests.
Given n(n<=1000000) positive integer numbers (each number is smaller than 1000000). The task is to calculate the sum of the bitwise xor ( ^ in c/c++) value of all the distinct combination of the given numbers.
Time limit is 1 second.
For example, if 3 integers are given as 7, 3 and 5, answer should be 7^3 + 7^5 + 3^5 = 12.
My approach is:
#include <bits/stdc++.h>
using namespace std;
int num[1000001];
int main()
{
int n, i, sum, j;
scanf("%d", &n);
sum=0;
for(i=0;i<n;i++)
scanf("%d", &num[i]);
for(i=0;i<n-1;i++)
{
for(j=i+1;j<n;j++)
{
sum+=(num[i]^num[j]);
}
}
printf("%d\n", sum);
return 0;
}
But my code failed to run in 1 second. How can I write my code in a faster way, which can run in 1 second ?
Edit: Actually this is an Online Judge problem and I am getting Cpu Limit Exceeded with my above code.
You need to compute around 1e12 xors in order to brute force this. Modern processors can do around 1e10 such operations per second. So brute force cannot work; therefore they are looking for you to figure out a better algorithm.
So you need to find a way to determine the answer without computing all those xors.
Hint: can you think of a way to do it if all the input numbers were either zero or one (one bit)? And then extend it to numbers of two bits, three bits, and so on?
When optimising your code you can go 3 different routes:
Optimising the algorithm.
Optimising the calls to language and library functions.
Optimising for the particular architecture.
There may very well be a quicker mathematical way of xoring every pair combination and then summing them up, but I know it not. In any case, on the contemporary processors you'll be shaving off microseconds at best; that is because you are doing basic operations (xor and sum).
Optimising for the architecture also makes little sense. It normally becomes important in repetitive branching, you have nothing like that here.
The biggest problem in your algorithm is reading from the standard input. Despite the fact that "scanf" takes only 5 characters in your computer code, in machine language this is the bulk of your program. Unfortunately, if the data will actually change each time your run your code, there is no way around the requirement of reading from stdin, and there will be no difference whether you use scanf, std::cin >>, or even will attempt to implement your own method to read characters from input and convert them into ints.
All this assumes that you don't expect a human being to enter thousands of numbers in less than one second. I guess you can be running your code via: myprogram < data.
This function grows quadratically (thanks #rici). At around 25,000 positive integers with each being 999,999 (worst case) the for loop calculation alone can finish in approximately a second. Trying to make this work with input as you have specified and for 1 million positive integers just doesn't seem possible.
With the hint in Alan Stokes's answer, you may have a linear complexity instead of quadratic with the following:
std::size_t xor_sum(const std::vector<std::uint32_t>& v)
{
std::size_t res = 0;
for (std::size_t b = 0; b != 32; ++b) {
const std::size_t count_0 =
std::count_if(v.begin(), v.end(),
[b](std::uint32_t n) { return (n >> b) & 0x01; });
const std::size_t count_1 = v.size() - count_0;
res += count_0 * count_1 << b;
}
return res;
}
Live Demo.
Explanation:
x^y = Sum_b((x&b)^(y&b)) where b is a single bit mask (from 1<<0 to 1<<32).
For a given bit, with count_0 and count_1 the respective number of count of number with bit set to 0 or 1, we have count_0 * (count_0 - 1) 0^0, count_0 * count_1 0^1 and count_1 * (count_1 - 1) 1^1 (and 0^0 and 1^1 are 0).
I have an assignment that asks for us to make a program in C++ that takes the input from a user for the amount of numbers on a lottery ticket, and the amount of numbers in a lottery drawing. It should then calculates the odds of the user getting the numbers correct. This is (more or less) my first program I am writing in C++, so I am new to this. What I have so far is below. I am seeking help with making the program work. I can get values in for the declared variables, but cannot figure out how to write down what it is I actually need to do - which is a factorial function. I know the function, just don't know how to say it in C++
From what I understand at this point is that it should look something like this:
for (int i = 1; i <= k; i++) {
result = (result * (n+1-i)) / i;
or something to that effect?.... at least this is what I have come across in the past couple of hours of searching for an answer online. I think I am getting close to figuring it out but I am at a road block.
I don't want someone to just tell me the answer. If you could explain to me what I am doing wrong and what I can do to fix it that would be most helpful for me.
#include <iostream>
#include <iomanip>
using namespace std;
int main (int argc, char** argv)
{
int n, k;
int odds;
cout<< "How many numbers are printed on the lottery ticket? ";
cin >> n ;
cout<<"How may numbers are selected in the lottery drawing? ";
cin >> k ;
cout << "You entered " << n << " for how many numbers are printed on the lottery ticket, and "
<< k << " for how many numbers are selected in the lottery drawing." << endl;
for (int i = 1; i <= k; i++)
{
odds = (n * (n-k++))/k;
cout << odds;
}
return 0;
}
When I run this I just get an endless stream of "3-3-3-3....". It's non-stop. At one point I was getting a number as the output (one VERY large incorrect number), but while I was tinkering with it I couldn't get it back.
Any guidance would be appreciated.
This seems slightly difficult for a first assignment, unless you're most of the way through a computer science curriculum and only new to C++.
The formula for the odds, which is commonly known as "number of combinations", is frequently written in terms of factorials. But you can't manipulate those factorials effectively on a computer; they are far too large for any of the built-in data types.
Instead, it's important to cancel like terms from numerator and denominator. Interleaving multiplications and divisions can help even more.
I've previously posted working code for number of combinations on another question:
Number of combinations (N choose R) in C++
Your current code actually does have things interleaved pretty well, but you haven't been at all careful with the meanings of i and k and n, and you've also got undefined behavior from both reading and writing a variable between sequence points.
Specifically, this is illegal because the k in the denominator is unstable, since it is in the process of being incremented:
odds = n*(n-k++)/k;
You shouldn't be changing k here at all. The value varying from 1 to k is i. So this becomes:
odds = n * (n-i) / i;
You need all the terms to accumulate across loop iterations, so you should be multiplying by the previous odds value:
odds = odds * (n - i) / i;
But you do need n - 0 in the numerator, but no 0 in the denominator. You're chosen to make i one-based, you it's the numerator that needs to be adjusted:
odds = odds * (n + 1 - i) / i;
And now your code is extremely close to mine. Depending on your values of n and k you might still overflow. Changing the data type of odds to long long or double should help with that.
This is the formula you need:
http://en.wikipedia.org/wiki/Lottery_mathematics
Make sure that you have the mathematics well in hand. Start with a function that implements that formula.
Once you have the formula in hand, you'll realize that the naive student factorial will never work. The biggest naive factorial you can have with a long is 20!; after that it overflows.
The right way to do it is logarithms and gamma function:
https://en.wikipedia.org/wiki/Gamma_function
So that formula will turn into:
ln{n!/k!(n-k)!)} = ln(n!) - ln(k!) - ln((n-k)!)
But since gamma(n+1) = n!
lngamma(n+1) - lngamma(k+1) - lngamma(n-k-1)
The gamma function returns doubles, not integers or longs. It'll behave much better for you.
I'm currently writing my own ASE/RSA encryption program in C++ for Unix. I've been going through the literature for about a week now, and I've started to wrap my head around it all but I'm still left with some pressing questions:
1) Based on my understanding, an RSA key in its most basic form is the combination of the product of the two primes (R) used and the exponents. It's obvious to me that storing the key in such a form in plaintext would defeat the purpose of encryption anything at all. Therefore, in what form can I store my generated public and private keys? Ask the user for a password and do some "simple" shift/replacing on the individual digits of the key with an ASCII table? Or is there some other standard I haven't run across? Also, when the keys are generated, are R and the respective exponent simply stored sequentially? i.e. ##primeproduct####exponent##? In that case, how would a decryption algorithm parse the key into the two separate values?
2) How would I go about programatically generating the private exponent, given that I've decided to use 65537 as my public exponent for all encryptions? I've got the equation P*Q = 1mod(M), where P and Q and the exponents and M is the result of Euler's Totient Function. Is this simply a matter of generating random numbers and testing their relative primality to the public exponent until you hit pay dirt? I know you can't simply start from 1 and increment until you find such a number, as anyone could simply do the same thing and get your private exponent themselves.
3) When generating the character equivalence set, I understand that the numbers used in the set can't be must be less than and relatively prime to P*Q. Again, this is a matter of testing relative primality of numbers to P*Q. Is the speed of testing relative primality independent of the size of the numbers you're working with? Or are special algorithms necessary?
Thanks in advance to anyone who takes the time to read and answer, cheers!
There are some standard formats for storing/exchanging RSA keys such as RFC 3447. For better or worse, most (many, anyway) use ASN.1 encoding, which adds more complexity than most people like, all by itself. A few use Base64 encoding, which is a lot easier to implement.
As far as what constitutes a key goes: in its most basic form, you're correct; the public key includes the modulus (usually called n) and an exponent (usually called e).
To compute a key pair, you start from two large prime numbers, usually called p and q. You compute the modulus n as p * q. You also compute a number (often called r) that's (p-1) * (q-1).
e is then a more or less randomly chosen number that's prime relative to r. Warning: you don't want e to be really small though -- log(e) >= log(n)/4 as a bare minimum.
You then compute d (the private decryption key) as a number satisfying the relation:
d * e = 1 (mod r)
You typically compute this using Euclid's algorithm, though there are other options (see below). Again, you don't want d to be really small either, so if it works out to a really small number, you probably want to try another value for e, and compute a new d to match.
There is another way to compute your e and d. You can start by finding some number K that's congruent to 1 mod r, then factor it. Put the prime factors together to get two factors of roughly equal size, and use them as e and d.
As far as an attacker computing your d goes: you need r to compute this, and knowing r depends on knowing p and q. That's exactly why/where/how factoring comes into breaking RSA. If you factor n, then you know p and q. From them, you can find r, and from r you can compute the d that matches a known e.
So, let's work through the math to create a key pair. We're going to use primes that are much too small to be effective, but should be sufficient to demonstrate the ideas involved.
So let's start by picking a p and q (of course, both need to be primes):
p = 9999991
q = 11999989
From those we compute n and r:
n = 119999782000099
r = 119999760000120
Next we need to either pick e or else compute K, then factor it to get e and d. For the moment, we'll go with your suggestion of e=65537 (since 65537 is prime, the only possibility for it and r not being relative primes would be if r was an exact multiple of 65537, which we can verify is not the case quite easily).
From that, we need to compute our d. We can do that fairly easily (though not necessarily very quickly) using the "Extended" version of Euclid's algorithm, (as you mentioned) Euler's Totient, Gauss' method, or any of a number of others.
For the moment, I'll compute it using Gauss' method:
template <class num>
num gcd(num a, num b) {
num r;
while (b > 0) {
r = a % b;
a = b;
b = r;
}
return a;
}
template <class num>
num find_inverse(num a, num p) {
num g, z;
if (gcd(a, p) > 1) return 0;
z = 1;
while (a > 1) {
z += p;
if ((g=gcd(a, z))> 1) {
a /= g;
z /= g;
}
}
return z;
}
The result we get is:
d = 38110914516113
Then we can plug these into an implementation of RSA, and use them to encrypt and decrypt a message.
So, let's encrypt "Very Secret Message!". Using the e and n given above, that encrypts to:
74603288122996
49544151279887
83011912841578
96347106356362
20256165166509
66272049143842
49544151279887
22863535059597
83011912841578
49544151279887
96446347654908
20256165166509
87232607087245
49544151279887
68304272579690
68304272579690
87665372487589
26633960965444
49544151279887
15733234551614
And, using the d given above, that decrypts back to the original. Code to do the encryption/decryption (using hard-coded keys and modulus) looks like this:
#include <iostream>
#include <iterator>
#include <algorithm>
#include <vector>
#include <functional>
typedef unsigned long long num;
const num e_key = 65537;
const num d_key = 38110914516113;
const num n = 119999782000099;
template <class T>
T mul_mod(T a, T b, T m) {
if (m == 0) return a * b;
T r = T();
while (a > 0) {
if (a & 1)
if ((r += b) > m) r %= m;
a >>= 1;
if ((b <<= 1) > m) b %= m;
}
return r;
}
template <class T>
T pow_mod(T a, T n, T m) {
T r = 1;
while (n > 0) {
if (n & 1)
r = mul_mod(r, a, m);
a = mul_mod(a, a, m);
n >>= 1;
}
return r;
}
int main() {
std::string msg = "Very Secret Message!";
std::vector<num> encrypted;
std::cout << "Original message: " << msg << '\n';
std::transform(msg.begin(), msg.end(),
std::back_inserter(encrypted),
[&](num val) { return pow_mod(val, e_key, n); });
std::cout << "Encrypted message:\n";
std::copy(encrypted.begin(), encrypted.end(), std::ostream_iterator<num>(std::cout, "\n"));
std::cout << "\n";
std::cout << "Decrypted message: ";
std::transform(encrypted.begin(), encrypted.end(),
std::ostream_iterator<char>(std::cout, ""),
[](num val) { return pow_mod(val, d_key, n); });
std::cout << "\n";
}
To have even a hope of security, you need to use a much larger modulus though--hundreds of bits at the very least (and perhaps a thousand or more for the paranoid). You could do that with a normal arbitrary precision integer library, or routines written specifically for the task at hand. RSA is inherently fairly slow, so at one time most implementations used code with lots of hairy optimization to do the job. Nowadays, hardware is fast enough that you can probably get away with a fairly average large-integer library fairly easily (especially since in real use, you only want to use RSA to encrypt/decrypt a key for a symmetrical algorithm, not to encrypt the raw data).
Even with a modulus of suitable size (and the code modified to support the large numbers needed), this is still what's sometimes referred to as "textbook RSA", and it's not really suitable for much in the way of real encryption. For example, right now, it's encrypting one byte of the input at a time. This leaves noticeable patterns in the encrypted data. It's trivial to look at the encrypted data above and see than the second and seventh words are identical--because both are the encrypted form of e (which also occurs a couple of other places in the message).
As it stands right now, this can be attacked as a simple substitution code. e is the most common letter in English, so we can (correctly) guess that the most common word in the encrypted data represents e (and relative frequencies of letters in various languages are well known). Worse, we can also look at things like pairs and triplets of letters to improve the attack. For example, if we see the same word twice in succession in the encrypted data, we know we're seeing a double letter, which can only be a few letters in normal English text. Bottom line: even though RSA itself can be quite strong, the way of using it shown above definitely is not.
To prevent that problem, with a (say) 512-bit key, we'd also process the input in 512-bit chunks. That means we only have a repetition if there are two places in the original input that go for 512 bits at a time that are all entirely identical. Even if that happens, it's relatively difficult to guess that that would be, so although it's undesirable, it's not nearly as vulnerable as with the byte-by-byte version shown above. In addition, you always want to pad the input to a multiple of the size being encrypted.
Reference
https://crypto.stackexchange.com/questions/1448/definition-of-textbook-rsa
This was a question I was asked at my recent interview and I want to know (I don't actually remember the theory of the numerical analysis, so please help me :)
If we have some function, which accumulates floating-point numbers:
std::accumulate(v.begin(), v.end(), 0.0);
v is a std::vector<float>, for example.
Would it be better to sort these numbers before accumulating them?
Which order would give the most precise answer?
I suspect that sorting the numbers in ascending order would actually make the numerical error less, but unfortunately I can't prove it myself.
P.S. I do realize this probably has nothing to do with real world programming, just being curious.
Your instinct is basically right, sorting in ascending order (of magnitude) usually improves things somewhat. Consider the case where we're adding single-precision (32 bit) floats, and there are 1 billion values equal to 1 / (1 billion), and one value equal to 1. If the 1 comes first, then the sum will come to 1, since 1 + (1 / 1 billion) is 1 due to loss of precision. Each addition has no effect at all on the total.
If the small values come first, they will at least sum to something, although even then I have 2^30 of them, whereas after 2^25 or so I'm back in the situation where each one individually isn't affecting the total any more. So I'm still going to need more tricks.
That's an extreme case, but in general adding two values of similar magnitude is more accurate than adding two values of very different magnitudes, since you "discard" fewer bits of precision in the smaller value that way. By sorting the numbers, you group values of similar magnitude together, and by adding them in ascending order you give the small values a "chance" of cumulatively reaching the magnitude of the bigger numbers.
Still, if negative numbers are involved it's easy to "outwit" this approach. Consider three values to sum, {1, -1, 1 billionth}. The arithmetically correct sum is 1 billionth, but if my first addition involves the tiny value then my final sum will be 0. Of the 6 possible orders, only 2 are "correct" - {1, -1, 1 billionth} and {-1, 1, 1 billionth}. All 6 orders give results that are accurate at the scale of the largest-magnitude value in the input (0.0000001% out), but for 4 of them the result is inaccurate at the scale of the true solution (100% out). The particular problem you're solving will tell you whether the former is good enough or not.
In fact, you can play a lot more tricks than just adding them in sorted order. If you have lots of very small values, a middle number of middling values, and a small number of large values, then it might be most accurate to first add up all the small ones, then separately total the middling ones, add those two totals together then add the large ones. It's not at all trivial to find the most accurate combination of floating-point additions, but to cope with really bad cases you can keep a whole array of running totals at different magnitudes, add each new value to the total that best matches its magnitude, and when a running total starts to get too big for its magnitude, add it into the next total up and start a new one. Taken to its logical extreme, this process is equivalent to performing the sum in an arbitrary-precision type (so you'd do that). But given the simplistic choice of adding in ascending or descending order of magnitude, ascending is the better bet.
It does have some relation to real-world programming, since there are some cases where your calculation can go very badly wrong if you accidentally chop off a "heavy" tail consisting of a large number of values each of which is too small to individually affect the sum, or if you throw away too much precision from a lot of small values that individually only affect the last few bits of the sum. In cases where the tail is negligible anyway you probably don't care. For example if you're only adding together a small number of values in the first place and you're only using a few significant figures of the sum.
There is also an algorithm designed for this kind of accumulation operation, called Kahan Summation, that you should probably be aware of.
According to Wikipedia,
The Kahan summation algorithm (also known as compensated summation) significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach. This is done by keeping a separate running compensation (a variable to accumulate small errors).
In pseudocode, the algorithm is:
function kahanSum(input)
var sum = input[1]
var c = 0.0 //A running compensation for lost low-order bits.
for i = 2 to input.length
y = input[i] - c //So far, so good: c is zero.
t = sum + y //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t //Algebraically, c should always be zero. Beware eagerly optimising compilers!
next i //Next time around, the lost low part will be added to y in a fresh attempt.
return sum
I tried out the extreme example in the answer supplied by Steve Jessop.
#include <iostream>
#include <iomanip>
#include <cmath>
int main()
{
long billion = 1000000000;
double big = 1.0;
double small = 1e-9;
double expected = 2.0;
double sum = big;
for (long i = 0; i < billion; ++i)
sum += small;
std::cout << std::scientific << std::setprecision(1) << big << " + " << billion << " * " << small << " = " <<
std::fixed << std::setprecision(15) << sum <<
" (difference = " << std::fabs(expected - sum) << ")" << std::endl;
sum = 0;
for (long i = 0; i < billion; ++i)
sum += small;
sum += big;
std::cout << std::scientific << std::setprecision(1) << billion << " * " << small << " + " << big << " = " <<
std::fixed << std::setprecision(15) << sum <<
" (difference = " << std::fabs(expected - sum) << ")" << std::endl;
return 0;
}
I got the following result:
1.0e+00 + 1000000000 * 1.0e-09 = 2.000000082740371 (difference = 0.000000082740371)
1000000000 * 1.0e-09 + 1.0e+00 = 1.999999992539933 (difference = 0.000000007460067)
The error in the first line is more than ten times bigger in the second.
If I change the doubles to floats in the code above, I get:
1.0e+00 + 1000000000 * 1.0e-09 = 1.000000000000000 (difference = 1.000000000000000)
1000000000 * 1.0e-09 + 1.0e+00 = 1.031250000000000 (difference = 0.968750000000000)
Neither answer is even close to 2.0 (but the second is slightly closer).
Using the Kahan summation (with doubles) as described by Daniel Pryden:
#include <iostream>
#include <iomanip>
#include <cmath>
int main()
{
long billion = 1000000000;
double big = 1.0;
double small = 1e-9;
double expected = 2.0;
double sum = big;
double c = 0.0;
for (long i = 0; i < billion; ++i) {
double y = small - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}
std::cout << "Kahan sum = " << std::fixed << std::setprecision(15) << sum <<
" (difference = " << std::fabs(expected - sum) << ")" << std::endl;
return 0;
}
I get exactly 2.0:
Kahan sum = 2.000000000000000 (difference = 0.000000000000000)
And even if I change the doubles to floats in the code above, I get:
Kahan sum = 2.000000000000000 (difference = 0.000000000000000)
It would seem that Kahan is the way to go!
There is a class of algorithms that solve this exact problem, without the need to sort or otherwise re-order the data.
In other words, the summation can be done in one pass over the data. This also makes such algorithms applicable in situations where the dataset is not known in advance, e.g. if the data arrives in real time and the running sum needs to be maintained.
Here is the abstract of a recent paper:
We present a novel, online algorithm for exact summation of a stream
of floating-point numbers. By “online” we mean that the algorithm
needs to see only one input at a time, and can take an arbitrary
length input stream of such inputs while requiring only constant
memory. By “exact” we mean that the sum of the internal array of our
algorithm is exactly equal to the sum of all the inputs, and the
returned result is the correctly-rounded sum. The proof of correctness
is valid for all inputs (including nonnormalized numbers but modulo
intermediate overflow), and is independent of the number of summands
or the condition number of the sum. The algorithm asymptotically needs
only 5 FLOPs per summand, and due to instruction-level parallelism
runs only about 2--3 times slower than the obvious, fast-but-dumb
“ordinary recursive summation” loop when the number of summands is
greater than 10,000. Thus, to our knowledge, it is the fastest, most
accurate, and most memory efficient among known algorithms. Indeed, it
is difficult to see how a faster algorithm or one requiring
significantly fewer FLOPs could exist without hardware improvements.
An application for a large number of summands is provided.
Source: Algorithm 908: Online Exact Summation of Floating-Point Streams.
Building on Steve's answer of first sorting the numbers in ascending order, I'd introduce two more ideas:
Decide on the difference in exponent of two numbers above which you might decide that you would lose too much precision.
Then add the numbers up in order until the exponent of the accumulator is too large for the next number, then put the accumulator onto a temporary queue and start the accumulator with the next number. Continue until you exhaust the original list.
You repeat the process with the temporary queue (having sorted it) and with a possibly larger difference in exponent.
I think this will be quite slow if you have to calculate exponents all the time.
I had a quick go with a program and the result was 1.99903
I think you can do better than sorting the numbers before you accumulate them, because during the process of accumulation, the accumulator gets bigger and bigger. If you have a large amount of similar numbers, you will start to lose precision quickly. Here is what I would suggest instead:
while the list has multiple elements
remove the two smallest elements from the list
add them and put the result back in
the single element in the list is the result
Of course this algorithm will be most efficient with a priority queue instead of a list. C++ code:
template <typename Queue>
void reduce(Queue& queue)
{
typedef typename Queue::value_type vt;
while (queue.size() > 1)
{
vt x = queue.top();
queue.pop();
vt y = queue.top();
queue.pop();
queue.push(x + y);
}
}
driver:
#include <iterator>
#include <queue>
template <typename Iterator>
typename std::iterator_traits<Iterator>::value_type
reduce(Iterator begin, Iterator end)
{
typedef typename std::iterator_traits<Iterator>::value_type vt;
std::priority_queue<vt> positive_queue;
positive_queue.push(0);
std::priority_queue<vt> negative_queue;
negative_queue.push(0);
for (; begin != end; ++begin)
{
vt x = *begin;
if (x < 0)
{
negative_queue.push(x);
}
else
{
positive_queue.push(-x);
}
}
reduce(positive_queue);
reduce(negative_queue);
return negative_queue.top() - positive_queue.top();
}
The numbers in the queue are negative because top yields the largest number, but we want the smallest. I could have provided more template arguments to the queue, but this approach seems simpler.
This doesn't quite answer your question, but a clever thing to do is to run the sum twice, once with rounding mode "round up" and once with "round down". Compare the two answers, and you know /how/ inaccurate your results are, and if you therefore need to use a cleverer summing strategy. Unfortunately, most languages don't make changing the floating point rounding mode as easy as it should be, because people don't know that it's actually useful in everyday calculations.
Take a look at Interval arithmetic where you do all maths like this, keeping highest and lowest values as you go. It leads to some interesting results and optimisations.
The simplest sort that improves accuracy is to sort by the ascending absolute value. That lets the smallest magnitude values have a chance to accumulate or cancel before interacting with larger magnitude values that have would trigger a loss of precision.
That said, you can do better by tracking multiple non-overlapping partial sums. Here is a paper describing the technique and presenting a proof-of-accuracy: www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps
That algorithm and other approaches to exact floating point summation are implemented in simple Python at: http://code.activestate.com/recipes/393090/ At least two of those can be trivially converted to C++.
For IEEE 754 single or double precision or known format numbers, another alternative is to use an array of numbers (passed by caller, or in a class for C++) indexed by the exponent. When adding numbers into the array, only numbers with the same exponent are added (until an empty slot is found and the number stored). When a sum is called for, the array is summed from smallest to largest to minimize truncation. Single precision example:
/* clear array */
void clearsum(float asum[256])
{
size_t i;
for(i = 0; i < 256; i++)
asum[i] = 0.f;
}
/* add a number into array */
void addtosum(float f, float asum[256])
{
size_t i;
while(1){
/* i = exponent of f */
i = ((size_t)((*(unsigned int *)&f)>>23))&0xff;
if(i == 0xff){ /* max exponent, could be overflow */
asum[i] += f;
return;
}
if(asum[i] == 0.f){ /* if empty slot store f */
asum[i] = f;
return;
}
f += asum[i]; /* else add slot to f, clear slot */
asum[i] = 0.f; /* and continue until empty slot */
}
}
/* return sum from array */
float returnsum(float asum[256])
{
float sum = 0.f;
size_t i;
for(i = 0; i < 256; i++)
sum += asum[i];
return sum;
}
double precision example:
/* clear array */
void clearsum(double asum[2048])
{
size_t i;
for(i = 0; i < 2048; i++)
asum[i] = 0.;
}
/* add a number into array */
void addtosum(double d, double asum[2048])
{
size_t i;
while(1){
/* i = exponent of d */
i = ((size_t)((*(unsigned long long *)&d)>>52))&0x7ff;
if(i == 0x7ff){ /* max exponent, could be overflow */
asum[i] += d;
return;
}
if(asum[i] == 0.){ /* if empty slot store d */
asum[i] = d;
return;
}
d += asum[i]; /* else add slot to d, clear slot */
asum[i] = 0.; /* and continue until empty slot */
}
}
/* return sum from array */
double returnsum(double asum[2048])
{
double sum = 0.;
size_t i;
for(i = 0; i < 2048; i++)
sum += asum[i];
return sum;
}
Your floats should be added in double precision. That will give you more additional precision than any other technique can. For a bit more precision and significantly more speed, you can create say four sums, and add them up at the end.
If you are adding double precision numbers, use long double for the sum - however, this will only have a positive effect in implementations where long double actually has more precision than double (typically x86, PowerPC depending on compiler settings).
Regarding sorting, it seems to me that if you expect cancellation then the numbers should be added in descending order of magnitude, not ascending. For instance:
((-1 + 1) + 1e-20) will give 1e-20
but
((1e-20 + 1) - 1) will give 0
In the first equation that two large numbers are cancelled out, whereas in the second the 1e-20 term gets lost when added to 1, since there is not enough precision to retain it.
Also, pairwise summation is pretty decent for summing lots of numbers.