addition instead of subtraction in Kahan algorithm

addition instead of subtraction in Kahan algorithm - c++

This is the Kahan summation algorithm from Wikipedia:
function KahanSum(input)
var sum = 0.0
var c = 0.0
for i = 1 to input.length do
y = input[i] - c // why subtraction?
t = sum + y
c = (t - sum) - y
sum = t
return sum
Is there a specific reason why it uses subtraction (as opposed to addition)? If I swap the operands in the computation of c, can I use addition instead? Somehow, that would make more sense to me:
function KahanSum(input)
var sum = 0.0
var c = 0.0
for i = 1 to input.length do
y = input[i] + c // addition instead of subtraction
t = sum + y
c = y - (t - sum) // swapped operands
sum = t
return sum
Or is there some weird difference between floating point addition and subtraction I don't know about yet?
Also, is there any difference between (t - sum) - y and t - sum - y in the original algorithm? Aren't the parenthesis redundant, since - is left-associative, anyway?

As far as I can tell, your method is exactly equivalent to the one from Wikipedia. The only difference is that the sign of c -- and therefore its meaning -- is reversed. In the Wikipedia algorithm, c is the "wrong" part of the sum; c=0.0001 means that the sum is a little bigger than it should be. In your version, c is the "correction" to the sum; c=-0.0001 means that the sum should be made a little smaller.
And I think the parentheses are for readability. They're for us, not the machine.

Your two algorithms are equivalent. The only difference during execution will be the sign of c. The reason it uses addition is because in Kahan's version, c represents the error, which is conventionally the correct minus the computed value.
In the sense that parentheses specify the order of operations, the parentheses are absolutely necessary. In fact, they are what makes this algorithm work!
When subtraction is left-associative, as it is in most languages, a - b - c evaluates as (a - b) - c so the two are the same. But the subtraction in the Kahan algorithm is a - (b - c), and that should not be evaluated as a - b + c.
Floating-point addition and subtraction are not associative. For expressions that are equivalent in standard arithmetic, you may get different results depending on the order in which you perform the operations.
Let's work with 3 decimal digits of precision, for the sake of clarity. This means that if we get a result with 4 digits, we have to round it.
Now compare (a - b) - c with the mathematically equivalent a - (b + c) for some specific values:
(998 - 997) - 5 = 1 - 5 = -4
with
998 - (997 + 5) = 998 - Round(1002)
= 998 - 1000 = -2
So the second approach is less accurate.
In the Kahan algorithm, t and sum will usually be relatively large compared to y. So you often get a situation like in the example above where you would get a less accurate result if you don't do the operations in the correct order.

Related

Correctness of multiplication with overflow detection

The following C++ template detects overflows from multiplying two unsigned integers.
template<typename UInt> UInt safe_multiply(UInt a, UInt b) {
UInt x = a * b; // x := ab mod n, for n := 2^#bits > 0
if (a != 0 && x / a != b)
cerr << "Overflow for " << a << " * " << b << "." << endl;
return x;
}
Can you give a proof that this algorithm detects every potential overflow, regardless of how many bits UInt uses?
The case
cannot result in overflows, so we can consider
.
It seems that the correctness proof boils down to leading
to a contradiction, since x / a actually means .
When assuming
, this leads to the straightforward consequence
thus
which contradicts n > 0.
So it remains to show
or there must be another way.
If the last equation is true, WolframAlpha fails to confirm that (also with exponents).
However, it asserts that the original assumptions have no integer solutions, so the algorithms seems to be correct indeed.
But it doesn't provide an explanation. So why is it correct?
I am looking for the smallest possible explanation that is still mathematically profound, ideally that it fits in a single-line comment. Maybe I am missing something trivial, or the problem is not as easy as it looks.
On a side note, I used Codecogs Equation Editor for the LaTeX markup images, which apparently looks bad in dark mode, so consider switching to light mode or, if you know, please tell me how to use different images depending on the client settings. It is just \bg{white} vs. \bg{black} as part of the image URLs.

To be clear, I'll use the multiplication and division symbols (*, /) mathematically.
Also, for convenience let's name the set N = {0, 1, ..., n - 1}.
Let's clear up what unsigned multiplication is:
Unsigned multiplication for some magnitude, n, is a modular n operation on unsigned-n inputs (inputs that are in N) that results in an unsigned-n output (ie. also in N).
In other words, the result of unsigned multiplication, x, is x = a*b (mod n), and, additionally, we know that x,a,b are in N.
It's important to be able to expand many modular formulas like so: x = a*b - k*n, where k is an integer - but in our case x,a,b are in N so this implies that k is in N.
Now, let's mathematically say what we're trying to prove:
Given positive integers, a,n, and non-negative integers x,b, where x,a,b are in N, and x = a*b (mod n), then a*b >= n (overflow) implies floor(x/a) != b.
Proof:
If overflow (a*b >= n) then x >= n - k*n = (1 - k)*n (for k in N),
As x < n then (1 - k)*n < n, so k > 0.
This means x <= a*b - n.
So, floor(x/a) <= floor([a*b - n]/a) = floor(a*b/a - n/a) = b - floor(n/a) <= b - 1,
Abbreviated: floor(x/a) <= b - 1
Therefore floor(x/a) != b

The multiplication gives either the mathematically correct result, or it is off by some multiple of 2^64. Since you check for a=0, the division always gives the correct result for its input. But in the case of overflow, the input is off by 2^64 or more, so the test will fail as you hoped.
The last bit is that unsigned operations don’t have undefined behaviour except for division by zero, so your code is fine.

Fast integer solution of x(x-1)/2 = c

Given a non-negative integer c, I need an efficient algorithm to find the largest integer x such that
x*(x-1)/2 <= c
Equivalently, I need an efficient and reliably accurate algorithm to compute:
x = floor((1 + sqrt(1 + 8*c))/2) (1)
For the sake of defineteness I tagged this question C++, so the answer should be a function written in that language. You can assume that c is an unsigned 32 bit int.
Also, if you can prove that (1) (or an equivalent expression involving floating-point arithmetic) always gives the right result, that's a valid answer too, since floating-point on modern processors can be faster than integer algorithms.

If you're willing to assume IEEE doubles with correct rounding for all operations including square root, then the expression that you wrote (plus a cast to double) gives the right answer on all inputs.
Here's an informal proof. Since c is a 32-bit unsigned integer being converted to a floating-point type with a 53-bit significand, 1 + 8*(double)c is exact, and sqrt(1 + 8*(double)c) is correctly rounded. 1 + sqrt(1 + 8*(double)c) is accurate to within one ulp, since the last term being less than 2**((32 + 3)/2) = 2**17.5 implies that the unit in the last place of the latter term is less than 1, and thus (1 + sqrt(1 + 8*(double)c))/2 is accurate to within one ulp, since division by 2 is exact.
The last piece of business is the floor. The problem cases here are when (1 + sqrt(1 + 8*(double)c))/2 is rounded up to an integer. This happens if and only if sqrt(...) rounds up to an odd integer. Since the argument of sqrt is an integer, the worst cases look like sqrt(z**2 - 1) for positive odd integers z, and we bound
z - sqrt(z**2 - 1) = z * (1 - sqrt(1 - 1/z**2)) >= 1/(2*z)
by Taylor expansion. Since z is less than 2**17.5, the gap to the nearest integer is at least 1/2**18.5 on a result of magnitude less than 2**17.5, which means that this error cannot result from a correctly rounded sqrt.
Adopting Yakk's simplification, we can write
(uint32_t)(0.5 + sqrt(0.25 + 2.0*c))
without further checking.

If we start with the quadratic formula, we quickly reach sqrt(1/4 + 2c), round up at 1/2 or higher.
Now, if you do that calculation in floating point, there can be inaccuracies.
There are two approaches to deal with these inaccuracies. The first would be to carefully determine how big they are, determine if the calculated value is close enough to a half for them to be important. If they aren't important, simply return the value. If they are, we can still bound the answer to being one of two values. Test those two values in integer math, and return.
However, we can do away with that careful bit, and note that sqrt(1/4 + 2c) is going to have an error less than 0.5 if the values are 32 bits, and we use doubles. (We cannot make this guarantee with floats, as by 2^31 the float cannot handle +0.5 without rounding).
In essense, we use the quadratic formula to reduce it to two possibilities, and then test those two.
uint64_t eval(uint64_t x) {
return x*(x-1)/2;
}
unsigned solve(unsigned c) {
double test = sqrt( 0.25 + 2.*c );
if ( eval(test+1.) <= c )
return test+1.
ASSERT( eval(test) <= c );
return test;
}
Note that converting a positive double to an integral type rounds towards 0. You can insert floors if you want.

This may be a bit tangential to your question. But, what caught my attention is the specific formula. You are trying to find the triangular root of Tn - 1 (where Tn is the nth triangular number).
I.e.:
Tn = n * (n + 1) / 2
and
Tn - n = Tn - 1 = n * (n - 1) / 2
From the nifty trick described here, for Tn we have:
n = int(sqrt(2 * c))
Looking for n such that Tn - 1 ≤ c in this case doesn't change the definition of n, for the same reason as in the original question.
Computationally, this saves a few operations, so it's theoretically faster than the exact solution (1). In reality, it's probably about the same.
Neither this solution or the one presented by David are as "exact" as your (1) though.
floor((1 + sqrt(1 + 8*c))/2) (blue) vs int(sqrt(2 * c)) (red) vs Exact (white line)
floor((1 + sqrt(1 + 8*c))/2) (blue) vs int(sqrt(0.25 + 2 * c) + 0.5 (red) vs Exact (white line)
My real point is that triangular numbers are a fun set of numbers that are connected to squares, pascal's triangle, Fibonacci numbers, et. al.
As such there are loads of identities around them which might be used to rearrange the problem in a way that didn't require a square root.
Of particular interest may be that Tn + Tn - 1 = n2
I'm assuming you know that you're working with a triangular number, but if you didn't realize that, searching for triangular roots yields a few questions such as this one which are along the same topic.

Generating random numbers given a uniform random number generator

I was asked to generate a random number between a and b, inclusive, using random(0,1). random(0,1) generates a uniform random number between 0 and 1.
I answered
(a+(((1+random(0,1))*b))%(b-a))
My interviewer was not satisfied with my usage of b in this piece of the expression:
(((1+random(0,1))*b))
Then I tried changing my answer to:
int*z=(int*)malloc(sizeof(int));
(a+(((1+random(0,1))*(*z)))%(b-a));
Later the question changed to generate random(1,7) from random(1,5). I responded with:
A = rand(1,5)%3
B = (rand(1,5)+1)%3
C = (rand(1,5)+2)%3
rand(1,7) = rand(1,5)+ (A+B+C)%3
Were my answers correct?

I think you were confused between random integral-number generator and random floating-point number generator. In C++, rand() generates random integral number between 0 and 32K. Thus to generate a random number from 1 to 10, we write rand() % 10 + 1. As such, to generate a random number from integer a to integer b, we write rand() % (b - a + 1) + a.
The interviewer told you that you had a random generator from 0 to 1. It means floating-point number generator.
How to get the answer mathematically:
Shift the question to a simple form such that the lower bound is 0.
Scale the range by multiplication
Re-shift to the required range.
For example: to generate R such that
a <= R <= b.
Apply rule 1, we get a-a <= R - a <= b-a
0 <= R - a <= b - a.
Think R - a as R1. How to generate R1 such that R1 has range from 0 to (b-a)?
R1 = rand(0, 1) * (b-a) // by apply rule 2.
Now substitute R1 by R - a
R - a = rand(0,1) * (b-a) ==> R = a + rand(0,1) * (b-a)
==== 2nd question - without explanation ====
We have 1 <= R1 <= 5
==> 0 <= R1 - 1 <= 4
==> 0 <= (R1 - 1)/4 <= 1
==> 0 <= 6 * (R1 - 1)/4 <= 6
==> 1 <= 1 + 6 * (R1 - 1)/4 <= 7
Thus, Rand(1,7) = 1 + 6 * (rand(1,5) - 1) / 4

random(a,b) from random(0,1):
random(0,1)*(b-a)+a
random(c,d) from random(a,b):
(random(a,b)-a)/(b-a)*(d-c)+c
or, simplified for your case (a=1,b=5,c=1,d=7):
random(1,5) * 1.5 - 0.5
(note: I assume we're talking about float values and that rounding errors are negligible)

random(a,b) from random(c,d) = a + (b-a)*((random(c,d) - c)/(d-c))
No?

[random(0,1)*(b-a)] + a, i think would give random numbers b/w a&b.
([random(1,5)-1]/4)*6 + 1 should give the random nubers in the range (1,7)
I am not sure whether the above will destroy the uniform distribution..

Were my answers correct?
I think there are some problems.
First off, I'm assuming that random() returns a floating point value - otherwise to generate any useful distribution of a larger range of numbers using random(0,1) would require repeated calls to generate a pool of bits to work with.
I'm also going to assume C/C++ is the intended platform, since the question is tagged as such.
Given these assumptions, one problem with your answers is that C/C++ do not allow the use of the % operator on floating point types.
But even if we imagine that the % operator was replaced with a function that performed a modulo operation with floating point arguments in a reasonable way, there are still some problems. In your initial answer, if b (or the uninitialized *z allocated in your second attempt - I'm assuming this is a kind of bizarre way to get an arbitrary value, or is something else intended?) is zero (say the range given for a and b is (-5, 0)), then your result will be decidedly non-uniform. The result would always be b.
Finally, I'm certainly no statistician, but in your final answer (to generate random(1,7) from random(1.5)), I'm pretty sure that A+B+C would be non-uniform and would therefore introduce a bias in the result.

I think that there is a nicer answer to this. There is one value (probability -> zero) that this overflows and thus the modulus is there.
Take a random number x in the interval [0,1].
Increment your upper_bound which could be a parameter by one.
Calculate (int(random() / (1.0 / upper_bound)) % upper_bound) + 1 + lower_bound .
This ought to return a number in your desired interval.

given random(0,5) you can generate random(0,7) in the following way
A = random(0,5)*random(0,5)
now the range of A is 0-25
if we simply take the modulo 7 of A, we can get the random numbers but they wont be truly random as for values of A from 22-25, you will get 1-4 values after modulo operation, hence getting modulo 7 from range(0,25) will bias the output towards 1-4. This is because 7 does not evenly divide 25: the largest multiple of 7 less than or equal to 25 is 7*3=21 and it is the numbers in the incomplete range from 21-25 that will cause the bias.
Easiest way to fix this problem is to discard those numbers (from 22-25) and to keep tying again until a number in the suitable range come up.
Obviously, this is true when we assume that we want random integers.
However to get random float numbers we need to modify the range accordingly as described in above posts.

Differentiate between a number (of type double) with decimal places and one without - c++

I am trying to implement a simple decimation algorithm in c++. I have two arrays, say p & q, where the subscripts are related to each other by the following relation:
p[k] = q[0.5*k]. This means that the following sequence should hold valid:
p[0] = q[0]
p[1] = 0
p[2] = q[1]
p[3] = 0
p[4] = q[2]
and so on...
Please note that p[k] takes on a value only and only when the result of (0.5*k) contains no decimal places (or has 0 in decimal) and does not use any rounding off etc.
My question is: Is there a way to distinguish between an integer (a number with no decimal places or only 0 in decimal, say 2.0) and a number with decimal places in C++, provided both are cast to double?
eg.) 2.0 is an integer cast to double. 2.1 is a number with decimal places.
eg. 2) * 0.9*2 should put 0 into array p while 0.9*10 should put q[9] into array p.*
If I use the statement, (int) (0.5*k), then I end up with an integer in every case, irrespective of the value of k.
Edit: The 0.5 in the above case is only illustrative. It could be any number, say 2, 2.5, 0.9, 0.95 etc.)
Any help is most welcome,
Thanks,
Sriram.

Assuming k is of an integer type, you could use if (k % 2 == 0) ... to check if kis divisible by two:
if (k % 2 == 0)
p[k] = q[k / 2];
else
p[k] = 0;
This can also be expressed using the ternary operator:
p[k] = (k % 2 == 0) ? q[k / 2] : 0;

Presuming that the coef can be anything else,
p[floor(coef*k)] = (fabs(coef*k-floor(coef*k))<1E-6)?q[k]:0;

The short syntax for what you want to do could be this:
p[k] = k % 2 ? 0 : q[k/2];
Is there a way to distinguish between a whole number and an integer in C++?
Define whole number, and define integer in this context. I'm confused!
Are you taking about the difference as explained here?
If you want to detect whether a number is integer or not, then probably this may help:
#include<cmath>
bool IsInteger(double d)
{
double intpart;
return std::modf(double number, &intpart) == 0.0;
}

k % 2 is in a couple of answers in this thread.
However, this is not useful in answering the OP's question. Note the edit:
"Edit: The 0.5 in the above case is only illustrative. It could be any number, say 2, 2.5, 0.9, 0.95 etc.)"
k % 2 only works because the value chosen was 0.5. It won't hold true for any other values.
Therefore, unless I'm missing something entirely, the simplest approach I can think of is the following:
Subtract the floor of the number from the number itself. If the result is > 0, it is not an integer.

Unless you have expressions that result in irrational numbers, you could use Boost.Rational to represent your indizes.

#Aix's suggestion of k%2 looks like it'd combine nicely with the ?: operator:
p[k] = (k%2) ? 0 : q[k/2];

Fast fixed point pow, log, exp and sqrt

I've got a fixed point class (10.22) and I have a need of a pow, a sqrt, an exp and a log function.
Alas I have no idea where to even start on this. Can anyone provide me with some links to useful articles or, better yet, provide me with some code?
I'm assuming that once I have an exp function then it becomes relatively easy to implement pow and sqrt as they just become.
pow( x, y ) => exp( y * log( x ) )
sqrt( x ) => pow( x, 0.5 )
Its just those exp and log functions that I'm finding difficult (as though I remember a few of my log rules, I can't remember much else about them).
Presumably, there would also be a faster method for sqrt and pow so any pointers on that front would be appreciated even if its just to say use the methods i outline above.
Please note: This HAS to be cross platform and in pure C/C++ code so I cannot use any assembler optimisations.

A very simple solution is to use a decent table-driven approximation. You don't actually need a lot of data if you reduce your inputs correctly. exp(a)==exp(a/2)*exp(a/2), which means you really only need to calculate exp(x) for 1 < x < 2. Over that range, a runga-kutta approximation would give reasonable results with ~16 entries IIRC.
Similarly, sqrt(a) == 2 * sqrt(a/4) == sqrt(4*a) / 2 which means you need only table entries for 1 < a < 4. Log(a) is a bit harder: log(a) == 1 + log(a/e). This is a rather slow iteration, but log(1024) is only 6.9 so you won't have many iterations.
You'd use a similar "integer-first" algorithm for pow: pow(x,y)==pow(x, floor(y)) * pow(x, frac(y)). This works because pow(double, int) is trivial (divide and conquer).
[edit] For the integral component of log(a), it may be useful to store a table 1, e, e^2, e^3, e^4, e^5, e^6, e^7 so you can reduce log(a) == n + log(a/e^n) by a simple hardcoded binary search of a in that table. The improvement from 7 to 3 steps isn't so big, but it means you only have to divide once by e^n instead of n times by e.
[edit 2]
And for that last log(a/e^n) term, you can use log(a/e^n) = log((a/e^n)^8)/8 - each iteration produces 3 more bits by table lookup. That keeps your code and table size small. This is typically code for embedded systems, and they don't have large caches.
[edit 3]
That's still not to smart on my side. log(a) = log(2) + log(a/2). You can just store the fixed-point value log2=0.6931471805599, count the number of leading zeroes, shift a into the range used for your lookup table, and multiply that shift (integer) by the fixed-point constant log2. Can be as low as 3 instructions.
Using e for the reduction step just gives you a "nice" log(e)=1.0 constant but that's false optimization. 0.6931471805599 is just as good a constant as 1.0; both are 32 bits constants in 10.22 fixed point. Using 2 as the constant for range reduction allows you to use a bit shift for a division.
[edit 5]
And since you're storing it in Q10.22, you can better store log(65536)=11.09035488. (16 x log(2)). The "x16" means that we've got 4 more bits of precision available.
You still get the trick from edit 2, log(a/2^n) = log((a/2^n)^8)/8. Basically, this gets you a result (a + b/8 + c/64 + d/512) * 0.6931471805599 - with b,c,d in the range [0,7]. a.bcd really is an octal number. Not a surprise since we used 8 as the power. (The trick works equally well with power 2, 4 or 16.)
[edit 4]
Still had an open end. pow(x, frac(y) is just pow(sqrt(x), 2 * frac(y)) and we have a decent 1/sqrt(x). That gives us the far more efficient approach. Say frac(y)=0.101 binary, i.e. 1/2 plus 1/8. Then that means x^0.101 is (x^1/2 * x^1/8). But x^1/2 is just sqrt(x) and x^1/8 is (sqrt(sqrt(sqrt(x))). Saving one more operation, Newton-Raphson NR(x) gives us 1/sqrt(x) so we calculate 1.0/(NR(x)*NR((NR(NR(x))). We only invert the end result, don't use the sqrt function directly.

Below is an example C implementation of Clay S. Turner's fixed-point log base 2 algorithm[1]. The algorithm doesn't require any kind of look-up table. This can be useful on systems where memory constraints are tight and the processor lacks an FPU, such as is the case with many microcontrollers. Log base e and log base 10 are then also supported by using the property of logarithms that, for any base n:
logₘ(x)
logₙ(x) = ───────
logₘ(n)
where, for this algorithm, m equals 2.
A nice feature of this implementation is that it supports variable precision: the precision can be determined at runtime, at the expense of range. The way I've implemented it, the processor (or compiler) must be capable of doing 64-bit math for holding some intermediate results. It can be easily adapted to not require 64-bit support, but the range will be reduced.
When using these functions, x is expected to be a fixed-point value scaled according to the
specified precision. For instance, if precision is 16, then x should be scaled by 2^16 (65536). The result is a fixed-point value with the same scale factor as the input. A return value of INT32_MIN represents negative infinity. A return value of INT32_MAX indicates an error and errno will be set to EINVAL, indicating that the input precision was invalid.
#include <errno.h>
#include <stddef.h>
#include "log2fix.h"
#define INV_LOG2_E_Q1DOT31 UINT64_C(0x58b90bfc) // Inverse log base 2 of e
#define INV_LOG2_10_Q1DOT31 UINT64_C(0x268826a1) // Inverse log base 2 of 10
int32_t log2fix (uint32_t x, size_t precision)
{
int32_t b = 1U << (precision - 1);
int32_t y = 0;
if (precision < 1 || precision > 31) {
errno = EINVAL;
return INT32_MAX; // indicates an error
}
if (x == 0) {
return INT32_MIN; // represents negative infinity
}
while (x < 1U << precision) {
x <<= 1;
y -= 1U << precision;
}
while (x >= 2U << precision) {
x >>= 1;
y += 1U << precision;
}
uint64_t z = x;
for (size_t i = 0; i < precision; i++) {
z = z * z >> precision;
if (z >= 2U << (uint64_t)precision) {
z >>= 1;
y += b;
}
b >>= 1;
}
return y;
}
int32_t logfix (uint32_t x, size_t precision)
{
uint64_t t;
t = log2fix(x, precision) * INV_LOG2_E_Q1DOT31;
return t >> 31;
}
int32_t log10fix (uint32_t x, size_t precision)
{
uint64_t t;
t = log2fix(x, precision) * INV_LOG2_10_Q1DOT31;
return t >> 31;
}
The code for this implementation also lives at Github, along with a sample/test program that illustrates how to use this function to compute and display logarithms from numbers read from standard input.
[1] C. S. Turner, "A Fast Binary Logarithm Algorithm", IEEE Signal Processing Mag., pp. 124,140, Sep. 2010.

A good starting point is Jack Crenshaw's book, "Math Toolkit for Real-Time Programming". It has a good discussion of algorithms and implementations for various transcendental functions.

Check my fixed point sqrt implementation using only integer operations.
It was fun to invent. Quite old now.
https://groups.google.com/forum/?hl=fr%05aacf5997b615c37&fromgroups#!topic/comp.lang.c/IpwKbw0MAxw/discussion
Otherwise check the CORDIC set of algorithms. That's the way to implement all the functions you listed and the trigonometric functions.
EDIT : I published the reviewed source on GitHub here

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

addition instead of subtraction in Kahan algorithm - c++

Related

Correctness of multiplication with overflow detection

Fast integer solution of x(x-1)/2 = c

Generating random numbers given a uniform random number generator

Differentiate between a number (of type double) with decimal places and one without - c++

Fast fixed point pow, log, exp and sqrt

Categories

Resources