Given a non-negative integer c, I need an efficient algorithm to find the largest integer x such that
x*(x-1)/2 <= c
Equivalently, I need an efficient and reliably accurate algorithm to compute:
x = floor((1 + sqrt(1 + 8*c))/2) (1)
For the sake of defineteness I tagged this question C++, so the answer should be a function written in that language. You can assume that c is an unsigned 32 bit int.
Also, if you can prove that (1) (or an equivalent expression involving floating-point arithmetic) always gives the right result, that's a valid answer too, since floating-point on modern processors can be faster than integer algorithms.
If you're willing to assume IEEE doubles with correct rounding for all operations including square root, then the expression that you wrote (plus a cast to double) gives the right answer on all inputs.
Here's an informal proof. Since c is a 32-bit unsigned integer being converted to a floating-point type with a 53-bit significand, 1 + 8*(double)c is exact, and sqrt(1 + 8*(double)c) is correctly rounded. 1 + sqrt(1 + 8*(double)c) is accurate to within one ulp, since the last term being less than 2**((32 + 3)/2) = 2**17.5 implies that the unit in the last place of the latter term is less than 1, and thus (1 + sqrt(1 + 8*(double)c))/2 is accurate to within one ulp, since division by 2 is exact.
The last piece of business is the floor. The problem cases here are when (1 + sqrt(1 + 8*(double)c))/2 is rounded up to an integer. This happens if and only if sqrt(...) rounds up to an odd integer. Since the argument of sqrt is an integer, the worst cases look like sqrt(z**2 - 1) for positive odd integers z, and we bound
z - sqrt(z**2 - 1) = z * (1 - sqrt(1 - 1/z**2)) >= 1/(2*z)
by Taylor expansion. Since z is less than 2**17.5, the gap to the nearest integer is at least 1/2**18.5 on a result of magnitude less than 2**17.5, which means that this error cannot result from a correctly rounded sqrt.
Adopting Yakk's simplification, we can write
(uint32_t)(0.5 + sqrt(0.25 + 2.0*c))
without further checking.
If we start with the quadratic formula, we quickly reach sqrt(1/4 + 2c), round up at 1/2 or higher.
Now, if you do that calculation in floating point, there can be inaccuracies.
There are two approaches to deal with these inaccuracies. The first would be to carefully determine how big they are, determine if the calculated value is close enough to a half for them to be important. If they aren't important, simply return the value. If they are, we can still bound the answer to being one of two values. Test those two values in integer math, and return.
However, we can do away with that careful bit, and note that sqrt(1/4 + 2c) is going to have an error less than 0.5 if the values are 32 bits, and we use doubles. (We cannot make this guarantee with floats, as by 2^31 the float cannot handle +0.5 without rounding).
In essense, we use the quadratic formula to reduce it to two possibilities, and then test those two.
uint64_t eval(uint64_t x) {
return x*(x-1)/2;
}
unsigned solve(unsigned c) {
double test = sqrt( 0.25 + 2.*c );
if ( eval(test+1.) <= c )
return test+1.
ASSERT( eval(test) <= c );
return test;
}
Note that converting a positive double to an integral type rounds towards 0. You can insert floors if you want.
This may be a bit tangential to your question. But, what caught my attention is the specific formula. You are trying to find the triangular root of Tn - 1 (where Tn is the nth triangular number).
I.e.:
Tn = n * (n + 1) / 2
and
Tn - n = Tn - 1 = n * (n - 1) / 2
From the nifty trick described here, for Tn we have:
n = int(sqrt(2 * c))
Looking for n such that Tn - 1 ≤ c in this case doesn't change the definition of n, for the same reason as in the original question.
Computationally, this saves a few operations, so it's theoretically faster than the exact solution (1). In reality, it's probably about the same.
Neither this solution or the one presented by David are as "exact" as your (1) though.
floor((1 + sqrt(1 + 8*c))/2) (blue) vs int(sqrt(2 * c)) (red) vs Exact (white line)
floor((1 + sqrt(1 + 8*c))/2) (blue) vs int(sqrt(0.25 + 2 * c) + 0.5 (red) vs Exact (white line)
My real point is that triangular numbers are a fun set of numbers that are connected to squares, pascal's triangle, Fibonacci numbers, et. al.
As such there are loads of identities around them which might be used to rearrange the problem in a way that didn't require a square root.
Of particular interest may be that Tn + Tn - 1 = n2
I'm assuming you know that you're working with a triangular number, but if you didn't realize that, searching for triangular roots yields a few questions such as this one which are along the same topic.
Related
I am working on a cryptocurrency and there is a calculation that nodes must make:
average /= total;
double ratio = average/DESIRED_BLOCK_TIME_SEC;
int delta = -round(log2(ratio));
It is required that every node has the exact same result no matter what architecture or stdlib being used by the system. My understanding is that log2 might have different implementations that yield very slightly different results or flags like --ffast-math could impact the outputted results.
Is there a simple way to convert the above calculation to something that is verifiably portable across different architectures (fixed point?) or am I overthinking the precision that is needed (given that I round the answer at the end).
EDIT: Average is a long and total is an int... so average ends up rounded to the closest second.
DESIRED_BLOCK_TIME_SEC = 30.0 (it's a float) that is #defined
For this kind of calculation to be exact, one must either calculate all the divisions and logarithms exactly -- or one can work backwards.
-round(log2(x)) == round(log2(1/x)), meaning that one of the divisions can be turned around to get (1/x) >= 1.
round(log2(x)) == floor(log2(x * sqrt(2))) == binary_log((int)(x*sqrt(2))).
One minor detail here is, if (double)sqrt(2) rounds down, or up. If it rounds up, then there might exist one or more value x * sqrt2 == 2^n + epsilon (after rounding), where as if it would round down, we would get 2^n - epsilon. One would give the integer value of n the other would give n-1. Which is correct?
Naturally that one is correct, whose ratio to the theoretical mid point x * sqrt(2) is smaller.
x * sqrt(2) / 2^(n-1) < 2^n / (x * sqrt(2)) -- multiply by x*sqrt(2)
x^2 * 2 / 2^(n-1) < 2^n -- multiply by 2^(n-1)
x^2 * 2 < 2^(2*n-1)
In order of this comparison to be exact, x^2 or pow(x,2) must be exact as well on the boundary - and it matters, what range the original values are. Similar analysis can and should be done while expanding x = a/b, so that the inexactness of the division can be mitigated at the cost of possible overflow in the multiplication...
Then again, I wonder how all the other similar applications handle the corner cases, which may not even exist -- and those could be brute force searched assuming that average and total are small enough integers.
EDIT
Because average is an integer, it makes sense to tabulate those exact integer values, which are on the boundaries of -round(log2(average)).
From octave: d=-round(log2((1:1000000)/30.0)); find(d(2:end) ~= find(d(1:end-1))
1 2 3 6 11 22 43 85 170 340 679 1358 2716
5431 10862 21723 43445 86890 173779 347558 695115
All the averages between [1 2( -> 5
All the averages between [2 3( -> 4
All the averages between [3 6( -> 3
..
All the averages between [43445 86890( -> -11
int a = find_lower_bound(average, table); // linear or binary search
return 5 - a;
No floating point arithmetic needed
Given 2 numbers, where A <= B say for example A = 9 and B = 10, I am trying to get the percentage of how smaller A is compared to B. I need to have the percentage as an int e.g. if the result is 10.00% The int should be 1000.
Here is my code:
int A = 9;
int B = 10;
int percentage = (((1 - (double)A/B) / 0.01)) * 100;
My code returns 999 instead of 1000. Some precision related to the usage of double is lost.
Is there a way to avoid losing precision in my case?
Seems the formula you're looking for is
int result = 10000 - (A*10000+B/2)/B;
The idea is to do all computations in integers and delaying division.
To do the rounding half of the denominator is added before performing the division (otherwise you get truncation in the division and thus upper rounding because of 100%-x)
For example with A=9 and B=11 the percentage is 18.18181818... and rounding 18.18, the computation without the rounding would give 1819 instead of the expected result 1818.
Note that the computation is done all in integers so there is a risk of overflow for large values of A and B. For example if int is 32 bit then A can be up to around 200000 before risking an overflow when computing A*10000.
Using A*10000LL instead of A*10000 in the formula will trade in some speed to raise the limit to a much bigger value.
Offcourse there may be precision loss in floating point number. Either you should use fixed point number as #6502 answered or add a bias to the result to get the intended answer.
You should better do
assert(B != 0);
int percentage = ((A<0) == (B<0) ? 0.5 : -0.5) + (((1 - (double)A/B) / 0.01)) * 100;
Because of precision loss, result of (((1 - (double)A/B) / 0.01)) * 100 may be slightly less or more than intended. If you add extra 0.5, it is guaranteed to be sligthly more than intended. Now when you cast this value to an integer, you get intended answer. (floor or ceil value depending whether the fractional part of the result of equation was above or below 0.5)
I tried
float floatpercent = (((1 - (double)A/B) / 0.01)) * 100;
int percentage = (int) floatpercent;
cout<< percentage;
displays 1000
I suspect a precision loss on automatic casting to int as the root problem to your code.
[I alluded to this in a comment to the original question, but I though I'd post it as an answer.]
The core problem is that the form of expression you're using amplifies the unavoidable floating point loss of precision when representing simple fractions of 10.
Your expression (with casts stripped out for now, using standard precedence to also avoid some parens)
((1 - A/B) / 0.01) * 100
is quite a complicated way of representing what you want, although it's algebraically correct. Unfortunately, floating point numbers can only precisely represent numbers like 1/2, 1/4, 1/8, etc, their multiples, and sums of those. In particular, neither 9/10 or 1/10 or 1/100 have precise representations.
The above expression introduces these errors twice: first in the calculation of A/B, and then in the division by 0.01. These two imprecise values are then divided, which further amplifies the inherent error.
The most direct way to write what you meant (again without needed casts) is
((B-A) / B) * 10000
This produces the correct answer and considerably easier to read, I would suggest, than the original. The fully correct C form is
((B - A) / (double)B) * 10000
I've tested this and it works reliably. As others have noted, it's generally good better to work with doubles instead of floats, as their extra precision makes them less prone (but not immune) to this sort of difficulty.
Let's say I am given integers x and y (satisfying x <= y with ones digit of 0 so they are, in particular, divisible by two). Then I know that their average avg = ((x+y) / 2) is an integer as well. I would like to find this midpoint rounded up to a resolution of 100. In other words if my two inputs are 75200 and 75300 then the avg is 75250 and rounded up to the nearest 100 (but without exceeding or equaling the bigger number) forces the answer to be 75200.
How can I implement this logic without first dividing everything by 100 and using the following floating point arithmetic:
x + std::floor((y - x) * .5 * 100 + .5)*0.01
In other words, how can I do the above without floating point values but obtain the same behavior at the resolution of 100 instead of 0.01?
To compute the average you can do
avg = (x + y) / 2
(BTW, integer addition and division by 2 are very cheap operations even on small microcontrollers.)
To round this to the nearest multiple of 100 (corresponding to your floating-point example) you can do
result = ((avg + 50) / 100) * 100
as integer division rounds down to the nearest integer. By changing the 50 to 0 you can always round down, while changing it to 99 always rounds up.
Edit: Note that this method for rounding doesn't work for negative numbers. Since integer division rounds towards zero, in that case you'll need to subtract the 50, subtract 99 to always round down and subtract 0 to always round up.
Your problematic example requires strong conditions:
the difference between x and y needs to be not greater than 100
y % 100 must be 0
So for most cases, a simple rounded average is perfect for you:
avg100 = avg - (avg % 100) + 100
The tricky part is fixing the remaining error without a condition - if you want to avoid conditions, or slow operations.
For this, the best way is to use a multiplication, and split the expression into two:
avg100 = avg - (avg % 100)
avg100 += 100 * !!(y - avg100)
For most cases, y is greater than avg100. For this case, the !! operator will return 1. In the rare case when they equal, it will return a 0, and it won't change the value.
(I don't know if the compiler will really generate a code without conditions for the '!!' operator, but I don't have a batter idea, and if it is possible, I think it will. If not, this code is still short and easy to understand.)
Also, you can calculate the average using the following expression:
avg = y - (y-x)/2
Or even change the division into bit shift for optimization.
This won't require for both of the numbers to be even, just to be the same parity.
This is the Kahan summation algorithm from Wikipedia:
function KahanSum(input)
var sum = 0.0
var c = 0.0
for i = 1 to input.length do
y = input[i] - c // why subtraction?
t = sum + y
c = (t - sum) - y
sum = t
return sum
Is there a specific reason why it uses subtraction (as opposed to addition)? If I swap the operands in the computation of c, can I use addition instead? Somehow, that would make more sense to me:
function KahanSum(input)
var sum = 0.0
var c = 0.0
for i = 1 to input.length do
y = input[i] + c // addition instead of subtraction
t = sum + y
c = y - (t - sum) // swapped operands
sum = t
return sum
Or is there some weird difference between floating point addition and subtraction I don't know about yet?
Also, is there any difference between (t - sum) - y and t - sum - y in the original algorithm? Aren't the parenthesis redundant, since - is left-associative, anyway?
As far as I can tell, your method is exactly equivalent to the one from Wikipedia. The only difference is that the sign of c -- and therefore its meaning -- is reversed. In the Wikipedia algorithm, c is the "wrong" part of the sum; c=0.0001 means that the sum is a little bigger than it should be. In your version, c is the "correction" to the sum; c=-0.0001 means that the sum should be made a little smaller.
And I think the parentheses are for readability. They're for us, not the machine.
Your two algorithms are equivalent. The only difference during execution will be the sign of c. The reason it uses addition is because in Kahan's version, c represents the error, which is conventionally the correct minus the computed value.
In the sense that parentheses specify the order of operations, the parentheses are absolutely necessary. In fact, they are what makes this algorithm work!
When subtraction is left-associative, as it is in most languages, a - b - c evaluates as (a - b) - c so the two are the same. But the subtraction in the Kahan algorithm is a - (b - c), and that should not be evaluated as a - b + c.
Floating-point addition and subtraction are not associative. For expressions that are equivalent in standard arithmetic, you may get different results depending on the order in which you perform the operations.
Let's work with 3 decimal digits of precision, for the sake of clarity. This means that if we get a result with 4 digits, we have to round it.
Now compare (a - b) - c with the mathematically equivalent a - (b + c) for some specific values:
(998 - 997) - 5 = 1 - 5 = -4
with
998 - (997 + 5) = 998 - Round(1002)
= 998 - 1000 = -2
So the second approach is less accurate.
In the Kahan algorithm, t and sum will usually be relatively large compared to y. So you often get a situation like in the example above where you would get a less accurate result if you don't do the operations in the correct order.
I've got a fixed point class (10.22) and I have a need of a pow, a sqrt, an exp and a log function.
Alas I have no idea where to even start on this. Can anyone provide me with some links to useful articles or, better yet, provide me with some code?
I'm assuming that once I have an exp function then it becomes relatively easy to implement pow and sqrt as they just become.
pow( x, y ) => exp( y * log( x ) )
sqrt( x ) => pow( x, 0.5 )
Its just those exp and log functions that I'm finding difficult (as though I remember a few of my log rules, I can't remember much else about them).
Presumably, there would also be a faster method for sqrt and pow so any pointers on that front would be appreciated even if its just to say use the methods i outline above.
Please note: This HAS to be cross platform and in pure C/C++ code so I cannot use any assembler optimisations.
A very simple solution is to use a decent table-driven approximation. You don't actually need a lot of data if you reduce your inputs correctly. exp(a)==exp(a/2)*exp(a/2), which means you really only need to calculate exp(x) for 1 < x < 2. Over that range, a runga-kutta approximation would give reasonable results with ~16 entries IIRC.
Similarly, sqrt(a) == 2 * sqrt(a/4) == sqrt(4*a) / 2 which means you need only table entries for 1 < a < 4. Log(a) is a bit harder: log(a) == 1 + log(a/e). This is a rather slow iteration, but log(1024) is only 6.9 so you won't have many iterations.
You'd use a similar "integer-first" algorithm for pow: pow(x,y)==pow(x, floor(y)) * pow(x, frac(y)). This works because pow(double, int) is trivial (divide and conquer).
[edit] For the integral component of log(a), it may be useful to store a table 1, e, e^2, e^3, e^4, e^5, e^6, e^7 so you can reduce log(a) == n + log(a/e^n) by a simple hardcoded binary search of a in that table. The improvement from 7 to 3 steps isn't so big, but it means you only have to divide once by e^n instead of n times by e.
[edit 2]
And for that last log(a/e^n) term, you can use log(a/e^n) = log((a/e^n)^8)/8 - each iteration produces 3 more bits by table lookup. That keeps your code and table size small. This is typically code for embedded systems, and they don't have large caches.
[edit 3]
That's still not to smart on my side. log(a) = log(2) + log(a/2). You can just store the fixed-point value log2=0.6931471805599, count the number of leading zeroes, shift a into the range used for your lookup table, and multiply that shift (integer) by the fixed-point constant log2. Can be as low as 3 instructions.
Using e for the reduction step just gives you a "nice" log(e)=1.0 constant but that's false optimization. 0.6931471805599 is just as good a constant as 1.0; both are 32 bits constants in 10.22 fixed point. Using 2 as the constant for range reduction allows you to use a bit shift for a division.
[edit 5]
And since you're storing it in Q10.22, you can better store log(65536)=11.09035488. (16 x log(2)). The "x16" means that we've got 4 more bits of precision available.
You still get the trick from edit 2, log(a/2^n) = log((a/2^n)^8)/8. Basically, this gets you a result (a + b/8 + c/64 + d/512) * 0.6931471805599 - with b,c,d in the range [0,7]. a.bcd really is an octal number. Not a surprise since we used 8 as the power. (The trick works equally well with power 2, 4 or 16.)
[edit 4]
Still had an open end. pow(x, frac(y) is just pow(sqrt(x), 2 * frac(y)) and we have a decent 1/sqrt(x). That gives us the far more efficient approach. Say frac(y)=0.101 binary, i.e. 1/2 plus 1/8. Then that means x^0.101 is (x^1/2 * x^1/8). But x^1/2 is just sqrt(x) and x^1/8 is (sqrt(sqrt(sqrt(x))). Saving one more operation, Newton-Raphson NR(x) gives us 1/sqrt(x) so we calculate 1.0/(NR(x)*NR((NR(NR(x))). We only invert the end result, don't use the sqrt function directly.
Below is an example C implementation of Clay S. Turner's fixed-point log base 2 algorithm[1]. The algorithm doesn't require any kind of look-up table. This can be useful on systems where memory constraints are tight and the processor lacks an FPU, such as is the case with many microcontrollers. Log base e and log base 10 are then also supported by using the property of logarithms that, for any base n:
logₘ(x)
logₙ(x) = ───────
logₘ(n)
where, for this algorithm, m equals 2.
A nice feature of this implementation is that it supports variable precision: the precision can be determined at runtime, at the expense of range. The way I've implemented it, the processor (or compiler) must be capable of doing 64-bit math for holding some intermediate results. It can be easily adapted to not require 64-bit support, but the range will be reduced.
When using these functions, x is expected to be a fixed-point value scaled according to the
specified precision. For instance, if precision is 16, then x should be scaled by 2^16 (65536). The result is a fixed-point value with the same scale factor as the input. A return value of INT32_MIN represents negative infinity. A return value of INT32_MAX indicates an error and errno will be set to EINVAL, indicating that the input precision was invalid.
#include <errno.h>
#include <stddef.h>
#include "log2fix.h"
#define INV_LOG2_E_Q1DOT31 UINT64_C(0x58b90bfc) // Inverse log base 2 of e
#define INV_LOG2_10_Q1DOT31 UINT64_C(0x268826a1) // Inverse log base 2 of 10
int32_t log2fix (uint32_t x, size_t precision)
{
int32_t b = 1U << (precision - 1);
int32_t y = 0;
if (precision < 1 || precision > 31) {
errno = EINVAL;
return INT32_MAX; // indicates an error
}
if (x == 0) {
return INT32_MIN; // represents negative infinity
}
while (x < 1U << precision) {
x <<= 1;
y -= 1U << precision;
}
while (x >= 2U << precision) {
x >>= 1;
y += 1U << precision;
}
uint64_t z = x;
for (size_t i = 0; i < precision; i++) {
z = z * z >> precision;
if (z >= 2U << (uint64_t)precision) {
z >>= 1;
y += b;
}
b >>= 1;
}
return y;
}
int32_t logfix (uint32_t x, size_t precision)
{
uint64_t t;
t = log2fix(x, precision) * INV_LOG2_E_Q1DOT31;
return t >> 31;
}
int32_t log10fix (uint32_t x, size_t precision)
{
uint64_t t;
t = log2fix(x, precision) * INV_LOG2_10_Q1DOT31;
return t >> 31;
}
The code for this implementation also lives at Github, along with a sample/test program that illustrates how to use this function to compute and display logarithms from numbers read from standard input.
[1] C. S. Turner, "A Fast Binary Logarithm Algorithm", IEEE Signal Processing Mag., pp. 124,140, Sep. 2010.
A good starting point is Jack Crenshaw's book, "Math Toolkit for Real-Time Programming". It has a good discussion of algorithms and implementations for various transcendental functions.
Check my fixed point sqrt implementation using only integer operations.
It was fun to invent. Quite old now.
https://groups.google.com/forum/?hl=fr%05aacf5997b615c37&fromgroups#!topic/comp.lang.c/IpwKbw0MAxw/discussion
Otherwise check the CORDIC set of algorithms. That's the way to implement all the functions you listed and the trigonometric functions.
EDIT : I published the reviewed source on GitHub here