Efficient implementation fo Faulhaber's Formula - c++

I want an efficient implementation of Faulhaber's Formula
I want answer as
F(N,K) % P
where F(N,K) is implementation of faulhaber's forumula and P is a prime number.
Note: N is very large upto 10^16 and K is upto 3000
I tried the double series implementation in the given site. But its too much time consuming for very large n and k. Can any one help making this implementation more efficient or describe some other way to implement the formula.

How about using Schultz' (1980) idea, outlined below the double series implementation (mathworld.wolfram.com/PowerSum.html) that you mentioned?
From Wolfram MathWorld:
Schultz (1980) showed that the sum S_p(n) can be found by writing
and solving the system of p+1 equations
obtained for j=0, 1, ..., p (Guo and Qi 1999), where delta (j,p) is the Kronecker delta.
Below is an attempt in Haskell that seems to work. It returns a result for n=10^16, p=1000 in about 36 seconds on my old laptop PC.
{-# OPTIONS_GHC -O2 #-}
import Math.Combinatorics.Exact.Binomial
import Data.Ratio
import Data.List (foldl')
kroneckerDelta a b | a == b = 1 % 1
| otherwise = 0 % 1
g a b = ((-1)^(a - b +1) * choose a b) % 1
coefficients :: Integral a => a -> a -> [Ratio a] -> [Ratio a]
coefficients p j cs
| j < 0 = cs
| otherwise = coefficients p (j - 1) (c:cs)
where
c = f / g (j + 1) j
f = foldl h (kroneckerDelta j p) (zip [j + 2..p + 1] cs)
h accum (i,cf) = accum - g i j * cf
trim r = let n = numerator r
d = denominator r
l = div n d
in (mod l (10^9 + 7),(n - d * l) % d)
s n p = numerator (a % 1 + b) where
(a,b) = foldl' (\(i',r') (i,r) -> (mod (i' + i) (10^9 + 7),r' + r)) (0,0)
(zipWith (\c i -> trim (c * n^i)) (coefficients p p []) [1..p + 1])
main = print (s (10^16) 1000)

I've discovered my own algorithm to calculate the coefficients of the polynomial obtained from Faulhaber's formula; it, its proof and several implementations can be found at github.com/fcard/PolySum. This question inspired me to include a c++ implementation (using the GMP library for arbitrary precision numbers), which, as of the time of writing and minus several usability features, is:
#include <gmpxx.h>
#include <vector>
namespace polysum {
typedef std::vector<mpq_class> mpq_row;
typedef std::vector<mpq_class> mpq_column;
typedef std::vector<mpq_row> mpq_matrix;
mpq_matrix make_matrix(size_t n) {
mpq_matrix A(n+1, mpq_row(n+2, 0));
A[0] = mpq_row(n+2, 1);
for (size_t i = 1; i < n+1; i++) {
for (size_t j = i; j < n+1; j++) {
A[i][j] += A[i-1][j];
A[i][j] *= (j - i + 2);
}
A[i][n+1] = A[i][n-1];
}
A[n][n+1] = A[n-1][n+1];
return A;
}
void reduced_row_echelon(mpq_matrix& A) {
size_t n = A.size() - 1;
for (size_t i = n; i+1 > 0; i--) {
A[i][n+1] /= A[i][i];
A[i][i] = 1;
for (size_t j = i-1; j+1 > 0; j--) {
auto p = A[j][i];
A[j][i] = 0;
A[j][n+1] -= A[i][n+1] * p;
}
}
}
mpq_column sum_coefficients(size_t n) {
auto A = make_matrix(n);
reduced_row_echelon(A);
mpq_column result;
for (auto row: A) {
result.push_back(row[n+1]);
}
return result;
}
}
We can use the above like so:
#include <cmath>
#include <gmpxx.h>
#include <polysum.h>
mpq_class power_sum(size_t K, unsigned int N) {
auto coeffs = polysum::sum_coefficients(K)
mpq_class result(0);
for (size_t i = 0; i <= K; i++) {
result += A[i][n+1] * pow(N, i+1);
}
return result;
}
The full implementation provides a Polynomial class that is printable and callable, as well as a polysum function to construct one as a sum of another polynomial.
#include "polysum.h"
void power_sum_print(size_t K, unsigned int N) {
auto F = polysum::polysum(K);
std::cout << "polynomial: " << F;
std::cout << "result: " << F(N);
}
As for efficiency, the above calculates the result for K=1000 and N=1e16 in about 1.75 seconds on my computer, compared to the much more mature and optimized SymPy implementation which takes about 90 seconds on the same machine, and mathematica which takes 30 seconds. For K=3000 the above takes about 4 minutes, mathematica took almost 20 minutes, (but uses much less memory) and I left sympy running all night but it didn't finish, maybe due to it running out of memory.
Among the optimizations that can be done here are making the matrix sparse and taking advantage of the fact that only half of the rows and columns need to be calculated. The Rust version in the linked repository implements the sparse and rows optimizations, and takes about 0.7 seconds to calculate K=1000, and about 45 to calculate K=3000 (using 105mb and 2.9gb of memory respectively). The Haskell version implements all three optimizations and takes about 1 second for K=1000 and about 34 seconds for K=3000. (using 60mb and 880mb of memory respectively) and The completely unoptimized python implementation takes about 12 seconds for K=1000 but runs out of memory for K=3000.
It's looking like this method is the fastest regardless of the language used, but the research is ongoing. Since Schultz's method also boils down to solving a system of n+1 equations and should be able to be optimized the same way, it will depend on whether his matrix is faster to calculate or not. Also, memory usage is not scaling well at all, and Mathematica is still the clear winner here, using only 80mb for K=3000. We'll see.

Related

How do I speed up this program to find fibonacci sequence

I am doing this coding question where they ask you to enter numbers N and M, and you are supposed to output the Nth fibonacci number mod M. My code runs rather slowly and I would like to learn how to speed it up.
#include<bits/stdc++.h>
using namespace std;
long long fib(long long N)
{
if (N <= 1)
return N;
return fib(N-1) + fib(N-2);
}
int main ()
{
long long N;
cin >> N;
long long M;
cin >> M;
long long b;
b = fib(N) % M;
cout << b;
getchar();
return 0;
}
While the program you wrote is pretty much the go-to example of recursion in education, it is really a pretty damn bad algorithm as you have found out. Try to write up the call tree for fib(7) and you will find that the number of calls you make balloons dramatically.
There are many ways of speeding it up and keeping it from recalculating the same values over and over. Somebody already linked to a bunch of algorithms in the comments - a simple loop can easily make it linear in N instead of exponential.
One problem with this though is that fibonacci numbers grow pretty fast: You can hold fib(93) in a 64 bit integer, but fib(94) overflows it.
However, you don't want the N'th fibonacci number - you want the N'th mod M. This changes the challenge a bit, because as long as M is smaller than MAX_INT_64 / 2 then you can calculate fib(N) mod M for any N.
Turn your attention to Modular arithmetic and the congruence relations. Specifically the one for addition, which says (changed to C++ syntax and simplified a bit):
If a1 % m == b1 and a2 % m == b2 then (a1 + a2) % m == (b1 + b2) % m
Or, to give an example: 17 % 3 == 2, 22 % 3 == 1 => (17 + 22) % 3 == (2 + 1) % 3 == 3 % 3 == 0
This means that you can put the modulo operator into the middle of your algorithm so that you never add big numbers together and never overflow. This way you can easily calculate f.ex. fib(10000) mod 237.
There is one simple optimatimization in calling fib without calculating duplicate values. Also using loops instead of recursion may speed up the process:
int fib(int N) {
int f0 = 0;
int f1 = 1;
for (int i = 0; i < N; i++) {
int tmp = f0 + f1;
f0 = f1;
f1 = tmp;
}
return f1;
}
You can apply the modulo operator sugested by #Frodyne on top of this.
1st observation is that you can turn the recursion into a simple loop:
#include <cstdint>
std::uint64_t fib(std::uint16_t n) {
if (!n)
return 0;
std::uint64_t result[]{ 0,1 };
bool select = 1;
for (auto i = 1; i < n; ++i , select=!select)
{
result[!select] += result[select];
};
return result[select];
};
next you can memoize it:
#include <cstdint>
#include <vector>
std::uint64_t fib(std::uint16_t n) {
static std::vector<std::uint64_t> result{0,1};
if (result.size()>n)
return result[n];
std::uint64_t back[]{ result.crbegin()[1],result.back() };
bool select = 1;
result.reserve(n + 1);
for (auto i=result.size(); i < result.capacity();++i, select = !select)
result.push_back(back[!select] += back[select]);
return result[n];
};
Another option would be an algebraic formula.
cheers,
FM.

Minimum cuts on a rectangle to make into squares

I'm trying to solve this problem:
Given an a×b rectangle, your task is to cut it into squares. On each move you can select a rectangle and cut it into two rectangles in such a way that all side lengths remain integers. What is the minimum possible number of moves?
My logic is that the minimum number of cuts means the minimum number of squares; I don't know if it's the correct approach.
I see which side is smaller, Now I know I need to cut bigSide/SmallSide of cuts to have squares of smallSide sides, then I am left with SmallSide and bigSide%smallSide. Then I go on till any side is 0 or both are equal.
#include <iostream>
int main() {
int a, b; std::cin >> a >> b; // sides of the rectangle
int res = 0;
while (a != 0 && b != 0) {
if (a > b) {
if (a % b == 0)
res += a / b - 1;
else
res += a / b;
a = a % b;
} else if (b > a) {
if (b % a == 0)
res += b / a - 1;
else
res += b / a;
b = b % a;
} else {
break;
}
}
std::cout << res;
return 0;
}
When the input is 404 288, my code gives 18, but the right answer is actually 10.
What am I doing wrong?
It seems clear to me that the problem defines each move as cutting a rectangle to two rectangles along the integer lines, and then asks for the minimum number of such cuts. As you can see there is a clear recursive nature in this problem. Once you cut a rectangle to two parts, you can recurse and cut each of them into squares with minimum moves and then sum up the answers. The problem is that the recursion might lead to exponential time complexity which leads us directly do dynamic programming. You have to use memoization to solve it efficiently (worst case time O(a*b*(a+b))) Here is what I'd suggest doing:
#include <iostream>
#include <vector>
using std::vector;
int min_cuts(int a, int b, vector<vector<int> > &mem) {
int min = mem[a][b];
// if already computed, just return the value
if (min > 0)
return min;
// if one side is divisible by the other,
// store min-cuts in 'min'
if (a%b==0)
min= a/b-1;
else if (b%a==0)
min= b/a -1;
// if there's no obvious solution, recurse
else {
// recurse on hight
for (int i=1; i<a/2; i++) {
int m = min_cuts(i,b, mem);
int n = min_cuts(a-i, b, mem);
if (min<0 or m+n+1<min)
min = m + n + 1;
}
// recurse on width
for (int j=1; j<b/2; j++) {
int m = min_cuts(a,j, mem);
int n = min_cuts(a, b-j, mem);
if (min<0 or m+n+1<min)
min = m + n + 1;
}
}
mem[a][b] = min;
return min;
}
int main() {
int a, b; std::cin >> a >> b; // sides of the rectangle
// -1 means the problem is not solved yet,
vector<vector<int> > mem(a+1, vector<int>(b+1, -1));
int res = min_cuts(a,b,mem);
std::cout << res << std::endl;
return 0;
}
The reason the foor loops go up until a/2 and b/2 is that cuting a paper is symmetric: if you cut along vertical line i it is the same as cutting along the line a-i if you flip the paper vertically. This is a little optimization hack that reduces complexity by a factor of 4 overall.
Another little hack is that by knowing that the problem is that if you transpose the paper the result is the same, meaining min_cuts(a,b)=min_cuts(b,a) you can potentially reduce computations by half. But any major further improvement, say a greedy algorithm would take more thinking (if there exists one at all).
The current answer is a good start, especially the suggestions to use memoization or dynamic programming, and potentially efficient enough.
Obviously, all answerers used the first with a sub-par data-structure. Vector-of-Vector has much space and performance overhead, using a (strict) lower triangular matrix stored in an array is much more efficient.
Using the maximum value as sentinel (easier with unsigned) would also reduce complexity.
Finally, let's move to dynamic programming instead of memoization to simplify and get even more efficient:
#include <algorithm>
#include <memory>
#include <utility>
constexpr unsigned min_cuts(unsigned a, unsigned b) {
if (a < b)
std::swap(a, b);
if (a == b || !b)
return 0;
const auto triangle = [](std::size_t n) { return n * (n - 1) / 2; };
const auto p = std::make_unique_for_overwrite<unsigned[]>(triangle(a));
/* const! */ unsigned zero = 0;
const auto f = [&](auto a, auto b) -> auto& {
if (a < b)
std::swap(a, b);
return a == b ? zero : p[triangle(a - 1) + b - 1];
};
for (auto i = 1u; i <= a; ++i) {
for (auto j = 1u; j < i; ++j) {
auto r = -1u;
for (auto k = i / 2; k; --k)
r = std::min(r, f(k, j) + f(i - k, j));
for (auto k = j / 2; k; --k)
r = std::min(r, f(k, i) + f(j - k, i));
f(i, j) = ++r;
}
}
return f(a, b);
}

Calculating Champernowne constant C10 using Boost

I am attempting to calculate the Champernowne constant C10 using the following formula:
In the above formula, I substitute b for 10 to calculate C10. I want to be able to calculate the constant to any precision using Boost's cpp_dec_float.
Here is my code:
#include <boost/multiprecision/cpp_dec_float.hpp>
const long long PRECISION = 100;
typedef boost::multiprecision::number<
boost::multiprecision::cpp_dec_float<PRECISION> > arbFloat;
arbFloat champernowne()
{
arbFloat c, sub, n, k;
std::string precomp_c, postcomp_c;
for(n = 1; n == 1 || precomp_c != postcomp_c; ++n) {
for(k = 1; k <= n; ++k) {
sub += floor(log10(k));
}
precomp_c = static_cast<std::string>(c);
c += n / pow(10, n + sub);
postcomp_c = static_cast<std::string>(c);
}
return c;
}
Here's a breakdown of the code:
I begin by defining a variable arbFloat which has a precision of 100 digits (this is changed often — so I don't want to use cpp_dec_float_100).
The formula has two blocks of summation, so I implement them using two for-loops. In the innermost for-loop I calculate the summation beginning with k = 1 conditional upon k <= n for floor(log10(k)).
I have verified that using floor() and log10() on cpp_dec_float returns variables with correct precision.
Because the outermost summation goes until infinity, I have to stop calculations at some point. To check whether the precision has been exceeded, I cast c to a string before I calculate c += n / pow(10, n + sub) - and then I cast it to a string after I do the calculation. If the strings are the same, I end the calculations because the precision has been exceeded (further calculations would be redundant).
I have also used this set up (with string casting and comparison to check exceeded precision) to calculate other variables - and it works very well.
Next I calculate the outermost summation of c += n / pow(10, n + sub) - using pow() in this manner does maintain the precision. Finally, I return c.
When I run this program, I get the following variable:
0.1234567891001100120001300001400000150000001600000001700000000180000000001900000000002000000000000210
vs. the real Champernowne constant C10:
0.1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253546
Only the first 11 digits are correct, and the rest are not. I am not able to find where I am going wrong. I have tried the following:
Tried replacing c += n / pow(10, n + sub) with c += n / pow(static_cast<arbFloat>(10), n + sub) to check if pow() was not maintaining precision - but it didn't change anything.
Tried replacing floor() with a method of casting log10(k) to a string and "rounding" the string (keep only characters before .) - but it didn't change anything.
Tried changing k <= n to k < n, k <= n + 1 - just in case I was misinterpreting the summation - but that only made it more inaccurate.
If I need to explain more, let me know. Any help would be much appreciated!
The previous value of sub is being carried forward on each iteration; declare it inside the loop.
arbFloat champernowne() {
arbFloat c;
for (int n = 1;; ++n) {
arbFloat sub;
for (int k = 1; k <= n; ++k) {
sub += floor(log10(k));
}
arbFloat const last = c;
c += n / pow(10, n + sub);
if (c == last) {
break;
}
}
return c;
}

Program to display a sum in C++?

I was given a task to write a program that displays:
I coded this:
#include<iostream.h>
#include<conio.h>
void main()
{
clrscr();
int a, n = 1, f = 1;
float s = 0;
cin >> a;
while(n <= a)
{
f = f * n;
s += 1 / (float)f;
n = n + 1;
}
cout << s;
getch();
}
So this displays -
s = 1 + 1/2! + 1/3! + 1/4! .... + 1/a!, including odd and even factorials.
For the past two hours I am trying to figure out how can I modify this code so that it displays the desired result. But I couldn't figure it out yet.
Question:
What changes should I make to my code?
You need to accumulate the sum while checking the counter n and only calculate the even factorials:
int n;
double sum = 1;
cin >> n;
for(int i = 2; i < n; ++i{
if(i % 2 == 0) sum += 1 / factorial(i);
}
In your code:
while(n <= a)
{
f = f * n;
// checks if n is even;
// n even if the remainder of the division by 2 is zero
if(n % 2 == 0){
s += 1 / (float)f;
}
n = n + 1;
}
12! is the largest value that fits in an 32 bit integer. You should use double for all the numbers. For even factorials, starting with f = 1 (0!), f = f * (n-1) * n, where n = 2, 4, 6, 8, ... .
You have almost everything you need in place (assuming you don't want to make design changes based on the issues brought up in the comments).
All you need to change is what you multiply f by in each step. To build up n! you are multiplying by n in each step. To build up (2n)! you would multiply by 2*n*(2*n-1)
Edit: Your second theory about what the instructor wants would need only slightly more of a change. Your inner loop could be replaced by
while(n < a)
{
f = f * n * (n+1);
s += 1 / f;
n = n + 2;
}
Edit2: To run your program I made several changes for I/O things you did that don't work in my copy of GCC. Hopefully those won't distract from the main point of the following code. I also added a second, more complicated and more accurate method of computing the answer to see how much was lost in floating point rounding.
So this code computes the answer twice, once by the method I suggested you change your code to and once by a more accurate method (using double instead of float and adding the numbers in the more accurate sequence via a recursive function). Then it display your answer and the difference between the two answers.
Running that shows the version I suggested gets all the displayed digits correct and is only wrong for the values of a I tried by tiny amounts that would need more display precision to notice:
#include<iostream>
using namespace std;
double fac_sum(int n, int a, double f)
{
if ( n > a )
return 0;
f *= n * (n-1);
return fac_sum(n+2, a, f) + 1 / f;
}
int main()
{
int a, n = 1;
float f = 1;
float s = 0;
cin >> a;
while(n < a)
{
f = f * n * (n+1);
s += 1 / f;
n = n + 2;
}
cout << s;
cout << " approx error was " << fac_sum( 2, a, 1.0)-s;
return 0;
}
For 8 that displays 0.54308 approx error was -3.23568e-08
I hope you understand the e-08 notation meaning the error is in the 8'th digit to the right of the .
Edit3: I changed f to float in this post because I had copied/tested thinking f was float, so parts of my answer didn't make sense when f was int

overflow possibilities in modular exponentiation by squaring

I am looking to implement the fermat's little theorem for prime testing. Here's the code I have written:
lld expo(lld n, lld p) //2^p mod n
{
if(p==0)
return 1;
lld exp=expo(n,p/2);
if(p%2==0)
return (exp*exp)%n;
else
return (((exp*exp)%n)*2)%n;
}
bool ifPseudoPrime(lld n)
{
if(expo(n,n)==2)
return true;
else
return false;
}
NOTE: I took the value of a(<=n-1) as 2.
Now, the number n can go as large as 10^18. This means that variable exp can reach values near 10^18. Which further implies that the expression (exp*exp) can reach as high as 10^36 hence causing overflow. How do I avoid this.
I tested this and it ran fine till 10^9. I am using C++
If the modulus is close to the limit of the largest integer type you can use, things get somewhat complicated. If you can't use a library that implements biginteger arithmetic, you can roll a modular multiplication yourself by splitting the factors in low-order and high-order parts.
If the modulus m is so large that 2*(m-1) overflows, things get really fussy, but if 2*(m-1) doesn't overflow, it's bearable.
Let us suppose you have and use a 64-bit unsigned integer type.
You can calculate the modular product by splitting the factors into low and high 32 bits, the product then splits into
a = a1 + (a2 << 32) // 0 <= a1, a2 < (1 << 32)
b = b1 + (b2 << 32) // 0 <= b1, b2 < (1 << 32)
a*b = a1*b1 + (a1*b2 << 32) + (a2*b1 << 32) + (a2*b2 << 64)
To calculate a*b (mod m) with m <= (1 << 63), reduce each of the four products modulo m,
p1 = (a1*b1) % m;
p2 = (a1*b2) % m;
p3 = (a2*b1) % m;
p4 = (a2*b2) % m;
and the simplest way to incorporate the shifts is
for(i = 0; i < 32; ++i) {
p2 *= 2;
if (p2 >= m) p2 -= m;
}
the same for p3 and with 64 iterations for p4. Then
s = p1+p2;
if (s >= m) s -= m;
s += p3;
if (s >= m) s -= m;
s += p4;
if (s >= m) s -= m;
return s;
That way is not very fast, but for the few multiplications needed here, it may be fast enough. A small speedup should be obtained by reducing the number of shifts; first calculate (p4 << 32) % m,
for(i = 0; i < 32; ++i) {
p4 *= 2;
if (p4 >= m) p4 -= m;
}
then all of p2, p3 and the current value of p4 need to be multiplied with 232 modulo m,
p4 += p3;
if (p4 >= m) p4 -= m;
p4 += p2;
if (p4 >= m) p4 -= m;
for(i = 0; i < 32; ++i) {
p4 *= 2;
if (p4 >= m) p4 -= m;
}
s = p4+p1;
if (s >= m) s -= m;
return s;
You can perform your multiplications in several stages. For example, say you want to compute X*Y mod n. Take X and Y and write them as X = 10^9*X_1 + X_0, Y = 10^9*Y_1 + Y_0. Then compute all four products X_i*Y_j mod n, and finally compute X = 10^18*(X_1*Y_1 mod n) + 10^9*( X_0*Y_1 + X_1*Y_0 mod n) + X_0*Y_0. Note that in this case, you are operating with numbers half the size of the maximum allowed.
If splitting in two parts do not suffice (I suspect this is the case), split in three parts using the same schema. Splitting in three should work.
A simpler approach is just to multiply the school way. It corresponds to the previous approach, but writing one number in as many parts as digits it has.
Good luck!