Cholesky decomposition in Halide - c++

I'm trying to implement a Cholesky decomposition in Halide. Part of common algorithm such as crout consists of an iteration over a triangular matrix. In a way that, the diagonal elements of the decomposition are computed by subtracting a partial column sum from the diagonal element of the input matrix. Column sum is calculated over squared elements of a triangular part of the input matrix, excluding the diagonal element.
Using BLAS the code would in C++ look as follows:
double* a; /* input matrix */
int n; /* dimension */
const int c__1 = 1;
const double c_b12 = 1.;
const double c_b10 = -1.;
for (int j = 0; j < n; ++j) {
double ajj = a[j + j * n] - ddot(&j, &a[j + n], &n, &a[j + n], &n);
ajj = sqrt(ajj);
a[j + j * n] = ajj;
if (j < n) {
int i__2 = n - j;
dgemv("No transpose", &i__2, &j, &c_b10, &a[j + 1 + n], &n, &a[j + n], &b, &c_b12, &a[j + 1 + j * n], &c__1);
double d__1 = 1. / ajj;
dscal(&i__2, &d__1, &a[j + 1 + j * n], &c__1);
}
}
My question is if a pattern like this is in general expressible by Halide? And if so, how would it look like?

I think Andrew may have a more complete answer, but in the interest of a timely response, you can use an RDom predicate (introduced via RDom::where) to enumerate triangular regions (or their generalization to more dimensions). A sketch of the pattern is:
Halide::RDom triangular(0, extent, 0, extent);
triangular.where(triangular.x < triangular.y);
Then use triangular in a reduction.

I once had a fast Cholesky written in Halide. Unfortunately I can't find the code. I put the outer loop in C and wrote a good block-panel update routine that operated on something like a 32-wide panel at a time. This was before Halide had triangular iteration, so maybe you can do better now.

Related

Need help understanding this line in an FFT algorithm

In my program I have a function that performs the fast Fourier transform. I know there are very good implementations freely available, but this is a learning thing so I don't want to use those. I ended up finding this comment with the following implementation (it originated from the Italian entry for the FFT):
void transform(complex<double>* f, int N) //
{
ordina(f, N); //first: reverse order
complex<double> *W;
W = (complex<double> *)malloc(N / 2 * sizeof(complex<double>));
W[1] = polar(1., -2. * M_PI / N);
W[0] = 1;
for(int i = 2; i < N / 2; i++)
W[i] = pow(W[1], i);
int n = 1;
int a = N / 2;
for(int j = 0; j < log2(N); j++) {
for(int k = 0; k < N; k++) {
if(!(k & n)) {
complex<double> temp = f[k];
complex<double> Temp = W[(k * a) % (n * a)] * f[k + n];
f[k] = temp + Temp;
f[k + n] = temp - Temp;
}
}
n *= 2;
a = a / 2;
}
free(W);
}
I've made a lot of changes by now but this was my starting point. One of the changes I made was to not cache the twiddle factors, because I decided to see if it's needed first. Now I've decided I do want to cache them. The way this implementation seems to do it is it has this array W of length N/2, where every index k has the value . What I don't understand is this expression:
W[(k * a) % (n * a)]
Note that n * a is always equal to N/2. I get that this is supposed to be equal to , and I can see that , which this relies on. I also get that modulo can be used here because the twiddle factors are cyclic. But there's one thing I don't get: this is a length-N DFT, and yet only N/2 twiddle factors are ever calculated. Shouldn't the array be of length N, and the modulo should be by N?
But there's one thing I don't get: this is a length-N DFT, and yet only N/2 twiddle factors are ever calculated. Shouldn't the array be of length N, and the modulo should be by N?
The twiddle factors are equally spaced points on the unit circle, and there is an even number of points because N is a power-of-two. After going around half of the circle (starting at 1, going counter clockwise above the X-axis), the second half is a repeat of the first half but this time it's below the X-axis (the points can be reflected through the origin). That is why Temp is subtracted the second time. That subtraction is the negation of the twiddle factor.

needed dtw like in R package

There is a function of the dtw package
dtw(x, y=NULL, dist.method="Euclidean", step.pattern=symmetric2, window.type="none", keep.internals=FALSE, distance.only=FALSE, open.end=FALSE, open.begin=FALSE, ... )
In the function, there are three methods of calculating distances
symmetric1 , symmetric2 , asymmetric
I am interested in the method step.pattern = symmetric2.
I have a C ++ function that works exactly like symmetric1
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double dtw_rcpp(const NumericVector& x, const NumericVector& y) {
size_t n = x.size(), m = y.size();
NumericMatrix res = no_init(n + 1, m + 1);
std::fill(res.begin(), res.end(), R_PosInf);
res(0, 0) = 0;
double cost = 0;
size_t w = std::abs(static_cast<int>(n - m));
for (size_t i = 1; i <= n; ++i) {
for (size_t j = std::max(1, static_cast<int>(i - w)); j <= std::min(m, i + w); ++j) {
cost = std::abs(x[i - 1] - y[j - 1]);
res(i, j) = cost + std::min(std::min(res(i - 1, j), res(i, j - 1)), res(i - 1, j - 1));
}
}
return res(n, m);
}
What do I need to change in this с++ function that it considered the method of distance symmetric2.
I do not understand how it works symmetric2.
here it is said very little about it
1. Well-known step patterns
These common transition types are used in quite a lot of implementations.
symmetric1 (or White-Neely) is the commonly used quasi-symmetric, no local constraint, non-normalizable. It is biased in favor of oblique steps. symmetric2 is normalizable, symmetric, with no local slope constraints. Since one diagonal step costs as much as the two equivalent steps along the sides, it can be normalized dividing by N+M (query+reference lengths).
in the source code, I could not understand because I am a beginner programmer
I do not speak English so forgive me for the mistakes.
thank you
OP is asking about dynamic time warping alignments in R. Printing the symmetric2 object should clarify the recursion rule:
g[i,j] = min(
g[i-1,j-1] + 2 * d[i ,j ] ,
g[i ,j-1] + d[i ,j ] ,
g[i-1,j ] + d[i ,j ] ,
)
g is the global cost matrix, d the local distance. I can't comment on the rest of your code.
If you only need the distance value under this specific step pattern, and no other features, the code may be much simplified (see e.g. the pseudocode on Wikipedia).

How to perform 1-dimensional "valid" convolution? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
The community reviewed whether to reopen this question 7 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I'm trying to implement a 1-dimensional convolution in "valid" mode (Matlab definition) in C++.
It seems pretty simple, but I haven't been able to find a code doing that in C++ (or any other language that I could adapt to as a matter of fact). If my vector size is a power, I can use a 2D convolution, but I would like to find something that would work for any input and kernel.
So how to perform a 1-dimensional convolution in "valid" mode, given an input vector of size I and a kernel of size K (the output should normally be a vector of size I - K + 1).
Pseudocode is also accepted.
You could use the one of the following implementations:
Full convolution:
template<typename T>
std::vector<T>
conv(std::vector<T> const &f, std::vector<T> const &g) {
int const nf = f.size();
int const ng = g.size();
int const n = nf + ng - 1;
std::vector<T> out(n, T());
for(auto i(0); i < n; ++i) {
int const jmn = (i >= ng - 1)? i - (ng - 1) : 0;
int const jmx = (i < nf - 1)? i : nf - 1;
for(auto j(jmn); j <= jmx; ++j) {
out[i] += (f[j] * g[i - j]);
}
}
return out;
}
f : First sequence (1D signal).
g : Second sequence (1D signal).
returns a std::vector of size f.size() + g.size() - 1, which is the result of the discrete convolution aka. Cauchy product (f * g) = (g * f).
LIVE DEMO
Valid convolution:
template<typename T>
std::vector<T>
conv_valid(std::vector<T> const &f, std::vector<T> const &g) {
int const nf = f.size();
int const ng = g.size();
std::vector<T> const &min_v = (nf < ng)? f : g;
std::vector<T> const &max_v = (nf < ng)? g : f;
int const n = std::max(nf, ng) - std::min(nf, ng) + 1;
std::vector<T> out(n, T());
for(auto i(0); i < n; ++i) {
for(int j(min_v.size() - 1), k(i); j >= 0; --j) {
out[i] += min_v[j] * max_v[k];
++k;
}
}
return out;
}
f : First sequence (1D signal).
g : Second sequence (1D signal).
returns a std::vector of size std::max(f.size(), g.size()) - std::min(f.size(), g.size()) + 1, which is the result of the valid (i.e., with out the paddings) discrete convolution aka. Cauchy product (f * g) = (g * f).
LIVE DEMO
In order to perform a 1-D valid convolution on an std::vector (let's call it vec for the sake of the example, and the output vector would be outvec) of the size l it is enough to create the right boundaries by setting loop parameters correctly, and then perform the convolution as usual, i.e.:
for(size_t i = K/2; i < l - K/2; ++i)
{
outvec[i] = 0.0;
for(size_t j = 0; j < K+1; j++)
{
outvec[i - K/2] += invec[i - K/2 + j] * kernel[j];
}
}
Note the starting and the final value of i.
Works for any 1-D kernel of any size - provided that the kernel is not of bigger size than vector ;)
Note that I've used the variable K as you've described it, but personally I would've understand the 'size' different - a matter of taste I guess. In this example, the total length of the kernel vector is K+1. I've also assumed that the outvec already has l - K elements (BTW: output vector has l - K elements, not l - K + 1 as you have written), so no push_back() is needed.
I don't understand why you need to implement a convolution function. Doesn't Matlab have a built-in 1D convolution function?
Putting that aside, you can implement convolution given a Fourier transform function. You need to be careful about the length of the input and output vectors. The length of the result is I + K - 1 (not I - K + 1, right?). Extend each input vector with zeros to length N where N is the power of 2 greater than or equal to I + K - 1. Take the Fourier transform of the inputs, then multiple the results element by element. Take the inverse Fourier transform of that product, and return the first I + K - 1 elements (throw the rest away). That's your convolution.
You may need to throw in a scaling factor of 1/N somewhere since there is no universally-agreed scaling for Fourier transforms, and I don't remember what Matlab assumes for that.

How to compute sum of evenly spaced binomial coefficients

How to find sum of evenly spaced Binomial coefficients modulo M?
ie. (nCa + nCa+r + nCa+2r + nCa+3r + ... + nCa+kr) % M = ?
given: 0 <= a < r, a + kr <= n < a + (k+1)r, n < 105, r < 100
My first attempt was:
int res = 0;
int mod=1000000009;
for (int k = 0; a + r*k <= n; k++) {
res = (res + mod_nCr(n, a+r*k, mod)) % mod;
}
but this is not efficient. So after reading here
and this paper I found out the above sum is equivalent to:
summation[ω-ja * (1 + ωj)n / r], for 0 <= j < r; and ω = ei2π/r is a primitive rth root of unity.
What can be the code to find this sum in Order(r)?
Edit:
n can go upto 105 and r can go upto 100.
Original problem source: https://www.codechef.com/APRIL14/problems/ANUCBC
Editorial for the problem from the contest: https://discuss.codechef.com/t/anucbc-editorial/5113
After revisiting this post 6 years later, I'm unable to recall how I transformed the original problem statement into mine version, nonetheless, I shared the link to the original solution incase anyone wants to have a look at the correct solution approach.
Binomial coefficients are coefficients of the polynomial (1+x)^n. The sum of the coefficients of x^a, x^(a+r), etc. is the coefficient of x^a in (1+x)^n in the ring of polynomials mod x^r-1. Polynomials mod x^r-1 can be specified by an array of coefficients of length r. You can compute (1+x)^n mod (x^r-1, M) by repeated squaring, reducing mod x^r-1 and mod M at each step. This takes about log_2(n)r^2 steps and O(r) space with naive multiplication. It is faster if you use the Fast Fourier Transform to multiply or exponentiate the polynomials.
For example, suppose n=20 and r=5.
(1+x) = {1,1,0,0,0}
(1+x)^2 = {1,2,1,0,0}
(1+x)^4 = {1,4,6,4,1}
(1+x)^8 = {1,8,28,56,70,56,28,8,1}
{1+56,8+28,28+8,56+1,70}
{57,36,36,57,70}
(1+x)^16 = {3249,4104,5400,9090,13380,9144,8289,7980,4900}
{3249+9144,4104+8289,5400+7980,9090+4900,13380}
{12393,12393,13380,13990,13380}
(1+x)^20 = (1+x)^16 (1+x)^4
= {12393,12393,13380,13990,13380}*{1,4,6,4,1}
{12393,61965,137310,191440,211585,203373,149620,67510,13380}
{215766,211585,204820,204820,211585}
This tells you the sums for the 5 possible values of a. For example, for a=1, 211585 = 20c1+20c6+20c11+20c16 = 20+38760+167960+4845.
Something like that, but you have to check a, n and r because I just put anything without regarding about the condition:
#include <complex>
#include <cmath>
#include <iostream>
using namespace std;
int main( void )
{
const int r = 10;
const int a = 2;
const int n = 4;
complex<double> i(0.,1.), res(0., 0.), w;
for( int j(0); j<r; ++j )
{
w = exp( i * 2. * M_PI / (double)r );
res += pow( w, -j * a ) * pow( 1. + pow( w, j ), n ) / (double)r;
}
return 0;
}
the mod operation is expensive, try avoiding it as much as possible
uint64_t res = 0;
int mod=1000000009;
for (int k = 0; a + r*k <= n; k++) {
res += mod_nCr(n, a+r*k, mod);
if(res > mod)
res %= mod;
}
I did not test this code
I don't know if you reached something or not in this question, but the key to implementing this formula is to actually figure out that w^i are independent and therefore can form a ring. In simpler terms you should think of implement
(1+x)^n%(x^r-1) or finding out (1+x)^n in the ring Z[x]/(x^r-1)
If confused I will give you an easy implementation right now.
make a vector of size r . O(r) space + O(r) time
initialization this vector with zeros every where O(r) space +O(r) time
make the first two elements of that vector 1 O(1)
calculate (x+1)^n using the fast exponentiation method. each multiplication takes O(r^2) and there are log n multiplications therefore O(r^2 log(n) )
return first element of the vector.O(1)
Complexity
O(r^2 log(n) ) time and O(r) space.
this r^2 can be reduced to r log(r) using fourier transform.
How is the multiplication done, this is regular polynomial multiplication with mod in the power
vector p1(r,0);
vector p2(r,0);
p1[0]=p1[1]=1;
p2[0]=p2[1]=1;
now we want to do the multiplication
vector res(r,0);
for(int i=0;i<r;i++)
{
for(int j=0;j<r;j++)
{
res[(i+j)%r]+=(p1[i]*p2[j]);
}
}
return res[0];
I have implemented this part before, if you are still cofused about something let me know. I would prefer that you implement the code yourself, but if you need the code let me know.

How do you multiply a matrix by itself?

This is what i have so far but I do not think it is right.
for (int i = 0 ; i < 5; i++)
{
for (int j = 0; j < 5; j++)
{
matrix[i][j] += matrix[i][j] * matrix[i][j];
}
}
Suggestion: if it's not a homework don't write your own linear algebra routines, use any of the many peer reviewed libraries that are out there.
Now, about your code, if you want to do a term by term product, then you're doing it wrong, what you're doing is assigning to each value it's square plus the original value (n*n+n or (1+n)*n, whatever you like best)
But if you want to do an authentic matrix multiplication in the algebraic sense, remember that you had to do the scalar product of the first matrix rows by the second matrix columns (or the other way, I'm not very sure now)... something like:
for i in rows:
for j in cols:
result(i,j)=m(i,:)·m(:,j)
and the scalar product "·"
v·w = sum(v(i)*w(i)) for all i in the range of the indices.
Of course, with this method you cannot do the product in place, because you'll need the values that you're overwriting in the next steps.
Also, explaining a little bit further Tyler McHenry's comment, as a consecuence of having to multiply rows by columns, the "inner dimensions" (I'm not sure if that's the correct terminology) of the matrices must match (if A is m x n, B is n x o and A*C is m x o), so in your case, a matrix can be squared only if it's square (he he he).
And if you just want to play a little bit with matrices, then you can try Octave, for example; squaring a matrix is as easy as M*M or M**2.
I don't think you can multiply a matrix by itself in-place.
for (i = 0; i < 5; i++) {
for (j = 0; j < 5; j++) {
product[i][j] = 0;
for (k = 0; k < 5; k++) {
product[i][j] += matrix[i][k] * matrix[k][j];
}
}
}
Even if you use a less naïve matrix multiplication (i.e. something other than this O(n3) algorithm), you still need extra storage.
That's not any matrix multiplication definition I've ever seen. The standard definition is
for (i = 1 to m)
for (j = 1 to n)
result(i, j) = 0
for (k = 1 to s)
result(i, j) += a(i, k) * b(k, j)
to give the algorithm in a sort of pseudocode. In this case, a is a m x s matrix and b is an s x n, the result is a m x n, and subscripts begin with 1..
Note that multiplying a matrix in place is going to get the wrong answer, since you're going to be overwriting values before using them.
It's been too long since I've done matrix math (and I only did a little bit of it, on top), but the += operator takes the value of matrix[i][j] and adds to it the value of matrix[i][j] * matrix[i][j], which I don't think is what you want to do.
Well it looks like what it's doing is squaring the row/column, then adding it to the row/column. Is that what you want it to do? If not, then change it.