Armadillo code which is the same as MATLAB code but much slower - c++

Recently I need to convert matlab code to c++. It includes matrix operations so I use Armadillo.But it turned out that Armadillo is much slower and I wonder why.
Here's the matlab code:
for j=1:sample
for i=1:length(t)
k=[1:1:N0];
AA=sqrt(2*k*dw).*(cos(dw.*k*i*dt).*Xk(j,k)+sin(dw.*k*i*dt).*Yk(j,k));
Ag2(i)=sum(AA);
end
Ag(j,:)=Ag2.*gt;
end
Here's the Armadillo code:
for (int j = 0; j < sample; j++){
for (int i = 0; i < t.n_cols; i++){
AA = sqrt(2 * k * dw)
% (cos(dw * k * (i + 1) * dt) % Xk.row(j) + sin(dw * k * (i + 1) * dt) % Yk.row(j));
Ag2(i) = sum(AA);
}
Ag.row(j) = Ag2 % gt;
}
where Xk,Yk,Ag,Ag2 are matrices, AA,k are vectors and others are double.
Matlab code runs for 20s and Armadillo code runs for 380s.
Anybody got any idea?Your help is really appreciated.
Thanks!

Related

Cholesky decomposition in Halide

I'm trying to implement a Cholesky decomposition in Halide. Part of common algorithm such as crout consists of an iteration over a triangular matrix. In a way that, the diagonal elements of the decomposition are computed by subtracting a partial column sum from the diagonal element of the input matrix. Column sum is calculated over squared elements of a triangular part of the input matrix, excluding the diagonal element.
Using BLAS the code would in C++ look as follows:
double* a; /* input matrix */
int n; /* dimension */
const int c__1 = 1;
const double c_b12 = 1.;
const double c_b10 = -1.;
for (int j = 0; j < n; ++j) {
double ajj = a[j + j * n] - ddot(&j, &a[j + n], &n, &a[j + n], &n);
ajj = sqrt(ajj);
a[j + j * n] = ajj;
if (j < n) {
int i__2 = n - j;
dgemv("No transpose", &i__2, &j, &c_b10, &a[j + 1 + n], &n, &a[j + n], &b, &c_b12, &a[j + 1 + j * n], &c__1);
double d__1 = 1. / ajj;
dscal(&i__2, &d__1, &a[j + 1 + j * n], &c__1);
}
}
My question is if a pattern like this is in general expressible by Halide? And if so, how would it look like?
I think Andrew may have a more complete answer, but in the interest of a timely response, you can use an RDom predicate (introduced via RDom::where) to enumerate triangular regions (or their generalization to more dimensions). A sketch of the pattern is:
Halide::RDom triangular(0, extent, 0, extent);
triangular.where(triangular.x < triangular.y);
Then use triangular in a reduction.
I once had a fast Cholesky written in Halide. Unfortunately I can't find the code. I put the outer loop in C and wrote a good block-panel update routine that operated on something like a 32-wide panel at a time. This was before Halide had triangular iteration, so maybe you can do better now.

How to optimize this math operation for speed

I'm trying to optimize a function taking a good chunk of execution time, which computes the following math operation many times. Is there anyway to make this operation faster?
float total = (sqrt(
((point_A[j].length)*(point_A[j].length))+
((point_B[j].width)*(point_B[j].width))+
((point_C[j].height)*(point_C[j].height))
));
If memory is cheap then you could do the following thereby improving CPU cache hit rate. Since you haven't posted more details, so I will make some assumptions here.
long tmp_len_square[N*3];
for (int j = 0; j < N; ++j) {
tmp_len_square[3 * j] = (point_A[j].length)*(point_A[j].length);
}
for (int j = 0; j < N; ++j) {
tmp_len_square[(3 * j) + 1] = (point_B[j].width)*(point_B[j].width);
}
for (int j = 0; j < N; ++j) {
tmp_len_square[(3 * j) + 2] = (point_C[j].height)*(point_C[j].height);
}
for (int j = 0; j < N; ++j) {
float total = sqrt(tmp_len_square[3 * j] +
tmp_len_square[(3 * j) + 1] +
tmp_len_square[(3 * j) + 2]);
// ...
}
Rearrange the data into this:
float *pointA_length;
float *pointB_width;
float *pointC_height;
That may require some level of butchering of your data structures, so you'll have to choose whether it's worth it or not.
Now what we can do is write this:
void process_points(float* Alengths, float* Bwidths, float* Cheights,
float* output, int n)
{
for (int i = 0; i < n; i++) {
output[i] = sqrt(Alengths[i] * Alengths[i] +
Bwidths[i] * Bwidths[i] +
Cheights[i] * Cheights[i]);
}
}
Writing it like this allows it to be auto-vectorized. For example, GCC targeting AVX and with -fno-math-errno -ftree-vectorize, can vectorize that loop. It does that with a lot of cruft though. __restrict__ and alignment attributes only improve that a little. So here's a hand-vectorized version as well: (not tested)
void process_points(float* Alengths,
float* Bwidths,
float* Cheights,
float* output, int n)
{
for (int i = 0; i < n; i += 8) {
__m256 a = _mm256_load_ps(Alengths + i);
__m256 b = _mm256_load_ps(Bwidths + i);
__m256 c = _mm256_load_ps(Cheights + i);
__m256 asq = _mm256_mul_ps(a, a);
__m256 sum = _mm256_fmadd_ps(c, c, _mm256_fmadd_ps(b, b, asq));
__m256 hsum = _mm256_mul_ps(sum, _mm256_set1_ps(0.5f));
__m256 invsqrt = _mm256_rsqrt_ps(sum);
__m256 s = _mm256_mul_ps(invsqrt, invsqrt);
invsqrt = _mm256_mul_ps(sum, _mm256_fnmadd_ps(hsum, s, _mm256_set1_ps(1.5f)));
_mm256_store_ps(output + i, _mm256_mul_ps(sum, invsqrt));
}
}
This makes a number of assumptions:
all pointers are 32-aligned.
n is a multiple of 8, or at least the buffers have enough padding that they're never accessed out of bounds.
the input buffers are not aliased with the output buffer (they could be aliased among themselves, but .. why)
the slightly reduced accuracy of the square root computed this way is OK (accurate to approximately 22 bits, instead of correctly rounded).
the sum of squares computed with fmadd can be slightly different than if it's computed using multiplies and adds, I assume that's OK too
your target supports AVX/FMA so this will actually run
The method for computing the square root I used here is using an approximate reciprocal square root, an improvement step (y = y * (1.5 - (0.5 * x * y * y))) and then a multiplication by x because x * 1/sqrt(x) = x/sqrt(x) = sqrt(x).
You can eventually try to optimize the sqrt function itself. May I suggest you to have a look at this link:
Best Square Root Method
Your question could be improved by adding a little more context. Is your code required to be portable, or are you targeting a particular compiler, or a specific processor or processor family? Perhaps you're willing to accept a general baseline version with target-specific optimised versions selected at runtime?
Also, there's very little context for the line of code you give. Is it in a tight loop? Or is it scattered in a bunch of places in conditional code in such a loop?
I'm going to assume that it's in a tight loop thus:
for (int j=0; j<total; ++j)
length[j] = sqrt(
(point_A[j].length)*(point_A[j].length) +
(point_B[j].width)*(point_B[j].width) +
(point_C[j].height)*(point_C[j].height));
I'm also going to assume that your target processor is multi-core, and that the arrays are distinct (or that the relevant elements are distinct), then an easy win is to annotate for OpenMP:
#pragma omp parallel for
for (int j=0; j<total; ++j)
length[j] = sqrt((point_A[j].length)*(point_A[j].length) +
(point_B[j].width)*(point_B[j].width) +
(point_C[j].height)*(point_C[j].height));
Compile with g++ -O3 -fopenmp -march=native (or substitute native with your desired target processor architecture).
If you know your target, you might be able to benefit from parallelisation of loops with the gcc flag -ftree-parallelize-loops=n - look in the manual.
Now measure your performance change (I'm assuming that you measured the original, given that this is an optimisation question). If it's still not fast enough for you, then it's time to consider changing your data structures, algorithms, or individual lines of code.

Exponential operator in C++ loop

I wrote C++ codes and matlab codes to test speed. My C++ code is:
int nrow = dim[0], ncol = dim[1];
double tmp, ldot;
for (int k = ncol - 1; k >= 0; --k){
grad[k] = 0;
for (int j = nrow - 1; j >= 0; --j){
tmp = exp(eta[j + nrow * k]);
ldot = (-Z[j + nrow * k] + tmp / (1 + tmp));
grad[k] += A[j] * ldot;
}
}
My matlab code is:
prob = exp(eta);
prob = prob./(1+prob);
ldot = prob - Z;
grad=sum(repmat(A,1,nGWAS).*ldot);
I run each code 100 times, it took over 5 seconds for C++ and only 1.2 seconds for matlab.
Anyone can help my here? Thanks.
The folks at matlab know very well how to optimize matrix access.
You chose to access it column by column. My initial guess is that the matrix is laid out in memory row by row. This causes your code to run over the whole matrix ncol times. Cache misses all over the place.

How to compute sum of evenly spaced binomial coefficients

How to find sum of evenly spaced Binomial coefficients modulo M?
ie. (nCa + nCa+r + nCa+2r + nCa+3r + ... + nCa+kr) % M = ?
given: 0 <= a < r, a + kr <= n < a + (k+1)r, n < 105, r < 100
My first attempt was:
int res = 0;
int mod=1000000009;
for (int k = 0; a + r*k <= n; k++) {
res = (res + mod_nCr(n, a+r*k, mod)) % mod;
}
but this is not efficient. So after reading here
and this paper I found out the above sum is equivalent to:
summation[ω-ja * (1 + ωj)n / r], for 0 <= j < r; and ω = ei2π/r is a primitive rth root of unity.
What can be the code to find this sum in Order(r)?
Edit:
n can go upto 105 and r can go upto 100.
Original problem source: https://www.codechef.com/APRIL14/problems/ANUCBC
Editorial for the problem from the contest: https://discuss.codechef.com/t/anucbc-editorial/5113
After revisiting this post 6 years later, I'm unable to recall how I transformed the original problem statement into mine version, nonetheless, I shared the link to the original solution incase anyone wants to have a look at the correct solution approach.
Binomial coefficients are coefficients of the polynomial (1+x)^n. The sum of the coefficients of x^a, x^(a+r), etc. is the coefficient of x^a in (1+x)^n in the ring of polynomials mod x^r-1. Polynomials mod x^r-1 can be specified by an array of coefficients of length r. You can compute (1+x)^n mod (x^r-1, M) by repeated squaring, reducing mod x^r-1 and mod M at each step. This takes about log_2(n)r^2 steps and O(r) space with naive multiplication. It is faster if you use the Fast Fourier Transform to multiply or exponentiate the polynomials.
For example, suppose n=20 and r=5.
(1+x) = {1,1,0,0,0}
(1+x)^2 = {1,2,1,0,0}
(1+x)^4 = {1,4,6,4,1}
(1+x)^8 = {1,8,28,56,70,56,28,8,1}
{1+56,8+28,28+8,56+1,70}
{57,36,36,57,70}
(1+x)^16 = {3249,4104,5400,9090,13380,9144,8289,7980,4900}
{3249+9144,4104+8289,5400+7980,9090+4900,13380}
{12393,12393,13380,13990,13380}
(1+x)^20 = (1+x)^16 (1+x)^4
= {12393,12393,13380,13990,13380}*{1,4,6,4,1}
{12393,61965,137310,191440,211585,203373,149620,67510,13380}
{215766,211585,204820,204820,211585}
This tells you the sums for the 5 possible values of a. For example, for a=1, 211585 = 20c1+20c6+20c11+20c16 = 20+38760+167960+4845.
Something like that, but you have to check a, n and r because I just put anything without regarding about the condition:
#include <complex>
#include <cmath>
#include <iostream>
using namespace std;
int main( void )
{
const int r = 10;
const int a = 2;
const int n = 4;
complex<double> i(0.,1.), res(0., 0.), w;
for( int j(0); j<r; ++j )
{
w = exp( i * 2. * M_PI / (double)r );
res += pow( w, -j * a ) * pow( 1. + pow( w, j ), n ) / (double)r;
}
return 0;
}
the mod operation is expensive, try avoiding it as much as possible
uint64_t res = 0;
int mod=1000000009;
for (int k = 0; a + r*k <= n; k++) {
res += mod_nCr(n, a+r*k, mod);
if(res > mod)
res %= mod;
}
I did not test this code
I don't know if you reached something or not in this question, but the key to implementing this formula is to actually figure out that w^i are independent and therefore can form a ring. In simpler terms you should think of implement
(1+x)^n%(x^r-1) or finding out (1+x)^n in the ring Z[x]/(x^r-1)
If confused I will give you an easy implementation right now.
make a vector of size r . O(r) space + O(r) time
initialization this vector with zeros every where O(r) space +O(r) time
make the first two elements of that vector 1 O(1)
calculate (x+1)^n using the fast exponentiation method. each multiplication takes O(r^2) and there are log n multiplications therefore O(r^2 log(n) )
return first element of the vector.O(1)
Complexity
O(r^2 log(n) ) time and O(r) space.
this r^2 can be reduced to r log(r) using fourier transform.
How is the multiplication done, this is regular polynomial multiplication with mod in the power
vector p1(r,0);
vector p2(r,0);
p1[0]=p1[1]=1;
p2[0]=p2[1]=1;
now we want to do the multiplication
vector res(r,0);
for(int i=0;i<r;i++)
{
for(int j=0;j<r;j++)
{
res[(i+j)%r]+=(p1[i]*p2[j]);
}
}
return res[0];
I have implemented this part before, if you are still cofused about something let me know. I would prefer that you implement the code yourself, but if you need the code let me know.

OpenCV Python binds incredibly slow iterations through image data

I recently took some code that tracked an object based on color in OpenCV c++ and rewrote it in the python bindings.
The overall results and method were the same minus syntax obviously. But, when I perform the below code on each frame of a video it takes almost 2-3 seconds to complete where as the c++ variant, also below, is instant in comparison and I can iterate between frames as fast as my finger can press a key.
Any ideas or comments?
cv.PyrDown(img, dsimg)
for i in range( 0, dsimg.height ):
for j in range( 0, dsimg.width):
if dsimg[i,j][1] > ( _RED_DIFF + dsimg[i,j][2] ) and dsimg[i,j][1] > ( _BLU_DIFF + dsimg[i,j][0] ):
res[i,j] = 255
else:
res[i,j] = 0
for( int i =0; i < (height); i++ )
{
for( int j = 0; j < (width); j++ )
{
if( ( (data[i * step + j * channels + 1]) > (RED_DIFF + data[i * step + j * channels + 2]) ) &&
( (data[i * step + j * channels + 1]) > (BLU_DIFF + data[i * step + j * channels]) ) )
data_r[i *step_r + j * channels_r] = 255;
else
data_r[i * step_r + j * channels_r] = 0;
}
}
Thanks
Try using numpy to do your calculation, rather than nested loops. You should get C-like performance for simple calculations like this from numpy.
For example, your nested for loops can be replaced with a couple of numpy expressions...
I'm not terribly familiar with opencv, but I think the python bindings now have a numpy array interface, so your example above should be as simple as:
cv.PyrDown(img, dsimg)
data = np.asarray(dsimg)
blue, green, red = data.T
res = (green > (_RED_DIFF + red)) & (green > (_BLU_DIFF + blue))
res = res.astype(np.uint8) * 255
res = cv.fromarray(res)
(Completely untested, of course...) Again, I'm really not terribly familar with opencv, but nested python for loops are not the way to go about modifying an image element-wise, regardless...
Hope that helps a bit, anyway!