I am trying to construct a sparse matrix L of the form
L and Hi are respectively a very sparse matrix and row vector. The final L matrix should have a density of around 1 % .
Armadillo provides a arma::sp_mat class that seems to suit my needs. The assembly of L then looks like this
arma::sp_mat L(N,N);
arma::sp_mat Hi(1,N);
for (int i = 0; i < p; ++ i){
// The non-zero terms in Hi are populated here
L += Hi.t() * Hi;
}
The number of non-zero elements in Hi is constant with i. I do not have much experience with sparse matrices but I was expecting the incremental assembly of L to be relatively constant in speed.
Yet, it seems that the speed at which Hi.t() * Hi is added to L decreases over time. Am I doing something wrong in the way I assemble L? Should I preconstruct L by specifying which of its components I know will not be zero?
It seems that L is not initialized so that it effectively changes size when incremented with Hi.t() * Hi. This was likely the cause for the decrease in the speed.
Related
Consider fast matrix multiplication of XDX^T for X an n by m matrix, and D an m by m diagonal matrix. Here m>>n (suppose n around 1000, m around 100000). In my application, X is a fixed matrix and values of D can change at every iteration.
What would be a fast way to calculate this? At the moment I am just doing simple multiplication in C++.
EDIT: I should clarify my current procedure, it is not "simple multiplication". In particular, I am columnise multiplying the X by the square root of diagonal entries of D to get A:=XD^{1/2}. Then I am directly calculating A*t(A) (which is the multiplication of an n by m matrix with its transpose).
Thank you.
If you know that D is diagonal, the you can just do simple multiplication. Hopefully, you are not multiplying the zeros.
I have implemented a Gauss-Newton optimization process which involves calculating the increment by solving a linearized system Hx = b. The H matrx is calculated by H = J.transpose() * W * J and b is calculated from b = J.transpose() * (W * e) where e is the error vector. Jacobian here is a n-by-6 matrix where n is in thousands and stays unchanged across iterations and W is a n-by-n diagonal weight matrix which will change across iterations (some diagonal elements will be set to zero). However I encountered a speed issue.
When I do not add the weight matrix W, namely H = J.transpose()*J and b = J.transpose()*e, my Gauss-Newton process can run very fast in 0.02 sec for 30 iterations. However when I add the W matrix which is defined outside the iteration loop, it becomes so slow (0.3~0.7 sec for 30 iterations) and I don't understand if it is my coding problem or it normally takes this long.
Everything here are Eigen matrices and vectors.
I defined my W matrix using .asDiagonal() function in Eigen library from a vector of inverse variances. then just used it in the calculation for H ad b. Then it gets very slow. I wish to get some hints about the potential reasons for this huge slowdown.
EDIT:
There are only two matrices. Jacobian is definitely dense. Weight matrix is generated from a vector by the function vec.asDiagonal() which comes from the dense library so I assume it is also dense.
The code is really simple and the only difference that's causing the time change is the addition of the weight matrix. Here is a code snippet:
for (int iter=0; iter<max_iter; ++iter) {
// obtain error vector
error = ...
// calculate H and b - the fast one
Eigen::MatrixXf H = J.transpose() * J;
Eigen::VectorXf b = J.transpose() * error;
// calculate H and b - the slow one
Eigen::MatrixXf H = J.transpose() * weight_ * J;
Eigen::VectorXf b = J.transpose() * (weight_ * error);
// obtain delta and update state
del = H.ldlt().solve(b);
T <- T(del) // this is pseudo code, meaning update T with del
}
It is in a function in a class, and weight matrix now for debug purposes is defined as a class variable that can be accessed by the function and is defined before the function is called.
I guess that weight_ is declared as a dense MatrixXf? If so, then replace it by w.asDiagonal() everywhere you use weight_, or make the later an alias to the asDiagonal expression:
auto weight = w.asDiagonal();
This way Eigen will knows that weight is a diagonal matrix and computations will be optimized as expected.
Because the matrix multiplication is just the diagonal, you can change it to use coefficient wise multiplication like so:
MatrixXd m;
VectorXd w;
w.setLinSpaced(5, 2, 6);
m.setOnes(5,5);
std::cout << (m.array().rowwise() * w.array().transpose()).matrix() << "\n";
Likewise, the matrix vector product can be written as:
(w.array() * error.array()).matrix()
This avoids the zero elements in the matrix. Without an MCVE for me to base this on, YMMV...
UPDATE: the (sparse) three-dimensional matrix v in my question below is symmetric: v(i1,i2,i3) = v(j1,j2,j3) where (j1,j2,j3) is any of the 6 permutations of (i1,i2,i3), i.e.
v(i1,i2,i3) = v(i1,i3,i2) = v(i2,i3,i1) = v(i2,i1,i3) = v(i3,i1,i2) = v(i3,i2,i1).
Moreover, v(i1,i2,i3) != 0 only when i1 != i2 && i1 != i3 && i2 != i3.
E.g. v(i,i,j) = 0, v(i, k, k) = 0, v(k, j, k) = 0, etc...
I thought that with this additional information, I could already get a significant speed-up by doing the following:
Remark: v contains duplicate values (a triplet (i,j,k) has 6 permutations, and the values of v for these 6 are the same).
So I defined a more compact matrix uthat contains only non-duplicates of v. The indices of u are (i1,i2,i3) where i1 < i2 < i3. The length of u is equal to the length of v divided by 6.
I computed the sum by iterating over the new value vector and the new index vectors.
With this, I only got a little speed-up. I realized that instead of iterating N times doing a multiplication each time, I iterated N/6 times doing 6 multiplications each time, and that's pretty much the same as before :(
Hope somebody could come up with a better solution.
--- (Original question) ---
In my program I have an expensive operation that is repeated every iteration.
I have three n-dimensional vectors x1, x2 and x3 that are supposed to change every iteration.
I have four N-dimensional vectors I1, I2, I3 and v that are pre-defined and will not change, where:
I1, I2 and I3 contain the indices of respectively x1, x2 and x3 (the elements in I_i are between 0 and n-1)
v is a vector of values.
For example:
We can see v as a (reshaped) sparse three-dimensional matrix, each index k of v corresponds to a triplet (i1,i2,i3) of indices of x1, x2, x3.
I want to compute at each iteration three n-dimensional vectors y1, y2 and y3 defined by:
y1[i1] = sum_{i2,i3} v(i1,i2,i3)*x2(i2)*x3(i3)
y2[i2] = sum_{i1,i3} v(i1,i2,i3)*x1(i1)*x3(i3)
y3[i3] = sum_{i1,i2} v(i1,i2,i3)*x1(i1)*x2(i2)
More precisely what the program does is:
Repeat:
Compute y1 then update x1 = f(y1)
Compute y2 then update x2 = f(y2)
Compute y3 then update x3 = f(y3)
where f is some external function.
I would like to know if there is a C++ library that helps me to do so as fast as possible. Using for loops is just too slow.
Thank you very much for your help!
Update: Looks like it's not easy to get a better solution than the straight-forward for loops. If the vector of indices I1 above is ordered in non-decreasing order, can we compute y1 faster?
For example: I1 = [0 0 0 0 1 1 2 2 2 3 3 3 ... n n].
The simple answer is no, at least, not trivially. Your access pattern (e.g. x2(i2)*x3(i3)) does not (at least at compile time) access contiguous memory, but rather has a layer of indirection. Due to this, SIMD instructions are pretty useless, as they work on chunks of memory. What you may want to consider doing is creating a copy of xM sorted according to iM, removing the layer of indirection. This should reduce the number of cache misses in that xM(iM) generates and since it's accessed N times, that may reduce some of the wall time (assuming N is large).
If maximal accuracy is not critical, you may want to consider using a FFT method instead of the convolution (at least, that's how I understood your question. Feel free to correct me if I'm wrong).
Assuming you are doing a convolution and the vectors (a and b, same size as in your question) are large, the result (c) can be calculated naïvely as
// O(n^2)
for(int i = 0; i < c.size(); i++)
c(i) = a(i) * b.array();
Using the convolution theorem, you could take the Fourier transform of both a and b and perform an element wise multiplication and then take the inverse Fourier transform of the result to get c (will probably differ a little):
// O(n log(n)); note A, B, and C are vectors of complex floating point numbers
fft.fwd(a, A);
fft.fwd(b, B);
C = A.array() * B.array();
fft.inv(C, c);
Just as Z boson recommended, I am using a column-major matrix format in order to avoid having to use the dot product. I don't see a feasible way to avoid it when multiplying a vector with a matrix, though. The matrix multiplication trick requires efficient extraction of rows (or columns, if we transpose the product). To multiply a vector by a matrix, we therefore transpose:
(b * A)^T = A^T * b^T
A is a matrix, b a row vector, which, after being transposed, becomes a column vector. Its rows are just single scalars and the vector * matrix product implementation becomes an inefficient implementation of dot products of columns of (non-transposed) matrix A with b. Is there a way to avoid performing these dot products? The only way I see that could do it, would involve row extraction, which is inefficient with the column-major matrix format.
This can be understood from original post on this (my first on SO)
efficient-4x4-matrix-vector-multiplication-with-sse-horizontal-add-and-dot-prod
. The rest of the discussion applies to 4x4 matrices.
Here are two methods to do do matrix times vector (v = Mu where v and u are column vectors)
method 1) v1 = dot(row1, u), v2 = dot(row2, u), v3 = dot(row3, u), v4 = dot(row4, u)
method 2) v = u1*col1 + u2*col2 + u3*col3 + u4*col4.
The first method is more familiar from math class while the second is more efficient for a SIMD computer. The second method uses vectorized math (like numpy) e.g.
u1*col1 = (u1x*col1x, u1y*col1y, u1z*col1z, u1w*col1w).
Now let's look at vector times matrix (v = uM where v and u are row vectors)
method 1) v1 = dot(col1, u), v2 = dot(col2, u), v3 = dot(col3, u), v4 = dot(col4, u)
method 2) v = u1*row1 + u2*row2 + u3*row3 + u4*row4.
Now the roles of columns and rows have swapped but method 2 is still the efficient method to use on a SIMD computer.
To do matrix times vector efficiently on a SIMD computer the matrix should be stored in column-major order. To do vector times matrix efficient on a SIMD computer the matrix should be stored in row-major order.
As far as I understand OpenGL uses column major ordering and does matrix times vector and DirectX uses row-major ordering and does vector times matrix.
If you have three matrix transformations that you do in order M1 first then M2 then M3 with matrix times vector you write it as
v = M3*M2*M1*u //u and v are column vectors - OpenGL form
With vector times matrix you write
v = u*M1*M2*M3 //u and v are row vectors - DirectX form
Neither form is better than the other in terms of efficiency. It's just a question of notation (and causing confusion which is useful when you have competition).
It's important to note that for matrix*matrix row-major versus column-major storage is irrelevant.
If you want to know why the vertical SIMD instructions are faster than the horizontal ones that's a separate question which should be asked but in short the horizontal ones really act in serial rather than parallel and are broken up into several micro-ops (which is why ironically dppd is faster than dpps).
I have implemented a MCMC algorithm in C++ using the Eigen library. The main part of the algorithm is a loop in which first some some matrix calculations are performed after which the determinant of the resulting matrix is obtained and added to the output. E.g.:
MatrixXd delta0;
NumericVector out(3);
out[0] = 0;
out[1] = 0;
for (int i = 0; i < s; i++) {
...
delta0 = V*(A.cast<double>()-(A+B).cast<double>()*theta.asDiagonal());
...
I = delta0.determinant()
out[1] += I;
out[2] += std::sqrt(I);
}
return out;
Now on certain matrices I unfortunately observe a numerical underflow so that the determinant is outputted as zero (which it actually isn't).
How can I avoid this underflow?
One solution would be to obtain, instead of the determinant, the log of the determinant. However,
I do not know how to do this;
how could I then add up these logs?
Any help is greatly appreciated.
There are 2 main options that come to my mind:
The product of eigenvalues of square matrix is the determinant of this matrix, therefore a sum of logarithms of each eigenvalue is a logarithm of the determinant of this matrix. Assume det(A) = a and det(B) = b for compact notation. After applying aforementioned for 2 matrices A and B, we end up with log(a) and log(b), then actually the following is true:
log(a + b) = log(a) + log(1 + e ^ (log(b) - log(a)))
Yes, we get a logarithm of the sum. What would you do with it next? I don't know, depends on what you have to. If you have to remove logarithm by e ^ log(a + b) = a + b, then you might be lucky that the value of a + b does not underflow now, but in some cases it can still underflow as well.
Perform clever preconditioning; there might be tons of options here, and you better read about them from some trusted sources as this is a serious topic. The simplest (and probably the cheapest ever) example of preconditioning for this particular problem could be to recall that det(c * A) = (c ^ n) * det(A), where A is n by n matrix, and to premultiply your matrix with some c, compute the determinant, and then to divide it by c ^ n to get the actual one.
Update
I thought about one more option. If on the last stages of #1 or #2 you still experience underflow too frequently, then it might be a good idea to increase precision specifically for these last operations, for example, by utilizing GNU MPFR.
You can use Householder elimination to get the QR decomposition of delta0. Then the determinant of the Q part is +/-1 (depending on whether you did an even or odd number of reflections) and the determinant of the R part is the product of the diagonal elements. Both of these are easy to compute without running into underflow hell---and you might not even care about the first.