Matrix multiplication optimization - c++

I am performing a series of matrix multiplications with fairly large matrices. To run through all of these operations takes a long time, and I need my program to do this in a large loop. I was wondering if anyone has any ideas to speed this up? I just started using Eigen, so I have very limited knowledge.
I was using ROOT-cern's built in TMatrix class, but the speed for performing the matrix operations is very poor. I set up some diagonal matrices using Eigen with the hope that it handled the multiplication operation in a more optimal way. It may, but I cannot really see the performance difference.
// setup matrices
int size = 8000;
Eigen::MatrixXf a(size*2,size);
// fill matrix a....
Eigen::MatrixXf r(2*size,2*size); // diagonal matrix of row sums of a
// fill matrix r
Eigen::MatrixXf c(size,size); // diagonal matrix of col sums of a
// fill matrix c
// transpose a in place
a.transposeInPlace();
Eigen::MatrixXf c_dia;
c_dia = c.diagonal().asDiagonal();
Eigen::MatrixXf r_dia;
r_dia = r.diagonal().asDiagonal();
// calc car
Eigen::MatrixXf car;
car = c_dia*a*r_dia;

You are doing way too much work here. If you have diagonal matrices, only store the diagonal (and directly use that for products). Once you store a diagonal matrix in a square matrix, the information of the structure is lost to Eigen.
Also, you don't need to store the transposed variant of a, just use a.transpose() inside a product (that is only a minor issue here ...)
// setup matrices
int size = 8000;
Eigen::MatrixXf a(size*2,size);
// fill matrix a....
a.setRandom();
Eigen::VectorXf r = a.rowwise().sum(); // diagonal matrix of row sums of a
Eigen::VectorXf c = a.colwise().sum(); // diagonal matrix of col sums of a
Eigen::MatrixXf car = c.asDiagonal() * a.transpose() * r.asDiagonal();
Finally, of course make sure to compile with optimization enabled, and enable vectorization if available (with gcc or clang compile with -O2 -march=native).

Related

Is there a something like a sparse cube in armadillo or some way of using sparse matrices as slices in a cube?

I am using armadillos sparse matrices. But now I would like to use something like a "sparse cube" which does not exist in armadillo. writing sparse matrices into a cube with cube.slice(some_sparse_matrix) converts everything back to a dense cube.
I am using sparse matrices in order to multiply a vector with. for larger vectors/matrices the sparse variant is much faster. Now I have to sum up the multiplications of several sparse matrices with several vectors.
would a std:vector be a way?
In my experience it is faster to use armadillos functions (for example a subvector or arma::span() or arma::sum() )) as opposed to write loops myself. So I was wondering what would be the fastest way of doing this.
It's possible to approximate a sparse cube using the field class, like so.
arma::uword number_of_matrices = 10;
arma::uword number_of_rows = 5000;
arma::uword number_of_cols = 5000;
arma::field<arma::sp_mat> F(number_of_matrices);
F.for_each( [&](arma::sp_mat& X) { X.set_size(number_of_rows, number_of_cols); } );
F(0)(1,2) = 456.7; // write to element (1,2) in matrix 0
F(1)(2,3) = 567.8; // write to element (2,3) in matrix 1
F.print("F:"); // show all matrices
Your compiler must support at least C++11 for this to work.

Matrix multiplication very slow in Eigen

I have implemented a Gauss-Newton optimization process which involves calculating the increment by solving a linearized system Hx = b. The H matrx is calculated by H = J.transpose() * W * J and b is calculated from b = J.transpose() * (W * e) where e is the error vector. Jacobian here is a n-by-6 matrix where n is in thousands and stays unchanged across iterations and W is a n-by-n diagonal weight matrix which will change across iterations (some diagonal elements will be set to zero). However I encountered a speed issue.
When I do not add the weight matrix W, namely H = J.transpose()*J and b = J.transpose()*e, my Gauss-Newton process can run very fast in 0.02 sec for 30 iterations. However when I add the W matrix which is defined outside the iteration loop, it becomes so slow (0.3~0.7 sec for 30 iterations) and I don't understand if it is my coding problem or it normally takes this long.
Everything here are Eigen matrices and vectors.
I defined my W matrix using .asDiagonal() function in Eigen library from a vector of inverse variances. then just used it in the calculation for H ad b. Then it gets very slow. I wish to get some hints about the potential reasons for this huge slowdown.
EDIT:
There are only two matrices. Jacobian is definitely dense. Weight matrix is generated from a vector by the function vec.asDiagonal() which comes from the dense library so I assume it is also dense.
The code is really simple and the only difference that's causing the time change is the addition of the weight matrix. Here is a code snippet:
for (int iter=0; iter<max_iter; ++iter) {
// obtain error vector
error = ...
// calculate H and b - the fast one
Eigen::MatrixXf H = J.transpose() * J;
Eigen::VectorXf b = J.transpose() * error;
// calculate H and b - the slow one
Eigen::MatrixXf H = J.transpose() * weight_ * J;
Eigen::VectorXf b = J.transpose() * (weight_ * error);
// obtain delta and update state
del = H.ldlt().solve(b);
T <- T(del) // this is pseudo code, meaning update T with del
}
It is in a function in a class, and weight matrix now for debug purposes is defined as a class variable that can be accessed by the function and is defined before the function is called.
I guess that weight_ is declared as a dense MatrixXf? If so, then replace it by w.asDiagonal() everywhere you use weight_, or make the later an alias to the asDiagonal expression:
auto weight = w.asDiagonal();
This way Eigen will knows that weight is a diagonal matrix and computations will be optimized as expected.
Because the matrix multiplication is just the diagonal, you can change it to use coefficient wise multiplication like so:
MatrixXd m;
VectorXd w;
w.setLinSpaced(5, 2, 6);
m.setOnes(5,5);
std::cout << (m.array().rowwise() * w.array().transpose()).matrix() << "\n";
Likewise, the matrix vector product can be written as:
(w.array() * error.array()).matrix()
This avoids the zero elements in the matrix. Without an MCVE for me to base this on, YMMV...

Principal Component Analysis with Eigen Library

I'm trying to compute the 2 major principal components from a dataset in C++ with Eigen.
The way I do it at the moment is to normalize the data between [0, 1] and then center the mean. After that I compute the covariance matrix and run an eigenvalue decomposition on it. I know SVD is faster, but I'm confused about the computed components.
Here is the major code about how I do it (where traindata is my MxN sized input matrix):
Eigen::VectorXf normalize(Eigen::VectorXf vec) {
for (int i = 0; i < vec.size(); i++) { // normalize each feature.
vec[i] = (vec[i] - minCoeffs[i]) / scalingFactors[i];
}
return vec;
}
// Calculate normalization coefficients (globals of type Eigen::VectorXf).
maxCoeffs = traindata.colwise().maxCoeff();
minCoeffs = traindata.colwise().minCoeff();
scalingFactors = maxCoeffs - minCoeffs;
// For each datapoint.
for (int i = 0; i < traindata.rows(); i++) { // Normalize each datapoint.
traindata.row(i) = normalize(traindata.row(i));
}
// Mean centering data.
Eigen::VectorXf featureMeans = traindata.colwise().mean();
Eigen::MatrixXf centered = traindata.rowwise() - featureMeans;
// Compute the covariance matrix.
Eigen::MatrixXf cov = centered.adjoint() * centered;
cov = cov / (traindata.rows() - 1);
Eigen::SelfAdjointEigenSolver<Eigen::MatrixXf> eig(cov);
// Normalize eigenvalues to make them represent percentages.
Eigen::VectorXf normalizedEigenValues = eig.eigenvalues() / eig.eigenvalues().sum();
// Get the two major eigenvectors and omit the others.
Eigen::MatrixXf evecs = eig.eigenvectors();
Eigen::MatrixXf pcaTransform = evecs.rightCols(2);
// Map the dataset in the new two dimensional space.
traindata = traindata * pcaTransform;
The result of this code is something like this:
To confirm my results, I tried the same with WEKA. So what I did is to use the normalize and the center filter, in this order. Then the principal component filter and save + plot the output. The result is this:
Technically I should have done the same, however the outcome is so different. Can anyone see if I made a mistake?
When scaling to 0,1, you modify the local variable vec but forgot to update traindata.
Moreover, this can be done more easily this way:
RowVectorXf minCoeffs = traindata.colwise().maxCoeff();
RowVectorXf minCoeffs = traindata.colwise().minCoeff();
RowVectorXf scalingFactors = maxCoeffs - minCoeffs;
traindata = (traindata.rowwise()-minCoeffs).array().rowwise() / scalingFactors.array();
that is, using row-vectors and array features.
Let me also add that the symmetric eigenvalue decomposition is actually faster than SVD. The true advantage of SVD in this case is that it avoids squaring the entries, but since your input data are normalized and centered, and that you only care about the largest eigenvalues, there is no accuracy concern here.
The reason was that Weka standardized the dataset. This means it scales each feature's variance to unit variance. When I did this, the plots looked the same. Technically my approach was correct as well.

sparse sparse product A^T*A optim in Eigen lib

In the case of multiple of same matrix matA, like
matA.transpose()*matA,
You don't have to compute all result product, because the result matrix is symmetric(so only if the m>n), in my specific case is always symmetric! square.
So its enough the compute only for. ex. lower triangular part and rest only copy..... because the results of the multiple 2nd and 3rd row, resp.col, is the same like 3rd and 2nd.....And etc....
So my question is , exist way how to tell Eigen, to compute only lower part. and optionally save to only lower trinaguler part the product?
DATA = SparseMatrix<double>((SparseMatrix<double>(matA.transpose()) * matA).pruned()).toDense();
According to the documentation, you can evaluate the lower triangle of a matrix with:
m1.triangularView<Eigen::Lower>() = m2 + m3;
or in your case:
m1.triangularView<Eigen::Lower>() = matA.transpose()*matA;
(where it says "Writing to a specific triangular part: (only the referenced triangular part is evaluated)"). Otherwise, in the line you've written
Eigen will calculate the entire sparse matrix matA.transpose()*matA.
Regarding saving the resulting m1 matrix, it is the same as saving whatever type of matrix it is (Eigen::MatrixXt or Eigen::SparseMatrix<t>). If m1 is sparse, then it will be only half the size of a straightforward matA.transpose()*matA. If m1 is dense, then it will be the full square matrix.
https://eigen.tuxfamily.org/dox/classEigen_1_1SparseSelfAdjointView.html
The symmetric rank update is defined as:
B = B + alpha * A * A^T
where alpha is a scalar. In your case, you are doing A^T * A, so you should pass the transposed matrix instead. The resulting matrix will only store the upper or lower portion of the matrix, whichever you prefer. For example:
SparseMatrix<double> B;
B.selfadjointView<Lower>().rankUpdate(A.transpose());

Diagonalization of a 2x2 self-adjoined (hermitian) matrix

Diagonalizing a 2x2 hermitian matrix is simple, it can be done analytically. However, when it comes to calculating the eigenvalues and eigenvectors over >10^6 times, it is important to do it as efficient as possible. Especially if the off-diagonal elements can vanish it is not possible to use one formula for the eigenvectors: An if-statement is necessary, which of course slows down the code. Thus, I thought using Eigen, where it's stated that the diagonalization of 2x2 and 3x3 matrices is optimized, would be still a good choice:
using
const std::complex<double> I ( 0.,1. );
inline double block_distr ( double W )
{
return (-W/2. + rand() * W/RAND_MAX);
}
a test-loop would be
...
SelfAdjointEigenSolver<Matrix<complex< double >, 2, 2> > ces;
Matrix<complex< double >, 2, 2> X;
for (int i = 0 ; i <iter_MAX; ++i) {
a00=block_distr(100.);
a11=block_distr(100.);
re_a01=block_distr(100.);
im_a01=block_distr(100.);
X(0,0)=a00;
X(1,0)=re_a01-I*im_a01;
//only the lower triangular part is referenced! X(0,1)=0.; <--- not necessary
X(1,1)=a11;
ces.compute(X,ComputeEigenvectors);
}
Writing the loop without Eigen, using directly the formulas for eigenvalues and eigenvectors of a hermitian matrix and an if-statement to check if the off diagonal is zero, is a factor of 5 faster. Am I not using Eigen properly or is such an overhead normal? Are there other lib.s which are optimized for small self-adjoint matrices?
By default, the iterative method is used. To use the analytical version for the 2x2 and 3x3, you have to call the computeDirect function:
ces.computeDirect(X);
but it is unlikely to be faster than your implementation of the analytic formulas.