Eigen: how to speed up a += coeffs * coeffs.transpose() - c++

I need to compute many (about 400k) solutions of small linear least square problems. Each problem contains 10-300 equations with only 7 variables.
To solve these problems i use eigen library. Straight solving takes too much time and i transform each problem to solving 7x7 system of linear equations by deriving derivatives by my hand.
I recieve nice speed-up but i want to increase performance again.
I use vagrind to profile my program and i found that operation with highest self cost is operator += of eigen matrix. This operation takes more than ten calls of a.ldlt().solve(b);
I use this operator to compose A matrix and B vector of each system of equations
//I cal these code to solve each problem
const int nVars = 7;
//i really need double precision
Eigen::Matrix<double, nVars, nVars> a = Eigen::Matrix<double, nVars, nVars>::Zero();
Eigen::Matrix<double, nVars, 1> b = Eigen::Matrix<double, nVars, 1>::Zero();
Eigen::Matrix<double, nVars, 1> equationCoeffs;
//............................
//Somewhere in big cycle.
//equationCoeffs and z are updated on each iteration
a += equationCoeffs * equationCoeffs.transpose();
b += equationCoeffs * z;
Where z is some scalar
So my question is: How can i improve performance of these operations?
PS Sorry for my poor English

Instead of forming the matrix and vector components of the normal equation by hand, one equation at a time, you might try to allocate a large enough matrix once (e.g. 300 x 7) to store all coefficients and then let Eigen's optimized matrix-matrix product kernels do the job for you:
Matrix<double,Dynamic,nbVars> D(300,nbVars);
VectorXd f(300);
for(...)
{
int nb_equations = ...;
for(i=0..nb_equations-1)
{
D.row(i) = equationCoeffs;
f(i) = z;
}
a = D.topRows(nb_equations).transpose() * D.topRows(nb_equations);
b = D.topRows(nb_equations).transpose() * f.head(nb_equations);
// solve ax=b
}
You might bench with both a column-major and row-major storage for the matrix D to see which one is best.
Another possible approach would be to declare a, equationCoeffs, and b as 8x8 or 8x1 matrix or vectors making sure that equationCoeffs(7)==0. This way you maximize SIMD usage. Then use a.topLeftCorners<7,7>(), b.head<7>() when calling LDLT. You might even combine this strategy with the previous one.
Finally, if your CPU support AVX or FMA, you might use the devel branch and compile with -mavx or -mfma to get a significant speedup.

If you can use g++5.1, you might want to take a look at OpenMP
( http://openmp.org/mp-documents/OpenMP4.0.0.Examples.pdf ).
G++5.1 (or gcc5.1 for C) also has some basic support for OpenACC, you can try that as well. There should be more implementation of OpenACC in the future.
Also if you have access to intel compiler (icc, icpc) it speeded up my code even just by using it.
If you can use nvidia's nvcc, you might use the thrust library
( http://docs.nvidia.com/cuda/thrust/#axzz3g8xJPGHe ), there's a lot of sample code on their github as well
( https://github.com/thrust/thrust ). However, using thrust is not so straight forward and needs some real thinking.
EDIT:
Thrust also requires Nvidia GPU.
For AMD cards I believe there is a library called ArrayFire, which looks very similar to Thrust (I have not tried that one, yet)

I have a single problem Ax=b with 480k float variables. The matrix A is sparse and solving it with Eigen BiCGSTAB took 4.8 seconds.
I also worked with ViennaCL before, so I tried to solve the same problem there, and it took only 1.2 seconds. The increase in spead is realised
by the processing on the GPU.

Related

Best way of solving sparse linear systems in C++ - GPU Possible?

I am currently working on a project where we need to solve
|Ax - b|^2.
In this case, A is a very sparse matrix and A'A has at most 5 nonzero elements in each row.
We are working with images and the dimension of A'A is NxN where N is the number of pixels. In this case N = 76800. We plan to go to RGB and then the dimension will be 3Nx3N.
In matlab solving (A'A)\(A'b) takes about 0.15 s, using doubles.
I have now done some experimenting with Eigens sparse solvers. I have tried:
SimplicialLLT
SimplicialLDLT
SparseQR
ConjugateGradient
and some different orderings. The by far best so far is
SimplicialLDLT
which takes about 0.35 - 0.5 using AMDOrdering.
When I for example use ConjugateGradient it takes roughly 6 s, using 0 as initilization.
The code for solving the problem is:
A_tot.makeCompressed();
// Create solver
Eigen::SimplicialLDLT<Eigen::SparseMatrix<float>, Eigen::Lower, Eigen::AMDOrdering<int> > solver;
// Eigen::ConjugateGradient<Eigen::SparseMatrix<float>, Eigen::Lower> cg;
solver.analyzePattern(A_tot);
t1 = omp_get_wtime();
solver.compute(A_tot);
if (solver.info() != Eigen::Success)
{
std::cerr << "Decomposition Failed" << std::endl;
getchar();
}
Eigen::VectorXf opt = solver.solve(b_tot);
t2 = omp_get_wtime();
std::cout << "Time for normal equations: " << t2 - t1 << std::endl;
This is the first time I use sparse matrices in C++ and its solvers. For this project speed is crucial and below 0.1 s is a minimum.
I would like to get some feedback on which would be the best strategy here. For example one is supposed to be able to use SuiteSparse and OpenMP in Eigen. What are your experiences about these types of problems? Is there a way of extracting the structure for example? And should conjugateGradient really be that slow?
Edit:
Thanks for som valuable comments! Tonight I have been reading a bit about cuSparse on Nvidia. It seems to be able to do factorisation an even solve systems. In particular it seems one can reuse pattern and so forth. The question is how fast could this be and what is the possible overhead?
Given that the amount of data in my matrix A is the same as in an image, the memory copying should not be such an issue. I did some years ago software for real-time 3D reconstruction and then you copy data for each frame and a slow version still runs in over 50 Hz. So if the factorization is much faster it is a possible speed-up? In particualar the rest of the project will be on the GPU, so if one can solve it there directly and keep the solution it is no drawback I guess.
A lot has happened in the field of Cuda and I am not really up to date.
Here are two links I found: Benchmark?, MatlabGPU
Your matrix is extremely sparse and corresponds to a discretization on a 2D domain, so it is expected that SimplicialLDLT is the fastest here. Since the sparsity pattern is fixed, call analyzePattern once, and then factorize instead of compute. This should save some milliseconds. Moreover, since you're working on a regular grid, you might also try to bypass the re-ordering step using NaturalOrdering (not 100% sure, you have to bench). If that's still not fast enough, you might search for a Cholesky solver tailored for skyline matrices (the Cholesky factorization is much simpler and thus faster in this case).

Eigen equivalent to Octave/MATLAB mldivide for rectangular matrices

I'm using Eigen v3.2.7.
I have a medium-sized rectangular matrix X (170x17) and row vector Y (170x1) and I'm trying to solve them using Eigen. Octave solves this problem fine using X\Y, but Eigen is returning incorrect values for these matrices (but not smaller ones) - however I suspect that it's how I'm using Eigen, rather than Eigen itself.
auto X = Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>{170, 17};
auto Y = Eigen::Matrix<T, Eigen::Dynamic, 1>{170};
// Assign their values...
const auto theta = X.colPivHouseholderQr().solve(Y).eval(); // Wrong!
According to the Eigen documentation, the ColPivHouseholderQR solver is for general matrices and pretty robust, but to make sure I've also tried the FullPivHouseholderQR. The results were identical.
Is there some special magic that Octave's mldivide does that I need to implement manually for Eigen?
Update
This spreadsheet has the two input matrices, plus Octave's and my result matrices.
Replacing auto doesn't make a difference, nor would I expect it to because construction cannot be a lazy operation, and I have to call .eval() on the solve result because the next thing I do with the result matrix is get at the raw data (using .data()) on tail and head operations. The expression template versions of the result of those block operations do not have a .data() member, so I have to force evaluation beforehand - in other words theta is the concrete type already, not an expression template.
The result for (X*theta-Y).norm()/Y.norm() is:
2.5365e-007
And the result for (X.transpose()*X*theta-X.transpose()*Y).norm() / (X.transpose()*Y).norm() is:
2.80096e-007
As I'm currently using single precision float for my basic numerical type, that's pretty much zero for both.
According to your verifications, the solution you get is perfectly fine. If you want more accuracy, then use double floating point numbers. Note that MatLab/Octave use double precision by default.
Moreover, it might also likely be that your problem is not full rank, in which case your problem admit an infinite number of solution. ColPivHouseholderQR picks one, somehow arbitrarily. On the other hand, mldivide will pick the minimal norm one that you can also obtain with Eigen::BDCSVD (Eigen 3.3), or the slower Eigen::JacobiSVD.

How can I get eigenvalues and eigenvectors fast and accurate?

I need to compute the eigenvalues and eigenvectors of a big matrix (about 1000*1000 or even more). Matlab works very fast but it does not guaranty accuracy. I need this to be pretty accurate (about 1e-06 error is ok) and within a reasonable time (an hour or two is ok).
My matrix is symmetric and pretty sparse. The exact values are: ones on the diagonal, and on the diagonal below the main diagonal, and on the diagonal above it. Example:
How can I do this? C++ is the most convenient to me.
MATLAB does not guarrantee accuracy
I find this claim unreasonable. On what grounds do you say that you can find a (significantly) more accurate implementation than MATLAB's highly refined computational algorithms?
AND... using MATLAB's eig, the following is computed in less than half a second:
%// Generate the input matrix
X = ones(1000);
A = triu(X, -1) + tril(X, 1) - X;
%// Compute eigenvalues
v = eig(A);
It's fast alright!
I need this to be pretty accurate (about 1e-06 error is OK)
Remember that solving eigenvalues accurately is related to finding the roots of the characteristic polynomial. This specific 1000x1000 matrix is very ill-conditioned:
>> cond(A)
ans =
1.6551e+003
A general rule of thumb is that for a condition number of 10k, you may lose up to k digits of accuracy (on top of what would be lost to the numerical method due to loss of precision from arithmetic method).
So in your case, I'd expect the results to be accurate up to an approximate error of 10-3.
If you're not opposed to using a third party library, I've had great success using the Armadillo linear algebra libraries.
For the example below, arma is the namespace they like to use, vec is a vector, mat is a matrix.
arma::vec getEigenValues(arma::mat M) {
return arma::eig_sym(M);
}
You can also serialize the data directly into MATLAB and vice versa.
Your system is tridiagonal and a (symmetric) Toeplitz matrix. I'd guess that eigen and Matlab's eig have special cases to handle such matrices. There is a closed-form solution for the eigenvalues in this case (reference (PDF)). In Matlab for your matrix this is simply:
n = size(A,1);
k = (1:n).';
v = 1-2*cos(pi*k./(n+1));
This can be further optimized by noting that the eigenvalues are centered about 1 and thus only half of them need to be computed:
n = size(A,1);
if mod(n,2) == 0
k = (1:n/2).';
u = 2*cos(pi*k./(n+1));
v = 1+[u;-u];
else
k = (1:(n-1)/2).';
u = 2*cos(pi*k./(n+1));
v = 1+[u;0;-u];
end
I'm not sure how you're going to get more fast and accurate than that (other than performing a refinement step using the eigenvectors and optimization) with simple code. The above should be able to translated to C++ very easily (or use Matlab's codgen to generate C/C++ code that uses this or eig). However, your matrix is still ill-conditioned. Just remember that estimates of accuracy are worst case.

Compute rank of Matrix

I need to calculate rank of 4096x4096 sparse matrix, and I use C/C++ code.
I found some libraries (like Armadillo) that do it but they're too slow (almost 5 minutes).
I've also tried two Open Source version of Matlab (Freemat and Octave) but both crashed when I tried to make a test with a script.
5 minutes isn't so much but I must get rank from something like a million of matrix so the faster the better.
Someone knows a fast library for rank computation?
The Eigen library supports sparse matrices, try it out.
Computing the algebraic rank is O(n^3), where n is the matrix size, so it's inherently slow. You need eg. to perform pivoting, and this is slow and inaccurate if your matrix is not well conditioned (for n = 4096, a typical matrix is very ill conditioned).
Now, what is the rank ? It is the dimension of the image. It is very difficult to compute when n is large and it'll be spoiled by any small numerical inaccuracy of the input. For n = 4096, unless you happen to have particularly well conditioned matrices, this will prevent you from doing anything useful with a pivoting algorithm.
The best way is in fact to fix a cutoff epsilon, compute the singular values s_1 > ... > s_n and take as the rank the lowest integer r such that sum(s_i^2, i > r) < epsilon^2 * sum(s_i^2).
You thus need a sparse SVD routine, eg. from there.
This may not be faster, but to the very least it will be correct.
You can ask for less singular values that you need to speed up things. This is a tough problem, and with no info on the background and how you got these matrices, there is nothing more we can do.
Try the following code (the documentation is here).
It is an example for calculating the rank of the matrix A with Eigen library:
MatrixXd A(2,2);
A << 1 , 0, 1, 0;
FullPivLU<MatrixXd> luA(A);
int rank = luA.rank();

Worse performance using Eigen than using my own class

A couple of weeks ago I asked a question about the performance of matrix multiplication.
I was told that in order to enhance the performance of my program I should use some specialised matrix classes rather than my own class.
StackOverflow users recommended:
uBLAS
EIGEN
BLAS
At first I wanted to use uBLAS however reading documentation it turned out that this library doesn't support matrix-matrix multiplication.
After all I decided to use EIGEN library. So I exchanged my matrix class to Eigen::MatrixXd - however it turned out that now my application works even slower than before.
Time before using EIGEN was 68 seconds and after exchanging my matrix class to EIGEN matrix program runs for 87 seconds.
Parts of program which take the most time looks like that
TemplateClusterBase* TemplateClusterBase::TransformTemplateOne( vector<Eigen::MatrixXd*>& pointVector, Eigen::MatrixXd& rotation ,Eigen::MatrixXd& scale,Eigen::MatrixXd& translation )
{
for (int i=0;i<pointVector.size();i++ )
{
//Eigen::MatrixXd outcome =
Eigen::MatrixXd outcome = (rotation*scale)* (*pointVector[i]) + translation;
//delete prototypePointVector[i]; // ((rotation*scale)* (*prototypePointVector[i]) + translation).ConvertToPoint();
MatrixHelper::SetX(*prototypePointVector[i],MatrixHelper::GetX(outcome));
MatrixHelper::SetY(*prototypePointVector[i],MatrixHelper::GetY(outcome));
//assosiatedPointIndexVector[i] = prototypePointVector[i]->associatedTemplateIndex = i;
}
return this;
}
and
Eigen::MatrixXd AlgorithmPointBased::UpdateTranslationMatrix( int clusterIndex )
{
double membershipSum = 0,outcome = 0;
double currentPower = 0;
Eigen::MatrixXd outcomePoint = Eigen::MatrixXd(2,1);
outcomePoint << 0,0;
Eigen::MatrixXd templatePoint;
for (int i=0;i< imageDataVector.size();i++)
{
currentPower =0;
membershipSum += currentPower = pow(membershipMatrix[clusterIndex][i],m);
outcomePoint.noalias() += (*imageDataVector[i] - (prototypeVector[clusterIndex]->rotationMatrix*prototypeVector[clusterIndex]->scalingMatrix* ( *templateCluster->templatePointVector[prototypeVector[clusterIndex]->assosiatedPointIndexVector[i]]) ))*currentPower ;
}
outcomePoint.noalias() = outcomePoint/=membershipSum;
return outcomePoint; //.ConvertToMatrix();
}
As You can see, these functions performs a lot of matrix operations. That is why I thought using Eigen would speed up my application. Unfortunately (as I mentioned above), the program works slower.
Is there any way to speed up these functions?
Maybe if I used DirectX matrix operations I would get better performance ?? (however I have a laptop with integrated graphic card).
If you're using Eigen's MatrixXd types, those are dynamically sized. You should get much better results from using the fixed size types e.g Matrix4d, Vector4d.
Also, make sure you're compiling such that the code can get vectorized; see the relevant Eigen documentation.
Re your thought on using the Direct3D extensions library stuff (D3DXMATRIX etc): it's OK (if a bit old fashioned) for graphics geometry (4x4 transforms etc), but it's certainly not GPU accelerated (just good old SSE, I think). Also, note that it's floating point precision only (you seem to be set on using doubles). Personally I'd much prefer to use Eigen unless I was actually coding a Direct3D app.
Make sure to have compiler optimization switched on (e.g. at least -O2 on gcc). Eigen is heavily templated and will not perform very well if you don't turn on optimization.
Which version of Eigen are you using? They recently released 3.0.1, which is supposed to be faster than 2.x. Also, make sure you play a bit with the compiler options. For example, make sure SSE is being used in Visual Studio:
C/C++ --> Code Generation --> Enable Enhanced Instruction Set
You should profile and then optimize first the algorithm, then the implementation. In particular, the posted code is quite innefficient:
for (int i=0;i<pointVector.size();i++ )
{
Eigen::MatrixXd outcome = (rotation*scale)* (*pointVector[i]) + translation;
I don't know the library, so I won't even try to guess the number of unnecessary temporaries that you are creating, but a simple refactor:
Eigen::MatrixXd tmp = rotation*scale;
for (int i=0;i<pointVector.size();i++ )
{
Eigen::MatrixXd outcome = tmp*(*pointVector[i]) + translation;
Can save you a good amount of expensive multiplications (and again, probably new temporary matrices that get discarded right away.
A couple of points.
Why are you multiplying rotation*scale inside of the loop when that product will have the same value each iteration? That is a lot of wasted effort.
You are using dynamically sized matrices rather than fixed sized matrices. Someone else mentioned this already, and you said you shaved off 2 sec.
You are passing arguments as a vector of pointers to matrices. This adds an extra pointer indirection and destroys any guarantee of data locality, which will give poor cache performance.
I hope this isn't insulting, but are you compiling in Release or Debug? Eigen is very slow in debug builds, because it uses lots of trivial templated functions that are optimized out of release but remain in debug.
Looking at your code, I am hesitant to blame Eigen for performance problems. However, most linear algebra libraries (including Eigen) are not really designed for your use case of lots of tiny matrices. In general, Eigen will be better optimized for 100x100 or larger matrices. You very well may be better off using your own matrix class or the DirectX math helper classes. The DirectX math classes are completely independent from your video card.
Looking back at your previous post and the code in there, my suggestion would be to use your old code, but improve its efficiency by moving things around. I'm posting on that previous question to keep the answers separate.