Best way of solving sparse linear systems in C++ - GPU Possible? - c++

I am currently working on a project where we need to solve
|Ax - b|^2.
In this case, A is a very sparse matrix and A'A has at most 5 nonzero elements in each row.
We are working with images and the dimension of A'A is NxN where N is the number of pixels. In this case N = 76800. We plan to go to RGB and then the dimension will be 3Nx3N.
In matlab solving (A'A)\(A'b) takes about 0.15 s, using doubles.
I have now done some experimenting with Eigens sparse solvers. I have tried:
SimplicialLLT
SimplicialLDLT
SparseQR
ConjugateGradient
and some different orderings. The by far best so far is
SimplicialLDLT
which takes about 0.35 - 0.5 using AMDOrdering.
When I for example use ConjugateGradient it takes roughly 6 s, using 0 as initilization.
The code for solving the problem is:
A_tot.makeCompressed();
// Create solver
Eigen::SimplicialLDLT<Eigen::SparseMatrix<float>, Eigen::Lower, Eigen::AMDOrdering<int> > solver;
// Eigen::ConjugateGradient<Eigen::SparseMatrix<float>, Eigen::Lower> cg;
solver.analyzePattern(A_tot);
t1 = omp_get_wtime();
solver.compute(A_tot);
if (solver.info() != Eigen::Success)
{
std::cerr << "Decomposition Failed" << std::endl;
getchar();
}
Eigen::VectorXf opt = solver.solve(b_tot);
t2 = omp_get_wtime();
std::cout << "Time for normal equations: " << t2 - t1 << std::endl;
This is the first time I use sparse matrices in C++ and its solvers. For this project speed is crucial and below 0.1 s is a minimum.
I would like to get some feedback on which would be the best strategy here. For example one is supposed to be able to use SuiteSparse and OpenMP in Eigen. What are your experiences about these types of problems? Is there a way of extracting the structure for example? And should conjugateGradient really be that slow?
Edit:
Thanks for som valuable comments! Tonight I have been reading a bit about cuSparse on Nvidia. It seems to be able to do factorisation an even solve systems. In particular it seems one can reuse pattern and so forth. The question is how fast could this be and what is the possible overhead?
Given that the amount of data in my matrix A is the same as in an image, the memory copying should not be such an issue. I did some years ago software for real-time 3D reconstruction and then you copy data for each frame and a slow version still runs in over 50 Hz. So if the factorization is much faster it is a possible speed-up? In particualar the rest of the project will be on the GPU, so if one can solve it there directly and keep the solution it is no drawback I guess.
A lot has happened in the field of Cuda and I am not really up to date.
Here are two links I found: Benchmark?, MatlabGPU

Your matrix is extremely sparse and corresponds to a discretization on a 2D domain, so it is expected that SimplicialLDLT is the fastest here. Since the sparsity pattern is fixed, call analyzePattern once, and then factorize instead of compute. This should save some milliseconds. Moreover, since you're working on a regular grid, you might also try to bypass the re-ordering step using NaturalOrdering (not 100% sure, you have to bench). If that's still not fast enough, you might search for a Cholesky solver tailored for skyline matrices (the Cholesky factorization is much simpler and thus faster in this case).

Related

Eigen sparse solvers give drastically different solutions for the same linear system

I am trying to solve a sparse linear system as quickly as possible using eigen.
The docs give you 4 sparse solvers toc hoose from (but really it;s more like these three):
SimplicialLLT
#include<Eigen/SparseCholesky> Direct LLt factorization SPD Fill-in reducing LGPL
SimplicialLDLT is often preferable
SimplicialLDLT
#include<Eigen/SparseCholesky> Direct LDLt factorization SPD Fill-in reducing LGPL
Recommended for very sparse and not too large problems (e.g., 2D Poisson eq.)
SparseLU
#include<Eigen/SparseLU> LU factorization Square Fill-in reducing, Leverage fast dense algebra MPL2
optimized for small and large problems with irregular patterns
When I use the last solver, i.e. I do:
Eigen::SparseLU<Eigen::SparseMatrix<Scalar>> solver(bijection);
Assert(solver.info() == Eigen::Success, "Matrix is degenerate.");
solver.compute(bijection);
Assert(solver.info() == Eigen::Success, "Matrix is degenerate.");
Eigen::VectorXf vertices_u = solver.solve(u);
Assert(solver.info() == Eigen::Success, "Matrix is degenerate.");
Eigen::VectorXf vertices_v = solver.solve(v);
Assert(solver.info() == Eigen::Success, "Matrix is degenerate.");
I get the correct result, which graphically looks like this:
If I use simplicialLDLT, i.e. if I change the solver line and nothing else to:
Eigen::SimplicialLDLT<Eigen::SparseMatrix<Scalar>> solver(bijection);
I get this degenerate monstrosity:
Basically the two solvers are returining wildely different results for the exact same sparse system. How is this possible?
None of the error checks return false, so in both versions the matrices are considered to be fine.
https://eigen.tuxfamily.org/dox/group__TopicSparseSystems.html => SimplicialLDLT only for SPD matrices. You might try a least squares approach https://snaildove.github.io/2017/08/01/positive_definite_and_least_square/

How can I solve linear equation AX=b for sparse matrix

I have sparse matrix A(120,000*120,000) and a vector b(120,000) and I would like to solve the linear system AX=b using Eigen library. I tried to follow the documentation but I always have an error. I also tried change the matrix to dense and solve the system
Eigen::MatrixXd H(N,N);
HH =Eigen:: MatrixXd(A);
Eigen::ColPivHouseholderQR<Eigen::MatrixXd> dec(H);
Eigen::VectorXd RR = dec.solve(b);
but I got a memory error.
Please help me find the way to solve this problem.
For sparse systems, we generally use iterative methods. One reason is that direct solvers like LU, QR... factorizations fill your initial matrix (fill in the sense that initially zero components are replaced with nonzero ones). In most of the cases the resulting dense matrix can not fit into the memory anymore (-> your memory error). In short, iterative solvers do not change your sparsity pattern as they only involve matrix-vector products (no filling).
That said, you must first know if your system is symmetric positive definite (aka SPD) in such case you can use the conjugate gradient method. Otherwise you must use methods for unsymmetric systems like BiCGSTAB or GMRES.
You must know that most of the time we use a preconditioner, especially if your system is ill-conditioned.
Looking at Eigen doc I found(example without preconditioner afaik):
int n = 10000;
VectorXd x(n), b(n);
SparseMatrix<double> A(n,n);
/* ... fill A and b ... */
BiCGSTAB<SparseMatrix<double> > solver;
solver.compute(A);
x = solver.solve(b);
std::cout << "#iterations: " << solver.iterations() << std::endl;
std::cout << "estimated error: " << solver.error() << std::endl;
which is maybe a good start (use Conjugate Gradient Method if your matrix is SPD). Note that here, there is no preconditioner, hence convergence will be certainly rather slow.
Recap: big matrix -> iterative method + preconditioner.
This is really a first "basic/naive" explanation, you can found futher information about the theory in Saad's book: Iterative Methods for Sparse Linear Systems. Again this topic is huge you can found plenty of other books on this subject.
I am not an Eigen user, however there are precondtioners here.
-> Incomplete LU (unsymmetric systems) or Incomplete Cholesky (SPD systems) are generally good all purpose preconditioners, there are the firsts to test.

Eigen: how to speed up a += coeffs * coeffs.transpose()

I need to compute many (about 400k) solutions of small linear least square problems. Each problem contains 10-300 equations with only 7 variables.
To solve these problems i use eigen library. Straight solving takes too much time and i transform each problem to solving 7x7 system of linear equations by deriving derivatives by my hand.
I recieve nice speed-up but i want to increase performance again.
I use vagrind to profile my program and i found that operation with highest self cost is operator += of eigen matrix. This operation takes more than ten calls of a.ldlt().solve(b);
I use this operator to compose A matrix and B vector of each system of equations
//I cal these code to solve each problem
const int nVars = 7;
//i really need double precision
Eigen::Matrix<double, nVars, nVars> a = Eigen::Matrix<double, nVars, nVars>::Zero();
Eigen::Matrix<double, nVars, 1> b = Eigen::Matrix<double, nVars, 1>::Zero();
Eigen::Matrix<double, nVars, 1> equationCoeffs;
//............................
//Somewhere in big cycle.
//equationCoeffs and z are updated on each iteration
a += equationCoeffs * equationCoeffs.transpose();
b += equationCoeffs * z;
Where z is some scalar
So my question is: How can i improve performance of these operations?
PS Sorry for my poor English
Instead of forming the matrix and vector components of the normal equation by hand, one equation at a time, you might try to allocate a large enough matrix once (e.g. 300 x 7) to store all coefficients and then let Eigen's optimized matrix-matrix product kernels do the job for you:
Matrix<double,Dynamic,nbVars> D(300,nbVars);
VectorXd f(300);
for(...)
{
int nb_equations = ...;
for(i=0..nb_equations-1)
{
D.row(i) = equationCoeffs;
f(i) = z;
}
a = D.topRows(nb_equations).transpose() * D.topRows(nb_equations);
b = D.topRows(nb_equations).transpose() * f.head(nb_equations);
// solve ax=b
}
You might bench with both a column-major and row-major storage for the matrix D to see which one is best.
Another possible approach would be to declare a, equationCoeffs, and b as 8x8 or 8x1 matrix or vectors making sure that equationCoeffs(7)==0. This way you maximize SIMD usage. Then use a.topLeftCorners<7,7>(), b.head<7>() when calling LDLT. You might even combine this strategy with the previous one.
Finally, if your CPU support AVX or FMA, you might use the devel branch and compile with -mavx or -mfma to get a significant speedup.
If you can use g++5.1, you might want to take a look at OpenMP
( http://openmp.org/mp-documents/OpenMP4.0.0.Examples.pdf ).
G++5.1 (or gcc5.1 for C) also has some basic support for OpenACC, you can try that as well. There should be more implementation of OpenACC in the future.
Also if you have access to intel compiler (icc, icpc) it speeded up my code even just by using it.
If you can use nvidia's nvcc, you might use the thrust library
( http://docs.nvidia.com/cuda/thrust/#axzz3g8xJPGHe ), there's a lot of sample code on their github as well
( https://github.com/thrust/thrust ). However, using thrust is not so straight forward and needs some real thinking.
EDIT:
Thrust also requires Nvidia GPU.
For AMD cards I believe there is a library called ArrayFire, which looks very similar to Thrust (I have not tried that one, yet)
I have a single problem Ax=b with 480k float variables. The matrix A is sparse and solving it with Eigen BiCGSTAB took 4.8 seconds.
I also worked with ViennaCL before, so I tried to solve the same problem there, and it took only 1.2 seconds. The increase in spead is realised
by the processing on the GPU.

Matlab Hilbert Transform in C++

First, please excuse my ignorance in this field, I'm a programmer by trade but have been stuck in a situation a little beyond my expertise (in math and signals processing).
I have a Matlab script that I need to port to a C++ program (without compiling the matlab code into a DLL). It uses the hilbert() function with one argument. I'm trying to find a way to implement the same thing in C++ (i.e. have a function that also takes only one argument, and returns the same values).
I have read up on ways of using FFT and IFFT to build it, but can't seem to get anything as simple as the Matlab version. The main thing is that I need it to work on a 128*2000 matrix, and nothing I've found in my search has showed me how to do that.
I would be OK with either a complex value returned, or just the absolute value. The simpler it is to integrate into the code, the better.
Thank you.
The MatLab function hilbert() does actually not compute the Hilbert transform directly but instead it computes the analytical signal, which is the thing one needs in most cases.
It does it by taking the FFT, deleting the negative frequencies (setting the upper half of the array to zero) and applying the inverse FFT. It would be straight forward in C/C++ (three lines of code) if you've got a decent FFT implementation.
This looks pretty good, as long as you can deal with the GPL license. Part of a much larger numerical computing resource.
Simple code below. (Note: this was part of a bigger project). The value for L is based on the your determination of your order, N. With N = 2L-1. Round N to an odd number. xbar below is based on the signal you define as the input to your designed system. This was implemented in MATLAB.
L = 40;
n = -L:L; % index n from [-40,-39,....,-1,0,1,...,39,40];
h = (1 - (-1).^n)./(pi*n); %impulse response of Hilbert Transform
h(41) = 0; %Corresponds to the 0/0 term (for 41st term, 0, in n vector above)
xhat = conv(h,xbar); %resultant from Hilbert Transform H(w);
plot(abs(xhat))
Not a true answer to your question but maybe a way of making you sleep better. I believe that you won't be able to be much faster than Matlab in the particular case of what is basically ffts on a matrix. That is where Matlab excels!
Matlab FFTs are computed using FFTW, the de-facto fastest FFT algorithm written in C which seem to be also parallelized by Matlab. On top of that, quoting from http://www.mathworks.com/help/matlab/ref/fftw.html:
For FFT dimensions that are powers of 2, between 214 and 222, MATLAB
software uses special preloaded information in its internal database
to optimize the FFT computation.
So don't feel bad if your code is slightly slower...

Compute rank of Matrix

I need to calculate rank of 4096x4096 sparse matrix, and I use C/C++ code.
I found some libraries (like Armadillo) that do it but they're too slow (almost 5 minutes).
I've also tried two Open Source version of Matlab (Freemat and Octave) but both crashed when I tried to make a test with a script.
5 minutes isn't so much but I must get rank from something like a million of matrix so the faster the better.
Someone knows a fast library for rank computation?
The Eigen library supports sparse matrices, try it out.
Computing the algebraic rank is O(n^3), where n is the matrix size, so it's inherently slow. You need eg. to perform pivoting, and this is slow and inaccurate if your matrix is not well conditioned (for n = 4096, a typical matrix is very ill conditioned).
Now, what is the rank ? It is the dimension of the image. It is very difficult to compute when n is large and it'll be spoiled by any small numerical inaccuracy of the input. For n = 4096, unless you happen to have particularly well conditioned matrices, this will prevent you from doing anything useful with a pivoting algorithm.
The best way is in fact to fix a cutoff epsilon, compute the singular values s_1 > ... > s_n and take as the rank the lowest integer r such that sum(s_i^2, i > r) < epsilon^2 * sum(s_i^2).
You thus need a sparse SVD routine, eg. from there.
This may not be faster, but to the very least it will be correct.
You can ask for less singular values that you need to speed up things. This is a tough problem, and with no info on the background and how you got these matrices, there is nothing more we can do.
Try the following code (the documentation is here).
It is an example for calculating the rank of the matrix A with Eigen library:
MatrixXd A(2,2);
A << 1 , 0, 1, 0;
FullPivLU<MatrixXd> luA(A);
int rank = luA.rank();