Fixed size SVD and solver in CUDA (in the device) - c++

I implemented a program on the GPU (CUDA) which only uses the host (in C++) to start new kernels. During the calculation on the device I need SVD and solving systems of 3x3 (dense) matrices, fixed size.
I've got my own SVD and solver implementation but it is not numerical stable (thus not usable). Due to me being rather new with C++ and CUDA I would prefer to use a library instead. (numerical stuff is very tricky)
Now I have trouble finding that library:
cuSOLVER is not callable from the device
cuLA is not callable form the device (and abandoned so it seems)
Eigen looks promising (should be callable from device?) but it is unclear what the status is on CUDA support (it says experimental). I find people saying it works, others got compile errors?
Preferable I would also being able to do general matrix operations with the library (transpose, inversion, sum, multiply, ...) as my own implementations will likely be less efficient and numerically stable for those.
Any ideas on how to achieve this?
UPDATE:
Seems like Eigen supports basic functions like *,+, transpose and even eigenvalues but SVD, inverse ect is not yet supported. This is at the time of writing.

According to the website, a subset of features works for fixed size matrices (3x3 in your case) from Eigen 3.3. The current stable release is 3.2.6 while 3.3 is in alpha. I don't know if specifically SVD is supported in CUDA. I would recommend trying a small MCVE to see if it works (as well as the other functions you require), and if so, implementing it in your project.

I'm having a similar problem; want to generate random vectors within a kernel function which requires performing cholesky/eigenvalue decompositions of NxN (N<=5) covariance matrices. Since, as you noted, the MAGMA and CULA libraries are not available from the device, and there seems to be no cuSOLVER device API yet, I've resorted to implementing these myself following algorithms outlined in, for example, Numerical Recipes in C. As for solving linear systems, I'd suggest checking out the cuBLAS (level 2 functions), as it provides some basic functionality. If you want to invert matrices, I'd suggest cublasmatinvBatched(). I haven't used it myself, will give it a try during the weekend, but from the description it sounds promising. Hope others will chime into this thread with better solutions...

Related

C++ How do I solve very large system of sparse linear system

I am trying to solve a very large and sparse system of linear equations in C++. Currently, I am using BiCGSTAB from eigen. It works fine for small matrix, but it is taking just too much time for matrix of the size I need, which is 40804x40804 (It could be even larger in the future).
I have a very long script, but I simply used the following format:
SparseMatrix<double> sj(40804,40804);
VectorXd c_(40804), sf(40804);
sj.reserve(VectorXi::Constant(40804,36)); //This is a very good estimate of how many non zeros in each column
//...Fill in actual number in sj
sj.makeCompressed();
BiCGSTAB<SparseMatrix<double> > handler;
//...Fill in sj, only in the entries that have been initialized previously
handler.analyzePattern(sj)
handler.factorize(sj);
c_.setZero();
c_=handler.solve(sf);
This takes way too long! And yes, the solution does exist. Sparse function in matlab seems to handle this very well, but I need it in C++ in order to connect to a server.
I would really appreciate it you could help me!
You should consider use of one of the advanced sparse direct solvers: CHOLMOD
Sparse direct solvers are a fundamental tool in computational analysis, providing a very general method for obtaining high-quality results to almost any problem. CHOLMOD is a high performance library for sparse Cholesky factorization.
I guarantee that this package definetly will help you. Moreover CHOLMOD has supported GPU acceleration since 2012 with version 4.0.0 . In SuiteSparse-4.3.1 performance has been further improved, providing speedups of 3x or greater vs. the CPU for the sparse factorization operation.
If your matrices are the representations of graphs you can also consider METIS with combination of CHOLMOD. Which means you will be able to do partition/domainDecomposition in graphs then parallel solve with CHOLMOD.
SuiteSparse is a powerfull tool with the support of linear(KLU) and direct solvers.
Here are the GitHub link, UserGuide and SuiteSparse's home page

Matrix handler large sequences

I m going to do an algorithm like needleman-wunsch algorithm or smith-watterman algorithm for large sequences. So I'm going to need a way to create matrices of different sizes so my question is which library gives me the best performance on that and would be easy to use.
P.S: I know that OpenCV, and Boost can handle the matrices but I don't know if there are good to do operations on it.
If “the best performance” is the requirement, then you have to look at NVIDIA CUDA or Intel MKL. Libraries like C++ boost uBLAS concentrate not on performance, but on usability.

Looking for testing matrices/systems for iterative linear solver

I am currently working on a C++-based library for large, sparse linear algebra problems (yes, I know many such libraries exist, but I'm rolling my own mostly to learn about iterative solvers, sparse storage containers, etc..).
I am to the point where I am using my solvers within other programming projects of mine, and would like to test the solvers against problems that are not my own. Primarily, I am looking to test against symmetric sparse systems that are positive definite. I have found several sources for such system matrices such as:
Matrix Market
UF Sparse Matrix Collection
That being said, I have not yet found any sources of good test matrices that include the entire system- system matrix and RHS. This would be great to have in order to check results. Any tips on where I can find such full systems, or alternatively, what I might do to generate a "good" RHS for the system matrices I can get online? I am currently just filling a matrix with random values, or all ones, but suspect that this is not necessarily the best way.
I would suggest using a right-hand-side vector obtained from a predefined 'goal' solution x:
b = A*x
Then you have a goal solution, x, and a resulting solution, x, from the solver.
This means you can compare the error (difference of the goal and resulting solutions) as well as the residuals (A*x - b).
Note that for careful evaluation of an iterative solver you'll also need to consider what to use for the initial x.
The online collections of matrices primarily contain the left-hand-side matrix, but some do include right-hand-sides and also some have solution vectors too.:
http://www.cise.ufl.edu/research/sparse/matrices/rhs.txt
By the way, for the UF sparse matrix collection I'd suggest this link instead:
http://www.cise.ufl.edu/research/sparse/matrices/
I haven't used it yet, I'm about to, but GiNAC seems like the best thing I've found for C++. It is the library used behind Maple for CAS, I don't know the performance it has for .
http://www.ginac.de/
it would do well to specify which kind of problems are you solving...
different problems will require different RHS to be of any use to check validity..... what i'll suggest is get some example code from some projects like DUNE Numerics (i'm working on this right now), FENICS, deal.ii which are already using the solvers to solve matrices... generally they'll have some functionality to output your matrix in some kind of file (DUNE Numerics has functionality to output matrices and RHS in a matlab-compliant files).
This you can then feed to your solvers..
and then again use their the libraries functionality to create output data
(like DUNE Numerics uses a VTK format)... That was, you'll get to analyse data using powerful tools.....
you may have to learn a little bit about compiling and using those libraries...
but it is not much... and i believe the functionality you'll get would be worth the time invested......
i guess even a single well-defined and reasonably complex problem should be good enough for testing your libraries.... well actually two
one for Ax=B problems and another for Ax=cBx (eigenvalue problems) ....

Matrix classes in c++

I'm doing some linear algebra math, and was looking for some really lightweight and simple to use matrix class that could handle different dimensions: 2x2, 2x1, 3x1 and 1x2 basically.
I presume such class could be implemented with templates and using some specialization in some cases, for performance.
Anybody know of any simple implementation available for use? I don't want "bloated" implementations, as I'll running this in an embedded environment where memory is constrained.
Thanks
You could try Blitz++ -- or Boost's uBLAS
I've recently looked at a variety of C++ matrix libraries, and my vote goes to Armadillo.
The library is heavily templated and header-only.
Armadillo also leverages templates to implement a delayed evaluation framework (resolved at compile time) to minimize temporaries in the generated code (resulting in reduced memory usage and increased performance).
However, these advanced features are only a burden to the compiler and not your implementation running in the embedded environment, because most Armadillo code 'evaporates' during compilation due to its design approach based on templates.
And despite all that, one of its main design goals has been ease of use - the API is deliberately similar in style to Matlab syntax (see the comparison table on the site).
Additionally, although Armadillo can work standalone, you might want to consider using it with LAPACK (and BLAS) implementations available to improve performance. A good option would be for instance OpenBLAS (or ATLAS). Check Armadillo's FAQ, it covers some important topics.
A quick search on Google dug up this presentation showing that Armadillo has already been used in embedded systems.
std::valarray is pretty lightweight.
I use Newmat libraries for matrix computations. It's open source and easy to use, although I'm not sure it fits your definition of lightweight (it includes over 50 source files which Visual Studio compiles it into a 1.8MB static library).
CML matrix is pretty good, but may not be lightweight enough for an embedded environment. Check it out anyway: http://cmldev.net/?p=418
Another option, altough may be too late is:
https://launchpad.net/lwmatrix
I for one wasn't able to find simple enough library so I wrote it myself: http://koti.welho.com/aarpikar/lib/
I think it should be able to handle different matrix dimensions (2x2, 3x3, 3x1, etc) by simply setting some rows or columns to zero. It won't be the most fastest approach since internally all operations will be done with 4x4 matrices. Although in theory there might exist that kind of processors that can handle 4x4-operations in one tick. At least I would much rather believe in existence of such processors that than go optimizing those low level matrix calculations. :)
How about just store the matrix in an array, like
2x3 matrix = {2,3,val1,val2,...,val6}
This is really simple, and addition operations are trivial. However, you need to write your own multiplication function.

3d convolution in c++

I'm looking for some source code implementing 3d convolution. Ideally, I need C++ code or CUDA code. I'd appreciate if anybody can point me to a nice and fast implementation :-)
Cheers
you understand that convolution is normally done by using an fft? see, for example, http://en.wikipedia.org/wiki/Convolution
so you need an fft library.
Fastest method to compute convolution suggests http://www.fftw.org/ (for a traditional cpu).
for cuda, use cufft - http://www.gsic.titech.ac.jp/~ccwww/tebiki/tesla_e/tesla6_e.html
Are you a registered developer? If so you should download the 3.0 SDK and check out the FDTD3d sample which shows a 3d convolution as applied for an explicit finite differences app. In the 2.3 SDK there was a sample called 3dfd which was similar (and has now been replaced).
It may be more efficient to use this approach rather than FFT if your impulse response is short.
Intel has a very good example - using SSE + OpenMP and a serial version of it. The code is primarily meant to profile the serial and a parallel approach, but is done in a nice way. http://software.intel.com/en-us/articles/16bit-3d-convolution-sse4openmp-implementation-on-penryn-cpu/