Solving eigen on GPU - CUDA - c++

I recently write a C++ program using eigen. But I found that my poor CPI cannot handle the large computation load. I estimate it demands at least a day of computation and I am short of time, especially since I may need debugging time too. I'm considering using GPU acceleration, but I don't know how to adapt my code for CUDA (or some other platforms, but I prefer CUDA).
Are there any convenient methods to apply GPU acceleration on eigen program?

cuSolver is a linar algebra library by NVIDIA. It provides fast and efficient eigen solver among tons of other linear algebra facilities.
You might also want to take a look at the SDK examples (7_Libraries).
Here's the link to the documentation.
Other useful libraries are
MAGMA - https://developer.nvidia.com/magma
CULA - http://www.culatools.com/

Related

Fixed size SVD and solver in CUDA (in the device)

I implemented a program on the GPU (CUDA) which only uses the host (in C++) to start new kernels. During the calculation on the device I need SVD and solving systems of 3x3 (dense) matrices, fixed size.
I've got my own SVD and solver implementation but it is not numerical stable (thus not usable). Due to me being rather new with C++ and CUDA I would prefer to use a library instead. (numerical stuff is very tricky)
Now I have trouble finding that library:
cuSOLVER is not callable from the device
cuLA is not callable form the device (and abandoned so it seems)
Eigen looks promising (should be callable from device?) but it is unclear what the status is on CUDA support (it says experimental). I find people saying it works, others got compile errors?
Preferable I would also being able to do general matrix operations with the library (transpose, inversion, sum, multiply, ...) as my own implementations will likely be less efficient and numerically stable for those.
Any ideas on how to achieve this?
UPDATE:
Seems like Eigen supports basic functions like *,+, transpose and even eigenvalues but SVD, inverse ect is not yet supported. This is at the time of writing.
According to the website, a subset of features works for fixed size matrices (3x3 in your case) from Eigen 3.3. The current stable release is 3.2.6 while 3.3 is in alpha. I don't know if specifically SVD is supported in CUDA. I would recommend trying a small MCVE to see if it works (as well as the other functions you require), and if so, implementing it in your project.
I'm having a similar problem; want to generate random vectors within a kernel function which requires performing cholesky/eigenvalue decompositions of NxN (N<=5) covariance matrices. Since, as you noted, the MAGMA and CULA libraries are not available from the device, and there seems to be no cuSOLVER device API yet, I've resorted to implementing these myself following algorithms outlined in, for example, Numerical Recipes in C. As for solving linear systems, I'd suggest checking out the cuBLAS (level 2 functions), as it provides some basic functionality. If you want to invert matrices, I'd suggest cublasmatinvBatched(). I haven't used it myself, will give it a try during the weekend, but from the description it sounds promising. Hope others will chime into this thread with better solutions...

GPGPU programming architecture for HSA in C++ for Matrix Math

GPU Compute Programmers,
I have a C++ program which currently relies on the ACML (LAPACK) to invert and multiple fairly large matrices of single precision fp values (E.g. 4,000 x 4,000). These matrices are very sparse although they do not always fit nicely into a diagonal matrix so I cannot presently reduce them. The other thing about this program is I have to do this invert and multiply several times (serially) as part of a Newton Rapson. However, I have several thousand permutations which can be done in parallel, each with a small change to the matrix before again calculating and inverting the Jacobian. This is all single precision fp, and seems perfectly suited for the GPU. My question is this...
I suspect I will need to use the AMD Accelerated Parallel Processing Math Libraries (APPML) for OpenGL as that is the only thing (non-CUDA, I want to be GPU agnostic) I know of which is available with BLAS functionality. My problem is I do not see the LAPACK dgetrf and dgetri functions included in APPML (yes, these are fp64 but I don't need that precision). Would C++ AMP be a better alternative? I am very interested in HSA features of passing pointers rather than copying data as there is a lot of data in flight here and some calculations still are done on the CPU. I believe that copy overhead would kill me otherwise. Ultimately, performance is the key and I want to make the right architectural decisions to set myself up for the most performance I can wring out of HSA GPUs coming out over the next 6 months.
I am using VS 2013 Ultimate preview and would be able to take advantage of C++ AMP for these HSA capabilities. I just want to make sure I am making the right long term architectural decision now while my program is in its infancy. Here is a link and snippet from some interesting data I found on Anandtech:
http://anandtech.com/show/7118/windows-81-and-vs2013-bring-gpu-computing-updates-to-direct3d-and-c-amp-
C++ AMP, Microsoft's C++ extension for GPU computing, has also been updated with the upcoming VS2013. I think the biggest feature update is that C++ AMP programs will also gain a shared memory feature on APUs/SoCs where the compiler and runtime will be able to eliminate extra data copies between CPU and GPU. This feature will also be available only on Windows 8.1 and it is likely built on top of the "map default buffer" as Microsoft's AMP implementation uses Direct3D under the hood. C++ AMP also brings some other nice additions including enhanced texture support and better debugging abilities.
Any thoughts, additional questions or discussion would be greatly appreciated!

BLAS+Multiple Precision+MPI

I am writing a scientific application for my Maths PhD in C++, it's based on some heavy linear algebra, mostly BLAS level 3 routines. The sizes of the matrices employed vary considerably, ideally I would like to be able to deal with very large matrices of order 10000 and higher. So far I have used Intel MKL, multi-threaded, scales nicely onto 8 cores. My algorithm produces the correct results, however is very unstable, in double precision arithmetic, due to the accumulating errors, resulting from high powers being taken. Additionally, as I have access to a large supercomputer cluster, and my algorithm can be easily scaled across multiple nodes, I would like to employ MPI to scale the application across hundreds of nodes.
My goal is to find a templated BLAS library that:
Supports Multiple Precision Arithmetic,
Supports Multi-threading,
Supports MPI
My findings so far:
MTL4 - Matrix Template library 4 seems to do all of the above, however the open source edition will only run on one core, and the supercomputing edition is quite costly.
Eigen - appears not to support multicore? Does it support multicore and MPI if linked with MKL?
Armadillo - does all the above?
I would greatly appreciate any insights and recommendations
Kind Regards,
Maria
Depending on your matrix problem, the Tpetra package of Trilinos might be worth a look. It's templated on the scalar type, so you might use multiple precision types. It targets large scale applications on supercomputers so one can expect good parallel performances.
Hope it helps!
Edit: and it's free!

Library for solving system of linear equations with tridiagonal matrix?

I am modelling physical system with heat conduction, and to do numerical calculations I need to solve system of linear equations with tridiagonal matrix. I am using this algorithm to get results: http://en.wikipedia.org/wiki/Tridiagonal_matrix_algorithm But I am afraid that my method is straightforward and not optimal. What C++ library should be used to solve that system in the fastest way? I should also mention that matrix is not changed often (only right part of the equation is changed). Thanks!
Check out Eigen.
The performance of this algorithm is likely dominated by floating-point division. Use SSE2 to perform two divisions (of ci and di) at once, and you will get close to optimal performance.
It is worth looking at the LAPACK and BLAS interfaces, of which there are several implementation libraries. Originally netlib which is open-source, and then other such as MKL that you have to pay for. The function dgtsv does what you are looking for. The open-source netlib versions don't do any explicit SIMD instructions, but MKL does and will perform best on intel chips.

ublas vs. matrix template library (MTL4)

I'm writing a software for hyperbolic partial differential equations in c++. Almost all notations are vector and matrix ones. On top of that, I need the linear algebra solver. And yes, the vector's and matrix's sizes can vary considerably (from say 1000 to sizes that can be solved only by distributed memory computing, eg. clusters or similar architecture). If I had lived in utopia, I'd had had linear solver which scales great for clusters, GPUs and multicores.
When thinking about the data structure that should represent the variables, I came accros the boost.ublas and MTL4.
Both libraries are blas level 3 compatible, MTL4 implements sparse solver and is much faster than ublas. They both don't have implemented support for multicore processors, not to mention parallelization for distributed memory computations. On the other hand, the development of MTL4 depends on sole effort of 2 developers (at least as I understood), and I'm sure there is a reason that the ublas is in the boost library. Furthermore, intel's mkl library includes the example for binding their structure with ublas.
I'd like to bind my data and software to the data structure that will be rock solid, developed and maintained for long period of time.
Finally, the question. What is your experience with the use of ublas and/or mtl4, and what would you recommend?
thanx,
mightydodol
With your requirements, I would probably go for BOOST::uBLAS. Indeed, a good deployment of uBLAS should be roughly on par with MTL4 regarding speed.
The reason is that there exist bindings for ATLAS (hence shared-memory parallelization that you can efficiently optimize for your computer), and also vendor-tuned implementations like the Intel Math Kernel Library or HP MLIB.
With these bindings, uBLAS with a well-tuned ATLAS / BLAS library doing the math should be fast enough. If you link against a given BLAS / ATLAS, you should be roughly on par with MTL4 linked against the same BLAS / ATLAS using the compiler flag -DMTL_HAS_BLAS, and most likely faster than the MTL4 without BLAS according to their own observation (example see here, where GotoBLAS outperforms MTL4).
To sum up, speed should not be your decisive factor as long as you are willing to use some BLAS library. Usability and support is more important. You have to decide, whether MTL or uBLAS is better suited for you. I tend towards uBLAS given that it is part of BOOST, and MTL4 currently only supports BLAS selectively. You might also find this slightly dated comparison of scientific C++ packages interesting.
One big BUT: for your requirements (extremely big matrices), I would probably skip the "syntactic sugar" uBLAS or MTL, and call the "metal" C interface of BLAS / LAPACK directly. But that's just me... Another advantage is that it should be easier than to switch to ScaLAPACK (distributed memory LAPACK, have never used it) for bigger problems. Just to be clear: for house-hold problems, I would not suggest calling a BLAS library directly.
If you're programming vectors, matrices, and linear algebra in C++, I'd look at Eigen:
http://eigen.tuxfamily.org/
It's faster than uBLAS (not sure about MTL4) and much cleaner syntax.
For new projects, it's probably best to stay away from Boost's uBlas. The uBlas FAQ even has this warning since late 2012:
Q: Should I use uBLAS for new projects?
... the last major improvement of uBLAS was in 2008 and no significant change was committed since 2009. ... Performance? There are faster alternatives. Cutting edge? uBLAS is more than 10 years old and missed all new stuff from C++11.
There is one C++ library missing in this list: FLENS
http://flens.sf.net
Disclaimer: Yes, this is my baby
It is header only
Comes with a simple, non-performant, generic (i.e. templated) C++ reference implemenation of BLAS.
If available you can use an optimized BLAS implementation as backend. In this case its like using BLAS directly (some Benchmark I should update).
You can use overloaded operators instead of calling BLAS functions.
It comes with its own, stand-alone, generic re-implemenation of a bunch of LAPACK functions. We call this port FLENS-LAPACK.
FLENS-LAPACK has exactly the same accuracy and performance as Netlib's LAPACK. And in my experience (FLENS-)LAPACK+ATLAS or (FLENS-)LAPACK+OpenBLAS gives you the same performance as ACML or MKL.
FLENS has a different policy regarding the creation of temporary vector/matrices in the evaluation of linear algebra expressions. The FLENS policy is: Never create them!!!. However, in a special debug-mode we allow the creation of temporaries "when necessary". This "when necessary" policy thing is the default in other libraries like Eigen or Armadillo or in Matlab.
You can see the performance differences directly here:
http://www.osl.iu.edu/research/mtl/mtl4/doc/performance.php3
Both are reasonable libraries to use in terms of their interfaces, I don't think that because uBLAS got through the BOOST review process it's necessarily way more robust. I've had my share of nightmares with unobvious side effects and unintended consequences from uBLAS implementations.
That's not to say uBLAS is bad, it's really good, but I think given the dramatic performances differences for MTL these days, it's worth using it instead of uBLAS even though it's arguably a bit more risky becuase of it's "only 2 developer" support group.
At the end of the day, it's about speed with a matrix library, go with MTL4.
From my own experience, MTL4 is much faster than uBLAS and it is also faster than Eigen.
There is a parallel version of MTL4. Just have a look at simunova