Working with many fixed-size matrices in CUDA kernels - c++

I am looking to work about 4000 fixed-size (3x3, 4x4) matrices, doing things such as matrix inversion and eigendecomposition.
It seems to me the best way to parallelize this would be to let each of the many GPU threads work on a single instance of the problem.
Is there a reasonable way to do this? I have read: http://www.culatools.com/blog/2011/12/09/batched-operations/ but as far as I can tell, it's always something that is "being worked on" with no solution in sight. Three years later, I hope there is a good solution.
So far, I have looked at:
Using Eigen in CUDA kernels: http://eigen.tuxfamily.org/dox-devel/TopicCUDA.html. But this is in its infancy: thus, it doesn't seem to work well and some things are not implemented. Moreover, I am not sure if it is optimized for CUDA at all. There is almost no documentation and the only example of code is a test file (eigen/test/cuda_basic.cu). When I tried using Eigen in CUDA kernels, simple things like declaring an Eigen::MatrixXf in a kernel did not survive compilation with nvcc V7.0.27 and Eigen 3.2.90 (mercurial).
Using the cuBLAS device API library to run BLAS routines within a kernel. It seems cuBLAS and its ilk are written to be parallelized even for small matrices, which seems overkill and likely slow for the 3x3 and 4x4 matrices I am interested in. Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Batch processing kernels using CUDA streams. In Section 2.1.7 "Batching Kernels" of the cuBLAS documentation for the CUDA Toolkit v7.0, this is suggested. But """in practice it is not possible to have more than 16 concurrent kernels executing at the same time""" and consequently it would be terrible for processing 4000 small matrices. In an aforementioned link to the CULA blog post, I quote, """One could, in theory, use a CUDA stream per problem and launch one problem at a time. This would be ill-performing for two reasons. First is that the number of threads per block would be far too low; [...] Second is that the overhead incurred by launching thousands of operations in this manner would be unacceptable, because the launch code is as expensive (if not more expensive) as just performing the matrix on the CPU."""
Implementing my own matrix multiplication and eigendecomposition in kernels. This is likely to be very slow, and may in addition be time consuming to implement.
At this point I am tempted to give up on doing this on the GPU at all. It is a pity, since I was hoping for real time performance for an algorithm that requires inverting 4000 3x3 matrices about 100 times every 0.1 seconds.

The cublas functions getrfBatched and getriBatched are designed for batch inversion of small matrices. This should be quicker than either dynamic parallelism or streams (your 2nd and 3rd approaches.) Also a batch solver is available in source code form that can do matrix inversions. You will need to log in as a registered developer at developer.nvidia.com to access this link.
Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Cusolver provides some eigen solver functions. However they are not batched nor callable from device code, so you're faced with streams as the only option beyond that.

Related

matrix multiplication - is Cuda worth it?

I have a problem which involves many matrix multiplications (classical and kronecker product) . I read that GPU are suited for this task and since speed is my main objective I was thinking about using Cuda with c++. However I would have to learn Cuda first. So before I start waisting my time I thought I should ask wiser people first. Can Cuda speed up my calculations? The matrices are generally quite small around 20x50. Sometimes involving a third dimension so it becomes a 20x50x10 matrix. I can only multiply a couple of matrices at one step in time (10-100). But I need to do several millions iteration after each other (Monte Carlo simulation). Currently I am using armadillo and matlab.
You would see some speed ups if your matrices were bigger, now you will be facing data bandwidth bottlenecks worse than computation time delays.
Something worth considering is to see mathematical tricks that could allow you (based on your computations) to combine multiple instances into bigger matrices then transfer and compute. But usually this is quite difficult and probably not always doable.

Parallelize a method from inside a CUDA device function / kernel

I've got an already parallelized CUDA kernel that does some tasks which require frequent interpolation.
So there's a kernel
__global__ void complexStuff(...)
which calls one or more times this interpolation device function:
__device__ void interpolate(...)
The interpolation algorithm does an WENO interpolation successively over three dimensions. This is a highly parallelizable task which I urgently would like to parallelize!
It is clear that the kernel complexStuff() can easily be parallelized by calling it from host code using the <<<...>>> syntax. It is also important that complexStuff() is already parallelized.
But it's not clear to me how to parallelize something / create new threads from inside a CUDA device function ... is this even possible? Does anyone know?
You might want to consider Dynamic Parallelism (some resources here, here, and here) in order to call a CUDA kernel from inside another CUDA kernel. It requires your device compute capability to be 3.5 or higher. It comes with a number of restrictions and limitations that may degrade the performance (mentioned in 3rd link).
My suggestion is to first consider calling your CUDA kernel with complexStuff(...) amount of work multiplied by interpolate(...) amount work. In other words, statically guess what is the maximum parallel fine-grained Jobs you need to do. Then configure your kernel to perform those fine-grained jobs with block threads. Note that it's just a speculation without knowing your program code.

Understanding Dlib Kernel implementation

I'm starting using dlib, and I have hard time understanding the way kernels are implemented. I started with the k-kmeans algorithm as I know this clustering method. However I cannot figure out where the kernel is computed. The input data are a matrix (not a kernel) and the algorithm never transform the data into a kernel.
I would expect a kernel class returning a square matrix. But I have not seen anything like this!
I want to use dlib to implement a clustering algorithm using kernels and dlib sounds a good solution to do so. Does anyone has a documentation on how it is implemented or can explain me how it does work?
thanks for your help!
A kernel is basically just a function that takes two input samples and outputs a single number. So yes, sometimes you will see code that then computes an N by N matrix of all the possible kernel function outputs for N samples. However, this is a somewhat naive implementation strategy since it requires O(N^2) RAM. So most real world kernel method software uses some kind of delayed evaluation or caching strategy to avoid this problem.
In the kernel K-means implementation in dlib this is done with the the kcentroid object. Inside the kcentroid you can see that it's invoking the kernel function in a number of places and doing all the "kernel stuff". You can read over the documentation for the kcentroid to understand what it does. Although, if you are just getting started with kernel methods then you will really need to get a book on the subject. I highly recommend picking one of these:
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf and Alexander J. Smola
Kernel Methods for Pattern Analysis by John Shawe-Taylor and Nello Cristianini
For a set of N data points, the kernel is usually specified by an NxN matrix whose (i,j)th entry gives the value of the kernel between data point i and data point j. This works for kernel methods as long as the matrix is symmetric and positive definite, which is guaranteed to be true for a true kernel.

Want to improve the computing speed of matrix calculation, OpenMP or CUDA?

My program has a bunch of matrix multiplication and inversion, which is time consuming.
My computer: CPU: intel i7; GPU: 512MB NVIDIA® Quadro® NVS3100M
Which one is better for improving computing speed? OpenMP or CUDA?
(ps. I think generally, GPU has more cores than cpu, thus, CUDA could improve multiple times more than OpenMP?)
From my experience(work on both as a school project, in most condition, the calculation time for a medium size array, I would say less than 2000 * 2000, is almost the same, the actual calculation time depending on the working load of your computer(usually when you working on openMP, you would share a cluster with other guys, so make sure you are running your application alone, so that you might got a better result))
But if you are good at CUDA, the GPU is very powerful in these kinds of calculation stuff, when i was working on my CUDA project, there are lots of good materials in the official website. For openMP, it is only a library, and if you are good at c or c++, should not be any problem for you to use it(but the compiler of openMP is buggy~~, don't trust it, try to log anything).
And i assumed you have experience on CUDA, is not hard to find some good example i think. But CUDA is really dummy, can't debug, so I recommend you to try openMP first, it should be easier.
I'd guess it depends on what your application is and how you go about trying to implement improvements. Keep in mind that every optimization has tradeoffs. For instance, GPU's typically use half-precision floating point, and there are compiler options that allow you to bypass some aspects of the IEEE standard, which brings you some extra speed at the expense of precision, etc.

Poor performance for calculating eigenvalues and eigenvectors on GPU

In some code we need to get auto vectors and auto values for the generalized eigenvalue problem with symmetric real matrices (Ax=lamba Bx). This code uses DSPGVX from LACPACK. We wanted to speed it up on GPU using a MAGMA function. We asked on this forum and got the answer about this
http://icl.cs.utk.edu/magma/docs/zhegvx_8cpp.html
The size of our matrices (N) goes from 100 to 50000 and even more, related to the number of atoms in a molecule. We observe:
a) for N bigger than 2500 (approx), MAGMA just does not work; segmentation fault
b) MAGMA runs always slower than LAPACK sequential, around 10 times slower
Is this behavior normal and could we overcome it? Can anybody report any reference where anybody working on this similar problems gets a decent speedup?
Thanks
In my experience you may be able to gain greater performance benefits by switching to a better eigensolver. The best solver that I know of is ARPACK. You will gain most benefit it your matrices have some structure, for example if they are sparse. This solver is also most efficient if you only need to extract a small fraction of the total number of eigenpairs.
I would start off by trying this solver on your problems running just on the CPU. You may find that this alone gives sufficient performance for your needs. If not then it is relatively easy to move the calculation core for ARPACK to the GPU. Or, there are parallel versions of ARPACK available.
Have you tried CULA http://www.culatools.com/ ? CULA is Lapack converted for CUDA by NVIDIA, so at least in theory it should have one of the best implementation for the generalized eigenvalue problem. I think the single precision version is free so you could give it a try.