matrix multiplication - is Cuda worth it? - c++

I have a problem which involves many matrix multiplications (classical and kronecker product) . I read that GPU are suited for this task and since speed is my main objective I was thinking about using Cuda with c++. However I would have to learn Cuda first. So before I start waisting my time I thought I should ask wiser people first. Can Cuda speed up my calculations? The matrices are generally quite small around 20x50. Sometimes involving a third dimension so it becomes a 20x50x10 matrix. I can only multiply a couple of matrices at one step in time (10-100). But I need to do several millions iteration after each other (Monte Carlo simulation). Currently I am using armadillo and matlab.

You would see some speed ups if your matrices were bigger, now you will be facing data bandwidth bottlenecks worse than computation time delays.
Something worth considering is to see mathematical tricks that could allow you (based on your computations) to combine multiple instances into bigger matrices then transfer and compute. But usually this is quite difficult and probably not always doable.

Related

Multithreading concept questions

I just had to write a program in which I have to do matrix multiplication using threads, where there's a thread for every multiplication.
Now i'm wondering a few things,
Are there really any advantages to using threads for multiplying a 3x2 matrix and a 2x3 matrix? for something small, sequential code is still efficient? If i'm wrong are there any advantages or disadvantages to something so small? I just see the complication too great for something so small.
On the other hand, would having a 10000x10000 matrix have a benefit in using threads? I would assume so, locality comes into play, but I'm still wrapping my head around when multithreading is more efficient, or not.
Thanks!
Generally, you never want to update values from same cache lines by multiple threads, that would kill performance. You also want to utilize SIMD units within threads. Both are typically achieved due to some kind of processing data in blocks (look for register blocking / cache blocking terms). Also, ideally, you want to create just as many threads as the hardware concurrency is (to prevent expensive context switching). For data parallelism (such as matrix multiplication), this is easier. For task parallelism, thread pools are typically employed.
For small matrices like 3x2, multithreading would be definitely much much slower than sequential processing. For larger matrices, you need to measure to find out the threshold where multithreading will be faster. That threshold depends on too many parameters to provide generic answer.
Also, I don't understand what do you mean by
there's a thread for every multiplication
Do you want to create a single thread for every multiplication of 2 scalars? This would create zillion of threads for large matrices, which would be terribly slow.

Working with many fixed-size matrices in CUDA kernels

I am looking to work about 4000 fixed-size (3x3, 4x4) matrices, doing things such as matrix inversion and eigendecomposition.
It seems to me the best way to parallelize this would be to let each of the many GPU threads work on a single instance of the problem.
Is there a reasonable way to do this? I have read: http://www.culatools.com/blog/2011/12/09/batched-operations/ but as far as I can tell, it's always something that is "being worked on" with no solution in sight. Three years later, I hope there is a good solution.
So far, I have looked at:
Using Eigen in CUDA kernels: http://eigen.tuxfamily.org/dox-devel/TopicCUDA.html. But this is in its infancy: thus, it doesn't seem to work well and some things are not implemented. Moreover, I am not sure if it is optimized for CUDA at all. There is almost no documentation and the only example of code is a test file (eigen/test/cuda_basic.cu). When I tried using Eigen in CUDA kernels, simple things like declaring an Eigen::MatrixXf in a kernel did not survive compilation with nvcc V7.0.27 and Eigen 3.2.90 (mercurial).
Using the cuBLAS device API library to run BLAS routines within a kernel. It seems cuBLAS and its ilk are written to be parallelized even for small matrices, which seems overkill and likely slow for the 3x3 and 4x4 matrices I am interested in. Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Batch processing kernels using CUDA streams. In Section 2.1.7 "Batching Kernels" of the cuBLAS documentation for the CUDA Toolkit v7.0, this is suggested. But """in practice it is not possible to have more than 16 concurrent kernels executing at the same time""" and consequently it would be terrible for processing 4000 small matrices. In an aforementioned link to the CULA blog post, I quote, """One could, in theory, use a CUDA stream per problem and launch one problem at a time. This would be ill-performing for two reasons. First is that the number of threads per block would be far too low; [...] Second is that the overhead incurred by launching thousands of operations in this manner would be unacceptable, because the launch code is as expensive (if not more expensive) as just performing the matrix on the CPU."""
Implementing my own matrix multiplication and eigendecomposition in kernels. This is likely to be very slow, and may in addition be time consuming to implement.
At this point I am tempted to give up on doing this on the GPU at all. It is a pity, since I was hoping for real time performance for an algorithm that requires inverting 4000 3x3 matrices about 100 times every 0.1 seconds.
The cublas functions getrfBatched and getriBatched are designed for batch inversion of small matrices. This should be quicker than either dynamic parallelism or streams (your 2nd and 3rd approaches.) Also a batch solver is available in source code form that can do matrix inversions. You will need to log in as a registered developer at developer.nvidia.com to access this link.
Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Cusolver provides some eigen solver functions. However they are not batched nor callable from device code, so you're faced with streams as the only option beyond that.

How one would implement a 2-for particle-interaction loop using CUDA, and what is the resulting complexity?

This algorithm receives a world (list) of particles (3-dimensional vectors) and calls an interacting function between them. Or, in pseudocode:
function tick(world)
for i in range(world)
for j in range(world)
world[i] = interact(world[i], world[j])
Where interact is a function that takes 2 particles and return another one, and could be anything, for example:
function interact(a,b) = (a + b)*0.5
You can easily determine this algorithm is O(N^2) on the CPU. In my attempt to learn CUDA, I'm not sure how that could be implemented on the GPU. What would be the general structure of such algorithm, and what would be the resulting complexity? What if we knew the interact function didn't do anything if 2 particles were distant enough? Could we optimize it for locality?
What would be the general structure of such algorithm, and what would be the resulting complexity?
This is essentially the n-body problem. Solved using a direct particle-particle approach. It's been written about a lot. The order of the algorithm is O(N^2) on the GPU, just as it is on the CPU.
The core algorithm as implemented in CUDA doesn't change a lot except to take advantage of local block memory and optimize for it. Essentially the implementation would still come does to two loops.
The following paper is a good place to start, Chapter 31. Fast N-Body Simulation with CUDA.
Could we optimize it for locality?
Yes. Many n-body algorithms attempt to optimize for locality as gravitational and E-M forces decrease as a power of the distance between particles so distant particles can be ignored or their contribution can be approximated. Which of these approximation approaches to take largely depends on the type of system you are trying to simulate.
The following is a good overview of some of the more popular approaches,
Seminar presentation, N-body algorithms

Want to improve the computing speed of matrix calculation, OpenMP or CUDA?

My program has a bunch of matrix multiplication and inversion, which is time consuming.
My computer: CPU: intel i7; GPU: 512MB NVIDIA® Quadro® NVS3100M
Which one is better for improving computing speed? OpenMP or CUDA?
(ps. I think generally, GPU has more cores than cpu, thus, CUDA could improve multiple times more than OpenMP?)
From my experience(work on both as a school project, in most condition, the calculation time for a medium size array, I would say less than 2000 * 2000, is almost the same, the actual calculation time depending on the working load of your computer(usually when you working on openMP, you would share a cluster with other guys, so make sure you are running your application alone, so that you might got a better result))
But if you are good at CUDA, the GPU is very powerful in these kinds of calculation stuff, when i was working on my CUDA project, there are lots of good materials in the official website. For openMP, it is only a library, and if you are good at c or c++, should not be any problem for you to use it(but the compiler of openMP is buggy~~, don't trust it, try to log anything).
And i assumed you have experience on CUDA, is not hard to find some good example i think. But CUDA is really dummy, can't debug, so I recommend you to try openMP first, it should be easier.
I'd guess it depends on what your application is and how you go about trying to implement improvements. Keep in mind that every optimization has tradeoffs. For instance, GPU's typically use half-precision floating point, and there are compiler options that allow you to bypass some aspects of the IEEE standard, which brings you some extra speed at the expense of precision, etc.

Poor performance for calculating eigenvalues and eigenvectors on GPU

In some code we need to get auto vectors and auto values for the generalized eigenvalue problem with symmetric real matrices (Ax=lamba Bx). This code uses DSPGVX from LACPACK. We wanted to speed it up on GPU using a MAGMA function. We asked on this forum and got the answer about this
http://icl.cs.utk.edu/magma/docs/zhegvx_8cpp.html
The size of our matrices (N) goes from 100 to 50000 and even more, related to the number of atoms in a molecule. We observe:
a) for N bigger than 2500 (approx), MAGMA just does not work; segmentation fault
b) MAGMA runs always slower than LAPACK sequential, around 10 times slower
Is this behavior normal and could we overcome it? Can anybody report any reference where anybody working on this similar problems gets a decent speedup?
Thanks
In my experience you may be able to gain greater performance benefits by switching to a better eigensolver. The best solver that I know of is ARPACK. You will gain most benefit it your matrices have some structure, for example if they are sparse. This solver is also most efficient if you only need to extract a small fraction of the total number of eigenpairs.
I would start off by trying this solver on your problems running just on the CPU. You may find that this alone gives sufficient performance for your needs. If not then it is relatively easy to move the calculation core for ARPACK to the GPU. Or, there are parallel versions of ARPACK available.
Have you tried CULA http://www.culatools.com/ ? CULA is Lapack converted for CUDA by NVIDIA, so at least in theory it should have one of the best implementation for the generalized eigenvalue problem. I think the single precision version is free so you could give it a try.