I am just beginning to learn some cuda programming and I am interested how to handle calculation of large matrices which surpass the Block/thread sizes.
For example, I have seen code which shows how to perform tiled matrix multiplication but it fails with the Block size and grid size are too small. In the mentioned code, if the Block size and Grid size are each set to 1, then only the first element of the final matrix will be computed.
The answer is simple: call the kernel with larger block and grid sizes, but what happens when I want to perform a matrix multiplication with 8 million rows and 6 million columns - something arbitrarily large for which there cannot be a proper Grid and Block size for any modern GPU?
Where can I find example code or an algorithm for how to work with this sort of thing? I believe that the simple case should be a matrix multiplication algorithm which works if called with <<<1,1>>> and any algorithm which can account for this call should be able to account for any larger matrix.
The main problem with very large matrix is not the number of blocks or number of threads. The main problem is that you cannot fit the whole matrix in GPU's DRAM memory. So for doing the multiplication, you need to manually use tiling to divide the input matrix into tiles that you can fit in the GPU's memory. Then, you need to run matrix multiplication on that tile on GPU with as many threads as you need and then return the tile result back to the host (CPU).
When you are working on these big tiles on the GPU, you need to launch 1000s of threads to get the performance that you need. launching only one thread does not help you in any way.
for more information you can look at this paper:
CUDA Based Fast Implementation of Very Large Matrix Computation
I just found it by googling "large matrix multiplication CUDA"
Related
I would like to write in C++ Tensorflow sparse matrix dense vector (SPMv) multiplication: y = Ax
The sparse matrix, A, is stored in CSR format. The usual sparsity of A is between 50-90%. The goal is to reach better or similar time than that of dense matrix dense vector (DMv) multiplication.
Please note that I have already viewed the following posts: Q1 Q2 Q3. However, I still am wondering about the following:
How does SPMv multiplication compare in terms of time to DMv? Since sparsity is relatively high, I assume that SPMv should be better given the reduction in the number of operations - Yes?
What should I take into to account to make SpMv the same or better in terms of time than the DMv? Why ppl are saying that the DMv will perform petter than SPMv? Does the storage representation make a difference?
Any recommended libraries that do SPMv in C++ for either CPU or GPU implementation.
This question is relevant to my other question here: (CSCC: Convolution Split Compression Calculation Algorithm for Deep Neural Network)
To answer the edited question:
Unless the Matrix is very sparse (<10% nonzeros on CPU, probably <1% on GPU), you will likely not benefit from the sparsity. While the number of floating point operations is reduced, the amount of storage is at least double (column or row index + value), memory access is irregular (you have an indirection via the index for the right-hand side), it becomes far more difficult to vectorize (or to achieve coalescing on the GPU) and if you parallelize you have to deal with the fact that rows are of varying length and therefore a static schedule is likely to be suboptimal.
Beyond the points above, yes, the storage representation matters. For example a COO-matrix stores two indices and the value, while CSR/CSC only store one but require an additional offset array which makes them more complex to build on the fly. Especially on the GPU, storage formats matter if you want to at least achieve some coalescing. This paper looks into how storage formats affect performance on the GPU: https://onlinelibrary.wiley.com/doi/full/10.1111/cgf.13957
For something generic try Eigen or cuSparse on GPU. There are plenty of others that perform better for specific use cases, but this part of the question isn't clearly answerable.
Beyond the matrix format itself, even the ordering of entries in your matrix can have a massive impact on performance, which is why the Cuthill-McKee algorithm is often used to reduce matrix bandwidth (and thereby improve cache performance).
I wanted to generate LU decomposition of large size dense matrices (N>10^7), the LU decomposition I'm currently using is based on Adaptive Cross Approximation and is taking very long time to execute for larger N, can anybody suggest few LU decomposition techniques that can be very well parallelised(using OpenMP) and take shorter period of time.
Note:
I write the code in C++ and make use of Xeon Processor(128 threads)
and Eigen library.
The entries in the matrix are filled through a kernel function of
form exp(-(x1-x2)^2).
Storage of matrix is not a problem, I'm working on Xeon processor and has enough memory and moreover, I'm not storing the full matrix and whenever I need to find an entry in matrix, I use the kernel function and generate type-double number for that cell.
I'm working with convolutional deep neural networks (CRBM in particular). I need to perform several 'valid' convolution and several 'full' convolutions. I need to optimize their speed. I'm working with C++.
In the case of the 'valid' convolution I have been able to speed it up by around 50% by rearranging the image (This is the equivalent of im2col in Matlab) and multiplying (matrix multiply) by the matrix of kernels (one per line), replacing several convolutions by one matrix multiplication.
Unfortunately, I haven't been able to gain anything with 'full' convolution. I know that I can also rearrange the image (using an equivalent of convmtx2 in Matlab), but simply doing the rearranging of the image matrix to a convolution matrix is already much slower (almost an order of magnitude) than doing all the convolutions. The convolution matrix becomes large very quickly.
Is there a good practical way of speeding 'full' convolution (with multiple kernels) up with matrix multiplication in C++ ? Perhaps, an efficient way of computing convmtx2 ?
I've been experimenting with CUDA kernels for days to perform a fast 2D convolution between a 500x500 image (but I could also vary the dimensions) and a very small 2D kernel (a laplacian 2d kernel, so it's a 3x3 kernel.. too small to take a huge advantage with all the cuda threads).
I created a CPU classic implementation (two for loops, as easy as you would think) and then I started creating CUDA kernels.
After a few disappointing attempts to perform a faster convolution I ended up with this code:
http://www.evl.uic.edu/sjames/cs525/final.html (see the Shared Memory section), it basically lets a 16x16 threads block load all the convolution data he needs in the shared memory and then performs the convolution.
Nothing, the CPU is still a lot faster. I didn't try the FFT approach because the CUDA SDK states that it is efficient with large kernel sizes.
Whether or not you read everything I wrote, my question is:
how can I perform a fast 2D convolution between a relatively large image and a very small kernel (3x3) with CUDA?
You are right in that 3x3 kernel is not suitable for FFT based approach. The best way to deal with this would be to push the kernel into constant memory (or if you are using a fermi+ card, this should not matter too much).
Since you know kernel size, the fastest way to do this would be to read chunks of the input image / signal into shared memory and perform an unrolled multiply and add operation.
--
If you are willing to use libraries to perform this operation ArrayFire and OpenCV
have highly optimized Convolution routines that can save you a lot of development time.
I am not too familiar with OpenCV, but in ArrayFire you can do something like the following.
array kernel = array(3, 3, h_kernel, afHost); // Transfer the kernel to gpu
array image = array(w, h, h_image , afHost); // Transfer the image to gpu
array result = convolve2(image, kernel); // Performs 2D convolution
EDIT
The added benefit of using ArrayFire is its batched operation allows you to perform convolution in parallel. You can read about how convolvutions support batch operations over here
For example if you had 10 images that you want to convolve using the same kernel, you could do somehting like the following:
array kernel = array(3, 3, h_kernel, afHost); // Transfer the kernel to gpu
array images = array(w, h, 10, h_images, afHost); // Transfer the images to gpu
array res = convolve2(images, kernel); // Perform all operations simultaneously
--
Full Disclosure: I work at AccelerEyes and actively work on ArrayFire.
I have a cuda code that I have implemented several C2C 2D FFT's in. They all use the same plan, but for some reason, the times on the 2D FFT's are large, and seem to vary quite a bit. Same data size FFT's seem to take anywhere from 0.4s to 1.8s
This is for a 1920x1080 FFT. Do those times seem reasonable?
Anyhow - I have had good experience with CUDA 1-D batched FFTs being fast. is it the same to take a 1D FFT across the rows, and then again across the columns of a matrix to give the same results as this 2D FFT? I have experience FFTs happening in a few hundreths of a second across larger data sets for 1D FFTs before, so I was hoping to maybe fix some of these results.
Thanks
A 2D transform of a 1K by 1K image requires 2K 1D transforms. Therefore those times seem reasonable.
For more information have a look at: http://paulbourke.net/miscellaneous/dft/