I've been experimenting with CUDA kernels for days to perform a fast 2D convolution between a 500x500 image (but I could also vary the dimensions) and a very small 2D kernel (a laplacian 2d kernel, so it's a 3x3 kernel.. too small to take a huge advantage with all the cuda threads).
I created a CPU classic implementation (two for loops, as easy as you would think) and then I started creating CUDA kernels.
After a few disappointing attempts to perform a faster convolution I ended up with this code:
http://www.evl.uic.edu/sjames/cs525/final.html (see the Shared Memory section), it basically lets a 16x16 threads block load all the convolution data he needs in the shared memory and then performs the convolution.
Nothing, the CPU is still a lot faster. I didn't try the FFT approach because the CUDA SDK states that it is efficient with large kernel sizes.
Whether or not you read everything I wrote, my question is:
how can I perform a fast 2D convolution between a relatively large image and a very small kernel (3x3) with CUDA?
You are right in that 3x3 kernel is not suitable for FFT based approach. The best way to deal with this would be to push the kernel into constant memory (or if you are using a fermi+ card, this should not matter too much).
Since you know kernel size, the fastest way to do this would be to read chunks of the input image / signal into shared memory and perform an unrolled multiply and add operation.
--
If you are willing to use libraries to perform this operation ArrayFire and OpenCV
have highly optimized Convolution routines that can save you a lot of development time.
I am not too familiar with OpenCV, but in ArrayFire you can do something like the following.
array kernel = array(3, 3, h_kernel, afHost); // Transfer the kernel to gpu
array image = array(w, h, h_image , afHost); // Transfer the image to gpu
array result = convolve2(image, kernel); // Performs 2D convolution
EDIT
The added benefit of using ArrayFire is its batched operation allows you to perform convolution in parallel. You can read about how convolvutions support batch operations over here
For example if you had 10 images that you want to convolve using the same kernel, you could do somehting like the following:
array kernel = array(3, 3, h_kernel, afHost); // Transfer the kernel to gpu
array images = array(w, h, 10, h_images, afHost); // Transfer the images to gpu
array res = convolve2(images, kernel); // Perform all operations simultaneously
--
Full Disclosure: I work at AccelerEyes and actively work on ArrayFire.
Related
We are currently developing an optimized convolution algorithm for ARM, in C++. We are using Ubuntu 18.04 for aarch64 on an ARM development board, Odroid N2+ to be exact.
To test the algorithm against other popular methods, (im2col+gemm, FFT, Winograd...) we download Pytorch, and compiled its C++ API, libtorch, natively in our development environment.
But unfortunately it seems there aren't a way to change the convolution algorithm from the given API.
Is there a way to change the convolution algorithm in Pytorch? (Or libtorch?) If not, are there any other frameworks or APIs that would provide optimized implementations of different convolution algorithms for our environment? (ARMv8, aarch64, C++)
We had some experience in modifying several CNN platforms. For orientation-independent pattern recognition tasks, one optimized computation approach is to use symmetrical kernels for all convolution layers and to replace the 1st flattened layer with a merged convolution layer (see "Transformationally Identical and Invariant Convolutional Neural Networks [TI-CNN] through Symmetric Element Operators" https://arxiv.org/abs/1806.03636). We successfully implemented fast algorithms on Caffe, PyTorch, and Darknet (GPU & CPU) platforms by changing im2col+gemm (2-6 times faster) and FFT (~1.5 faster) codes and using a fraction of memory buffer with dih4 symmetrical kernels. For a 3x3 kernel, naive convolution for using a dih4 kernel only takes 3 multiplications for each matric multiplication so Winograd does not have an advantage and is not needed in the symmetrical kernel mode. Tensorflow GPU convolution process uses cuDNN library for im2col+gemm, we could not change its im2col. But we can force all its kernels to be symmetric (either in C++ or python level) but computation and memory cannot be saved besides reducing 0-360 degrees to 0-45 degrees (the other 7 wedges of angle sections are identical to 0-45 degrees when dih4 symmetrical kernels are employed) if rotation data augmentation is used.
When writing your own custom op in tensorflow with GPU support, the guide suggests computing the gradients using python. Elsewhere people have used C++ with libraries such as Eigen to implement the gradients in the same way, more efficiently.
My question is; with the custom operations argument Tensors provided as pointers to device memory when training on a GPU (is this correct?):
OpKernelContext* context
const Tensor& grad = context->input(0);
Can copying data between host and device be avoided by computing the gradients for the operation win CUDA on the GPU?
Will this reduce compute time? (I know this is dependant on how well the gradient computation lends itself to parallel computation, but assuming it does)
Is there any reason why this shouldn't be done?/are the potential speed increases too marginal for it to be worthwhile?
You can simply build a op in CUDA and then call it inside the python definition of your gradient, this way you can definitely speedup gradient computation greatly and do not have to copy between GPU and CPU memory
I'm working with convolutional deep neural networks (CRBM in particular). I need to perform several 'valid' convolution and several 'full' convolutions. I need to optimize their speed. I'm working with C++.
In the case of the 'valid' convolution I have been able to speed it up by around 50% by rearranging the image (This is the equivalent of im2col in Matlab) and multiplying (matrix multiply) by the matrix of kernels (one per line), replacing several convolutions by one matrix multiplication.
Unfortunately, I haven't been able to gain anything with 'full' convolution. I know that I can also rearrange the image (using an equivalent of convmtx2 in Matlab), but simply doing the rearranging of the image matrix to a convolution matrix is already much slower (almost an order of magnitude) than doing all the convolutions. The convolution matrix becomes large very quickly.
Is there a good practical way of speeding 'full' convolution (with multiple kernels) up with matrix multiplication in C++ ? Perhaps, an efficient way of computing convmtx2 ?
I am just beginning to learn some cuda programming and I am interested how to handle calculation of large matrices which surpass the Block/thread sizes.
For example, I have seen code which shows how to perform tiled matrix multiplication but it fails with the Block size and grid size are too small. In the mentioned code, if the Block size and Grid size are each set to 1, then only the first element of the final matrix will be computed.
The answer is simple: call the kernel with larger block and grid sizes, but what happens when I want to perform a matrix multiplication with 8 million rows and 6 million columns - something arbitrarily large for which there cannot be a proper Grid and Block size for any modern GPU?
Where can I find example code or an algorithm for how to work with this sort of thing? I believe that the simple case should be a matrix multiplication algorithm which works if called with <<<1,1>>> and any algorithm which can account for this call should be able to account for any larger matrix.
The main problem with very large matrix is not the number of blocks or number of threads. The main problem is that you cannot fit the whole matrix in GPU's DRAM memory. So for doing the multiplication, you need to manually use tiling to divide the input matrix into tiles that you can fit in the GPU's memory. Then, you need to run matrix multiplication on that tile on GPU with as many threads as you need and then return the tile result back to the host (CPU).
When you are working on these big tiles on the GPU, you need to launch 1000s of threads to get the performance that you need. launching only one thread does not help you in any way.
for more information you can look at this paper:
CUDA Based Fast Implementation of Very Large Matrix Computation
I just found it by googling "large matrix multiplication CUDA"
What's the most efficient way to do image pyramiding in CUDA? I have written my own kernels to do so but imagine we can do better.
Binding to an OpenGL texture using OpenGL interop and using the hardware mipmapping would probably be much faster. Any pointers on how to do this or other
MipMaps are setup when accessed/initialized in OpenGL/DirectX. A CUDA kernel can do the same thing if you allocate a texture 50% wider (or higher) than the initial texture and use the kernel to down-sample the texture and write the result beside the original texture. The kernel will probably work best where each thread evaluates a pixel in the next down-sampled image. It's up to you to determine the sampling-scheme and choose appropriate weights for combining the pixels. Try bilinear to start with, then once it's working you can setup trilinear (cubic) or other sampling schemes like anisotropic etc. Simple sampling (linear and cubic) will likely be more efficient since coalesced memory access will occur (refer to the CUDA SDK programming guide). You will probably need to tile the kernel execution since the thread-count is limited for parallel invokation (too many pixels, too few threads = use tiling to chunk parallel execution).You might find Mesa3D useful as a reference (it's an open-source implementation of OpenGL).