I'm working with convolutional deep neural networks (CRBM in particular). I need to perform several 'valid' convolution and several 'full' convolutions. I need to optimize their speed. I'm working with C++.
In the case of the 'valid' convolution I have been able to speed it up by around 50% by rearranging the image (This is the equivalent of im2col in Matlab) and multiplying (matrix multiply) by the matrix of kernels (one per line), replacing several convolutions by one matrix multiplication.
Unfortunately, I haven't been able to gain anything with 'full' convolution. I know that I can also rearrange the image (using an equivalent of convmtx2 in Matlab), but simply doing the rearranging of the image matrix to a convolution matrix is already much slower (almost an order of magnitude) than doing all the convolutions. The convolution matrix becomes large very quickly.
Is there a good practical way of speeding 'full' convolution (with multiple kernels) up with matrix multiplication in C++ ? Perhaps, an efficient way of computing convmtx2 ?
Related
We are currently developing an optimized convolution algorithm for ARM, in C++. We are using Ubuntu 18.04 for aarch64 on an ARM development board, Odroid N2+ to be exact.
To test the algorithm against other popular methods, (im2col+gemm, FFT, Winograd...) we download Pytorch, and compiled its C++ API, libtorch, natively in our development environment.
But unfortunately it seems there aren't a way to change the convolution algorithm from the given API.
Is there a way to change the convolution algorithm in Pytorch? (Or libtorch?) If not, are there any other frameworks or APIs that would provide optimized implementations of different convolution algorithms for our environment? (ARMv8, aarch64, C++)
We had some experience in modifying several CNN platforms. For orientation-independent pattern recognition tasks, one optimized computation approach is to use symmetrical kernels for all convolution layers and to replace the 1st flattened layer with a merged convolution layer (see "Transformationally Identical and Invariant Convolutional Neural Networks [TI-CNN] through Symmetric Element Operators" https://arxiv.org/abs/1806.03636). We successfully implemented fast algorithms on Caffe, PyTorch, and Darknet (GPU & CPU) platforms by changing im2col+gemm (2-6 times faster) and FFT (~1.5 faster) codes and using a fraction of memory buffer with dih4 symmetrical kernels. For a 3x3 kernel, naive convolution for using a dih4 kernel only takes 3 multiplications for each matric multiplication so Winograd does not have an advantage and is not needed in the symmetrical kernel mode. Tensorflow GPU convolution process uses cuDNN library for im2col+gemm, we could not change its im2col. But we can force all its kernels to be symmetric (either in C++ or python level) but computation and memory cannot be saved besides reducing 0-360 degrees to 0-45 degrees (the other 7 wedges of angle sections are identical to 0-45 degrees when dih4 symmetrical kernels are employed) if rotation data augmentation is used.
I would like to write in C++ Tensorflow sparse matrix dense vector (SPMv) multiplication: y = Ax
The sparse matrix, A, is stored in CSR format. The usual sparsity of A is between 50-90%. The goal is to reach better or similar time than that of dense matrix dense vector (DMv) multiplication.
Please note that I have already viewed the following posts: Q1 Q2 Q3. However, I still am wondering about the following:
How does SPMv multiplication compare in terms of time to DMv? Since sparsity is relatively high, I assume that SPMv should be better given the reduction in the number of operations - Yes?
What should I take into to account to make SpMv the same or better in terms of time than the DMv? Why ppl are saying that the DMv will perform petter than SPMv? Does the storage representation make a difference?
Any recommended libraries that do SPMv in C++ for either CPU or GPU implementation.
This question is relevant to my other question here: (CSCC: Convolution Split Compression Calculation Algorithm for Deep Neural Network)
To answer the edited question:
Unless the Matrix is very sparse (<10% nonzeros on CPU, probably <1% on GPU), you will likely not benefit from the sparsity. While the number of floating point operations is reduced, the amount of storage is at least double (column or row index + value), memory access is irregular (you have an indirection via the index for the right-hand side), it becomes far more difficult to vectorize (or to achieve coalescing on the GPU) and if you parallelize you have to deal with the fact that rows are of varying length and therefore a static schedule is likely to be suboptimal.
Beyond the points above, yes, the storage representation matters. For example a COO-matrix stores two indices and the value, while CSR/CSC only store one but require an additional offset array which makes them more complex to build on the fly. Especially on the GPU, storage formats matter if you want to at least achieve some coalescing. This paper looks into how storage formats affect performance on the GPU: https://onlinelibrary.wiley.com/doi/full/10.1111/cgf.13957
For something generic try Eigen or cuSparse on GPU. There are plenty of others that perform better for specific use cases, but this part of the question isn't clearly answerable.
Beyond the matrix format itself, even the ordering of entries in your matrix can have a massive impact on performance, which is why the Cuthill-McKee algorithm is often used to reduce matrix bandwidth (and thereby improve cache performance).
I have a problem which involves many matrix multiplications (classical and kronecker product) . I read that GPU are suited for this task and since speed is my main objective I was thinking about using Cuda with c++. However I would have to learn Cuda first. So before I start waisting my time I thought I should ask wiser people first. Can Cuda speed up my calculations? The matrices are generally quite small around 20x50. Sometimes involving a third dimension so it becomes a 20x50x10 matrix. I can only multiply a couple of matrices at one step in time (10-100). But I need to do several millions iteration after each other (Monte Carlo simulation). Currently I am using armadillo and matlab.
You would see some speed ups if your matrices were bigger, now you will be facing data bandwidth bottlenecks worse than computation time delays.
Something worth considering is to see mathematical tricks that could allow you (based on your computations) to combine multiple instances into bigger matrices then transfer and compute. But usually this is quite difficult and probably not always doable.
I wanted to generate LU decomposition of large size dense matrices (N>10^7), the LU decomposition I'm currently using is based on Adaptive Cross Approximation and is taking very long time to execute for larger N, can anybody suggest few LU decomposition techniques that can be very well parallelised(using OpenMP) and take shorter period of time.
Note:
I write the code in C++ and make use of Xeon Processor(128 threads)
and Eigen library.
The entries in the matrix are filled through a kernel function of
form exp(-(x1-x2)^2).
Storage of matrix is not a problem, I'm working on Xeon processor and has enough memory and moreover, I'm not storing the full matrix and whenever I need to find an entry in matrix, I use the kernel function and generate type-double number for that cell.
I am just beginning to learn some cuda programming and I am interested how to handle calculation of large matrices which surpass the Block/thread sizes.
For example, I have seen code which shows how to perform tiled matrix multiplication but it fails with the Block size and grid size are too small. In the mentioned code, if the Block size and Grid size are each set to 1, then only the first element of the final matrix will be computed.
The answer is simple: call the kernel with larger block and grid sizes, but what happens when I want to perform a matrix multiplication with 8 million rows and 6 million columns - something arbitrarily large for which there cannot be a proper Grid and Block size for any modern GPU?
Where can I find example code or an algorithm for how to work with this sort of thing? I believe that the simple case should be a matrix multiplication algorithm which works if called with <<<1,1>>> and any algorithm which can account for this call should be able to account for any larger matrix.
The main problem with very large matrix is not the number of blocks or number of threads. The main problem is that you cannot fit the whole matrix in GPU's DRAM memory. So for doing the multiplication, you need to manually use tiling to divide the input matrix into tiles that you can fit in the GPU's memory. Then, you need to run matrix multiplication on that tile on GPU with as many threads as you need and then return the tile result back to the host (CPU).
When you are working on these big tiles on the GPU, you need to launch 1000s of threads to get the performance that you need. launching only one thread does not help you in any way.
for more information you can look at this paper:
CUDA Based Fast Implementation of Very Large Matrix Computation
I just found it by googling "large matrix multiplication CUDA"