I would like to write in C++ Tensorflow sparse matrix dense vector (SPMv) multiplication: y = Ax
The sparse matrix, A, is stored in CSR format. The usual sparsity of A is between 50-90%. The goal is to reach better or similar time than that of dense matrix dense vector (DMv) multiplication.
Please note that I have already viewed the following posts: Q1 Q2 Q3. However, I still am wondering about the following:
How does SPMv multiplication compare in terms of time to DMv? Since sparsity is relatively high, I assume that SPMv should be better given the reduction in the number of operations - Yes?
What should I take into to account to make SpMv the same or better in terms of time than the DMv? Why ppl are saying that the DMv will perform petter than SPMv? Does the storage representation make a difference?
Any recommended libraries that do SPMv in C++ for either CPU or GPU implementation.
This question is relevant to my other question here: (CSCC: Convolution Split Compression Calculation Algorithm for Deep Neural Network)
To answer the edited question:
Unless the Matrix is very sparse (<10% nonzeros on CPU, probably <1% on GPU), you will likely not benefit from the sparsity. While the number of floating point operations is reduced, the amount of storage is at least double (column or row index + value), memory access is irregular (you have an indirection via the index for the right-hand side), it becomes far more difficult to vectorize (or to achieve coalescing on the GPU) and if you parallelize you have to deal with the fact that rows are of varying length and therefore a static schedule is likely to be suboptimal.
Beyond the points above, yes, the storage representation matters. For example a COO-matrix stores two indices and the value, while CSR/CSC only store one but require an additional offset array which makes them more complex to build on the fly. Especially on the GPU, storage formats matter if you want to at least achieve some coalescing. This paper looks into how storage formats affect performance on the GPU: https://onlinelibrary.wiley.com/doi/full/10.1111/cgf.13957
For something generic try Eigen or cuSparse on GPU. There are plenty of others that perform better for specific use cases, but this part of the question isn't clearly answerable.
Beyond the matrix format itself, even the ordering of entries in your matrix can have a massive impact on performance, which is why the Cuthill-McKee algorithm is often used to reduce matrix bandwidth (and thereby improve cache performance).
Related
Given an array A of size 105.
Then given m (m is very large, m>> the size of A) operations, each operation is for position p, increasing t.
A[p]+=t
Finally, I output the value of each position of the whole array.
Is there any constant optimization to speed up the intermediate modification operations?
For example, if I sort the positions, I can modify them sequentially to avoid random access. However, this operation will incur an additional sorting cost. Is there any other way to speed it up?
Trying to re-execute all operations after sorting can be an order of magnitude faster than executing them directly. But the cost of sorting is too high.
On architectures with many cores, the best solution is certainly to perform atomic accesses of A[p] in parallel. This assume the number of cores is sufficiently big for the parallelism to not only mitigate the overhead of the atomic operations but also be faster than the serial implementation. This can be pretty easily done with OpenMP or with native C++ thread/atomics. The number of core need not to be too huge, otherwise, the number of conflict may be significantly bigger causing contention and so decreasing performance. This should be fine since the number of item is pretty big. This solution also assume the accesses are quite uniformly random. If they are not (eg. normal distribution), then the contention can be too big for the method to be efficient.
An alternative solution is to split the accesses between N threads spacially. The array range can be statically split in N (relatively equal) parts. All the threads read the inputs but only the thread owning the target range of the output array write into it. The array parts can then be combined after that. This method works well with few threads and if the data distribution is uniform. When the distribution is not uniform at all (eg. normal distribution), then a pre-computing step may be needed so to adjust the array range owned by threads. For example, one can compute the median, or event the quartiles so to better balance the work between threads. Computing quartiles can be done using a partitioning algorithm like Floyd Rivest (std::partition should not be too bad despite I expect it to use a kind of IntroSelect algorithm that is often a bit slower). The pre-computation may be expensive but this should be significantly faster than doing a sort. Using OpenMP is certainly a good idea to implement this.
Another alternative implementation is simply to perform the reduction separately in each thread and then sum up the final array of each thread in a global array. This solution works well in your case (since "m >> the size of A") assuming the number of core is not too big. If so, on need to mix this method with the first one. This last method is probably the simplest efficient method.
Besides, #Jérôme Richard's answer targeted parallel thread computing.
I would name an idea of the partial sort like "merge-sort-just-a-few-iterations" or "bucket-sort-only-in-bucket" (note, they are different). Preferably, set the bulk size to be the Page size to have a better overall performance in terms of OS level. Especially considering m is extraordinarily big. The cost of the partial sort would be amortized by saving cache miss and page swap.
And if this is an interview question, I would ask for more details about m, p, t, data sparsity, distribution, hardware, CPU, memory, power consumption, latency, .etc details. And for each new condition, customizes more detailed designs accordingly.
I require solving a large set of (independent) Ax=b linear problems.
This cannot be parallelized (or more specifically, this is within each processor's responsibility anyway).
The Ax=b sets are small (say 10x10 at most) but are dense (Usually all terms are non-zero) and both A matrices and RHS vectors are completely different and independent.
What is the most efficient/practical way of solving a large set of small Ax=b problems using PETSc?
I.e. how costly would it be to have a single A matrix and a single b vector to be modified all the time and solved for each system?
After considering all the options, PETSc was not very efficient at handling the situation due to the need to rebuild/repopulate the matrices all the time with different values and ended up being relatively memory-costly.
I ended up putting PETSc aside for this matter.
Since your A is small, why not try the direct method to solve it. SuperLU MT version or SuperLU serial version.
If you have all the matrices at once, you can make a large, block diagonal matrix out of them and this will be efficient (vectorized and such).
I have a problem which involves many matrix multiplications (classical and kronecker product) . I read that GPU are suited for this task and since speed is my main objective I was thinking about using Cuda with c++. However I would have to learn Cuda first. So before I start waisting my time I thought I should ask wiser people first. Can Cuda speed up my calculations? The matrices are generally quite small around 20x50. Sometimes involving a third dimension so it becomes a 20x50x10 matrix. I can only multiply a couple of matrices at one step in time (10-100). But I need to do several millions iteration after each other (Monte Carlo simulation). Currently I am using armadillo and matlab.
You would see some speed ups if your matrices were bigger, now you will be facing data bandwidth bottlenecks worse than computation time delays.
Something worth considering is to see mathematical tricks that could allow you (based on your computations) to combine multiple instances into bigger matrices then transfer and compute. But usually this is quite difficult and probably not always doable.
I am just beginning to learn some cuda programming and I am interested how to handle calculation of large matrices which surpass the Block/thread sizes.
For example, I have seen code which shows how to perform tiled matrix multiplication but it fails with the Block size and grid size are too small. In the mentioned code, if the Block size and Grid size are each set to 1, then only the first element of the final matrix will be computed.
The answer is simple: call the kernel with larger block and grid sizes, but what happens when I want to perform a matrix multiplication with 8 million rows and 6 million columns - something arbitrarily large for which there cannot be a proper Grid and Block size for any modern GPU?
Where can I find example code or an algorithm for how to work with this sort of thing? I believe that the simple case should be a matrix multiplication algorithm which works if called with <<<1,1>>> and any algorithm which can account for this call should be able to account for any larger matrix.
The main problem with very large matrix is not the number of blocks or number of threads. The main problem is that you cannot fit the whole matrix in GPU's DRAM memory. So for doing the multiplication, you need to manually use tiling to divide the input matrix into tiles that you can fit in the GPU's memory. Then, you need to run matrix multiplication on that tile on GPU with as many threads as you need and then return the tile result back to the host (CPU).
When you are working on these big tiles on the GPU, you need to launch 1000s of threads to get the performance that you need. launching only one thread does not help you in any way.
for more information you can look at this paper:
CUDA Based Fast Implementation of Very Large Matrix Computation
I just found it by googling "large matrix multiplication CUDA"
I have a big software written in Fortran 77 running on Linux that uses multidimensional arrays for storing time-indexed matrices and with several, different series, but the indexing order is that of a row-major ordering like C. However, Fortran is column-ordered and there are cache-miss penalties when the values are indexed sequentially.
As a example, I have two series of 100.000 10x10 matrices. I'm storing them as:
MATRIX(2, 100000, 10, 10)
but as I understand, if the values are intended to be accessed linearly, the optimum declaration for Fortran would be
MATRIX(10, 10, 100000, 2)
Right now the impact of refactoring all the code to use the other ordering system would be big, but I would like to have an idea of the potential impact.
Is there any way I could easily measure the impact of having the wrong ordering system, to estimate the potential improvement of changing it?
Maybe some automatic measure of cache misses or cache misses associated to those arrays.