scatterv and sendreceive of columns of a matrix in MPI [duplicate] - c++

This question already has answers here:
MPI_Scatter - sending columns of 2D array
(3 answers)
Closed 8 years ago.
My program needs to scatter a matrix between processes. The matrix is represented in memory by a 1d array. In my first version I scattered the matrix between processes by rows. The processes sends to each others some rows of their local matrixes in order to calculate the result of the computations that each one needs to make in a proper way. The processes send these rows with the sendrecv function. Till that all works good.
Now it came up to my mind that if the matrix has much more columns than rows it will be a better idea to scatter the matrix by columns instead of rows in order to have less elements of the local matrixes to be sent by the processes and in this way improving the scalability of the program. The thing is...how can I scatter the matrix by columns? And then...how can I select the proper columns to be sent by the processes to each others?

If possible, try changing your 1d array from row major order to column major order, scatter it and perform the computation, recieve it and then change it back from column major order to row major order. Depending on your matrix, the cost of to and fro transformation might be greater than the savings obtained from parallelization along the columns. See boost::multi_array documentation ( http://www.boost.org/doc/libs/1_55_0b1/libs/multi_array/doc/user.html#sec_storage)

Related

Sparse Matrix Vs Dense Matrix Multiplication C++ Tensorflow

I would like to write in C++ Tensorflow sparse matrix dense vector (SPMv) multiplication: y = Ax
The sparse matrix, A, is stored in CSR format. The usual sparsity of A is between 50-90%. The goal is to reach better or similar time than that of dense matrix dense vector (DMv) multiplication.
Please note that I have already viewed the following posts: Q1 Q2 Q3. However, I still am wondering about the following:
How does SPMv multiplication compare in terms of time to DMv? Since sparsity is relatively high, I assume that SPMv should be better given the reduction in the number of operations - Yes?
What should I take into to account to make SpMv the same or better in terms of time than the DMv? Why ppl are saying that the DMv will perform petter than SPMv? Does the storage representation make a difference?
Any recommended libraries that do SPMv in C++ for either CPU or GPU implementation.
This question is relevant to my other question here: (CSCC: Convolution Split Compression Calculation Algorithm for Deep Neural Network)
To answer the edited question:
Unless the Matrix is very sparse (<10% nonzeros on CPU, probably <1% on GPU), you will likely not benefit from the sparsity. While the number of floating point operations is reduced, the amount of storage is at least double (column or row index + value), memory access is irregular (you have an indirection via the index for the right-hand side), it becomes far more difficult to vectorize (or to achieve coalescing on the GPU) and if you parallelize you have to deal with the fact that rows are of varying length and therefore a static schedule is likely to be suboptimal.
Beyond the points above, yes, the storage representation matters. For example a COO-matrix stores two indices and the value, while CSR/CSC only store one but require an additional offset array which makes them more complex to build on the fly. Especially on the GPU, storage formats matter if you want to at least achieve some coalescing. This paper looks into how storage formats affect performance on the GPU: https://onlinelibrary.wiley.com/doi/full/10.1111/cgf.13957
For something generic try Eigen or cuSparse on GPU. There are plenty of others that perform better for specific use cases, but this part of the question isn't clearly answerable.
Beyond the matrix format itself, even the ordering of entries in your matrix can have a massive impact on performance, which is why the Cuthill-McKee algorithm is often used to reduce matrix bandwidth (and thereby improve cache performance).

Statistical sampling of code [duplicate]

This question already has answers here:
How to efficiently calculate a running standard deviation
(17 answers)
Closed 5 years ago.
This may be a very open ended question.
I have to quickly measure time of some section of code. I'm using the std::chrono::high_resolution_clock functionality. I have to run this code for many iterations and measure the duration.
So here is the problem: I can measure minimum and maximum duration values, and calculate average using the number of samples count. In this case, I only need to store 4 values. But I would also like to know how the data is distributed. Calculation of the standard deviation or histogram requires that all data points be stored. However, this will require either one giant initial data structure or dynamically growing data structure - both of which will affect the measured code on my embedded system.
Is there a way to calculate standard deviation for this sample using the standard deviation of the previous sample?
Calculation of the standard deviation or histogram requires that all data points be stored
That's trivially false. You can calculate a running standard deviation with Welford's algorithm, which just requires one extra variable besides the running mean and the current count of elements.
As for the histogram, you don't need to keep all the data - you just need to keep the counts for each bin, and increment the right bin each time you have a new sample. Of course for this simple approach to pay out you need to know in advance an expected range and number of bins. If this isn't possible, you can always start with small bins over a small range and scale the bins size (merging the adjacent bins) whenever you meet an element outside of the current range. Again, all this requires just a fixed quantity of memory (one integer for each bin and two values for the range).

Calculating (very) large matrix products with CUDA

I am just beginning to learn some cuda programming and I am interested how to handle calculation of large matrices which surpass the Block/thread sizes.
For example, I have seen code which shows how to perform tiled matrix multiplication but it fails with the Block size and grid size are too small. In the mentioned code, if the Block size and Grid size are each set to 1, then only the first element of the final matrix will be computed.
The answer is simple: call the kernel with larger block and grid sizes, but what happens when I want to perform a matrix multiplication with 8 million rows and 6 million columns - something arbitrarily large for which there cannot be a proper Grid and Block size for any modern GPU?
Where can I find example code or an algorithm for how to work with this sort of thing? I believe that the simple case should be a matrix multiplication algorithm which works if called with <<<1,1>>> and any algorithm which can account for this call should be able to account for any larger matrix.
The main problem with very large matrix is not the number of blocks or number of threads. The main problem is that you cannot fit the whole matrix in GPU's DRAM memory. So for doing the multiplication, you need to manually use tiling to divide the input matrix into tiles that you can fit in the GPU's memory. Then, you need to run matrix multiplication on that tile on GPU with as many threads as you need and then return the tile result back to the host (CPU).
When you are working on these big tiles on the GPU, you need to launch 1000s of threads to get the performance that you need. launching only one thread does not help you in any way.
for more information you can look at this paper:
CUDA Based Fast Implementation of Very Large Matrix Computation
I just found it by googling "large matrix multiplication CUDA"

Impact of a row-major ordered program in Fortran

I have a big software written in Fortran 77 running on Linux that uses multidimensional arrays for storing time-indexed matrices and with several, different series, but the indexing order is that of a row-major ordering like C. However, Fortran is column-ordered and there are cache-miss penalties when the values are indexed sequentially.
As a example, I have two series of 100.000 10x10 matrices. I'm storing them as:
MATRIX(2, 100000, 10, 10)
but as I understand, if the values are intended to be accessed linearly, the optimum declaration for Fortran would be
MATRIX(10, 10, 100000, 2)
Right now the impact of refactoring all the code to use the other ordering system would be big, but I would like to have an idea of the potential impact.
Is there any way I could easily measure the impact of having the wrong ordering system, to estimate the potential improvement of changing it?
Maybe some automatic measure of cache misses or cache misses associated to those arrays.

Is a "2D fft" the same as two 1D fft's?

I have a cuda code that I have implemented several C2C 2D FFT's in. They all use the same plan, but for some reason, the times on the 2D FFT's are large, and seem to vary quite a bit. Same data size FFT's seem to take anywhere from 0.4s to 1.8s
This is for a 1920x1080 FFT. Do those times seem reasonable?
Anyhow - I have had good experience with CUDA 1-D batched FFTs being fast. is it the same to take a 1D FFT across the rows, and then again across the columns of a matrix to give the same results as this 2D FFT? I have experience FFTs happening in a few hundreths of a second across larger data sets for 1D FFTs before, so I was hoping to maybe fix some of these results.
Thanks
A 2D transform of a 1K by 1K image requires 2K 1D transforms. Therefore those times seem reasonable.
For more information have a look at: http://paulbourke.net/miscellaneous/dft/