Small matrices in ScaLAPACK: How to deal with empty blocks - fortran

I have a large matrix describing a physical system. The last two rows are fundamentally different from the others, and therefore need to be set up separately. Furthermore, it makes no sense to distribute each of these rows over different processes. I want to set up the two lines on the 0th process and then copy them to the global matrix.
What do I have? - A distributed M x N matrix where the upper (M-2) x N block is already filled.
What do I want to do? - Calculate the last 2 x N elements on the 0th process and then copy them with PDGEMR2D
What is the problem? - I need to call PDGEMR2D on all processes. The to-be-copied matrix (I think it's usually called a) therefore needs to be allocated and have a scalapack descriptor on all processes. On the 0th process, the local matrix is 2 x N, on all other processes it is 0 x N.
How do I deal with the empty submatrices?
Usually, to get the scalapack descriptors I would call descinit with the local number of rows as LLD. However this number needs to be >= 1, but on the processes with the empty matrices it is 0.
(Note that fortran lets you allocate arrays with 0 elements - this is purely a ScaLAPACK issue.)

Related

Change locally allocated MPI memory in program

I have a M x N array 'A' that is to be distributed over 'np' processors using MPI in the second dimension(i.e N is the direction that is scattered). Each processor will be initially allocated M x N/np memory by fftw_mpi_local_size_2D (I have used this function from mpi because SIMD is efficient as per fftw3 manual).
initialisation:
alloc_local=fftw_mpi_local_size_2d(M,N,MPI_COMM_WORLD,local_n,local_n_offset)
pointer1=fftw_alloc_real(alloc_local)
call c_f_pointer(pointer1,A[M,local_n])
At this point, each processor has a slab of A that is M x local_n=(N/np) size.
While doing a fourier transform: A(x,y) -> A(x,ky), here y is vertically downwards(not the MPI partitioned axis) in the array A. In fourier space I have to store M+2 x local_n number of elements (for a 1d real array of length M, M in fourier space has M+2 elements if we use this module from FFTW3 dfftw_execute_dft_r2c, ).
These fourier space operations I could do in dummy matrices in every processor independently.
There is one operation where I have to y fourier and x fourier cosine transform consecutively. To parallelise operations in the all fourier space, I want to gather my y fourier transformed arrays which are (M+2)xlocal_n size to M+2 x N bigger array and scatter them back again after a transpose so that y direction is the partitioned one. i.e( N x M+2 ) ----scatter---> (N x (M+2)/np) but each processor has been allocated only M x local_n addresses initially.
If I have M=N, then I still have (N x local_n + (2/np)) . I could resolve this by increasing allocated memory for 1 processor.
I don't want to start out with (N+2,N) and (N+2,local_n) because this will increase memory requirement for a lot of arrays and the above gymnastics has to be done only once per iteration.
No, you cannot easily change the allocated size of a Fortran array (MPI does not play any role here). What you can do is to use a different array for the receive buffer, deallocate the array and allocate it with the new size, or allocate it with the large enough size in the first place. Different choices will be appropriate in different situations. Without seeing your code I would go for the first one, but the last one cannot be excluded.
Note that FFTW3 has parallel (1D MPI decomposition, which is what you use) transforms built-in, including multidimensional transforms.

Thread indexing for image rows / columns in CUDA

So I'm working on a CUDA program, and I'm experiencing some issues when it comes to indexing blocks and threads. Basically, I'm trying to implement the Pixel Sort algorithm in CUDA. (With one modification, we are only processing either rows or columns, not both at the same time)
The way I was visualizing it was to simply run N blocks with 1 thread each (for the number of rows, or columns) and have each block process that row / column independently of each other.
So if we want to sort on columns, we launch the kernel like this (there are a couple extra parameters that are only relevant to our specific processing, so I've left them out for simplicity)
pixel_sort<<<cols, 1>>>(d_image, d_imageSort, rows, cols, ...);
Then in the kernel, I access the block index with
int tid = blockIdx.x;
This allows me to work with one row / columns data per block, but it has some issues. It's running slower than our serial implementation of the algorithm for smaller images, and straight up crashes when the image size becomes too large.
An alternative thread scheme I was considering would be to map each of the image's pixels to one thread, however I have a couple of questions on this.
If we were to launch N blocks with M threads, representing N cols with M rows, how do we avoid 512 (or 1024 ?) limit of threads per block. Can we just have each thread process multiple pixels in the column in this instance? How would the indexing look like in the kernel?
The algorithm basically requires that we work on the entire column, hence each thread cannot just do some work on that pixel, they have to communicate, presumably using shared memory. Would it be a valid strategy to have one "master" thread per block, that does the actual sorting calculations, and then have all of the other threads participate in the shared memory?
Other Notes:
Our image data is read in through OpenCV, and has the RGBA values stored in a uchar4 array
If you have a single thread per block, you very quickly run into thread occupancy issues. If your goal is to do a full row sort (for columns you could transpose the image before sending to the GPU to take advantage of global coalescing), the fastest way that gets a decent result is probably to do a radix or merge-sort on a per-row basis, basically copying the steps from http://mgarland.org/files/papers/nvr-2008-001.pdf. You could assign k blocks of m threads each for each row such that km >= image width. Then you would be launching k*(image height) blocks. Your grid would then be of size (k, height, 1).
As for your specific questions:
You cannot get around the 512/1024 thread-per-block limit, you would have to restructure your algorithm.
A "master" thread would generally be poor design, causing stalls, overhead, and not taking full advantage of the many cores. You may sometimes need to utilize a single thread, say to output/broadcast a result, but mostly you want to avoid it. See the linked article for sample algorithms that mostly avoid this.

about organizing threads in cuda

general question: the number of threads must be equal to the size of the elements i want to deal with? exmaple: if i have matrix M[a][b]. i must allocate (aXb) threads or i can allocate more threads than i need(more than ab)? because the thread that will focus on element aXb+1 will throw us out, doesnt he? or the solution is to put a condition(only if in range(ab))?
specific question: let be M[x][y] matrix with x rows and y columns. consider that 1000 <= x <= 300000 and y <= 100. how can i organize the threads in that way that it will be general for each input for x and y. i want that each thread will focus on one element in the matrix. CC = 2.1 thanks!
General answer: It depends on a problem.
In most cases natural one-to-one mapping of the problem to the grid of threads is fine to start with, but what you want to keep in mind is:
Achieving high occupancy.
Maximizing GPU resources usage and memory throughput.
Working with valid data.
Sometimes it may require using single thread to process many elements or many threads to process single element.
For instance, you can imagine an series of independent operations A,B and C that need to be applied on array of elements. You could run three different kernels, but it might be better choice to allocate the grid to contain three times more threads than there is elements and distinguish operations by one of the dimensions of the grid (or anything else). On the other side - you might have a problem that could use maximizing the usage of shared memory (e.g transforming the image) - you could use block of 16 threads to process 5x5 image window where each thread would calculate some statistics of each 2x2 slice.
The choice is yours - the best advice is not always go with the obvious. Try different approaches and choose what works best.

Access a matrix rapidly

I need to access a two-dimensions matrix with a C++ code. If the matrix is mat[n][m], I have to access (in a for-loop) these positions:
mat[x][y], mat[x-1][y-m-1], mat[x-1][y], mat[x][y-1]
At the next iteration I have to do:
x=x+1
And then, again:
mat[x][y], mat[x-1][y-m-1], mat[x-1][y], mat[x][y-1]
What could be the best way to have these positions nearest in memory to speedup my code?
If you are iterating horizontally, arrange your matrix as mat[y][x], especially if it is an array of arrays (the layout of the matrix isn't clear in your answer).
Since you didn't provided sufficient information, it's hard to say which way is better.
You could try to unroll your loop for continuous memory access.
For example, read from mat[x][y] 4 times, then mat[x-1][y-m-1] 4 times, then mat[x-1][y] 4 times, then mat[x][y-1] 4 times. After that, you process the loaded 4 sets of data in one iteration.
I bet the bottleneck is not the memory access itself. It should be the calculation of memory address. This approach of memory access can be written in SIMD load so you could reduce 3/4 time cost of memory address calculating.
If you have to process your task sequentially, you could try not to use multidimensional subscribes. For example:
for( x=0; x<n; x++ )
doSomething( mat[x][y] );
could be done with:
for( x=y; x<n*m; x+=m )
doSomething( mat[0][x] );
In second way you avoided one lea instruction.
If I get this right, you loop through your entire array, although you only mention x = x + 1 as an update (nothing for y). I would then see the array as one-dimensional, with a single counter i going from 0 to the total array length. Then the four values to access in each loop would be
mat[i], mat[i-S-m-1], mat[i-S], mat[i-1]
where S is the stride (rows or columns depending on your representation). This requires less address computations, regardless of memory layout. It also takes less index checks/updates because there's only one counter i. Plus, S+m+1 is constant, so you could define it as such.

Calling multiple kernels, global memory performances - CUDA

I have four CUDA kernels working on matrices in the following way:
convolution<<<>>>(A,B);
multiplybyElement1<<<>>>(B);
multiplybyElement2<<<>>>(A);
multiplybyElement3<<<>>>(C);
// A + B + C with CUBLAS' cublasSaxpy
every kernel basically (except the convolution first) performs a matrix each-element multiplication by a fixed value hardcoded in its constant memory (to speed things up).
Should I join these kernels into a single one by calling something like
multiplyBbyX_AbyY_CbyZ<<<>>>(B,A,C)
?
Global memory should already be on the device so probably that would not help, but I'm not totally sure
If I understood correctly, you're asking if you should merge the three "multiplybyElement" kernels into one, where each of those kernels reads an entire (different) matrix, multiplying each element by a constant, and storing the new scaled matrix.
Given that these kernels will be memory bandwidth bound (practically no computation, just one multiply for every element) there is unlikely to be any benefit from merging the kernels unless your matrices are small, in which case you would be making inefficient use of the GPU since the kernels will execute in series (same stream).
If merging the kernels means that you can do only one pass over the memory, then you may see a 3x speedup.
Can you multiply up the fixed values up front and then do a single multiply in a single kernel?