I have a M x N array 'A' that is to be distributed over 'np' processors using MPI in the second dimension(i.e N is the direction that is scattered). Each processor will be initially allocated M x N/np memory by fftw_mpi_local_size_2D (I have used this function from mpi because SIMD is efficient as per fftw3 manual).
initialisation:
alloc_local=fftw_mpi_local_size_2d(M,N,MPI_COMM_WORLD,local_n,local_n_offset)
pointer1=fftw_alloc_real(alloc_local)
call c_f_pointer(pointer1,A[M,local_n])
At this point, each processor has a slab of A that is M x local_n=(N/np) size.
While doing a fourier transform: A(x,y) -> A(x,ky), here y is vertically downwards(not the MPI partitioned axis) in the array A. In fourier space I have to store M+2 x local_n number of elements (for a 1d real array of length M, M in fourier space has M+2 elements if we use this module from FFTW3 dfftw_execute_dft_r2c, ).
These fourier space operations I could do in dummy matrices in every processor independently.
There is one operation where I have to y fourier and x fourier cosine transform consecutively. To parallelise operations in the all fourier space, I want to gather my y fourier transformed arrays which are (M+2)xlocal_n size to M+2 x N bigger array and scatter them back again after a transpose so that y direction is the partitioned one. i.e( N x M+2 ) ----scatter---> (N x (M+2)/np) but each processor has been allocated only M x local_n addresses initially.
If I have M=N, then I still have (N x local_n + (2/np)) . I could resolve this by increasing allocated memory for 1 processor.
I don't want to start out with (N+2,N) and (N+2,local_n) because this will increase memory requirement for a lot of arrays and the above gymnastics has to be done only once per iteration.
No, you cannot easily change the allocated size of a Fortran array (MPI does not play any role here). What you can do is to use a different array for the receive buffer, deallocate the array and allocate it with the new size, or allocate it with the large enough size in the first place. Different choices will be appropriate in different situations. Without seeing your code I would go for the first one, but the last one cannot be excluded.
Note that FFTW3 has parallel (1D MPI decomposition, which is what you use) transforms built-in, including multidimensional transforms.
Related
I have a large matrix describing a physical system. The last two rows are fundamentally different from the others, and therefore need to be set up separately. Furthermore, it makes no sense to distribute each of these rows over different processes. I want to set up the two lines on the 0th process and then copy them to the global matrix.
What do I have? - A distributed M x N matrix where the upper (M-2) x N block is already filled.
What do I want to do? - Calculate the last 2 x N elements on the 0th process and then copy them with PDGEMR2D
What is the problem? - I need to call PDGEMR2D on all processes. The to-be-copied matrix (I think it's usually called a) therefore needs to be allocated and have a scalapack descriptor on all processes. On the 0th process, the local matrix is 2 x N, on all other processes it is 0 x N.
How do I deal with the empty submatrices?
Usually, to get the scalapack descriptors I would call descinit with the local number of rows as LLD. However this number needs to be >= 1, but on the processes with the empty matrices it is 0.
(Note that fortran lets you allocate arrays with 0 elements - this is purely a ScaLAPACK issue.)
I'm trying to use armadillo to do linear regression as in the following function:
void compute_weights()
{
printf("transpose\n");
const mat &xt(X.t());
printf("inverse\n");
mat xd;
printf("mul\n");
xd = (xt * X);
printf("inv\n");
xd = xd.i();
printf("mul2\n");
xd = xd * xt;
printf("mul3\n");
W = xd * Y;
}
I've split this up so I could see what was going on with the program getting so huge. The matrix X has 64 columns and over 23 million rows. The transpose isn't too bad, but that first multiply causes the memory footprint to completely blow up. Now, as I understand it, if I multiply X.t() * X, each element of the matrix product will be the dot product of a column of X and a row of X.t(), and the result should be a 64x64 matrix.
Sure, it should take a long time, but why would the memory suddenly blow up to nearly 30 gigabytes?
Then it seems to hang on to that memory, and then when it gets to the second multiply, it's just too much, and the OS kills it for getting so huge.
Is there a way to compute products without so much memory usage? Can that memory be reclaimed? Is there a better way to represent these calculations?
You don't stand a chance doing this whole multiplication in one shot, unless you use a huge workstation. Like hbrerkere said, your initial consumption is about 22 GB. So you either be ready for that, or find another way.
If you don't have such a workstation, another way is to do the multiplication yourself, and parallelize it. Here's how you do it:
Don't load the whole matrix into memory, but load parts of it.
Load like a million rows of X, and store it somewhere.
Load a million columns of Y
Use std::transform with the binary operator std::multiplies to multiply the parts you loaded (this will utilize your processor's vectorization, and make it fast), and fill in the partial result you calculated.
Load the next part of your matrices, and repeat
This won't be as efficient, but it will work. Also another option is to consider using Armadillo after decomposing your matrix to smaller matrices, whose multiplication will yield sub-results.
Both methods are much slower than the full multiplication for 2 reasons:
The overhead of loading and deleting data from memory
Matrix multiplication is already an O(N^3) problem... and now splitting your multiplication is O(N^2), so it'll become O(N^6)...
Good luck!
You can compute the weights using far less memory using the QR decomposition (You might want to look up 'least squares QR');
Briefly:
Use householder transformations to (implicitly) find orthogonal Q so that
Q'*X = R where R is upper triangular
and at the same time transform Y
Q'*Y = y
Solve
R*y = W for W using only the top 64 rows of R and y
If you are willing to overwrite Z and Y, then this requires no extra memory; otherwise you will need a copy of X and a copy of Y.
I wrote a cellular automaton program that stores data in a matrix (an array of arrays). For a 300*200 matrix I can achieve 60 or more iterations per second using static memory allocation (e.g. std::array).
I would like to produce matrices of different sizes without recompiling the program every time, i.e. the user enters a size and then the simulation for that matrix size begins. However, if I use dynamic memory allocation, (e.g. std::vector), the simulation drops to ~2 iterations per second. How can I solve this problem? One option I've resorted to is to pre-allocate a static array larger than what I anticipate the user will select (e.g. 2000*2000), but this seems wasteful and still limits user choice to some degree.
I'm wondering if I can either
a) allocate memory once and then somehow "freeze" it for ordinary static array performance?
b) or perform more efficient operations on the std::vector? For reference, I am only performing matrix[x][y] == 1 and matrix[x][y] = 1 operations on the matrix.
According to this question/answer, there is no difference in performance between std::vector or using pointers.
EDIT:
I've rewritten the matrix, as per UmNyobe' suggestion, to be a single array, accessed via matrix[y*size_x + x]. Using dynamic memory allocation (sized once at launch), I double the performance to 5 iterations per second.
As per PaulMcKenzie's comment, I compiled a release build and got the performance I was looking for (60 or more iterations per second). However, this is the foundation for more, so I still want to quantify the benefit of one method over the other more thoroughly, so I used a std::chrono::high_resolution_clock to time each iteration, and found that the performance difference between dynamic and static arrays (after using a single array matrix representation) to be within the margin of error (450~600 microseconds per iteration).
The performance during debugging is a slight concern however, so I think I'll keep both, and switch to a static array when debugging.
For reference, I am only performing
matrix[x][y]
Red flag! Are you using vector<vector<int>>for your matrix
representation? This is a mistake, as rows of your matrix will be far
apart in memory. You should use a single vector of size rows x cols
and use matrix[y * row + x]
Furthermore, you should follow the approach where you index first by rows and then by columns, ie matrix[y][x] rather than matrix[x][y]. Your algorithm should also process the same way. This is due to the fact that with matrix[y][x] (x, y) and (x + 1, y) are one memory block from each other while with any other mechanism elements (x,y) and (x + 1, y), (x, y + 1) are much farther away.
Even if there is performance decrease from std::array to std::vector (as the array can have its elements on the stack, which is faster), a decent algorithm will perform on the same magnitude using both collections.
I have a sparse matrix structure that I am using in conjunction with CUBLAS to implement a linear solver class. I anticipate that the dimensions of the sparse matrices I will be solving will be fairly large (on the order of 10^7 by 10^7).
I will also anticipate that the solver will need to be used many times and that a portion of this matrix will need be updated several times (between computing solutions) as well.
Copying the entire matrix sturcture from system memory to device memory could become quite a performance bottle neck since only a fraction of the matrix entries will ever need to be changed at a given time.
What I would like to be able to do is to have a way to update only a particular sub-set / sub-matrix rather than recopy the entire matrix structure from system memory to device memory each time I need to change the matrix.
The matrix data structure would reside on the CUDA device in arrays:
d_col, d_row, and d_val
On the system side I would have corresponding arrays I, J, and val.
So ideally, I would only want to change the subsets of d_val that correspond to the values in the system array, val, that changed.
Note that I do not anticipate that any entries will be added to or removed from the matrix, only that existing entries will change in value.
Naively I would think that to implement this, I would have an integer array or vector on the host side, e.g. updateInds , that would track the indices of entries in val that have changed, but I'm not sure how to efficiently tell the CUDA device to update the corresponding values of d_val.
In essence: how do I change the entries in a CUDA device side array (d_val) at indicies updateInds[1],updateInds[2],...,updateInds[n] to a new set of values val[updatInds[1]], val[updateInds[2]], ..., val[updateInds[3]], with out recopying the entire val array from system memory into CUDA device memory array d_val?
As long as you only want to change the numerical values of the value array associated with CSR (or CSC, or COO) sparse matrix representation, the process is not complicated.
Suppose I have code like this (excerpted from the CUDA conjugate gradient sample):
checkCudaErrors(cudaMalloc((void **)&d_val, nz*sizeof(float)));
...
cudaMemcpy(d_val, val, nz*sizeof(float), cudaMemcpyHostToDevice);
Now, subsequent to this point in the code, let's suppose I need to change some values in the d_val array, corresponding to changes I have made in val:
for (int i = 10; i < 25; i++)
val[i] = 4.0f;
The process to move these particular changes is conceptually the same as if you were updating an array using memcpy, but we will use cudaMemcpy to update the d_val array on the device:
cudaMemcpy(d_val+10, val+10, 15*sizeof(float), cudaMempcyHostToDevice);
Since these values were all contiguous, I can use a single cudaMemcpy call to effect the transfer.
If I have several disjoint regions similar to above, it will require several calls to cudaMemcpy, one for each region. If, by chance, the regions are equally spaced and of equal length:
for (int i = 10; i < 5; i++)
val[i] = 1.0f;
for (int i = 20; i < 5; i++)
val[i] = 2.0f;
for (int i = 30; i < 5; i++)
val[i] = 4.0f;
then it would also be possible to perform this transfer using a single call to cudaMemcpy2D. The method is outlined here.
Notes:
cudaMemcpy2D is slower than you might expect compared to a cudaMemcpy operation on the same number of elements.
CUDA API calls have some inherent overhead. If a large part of the matrix is to be updated in a scattered fashion, it may still be actually quicker to just transfer the whole d_val array, taking advantage of the fact that this can be done using a single cudaMemcpy operation.
The method described here cannot be used if non-zero values change their location in the sparse matrix. In that case, I cannot provide a general answer for how to surgically update a CSR sparse matrix on the device. And certain relatively simple changes could necessitate updating most of the array data (3 vectors) anyway.
I need to access a two-dimensions matrix with a C++ code. If the matrix is mat[n][m], I have to access (in a for-loop) these positions:
mat[x][y], mat[x-1][y-m-1], mat[x-1][y], mat[x][y-1]
At the next iteration I have to do:
x=x+1
And then, again:
mat[x][y], mat[x-1][y-m-1], mat[x-1][y], mat[x][y-1]
What could be the best way to have these positions nearest in memory to speedup my code?
If you are iterating horizontally, arrange your matrix as mat[y][x], especially if it is an array of arrays (the layout of the matrix isn't clear in your answer).
Since you didn't provided sufficient information, it's hard to say which way is better.
You could try to unroll your loop for continuous memory access.
For example, read from mat[x][y] 4 times, then mat[x-1][y-m-1] 4 times, then mat[x-1][y] 4 times, then mat[x][y-1] 4 times. After that, you process the loaded 4 sets of data in one iteration.
I bet the bottleneck is not the memory access itself. It should be the calculation of memory address. This approach of memory access can be written in SIMD load so you could reduce 3/4 time cost of memory address calculating.
If you have to process your task sequentially, you could try not to use multidimensional subscribes. For example:
for( x=0; x<n; x++ )
doSomething( mat[x][y] );
could be done with:
for( x=y; x<n*m; x+=m )
doSomething( mat[0][x] );
In second way you avoided one lea instruction.
If I get this right, you loop through your entire array, although you only mention x = x + 1 as an update (nothing for y). I would then see the array as one-dimensional, with a single counter i going from 0 to the total array length. Then the four values to access in each loop would be
mat[i], mat[i-S-m-1], mat[i-S], mat[i-1]
where S is the stride (rows or columns depending on your representation). This requires less address computations, regardless of memory layout. It also takes less index checks/updates because there's only one counter i. Plus, S+m+1 is constant, so you could define it as such.