The issue I am trying to resolve is the following:
The C++ serial code I have computes across a large 2D matrix. To optimize this process, I wish to split this large 2D matrix and run on 4 nodes (say) using MPI. The only communication that occurs between nodes is the sharing of edge values at the end of each time step. Every node shares the edge array data, A[i][j], with its neighbor.
Based on reading about MPI, I have the following scheme to be implemented.
if (myrank == 0)
{
for (i= 0 to x)
for (y= 0 to y)
{
C++ CODE IMPLEMENTATION
....
MPI_SEND(A[x][0], A[x][1], A[x][2], Destination= 1.....)
MPI_RECEIVE(B[0][0], B[0][1]......Sender = 1.....)
MPI_BARRIER
}
if (myrank == 1)
{
for (i = x+1 to xx)
for (y = 0 to y)
{
C++ CODE IMPLEMENTATION
....
MPI_SEND(B[x][0], B[x][1], B[x][2], Destination= 0.....)
MPI_RECEIVE(A[0][0], A[0][1]......Sender = 1.....)
MPI BARRIER
}
I wanted to know if my approach is correct and also would appreciate any guidance on other MPI functions too look into for implementation.
Thanks,
Ashwin.
Just to amplify Joel's points a bit:
This goes much easier if you allocate your arrays so that they're contiguous (something C's "multidimensional arrays" don't give you automatically:)
int **alloc_2d_int(int rows, int cols) {
int *data = (int *)malloc(rows*cols*sizeof(int));
int **array= (int **)malloc(rows*sizeof(int*));
for (int i=0; i<rows; i++)
array[i] = &(data[cols*i]);
return array;
}
/*...*/
int **A;
/*...*/
A = alloc_2d_init(N,M);
Then, you can do sends and recieves of the entire NxM array with
MPI_Send(&(A[0][0]), N*M, MPI_INT, destination, tag, MPI_COMM_WORLD);
and when you're done, free the memory with
free(A[0]);
free(A);
Also, MPI_Recv is a blocking recieve, and MPI_Send can be a blocking send. One thing that means, as per Joel's point, is that you definately don't need Barriers. Further, it means that if you have a send/recieve pattern as above, you can get yourself into a deadlock situation -- everyone is sending, no one is recieving. Safer is:
if (myrank == 0) {
MPI_Send(&(A[0][0]), N*M, MPI_INT, 1, tagA, MPI_COMM_WORLD);
MPI_Recv(&(B[0][0]), N*M, MPI_INT, 1, tagB, MPI_COMM_WORLD, &status);
} else if (myrank == 1) {
MPI_Recv(&(A[0][0]), N*M, MPI_INT, 0, tagA, MPI_COMM_WORLD, &status);
MPI_Send(&(B[0][0]), N*M, MPI_INT, 0, tagB, MPI_COMM_WORLD);
}
Another, more general, approach is to use MPI_Sendrecv:
int *sendptr, *recvptr;
int neigh = MPI_PROC_NULL;
if (myrank == 0) {
sendptr = &(A[0][0]);
recvptr = &(B[0][0]);
neigh = 1;
} else {
sendptr = &(B[0][0]);
recvptr = &(A[0][0]);
neigh = 0;
}
MPI_Sendrecv(sendptr, N*M, MPI_INT, neigh, tagA, recvptr, N*M, MPI_INT, neigh, tagB, MPI_COMM_WORLD, &status);
or nonblocking sends and/or recieves.
First you don't need that much barrier
Second, you should really send your data as a single block as multiple send/receive blocking their way will result in poor performances.
This question has already been answered quite thoroughly by Jonathan Dursi; however, as Jonathan Leffler has pointed out in his comment to Jonathan Dursi's answer, C's multi-dimensional arrays are a contiguous block of memory. Therefore, I would like to point out that for a not-too-large 2d array, a 2d array could simply be created on the stack:
int A[N][M];
Since, the memory is contiguous, the array can be sent as it is:
MPI_Send(A, N*M, MPI_INT,1, tagA, MPI_COMM_WORLD);
On the receiving side, the array can be received into a 1d array of size N*M (which can then be copied into a 2d array if necessary):
int A_1d[N*M];
MPI_Recv(A_1d, N*M, MPI_INT,0,tagA, MPI_COMM_WORLD,&status);
//copying the array to a 2d-array
int A_2d[N][M];
for (int i = 0; i < N; i++){
for (int j = 0; j < M; j++){
A_2d[i][j] = A_1d[(i*M)+j]
}
}
Copying the array does cause twice the memory to be used, so it would be better to simply use A_1d by accessing its elements through A_1d[(i*M)+j].
Related
I have a task to speed up a program using MPI.
Let's assume I have a large 2d array (1000x1000 or bigger) on the input. I have a working sequential program that divides, so the 2d array into chunks (for example 10x10) and calculates the result which is double for each chuck. (so we have a function which argument is 2d array 10x10 and a result is a double number).
My first idea to speed up:
Create 1d array of size N*N (for example 10x10 = 100) and Send array to another process
double* buffer = new double[dataPortionSize];
//copy some data to buffer
MPI_Send(buffer, dataPortionSize, MPI_DOUBLE, currentProcess, 1, MPI_COMM_WORLD);
Recieve it in another process, calculate result, send back the result
double* buf = new double[dataPortionSize];
MPI_Recv(buf, dataPortionSize, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD, status);
double result = function->calc(buf);
MPI_Send(&result, 1, MPI_DOUBLE, 0, 3, MPI_COMM_WORLD);
This program was much slower than the sequential version. It looks like MPI needs a lot of time to pass an array to another process.
My second idea:
Pass the whole 2d input array to all processes
// data is protected field in base class, it is injected during runtime
MPI_Send(&(data[0][0]), dataSize * dataSize, MPI_DOUBLE, currentProcess, 1, MPI_COMM_WORLD);
And receive data like this
double **arrayAlloc( int size ) {
double **result; result = new double [ size ];
for ( int i = 0; i < size; i++ )
result[ i ] = new double[ size ];
return result;
}
double **data = arrayAlloc(dataSize);
MPI_Recv(&data[0][0], dataSize * dataSize, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD, status);
Unfortunately, I got a bunch of errors during execution:
Those crashes are pretty random. It happened 2 times that the program ended successfully
My third idea:
Pass memory address to all processes, but I found this:
MPI processes cannot read each others' memory, and virtual addressing makes one process' pointer completely meaningless to another.
Does anyone have an idea how to speed up it? I understand that the key thing for speed to is pass array/arrays to processes in an efficient way, but I don't have an idea how to do this.
You have multiple issues here. I'll try to go through them in some arbitrary order.
As someone else explained, your second attempt fails because MPI expects you to work with a single consecutive array, not an array of pointers. So you want to allocate something like matrix = new double[rows * cols] and then access individual rows as &matrix[row * cols] or an individual value as matrix[row * cols + col]
This would be a data structure that you can send, receive, scatter, and gather with MPI. It would also be faster in general.
You are correct to assume that MPI takes time to transfer data. Even best case it is the cost of a memcpy. Usually significantly more. If your program is doing too little work before transferring data, it will not be faster.
Your first attempt may have failed because the first process doesn't do anything useful while waiting for the result. You didn't include the receive operation in your code sample. However, if you wrote something like this:
for(int block = 0; block < nblocks; ++block) {
generate_data(buf);
MPI_Send(buf, ...);
MPI_Recv(buf, ...);
}
Then you cannot expect a speedup because the process is not doing anything useful while waiting for the result. You can avoid this with double buffering. Let the first process generate the next data block before waiting in the receive operation for the result. Something like this:
generate_data(0, input); /* 0-th block */
MPI_Send(input, ...);
for(int block = 1; block < nblocks; ++block) {
generate_data(block, input); /* 1st up to nth block */
MPI_Recv(output, ...); /* 0-th up to n-1-th block */
MPI_Send(input, ...);
}
MPI_Recv(output, ...); /* n-th block */
Now calculations in both processes can overlap.
You shouldn't use MPI_Send and MPI_Recv to begin with! MPI is designed for collective operations like MPI_Scatter and MPI_Gather. What you should do, is generate N blocks for N processes, MPI_Scatter them across all processes. Then let each process compute their result. Then MPI_Gather them back at the root process.
Even better, let every process work independently, if possible. Of course this depends on your data but if you can generate and process data blocks independently from one another, don't do any communication. Just let them all work alone. Something like this:
int rank, worldsize;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &worldsize);
for(int block = rank; block < nblocks; block += worldsize) {
process_data(block);
}
I have an array of index which I want each worker do something based on these indexes.
the size of the array might be more than the total number of ranks, so my first question is if there is another way except master-worker load balancing here? I want to have a balances system and also I want to assign each index to each ranks.
I was thinking about master-worker, and in this approach master rank (0) is giving each index to other ranks. but when I was running my code with 3 rank and 15 index my code is halting in while loop for sending the index 4. I was wondering If anybody can help me to find the problem
if(pCurrentID == 0) { // Master
MPI_Status status;
int nindices = 15;
int mesg[1] = {0};
int initial_id = 0;
int recv_mesg[1] = {0};
// -- send out initial ids to workers --//
while (initial_id < size - 1) {
if (initial_id < nindices) {
MPI_Send(mesg, 1, MPI_INT, initial_id + 1, 1, MPI_COMM_WORLD);
mesg[0] += 1;
++initial_id;
}
}
//-- hand out id to workers dynamically --//
while (mesg[0] != nindices) {
MPI_Probe(MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status);
int isource = status.MPI_SOURCE;
MPI_Recv(recv_mesg, 1, MPI_INT, isource, 1, MPI_COMM_WORLD, &status);
MPI_Send(mesg, 1, MPI_INT, isource, 1, MPI_COMM_WORLD);
mesg[0] += 1;
}
//-- hand out ending signals once done --//
for (int rank = 1; rank < size; ++rank) {
mesg[0] = -1;
MPI_Send(mesg, 1, MPI_INT, rank, 0, MPI_COMM_WORLD);
}
} else {
MPI_Status status;
int id[1] = {0};
// Get the surrounding fragment id
MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
int itag = status.MPI_TAG;
MPI_Recv(id, 1, MPI_INT, 0, itag, MPI_COMM_WORLD, &status);
int jfrag = id[0];
if (jfrag < 0) break;
// do something
MPI_Send(id, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
I have an array of index which I want each worker do something based
on these indexes. the size of the array might be more than the total
number of ranks, so my first question is if there is another way
except master-worker load balancing here? I want to have a balances
system and also I want to assign each index to each ranks.
No, but if the work performed per array index takes roughly the same amount of time, you can simply scatter the array among the processes.
I was thinking about master-worker, and in this approach master rank
(0) is giving each index to other ranks. but when I was running my
code with 3 rank and 15 index my code is halting in while loop for
sending the index 4. I was wondering If anybody can help me to find
the problem
As already pointed out in the comments, the problem is that you are missing (in the workers side) the loop of querying the master for work.
The load-balancer can be implemented as follows:
The master initial sends an iteration to the other workers;
Each worker waits for a message from the master;
Afterwards the master calls MPI_Recv from MPI_ANY_SOURCE and waits for another worker to request work;
After the worker finished working on its first iteration it sends its rank to the master, signaling the master to send a new iteration;
The master reads the rank sent by the worker in step 4., checks the array for a new index, and if there is still a valid index, send it to the worker. Otherwise, sends a special message signaling the worker that there is no more work to be performed. That message can be for instance -1;
When the worker receive the special message it stops working;
The master stops working when all the workers have receive the special message.
An example of such approach:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc,char *argv[]){
MPI_Init(NULL,NULL); // Initialize the MPI environment
int rank;
int size;
MPI_Status status;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
int work_is_done = -1;
if(rank == 0){
int max_index = 10;
int index_simulator = 0;
// Send statically the first iterations
for(int i = 1; i < size; i++){
MPI_Send(&index_simulator, 1, MPI_INT, i, i, MPI_COMM_WORLD);
index_simulator++;
}
int processes_finishing_work = 0;
do{
int process_that_wants_work = 0;
MPI_Recv(&process_that_wants_work, 1, MPI_INT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status);
if(index_simulator < max_index){
MPI_Send(&index_simulator, 1, MPI_INT, process_that_wants_work, 1, MPI_COMM_WORLD);
index_simulator++;
}
else{ // send special message
MPI_Send(&work_is_done, 1, MPI_INT, process_that_wants_work, 1, MPI_COMM_WORLD);
processes_finishing_work++;
}
} while(processes_finishing_work < size - 1);
}
else{
int index_to_work = 0;
MPI_Recv(&index_to_work, 1, MPI_INT, 0, rank, MPI_COMM_WORLD, &status);
// Work with the iterations index_to_work
do{
MPI_Send(&rank, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
MPI_Recv(&index_to_work, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
if(index_to_work != work_is_done)
// Work with the iterations index_to_work
}while(index_to_work != work_is_done);
}
printf("Process {%d} -> I AM OUT\n", rank);
MPI_Finalize();
return 0;
}
You can improve upon the aforementioned approach by reducing: 1) the number of messages sent and 2) the time waiting for them. For the former you can try to use a chunking strategy (i.e., sending more than one index per MPI communication). For the latter you can try to play around with nonblocking MPI communications or have two threads per process one to receive/send the work another to actually perform the work. This multithreading approach would also allow the master process to actually work with the array indices, but it significantly complicates the approach.
I am trying to fake shared memory currently while using the MPI library in C++. I have an array A of size n+1, where n is given from the user, and have processor 0 generate the integers for that array. I need to share the array that processor 0 created with all the other processes. So as a result I Bcast it to the others... However when I go to have each processor print out their array I get a signal 11 (Segmentation Fault) from a processor that isn't zero. If I comment that section out it runs with no problems. I would like to be able to see that my array was sent and stored correctly in all of the processors.
int *A=new int[n+1];
if(my_rank==0)
{
srand(1251);
A[0]=0;
for(int i=1; i<=n; i++){
A[i]=rand()%100;}
MPI_Bcast(&A, n+1, MPI_INT, 0, MPI_COMM_WORLD);
}
else {
MPI_Bcast(&A, n+1, MPI_INT, 0, MPI_COMM_WORLD);
}
cout<<"My rank is "<<my_rank<<" and this is my array:"<<endl;
for (int i=0; i<=n; i++)
{cout<<A[i]<<" "<<endl;}
cout<<endl;
You are incorrectly passing &A as address to MPI_Bcast. This is the address of the pointer, MPI needs the address of the data i.e. A.
MPI_Bcast(A, n+1, MPI_INT, 0, MPI_COMM_WORLD);
Move that code outside of the if/else block. It is the same call for all ranks.
I am learning MPI, and trying to create examples of some of the functions. I've gotten several to work, but I am having issues with MPI_Gather. I had a much more complex fitting test, but I trimmed it down to the most simple code. I am still, however, getting the following error:
root#master:/home/sgeadmin# mpirun ./expfitTest5
Assertion failed in file src/mpid/ch3/src/ch3u_request.c at line 584: FALSE
memcpy argument memory ranges overlap, dst_=0x1187e30 src_=0x1187e40 len_=400
internal ABORT - process 0
I am running one master instance and two node instances through AWS EC2. I have all the appropriate libraries installed, as I've gotten other MPI examples to work. My program is:
int main()
{
int world_size, world_rank;
int nFits = 100;
double arrCount[100];
double *rBuf = NULL;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
assert(world_size!=1);
int nElements = nFits/(world_size-1);
if(world_rank>0){
for(int k = 0; k < nElements; k++)
{
arrCount[k] = k;
}}
MPI_Barrier(MPI_COMM_WORLD);
if(world_rank==0)
{
rBuf = (double*) malloc( nFits*sizeof(double));
}
MPI_Gather(arrCount, nElements, MPI_DOUBLE, rBuf, nElements, MPI_DOUBLE, 0, MPI_COMM_WORLD);
if(world_rank==0){
for(int i = 0; i < nFits; i++)
{
cout<<rBuf[i]<<"\n";
}}
MPI_Finalize();
exit(0);
}
Is there something I am not understanding in malloc or MPI_Gather? I've compared my code to other samples, and can't find any differences.
The root process in a gather operation does participate in the operation. I.e. it sends data to it's own receive buffer. That also means you must allocate memory for it's part in the receive buffer.
Now you could use MPI_Gatherv and specify a recvcounts[0]/sendcount at root of 0 to follow your example closely. But usually you would prefer to write an MPI application in a way that the root participates equally in the operation, i.e. int nElements = nFits/world_size.
I have a Finite Element code that uses blocking receives and non-blocking sends. Each element has 3 incoming faces and 3 outgoing faces. The mesh is split up among many processors, so sometimes the boundary conditions come from the elements processor, or from neighboring processors. Relevant parts of the code are:
std::vector<task>::iterator it = All_Tasks.begin();
std::vector<task>::iterator it_end = All_Tasks.end();
int task = 0;
for (; it != it_end; it++, task++)
{
for (int f = 0; f < 3; f++)
{
// Get the neighbors for each incoming face
Neighbor neighbor = subdomain.CellSets[(*it).cellset_id_loc].neighbors[incoming[f]];
// Get buffers from boundary conditions or neighbor processors
if (neighbor.processor == rank)
{
subdomain.Set_buffer_from_bc(incoming[f]);
}
else
{
// Get the flag from the corresponding send
target = GetTarget((*it).angle_id, (*it).group_id, (*it).cell_id);
if (incoming[f] == x)
{
int size = cells_y*cells_z*groups*angles*4;
MPI_Status status;
MPI_Recv(&subdomain.X_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &status);
}
if (incoming[f] == y)
{
int size = cells_x*cells_z*groups*angles * 4;
MPI_Status status;
MPI_Recv(&subdomain.Y_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &status);
}
if (incoming[f] == z)
{
int size = cells_x*cells_y*groups*angles * 4;
MPI_Status status;
MPI_Recv(&subdomain.Z_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &status);
}
}
}
... computation ...
for (int f = 0; f < 3; f++)
{
// Get the outgoing neighbors for each face
Neighbor neighbor = subdomain.CellSets[(*it).cellset_id_loc].neighbors[outgoing[f]];
if (neighbor.IsOnBoundary)
{
// store the buffer into the boundary information
}
else
{
target = GetTarget((*it).angle_id, (*it).group_id, neighbor.cell_id);
if (outgoing[f] == x)
{
int size = cells_y*cells_z*groups*angles * 4;
MPI_Request request;
MPI_Isend(&subdomain.X_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &request);
}
if (outgoing[f] == y)
{
int size = cells_x*cells_z*groups*angles * 4;
MPI_Request request;
MPI_Isend(&subdomain.Y_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &request);
}
if (outgoing[f] == z)
{
int size = cells_x*cells_y*groups*angles * 4;
MPI_Request request;
MPI_Isend(&subdomain.Z_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &request);
}
}
}
}
A processor can do a lot of tasks before it needs information from other processors. I need a non-blocking send so that the code can keep working, but I'm pretty sure the receives are overwriting the send buffers before they get sent.
I've tried timing this code, and it's taking 5-6 seconds for the call to MPI_Recv, even though the message it's trying to receive has been sent. My theory is that the Isend is starting, but not actually sending anything until the Recv is called. The message itself is on the order of 1 MB. I've looked at benchmarks and messages of this size should take a very small fraction of a second to send.
My question is, in this code, is the buffer that was sent being overwritten, or just the local copy? Is there a way to 'add' to a buffer when I'm sending, rather than writing to the same memory location? I want the Isend to write to a different buffer every time it's called so the information isn't being overwritten while the messages wait to be received.
** EDIT **
A related question that might fix my problem: Can MPI_Test or MPI_Wait give information about an MPI_Isend writing to a buffer, i.e. return true if the Isend has written to the buffer, but that buffer has yet to be received?
** EDIT 2 **
I have added more information about my problem.
So it looks like I just have to bite the bullet and allocate enough memory in the send buffers to accommodate all the messages, and then just send portions of the buffer when I send.