MPI_Isend/Recv- Is there a deadlock? - c++

I have a total of 8 messages being passed on 4 nodes using MPI. I noticed that there were two messages whose arrays did not provide meaningful results. I have copied an excerpt of the code below? These are some related questions I had based on the code/results below:
Does the MPI_Isend also require a wait? I am not sure if there is a deadlock. I also tried just passing these two variables from one node to the other, and the array values were still NULL.
Will MPI_SendRecv improve the efficiency of the code as suggested here Non Blocking communication in MPI and MPI Wait Issue. Not all information is passed correctly? If so, how/why? Would also appreciate some pointers on setting that up.
Thanks!
Source Code:
if ((my_rank) == 0)
{
MPI_Irecv(A, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[6]);
MPI_Wait(&request[6], &status[6]);
}
if ((my_rank) == 1)
{
MPI_Isend(AA, Rows, MPI_DOUBLE, my_rank-1, 0, MPI_COMM_WORLD, &request[6]);
}
if ((my_rank) == 2)
{
MPI_Isend(B, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &request[7]);
}
if ((my_rank) == 3)
{
MPI_Irecv(BB, Rows, MPI_DOUBLE, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[7]);
MPI_Wait(&request[7], &status[7]);
}

Yes, All non-blocking calls (MPI_Isend, MPI_Irecv etc) require a matching MPI_Wait. The call is not guaranteed to complete until MPI_Wait is called. You should not change the contents of the buffer until after MPI_Wait returns.
https://computing.llnl.gov/tutorials/mpi/
To use SendRecv, same task has to send a message and wait to receive a message. That pattern doesnt hold true for your code.

Related

Dynamic load balancing master-worker

I have an array of index which I want each worker do something based on these indexes.
the size of the array might be more than the total number of ranks, so my first question is if there is another way except master-worker load balancing here? I want to have a balances system and also I want to assign each index to each ranks.
I was thinking about master-worker, and in this approach master rank (0) is giving each index to other ranks. but when I was running my code with 3 rank and 15 index my code is halting in while loop for sending the index 4. I was wondering If anybody can help me to find the problem
if(pCurrentID == 0) { // Master
MPI_Status status;
int nindices = 15;
int mesg[1] = {0};
int initial_id = 0;
int recv_mesg[1] = {0};
// -- send out initial ids to workers --//
while (initial_id < size - 1) {
if (initial_id < nindices) {
MPI_Send(mesg, 1, MPI_INT, initial_id + 1, 1, MPI_COMM_WORLD);
mesg[0] += 1;
++initial_id;
}
}
//-- hand out id to workers dynamically --//
while (mesg[0] != nindices) {
MPI_Probe(MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status);
int isource = status.MPI_SOURCE;
MPI_Recv(recv_mesg, 1, MPI_INT, isource, 1, MPI_COMM_WORLD, &status);
MPI_Send(mesg, 1, MPI_INT, isource, 1, MPI_COMM_WORLD);
mesg[0] += 1;
}
//-- hand out ending signals once done --//
for (int rank = 1; rank < size; ++rank) {
mesg[0] = -1;
MPI_Send(mesg, 1, MPI_INT, rank, 0, MPI_COMM_WORLD);
}
} else {
MPI_Status status;
int id[1] = {0};
// Get the surrounding fragment id
MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
int itag = status.MPI_TAG;
MPI_Recv(id, 1, MPI_INT, 0, itag, MPI_COMM_WORLD, &status);
int jfrag = id[0];
if (jfrag < 0) break;
// do something
MPI_Send(id, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
I have an array of index which I want each worker do something based
on these indexes. the size of the array might be more than the total
number of ranks, so my first question is if there is another way
except master-worker load balancing here? I want to have a balances
system and also I want to assign each index to each ranks.
No, but if the work performed per array index takes roughly the same amount of time, you can simply scatter the array among the processes.
I was thinking about master-worker, and in this approach master rank
(0) is giving each index to other ranks. but when I was running my
code with 3 rank and 15 index my code is halting in while loop for
sending the index 4. I was wondering If anybody can help me to find
the problem
As already pointed out in the comments, the problem is that you are missing (in the workers side) the loop of querying the master for work.
The load-balancer can be implemented as follows:
The master initial sends an iteration to the other workers;
Each worker waits for a message from the master;
Afterwards the master calls MPI_Recv from MPI_ANY_SOURCE and waits for another worker to request work;
After the worker finished working on its first iteration it sends its rank to the master, signaling the master to send a new iteration;
The master reads the rank sent by the worker in step 4., checks the array for a new index, and if there is still a valid index, send it to the worker. Otherwise, sends a special message signaling the worker that there is no more work to be performed. That message can be for instance -1;
When the worker receive the special message it stops working;
The master stops working when all the workers have receive the special message.
An example of such approach:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc,char *argv[]){
MPI_Init(NULL,NULL); // Initialize the MPI environment
int rank;
int size;
MPI_Status status;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
int work_is_done = -1;
if(rank == 0){
int max_index = 10;
int index_simulator = 0;
// Send statically the first iterations
for(int i = 1; i < size; i++){
MPI_Send(&index_simulator, 1, MPI_INT, i, i, MPI_COMM_WORLD);
index_simulator++;
}
int processes_finishing_work = 0;
do{
int process_that_wants_work = 0;
MPI_Recv(&process_that_wants_work, 1, MPI_INT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status);
if(index_simulator < max_index){
MPI_Send(&index_simulator, 1, MPI_INT, process_that_wants_work, 1, MPI_COMM_WORLD);
index_simulator++;
}
else{ // send special message
MPI_Send(&work_is_done, 1, MPI_INT, process_that_wants_work, 1, MPI_COMM_WORLD);
processes_finishing_work++;
}
} while(processes_finishing_work < size - 1);
}
else{
int index_to_work = 0;
MPI_Recv(&index_to_work, 1, MPI_INT, 0, rank, MPI_COMM_WORLD, &status);
// Work with the iterations index_to_work
do{
MPI_Send(&rank, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
MPI_Recv(&index_to_work, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
if(index_to_work != work_is_done)
// Work with the iterations index_to_work
}while(index_to_work != work_is_done);
}
printf("Process {%d} -> I AM OUT\n", rank);
MPI_Finalize();
return 0;
}
You can improve upon the aforementioned approach by reducing: 1) the number of messages sent and 2) the time waiting for them. For the former you can try to use a chunking strategy (i.e., sending more than one index per MPI communication). For the latter you can try to play around with nonblocking MPI communications or have two threads per process one to receive/send the work another to actually perform the work. This multithreading approach would also allow the master process to actually work with the array indices, but it significantly complicates the approach.

Allocation of pointers in MPI collective communications

I wonder how MPI collective communications such as Bcast, Scatter, Gather etc. behave when the send buffer is allocated in root but it is not allocated in the other ranks.
For example:
rowptr = (int*)malloc(sizeof(int) * (row_count + 1));
MPI_Scatterv(all_rows, rowCounts, rowDispls, MPI_INT,
rowptr, row_count, MPI_INT, MASTER, MPI_COMM_WORLD);
Where all_rows is only allocated in MASTER (rank == 0) process. What is the behavior of MPI under this situation.
Or in the following case;
MPI_Scatter(eCounts, 1, MPI_INT, &elm_count, 1, MASTER, MPI_COMM_WORLD);
where eCounts is int[] and elm_count is int, but eCount is allocated only in MASTER.
Should I also allocate send buffers even if they are not used in other ranks?
From the MPI 3.1 standard (chapter 5.6 page 160)
The send buffer is ignored for all non-root processes.
[...]
All arguments to the function are significant on process root, while on other processes, only arguments recvbuf, recvcount, recvtype, root, and comm are significant.
Same story for MPI_Gather() but replace recv* with send*.
All arguments are significant in the case of MPI_Bcast() (the buffer is a send buffer on the root rank, and a receive buffer on the other ranks).

MPI_send MPI_recv failed while increase the arry size

I am trying to write a 3D parallel computing Poisson solver using OpenMPI ver 1.6.4.
The following parts are my code for parallel computing using blocking send receive.
The following variable is declared in another file.
int px = lx*meshx; //which is meshing point in x axis.
int py = ly*meshy;
int pz = lz*meshz;
int L = px * py * pz
The following code works well while
lx=ly=lz=10;
meshx=meshy=2, meshz=any int number.
The send recv parts failed while meshx and meshy are larger than 4.
The program hanging there waiting for sending or receiving data.
But it works if I only send data from one processor to another, not exchange the data.
(ie : send from rank 0 to 1, but dont send from 1 to 0)
I can't understand how this code works while meshx and meshy is small but failed while mesh number x y in large.
Does blocking send receive process will interrupt itself or I confuse the processor in my code?Does it matter with my array size?
#include "MPI-practice.h"
# include <iostream>
# include <math.h>
# include <string.h>
# include <time.h>
# include <sstream>
# include <string>
# include "mpi.h"
using namespace std;
extern int px,py,pz;
extern int L;
extern double simTOL_phi;
extern vector<double> phi;
int main(int argc, char *argv[]){
int numtasks, taskid, offset_A, offset_B, DD_loop,s,e;
double errPhi(0),errPhi_sum(0);
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &taskid);
MPI_Status status;
if((pz-1)%numtasks!=0){
//cerr << "can not properly divide meshing points."<<endl;
exit(0);
}
offset_A=(pz-1)/numtasks*px*py;
offset_B=((pz-1)/numtasks+1)*px*py;
s=offset_A*taskid;
e=offset_A*taskid+offset_B;
int pz_offset_A=(pz-1)/numtasks;
int pz_offset_B=(pz-1)/numtasks+1;
stringstream name1;
string name2;
Setup_structure();
Initialize();
Build_structure();
if (taskid==0){
//master processor
ofstream output;
output.open("time", fstream::out | fstream::app);
output.precision(6);
clock_t start,end;
start=clock();
do{
errPhi_sum=0;
errPhi=Poisson_inner(taskid,numtasks,pz_offset_A,pz_offset_B);
//Right exchange
MPI_Send(&phi[e-px*py], px*py, MPI_DOUBLE, taskid+1, 1, MPI_COMM_WORLD);
MPI_Recv(&phi[e], px*py, MPI_DOUBLE, taskid+1, 1, MPI_COMM_WORLD, &status);
MPI_Allreduce ( &errPhi, &errPhi_sum, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD );
}while(errPhi_sum>simTOL_phi);
end=clock();
output << "task "<< 0 <<" = "<< (end-start)/CLOCKS_PER_SEC <<endl<<endl;
Print_to_file("0.txt");
//recv from slave
for (int i=1;i<numtasks;i++){
MPI_Recv(&phi[offset_A*i], offset_B, MPI_DOUBLE, i, 1, MPI_COMM_WORLD, &status);
}
Print_to_file("sum.txt");
}
else{
//slave processor
do{
errPhi=Poisson_inner(taskid,numtasks,pz_offset_A,pz_offset_B);
//Left exchange
MPI_Send(&phi[s+px*py], px*py, MPI_DOUBLE, taskid-1, 1, MPI_COMM_WORLD);
MPI_Recv(&phi[s], px*py, MPI_DOUBLE, taskid-1, 1, MPI_COMM_WORLD, &status);
//Right exchange
if(taskid!=numtasks-1){
MPI_Send(&phi[e-px*py], px*py, MPI_DOUBLE, taskid+1, 1, MPI_COMM_WORLD);
MPI_Recv(&phi[e], px*py, MPI_DOUBLE, taskid+1, 1, MPI_COMM_WORLD, &status);
}
MPI_Allreduce ( &errPhi, &errPhi_sum, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD );
}while(errPhi_sum>simTOL_phi);
//send back master
MPI_Send(&phi[s], offset_B, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD);
name1<<taskid<<".txt";
name2=name1.str();
Print_to_file(name2.c_str());
}
MPI_Finalize();
}
Replace all coupled MPI_Send/MPI_Recv calls with a calls to MPI_Sendrecv. For example, this
MPI_Send(&phi[e-px*py], px*py, MPI_DOUBLE, taskid+1, 1, MPI_COMM_WORLD);
MPI_Recv(&phi[e], px*py, MPI_DOUBLE, taskid+1, 1, MPI_COMM_WORLD, &status);
becomes
MPI_Sendrecv(&phi[e-px*py], px*py, MPI_DOUBLE, taskid+1, 1,
&phi[e], px*px, MPI_DOUBLE, taskid+1, 1,
MPI_COMM_WORLD, &status);
MPI_Sendrecv uses non-blocking operations internally and thus it does not deadlock, even if two ranks are sending to each other at the same time. The only requirement (as usual) is that each send is matched by a receive.
The problem is in your inner most loop. Both tasks do a blocking send at the same time, which then hangs. It doesn't hang with smaller data sets, as the MPI library has a big enough buffer to hold the data. But once you increase that beyond the buffer size, the send blocks both processes. Since neither process are trying to receive, neither buffer can empty and the program deadlocks.
To fix it, have the slave first receive from the master, then send data back. If your send/receive don't conflict, you can switch the order of the functions. Otherwise you need to create a temporary buffer to hold it.

mpi hello world not working

I write simple hello-world program on Visual c++ 2010 express with MPI library and cant understand, why my code not working.
MPI_Init( NULL, NULL );
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
int a, b = 5;
MPI_Status st;
MPI_Send( &b, 1, MPI_INT, 0,0, MPI_COMM_WORLD );
MPI_Recv( &a, 1, MPI_INT, 0,0, MPI_COMM_WORLD, &st );
MPI_Send tells me "DEADLOCK: attempting to send a message to the local process without a prior matching receive". If i write Recv first, program stucks there (no data, blocking receive).
What i`m doint wrong?
My studio is visual c++ 2010 express. MPI from HPC SDK 2008 (32 bit).
You need something like this:
assert(size >= 2);
if (rank == 0)
MPI_Send( &b, 1, MPI_INT, 1,0, MPI_COMM_WORLD );
if (rank == 1)
MPI_Recv( &a, 1, MPI_INT, 0,0, MPI_COMM_WORLD, &st );
The idea of MPI is that the whole system operates in lockstep. And sometimes you do need to be aware of which participant you are in the "world." In this case, assuming you have two members (as per my assert), you need to make one of them send and the other receive.
Note also that I changed the "dest" parameter of the send, because 0 needs to send to 1 therefore 1 needs to receive from 0.
You can later do it the other way around if you wish (if each needs to tell the other something), but in such a case you may find even more efficient ways to do it using "collective operations" where you can exchange (both send and receive) with all the peers.
In your example code, you're sending to and receiving from rank 0. If you are only running your MPI program with 1 process (which makes no sense, but we'll accept it for the sake of argument), you could make this work by using non-blocking calls instead of the blocking version. It would change your program to look like this:
MPI_Init( NULL, NULL );
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
int a, b = 5;
MPI_Status st[2];
MPI_Request request[2];
MPI_Isend( &b, 1, MPI_INT, 0,0, MPI_COMM_WORLD, &request[0] );
MPI_Irecv( &a, 1, MPI_INT, 0,0, MPI_COMM_WORLD, &request[1] );
MPI_Waitall( request, st );
That would let both the send and the receive complete at the same time. The reason your MPI version doesn't like your original code (which is very nice of it to tell you such a thing) is because the call to MPI_SEND could block until the matching MPI_RECV is done, which in this case wouldn't occur because it would only get called after the MPI_SEND is over, which is a circular dependency.
In MPI, when you add an 'I' before an MPI call, it means "Immediate", as in, the call will return immediately and complete all the work later, when you call MPI_WAIT (or some version of it, like MPI_WAITALL in this example). So what we did here was to make the send and receive return immediately, basically just telling MPI that we intend to do a send and receive with rank 0 at some point in the future, then later (the next line), we tell MPI to go ahead and finish those calls now.
The benefit of using the immediate version of these calls is that theoretically, MPI can do some things in the background to let the send and receive calls make progress while your application is doing something else that doesn't rely on the result of that data. Then, when you finish the call to MPI_WAIT* later, the data is available and you can do whatever you need to do.

Non Blocking communication in MPI and MPI Wait Issue. Not all information is passed correctly

I noticed that not all my MPI_Isend/MPI_IRecv were being executed. I think it may perhaps be either the order in which I do my send and receive or the fact that the code doesn't wait until all the commands are executed. I have copied the excerpt from the code below. Could you suggest as to what I could be doing incorrectly?
Thanks!
MPI_Status status[8];
MPI_Request request[8];
....
....
if ((my_rank) == 0)
{
MPI_Isend(eastedge0, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &request[0]);
MPI_Irecv(westofwestedge0, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[6]);
MPI_Wait(&request[6], &status[6]);
}
if ((my_rank) == 1)
{
MPI_Irecv(eastofeastedge1, Rows, MPI_DOUBLE, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[0]);
MPI_Wait(&request[0], &status[0]);
MPI_Isend(westedge1, Rows, MPI_DOUBLE, my_rank-1, 0, MPI_COMM_WORLD, &request[6]);
}
Either rank 0 or 1 could still be sending data after this block of code has been executed (as you don't wait on the send request object). This could cause problems if you modify the data before it has finished sending.
For this particular example, perhaps MPI_Sendrecv would be useful?
For every call to a non-blocking MPI call, there has to be a corresponding wait. You are missing one wait per process.