I am new to MPI. I want to send three ints to three slave nodes to create dynamic arrays, and each arrays will be send back to master. According to this post, I modified the code, and it's close to the right answer. But I got breakpoint when received array from slave #3 (m ==3) in receiver code. Thank you in advance!
My code is as follow:
#include <mpi.h>
#include <iostream>
#include <stdlib.h>
int main(int argc, char** argv)
{
int firstBreakPt, lateralBreakPt;
//int reMatNum1, reMatNum2;
int tmpN;
int breakPt[3][2]={{3,5},{6,9},{4,7}};
int myid, numprocs;
MPI_Status status;
// double *reMat1;
// double *reMat2;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
tmpN = 15;
if (myid==0)
{
// send three parameters to slaves;
for (int i=1;i<numprocs;i++)
{
MPI_Send(&tmpN,1,MPI_INT,i,0,MPI_COMM_WORLD);
firstBreakPt = breakPt[i-1][0];
lateralBreakPt = breakPt[i-1][1];
//std::cout<<i<<" "<<breakPt[i-1][0] <<" "<<breakPt[i-1][1]<<std::endl;
MPI_Send(&firstBreakPt,1,MPI_INT,i,1,MPI_COMM_WORLD);
MPI_Send(&lateralBreakPt,1,MPI_INT,i,2,MPI_COMM_WORLD);
}
// receive arrays from slaves;
for (int m =1; m<numprocs; m++)
{
MPI_Probe(m, 3, MPI_COMM_WORLD, &status);
int nElems3, nElems4;
MPI_Get_elements(&status, MPI_DOUBLE, &nElems3);
// Allocate buffer of appropriate size
double *result3 = new double[nElems3];
MPI_Recv(result3,nElems3,MPI_DOUBLE,m,3,MPI_COMM_WORLD,&status);
std::cout<<"Tag is 3, ID is "<<m<<std::endl;
for (int ii=0;ii<nElems3;ii++)
{
std::cout<<result3[ii]<<std::endl;
}
MPI_Probe(m, 4, MPI_COMM_WORLD, &status);
MPI_Get_elements(&status, MPI_DOUBLE, &nElems4);
// Allocate buffer of appropriate size
double *result4 = new double[nElems4];
MPI_Recv(result4,nElems4,MPI_DOUBLE,m,4,MPI_COMM_WORLD,&status);
std::cout<<"Tag is 4, ID is "<<m<<std::endl;
for (int ii=0;ii<nElems4;ii++)
{
std::cout<<result4[ii]<<std::endl;
}
}
}
else
{
// receive three paramters from master;
MPI_Recv(&tmpN,1,MPI_INT,0,0,MPI_COMM_WORLD,&status);
MPI_Recv(&firstBreakPt,1,MPI_INT,0,1,MPI_COMM_WORLD,&status);
MPI_Recv(&lateralBreakPt,1,MPI_INT,0,2,MPI_COMM_WORLD,&status);
// width
int width1 = (rand() % (tmpN-firstBreakPt+1))+ firstBreakPt;
int width2 = (rand() % (tmpN-lateralBreakPt+1))+ lateralBreakPt;
// create dynamic arrays
double *reMat1 = new double[width1*width1];
double *reMat2 = new double[width2*width2];
for (int n=0;n<width1; n++)
{
for (int j=0;j<width1; j++)
{
reMat1[n*width1+j]=(double)rand()/RAND_MAX + (double)rand()/(RAND_MAX*RAND_MAX);
//a[i*Width+j]=1.00;
}
}
for (int k=0;k<width2; k++)
{
for (int h=0;h<width2; h++)
{
reMat2[k*width2+h]=(double)rand()/RAND_MAX + (double)rand()/(RAND_MAX*RAND_MAX);
//a[i*Width+j]=1.00;
}
}
// send it back to master
MPI_Send(reMat1,width1*width1,MPI_DOUBLE,0,3,MPI_COMM_WORLD);
MPI_Send(reMat2,width2*width2,MPI_DOUBLE,0,4,MPI_COMM_WORLD);
}
MPI_Finalize();
std::cin.get();
return 0;
}
P.S. This code is the right answer.
Use collective MPI operations, as Zulan suggested. For example, first thing your code does is that the root sends to all the slaves the same value, which is broadcasting, i.e.,MPI_Bcast(). Then, the root sends to each slave a different value, which is scatter, i.e., MPI_Scatter().
The last operation is that the slave processes send to the root variably-sized data, for which exists the MPI_Gatherv() function. However, to use this function, you need to:
allocate the incoming buffer by the root (there is no malloc() for reMat1 and reMat2 in the first if-branch of your code), therefore, the root needs to know their count,
tell MPI_Gatherv() on the root how many elements will be received from each slave and where to put them.
This problem can be easily solved by so-called parallel prefix, look at MPI_Scan() or MPI_Exscan().
Here you create randomized width
int width1 = (rand() % (tmpN-firstBreakPt+1))+ firstBreakPt;
int width2 = (rand() % (tmpN-lateralBreakPt+1))+ lateralBreakPt;
which you later use to send data back to process 0
MPI_Send(reMat1,width1*width1,MPI_DOUBLE,0,3,MPI_COMM_WORLD);
But it expects different number of
MPI_Recv(reMat1,firstBreakPt*tmpN*firstBreakPt*tmpN,MPI_DOUBLE,m,3,MPI_COMM_WORLD,&status);
which causes problems. It does not know what sizes each slave process generated so you have to send them back the same way you did for sending sizes to them.
Related
I am writing a code, where each processor must interact with multiple processors.
Ex: I have 12 processors, so Processor 0 has to communicate to say 1,2,10 and 9. Lets call them as neighbours of Processor 0. Similarly I have
Processor 1 has to communicate to say 5 ,3.
Processor 2 has to communicate to 5,1,0,10,11
and so on.
The flow of data is 2 ways, i.e Processor 0 must send data to 1,2,10 and 9 and also receive data from them.
Also, there is no problem in Tag calculation.
I have created a code which works like this:
for(all neighbours)
{
store data in vector<double> x;
MPI_Send(x)
}
MPI_BARRIER();
for(all neighbours)
{
MPI_Recv(x);
do work with x
}
Now I testing this algorithm for different size of x and different arrangement of neighbours. The code works for some, but doesnot work for others, it simply resorts to deadlock.
I have also tried:
for(all neighbours)
{
store data in vector<double> x;
MPI_ISend(x)
}
MPI_Test();
for(all neighbours)
{
MPI_Recv(x);
do work with x
}
The result is same, although the deadlock is replcaed by NaN in result, as MPI_Test() tells me that some of the MPI_Isend() operation are not complete and it jumps immediately to MPI_Recv().
Can anyone guide me in this matter, what am I dong wrong? Or is my fundamental approach itself is incorrect?
EDIT: I am attaching code snippet for better understanding of the problem. I am basically workin on parallelizing an unstructured 3D-CFD solver
I have attached one of the files, with some explanation. I am not broadcasting, I am looping over the neighbours of the parent processor to send the data across the interface( this can be defined as a boundary between two interfaces) .
So, If say I have 12 processors, and say Processor 0 has to communicate to say 1,2,10 and 9. So 0 is the parent processor and 1,2,10 and 9 are its neighbours.
As the file was too long and a part of the solver, to make things simple, I have only kept the MPI function in it.
void Reader::MPI_InitializeInterface_Values() {
double nbr_interface_id;
Interface *interface;
MPI_Status status;
MPI_Request send_request, recv_request;
int err, flag;
int err2;
char buffer[MPI_MAX_ERROR_STRING];
int len;
int count;
for (int zone_no = 0; zone_no<this->GetNumberOfZones(); zone_no++) { // Number of zone per processor is 1, so basically each zone is an independent processor
UnstructuredGrid *zone = this->ZoneList[zone_no];
int no_of_interface = zone->GetNumberOfInterfaces();
// int count;
long int count_send = 0;
long int count_recv = 0;
long int max_size = 10000; // can be set from test case later
int max_size2 = 199;
int proc_no = FlowSolution::processor_number;
for (int interface_no = 0; interface_no < no_of_interface; interface_no++) { // interface is defined as a boundary between two zones
interface = zone->GetInterface(interface_no);
int no_faces = interface->GetNumberOfFaces();
if (no_faces != 0) {
std::vector< double > Variable_send; // The vector which stores the data to be sent across the interface
std::vector< double > Variable_recieve;
int total_size = FlowSolution::VariableOrder.size() * no_faces;
Variable_send.resize(total_size);
Variable_recieve.resize(total_size);
int nbr_proc_no = zone->GetInterface(interface_no)->GetNeighborZoneId(); // neighbour of parent processor
int j = 0;
nbr_interface_id = interface->GetShared_Interface_ID();
for (std::map<VARIABLE, int>::iterator iterator = FlowSolution::VariableOrder.begin(); iterator != FlowSolution::VariableOrder.end(); iterator++) {
for (int face_no = 0; face_no < no_faces; face_no++) {
Face *face = interface->GetFace(face_no);
int owner_id = face->Getinterface_Original_face_owner_id();
double value_send = zone->GetInterface(interface_no)->GetFace(face_no)->GetCell(owner_id)->GetPresentFlowSolution()->GetVariableValue((*iterator).first);
Variable_send[j] = value_send;
j++;
}
}
count_send = nbr_proc_no * max_size + nbr_interface_id; // tag for data to be sent
err2 = MPI_Isend(&Variable_send.front(), total_size, MPI_DOUBLE, nbr_proc_no, count_send, MPI_COMM_WORLD, &send_request);
}// end of sending
} // all the processors have sent data to their corresponding neighbours
MPI_Barrier(MPI_COMM_WORLD);
for (int interface_no = 0; interface_no < no_of_interface; interface_no++) { // loop over of neighbours of the current processor to receive data
interface = zone->GetInterface(interface_no);
int no_faces = interface->GetNumberOfFaces();
if (no_faces != 0) {
std::vector< double > Variable_recieve; // The vector which collects the data sent across the interface from
int total_size = FlowSolution::VariableOrder.size() * no_faces;
Variable_recieve.resize(total_size);
count_recv = proc_no * max_size + interface_no; // tag to receive data
int nbr_proc_no = zone->GetInterface(interface_no)->GetNeighborZoneId();
nbr_interface_id = interface->GetShared_Interface_ID();
MPI_Irecv(&Variable_recieve.front(), total_size, MPI_DOUBLE, nbr_proc_no, count_recv, MPI_COMM_WORLD, &recv_request);
/* Now some work is done using received data */
int j = 0;
for (std::map<VARIABLE, int>::iterator iterator = FlowSolution::VariableOrder.begin(); iterator != FlowSolution::VariableOrder.end(); iterator++) {
for (int face_no = 0; face_no < no_faces; face_no++) {
double value_recieve = Variable_recieve[j];
j++;
Face *face = interface->GetFace(face_no);
int owner_id = face->Getinterface_Original_face_owner_id();
interface->GetFictitiousCell(face_no)->GetPresentFlowSolution()->SetVariableValue((*iterator).first, value_recieve);
double value1 = face->GetCell(owner_id)->GetPresentFlowSolution()->GetVariableValue((*iterator).first);
double face_value = 0.5 * (value1 + value_recieve);
interface->GetFace(face_no)->GetPresentFlowSolution()->SetVariableValue((*iterator).first, face_value);
}
}
// Variable_recieve.clear();
}
}// end of receiving
}
Working from the problem statement:
Processor 0 has to send to 1, 2, 9 and 10, and receive from them.
Processor 1 has to send to 5 and 3, and receive from them.
Processor 2 has to send to 0, 1, 5, 10 and 11, and receive from them.
There are 12 total processors.
You can make life easier if you just run a 12-step program:
Step 1: Processor 0 sends, others receive as needed, then the converse occurs.
Step 2: Processor 1 sends, others receive as needed, then the converse occurs.
...
Step 12: Profit - there's nothing left to do (because every other processor has already interacted with Processor 11).
Each step can be implemented as an MPI_Scatterv (some sendcounts will be zero), followed by an MPI_Gatherv. 22 total calls and you're done.
There may be several possible reasons for a deadlock, so you have to be more specific, e. g. standard says: "When standard send operations are used, then a deadlock situation may occur where both processes are blocked because buffer space is not available."
You should use both Isend and Irecv. The general structure should be:
MPI_Request req[n];
MPI_Irecv(..., req[0]);
// ...
MPI_Irecv(..., req[n-1]);
MPI_Isend(..., req[0]);
// ...
MPI_Isend(..., req[n-1]);
MPI_Waitall(n, req, MPI_STATUSES_IGNORE);
By using AllGatherV, the problem can be solved. All I did was made the send count such that the send count only had the processors that I wanted to communicate with. Other processors had 0 send count.
This solved my problem
Thank you everyone for your answers!
I am learning MPI, and trying to create examples of some of the functions. I've gotten several to work, but I am having issues with MPI_Gather. I had a much more complex fitting test, but I trimmed it down to the most simple code. I am still, however, getting the following error:
root#master:/home/sgeadmin# mpirun ./expfitTest5
Assertion failed in file src/mpid/ch3/src/ch3u_request.c at line 584: FALSE
memcpy argument memory ranges overlap, dst_=0x1187e30 src_=0x1187e40 len_=400
internal ABORT - process 0
I am running one master instance and two node instances through AWS EC2. I have all the appropriate libraries installed, as I've gotten other MPI examples to work. My program is:
int main()
{
int world_size, world_rank;
int nFits = 100;
double arrCount[100];
double *rBuf = NULL;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
assert(world_size!=1);
int nElements = nFits/(world_size-1);
if(world_rank>0){
for(int k = 0; k < nElements; k++)
{
arrCount[k] = k;
}}
MPI_Barrier(MPI_COMM_WORLD);
if(world_rank==0)
{
rBuf = (double*) malloc( nFits*sizeof(double));
}
MPI_Gather(arrCount, nElements, MPI_DOUBLE, rBuf, nElements, MPI_DOUBLE, 0, MPI_COMM_WORLD);
if(world_rank==0){
for(int i = 0; i < nFits; i++)
{
cout<<rBuf[i]<<"\n";
}}
MPI_Finalize();
exit(0);
}
Is there something I am not understanding in malloc or MPI_Gather? I've compared my code to other samples, and can't find any differences.
The root process in a gather operation does participate in the operation. I.e. it sends data to it's own receive buffer. That also means you must allocate memory for it's part in the receive buffer.
Now you could use MPI_Gatherv and specify a recvcounts[0]/sendcount at root of 0 to follow your example closely. But usually you would prefer to write an MPI application in a way that the root participates equally in the operation, i.e. int nElements = nFits/world_size.
I have a Finite Element code that uses blocking receives and non-blocking sends. Each element has 3 incoming faces and 3 outgoing faces. The mesh is split up among many processors, so sometimes the boundary conditions come from the elements processor, or from neighboring processors. Relevant parts of the code are:
std::vector<task>::iterator it = All_Tasks.begin();
std::vector<task>::iterator it_end = All_Tasks.end();
int task = 0;
for (; it != it_end; it++, task++)
{
for (int f = 0; f < 3; f++)
{
// Get the neighbors for each incoming face
Neighbor neighbor = subdomain.CellSets[(*it).cellset_id_loc].neighbors[incoming[f]];
// Get buffers from boundary conditions or neighbor processors
if (neighbor.processor == rank)
{
subdomain.Set_buffer_from_bc(incoming[f]);
}
else
{
// Get the flag from the corresponding send
target = GetTarget((*it).angle_id, (*it).group_id, (*it).cell_id);
if (incoming[f] == x)
{
int size = cells_y*cells_z*groups*angles*4;
MPI_Status status;
MPI_Recv(&subdomain.X_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &status);
}
if (incoming[f] == y)
{
int size = cells_x*cells_z*groups*angles * 4;
MPI_Status status;
MPI_Recv(&subdomain.Y_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &status);
}
if (incoming[f] == z)
{
int size = cells_x*cells_y*groups*angles * 4;
MPI_Status status;
MPI_Recv(&subdomain.Z_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &status);
}
}
}
... computation ...
for (int f = 0; f < 3; f++)
{
// Get the outgoing neighbors for each face
Neighbor neighbor = subdomain.CellSets[(*it).cellset_id_loc].neighbors[outgoing[f]];
if (neighbor.IsOnBoundary)
{
// store the buffer into the boundary information
}
else
{
target = GetTarget((*it).angle_id, (*it).group_id, neighbor.cell_id);
if (outgoing[f] == x)
{
int size = cells_y*cells_z*groups*angles * 4;
MPI_Request request;
MPI_Isend(&subdomain.X_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &request);
}
if (outgoing[f] == y)
{
int size = cells_x*cells_z*groups*angles * 4;
MPI_Request request;
MPI_Isend(&subdomain.Y_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &request);
}
if (outgoing[f] == z)
{
int size = cells_x*cells_y*groups*angles * 4;
MPI_Request request;
MPI_Isend(&subdomain.Z_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &request);
}
}
}
}
A processor can do a lot of tasks before it needs information from other processors. I need a non-blocking send so that the code can keep working, but I'm pretty sure the receives are overwriting the send buffers before they get sent.
I've tried timing this code, and it's taking 5-6 seconds for the call to MPI_Recv, even though the message it's trying to receive has been sent. My theory is that the Isend is starting, but not actually sending anything until the Recv is called. The message itself is on the order of 1 MB. I've looked at benchmarks and messages of this size should take a very small fraction of a second to send.
My question is, in this code, is the buffer that was sent being overwritten, or just the local copy? Is there a way to 'add' to a buffer when I'm sending, rather than writing to the same memory location? I want the Isend to write to a different buffer every time it's called so the information isn't being overwritten while the messages wait to be received.
** EDIT **
A related question that might fix my problem: Can MPI_Test or MPI_Wait give information about an MPI_Isend writing to a buffer, i.e. return true if the Isend has written to the buffer, but that buffer has yet to be received?
** EDIT 2 **
I have added more information about my problem.
So it looks like I just have to bite the bullet and allocate enough memory in the send buffers to accommodate all the messages, and then just send portions of the buffer when I send.
I am trying to do an all-to-one communication out-of-order. Basically I have multiple floating point arrays of the same size, identified by an integer id.
Each message should look like:
<int id><float array data>
On the receiver side, it knows exactly how many arrays are there, and thus sets up exact number of recvs. Upon receiving a message, it parses the id and put data into the right place. The problem is that a message could be sent from any other processes to the receiving process. (e.g. the producers have a work queue structure, and process whichever id is available on the queue.)
Since MPI only guarantees P2P in order delivery, I can't trivially put integer id and FP data in two messages, otherwise receiver might not be able to match id with data. MPI doesn't allow two types of data in one send as well.
I can only think of two approaches.
1) Receiver has an array of size m (source[m]), m being number of sending nodes. Sender sends id first, then the data. Receiver saves id to source[i] after receiving an integer message from sender i. Upon receiving a FP array from sender i, it checks source[i], get the id, and moves data to the right place. It works because MPI guarantees in-order P2P communication. It requires receiver to keep state information for each sender. To make matter worse, if a single sending process can have two ids sent before data (e.g. multi-threaded), this mechanism won't work.
2) Treat id and FP as bytes, and copy them into a send buffer. Send them as MPI_CHAR, and receiver casts them back to an integer and a FP array. Then I need to pay the addition cost of copying things into a byte buffer on sender side. The total temporary buffer also grows as I grow number of threads within an MPI process.
Neither of them are perfect solutions. I don't want to lock anything inside a process. I wonder if any of you have better suggestions.
Edit: The code will be run on a shared cluster with infiniband. The machines will be randomly assigned. So I don't think TCP sockets will be able to help me here. In addition, IPoIB looks expensive. I do need the full 40Gbps speed for communication, and keep CPU doing the computation.
You can specify MPI_ANY_SOURCE as the source rank in the receive function, then sort the messages using their tags, which is easier than creating custom messages. Here's a simplified example:
#include <stdio.h>
#include "mpi.h"
int main() {
MPI_Init(NULL,NULL);
int rank=0;
int size=1;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
// Receiver is the last node for simplicity in the arrays
if (rank == size-1) {
// Receiver has size-1 slots
float data[size-1];
MPI_Request request[size-1];
// Use tags to sort receives
for (int tag=0;tag<size-1;++tag){
printf("Receiver for id %d\n",tag);
// Non-blocking receive
MPI_Irecv(data+tag,1,MPI_FLOAT,
MPI_ANY_SOURCE,tag,MPI_COMM_WORLD,&request[tag]);
}
// Wait for all requests to complete
printf("Waiting...\n");
MPI_Waitall(size-1,request,MPI_STATUSES_IGNORE);
for (size_t i=0;i<size-1;++i){
printf("%f\n",data[i]);
}
} else {
// Producer
int id = rank;
float data = rank;
printf("Sending {%d}{%f}\n",id,data);
MPI_Send(&data,1,MPI_FLOAT,size-1,id,MPI_COMM_WORLD);
}
return MPI_Finalize();
}
As somebody already wrote, you can use MPI_ANY_SOURCE to receive from any source. To send two different kind of data in a single send you can use a derived datatype:
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#define asize 10
typedef struct data_ {
int id;
float array[asize];
} data;
int main() {
MPI_Init(NULL,NULL);
int rank = -1;
int size = -1;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
data buffer;
// Define and commit a new datatype
int blocklength [2];
MPI_Aint displacement[2];
MPI_Datatype datatypes [2];
MPI_Datatype mpi_tdata;
MPI_Aint startid,startarray;
MPI_Get_address(&(buffer.id),&startid);
MPI_Get_address(&(buffer.array[0]),&startarray);
blocklength [0] = 1;
blocklength [1] = asize;
displacement[0] = 0;
displacement[1] = startarray - startid;
datatypes [0] = MPI_INT;
datatypes [1] = MPI_FLOAT;
MPI_Type_create_struct(2,blocklength,displacement,datatypes,&mpi_tdata);
MPI_Type_commit(&mpi_tdata);
if (rank == 0) {
int count = 0;
MPI_Status status;
while (count < size-1 ) {
// Non-blocking receive
printf("Receiving message %d\n",count);
MPI_Recv(&buffer,1,mpi_tdata,MPI_ANY_SOURCE,0,MPI_COMM_WORLD,&status);
printf("Message tag %d, first entry %g\n",buffer.id,buffer.array[0]);
// Counting the received messages
count++;
}
} else {
// Initialize buffer to be sent
buffer.id = rank;
for (int ii = 0; ii < size; ii++) {
buffer.array[ii] = 10*rank + ii;
}
// Send buffer
MPI_Send(&buffer,1,mpi_tdata,0,0,MPI_COMM_WORLD);
}
MPI_Type_free(&mpi_tdata);
MPI_Finalize();
return 0;
}
The issue I am trying to resolve is the following:
The C++ serial code I have computes across a large 2D matrix. To optimize this process, I wish to split this large 2D matrix and run on 4 nodes (say) using MPI. The only communication that occurs between nodes is the sharing of edge values at the end of each time step. Every node shares the edge array data, A[i][j], with its neighbor.
Based on reading about MPI, I have the following scheme to be implemented.
if (myrank == 0)
{
for (i= 0 to x)
for (y= 0 to y)
{
C++ CODE IMPLEMENTATION
....
MPI_SEND(A[x][0], A[x][1], A[x][2], Destination= 1.....)
MPI_RECEIVE(B[0][0], B[0][1]......Sender = 1.....)
MPI_BARRIER
}
if (myrank == 1)
{
for (i = x+1 to xx)
for (y = 0 to y)
{
C++ CODE IMPLEMENTATION
....
MPI_SEND(B[x][0], B[x][1], B[x][2], Destination= 0.....)
MPI_RECEIVE(A[0][0], A[0][1]......Sender = 1.....)
MPI BARRIER
}
I wanted to know if my approach is correct and also would appreciate any guidance on other MPI functions too look into for implementation.
Thanks,
Ashwin.
Just to amplify Joel's points a bit:
This goes much easier if you allocate your arrays so that they're contiguous (something C's "multidimensional arrays" don't give you automatically:)
int **alloc_2d_int(int rows, int cols) {
int *data = (int *)malloc(rows*cols*sizeof(int));
int **array= (int **)malloc(rows*sizeof(int*));
for (int i=0; i<rows; i++)
array[i] = &(data[cols*i]);
return array;
}
/*...*/
int **A;
/*...*/
A = alloc_2d_init(N,M);
Then, you can do sends and recieves of the entire NxM array with
MPI_Send(&(A[0][0]), N*M, MPI_INT, destination, tag, MPI_COMM_WORLD);
and when you're done, free the memory with
free(A[0]);
free(A);
Also, MPI_Recv is a blocking recieve, and MPI_Send can be a blocking send. One thing that means, as per Joel's point, is that you definately don't need Barriers. Further, it means that if you have a send/recieve pattern as above, you can get yourself into a deadlock situation -- everyone is sending, no one is recieving. Safer is:
if (myrank == 0) {
MPI_Send(&(A[0][0]), N*M, MPI_INT, 1, tagA, MPI_COMM_WORLD);
MPI_Recv(&(B[0][0]), N*M, MPI_INT, 1, tagB, MPI_COMM_WORLD, &status);
} else if (myrank == 1) {
MPI_Recv(&(A[0][0]), N*M, MPI_INT, 0, tagA, MPI_COMM_WORLD, &status);
MPI_Send(&(B[0][0]), N*M, MPI_INT, 0, tagB, MPI_COMM_WORLD);
}
Another, more general, approach is to use MPI_Sendrecv:
int *sendptr, *recvptr;
int neigh = MPI_PROC_NULL;
if (myrank == 0) {
sendptr = &(A[0][0]);
recvptr = &(B[0][0]);
neigh = 1;
} else {
sendptr = &(B[0][0]);
recvptr = &(A[0][0]);
neigh = 0;
}
MPI_Sendrecv(sendptr, N*M, MPI_INT, neigh, tagA, recvptr, N*M, MPI_INT, neigh, tagB, MPI_COMM_WORLD, &status);
or nonblocking sends and/or recieves.
First you don't need that much barrier
Second, you should really send your data as a single block as multiple send/receive blocking their way will result in poor performances.
This question has already been answered quite thoroughly by Jonathan Dursi; however, as Jonathan Leffler has pointed out in his comment to Jonathan Dursi's answer, C's multi-dimensional arrays are a contiguous block of memory. Therefore, I would like to point out that for a not-too-large 2d array, a 2d array could simply be created on the stack:
int A[N][M];
Since, the memory is contiguous, the array can be sent as it is:
MPI_Send(A, N*M, MPI_INT,1, tagA, MPI_COMM_WORLD);
On the receiving side, the array can be received into a 1d array of size N*M (which can then be copied into a 2d array if necessary):
int A_1d[N*M];
MPI_Recv(A_1d, N*M, MPI_INT,0,tagA, MPI_COMM_WORLD,&status);
//copying the array to a 2d-array
int A_2d[N][M];
for (int i = 0; i < N; i++){
for (int j = 0; j < M; j++){
A_2d[i][j] = A_1d[(i*M)+j]
}
}
Copying the array does cause twice the memory to be used, so it would be better to simply use A_1d by accessing its elements through A_1d[(i*M)+j].