I am writing a MPI C++ code for data exchange, below is the sample code:
#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv)
{
int size, rank;
int dest, tag, i;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("SIZE = %d RANK = %d\n",size,rank);
if ( rank == 0 )
{
double data[500];
for (i = 0; i < 500; i++)
{
data[i] = i;
}
dest = 1;
tag = 1;
MPI_Send(data,500,MPI_DOUBLE,dest,tag, MPI_COMM_WORLD );
}
MPI_Finalize();
return(0);
}
Looks like 500 is the maximum that I can send. If the number of data increases to 600, the code seems to stop at "MPI_SEND" without further progress. I am suspecting is there any limitation for the data be transferred using MPI_SEND. Can you someone enlighten me?
Thanks in advance,
Kan
Long story short, your program is incorrect and you are lucky it did not hang with a small count.
Per the MPI standard, you cannot assume MPI_Send() will return unless a matching MPI_Recv() has been posted.
From a pragmatic point of view, "short" messages are generally sent in eager mode, and MPI_Send() likely returns immediately.
On the other hand, "long" messages usually involve a rendez-vous protocol, and hence hang until a matching receive have been posted.
"small" and "long" depend on several factor, including the interconnect you are using. And once again, you cannot assume MPI_Send() will always return immediately if you send messages that are small enough.
Related
First question in the community, if there is anything wrong about my format, please tell me.
I am using MPI_Gatherv to collect data. I may have to collect something like vector<vector>. I heard MPI_Gatherv can only do with vector, so I decide to send the data vector by vector. Below is the example of my idea. However, it failed in the MPI_Finalize(), and it said 0xC0000005. If I delete the MPI_Gatherv(&c,count.at(i),MPI_INT,&temre,&count.at(i),displs,MPI_INT,0,MPI_COMM_WORLD), it worked. I wonder whether it has something to do with the address conflicts.
Thanks for any help!
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
int gsize;
int myrank;
MPI_Comm_size(MPI_COMM_WORLD, &gsize);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
vector<vector<int>> a,reci;
vector<int> count(1,10);
vector<int> b(10,1),NU(1,0);
a.push_back(b);
reci.push_back(NU);
for ( int i = 0; i < myrank*3+2; i++ )
{
b.push_back(i);
count.push_back(11+i);
a.push_back(b);
reci.push_back(NU);
}
vector<int> c,temre(1,1);
int displs[1]={0};
for ( int i = 0; i < myrank*3+3; i++ )
{
c = a.at(i);
MPI_Gatherv(&c,count.at(i),MPI_INT,
&temre,&count.at(i),displs,MPI_INT,0,MPI_COMM_WORLD);
reci.at(i).swap(temre);
}
MPI_Finalize();
return 0;
}
Many thanks for any comments and answers.
After several days working, I found out the error. For vectors in MPI, you must use &c[0] instead of &c in
MPI_Gatherv(&c,count.at(i),MPI_INT, &temre,&count.at(i),displs,MPI_INT,0,MPI_COMM_WORLD);(also &temre[0] instead of &temre)
However, currently it can work in Serial, but not parallel. I am trying to fix the problem and put the executable code.
Thanks again for any help!
You have to realize that MPI is strictly a C library. So you can not directly send/recv with a std::vector as buffer. Your buffer has to be an int* or whatever type you use. So in your case: MPI_Gatherv(c.data(),...
I am trying to send message to all MPI processes from a process and also receive message from all those processes in a process. It is basically an all to all communication where every process sends message to every other process (except itself) and receives message from every other process.
The following example code snippet shows what I am trying to achieve. Now, the problem with MPI_Send is its behavior where for small message size it acts as non-blocking but for the larger message (in my machine BUFFER_SIZE 16400) it blocks. I am aware of this is how MPI_Send behaves. As a workaround, I replaced the code below with blocking (send+recv) which is MPI_Sendrecv. Example code is like this MPI_Sendrecv(intSendPack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, intReceivePack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, MPI_COMM_WORLD, MPI_STATUSES_IGNORE) . I am making the above call for all the processes of MPI_COMM_WORLD inside a loop for every rank and this approach gives me what I am trying to achieve (all to all communication). However, this call takes a lot of time which I want to cut-down with some time-efficient approach. I have tried with mpi scatter and gather to perform all to all communication but here one issue is the buffer size (16400) may differ in actual implementation in different iteration for MPI_all_to_all function calling. Here, I am using MPI_TAG to differentiate the call in different iteration which I cannot use in scatter and gather functions.
#define BUFFER_SIZE 16400
void MPI_all_to_all(int MPI_TAG)
{
int size;
int rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* intSendPack = new int[BUFFER_SIZE]();
int* intReceivePack = new int[BUFFER_SIZE]();
for (int prId = 0; prId < size; prId++) {
if (prId != rank) {
MPI_Send(intSendPack, BUFFER_SIZE, MPI_INT, prId, MPI_TAG,
MPI_COMM_WORLD);
}
}
for (int sId = 0; sId < size; sId++) {
if (sId != rank) {
MPI_Recv(intReceivePack, BUFFER_SIZE, MPI_INT, sId, MPI_TAG,
MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
}
}
}
I want to know if there is a way I can perform all to all communication using any efficient communication model. I am not sticking to MPI_Send, if there is some other way which provides me what I am trying to achieve, I am happy with that. Any help or suggestion is much appreciated.
This is a benchmark that allows to compare performance of collective vs. point-to-point communication in an all-to-all communication,
#include <iostream>
#include <algorithm>
#include <mpi.h>
#define BUFFER_SIZE 16384
void point2point(int*, int*, int, int);
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int rank_id = 0, com_sz = 0;
double t0 = 0.0, tf = 0.0;
MPI_Comm_size(MPI_COMM_WORLD, &com_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &rank_id);
int* intSendPack = new int[BUFFER_SIZE]();
int* result = new int[BUFFER_SIZE*com_sz]();
std::fill(intSendPack, intSendPack + BUFFER_SIZE, rank_id);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
// Send-Receive
t0 = MPI_Wtime();
point2point(intSendPack, result, rank_id, com_sz);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Send-receive time: " << tf - t0 << std::endl;
// Collective
std::fill(result, result + BUFFER_SIZE*com_sz, 0);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
t0 = MPI_Wtime();
MPI_Allgather(intSendPack, BUFFER_SIZE, MPI_INT, result, BUFFER_SIZE, MPI_INT, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Allgather time: " << tf - t0 << std::endl;
MPI_Finalize();
delete[] intSendPack;
delete[] result;
return 0;
}
// Send/receive communication
void point2point(int* send_buf, int* result, int rank_id, int com_sz)
{
MPI_Status status;
// Exchange and store the data
for (int i=0; i<com_sz; i++){
if (i != rank_id){
MPI_Sendrecv(send_buf, BUFFER_SIZE, MPI_INT, i, 0,
result + i*BUFFER_SIZE, BUFFER_SIZE, MPI_INT, i, 0, MPI_COMM_WORLD, &status);
}
}
}
Here every rank contributes its own array intSendPack to the array result on all other ranks that should end up the same on all the ranks. result is flat, each rank takes BUFFER_SIZE entries starting with its rank_id*BUFFER_SIZE. After the point-to-point communication, the array is reset to its original shape.
Time is measured by setting up an MPI_Barrier which will give you the maximum time out of all ranks.
I ran the benchmark on 1 node of Nersc Cori KNL using slurm. I ran it a few times each case just to make sure the values are consistent and I'm not looking at an outlier, but you should run it maybe 10 or so times to collect more proper statistics.
Here are some thoughts:
For small number of processes (5) and a large buffer size (16384) collective communication is about twice faster than point-to-point, but it becomes about 4-5 times faster when moving to larger number of ranks (64).
In this benchmark there is not much difference between performance with recommended slurm settings on that specific machine and default settings but in real, larger programs with more communication there is a very significant one (job that runs for less than a minute with recommended will run for 20-30 min and more with default). Point of this is check your settings, it may make a difference.
What you were seeing with Send/Receive for larger messages was actually a deadlock. I saw it too for the message size shown in this benchmark. In case you missed those, there are two worth it SO posts on it: buffering explanation and a word on deadlocking.
In summary, adjust this benchmark to represent your code more closely and run it on your system, but collective communication in an all-to-all or one-to-all situations should be faster because of dedicated optimizations such as superior algorithms used for communication arrangement. A 2-5 times speedup is considerable, since communication often contributes to the overall time the most.
I have a code which I compile and run using openmpi. Lately, I wanted to run this same code using Intel MPI. But my code is not working as expected.
I digged into the code and found out that MPI_Send behaves differently in both implementation.
I got the advice from the different forum to use MPI_Isend instead of MPi_Send from different forum. But that requires hell lot of work to modify the code. Is there any workaround in Intel MPI to make it work just like in OpenMPI. May be some Flags or Increasing Buffer or something else. Thanks in advance for your answers.
int main(int argc, char **argv) {
int numRanks;
int rank;
char cmd[] = "Hello world";
MPI_Status status;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &numRanks);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
if(rank == 0) {
for (int i=0; i< numRanks; i++) {
printf("Calling MPI_Send() from rank %d to %d\n", rank, i);
MPI_Send(&cmd,sizeof(cmd),MPI_CHAR,i,MPI_TAG,MPI_COMM_WORLD);
printf("Returned from MPI_Send()\n");
}
}
MPI_Recv(&cmd,sizeof(cmd),MPI_CHAR,0,MPI_TAG,MPI_COMM_WORLD,&status);
printf("%d receieved from 0 %s\n", rank, cmd);
MPI_Finalize();
}
OpenMPI Result
# mpirun --allow-run-as-root -n 2 helloworld_openmpi
Calling MPI_Send() from rank 0 to 0
Returned from MPI_Send()
Calling MPI_Send() from rank 0 to 1
Returned from MPI_Send()
0 receieved from 0 Hello world
1 receieved from 0 Hello world
Intel MPI Result
# mpiexec.hydra -n 2 /root/helloworld_intel
Calling MPI_Send() from rank 0 to 0
Stuck at MPI_Send.
It is incorrect to assume MPI_Send() will return before a matching receive is posted, so your code is incorrect with respect to the MPI Standard, and you are lucky it did not hang with Open MPI.
MPI implementation usually eager-send small messages so MPI_Send() can return immediately, but this is an implementation choice not mandated by the standard, and "small" message depends on the library version, the interconnect you are using and other factors.
The only safe and portable choice here is to write correct code.
FWIW, MPI_Bcast(cmd, ...) is a better fit here, assuming all ranks already know the string length plus the NUL terminator.
Last but not least, the buffer argument is cmd and not &cmd.
I have an MPI program in which worker ranks (rank != 0) make a bunch of MPI_Send calls, and the master rank (rank == 0) receives all these messages. However, I run into a Fatal error in MPI_Recv - MPI_Recv(...) failed, Out of memory.
Here is the code that I am compiling in Visual Studio 2010.
I run the executable like so:
mpiexec -n 3 MPIHelloWorld.exe
int main(int argc, char* argv[]){
int numprocs, rank, namelen, num_threads, thread_id;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
if(rank == 0){
for(int k=1; k<numprocs; k++){
for(int i=0; i<1000000; i++){
double x;
MPI_Recv(&x, 1, MPI_DOUBLE, k, i, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
}
}
else{
for(int i=0; i<1000000; i++){
double x = 5;
MPI_Send(&x, 1, MPI_DOUBLE, 0, i, MPI_COMM_WORLD);
}
}
}
If I run with only 2 processes, the program does not crash. So it seems like the problem is when there is an accumulation of the MPI_Send calls from a third rank (aka a second worker node).
If I decrease the number of iterations to 100,000 then I can run with 3 processes without crashing. However, the amount of data being sent with one million iterations is ~ 8 MB (8 bytes for double * 1000000 iterations), so I don't think the "Out of Memory" is referring to any physical memory like RAM.
Any insight is appreciated, thanks!
The MPI_send operation stores the data on the system buffer ready to send. The size of this buffer and where it is stored is implementation specific (I remember hearing that this can even be in the interconnects). In my case (linux with mpich) I don't get a memory error. One way to explicitly change this buffer is to use MPI_buffer_attach with MPI_Bsend. There may also be a way to change the system buffer size (e.g. MP_BUFFER_MEM system variable on IBM systems).
However that this situation of unrequited messages should probably not occur in practice. In your example above, the order of the k and i loops could be swapped to prevent this build up of messages.
I am learning MPI, and trying to create examples of some of the functions. I've gotten several to work, but I am having issues with MPI_Gather. I had a much more complex fitting test, but I trimmed it down to the most simple code. I am still, however, getting the following error:
root#master:/home/sgeadmin# mpirun ./expfitTest5
Assertion failed in file src/mpid/ch3/src/ch3u_request.c at line 584: FALSE
memcpy argument memory ranges overlap, dst_=0x1187e30 src_=0x1187e40 len_=400
internal ABORT - process 0
I am running one master instance and two node instances through AWS EC2. I have all the appropriate libraries installed, as I've gotten other MPI examples to work. My program is:
int main()
{
int world_size, world_rank;
int nFits = 100;
double arrCount[100];
double *rBuf = NULL;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
assert(world_size!=1);
int nElements = nFits/(world_size-1);
if(world_rank>0){
for(int k = 0; k < nElements; k++)
{
arrCount[k] = k;
}}
MPI_Barrier(MPI_COMM_WORLD);
if(world_rank==0)
{
rBuf = (double*) malloc( nFits*sizeof(double));
}
MPI_Gather(arrCount, nElements, MPI_DOUBLE, rBuf, nElements, MPI_DOUBLE, 0, MPI_COMM_WORLD);
if(world_rank==0){
for(int i = 0; i < nFits; i++)
{
cout<<rBuf[i]<<"\n";
}}
MPI_Finalize();
exit(0);
}
Is there something I am not understanding in malloc or MPI_Gather? I've compared my code to other samples, and can't find any differences.
The root process in a gather operation does participate in the operation. I.e. it sends data to it's own receive buffer. That also means you must allocate memory for it's part in the receive buffer.
Now you could use MPI_Gatherv and specify a recvcounts[0]/sendcount at root of 0 to follow your example closely. But usually you would prefer to write an MPI application in a way that the root participates equally in the operation, i.e. int nElements = nFits/world_size.