Bizarre deadlock in MPI_Allgather

Bizarre deadlock in MPI_Allgather - c++

After much Googling, I have no idea what's causing this issue. Here it is:
I have a simple call to MPI_Allgather in my code which I have double, triple, and quadruple-checked to be correct (send/receive buffers are properly sized; the send/receive sizes in the call are correct), but for 'large' numbers of processes I get either a deadlock or an MPI_ERR_TRUNCATE. The communicator being used for the Allgather is split from MPI_COMM_WORLD using MPI_Comm_split. For my current testing, rank 0 goes to one communicator, and the remaining ranks go to a second communicator. For 6 total ranks or less, the Allgather works just fine. If I use 7 ranks, I get an MPI_ERR_TRUNCATE. 8 ranks, deadlock. I have verified that the communicators were split correctly (MPI_Comm_rank and MPI_Comm_size is correct on all ranks for both Comms).
I have manually verified the size of each send and receive buffer, and the maximal number of receives. My first workaround was to swap the MPI_Allgather for a for-loop of MPI_Gather's to each process. This worked for that one case, but changing the meshes given to my code (CFD grids being partitioned using METIS) brought the problem back. Now my solution, which I haven't been able to break (yet), is to replace the Allgather with an Allgatherv, which I suppose is more efficient anyways since I have a different number of pieces of data being sent from each process.
Here's the (I hope) relevant offending code in context; if I've missed something, the Allgather in question is on line 599 of this file.
// Get the number of mpiFaces on each processor (for later communication)
// 'nProgGrid' is the size of the communicator 'gridComm'
vector<int> nMpiFaces_proc(nProcGrid);
// This MPI_Allgather works just fine, every time
// int nMpiFaces is assigned on preceding lines
MPI_Allgather(&nMpiFaces,1,MPI_INT,nMpiFaces_proc.data(),1,MPI_INT,gridComm);
int maxNodesPerFace = (nDims==2) ? 2 : 4;
int maxNMpiFaces = getMax(nMpiFaces_proc);
// The matrix class is just a fancy wrapper around std::vector that
// allows for (i,j) indexing. The getSize() and getData() methods just
// call the size() and data() methods, respectively, of the underlying
// vector<int> object.
matrix<int> mpiFaceNodes_proc(nProcGrid,maxNMpiFaces*maxNodesPerFace);
// This is the MPI_Allgather which (sometimes) doesn't work.
// vector<int> mpiFaceNodes is assigned in preceding lines
MPI_Allgather(mpiFaceNodes.data(),mpiFaceNodes.size(),MPI_INT,
mpiFaceNodes_proc.getData(),maxNMpiFaces*maxNodesPerFace,
MPI_INT,gridComm);
I am currently using OpenMPI 1.6.4, g++ 4.9.2, and an AMD FX-8350 8-core processor with 16GB of RAM, running the latest updates of Elementary OS Freya 0.3 (basically Ubuntu 14.04). However, I have also had this issue on another machine using CentOS, Intel hardware, and MPICH2.
Any ideas? I have heard that it could be possible to change MPI's internal buffer size(s) to fix similar issues, but a quick try to do so (as shown in http://www.caps.ou.edu/pipermail/arpssupport/2002-May/000361.html) had no effect.
For reference, this issue is very similar to the one shown here: https://software.intel.com/en-us/forums/topic/285074, except that in my case, I have only 1 processor with 8 cores, on a single desktop computer.
UPDATE
I've managed to put together a minimalist example of this failure:
#include <iostream>
#include <vector>
#include <stdlib.h>
#include <time.h>
#include "mpi.h"
using namespace std;
int main(int argc, char* argv[])
{
MPI_Init(&argc,&argv);
int rank, nproc, newID, newRank, newSize;
MPI_Comm newComm;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&nproc);
newID = rank%2;
MPI_Comm_split(MPI_COMM_WORLD,newID,rank,&newComm);
MPI_Comm_rank(newComm,&newRank);
MPI_Comm_size(newComm,&newSize);
srand(time(NULL));
// Get a different 'random' number for each rank on newComm
//int nSend = rand()%10000;
//for (int i=0; i<newRank; i++) nSend = rand()%10000;
/*! -- Found a set of #'s which fail for nproc=8: -- */
int badSizes[4] = {2695,7045,4256,8745};
int nSend = badSizes[newRank];
cout << "Comm " << newID << ", rank " << newRank << ": nSend = " << nSend << endl;
vector<int> send(nSend);
for (int i=0; i<nSend; i++)
send[i] = rand();
vector<int> nRecv(newSize);
MPI_Allgather(&nSend,1,MPI_INT,nRecv.data(),1,MPI_INT,newComm);
int maxNRecv = 0;
for (int i=0; i<newSize; i++)
maxNRecv = max(maxNRecv,nRecv[i]);
vector<int> recv(newSize*maxNRecv);
MPI_Barrier(MPI_COMM_WORLD);
cout << "rank " << rank << ": Allgather-ing data for communicator " << newID << endl;
MPI_Allgather(send.data(),nSend,MPI_INT,recv.data(),maxNRecv,MPI_INT,newComm);
cout << "rank " << rank << ": Done Allgathering-data for communicator " << newID << endl;
MPI_Finalize();
return 0;
}
The above code was compiled and run as:
mpicxx -std=c++11 mpiTest.cpp -o mpitest
mpirun -np 8 ./mpitest
with the following output on both my 16-core CentOS and my 8-core Ubuntu machines:
Comm 0, rank 0: nSend = 2695
Comm 1, rank 0: nSend = 2695
Comm 0, rank 1: nSend = 7045
Comm 1, rank 1: nSend = 7045
Comm 0, rank 2: nSend = 4256
Comm 1, rank 2: nSend = 4256
Comm 0, rank 3: nSend = 8745
Comm 1, rank 3: nSend = 8745
rank 5: Allgather-ing data for communicator 1
rank 6: Allgather-ing data for communicator 0
rank 7: Allgather-ing data for communicator 1
rank 0: Allgather-ing data for communicator 0
rank 1: Allgather-ing data for communicator 1
rank 2: Allgather-ing data for communicator 0
rank 3: Allgather-ing data for communicator 1
rank 4: Allgather-ing data for communicator 0
rank 5: Done Allgathering-data for communicator 1
rank 3: Done Allgathering-data for communicator 1
rank 4: Done Allgathering-data for communicator 0
rank 2: Done Allgathering-data for communicator 0
Note that only 2 of the ranks from each communicator exit the Allgather; this isn't what happens in my actual code (no ranks on the 'broken' communicator exit the Allgather), but the end result is the same - the code hangs until I kill it.
I'm guessing this has something to do with the differing number of sends on each process, but as far as I can tell from the MPI documentation and tutorials I've seen, this is supposed to be allowed, correct? Of course, the MPI_Allgatherv is a little more applicable, but for reasons of simplicity I have been using Allgather instead.

You must use MPI_Allgatherv if the input counts are not identical across all processes.
To be precise, what must match is the type signature count,type, since technically you can get to the same fundamental representation with different datatypes (e.g. N elements vs 1 element that is a contiguous type of N elements), but if you use the same argument everywhere, which is the common usage of MPI collectives, then your counts must match everywhere.
The relevant portion of the latest MPI standard (3.1) is on page 165:
The type signature associated with sendcount, sendtype, at a process
must be equal to the type signature associated with recvcount,
recvtype at any other process.

Related

MPI_Exchanging sides of vector between processes

In my code I have an arbitrary number of processes exchanging some parts of their local vectors. The local vectors are vectors of pairs and for this reason I've been using an MPI derived datatype. In principle I don't know how many elements each process sends to the others and for this reason I also have to send the size of the buffer. In particular, each process exchanges data with the process with rank: myrank-1 and with the process with rank: myrank+1. In case of process 0 instead of myrank-1 it exchanges with process with rank comm_size-1. And as well in case of process comm_size-1 instead of myrank+1 it exchanges with process with rank 0.
This is my code:
unsigned int size1tobesent;
size1tobesent=last.size();//Buffer size
int slpartner = (rank + 1) % p;
int rlpartner = (rank - 1 + p) % p;
unsigned int sizereceived1;
MPI_Sendrecv(&size1tobesent, 1, MPI_UNSIGNED, slpartner, 0,&sizereceived1,1,
MPI_UNSIGNED, rlpartner, 0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Vect first1(sizereceived1);
MPI_Sendrecv(&last[0], last.size(), mytype, slpartner, 0,&first1[0],sizereceived1,
mytype, rlpartner, 0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
unsigned int size2tobesent;
size2tobesent=first.size();//Buffer size2
unsigned int sizereceived2;
MPI_Sendrecv(&size2tobesent, 1, MPI_UNSIGNED, rlpartner, 0,
&sizereceived2,1,MPI_UNSIGNED, slpartner, 0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Vect last1(sizereceived2);
MPI_Sendrecv(&first[0], first.size(), mytype, rlpartner, 0,&last1[0],
sizereceived2 ,mytype, slpartner, 0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Now when I run my code with 2 or 3 processes all works as expected. With more than 3 the results are unpredictable. I don't know if this is due to a particular combination of the input data or if there are some theoretical errors that I'm missing.
Finally consider that this code is part of a for cycle.

How to use shared global datasets in MPI?

I am an MPI beginner. I have a large array gmat of numbers (type double, dimensions 1x14000000) which is precomputed and stored in a binary file. It will use approximately 100 MB in memory (14000000 x8 bytes /1024 /1024). I want to write a MPI code which will do some computation on this array (for example, multiply all elements of gmat by rank number of the process). This array gmat itself stays constant during run-time.
The code is supposed to be something like this
#include <iostream>
#include "mpi.h"
double* gmat;
long int imax;
int main(int argc, char* argv[])
{
void performcomputation(int rank); // this function performs the computation and will be called by all processes
imax=atoi(argv[1]); // user inputs the length of gmat
MPI::Init();
rank = MPI::COMM_WORLD.Get_rank();
size = MPI::COMM_WORLD.Get_size(); //i will use -np 16 = 4 processors x 4 cores
if rank==0 // read the gmat array using one of the processes
{
gmat = new double[imax];
// read values of gmat from a file
// next line is supposed to broadcast values of gmat to all processes which will use it
MPI::COMM_WORLD.Bcast(&gmat,imax,MPI::DOUBLE,1);
}
MPI::COMM_WORLD.Barrier();
performcomputation(rank);
MPI::Finalize();
return 0;
}
void performcomputation(int rank)
{
int i;
for (i=0;i <imax; i++)
cout << "the new value is" << gmat[i]*rank << endl;
}
My question is when I run this code using 16 processes (-np 16), is the gmat same for all of them ? I mean, will the code use 16 x 100 MB in memory to store gmat for each process or will it use only 100 MB since I have defined gmat to be global ? And I don't want different processes to read gmat separately from the file, since reading so many numbers takes time. What is a better way to do this ? Thanks.

First of all, please do not use the MPI C++ bindings. These were deprecated in MPI-2.2 and then deleted in MPI-3.0 and therefore no longer part of the specification, which means that future MPI implementations are not even required to provide C++ bindings, and that if they do, they will probably diverge in the way the interface looks like.
That said, your code contains a very common mistake:
if rank==0 // read the gmat array using one of the processes
{
gmat = new double[imax];
// read values of gmat from a file
// next line is supposed to broadcast values of gmat to all processes which will use it
MPI::COMM_WORLD.Bcast(&gmat,imax,MPI::DOUBLE,1);
}
This won't work as there are four errors here. First, gmat is only allocated at rank 0 and not allocated in the others ranks, which is not what you want. Second, you are giving Bcast the address of the pointer gmat and not the address of the data pointed by it (i.e. you should not use the & operator). You also broadcast from rank 0 but put 1 as the broadcast root argument. But the most important error is that MPI_BCAST is a collective communication call and all ranks are required to call it with the same value of the root argument in order for it to complete successfully. The correct code (using the C bindings instead of the C++ ones) is:
gmat = new double[imax];
if (rank == 0)
{
// read values of gmat from a file
}
MPI_Bcast(gmat, imax, MPI_DOUBLE, 0, MPI_COMM_WORLD);
// ^^^^ ^^^
// no & root == 0
Each rank has its own copy of gmat. Initially all values are different (e.g. random or all zeros, depends on the memory allocator). After the broadcast all copies will become identical to the copy of gmat at rank 0. After the call to performcomputation() each copy will be different again since each rank multiplies the elements of gmat with a different number. The answer to your question is: the code will use 100 MiB in each rank, therefore 16 x 100 MiB in total.
MPI deals with distributed memory - processes do not share variables, no matter if they are local or global ones. The only way to share data is to use MPI calls like point-to-point communication (e.g. MPI_SEND / MPI_RECV), collective calls (e.g. MPI_BCAST) or one-sided communication (e.g. MPI_PUT / MPI_GET).

Ordering of cout weird: MPI_Recv before MPI_Send?

I have something like:
if (rank == winner) {
ballPos[0] = rand() % 128;
ballPos[1] = rand() % 64;
cout << "new ball pos: " << ballPos[0] << " " << ballPos[1] << endl;
MPI_Send(&ballPos, 2, MPI_INT, FIELD, NEW_BALL_POS_TAG, MPI_COMM_WORLD);
} else if (rank == FIELD) {
MPI_Recv(&ballPos, 2, MPI_INT, winner, NEW_BALL_POS_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
cout << "2 new ball pos: " << ballPos[0] << " " << ballPos[1] << endl;
}
But I see in console:
new ball pos: 28 59
2 new ball pos: 28 59
Why isit that the cout after receive prints before the one before send?

These are two different processes doing output at the same time. MPI implementations usually perform standard output redirection for all processes, but it is usually buffered in order to improve performance and to minimise network utilisation. The output from all processes is then sent to mpiexec (or to mpirun, or to whatever other command is used to launch the MPI job) and combined into its standard output. The order in which different chunks/lines from different processes end up in the output is mostly random, so you must not expect that a message from a certain rank would come up first unless some sort of process synchronisatoin is employed.
Also note that the MPI standard does not guarantee that it is possible for all ranks to write to the standard output. The standard provides the MPI_IO predefined attribute key that one can query on MPI_COMM_WORLD in order to obtain the rank of the process that is allowed to perform standard output. Most MPI implementations nowadays perform output redirection on all processes in the MPI job and thus return MPI_ANY_SOURCE for such attribute queries, but this is not guaranteed to always be the case.

Simple MPI_Scatter try

I am just learning OpenMPI. Tried a simple MPI_Scatter example:
#include <mpi.h>
using namespace std;
int main() {
int numProcs, rank;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* data;
int num;
data = new int[5];
data[0] = 0;
data[1] = 1;
data[2] = 2;
data[3] = 3;
data[4] = 4;
MPI_Scatter(data, 5, MPI_INT, &num, 5, MPI_INT, 0, MPI_COMM_WORLD);
cout << rank << " recieved " << num << endl;
MPI_Finalize();
return 0;
}
But it didn't work as expected ...
I was expecting something like
0 received 0
1 received 1
2 received 2 ...
But what I got was
32609 received
1761637486 received
1 received
33 received
1601007716 received
Whats with the weird ranks? Seems to be something to do with my scatter? Also, why is the sendcount and recvcount the same? At first I thought since I'm scattering 5 elements to 5 processors, each will get 1? So I should be using:
MPI_Scatter(data, 5, MPI_INT, &num, 1, MPI_INT, 0, MPI_COMM_WORLD);
But this gives an error:
[JM:2861] *** An error occurred in MPI_Scatter
[JM:2861] *** on communicator MPI_COMM_WORLD
[JM:2861] *** MPI_ERR_TRUNCATE: message truncated
[JM:2861] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
I am wondering though, why doing I need to differentiate between root and child processes? Seems like in this case, the source/root will also get a copy? Another thing is will other processes run scatter too? Probably not, but why? I thought all processes will run this code since its not in the typical if I see in MPI programs?
if (rank == xxx) {
UPDATE
I noticed to run, send and receive buffer must be of same length ... and the data should be declared like:
int data[5][5] = { {0}, {5}, {10}, {3}, {4} };
Notice the columns is declared as length 5 but I only initialized 1 value? What is actually happening here? Is this code correct? Suppose I only want each process to receive 1 value only.

sendcount is the number of elements you want to send to each process, not the count of elements in the send buffer. MPI_Scatter will just take sendcount * [number of processes in the communicator] elements from the send buffer from the root process and scatter it to all processes in the communicator.
So to send 1 element to each of the processes in the communicator (assume there are 5 processes), set sendcount and recvcount to be 1.
MPI_Scatter(data, 1, MPI_INT, &num, 1, MPI_INT, 0, MPI_COMM_WORLD);
There are restrictions on the possible datatype pairs and they are the same as for point-to-point operations. The type map of recvtype should be compatible with the type map of sendtype, i.e. they should have the same list of underlying basic datatypes. Also the receive buffer should be large enough to hold the received message (it might be larger, but not smaller). In most simple cases, the data type on both send and receive sides are the same. So sendcount - recvcount pair and sendtype - recvtype pair usually end up the same. An example where they can differ is when one uses user-defined datatype(s) on either side:
MPI_Datatype vec5int;
MPI_Type_contiguous(5, MPI_INT, &vec5int);
MPI_Type_commit(&vec5int);
MPI_Scatter(data, 5, MPI_INT, local_data, 1, vec5int, 0, MPI_COMM_WORLD);
This works since the sender constructs messages of 5 elements of type MPI_INT while each receiver interprets the message as a single instance of a 5-element integer vector.
(Note that you specify the maximum number of elements to be received in MPI_Recv and the actual amount received might be less, which can be obtained by MPI_Get_count. In contrast, you supply the expected number of elements to be received in recvcount of MPI_Scatter so error will be thrown if the message length received is not exactly the same as promised.)
Probably you know by now that the weird rank printed out is caused by stack corruption, since num can only contains 1 int but 5 int are received in MPI_Scatter.
I am wondering though, why doing I need to differentiate between root and child processes? Seems like in this case, the source/root will also get a copy? Another thing is will other processes run scatter too? Probably not, but why? I thought all processes will run this code since its not in the typical if I see in MPI programs?
It is necessary to differentiate between root and other processes in the communicator (they are not child process of the root since they can be in a separate computer) in some operations such as Scatter and Gather, since these are collective communication (group communication) but with a single source/destination. The single source/destination (the odd one out) is therefore called root. It is necessary for all the processes to know the source/destination (root process) to set up send and receive correctly.
The root process, in case of Scatter, will also receive a piece of data (from itself), and in case of Gather, will also include its data in the final result. There is no exception for the root process, unless "in place" operations are used. This also applies to all collective communication functions.
There are also root-less global communication operations like MPI_Allgather, where one does not provide a root rank. Rather all ranks receive the data being gathered.
All processes in the communicator will run the function (try to exclude one process in the communicator and you will get a deadlock). You can imagine processes on different computer running the same code blindly. However, since each of them may belong to different communicator group and has different rank, the function will run differently. Each process knows whether it is member of the communicator, and each knows the rank of itself and can compare to the rank of the root process (if any), so they can set up the communication or do extra actions accordingly.

MPI: How to start three functions which will be executed in different threads

I have 3 function and 4 cores. I want execute each function in new thread using MPI and C++
I write this
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
size--;
if (rank == 0)
{
Thread1();
}
else
{
if(rank == 1)
{
Thread2();
}
else
{
Thread3();
}
}
MPI_Finalize();
But it execute just Thread1(). How i must change code?
Thanks!

Print to screen the current value of variable size (possibly without decrementing it) and you will find 1. That is: "there is 1 process running".
You are likely running your compiled code the wrong way. Consider to use mpirun (or mpiexec, depending on your MPI implementation) to execute it, i.e.
mpirun -np 4 ./MyCompiledCode
the -np parameter specifies the number of processes you will start (doing so, your MPI_Comm_size will be 4 as you expect).
Currently, though, you are not using anything explicitly owing to C++. You can consider some C++ binding of MPI such as Boost.MPI.
I worked a little bit on the code you provided. I changed it a little bit producing this working mpi code (I provided some needed correction in capital letters).
FYI:
compilation (under gcc, mpich):
$ mpicxx -c mpi1.cpp
$ mpicxx -o mpi1 mpi1.o
execution
$ mpirun -np 4 ./mpi1
output
size is 4
size is 4
size is 4
2 function started.
thread2
3 function started.
thread3
3 function ended.
2 function ended.
size is 4
1 function started.
thread1
1 function ended.
be aware that stdout is likely messed out.
Are you sure you are compiling your code the right way?

You problem is that MPI provides no way to feed console input into many processes but only into process with rank 0. Because of the first three lines in main:
int main(int argc, char *argv[]){
int oper;
std::cout << "Enter Size:";
std::cin >> oper; // <------- The problem is right here
Operations* operations = new Operations(oper);
int rank, size;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
switch(tid)
{
all processes but rank 0 block waiting for console input which they cannot get. You should rewrite the beginning of your main function as follows:
int main(int argc, char *argv[]){
int oper;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
if (tid == 0) {
std::cout << "Enter Size:";
std::cin >> oper;
}
MPI_Bcast(&oper, 1, MPI_INT, 0, MPI_COMM_WORLD);
Operations* operations = new Operations(oper);
switch(tid)
{
It works as follows: only rank 0 displays the prompt and then reads the console input into oper. Then a broadcast of the value of oper from rank 0 is performed so all other processes obtain the correct value, create the Operations object and then branch to the appropriate function.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js