I have a C++ MPI program that runs on Windows HPC cluster (12 nodes, 24 cores per node).
The logic of the program is really simple:
there is a pool of tasks
At the start, the program divides the tasks equally to each MPI process
Each MPI process execute their tasks
After everything is finished, using MPI reduce to gather the results to the root process.
There is one problem. Each task can have drastically different execution time and there is no way that I can tell that in advance. Equally distributing the task will results a lot of processes waiting idle. This wastes a lot of computer resources and make the total execution time longer.
I am thinking of one solution that might work.
The process is like this.
The task pool is divided into small parcels (like 10 tasks a parcel)
Each MPI process take a parcel at a time when it is idle (have not received a parcel, or finished the previous parcel)
The step 2 is continued until the task pool is exhausted
Using MPI reduce to gather all the results to root process
As far as I understand, this scheme need a universal counter across nodes/process (to avoid different MPI process execute the same parcel) and changing it need some lock/sync mechanism. It certainly has its overhead but with proper tuning, I think it can help to improve the performance.
I am not quite familiar with MPI and have some implementation issues. I can think of two ways to implement this universal counter
Using MPI I/O technique, write this counter in file, when a parcel is took, increase this counter (will certainly need file lock mechanism)
Using MPI one side communication/shared memory. Put this counter in the shared memory and increase it when a parcel is taken. (will certainly need a sync mechanism)
Unfortunately, I am not familiar with either technique and want to explore the possibility, implementation, or possible drawbacks of the two above methods. A sample code would be greatly appreciated.
If you have other ways to solve the problem or suggestions, that will also be great. Thanks.
Follow-ups:
Thanks for all the useful suggestions. I am implemented a test program following the scheme of using process 0 as the task distributor.
#include <iostream>
#include <mpi.h>
using namespace std;
void doTask(int rank, int i){
cout<<rank<<" got task "<<i<<endl;
}
int main ()
{
int numTasks = 5000;
int parcelSize = 100;
int numParcels = (numTasks/parcelSize) + (numTasks%parcelSize==0?0:1);
//cout<<numParcels<<endl;
MPI_Init(NULL, NULL);
int rank, nproc;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Status status;
MPI_Request request;
int ready = 0;
int i = 0;
int maxParcelNow = 0;
if(rank == 0){
for(i = 0; i <numParcels; i++){
MPI_Recv(&ready, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
//cout<<i<<"Yes"<<endl;
MPI_Send(&i, 1, MPI_INT, status.MPI_SOURCE, 0, MPI_COMM_WORLD);
//cout<<i<<"No"<<endl;
}
maxParcelNow = i;
cout<<maxParcelNow<<" "<<numParcels<<endl;
}else{
int counter = 0;
while(true){
if(maxParcelNow == numParcels) {
cout<<"Yes exiting"<<endl;
break;
}
//if(maxParcelNow == numParcels - 1) break;
ready = 1;
MPI_Send(&ready, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
//cout<<rank<<"send"<<endl;
MPI_Recv(&i, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
//cout<<rank<<"recv"<<endl;
doTask(rank, i);
}
}
MPI_Bcast(&maxParcelNow, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
It does not work and it never stops. Any suggestions on how to make it work? Does this code reflect the idea right or am I missing something? Thanks
[Converting my comments into an answer...]
Given n processes, you can have your first process p0 dispatch tasks for the other n - 1 processes. First, it will do point-to-point communication to the other n - 1 processes so that everyone has work to do, and then it will block on a Recv. When any given process completes, say p3, it will send its result back to p0. At this point, p0 will send another message to p3 with one of two things:
1) Another task
or
2) Some kind of termination signal if there are no tasks remaining. (Using the 'tag' of the message is one easy way.)
Obviously, p0 will loop over that logic until there is no task left, in which case it will call MPI_Finalize too.
Unlike what you thought in your comments, this isn't round-robin. It first gives a job to every process, or worker, and then gives back another job whenever one completes...
Related
I am trying to send message to all MPI processes from a process and also receive message from all those processes in a process. It is basically an all to all communication where every process sends message to every other process (except itself) and receives message from every other process.
The following example code snippet shows what I am trying to achieve. Now, the problem with MPI_Send is its behavior where for small message size it acts as non-blocking but for the larger message (in my machine BUFFER_SIZE 16400) it blocks. I am aware of this is how MPI_Send behaves. As a workaround, I replaced the code below with blocking (send+recv) which is MPI_Sendrecv. Example code is like this MPI_Sendrecv(intSendPack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, intReceivePack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, MPI_COMM_WORLD, MPI_STATUSES_IGNORE) . I am making the above call for all the processes of MPI_COMM_WORLD inside a loop for every rank and this approach gives me what I am trying to achieve (all to all communication). However, this call takes a lot of time which I want to cut-down with some time-efficient approach. I have tried with mpi scatter and gather to perform all to all communication but here one issue is the buffer size (16400) may differ in actual implementation in different iteration for MPI_all_to_all function calling. Here, I am using MPI_TAG to differentiate the call in different iteration which I cannot use in scatter and gather functions.
#define BUFFER_SIZE 16400
void MPI_all_to_all(int MPI_TAG)
{
int size;
int rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* intSendPack = new int[BUFFER_SIZE]();
int* intReceivePack = new int[BUFFER_SIZE]();
for (int prId = 0; prId < size; prId++) {
if (prId != rank) {
MPI_Send(intSendPack, BUFFER_SIZE, MPI_INT, prId, MPI_TAG,
MPI_COMM_WORLD);
}
}
for (int sId = 0; sId < size; sId++) {
if (sId != rank) {
MPI_Recv(intReceivePack, BUFFER_SIZE, MPI_INT, sId, MPI_TAG,
MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
}
}
}
I want to know if there is a way I can perform all to all communication using any efficient communication model. I am not sticking to MPI_Send, if there is some other way which provides me what I am trying to achieve, I am happy with that. Any help or suggestion is much appreciated.
This is a benchmark that allows to compare performance of collective vs. point-to-point communication in an all-to-all communication,
#include <iostream>
#include <algorithm>
#include <mpi.h>
#define BUFFER_SIZE 16384
void point2point(int*, int*, int, int);
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int rank_id = 0, com_sz = 0;
double t0 = 0.0, tf = 0.0;
MPI_Comm_size(MPI_COMM_WORLD, &com_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &rank_id);
int* intSendPack = new int[BUFFER_SIZE]();
int* result = new int[BUFFER_SIZE*com_sz]();
std::fill(intSendPack, intSendPack + BUFFER_SIZE, rank_id);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
// Send-Receive
t0 = MPI_Wtime();
point2point(intSendPack, result, rank_id, com_sz);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Send-receive time: " << tf - t0 << std::endl;
// Collective
std::fill(result, result + BUFFER_SIZE*com_sz, 0);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
t0 = MPI_Wtime();
MPI_Allgather(intSendPack, BUFFER_SIZE, MPI_INT, result, BUFFER_SIZE, MPI_INT, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Allgather time: " << tf - t0 << std::endl;
MPI_Finalize();
delete[] intSendPack;
delete[] result;
return 0;
}
// Send/receive communication
void point2point(int* send_buf, int* result, int rank_id, int com_sz)
{
MPI_Status status;
// Exchange and store the data
for (int i=0; i<com_sz; i++){
if (i != rank_id){
MPI_Sendrecv(send_buf, BUFFER_SIZE, MPI_INT, i, 0,
result + i*BUFFER_SIZE, BUFFER_SIZE, MPI_INT, i, 0, MPI_COMM_WORLD, &status);
}
}
}
Here every rank contributes its own array intSendPack to the array result on all other ranks that should end up the same on all the ranks. result is flat, each rank takes BUFFER_SIZE entries starting with its rank_id*BUFFER_SIZE. After the point-to-point communication, the array is reset to its original shape.
Time is measured by setting up an MPI_Barrier which will give you the maximum time out of all ranks.
I ran the benchmark on 1 node of Nersc Cori KNL using slurm. I ran it a few times each case just to make sure the values are consistent and I'm not looking at an outlier, but you should run it maybe 10 or so times to collect more proper statistics.
Here are some thoughts:
For small number of processes (5) and a large buffer size (16384) collective communication is about twice faster than point-to-point, but it becomes about 4-5 times faster when moving to larger number of ranks (64).
In this benchmark there is not much difference between performance with recommended slurm settings on that specific machine and default settings but in real, larger programs with more communication there is a very significant one (job that runs for less than a minute with recommended will run for 20-30 min and more with default). Point of this is check your settings, it may make a difference.
What you were seeing with Send/Receive for larger messages was actually a deadlock. I saw it too for the message size shown in this benchmark. In case you missed those, there are two worth it SO posts on it: buffering explanation and a word on deadlocking.
In summary, adjust this benchmark to represent your code more closely and run it on your system, but collective communication in an all-to-all or one-to-all situations should be faster because of dedicated optimizations such as superior algorithms used for communication arrangement. A 2-5 times speedup is considerable, since communication often contributes to the overall time the most.
I write simple hello-world program on Visual c++ 2010 express with MPI library and cant understand, why my code not working.
MPI_Init( NULL, NULL );
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
int a, b = 5;
MPI_Status st;
MPI_Send( &b, 1, MPI_INT, 0,0, MPI_COMM_WORLD );
MPI_Recv( &a, 1, MPI_INT, 0,0, MPI_COMM_WORLD, &st );
MPI_Send tells me "DEADLOCK: attempting to send a message to the local process without a prior matching receive". If i write Recv first, program stucks there (no data, blocking receive).
What i`m doint wrong?
My studio is visual c++ 2010 express. MPI from HPC SDK 2008 (32 bit).
You need something like this:
assert(size >= 2);
if (rank == 0)
MPI_Send( &b, 1, MPI_INT, 1,0, MPI_COMM_WORLD );
if (rank == 1)
MPI_Recv( &a, 1, MPI_INT, 0,0, MPI_COMM_WORLD, &st );
The idea of MPI is that the whole system operates in lockstep. And sometimes you do need to be aware of which participant you are in the "world." In this case, assuming you have two members (as per my assert), you need to make one of them send and the other receive.
Note also that I changed the "dest" parameter of the send, because 0 needs to send to 1 therefore 1 needs to receive from 0.
You can later do it the other way around if you wish (if each needs to tell the other something), but in such a case you may find even more efficient ways to do it using "collective operations" where you can exchange (both send and receive) with all the peers.
In your example code, you're sending to and receiving from rank 0. If you are only running your MPI program with 1 process (which makes no sense, but we'll accept it for the sake of argument), you could make this work by using non-blocking calls instead of the blocking version. It would change your program to look like this:
MPI_Init( NULL, NULL );
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
int a, b = 5;
MPI_Status st[2];
MPI_Request request[2];
MPI_Isend( &b, 1, MPI_INT, 0,0, MPI_COMM_WORLD, &request[0] );
MPI_Irecv( &a, 1, MPI_INT, 0,0, MPI_COMM_WORLD, &request[1] );
MPI_Waitall( request, st );
That would let both the send and the receive complete at the same time. The reason your MPI version doesn't like your original code (which is very nice of it to tell you such a thing) is because the call to MPI_SEND could block until the matching MPI_RECV is done, which in this case wouldn't occur because it would only get called after the MPI_SEND is over, which is a circular dependency.
In MPI, when you add an 'I' before an MPI call, it means "Immediate", as in, the call will return immediately and complete all the work later, when you call MPI_WAIT (or some version of it, like MPI_WAITALL in this example). So what we did here was to make the send and receive return immediately, basically just telling MPI that we intend to do a send and receive with rank 0 at some point in the future, then later (the next line), we tell MPI to go ahead and finish those calls now.
The benefit of using the immediate version of these calls is that theoretically, MPI can do some things in the background to let the send and receive calls make progress while your application is doing something else that doesn't rely on the result of that data. Then, when you finish the call to MPI_WAIT* later, the data is available and you can do whatever you need to do.
I am just learning OpenMPI. Tried a simple MPI_Scatter example:
#include <mpi.h>
using namespace std;
int main() {
int numProcs, rank;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* data;
int num;
data = new int[5];
data[0] = 0;
data[1] = 1;
data[2] = 2;
data[3] = 3;
data[4] = 4;
MPI_Scatter(data, 5, MPI_INT, &num, 5, MPI_INT, 0, MPI_COMM_WORLD);
cout << rank << " recieved " << num << endl;
MPI_Finalize();
return 0;
}
But it didn't work as expected ...
I was expecting something like
0 received 0
1 received 1
2 received 2 ...
But what I got was
32609 received
1761637486 received
1 received
33 received
1601007716 received
Whats with the weird ranks? Seems to be something to do with my scatter? Also, why is the sendcount and recvcount the same? At first I thought since I'm scattering 5 elements to 5 processors, each will get 1? So I should be using:
MPI_Scatter(data, 5, MPI_INT, &num, 1, MPI_INT, 0, MPI_COMM_WORLD);
But this gives an error:
[JM:2861] *** An error occurred in MPI_Scatter
[JM:2861] *** on communicator MPI_COMM_WORLD
[JM:2861] *** MPI_ERR_TRUNCATE: message truncated
[JM:2861] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
I am wondering though, why doing I need to differentiate between root and child processes? Seems like in this case, the source/root will also get a copy? Another thing is will other processes run scatter too? Probably not, but why? I thought all processes will run this code since its not in the typical if I see in MPI programs?
if (rank == xxx) {
UPDATE
I noticed to run, send and receive buffer must be of same length ... and the data should be declared like:
int data[5][5] = { {0}, {5}, {10}, {3}, {4} };
Notice the columns is declared as length 5 but I only initialized 1 value? What is actually happening here? Is this code correct? Suppose I only want each process to receive 1 value only.
sendcount is the number of elements you want to send to each process, not the count of elements in the send buffer. MPI_Scatter will just take sendcount * [number of processes in the communicator] elements from the send buffer from the root process and scatter it to all processes in the communicator.
So to send 1 element to each of the processes in the communicator (assume there are 5 processes), set sendcount and recvcount to be 1.
MPI_Scatter(data, 1, MPI_INT, &num, 1, MPI_INT, 0, MPI_COMM_WORLD);
There are restrictions on the possible datatype pairs and they are the same as for point-to-point operations. The type map of recvtype should be compatible with the type map of sendtype, i.e. they should have the same list of underlying basic datatypes. Also the receive buffer should be large enough to hold the received message (it might be larger, but not smaller). In most simple cases, the data type on both send and receive sides are the same. So sendcount - recvcount pair and sendtype - recvtype pair usually end up the same. An example where they can differ is when one uses user-defined datatype(s) on either side:
MPI_Datatype vec5int;
MPI_Type_contiguous(5, MPI_INT, &vec5int);
MPI_Type_commit(&vec5int);
MPI_Scatter(data, 5, MPI_INT, local_data, 1, vec5int, 0, MPI_COMM_WORLD);
This works since the sender constructs messages of 5 elements of type MPI_INT while each receiver interprets the message as a single instance of a 5-element integer vector.
(Note that you specify the maximum number of elements to be received in MPI_Recv and the actual amount received might be less, which can be obtained by MPI_Get_count. In contrast, you supply the expected number of elements to be received in recvcount of MPI_Scatter so error will be thrown if the message length received is not exactly the same as promised.)
Probably you know by now that the weird rank printed out is caused by stack corruption, since num can only contains 1 int but 5 int are received in MPI_Scatter.
I am wondering though, why doing I need to differentiate between root and child processes? Seems like in this case, the source/root will also get a copy? Another thing is will other processes run scatter too? Probably not, but why? I thought all processes will run this code since its not in the typical if I see in MPI programs?
It is necessary to differentiate between root and other processes in the communicator (they are not child process of the root since they can be in a separate computer) in some operations such as Scatter and Gather, since these are collective communication (group communication) but with a single source/destination. The single source/destination (the odd one out) is therefore called root. It is necessary for all the processes to know the source/destination (root process) to set up send and receive correctly.
The root process, in case of Scatter, will also receive a piece of data (from itself), and in case of Gather, will also include its data in the final result. There is no exception for the root process, unless "in place" operations are used. This also applies to all collective communication functions.
There are also root-less global communication operations like MPI_Allgather, where one does not provide a root rank. Rather all ranks receive the data being gathered.
All processes in the communicator will run the function (try to exclude one process in the communicator and you will get a deadlock). You can imagine processes on different computer running the same code blindly. However, since each of them may belong to different communicator group and has different rank, the function will run differently. Each process knows whether it is member of the communicator, and each knows the rank of itself and can compare to the rank of the root process (if any), so they can set up the communication or do extra actions accordingly.
I am relatively new to MPI (with C), and am having some trouble using MPI_Bcast to send an int to all processes.
In my code, I decide which rank is root within a for loop, where different processes are responsible for different element of the loop. Then, I want to Bcast a result from root to all processes, except all non-root processes do not know who to expect the bcast from, so do not receive it.
The code block looks something like this:
for (iX=start; iX<end; iX++)
//start and end are the starting row and ending row for each processes, defined earlier
for (int iY=1; iY<nn; iY++)
// do some calculations
if (some condition)
int bcastroot = rank; // initialized above
int X = 111; // initialized above
else
//do other calculations
end
end
MPI_Bcast(&X, 1, MPI_INT, bcastroot, comm);
// remainder of code and MPI_FINALIZE
When I execute this code, whatever the bcastroot default (value for all non-root processes) competes with root, so X is not broadcast correctly. I do not know the value of X nor can I predict the root beforehand, so I cannot define it in advance.
I have tried initializing bcastroot = -1, then setting it for rank, but this does not work. Is there a way I can Bcast this value without setting root for all processes?
Thanks,
JonZor
There is no way to do an MPI_Bcast where the receivers don't know what the root is. If you know there will only be one root, you can first do an MPI_Allreduce to agree on it:
int root, maybe_root;
int i_am_root = ...;
maybe_root = (i_am_root ? rank : 0);
MPI_Allreduce(&maybe_root, &root, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
then every rank will know the same root and you can do your broadcast.
I'm writing an MPI program (Visual Studio 2k8 + MSMPI) that uses Boost::thread to spawn two threads per MPI process, and have run into a problem I'm having trouble tracking down.
When I run the program with: mpiexec -n 2 program.exe, one of the processes suddenly terminates:
job aborted:
[ranks] message
[0] terminated
[1] process exited without calling finalize
---- error analysis -----
[1] on winblows
program.exe ended prematurely and may have crashed. exit code 0xc0000005
---- error analysis -----
I have no idea why the first process is suddenly terminating, and can't figure out how to track down the reason. This happens even if I put the rank zero process into an infinite loop at the end of all of it's operations... it just suddenly dies. My main function looks like this:
int _tmain(int argc, _TCHAR* argv[])
{
/* Initialize the MPI execution environment. */
MPI_Init(0, NULL);
/* Create the worker threads. */
boost::thread masterThread(&Master);
boost::thread slaveThread(&Slave);
/* Wait for the local test thread to end. */
masterThread.join();
slaveThread.join();
/* Shutdown. */
MPI_Finalize();
return 0;
}
Where the master and slave functions do some arbitrary work before ending. I can confirm that the master thread, at the very least, is reaching the end of it's operations. The slave thread is always the one that isn't done before the execution gets aborted. Using print statements, it seems like the slave thread isn't actually hitting any errors... it's happily moving along and just get's taken out in the crash.
So, does anyone have any ideas for:
a) What could be causing this?
b) How should I go about debugging it?
Thanks so much!
Edit:
Posting minimal versions of the Master/Slave functions. Note that the goal of this program is purely for demonstration purposes... so it isn't doing anything useful. Essentially, the master threads send a dummy payload to the slave thread of the other MPI process.
void Master()
{
int myRank;
int numProcs;
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
/* Create a message with numbers 0 through 39 as the payload, addressed
* to this thread. */
int *payload= new int[40];
for(int n = 0; n < 40; n++) {
payload[n] = n;
}
if(myRank == 0) {
MPI_Send(payload, 40, MPI_INT, 1, MPI_ANY_TAG, MPI_COMM_WORLD);
} else {
MPI_Send(payload, 40, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD);
}
/* Free memory. */
delete(payload);
}
void Slave()
{
MPI_Status status;
int *payload= new int[40];
MPI_Recv(payload, 40, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
/* Free memory. */
delete(payload);
}
you have to use thread safe version of mpi runtime.
read up on MPI_Init_thread.