I am trying to use MPI_Gather to recover data from slave. So basically, a simulation are running on each slave (wich is not the same on each), and I want to recover one integer on the master (the results of the simulation). From each integer, I calculate a new value 'a' on the master that I send back to the slave to redo a new simulation with this better parameter.
I hope is is clear, I am pretty new to MPI.
Note: Some simulation will not finish at the same time.
int main
while(true){
if (rank==0) runMaster();
else runSlave();
}
runMaster()
receive data b of all slave (with MPI_gather)
calculate parameter a for each slave; aTotal=[a_1,...,a_n]
MPI_Scatter(aTotal, to slave)
runSlave()
a=aTotal[rank]
simulationRun(a){return b}
MPI_Gather(&b, to master)
To avoid the deadlock, each slave is initialized with a random a.
created a small test case, because I don't see how I can use MPI_Gather in my slave:
int main (int argc, char *argv[]) {
int size;
int rank;
int a[12];
int i;
int start,end;
int b;
MPI_Init(&argc, &argv);
MPI_Status status;
MPI_Request req;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int* bb= new int[size];
int source;
//master
if(!rank){
while(true){
b=12;
MPI_Recv(&bb[0], 1, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status);
source = status.MPI_SOURCE;
printf("master receive b %d from source %d \n", bb[0], source);
if (source == 1) goto finish;
}
}
//slave
if(rank){·
b=13;·
if (rank==1) {b=15; sleep(2);}
int source = rank;
printf("slave %d will send b %d \n", source, b);
// MPI_Gather(&b,1,MPI_INT,bb,1,MPI_INT,0,MPI_COMM_WORLD); // unworking, not called by master
MPI_Send(&b, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
}
finish:
MPI_Finalize();
return 0;
}
I am trying to send the slave data to the master with a collective command.
Is this implementation realistic?
What you propose sounds reasonable. Another approach would be to have the slaves all broadcast their results to each other in one go (MPI_AllGather), then you can implement the scoring and what-to-try-next algorithm directly in each slave. If the scoring algorithm is not too complex, the overhead of running it in every slave will be worth it in terms of speed because the slaves will not have to communicate with the master at all, saving one communication on each iteration.
Related
I have an array of index which I want each worker do something based on these indexes.
the size of the array might be more than the total number of ranks, so my first question is if there is another way except master-worker load balancing here? I want to have a balances system and also I want to assign each index to each ranks.
I was thinking about master-worker, and in this approach master rank (0) is giving each index to other ranks. but when I was running my code with 3 rank and 15 index my code is halting in while loop for sending the index 4. I was wondering If anybody can help me to find the problem
if(pCurrentID == 0) { // Master
MPI_Status status;
int nindices = 15;
int mesg[1] = {0};
int initial_id = 0;
int recv_mesg[1] = {0};
// -- send out initial ids to workers --//
while (initial_id < size - 1) {
if (initial_id < nindices) {
MPI_Send(mesg, 1, MPI_INT, initial_id + 1, 1, MPI_COMM_WORLD);
mesg[0] += 1;
++initial_id;
}
}
//-- hand out id to workers dynamically --//
while (mesg[0] != nindices) {
MPI_Probe(MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status);
int isource = status.MPI_SOURCE;
MPI_Recv(recv_mesg, 1, MPI_INT, isource, 1, MPI_COMM_WORLD, &status);
MPI_Send(mesg, 1, MPI_INT, isource, 1, MPI_COMM_WORLD);
mesg[0] += 1;
}
//-- hand out ending signals once done --//
for (int rank = 1; rank < size; ++rank) {
mesg[0] = -1;
MPI_Send(mesg, 1, MPI_INT, rank, 0, MPI_COMM_WORLD);
}
} else {
MPI_Status status;
int id[1] = {0};
// Get the surrounding fragment id
MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
int itag = status.MPI_TAG;
MPI_Recv(id, 1, MPI_INT, 0, itag, MPI_COMM_WORLD, &status);
int jfrag = id[0];
if (jfrag < 0) break;
// do something
MPI_Send(id, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
I have an array of index which I want each worker do something based
on these indexes. the size of the array might be more than the total
number of ranks, so my first question is if there is another way
except master-worker load balancing here? I want to have a balances
system and also I want to assign each index to each ranks.
No, but if the work performed per array index takes roughly the same amount of time, you can simply scatter the array among the processes.
I was thinking about master-worker, and in this approach master rank
(0) is giving each index to other ranks. but when I was running my
code with 3 rank and 15 index my code is halting in while loop for
sending the index 4. I was wondering If anybody can help me to find
the problem
As already pointed out in the comments, the problem is that you are missing (in the workers side) the loop of querying the master for work.
The load-balancer can be implemented as follows:
The master initial sends an iteration to the other workers;
Each worker waits for a message from the master;
Afterwards the master calls MPI_Recv from MPI_ANY_SOURCE and waits for another worker to request work;
After the worker finished working on its first iteration it sends its rank to the master, signaling the master to send a new iteration;
The master reads the rank sent by the worker in step 4., checks the array for a new index, and if there is still a valid index, send it to the worker. Otherwise, sends a special message signaling the worker that there is no more work to be performed. That message can be for instance -1;
When the worker receive the special message it stops working;
The master stops working when all the workers have receive the special message.
An example of such approach:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc,char *argv[]){
MPI_Init(NULL,NULL); // Initialize the MPI environment
int rank;
int size;
MPI_Status status;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
int work_is_done = -1;
if(rank == 0){
int max_index = 10;
int index_simulator = 0;
// Send statically the first iterations
for(int i = 1; i < size; i++){
MPI_Send(&index_simulator, 1, MPI_INT, i, i, MPI_COMM_WORLD);
index_simulator++;
}
int processes_finishing_work = 0;
do{
int process_that_wants_work = 0;
MPI_Recv(&process_that_wants_work, 1, MPI_INT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status);
if(index_simulator < max_index){
MPI_Send(&index_simulator, 1, MPI_INT, process_that_wants_work, 1, MPI_COMM_WORLD);
index_simulator++;
}
else{ // send special message
MPI_Send(&work_is_done, 1, MPI_INT, process_that_wants_work, 1, MPI_COMM_WORLD);
processes_finishing_work++;
}
} while(processes_finishing_work < size - 1);
}
else{
int index_to_work = 0;
MPI_Recv(&index_to_work, 1, MPI_INT, 0, rank, MPI_COMM_WORLD, &status);
// Work with the iterations index_to_work
do{
MPI_Send(&rank, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
MPI_Recv(&index_to_work, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
if(index_to_work != work_is_done)
// Work with the iterations index_to_work
}while(index_to_work != work_is_done);
}
printf("Process {%d} -> I AM OUT\n", rank);
MPI_Finalize();
return 0;
}
You can improve upon the aforementioned approach by reducing: 1) the number of messages sent and 2) the time waiting for them. For the former you can try to use a chunking strategy (i.e., sending more than one index per MPI communication). For the latter you can try to play around with nonblocking MPI communications or have two threads per process one to receive/send the work another to actually perform the work. This multithreading approach would also allow the master process to actually work with the array indices, but it significantly complicates the approach.
I am trying to send message to all MPI processes from a process and also receive message from all those processes in a process. It is basically an all to all communication where every process sends message to every other process (except itself) and receives message from every other process.
The following example code snippet shows what I am trying to achieve. Now, the problem with MPI_Send is its behavior where for small message size it acts as non-blocking but for the larger message (in my machine BUFFER_SIZE 16400) it blocks. I am aware of this is how MPI_Send behaves. As a workaround, I replaced the code below with blocking (send+recv) which is MPI_Sendrecv. Example code is like this MPI_Sendrecv(intSendPack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, intReceivePack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, MPI_COMM_WORLD, MPI_STATUSES_IGNORE) . I am making the above call for all the processes of MPI_COMM_WORLD inside a loop for every rank and this approach gives me what I am trying to achieve (all to all communication). However, this call takes a lot of time which I want to cut-down with some time-efficient approach. I have tried with mpi scatter and gather to perform all to all communication but here one issue is the buffer size (16400) may differ in actual implementation in different iteration for MPI_all_to_all function calling. Here, I am using MPI_TAG to differentiate the call in different iteration which I cannot use in scatter and gather functions.
#define BUFFER_SIZE 16400
void MPI_all_to_all(int MPI_TAG)
{
int size;
int rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* intSendPack = new int[BUFFER_SIZE]();
int* intReceivePack = new int[BUFFER_SIZE]();
for (int prId = 0; prId < size; prId++) {
if (prId != rank) {
MPI_Send(intSendPack, BUFFER_SIZE, MPI_INT, prId, MPI_TAG,
MPI_COMM_WORLD);
}
}
for (int sId = 0; sId < size; sId++) {
if (sId != rank) {
MPI_Recv(intReceivePack, BUFFER_SIZE, MPI_INT, sId, MPI_TAG,
MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
}
}
}
I want to know if there is a way I can perform all to all communication using any efficient communication model. I am not sticking to MPI_Send, if there is some other way which provides me what I am trying to achieve, I am happy with that. Any help or suggestion is much appreciated.
This is a benchmark that allows to compare performance of collective vs. point-to-point communication in an all-to-all communication,
#include <iostream>
#include <algorithm>
#include <mpi.h>
#define BUFFER_SIZE 16384
void point2point(int*, int*, int, int);
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int rank_id = 0, com_sz = 0;
double t0 = 0.0, tf = 0.0;
MPI_Comm_size(MPI_COMM_WORLD, &com_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &rank_id);
int* intSendPack = new int[BUFFER_SIZE]();
int* result = new int[BUFFER_SIZE*com_sz]();
std::fill(intSendPack, intSendPack + BUFFER_SIZE, rank_id);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
// Send-Receive
t0 = MPI_Wtime();
point2point(intSendPack, result, rank_id, com_sz);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Send-receive time: " << tf - t0 << std::endl;
// Collective
std::fill(result, result + BUFFER_SIZE*com_sz, 0);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
t0 = MPI_Wtime();
MPI_Allgather(intSendPack, BUFFER_SIZE, MPI_INT, result, BUFFER_SIZE, MPI_INT, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Allgather time: " << tf - t0 << std::endl;
MPI_Finalize();
delete[] intSendPack;
delete[] result;
return 0;
}
// Send/receive communication
void point2point(int* send_buf, int* result, int rank_id, int com_sz)
{
MPI_Status status;
// Exchange and store the data
for (int i=0; i<com_sz; i++){
if (i != rank_id){
MPI_Sendrecv(send_buf, BUFFER_SIZE, MPI_INT, i, 0,
result + i*BUFFER_SIZE, BUFFER_SIZE, MPI_INT, i, 0, MPI_COMM_WORLD, &status);
}
}
}
Here every rank contributes its own array intSendPack to the array result on all other ranks that should end up the same on all the ranks. result is flat, each rank takes BUFFER_SIZE entries starting with its rank_id*BUFFER_SIZE. After the point-to-point communication, the array is reset to its original shape.
Time is measured by setting up an MPI_Barrier which will give you the maximum time out of all ranks.
I ran the benchmark on 1 node of Nersc Cori KNL using slurm. I ran it a few times each case just to make sure the values are consistent and I'm not looking at an outlier, but you should run it maybe 10 or so times to collect more proper statistics.
Here are some thoughts:
For small number of processes (5) and a large buffer size (16384) collective communication is about twice faster than point-to-point, but it becomes about 4-5 times faster when moving to larger number of ranks (64).
In this benchmark there is not much difference between performance with recommended slurm settings on that specific machine and default settings but in real, larger programs with more communication there is a very significant one (job that runs for less than a minute with recommended will run for 20-30 min and more with default). Point of this is check your settings, it may make a difference.
What you were seeing with Send/Receive for larger messages was actually a deadlock. I saw it too for the message size shown in this benchmark. In case you missed those, there are two worth it SO posts on it: buffering explanation and a word on deadlocking.
In summary, adjust this benchmark to represent your code more closely and run it on your system, but collective communication in an all-to-all or one-to-all situations should be faster because of dedicated optimizations such as superior algorithms used for communication arrangement. A 2-5 times speedup is considerable, since communication often contributes to the overall time the most.
I have an MPI program in which worker ranks (rank != 0) make a bunch of MPI_Send calls, and the master rank (rank == 0) receives all these messages. However, I run into a Fatal error in MPI_Recv - MPI_Recv(...) failed, Out of memory.
Here is the code that I am compiling in Visual Studio 2010.
I run the executable like so:
mpiexec -n 3 MPIHelloWorld.exe
int main(int argc, char* argv[]){
int numprocs, rank, namelen, num_threads, thread_id;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
if(rank == 0){
for(int k=1; k<numprocs; k++){
for(int i=0; i<1000000; i++){
double x;
MPI_Recv(&x, 1, MPI_DOUBLE, k, i, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
}
}
else{
for(int i=0; i<1000000; i++){
double x = 5;
MPI_Send(&x, 1, MPI_DOUBLE, 0, i, MPI_COMM_WORLD);
}
}
}
If I run with only 2 processes, the program does not crash. So it seems like the problem is when there is an accumulation of the MPI_Send calls from a third rank (aka a second worker node).
If I decrease the number of iterations to 100,000 then I can run with 3 processes without crashing. However, the amount of data being sent with one million iterations is ~ 8 MB (8 bytes for double * 1000000 iterations), so I don't think the "Out of Memory" is referring to any physical memory like RAM.
Any insight is appreciated, thanks!
The MPI_send operation stores the data on the system buffer ready to send. The size of this buffer and where it is stored is implementation specific (I remember hearing that this can even be in the interconnects). In my case (linux with mpich) I don't get a memory error. One way to explicitly change this buffer is to use MPI_buffer_attach with MPI_Bsend. There may also be a way to change the system buffer size (e.g. MP_BUFFER_MEM system variable on IBM systems).
However that this situation of unrequited messages should probably not occur in practice. In your example above, the order of the k and i loops could be swapped to prevent this build up of messages.
I am learning MPI, and trying to create examples of some of the functions. I've gotten several to work, but I am having issues with MPI_Gather. I had a much more complex fitting test, but I trimmed it down to the most simple code. I am still, however, getting the following error:
root#master:/home/sgeadmin# mpirun ./expfitTest5
Assertion failed in file src/mpid/ch3/src/ch3u_request.c at line 584: FALSE
memcpy argument memory ranges overlap, dst_=0x1187e30 src_=0x1187e40 len_=400
internal ABORT - process 0
I am running one master instance and two node instances through AWS EC2. I have all the appropriate libraries installed, as I've gotten other MPI examples to work. My program is:
int main()
{
int world_size, world_rank;
int nFits = 100;
double arrCount[100];
double *rBuf = NULL;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
assert(world_size!=1);
int nElements = nFits/(world_size-1);
if(world_rank>0){
for(int k = 0; k < nElements; k++)
{
arrCount[k] = k;
}}
MPI_Barrier(MPI_COMM_WORLD);
if(world_rank==0)
{
rBuf = (double*) malloc( nFits*sizeof(double));
}
MPI_Gather(arrCount, nElements, MPI_DOUBLE, rBuf, nElements, MPI_DOUBLE, 0, MPI_COMM_WORLD);
if(world_rank==0){
for(int i = 0; i < nFits; i++)
{
cout<<rBuf[i]<<"\n";
}}
MPI_Finalize();
exit(0);
}
Is there something I am not understanding in malloc or MPI_Gather? I've compared my code to other samples, and can't find any differences.
The root process in a gather operation does participate in the operation. I.e. it sends data to it's own receive buffer. That also means you must allocate memory for it's part in the receive buffer.
Now you could use MPI_Gatherv and specify a recvcounts[0]/sendcount at root of 0 to follow your example closely. But usually you would prefer to write an MPI application in a way that the root participates equally in the operation, i.e. int nElements = nFits/world_size.
I want to send several instances of struct S to some processes. The layout of each struct could be different, that is, s.v might have different sizes. When receiving data, I do not know the exact MPI_Datatype in MPI_Get_count because that information is available in the sender process only. Also consider that struct S has a lot of non-primitive members so that I cannot assume the MPI_Datatype as MPI_INT when receiving. How can I acquire the MPI_Datatype to be used for MPI_Get_count?
struct S
{
std::vector<int> v;
MPI_Datatype mpi_dtype;
void make_layout()
{
const int nblock = 1;
int block_count[nblock] = {v.size()};
MPI_Aint offset[nblock];
MPI_Get_address(&v[0], &offset[0]);
MPI_Datatype block_type[nblock] = {MPI_INT};
MPI_Type_struct(nblock, block_count, offset, block_type, &mpi_dtype);
MPI_Type_commit(&mpi_dtype);
}
};
int main()
{
int rank, size;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0)
{
S s0;
s0.v.resize(7);
s0.make.layout();
MPI_Send(MPI_BOTTOM, 1, s0.mpi_dtype, 1, 0, MPI_COMM_WORLD);
}
else
{
S s1; // note that right now s1.mpi_dtype != s0.mpi_dtype
MPI_Status status;
int number_amount;
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
// MPI_Get_count(&status, ???, &number_amount);
// MPI_Recv(...);
}
return 0;
}
MPI messages are generally typeless, that is, they do not carry the datatypes in them. There is no way on the receiver side to obtain detailed information about the message structure. A viable option is to send the structure as two messages. The first one will contain some kind of description, e.g., the length of each vector. Once received, that information is used by the receiver to prepare the data object and to construct the appropriate datatype. Then, the second message will transmit the actual data. Boost.MPI completely automates this process - check its source code to learn how.
Another option is to pack the description with the data in a single message using MPI_Pack. On the receiver side, the message size could be probed with type MPI_PACKED and the data should be unpacked using MPI_Unpack. I don't think the added complexity of this case will offset the small performance loss due to sending two messages instead of one.
In your case that is not necessary as your structure has a single member, which is a vector of integers. Therefore, the correct type to give to MPI_Get_count is MPI_INT.