I am implementing MPI non-blocking communication inside my program. I see on MPI_Isend man_page, it says:
A nonblocking send call indicates that the system may start copying data out of the send buffer. The sender should not modify any part of the send buffer after a nonblocking send operation is called, until the send completes.
My code works like this:
// send messages
if(s > 0){
MPI_Requests s_requests[s];
MPI_Status s_status[s];
for(int i = 0; i < s; ++i){
// some code to form the message to send
std::vector<doubel> send_info;
// non-blocking send
MPI_Isend(&send_info[0], ..., s_requests[i]);
}
MPI_Waitall(s, s_requests, s_status);
}
// recv info
if(n > 0){ // s and n will match
for(int i = 0; i < n; ++i){
MPI_Status status;
// allocate the space to recv info
std::vector<double> recv_info;
MPI_Recv(&recv_info[0], ..., status)
}
}
My question is: am I modify the send buffers since they are in the inner curly brackets (the send_info vector get killed after the loop finishes)? Therefore, this is not a safe communication mode? Although my program works fine now, I still being suspected. Thank you for your reply.
There are two points I want to emphasize in this example.
The first one is the problem I questioned: send buffer gets modified before MPI_Waitall. The reason is what Gilles said. And the solution could be allocated a big buffer before the for loop, and use MPI_Waitall after the loop is finished or put MPI_Wait inside the loop. But the latter one is equivalent to use MPI_Send in the sense of performance.
However, I found if you simply transfer to blocking send and receive, a communication scheme like this could cause deadlock. It is similar to the classic deadlock:
if (rank == 0) {
MPI_Send(..., 1, tag, MPI_COMM_WORLD);
MPI_Recv(..., 1, tag, MPI_COMM_WORLD, &status);
} else if (rank == 1) {
MPI_Send(..., 0, tag, MPI_COMM_WORLD);
MPI_Recv(..., 0, tag, MPI_COMM_WORLD, &status);
}
And the explaination could be found here.
My program could cause a similar situation: all the processors called MPI_Send then it is a deadlock.
So my solution is to use a large buffer and stick to non-blocking communication scheme.
#include <vector>
#include <unordered_map>
// send messages
if(s > 0){
MPI_Requests s_requests[s];
MPI_Status s_status[s];
std::unordered_map<int, std::vector<double>> send_info;
for(int i = 0; i < s; ++i){
// some code to form the message to send
send_info[i] = std::vector<double> ();
// non-blocking send
MPI_Isend(&send_info[i][0], ..., s_requests[i]);
}
MPI_Waitall(s, s_requests, s_status);
}
// recv info
if(n > 0){ // s and n will match
for(int i = 0; i < n; ++i){
MPI_Status status;
// allocate the space to recv info
std::vector<double> recv_info;
MPI_Recv(&recv_info[0], ..., status)
}
}
Related
I am trying to send message to all MPI processes from a process and also receive message from all those processes in a process. It is basically an all to all communication where every process sends message to every other process (except itself) and receives message from every other process.
The following example code snippet shows what I am trying to achieve. Now, the problem with MPI_Send is its behavior where for small message size it acts as non-blocking but for the larger message (in my machine BUFFER_SIZE 16400) it blocks. I am aware of this is how MPI_Send behaves. As a workaround, I replaced the code below with blocking (send+recv) which is MPI_Sendrecv. Example code is like this MPI_Sendrecv(intSendPack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, intReceivePack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, MPI_COMM_WORLD, MPI_STATUSES_IGNORE) . I am making the above call for all the processes of MPI_COMM_WORLD inside a loop for every rank and this approach gives me what I am trying to achieve (all to all communication). However, this call takes a lot of time which I want to cut-down with some time-efficient approach. I have tried with mpi scatter and gather to perform all to all communication but here one issue is the buffer size (16400) may differ in actual implementation in different iteration for MPI_all_to_all function calling. Here, I am using MPI_TAG to differentiate the call in different iteration which I cannot use in scatter and gather functions.
#define BUFFER_SIZE 16400
void MPI_all_to_all(int MPI_TAG)
{
int size;
int rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* intSendPack = new int[BUFFER_SIZE]();
int* intReceivePack = new int[BUFFER_SIZE]();
for (int prId = 0; prId < size; prId++) {
if (prId != rank) {
MPI_Send(intSendPack, BUFFER_SIZE, MPI_INT, prId, MPI_TAG,
MPI_COMM_WORLD);
}
}
for (int sId = 0; sId < size; sId++) {
if (sId != rank) {
MPI_Recv(intReceivePack, BUFFER_SIZE, MPI_INT, sId, MPI_TAG,
MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
}
}
}
I want to know if there is a way I can perform all to all communication using any efficient communication model. I am not sticking to MPI_Send, if there is some other way which provides me what I am trying to achieve, I am happy with that. Any help or suggestion is much appreciated.
This is a benchmark that allows to compare performance of collective vs. point-to-point communication in an all-to-all communication,
#include <iostream>
#include <algorithm>
#include <mpi.h>
#define BUFFER_SIZE 16384
void point2point(int*, int*, int, int);
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int rank_id = 0, com_sz = 0;
double t0 = 0.0, tf = 0.0;
MPI_Comm_size(MPI_COMM_WORLD, &com_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &rank_id);
int* intSendPack = new int[BUFFER_SIZE]();
int* result = new int[BUFFER_SIZE*com_sz]();
std::fill(intSendPack, intSendPack + BUFFER_SIZE, rank_id);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
// Send-Receive
t0 = MPI_Wtime();
point2point(intSendPack, result, rank_id, com_sz);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Send-receive time: " << tf - t0 << std::endl;
// Collective
std::fill(result, result + BUFFER_SIZE*com_sz, 0);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
t0 = MPI_Wtime();
MPI_Allgather(intSendPack, BUFFER_SIZE, MPI_INT, result, BUFFER_SIZE, MPI_INT, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Allgather time: " << tf - t0 << std::endl;
MPI_Finalize();
delete[] intSendPack;
delete[] result;
return 0;
}
// Send/receive communication
void point2point(int* send_buf, int* result, int rank_id, int com_sz)
{
MPI_Status status;
// Exchange and store the data
for (int i=0; i<com_sz; i++){
if (i != rank_id){
MPI_Sendrecv(send_buf, BUFFER_SIZE, MPI_INT, i, 0,
result + i*BUFFER_SIZE, BUFFER_SIZE, MPI_INT, i, 0, MPI_COMM_WORLD, &status);
}
}
}
Here every rank contributes its own array intSendPack to the array result on all other ranks that should end up the same on all the ranks. result is flat, each rank takes BUFFER_SIZE entries starting with its rank_id*BUFFER_SIZE. After the point-to-point communication, the array is reset to its original shape.
Time is measured by setting up an MPI_Barrier which will give you the maximum time out of all ranks.
I ran the benchmark on 1 node of Nersc Cori KNL using slurm. I ran it a few times each case just to make sure the values are consistent and I'm not looking at an outlier, but you should run it maybe 10 or so times to collect more proper statistics.
Here are some thoughts:
For small number of processes (5) and a large buffer size (16384) collective communication is about twice faster than point-to-point, but it becomes about 4-5 times faster when moving to larger number of ranks (64).
In this benchmark there is not much difference between performance with recommended slurm settings on that specific machine and default settings but in real, larger programs with more communication there is a very significant one (job that runs for less than a minute with recommended will run for 20-30 min and more with default). Point of this is check your settings, it may make a difference.
What you were seeing with Send/Receive for larger messages was actually a deadlock. I saw it too for the message size shown in this benchmark. In case you missed those, there are two worth it SO posts on it: buffering explanation and a word on deadlocking.
In summary, adjust this benchmark to represent your code more closely and run it on your system, but collective communication in an all-to-all or one-to-all situations should be faster because of dedicated optimizations such as superior algorithms used for communication arrangement. A 2-5 times speedup is considerable, since communication often contributes to the overall time the most.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
The scenario is like this: one process is using epoll on several sockets, all sockets are set non-blocking and edge triggered; then EPOLLIN event occurs on one socket, then we start to read data on its fd, but the problem is that there are too many data coming in, and in the while loop reading data, the return value of recv is always larger than 0. So the application is stuck there, reading data and cannot move on.
Any idea how should I deal with this?
constexpr int max_events = 10;
constexpr int buf_len = 8192;
....
epoll_event events[max_events];
char buf[buf_len];
int n;
auto fd_num = epoll_wait(...);
for(auto i = 0; i < fd_num; i++) {
if(events[i].events & EPOLLIN) {
for(;;) {
n = ::read(events[i].data.fd, buf, sizeof(buf));
if (errno == EAGAIN)
break;
if (n <= 0)
{
on_disconnect_(events[i].data.fd);
break;
}
else
{
on_data_(events[i].data.fd, buf, n);
}
}
}
}
When using edge triggered mode the data must be read in one recv call, otherwise it risks starving other sockets. This issue has been written about in numerous blogs, e.g. Epoll is fundamentally broken.
Make sure that your user-space receive buffer is at least the same size as the kernel receive socket buffer. This way you read the entire kernel buffer in one recv call.
Also, you can process ready sockets in a round-robin fashion, so that the control flow does not get stuck in recv loop for one socket. That works best with the user-space receive buffer being of the same size as the kernel one. E.g.:
auto n = epoll_wait(...);
for(int dry = 0; dry < n;) {
for(auto i = 0; i < n; i++) {
if(events[i].events & EPOLLIN) {
// Do only one read call for each ready socket
// before moving to the next ready socket.
auto r = recv(...);
if(-1 == r) {
if(EAGAIN == errno) {
events[i].events ^= EPOLLIN;
++dry;
}
else
; // Handle error.
}
else if(!r){
// Process client disconnect.
}
else {
// Process data received so far.
}
}
}
}
This version can be further improved to avoid scanning the entire events array on each iteration.
In you original post do {} while(n > 0); is incorrect and it leads to an endless loop. I assume it is a typo.
I have a Finite Element code that uses blocking receives and non-blocking sends. Each element has 3 incoming faces and 3 outgoing faces. The mesh is split up among many processors, so sometimes the boundary conditions come from the elements processor, or from neighboring processors. Relevant parts of the code are:
std::vector<task>::iterator it = All_Tasks.begin();
std::vector<task>::iterator it_end = All_Tasks.end();
int task = 0;
for (; it != it_end; it++, task++)
{
for (int f = 0; f < 3; f++)
{
// Get the neighbors for each incoming face
Neighbor neighbor = subdomain.CellSets[(*it).cellset_id_loc].neighbors[incoming[f]];
// Get buffers from boundary conditions or neighbor processors
if (neighbor.processor == rank)
{
subdomain.Set_buffer_from_bc(incoming[f]);
}
else
{
// Get the flag from the corresponding send
target = GetTarget((*it).angle_id, (*it).group_id, (*it).cell_id);
if (incoming[f] == x)
{
int size = cells_y*cells_z*groups*angles*4;
MPI_Status status;
MPI_Recv(&subdomain.X_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &status);
}
if (incoming[f] == y)
{
int size = cells_x*cells_z*groups*angles * 4;
MPI_Status status;
MPI_Recv(&subdomain.Y_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &status);
}
if (incoming[f] == z)
{
int size = cells_x*cells_y*groups*angles * 4;
MPI_Status status;
MPI_Recv(&subdomain.Z_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &status);
}
}
}
... computation ...
for (int f = 0; f < 3; f++)
{
// Get the outgoing neighbors for each face
Neighbor neighbor = subdomain.CellSets[(*it).cellset_id_loc].neighbors[outgoing[f]];
if (neighbor.IsOnBoundary)
{
// store the buffer into the boundary information
}
else
{
target = GetTarget((*it).angle_id, (*it).group_id, neighbor.cell_id);
if (outgoing[f] == x)
{
int size = cells_y*cells_z*groups*angles * 4;
MPI_Request request;
MPI_Isend(&subdomain.X_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &request);
}
if (outgoing[f] == y)
{
int size = cells_x*cells_z*groups*angles * 4;
MPI_Request request;
MPI_Isend(&subdomain.Y_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &request);
}
if (outgoing[f] == z)
{
int size = cells_x*cells_y*groups*angles * 4;
MPI_Request request;
MPI_Isend(&subdomain.Z_buffer[0], size, MPI_DOUBLE, neighbor.processor, target, MPI_COMM_WORLD, &request);
}
}
}
}
A processor can do a lot of tasks before it needs information from other processors. I need a non-blocking send so that the code can keep working, but I'm pretty sure the receives are overwriting the send buffers before they get sent.
I've tried timing this code, and it's taking 5-6 seconds for the call to MPI_Recv, even though the message it's trying to receive has been sent. My theory is that the Isend is starting, but not actually sending anything until the Recv is called. The message itself is on the order of 1 MB. I've looked at benchmarks and messages of this size should take a very small fraction of a second to send.
My question is, in this code, is the buffer that was sent being overwritten, or just the local copy? Is there a way to 'add' to a buffer when I'm sending, rather than writing to the same memory location? I want the Isend to write to a different buffer every time it's called so the information isn't being overwritten while the messages wait to be received.
** EDIT **
A related question that might fix my problem: Can MPI_Test or MPI_Wait give information about an MPI_Isend writing to a buffer, i.e. return true if the Isend has written to the buffer, but that buffer has yet to be received?
** EDIT 2 **
I have added more information about my problem.
So it looks like I just have to bite the bullet and allocate enough memory in the send buffers to accommodate all the messages, and then just send portions of the buffer when I send.
I am programming in MPI. I want to send something to another processor and receive it there, but I don't know how many messages I will send. In fact, the number of messages which send to the other processor depends on the file which I am reading it during the program, so I don't know how many receives I should write on the other side. Which method and which function should I use?
You can still use sends and receives, but you would also add a new kind of message that tells the receiving process that there will be no new messages. Usually this is handled by sending with a different tag. So you program would look something like this:
if (sender) {
while (data_to_send == true) {
MPI_Send(data, size, datatype, receiving_rank, 0, MPI_COMM_WORLD);
}
for (i = 0; i < size; i++) {
MPI_Send(NULL, 0, MPI_INT, i, 1, MPI_COMM_WORLD);
}
} else {
while (1) {
MPI_Recv(data, size, datatype, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
if (status.MPI_TAG == 1) break;
/* Do processing */
}
}
There is a better way that works if you have non-blocking collectives (from MPI-3). Before you start receiving data, you post a non-blocking barrier. Then you start posting non-blocking receives. Instead of waiting only on the receives, you use a waitany on both requests and when the barrier is done, you know here won't be any more data. On the sender side, you just keep sending data until there's no more, then do a non-blocking barrier to finish things off.
I am trying to do an all-to-one communication out-of-order. Basically I have multiple floating point arrays of the same size, identified by an integer id.
Each message should look like:
<int id><float array data>
On the receiver side, it knows exactly how many arrays are there, and thus sets up exact number of recvs. Upon receiving a message, it parses the id and put data into the right place. The problem is that a message could be sent from any other processes to the receiving process. (e.g. the producers have a work queue structure, and process whichever id is available on the queue.)
Since MPI only guarantees P2P in order delivery, I can't trivially put integer id and FP data in two messages, otherwise receiver might not be able to match id with data. MPI doesn't allow two types of data in one send as well.
I can only think of two approaches.
1) Receiver has an array of size m (source[m]), m being number of sending nodes. Sender sends id first, then the data. Receiver saves id to source[i] after receiving an integer message from sender i. Upon receiving a FP array from sender i, it checks source[i], get the id, and moves data to the right place. It works because MPI guarantees in-order P2P communication. It requires receiver to keep state information for each sender. To make matter worse, if a single sending process can have two ids sent before data (e.g. multi-threaded), this mechanism won't work.
2) Treat id and FP as bytes, and copy them into a send buffer. Send them as MPI_CHAR, and receiver casts them back to an integer and a FP array. Then I need to pay the addition cost of copying things into a byte buffer on sender side. The total temporary buffer also grows as I grow number of threads within an MPI process.
Neither of them are perfect solutions. I don't want to lock anything inside a process. I wonder if any of you have better suggestions.
Edit: The code will be run on a shared cluster with infiniband. The machines will be randomly assigned. So I don't think TCP sockets will be able to help me here. In addition, IPoIB looks expensive. I do need the full 40Gbps speed for communication, and keep CPU doing the computation.
You can specify MPI_ANY_SOURCE as the source rank in the receive function, then sort the messages using their tags, which is easier than creating custom messages. Here's a simplified example:
#include <stdio.h>
#include "mpi.h"
int main() {
MPI_Init(NULL,NULL);
int rank=0;
int size=1;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
// Receiver is the last node for simplicity in the arrays
if (rank == size-1) {
// Receiver has size-1 slots
float data[size-1];
MPI_Request request[size-1];
// Use tags to sort receives
for (int tag=0;tag<size-1;++tag){
printf("Receiver for id %d\n",tag);
// Non-blocking receive
MPI_Irecv(data+tag,1,MPI_FLOAT,
MPI_ANY_SOURCE,tag,MPI_COMM_WORLD,&request[tag]);
}
// Wait for all requests to complete
printf("Waiting...\n");
MPI_Waitall(size-1,request,MPI_STATUSES_IGNORE);
for (size_t i=0;i<size-1;++i){
printf("%f\n",data[i]);
}
} else {
// Producer
int id = rank;
float data = rank;
printf("Sending {%d}{%f}\n",id,data);
MPI_Send(&data,1,MPI_FLOAT,size-1,id,MPI_COMM_WORLD);
}
return MPI_Finalize();
}
As somebody already wrote, you can use MPI_ANY_SOURCE to receive from any source. To send two different kind of data in a single send you can use a derived datatype:
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#define asize 10
typedef struct data_ {
int id;
float array[asize];
} data;
int main() {
MPI_Init(NULL,NULL);
int rank = -1;
int size = -1;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
data buffer;
// Define and commit a new datatype
int blocklength [2];
MPI_Aint displacement[2];
MPI_Datatype datatypes [2];
MPI_Datatype mpi_tdata;
MPI_Aint startid,startarray;
MPI_Get_address(&(buffer.id),&startid);
MPI_Get_address(&(buffer.array[0]),&startarray);
blocklength [0] = 1;
blocklength [1] = asize;
displacement[0] = 0;
displacement[1] = startarray - startid;
datatypes [0] = MPI_INT;
datatypes [1] = MPI_FLOAT;
MPI_Type_create_struct(2,blocklength,displacement,datatypes,&mpi_tdata);
MPI_Type_commit(&mpi_tdata);
if (rank == 0) {
int count = 0;
MPI_Status status;
while (count < size-1 ) {
// Non-blocking receive
printf("Receiving message %d\n",count);
MPI_Recv(&buffer,1,mpi_tdata,MPI_ANY_SOURCE,0,MPI_COMM_WORLD,&status);
printf("Message tag %d, first entry %g\n",buffer.id,buffer.array[0]);
// Counting the received messages
count++;
}
} else {
// Initialize buffer to be sent
buffer.id = rank;
for (int ii = 0; ii < size; ii++) {
buffer.array[ii] = 10*rank + ii;
}
// Send buffer
MPI_Send(&buffer,1,mpi_tdata,0,0,MPI_COMM_WORLD);
}
MPI_Type_free(&mpi_tdata);
MPI_Finalize();
return 0;
}