Check if adjacent slave process is ended in MPI - c++

In my MPI program, I want to send and receive information to adjacent processes. But if a process ends and doesn't send anything, its neighbors will wait forever. How can I resolve this issue? Here is what I am trying to do:
if (rank == 0) {
// don't do anything until all slaves are done
} else {
while (condition) {
// send info to rank-1 and rank+1
// if can receive info from rank-1, receive it, store received info locally
// if cannot receive info from rank-1, use locally stored info
// do the same for process rank+1
// MPI_Barrier(slaves); (wait for other slaves to finish this iteration)
}
}
I am going to check the boundaries of course. I won't check rank-1 when process number is 1 and I won't check rank+1 when process is the last one. But how can I achieve this? Should I wrap it with another while? I am confused.

I'd start by saying that MPI wasn't originally designed with your use case in mind. In general, MPI applications all start together and all end together. Not all applications fit into this model though, so don't lose hope!
There are two relatively easy ways of doing this and probably thousands of hard ones:
Use RMA to set flags on neighbors.
As has been pointed out in the comments, you can set up a tiny RMA window that exposes a single value to each neighbor. When a process is done working, it can do an MPI_Put on each neighbor to indicate that it's done and then MPI_Finalize. Before sending/receiving data to/from the neighbors, check to see if the flag is set.
Use a special tag when detecting shutdowns.
The tag value often gets ignored when sending and receiving messages, but this is a great time to use it. You can have two flags in your application. The first (we'll call it DATA) just indicates that this message contains data and you can process it as normal. The second (DONE) indicates that the process is done and is leaving the application. When receiving messages, you'll have to change the value for tag from whatever you're using to MPI_ANY_TAG. Then, when the message is received, check which tag it is. If it's DONE, then stop communicating with that process.
There's another problem with the pseudo-code that you posted however. If you expect to perform an MPI_Barrier at the end of every iteration, you can't have processes leaving early. When that happens, the MPI_Barrier will hang. There's not much you can do to avoid this unfortunately. However, given the code you posted, I'm not sure that the barrier is really necessary. It seems to me that the only inter-loop dependency is between neighboring processes. If that's the case, then the sends and receives will accomplish all of the necessary synchronization.
If you still need a way to track when all of the ranks are done, you can have each process alert a single rank (say rank 0) when it leaves. When rank 0 detects that everyone is done, it can just exit. Or, if you want to leave after some other number of processes is done, you can have rank 0 send out a message to all other ranks with a special tag like above (but add MPI_ANY_SOURCE so you can receive from rank 0).

Related

How does `select` handle multiple events at the same time?

I am trying to understand the following code. If I have 50 connections to this server and I send data through one of these sockets, the select block with the inner loop will capture what I send and echo it back. But what happens if within a very short time-frame of the first message, I send another one? So fast that the inner loop (after select - the loop iterating over all active client sockets) doesn't finish. Will that data be thrown away? Will it be what the next select will be triggered with? What happens if I send two messages before the inner loop finishes ? Will I ever face the scenario where inside the loop iterating over all the active sockets I get more than 1 that has "activity" - i.e.: can two FD_ISSET(sd, &readfds) be true within a single iteration of the loop ?
Yes, multiple descriptors can be ready to read in a single iteration. The return value of select() is the number of descriptors that are ready, and it can be more than 1. As you loop through the descriptors, you should increment a counter when FD_ISSET(sd, &readfds) is true, and continue until the counter reaches this number.
But even if you only process one descriptor, nothing will be thrown away. select() is not triggered by changes, it returns whenever any of the descriptors is ready to read (or write, if you also use writefds). If a descriptor is ready to read, but you don't read from it, it will still be ready to read the next time you call select(), so it will return immediately.
However, if you only process the first descriptor you find in the loop, later descriptors could be "starved" if an earlier descriptor is always ready to read, and you never process the later ones. So it's generally best to always process all the ready descriptors.
select() is a level-triggered API, which means that it answers the question "are any of these file descriptors readable/writable now?", not "have these file descriptors become readable/writable?". That should answer most of your questions:
But what happens if within a very short time-frame of the first message, I send another one? [...] Will it be what the next select will be triggered with?
It will be what the next select() will be triggered with.
What happens if I send two messages before the inner loop finishes ?
That depends on how long the messages are - TCP doesn't work in terms of messages, but in terms of a stream of bytes. The server might well read both messages in a single read(). And if it doesn't, the socket will remain readable, and it will pick them up immediately on the next select().
Will I ever face the scenario where inside the loop iterating over all the active sockets I get more than 1 that has "activity" - i.e.: can two FD_ISSET(sd, &readfds) be true within a single iteration of the loop ?
Yes, if two clients send data at the same time (while you are out of select()), select() will report two readable file descriptors.
To add to the already excellent answers:
The select function in this case isn't grabbing packets directly from the wire, it's actually going to the packet buffer, usually a part of the NIC, to grab packets/frames that are available to be read. The packet buffer is normally a ring buffer: it has a fixed size, new packets come in at the "top", and when the buffer gets full, the oldest packets drop out of the "bottom".
Just as #sam-varshavchik mentioned in the comments, as long as select is implemented correctly and the packet buffer doesn't clog up during the time you are going through the select loop, you will be fine.
Here's an interesting article on how to implement a packet ring buffer for a socket.

Non-blocking data sharing through OpenMPI

I'm trying to spread data across multiple workers using OpenMPI, however, I'm doing the data division in a fairly custom way that is not amenable to MPI_Scatter or MPI_Broadcast. What I would like to do is to give each processor some work in a queue (or, some other async mechanism) such that they can do their work on the first chunk of data, take the next chunk, repeat until no more chunks.
I know of MPI_Isend, however if I send data with MPI_Isend I can't modify it until it's finished sending; forcing me to use MPI_Wait and thus having to wait until the thread is finished receiving the data anyway!
Is there a standard a solution to this problem, or must I rethink my approach?
Using MPI_ISEND doesn't necessarily mean that the message is received on the remote end. It just means that the buffer is available for reuse. It could be that the message has been buffered internally by Open MPI or that the message actually has been received on the other end. It depends on your message size.
Another option would be to have your workers ask the master process for work when they need it instead of having it pushed to them. Then the master can work only as needed. You could do an MPI_SCATTER for the first message since everyone will be receiving some data. Then after that, have the master do an MPI_RECV(MPI_ANY_SOURCE) to get a message from one of the worker processes.

MPI: Receiving an already-received message?

I'm writing a C++ program that uses OpenMPI. It executes in "rounds", where in each round, process 0 sends chunks of data to the other processes, they do stuff to it and send results back, and when there are no more chunks to send, process 0 sends a "done" message to each other process. A "done" message is just a single-int message with tag 3. My first round executes fine. However, when I get to round two, processes 1-p "probe" and "receive" a done message before process 0 has had a chance to send anything (let alone a done message).
I've gone over my code many times now and it seems like the only place this message could be coming from is where process 0 sent it in the previous round - but each process had already received that. I'd rather not post my code since it's pretty big, but does anyone know if MPI messages can be received twice like this?
I think I may have the answer... Since the actual data in the done message doesn't matter, I didn't think to have the processes actually receive it. It turns out that in the previous round, the processes were "probing" the message and finding that the tag was 3, then breaking out of their loop. Therefore, in round two, the message was still waiting to be received, so when they called MPI_Probe, they found the same message as in the previous round.
To solve this I just put in a call to MPI_Recv. I looked at MPI_Cancel but I can't find enough information about it to see if it would be appropriate. Sorry for being misleading in my question!

Is there something wrong in my MPI algorithm?

I setup this algorithm to share data between different processors, and it has worked so far, but I'm trying to throw a much larger problem at it and I'm witnessing some very strange behavior. I'm losing pieces of data between MPI_Isend's and MPI_Recv's.
I present a snippet of the code below. It is basically comprised of three stages. First, a processor will loop over all elements in a given array. Each element represents a cell in a mesh. The processor checks if the element is being used on other processors. If yes, it does a non-blocking send to that process using the cell's unique global ID as the tag. If no, it checks the next element, and so on.
Second, the processor then loops over all elements again, this time checking if the processor needs to update the data in that cell. If yes, then the data has already been sent out by another process. The current process simply does a blocking receive, knowing who owns the data and the unique global ID for that cell.
Finally, MPI_Waitall is used for the request codes that were stored in the 'req' array during the non-blocking sends.
The issue I'm having is that this entire process completes---there is no hang in the code. But some of the data being received by some of the cells just isn't correct. I check that all data being sent is right by printing each piece of data prior to the send operation. Note that I'm sending and receiving a slice of an array. Each send will pass 31 elements. When I print the array from the process that received it, 3 out of the 31 elements are garbage. All other elements are correct. The strange thing is that it is always the same three elements that are garbage---the first, second and last element.
I want to rule out that something isn't drastically wrong in my algorithm which would explain this. Or perhaps it is related to the cluster I'm working on? As I mentioned, this worked on all other models I threw at it, using up to 31 cores. I'm getting this behavior when I try to throw 56 cores at the problem. If nothing pops out as wrong, can you suggest a means to test why certain pieces of a send are not making it to their destination?
do i = 1, num_cells
! skip cells with data that isn't needed by other processors
if (.not.needed(i)) cycle
tag = gid(i) ! The unique ID of this cell in the entire system
ghoster = ghosts(i) ! The processor that needs data from this cell
call MPI_Isend(data(i,1:tot_levels),tot_levels,mpi_datatype,ghoster,tag,MPI_COMM,req(send),mpierr)
send = send + 1
end do
sends = send-1
do i = 1, num_cells
! skip cells that don't need a data update
if (.not.needed_here(i)) cycle
tag = gid(i)
owner = owner(i)
call MPI_Recv(data(i,1:tot_levels),tot_levels,mpi_datatype,owner,tag,MPI_COMM,MPI_STATUS_IGNORE,mpierr)
end do
call MPI_Waitall(sends,req,MPI_STATUSES_IGNORE,mpierr)
Is your problem that you're not receiving all of the messages? Note that just because an MPI_SEND or MPI_ISEND completes, doesn't mean that the corresponding MPI_RECV was actually posted/completed. The return of the send call only means that the buffer can be reused by the sender. That data may still be buffered internally somewhere on either the sender or the receiver.
If it's critical that you know that the message was actually received, you need to use a different variety of the send like MPI_SSEND or MPI_RSEND (or the nonblocking versions if you prefer). Note that this won't actually solve your problem. It will probably just make it easier for you to figure out which messages aren't showing up.
I figured out a way to get my code to work, but I'm not entirely sure why, so I'm going to post the solution here and maybe somebody could comment on why this is the case and possibly offer a better solution.
As I indicated in my question and as we have discussed in the comments, it appeared that pieces of data were being lost between sends/receives. The concept of the buffer is a mystery to me, but I thought that maybe there wasn't enough space to hold my Isends, allowing for them to get lost before they could be received. So I swapped out the MPI_Isend calls with MPI_Bsend calls. I figure out how big my buffer needs to be using MPI_Pack_size. This way, I know I will have ample space for all my messages I send. I allocate my buffer size using MPI_Buffer_attach. I got rid of the MPI_Waitall, since it is no longer needed, and I replaced it with a call to MPI_Buffer_detach.
The code runs without issue and arrives at identical results to the serial case. I'm able to scale the problem size up to what I tried before and it works now. So based on these results, I'd have to assume that pieces of messages were being lost due to insufficient buffer space.
I have concerns about the impact on code performance. I did a scaling study on different problem sizes. See the image below. The x-axis gives the size of the problem (5 means the problem is 5 times bigger than 1). The y-axis gives the time to finish executing the program. There are three lines shown. Running the program in serial is shown in blue. The size=1 case is extrapolated out linearly with the green line. We see that the code execution time is linearly correlated with problem size. The red line shows running the program in parallel---we use a number of processors that matches the problem size (e.g. 2 cores for size=2, 4 cores for size=4, etc.).
You can see that the parallel execution time increases very slowly with problem size, which is expected, except for the largest case. I feel that the poor performance for the largest case is being caused by an increased amount of message buffering, which was not needed in smaller cases.

Broadcast message for all processes to exit(MPI)

[MPi-C++]
I made an application that under a specific condition it should close the application in all processes.
I tried to made it using root process but I want to send message to all other processes to terminate also. How can I make this???
There is no way to quit an MPI application cleanly on all processes without communication. That means, if you have a condition that occurs only on a subset of the processes of your MPI application (e.g. you have an error on one of processes), the only way to unilaterally quit the application is to call MPI_Abort. This will result in all MPI processes coming to an abrupt end, no matter where in the code each rank was at that moment. Since MPI_Abort is not a collective routine, it is not possible to perform any cleanup on any of the other ranks.
If you wish to have a clean exit, you need to regularly communicate between all ranks whether everything is still working on all ranks, or if it is time to quit. For example, you could regularly call MPI_Allreduce with MPI_SUM as the operation. If your exit condition is fulfilled on a process, make it send 1 as the data, otherwise make it send 0. Now you only need to check after the MPI_Allreduce if the sum is larger than 0, and if it is, quit your application in an orderly fashion.