MPI Send and Receive modes for large number of processors

MPI Send and Receive modes for large number of processors - c++

I know there are a ton of questions & answers about the different modes of MPI send and receive out there, but I believe mine is different or I am simply not able to apply these answers to my problem.
Anyway, my scenario is as follows. The code is intended for high performance clusters with potentially thousands of cores, organized into a multi-dimensional grid. In my algorithm, there are two successive operations that need to be carry out, let's call them A and B, where A precedes B. A brief description is as follows:
A: Each processor has multiple buffers. It has to send each of these buffers to a certain set of processors. For each buffer to be sent, the set of receiving processors might differ. Sending is the last step of operation A.
B: Each processor receives a set of buffers from a set of processors. Operation B will then work on these buffers once it has received all of them. The result of that operation will be stored in a fixed location (neither in the send or receive buffers)
The following properties are also given:
in A, every processor can compute which processors to send to, and it can also compute a corresponding tag in case that a processor receives multiple buffers from the same sending processor (which is very likely).
in B, every processor can also compute which processors it will receive from, and the corresponding tags that the messages were sent with.
Each processor has its own send buffers and receive buffers, and these are disjoint (i.e. there is no processor that uses its send buffer as a receive buffer as well, and vice versa).
A and B are executed in a loop among other operations before A and after B. We can ensure that the send buffer will not be used again until the next loop iteration, where it is filled with new data in A, and the receive buffers will also not be used again until the next iteration where they are used to receive new data in operation B.
The transition between A and B should, if possible, be a synchronization point, i.e. we want to ensure that all processors enter B at the same time
To send and receive, both A and B have to use (nested) loops on their own to send and receive the different buffers. However, we cannot make any assumption about the order of these send and receive statements, i.e. for any two buffers buf0 and buf1 we cannot guarantee that if buf0 is received by some processor before buf1, that buf0 was also sent before buf1. Note that at this point, using group operations like MPI_Broadcast etc. is not an option yet due to the complexity of determining the set of receiving/sending processors.
Question: Which send and receive modes should I use? I have read a lot of different stuff about these different modes, but I cannot really wrap my head around them. The most important property is deadlock-freedom, and the next important thing is performance. I am tending towards using MPI_Isend() in A without checking the request status and again using the non-blocking MPI_IRecv() in B's loops, and then using MPI_Waitall() to ensure that all buffers were received (and as a consequence, also all buffers have been sent and the processors are synchronized).
Is this the correct approach, or do I have to use buffered sends or something entirely different? I don't have a ton of experience in MPI and the documentation does not really help me much either.

From how you describe your problem, I think MPI_Isend is likely to be the best (only?) option for A because it's guaranteed non-blocking, whereas MPI_Send may be non-blocking, but only if it is able to internally buffer your send buffer.
You should then be able to use an MPI_Barrier so that all processes enter B at the same time. But this might be a performance hit. If you don't insist that all processes enter B at the same time, some can begin receiving messages sooner. Furthermore, given your disjoint send & receive buffers, this should be safe.
For B, you can use MPI_Irecv or MPI_Recv. MPI_Irecv is likely to be faster, because a standard MPI_Recv might be waiting around for a slower send from another process.
Whether or not you block on the receiving end, you ought to call MPI_Waitall before finishing the loop to ensure all send/recv operations have completed successfully.
An additional point: you could leverage MPI_ANY_SOURCE with MPI_Recv to receive messages in a blocking manner and immediately operate on them, no matter what order they arrive. However, given that you specified that no operations on the data happen until all data are received, this might not be that useful.
Finally: as mentioned in these recommendations, you will get best performance if you can restructure your code so that you can just use MPI_SSend. In this case you avoid any buffering at all. To achieve this, you'd have to have all processes first call an MPI_Irecv, then begin sending via MPI_Ssend. It might not be as hard as you think to refactor in this way, particularly if, as you say, each process can work out independently which messages it will receive from whom.

Related

How to overlap nonblocking communication

I know that Nonblocking communication is not blocking the code, such as MPI_Isend will send the data immediately. But when we have some all-to-all communication and we need to read this data we need to use MPI_Wait to get the data first.
for all-to-all communication the first thing that is coming in my mind is something like: (it is a real code)
1- initialise the MPI, getting the rank and size...
2- creating the data or read the first data from file
3- in a for loop, we need to send the data from all ranks to the other, by MPI_Isend or Evan MPI_Bcast
4- finalise the MPI
for writing the for loop, we need to use MPI_Wait for both sending and receiving.
My question is how we can use overlap in nonblocking.
I want to use two times MPI_Isend and MPI_Irecv in each iteration loop to overlap some the computation from the first receiving data and meanwhile do another send and receive, but this approach needs 4 waiting, Is there any algorithm for overlapping the nonblocking communications?

I know that None-blocking communication is not blocking the code, such
as MPI_Isend will send the data immediately.
This is not accurate the data is not send immediately. Let me explain in more detail:
MPI_Irecv and MPI_Isend are nonblocking communication routines, therefore one needs to use the MPI_Wait (or use MPI_Test to test for the completion of the request) to ensure that the message is completed, and that the data in the send/receive buffer can be again safely manipulated.
The nonblocking here means one does not wait for the data to be read and sent; rather the data is immediately made available to be read and sent. That does not imply, however, that the data is immediately sent. If it were, there would not be a need to call MPI_Wait.
But when we have some all-to-all communication and we need to read
this data we need to use MPI_Wait to get the data first.
You need to call MPI_Wait (or MPI_Test) in call communication types 1 to 1, 1 to n, and so on not just in all-to-all. Calling MPI_Wait is not about getting the data first, but rather that during the MPI_Isend the content of the buffer (e.g., the array of ints) has to be read and sent; In the meantime, one can overlap some computation with the ongoing process, however this computation cannot change (or read) the contend of the send/recv buffer. Then one calls the MPI_Wait to ensure that from that point onwards the data send/recv can be safely read/modified without any issues.
I want to use two times MPI_Isend and MPI_Irecv in each iteration loop
to overlap some the computation from the first receiving data and
meanwhile do another send and receive, but this approach needs 4
waiting, Is there any algorithm for overlapping the nonblocking
communications?
Whenever possible just use in build communication routines they are likely more optimized than one would able to optimized. What you have described fits perfectly the MPI routine MPI_Ialltoall:
Sends data from all to all processes in a nonblocking way

write_some vs write - boost asio

Why do someone want to use write_some when it may not transmit all data to peer?
from boost write_some documentation
The write_some operation may not transmit all of the data to the peer.
Consider using the write function if you need to ensure that all data
is written before the blocking operation completes.
What is the relevance of write_some method in boost when it has write method? I went through the boost write_some documentation,nothing I can guess.

At one extreme, write waits until all the data has been confirmed as written to the remote system. It gives the greatest certainty of successful completion at the expense of being the slowest.
At the opposite extreme, you could just queue the data for writing and return immediately. This is fast, but gives no assurance at all that the data will actually be written. If a router was down, a DNS giving incorrect addresses, etc., you could be trying to send to some machine that isn't available and (possibly) hasn't been for a long time.
write_some is kind of a halfway point between these two extremes. It doesn't return until at least some data has been written, so it assures you that the remote host you were trying to write to does currently exist (for some, possibly rather loose, definition of "currently"). It doesn't assure you that all the data will be written but may complete faster, and still gives a little bit of a "warm fuzzy" feeling that the write is likely to complete.
As to when you'd likely want to use it: the obvious scenario would be something like a large transfer over a local connection on a home computer. The likely problem here isn't with the hardware, but with the computer (or router) being mis-configured. As soon as one byte has gone through, you're fairly assured that the connection is configured correctly, and the transfer will probably complete. Since the transfer is large, you may be saving a lot of time in return for a minimal loss of assurance about successful completion.
As to when you'd want to avoid it: pretty much reverse the circumstances above. You're sending a small amount of data over (for example) an unreliable Internet connection. Since you're only sending a little data, you don't save much time by returning before all the data's sent. The connection is unreliable enough that the odds of a packet being transmitted are effectively independent of the odds for other packets--that is, sending one packet through tells you little about the likelihood of being able to send the next.

There is no reason really. But these functions are at different levels.
basic_stream_socket::write_some is an operation on a socket that pretty much wraps the OS's send operation (most send implementations do not guarantee transmission of the complete message). Normally you wrap this call in a loop until all of the data is sent.
asio::write is a high-level wrapper that will loop until all of the data is sent. It accepts a socket as an argument.
One possible reason to use write_some could be when porting existing code that is based on sockets and that already does the looping.

Can single-buffer blocking WSASend deliver partial data?

I've pretty much always used send() with sockets and now I'm moving onto the WSA functions. With send(), I have a sendall() helper that ensured all data is delivered even if it didn't happen in one try and a partial send occurred on first call.
So, instead of learning the hard way or over-complicating code when I don't have to, decided to ask you:
Can a blocking WSASend() send partial data or does it send everything before it returns or fails? Or should I check the bytes sent vs. expected to send and keep at it until everything is delivered?
ANSWER: Overlapped WSASend() does not send partial data but if it does, it means the connection has terminated. I've never encountered the case yet.

From the WSASend docs:
If the socket is non-blocking and stream-oriented, and there is not sufficient space in the transport's buffer, WSASend will return with only part of the application's buffers having been consumed. Given the same buffer situation and a blocking socket, WSASend will block until all of the application buffer contents have been consumed.
I haven't tried this behavior though. BTW, why do you rewrite your code to use WSA functions? Switching from standard bsd socket api just to use the socket basically with the same blocking behavior doesn't really seem to be a good idea for me. Just keep the old blocking code with send with the "retry code", this way its portable and bulletproof. It is not saving 1-2 comparisons is that makes your IO code performant.
Switch to specialized WSA functions only if you are trying to exploit some windows specific strengths, or if you want to use for non-blocking sockets with WSAWaitForMultipleObjects that is a bit better than the standard select but even in that case you can simply go with send and recv as I did it.
In my opinion using epoll/kqueue/iocp (or a library that abstracts these away) with sockets are the way to go. There are some very basic tasks that can be done with blocking sockets but if you cross the line and you need nonblocking socks then switching straight to epoll/kqueue/iocp is the way to go instead of programming painful select or WSAWaitForMultipleObjects based apis. epoll/kqueue/iocp are not only better but also easier to program than the select based alternatives. Really. They are more modern apis that were invented based on more experience. (Although they are not crossplatform, but even select has portability issues...).
The previously mentioned apis for linux/bsd/windows are based on the same concept but in my opinion the simplest and easiest to learn is the epoll api of linux. It is ways better than a select call but its 100x easier to program once you get the idea. If you start using IOCP on windows than it my seem a bit more complicated.
If you haven't yet used these apis then definitely give epoll a go if you are familiar with linux and then on windows implement the same with IOCP that is based on a similar concept with a bit more complicated overlapped IO programming. With IOCP you will have a reason for using WSASend because you can not start overlapped IO on a socket with send but you can do that with WSASend (or WriteFile).
EDIT: If you are going for max performance with IOCP then here are some additional hints:
Drop blocking operations. This is very important. A serious networking engine can not afford blocking IO. It simply doesn't scale on any of the platforms. Do overlapped operations for both send and receive, overlapped IO is the big gun of windows.
Setup a thread pool that processes the completed IO operations. Setup test clients that bomb your server with real-world-usage-like messages and parallel connection counts and under stress tweak the buffer sizes and thread counts for your actual target hardware.
Set the SO_RCVBUF and SO_SNDBUF sizes of your sockets to zero and play around with the size of the buffers that you are using to send and receive data. Setting the rcv/send buf of the socket handle to zero allows the tcp stack to receive/send data directly to/from your buffers avoiding an additional copy between your userspace buffers and the socket buffers. The optimal size for these buffers is also subject to tweaking. I usually use at least a few ten K buffers sizes but sometimes in case of large volume transfers 1-2M buffer sizes are better depending on the number of parallel busy connections. Again, tweak the values while stressing the server with some test clients that do activity similar to real world clients. When you are ready with the first working version of your network engine on top of it lets build a test client that can simulate many (maybe thousands of) parallel clients depending on the real world usage of your server.
You will need "per connection software send buffers" inside your network engine and you may (or may not) want to control the max size of the send buffers. In case of reaching the max send buffer size you may want to block or discard messages/data depending on what you want to do, encapsulate this special buffer and provide two nice interfaces to it: one for the threads that are putting data into this buffer and another interface that is used by the IOCP sender code. This buffer is usually a very critical part of the whole thing and I usually had a lot of bugs around this part of the code so make sure to design its interface nicely to minimize the number of bugs. Depending on how your application constructs and puts messages into the queue you can play around a lot with the internal implementation (size of storage chunks, nagle-like optimizations, ...).

What exactly happens when we use mpi_send/receive functions?

when we use mpi_send/receive functions what happens? I mean this communication is done by value or by address of the variable that we desire to be sent and received (for example process 0 wants send variable "a" to process 1. Process 0 what exactly sends value of variable "a" or address of "a" ). And what happens when we use derived data types for communication?

Quite a bit of magic happens behind the scenes.
First, there's the unexpected message queue. When the sender calls MPI_Send before the receiver has called MPI_Recv, MPI doesn't know where in the receiver's memory the message is going. Two things can happen at this point. If the message is short, it is copied to a temporary buffer at the receiver. When the receiver calls MPI_Recv it first checks if a matching message has already arrived, and if it has, copies the data to the final destination. If not, the information about the target buffer is stored in the queue so the MPI_Recv can be matched when the message arrives. It is possible to examine the unexpected queue with MPI_Probe.
If the message is longer than some threshold, copying it would take too long. Instead, the sender and the receiver do a handshake with a rendezvous protocol of some sort to make sure the target is ready to receive the message before it is sent out. This is especially important with a high-speed network like InfiniBand.
If the communicating ranks are on the same machine, usually the data transfer happens through shared memory. Because MPI ranks are independent processes, they do not have access to each other's memory. Instead, the MPI processes on the same node set up a shared memory region and use it to transfer messages. So sending a message involves copying the data twice: the sender copies it into the shared buffer, and the receiver copies it out into its own address space. There exists an exception to this. If the MPI library is configured to use a kernel module like KNEM, the message can be copied directly to the destination in the OS kernel. However, such a copy incurs a penalty of a system call. Copying through the kernel is usually only worth it for large messages. Specialized HPC operating systems like Catamount can change these rules.
Collective operations can be implemented either in terms of send/receive, or can have a completely separate optimized implementation. It is common to have implementations of several algorithms for a collective operation. The MPI library decides at runtime which one to use for best performance depending on the size of the messages and the communicator.
A good MPI implementation will try very hard to transfer a derived datatype without creating extra copies. It will figure out which regions of memory within a datatype are contiguous and copy them individually. However, in some cases MPI will fall back to using MPI_Pack behind the scenes to make the message contiguous, and then transfer and unpack it.

As far as the applications system programmer need be concerned these operations send and receive data, not addresses of data. MPI processes do not share an address space, so an address on process 0 is meaningless to an operation on process 1 - if process 1 wants the data at an address on process 0 it has to get it from process 0. I don't think that the single-sided communications which came in with MPI-2 materially affect this situation.
What goes on under the hood, the view from the developer of the MPI libraries, might be different and will certainly be implementation dependent. For example, if you are using a well written MPI library on a shared-memory machine then yes, it might just implement message passing by sending pointers to address locations around the system. But this is a corner case, and not much seen these days.

mpi_send requires you to give the address to the memory holding the data to be sent. It will return only when it is safe for you to re-use that memory (non-blocking communications can avoid this).
Similarly, mpi_recv requires you to give the address of sufficient memory where it can copy the data to be received into. It will return only when the data have been received into that buffer.
How MPI does that, is another matter and you don't need to worry about that for writing a working MPI program (but possibly for writing an efficient one).

MPI distribution layer

I used MPI to write a distribution layer. Let say we have n of data sources and k of data consumers. In my approach each of n MPI processes reads data, then distributes it to one (or many) of k data consumers (other MPI processes) in given manner (logic).
So it seems to be very generic and my question is there something like that already done?
It seems simple, but it might be very complicated. Let say that distribution checks which of data consumers is ready to work (dynamic work distribution). It may distribute data according to given algorithm based on data. There are plenty of possibilities and I as every of us do not want to reinvent the wheel.

As far as I know, there is no generic implementation for it, other than the MPI API itself. You should use the correct functions according to the problem's constraints.
If what you're trying to build a simple n-producers-and-k-consumers synchronized job/data queue, then of course there are already many implementations out there (just google it and you should get a few).
However, the way you present it seems very general - sometimes you want the data to only be sent to one consumer, sometimes to all of them, etc. In that case, you should figure out what you want and when, and use either point-to-point communication functions, or collective communication functions, accordingly (and of course everyone has to know what to expect - you can't have a consumer waiting for data from a single source, while the producer wishes to broadcast the data...).
All that aside, here is one implementation that comes to mind that seems to answer all of your requirements:
Make a synchronized queue, producers pushing data in one end, consumers taking it from the other (decide on all kinds of behaviors for the queue as you need - is the queue size limited, does adding an element to a full queue block or fail, does removing an element from an empty queue block or fail, etc.).
Assuming the data contains some flag that tells the consumers if this data is for everyone or just for one of them, the consumers peek and either remove the element, or leave it there and just note that they already did it (either by keeping its id locally, or by changing a flag in the data itself).
If you don't want a single piece of collective data to block until everyone dealt with it, you can use 2 queues, one for each type of data, and the consumers would take data from one of the queues at a time (either by choosing a different queue each time, randomly choosing a queue, prioritizing one of the queues, or by some accepted order that is deductible from the data (e.g. lowest id first)).
Sorry for the long answer, and I hope this helps :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js