How to overlap nonblocking communication

How to overlap nonblocking communication - c++

I know that Nonblocking communication is not blocking the code, such as MPI_Isend will send the data immediately. But when we have some all-to-all communication and we need to read this data we need to use MPI_Wait to get the data first.
for all-to-all communication the first thing that is coming in my mind is something like: (it is a real code)
1- initialise the MPI, getting the rank and size...
2- creating the data or read the first data from file
3- in a for loop, we need to send the data from all ranks to the other, by MPI_Isend or Evan MPI_Bcast
4- finalise the MPI
for writing the for loop, we need to use MPI_Wait for both sending and receiving.
My question is how we can use overlap in nonblocking.
I want to use two times MPI_Isend and MPI_Irecv in each iteration loop to overlap some the computation from the first receiving data and meanwhile do another send and receive, but this approach needs 4 waiting, Is there any algorithm for overlapping the nonblocking communications?

I know that None-blocking communication is not blocking the code, such
as MPI_Isend will send the data immediately.
This is not accurate the data is not send immediately. Let me explain in more detail:
MPI_Irecv and MPI_Isend are nonblocking communication routines, therefore one needs to use the MPI_Wait (or use MPI_Test to test for the completion of the request) to ensure that the message is completed, and that the data in the send/receive buffer can be again safely manipulated.
The nonblocking here means one does not wait for the data to be read and sent; rather the data is immediately made available to be read and sent. That does not imply, however, that the data is immediately sent. If it were, there would not be a need to call MPI_Wait.
But when we have some all-to-all communication and we need to read
this data we need to use MPI_Wait to get the data first.
You need to call MPI_Wait (or MPI_Test) in call communication types 1 to 1, 1 to n, and so on not just in all-to-all. Calling MPI_Wait is not about getting the data first, but rather that during the MPI_Isend the content of the buffer (e.g., the array of ints) has to be read and sent; In the meantime, one can overlap some computation with the ongoing process, however this computation cannot change (or read) the contend of the send/recv buffer. Then one calls the MPI_Wait to ensure that from that point onwards the data send/recv can be safely read/modified without any issues.
I want to use two times MPI_Isend and MPI_Irecv in each iteration loop
to overlap some the computation from the first receiving data and
meanwhile do another send and receive, but this approach needs 4
waiting, Is there any algorithm for overlapping the nonblocking
communications?
Whenever possible just use in build communication routines they are likely more optimized than one would able to optimized. What you have described fits perfectly the MPI routine MPI_Ialltoall:
Sends data from all to all processes in a nonblocking way

Related

What should I do if boost::beast write_some doesn't write everything?

I am sending data on a boost::beast::websocket
I would like to send the data synchronously, so am trying to decide if I should use write or write_some.
From this SO answer (which is about asio rather than beast specifically, but I assume(!) the same rules apply?) I understand that write will block until the entire message is confirmed sent, whereas write_some may return early, and will return the number of bytes sent which may not be all the bytes which were requested be sent.
In my particular use-case I am using a single thread, and the write is done from within this thread's context (ie: from inside a callback issued after entering io_context.run())
Since I don't want to block the caller for some indeterminate amount of time, I want to avoid using write if there is a more elegant solution.
So if I then turn to async_write, I am uncertain what I should do if the number of bytes is less than the number of bytes I requested be sent?
How I would normally handle this with standard tcp sockets is use non-blocking mode, and when I get back EWOULDBLOCK, enqueue the data and carry on. When the socket becomes writeable again, only then complete the write (much akin to an asio async_write). Since non-blocking is not supported in beast, I'm wondering what the analogous approach is?
Presumably I need to perform some additional write operation to ensure the rest of the bytes are sent in due course?
The beast docs say
Callers are responsible for synchronizing operations on the socket
using an implicit or explicit strand, as per the Asio documentation.
The websocket stream asynchronous interface supports one of each of
the following operations to be active at the same time:
async_read or async_read_some
async_write or async_write_some
async_ping or async_pong
async_close
Is it ok to start an async write of the remaining bytes, so long as I ensure that a new synchronous write/write_some isn't started before the outstanding async write has completed?
If I cannot start an async write to complete the send of the remaining bytes, how is one supposed to handle a synchronous write_some which doesn't completely send all bytes?
As to why I don't just use async_write always, I have additional slow processing to do after the attempt to write, such as logging etc. Since I am using a single thread, and the call to async_write happens within that thread, the write will only occur after I return control to the event loop.
So what I'd like to do is attempt to write synchronously (which will work in 90% of the cases) so the data is sent, and then perform my slow tasks which would otherwise delay the write. In the 10% of cases where a sync write doesn't complete immediately, then an alternative async_write operation should be employed - but only in the fallback situation.
Possibly related: I see that write_some has a flag fin, which should be set to true if this is the last part of the message.
I am only ever attempting to write complete messages, so should I always use true for this?

Any way to know how many bytes will be sent on TCP before sending?

I'm aware that the ::send within a Linux TCP server can limit the sending of the payload such that ::send needs to be called multiple times until the entire payload is sent.
i.e. Payload is 1024 bytes
sent_bytes = ::send(fd, ...) where sent_bytes is only 256 bytes so this needs to be called again.
Is there any way to know exactly how many bytes can be sent before sending? If the socket will allow for the entire message, or that the message will be fragmented and by how much?
Example Case
2 messages are sent to the same socket by different threads at the same time on the same tcp client via ::send(). In some cases where messages are large multiple calls to ::send() are required as not all the bytes are sent at the initial call. Thus, go with the loop solution until all the bytes are sent. The loop is mutexed so can be seen as thread safe, so each thread has to perform the sending after the other. But, my worry is that beacuse Tcp is a stream the client will receive fragments of each message and I was thinking that adding framing to each message I could rebuild the message on the client side, if I knew how many bytes are sent at a time.
Although the call to ::send() is done sequentially, is the any chance that the byte stream is still mixed?
Effectively, could this happen:
Server Side
Message 1: "CiaoCiao"
Message 2: "HelloThere"
Client Side
Received Message: "CiaoHelloCiaoThere"

Although the call to ::send() is done sequentially, is the any chance that
the byte stream is still mixed?
Of course. Not only there's a chance of that, it is pretty much going to be a certainty, at one point or another. It's going to happen at one point. Guaranteed.
sent to the same socket by different threads
It will be necessary to handle the synchronization at this level, by employing a mutex that each thread locks before sending its message and unlocking it only after the entire message is sent.
It goes without sending that this leaves open a possibility that a blocked/hung socket will result in a single thread locking this mutex for an excessive amount of time, until the socket times out and your execution thread ends up dealing with a failed send() or write(), in whatever fashion it is already doing now (you are, of course, checking the return value from send/write, and handling the exception conditions appropriately).
There is no single, cookie-cutter, paint-by-numbers, solution to this that works in every situation, in every program, that needs to do something like this. Each eventual solution needs to be tailored based on each program's unique requirements and purpose. Just one possibility would be a dedicated execution thread that handles all socket input/output, and all your other execution threads sending their messages to the socket thread, instead of writing to the socket directly. This would avoid having all execution thread wedged by a hung socket, at expense of grown memory, that's holding all unsent data.
But that's just one possible approach. The number of possible, alternative solutions has no limit. You will need to figure out which logic/algorithm based solution will work best for your specific program. There is no operating system/kernel level indication that will give you any kind of a guarantee as to the amount of a send() or write() call on a socket will accept.

MPI Send and Receive modes for large number of processors

I know there are a ton of questions & answers about the different modes of MPI send and receive out there, but I believe mine is different or I am simply not able to apply these answers to my problem.
Anyway, my scenario is as follows. The code is intended for high performance clusters with potentially thousands of cores, organized into a multi-dimensional grid. In my algorithm, there are two successive operations that need to be carry out, let's call them A and B, where A precedes B. A brief description is as follows:
A: Each processor has multiple buffers. It has to send each of these buffers to a certain set of processors. For each buffer to be sent, the set of receiving processors might differ. Sending is the last step of operation A.
B: Each processor receives a set of buffers from a set of processors. Operation B will then work on these buffers once it has received all of them. The result of that operation will be stored in a fixed location (neither in the send or receive buffers)
The following properties are also given:
in A, every processor can compute which processors to send to, and it can also compute a corresponding tag in case that a processor receives multiple buffers from the same sending processor (which is very likely).
in B, every processor can also compute which processors it will receive from, and the corresponding tags that the messages were sent with.
Each processor has its own send buffers and receive buffers, and these are disjoint (i.e. there is no processor that uses its send buffer as a receive buffer as well, and vice versa).
A and B are executed in a loop among other operations before A and after B. We can ensure that the send buffer will not be used again until the next loop iteration, where it is filled with new data in A, and the receive buffers will also not be used again until the next iteration where they are used to receive new data in operation B.
The transition between A and B should, if possible, be a synchronization point, i.e. we want to ensure that all processors enter B at the same time
To send and receive, both A and B have to use (nested) loops on their own to send and receive the different buffers. However, we cannot make any assumption about the order of these send and receive statements, i.e. for any two buffers buf0 and buf1 we cannot guarantee that if buf0 is received by some processor before buf1, that buf0 was also sent before buf1. Note that at this point, using group operations like MPI_Broadcast etc. is not an option yet due to the complexity of determining the set of receiving/sending processors.
Question: Which send and receive modes should I use? I have read a lot of different stuff about these different modes, but I cannot really wrap my head around them. The most important property is deadlock-freedom, and the next important thing is performance. I am tending towards using MPI_Isend() in A without checking the request status and again using the non-blocking MPI_IRecv() in B's loops, and then using MPI_Waitall() to ensure that all buffers were received (and as a consequence, also all buffers have been sent and the processors are synchronized).
Is this the correct approach, or do I have to use buffered sends or something entirely different? I don't have a ton of experience in MPI and the documentation does not really help me much either.

From how you describe your problem, I think MPI_Isend is likely to be the best (only?) option for A because it's guaranteed non-blocking, whereas MPI_Send may be non-blocking, but only if it is able to internally buffer your send buffer.
You should then be able to use an MPI_Barrier so that all processes enter B at the same time. But this might be a performance hit. If you don't insist that all processes enter B at the same time, some can begin receiving messages sooner. Furthermore, given your disjoint send & receive buffers, this should be safe.
For B, you can use MPI_Irecv or MPI_Recv. MPI_Irecv is likely to be faster, because a standard MPI_Recv might be waiting around for a slower send from another process.
Whether or not you block on the receiving end, you ought to call MPI_Waitall before finishing the loop to ensure all send/recv operations have completed successfully.
An additional point: you could leverage MPI_ANY_SOURCE with MPI_Recv to receive messages in a blocking manner and immediately operate on them, no matter what order they arrive. However, given that you specified that no operations on the data happen until all data are received, this might not be that useful.
Finally: as mentioned in these recommendations, you will get best performance if you can restructure your code so that you can just use MPI_SSend. In this case you avoid any buffering at all. To achieve this, you'd have to have all processes first call an MPI_Irecv, then begin sending via MPI_Ssend. It might not be as hard as you think to refactor in this way, particularly if, as you say, each process can work out independently which messages it will receive from whom.

What happens in boost::asio when TCP TX buffer fills up?

I'm trying to get to grips with boost asio but I'm having trouble understanding some of the behavior behind the asynchronous interface.
I have a simple setup with a client and a server.
The client calls async_write regularly with a fixed amount of data
The server polls for data regularly
What happens when the server stops polling for data ?
I guess the various buffers would fill up in the server OS and it would stop sending ACKs ?
Regardless of what happens it seems that the client can happily continue to send several gigabytes of data without receiving any error callback (doesn't receive any success either of course).
I assume the client OS stops accepting packets at one point since they can't be TX'ed ?
Does this means that boost::asio buffers data internally ?
If it does, can I use socket.cancel() to drop packets in case I don't want to wait for delivery ? (I need to make sure ASIO forgets about my packets so I can reuse old buffers for new packets)

asio doesn't buffer internally. And you will always get signaled if you can't transfer more data to the remote.
E.g. if you use synchronous writes in asio they will block until the data could be sent (or at least be copied into the kernel send buffers). If you use async writes the callback/acknowledgement will only be called once it could be sent. If you use nonblocking writes you get EAGAIN/WOULD_BLOCK errors. If you use multiple async_write's in parallel - well - you shouldn't do that, it's behavior is undefined according to the asio docs:
This operation is implemented in terms of zero or more calls to the stream's async_write_some function, and is known as a composed operation. The program must ensure that the stream performs no other write operations (such as async_write, the stream's async_write_some function, or any other composed operations that perform writes) until this operation completes.
Guarantee in your application that you always only perform a single async write operation and once that finishes write the next piece of data. If you need to write data in between you would need to buffer that inside your application.

Should I make simultaneous WSASend() calls?

I know that in order to call WSASend() simultaneously, I need to provide for each call a unique WSAOVERLAPPED and WSABUF instances. But this means that I have to keep track of these instances for each call, which will complicate things.
I think it would be a better idea if I create a thread that only make WSASend() calls not simultaneously but rather sequentially. This thread will wait on a queue that will hold WSASend() requests (each request will contain the socket handle and the string I want to send). When I eventually call WSASend() I will block the thread until I receive a wake up signal from the thread that waits on the completion port telling me that the WSASend() has been completed, and then I go on to fetch the next request.
If this is a good idea, then how should I implement the queue and how to make a blocking fetch call on it (instead of using polling)?

The WSABUF can be stack based as it is the responsibility of WSASend() to duplicate it before returning. The OVERLAPPED and the data buffer itself must live until the IOCP completion for the operation is extracted and processed.
I've always used an 'extended' OVERLAPPED structure which incorporates the data buffer, the overlapped structure AND the WSABUF. I then use a reference counting system to ensure that the 'per operation data' exists until nobody needs it any more (that is I take a reference before the API call initiates the operation and I release a reference when the operation is completed after removal of the completion from the IOCP - note that references aren't 100% necessary here but they make it easier to then pass the resulting data buffer off to other parts of the code).
It is MOST optimal for a TCP connection to have the TCP "window size" of data in transit at any one time and to have some more data pending so that the window is always kept full and you are always sending at the maximum that the connection can take. To achieve this with overlapped I/O it's usually best to have many WSASend() calls pending. However, you don't want to have too many pending (see here) and the easiest way to achieve this is to track the number of bytes that you have pending, queue bytes for later transmission and send from your transmission queue when existing sends complete...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js