What is the exact behaviour of MPI_Wait following an MPI_Isend? - c++

I have read all the MPI documentation and tutorials and Stack Overflow questions that I could find relevant to this, but I still do not fully understand how MPI_Wait behaves when "completing" an MPI_Isend. Can we summarize it succinctly? Does it
A. Return when the buffer used in the corresponding Isend is usable
again (I think yes, but this doesn't tell me everything I want to
know).
B. Return when the corresponding Recv completes (I think not
necessarily, but maybe sometimes if the message is large?)
C. Guarantee that the Recv can be completed by the receiving process at some later time after it returns
I ask because I am trying to implement a kind of non-blocking broadcast (since MPI_Ibcast is an MPI 3 thing and doesn't seem to exist in any of the implementations I actually encounter). The sequence I am currently following on each process is this:
MPI_Isend from every process to every other process
Do some other work
MPI_Wait for all the Isends to 'complete', whatever that means exactly
MPI_Recv all the messages
This seems to work fine in practice, but I don't know if it is guaranteed to work, or if I am just lucky because my messages are small (they are just one int each, so I suspect that they get rapidly shuffled off by MPI into some internal buffers or whatever). I don't know whether this would create deadlocks if the messages were bigger (I worry that in this case not all the MPI_Waits would return, because some might be deadlocked waiting for MPI_Recvs to happen on another process).
If this is not guaranteed to work in general, is it at least guaranteed to work for very small messages? Or is even this not necessarily true? I am really not sure what I can count on here.
If this is not guaranteed to work then how can a non-blocking broadcast be implemented? Perhaps there is some clever sequence of performing the Waits and Recvs that works out? Like first rank 0 Waits for rank 1 to Recv, then rank 1 Waits for rank 0 to Recv? Is some kind of switched pairing arrangement like that more correct?

Your conditions are met by the following:
MPI_Isend followed by MPI_Wait both complete:
A. The buffer used in the corresponding Isend is usable again.
If you were to use MPI_Issend and MPI_Wait:
almost B. Return when the corresponding Recv is posted.
The following is true right after MPI_Send:
C. Guarantee that a matching Recv can be completed by the receiving process at some later time after it returns.
In your suggested non-blocking broadcast, you would have to allow to swap 3. and 4., otherwise there is a deadlock. That means, there must not be a strict condition that 4. happens after 3. Since 3 happens on the root, and 4 happens on all other ranks, it may not even be an issue.
I would, however suggest to instead a cleaner approach:
MPI_Isend on the root to all processes
MPI_Irecv on **all88 processes from the root
Do some other work (on all processes)
MPI_Waitall for all posted requests (both send/recv) on all ranks
That should be clean to implement and work just fine. Of course that is not an optimized collective... but that is another topic.

Related

SetFileCompletionNotificationModes() disrupts my event loop, and yours?

The new Windows API SetFileCompletionNotificationModes() with the flag FILE_SKIP_COMPLETION_PORT_ON_SUCCESS is very useful to optimize an I/O completion port loop, because you'll get less I/O completions for the same HANDLE.
But it also disrupts the entire I/O completion port loop, becase you have to change a lot of things, so I thought it was better to open a new post about all of those things to change.
First of all, when you set the flag FILE_SKIP_COMPLETION_PORT_ON_SUCCESS it means that you won't receive I/O completions anymore for that HANDLE/SOCKET until all of the bytes are read (or written) so, until there is no more I/O to do, just like in unix when you got EWOULDBLOCK.
When you'll receive ERROR_IO_PENDING again (so a new request will pending) it's just like getting EWOULDBLOCK in unix.
Said that, I encountered some difficulties to adapt this behavior to my iocp event loop, because a normal iocp event loop simply wait forever until there is some OVERLAPPED packet to process, the OVERLAPPED packet will be processed calling the correct callback, which in turn will decrement an atomic counter, and the loop starts to wait again, until the next packet will come.
Now, if I set FILE_SKIP_COMPLETION_PORT_ON_SUCCESS, when an OVERLAPPED packet is returned to be processed, I process it by doing some I/O (with ReadFile() or WSARecv() or whatever) and it can be pending again (if I get ERROR_IO_PENDING) or it cannot, if my I/O API completes immediately. In the former case I have just to wait the next pending OVERLAPPED, but in the latter case what I have to do?
If I try to do I/O until I get ERROR_IO_PENDING, it goes in an infinite loop, it will never return ERROR_IO_PENDING (until the HANDLE/SOCKET's counterpart stop reading/writing), so others OVERLAPPEDs will wait indefinitely. Since I am testing that with a local named pipe that writes or reads forever, it goes in an infinite loop.
So I thought to do I/O until a certain X amount of bytes, just like a scheduler assigns time slices, and if I get ERROR_IO_PENDING before X, that's ok, the OVERLAPPED will be queued again in the iocp event loop, but what about I didn't get ERROR_IO_PENDING?
I tried to put my OVERLAPPED that hasn't finished its I/O in a list/queue for later processing, calling I/O APIs later (always with max X amount of bytes), after processed others OVERLAPPEDs waiting, and I set GetQueuedCompletionStatus[Ex]() timeout to 0, so, basically the loop will process listed/queued OVERLAPPEDs that hasn't finished I/O and in the same time checking immediately for new OVERLAPPEDs without going to sleep.
When the list/queued of unfinished OVERLAPPEDs becomes empty, I can set GQCS[Ex] timeout to INFINITE again. And so on.
In theory it should work perfectly, but I have noticed a strange thing: GQCS[Ex] with timeout set to 0 returns the same OVERLAPPEDs that aren't still fully processed (so those are in the list/queue waiting for later processing) again and again.
Question 1: so if I got it right, the OVERLAPPED packet will be removed from the system only when all data is processed?
Let's say that is ok, because If I get the same OVERLAPPEDs again and again, I don't need to put them in the list/queue, but I process them only like other OVERLAPPEDs, and if I get ERROR_IO_PENDING, fine, otherwise I will process them again later.
But there is a flaw: when I call the callback for processing OVERLAPPEDs packets, I decrement the atomic counter of pending I/O operations. With FILE_SKIP_COMPLETION_PORT_ON_SUCCESS set, I don't know if the callback has been called to process a real pending operation, or just an OVERLAPPED waiting for more synchronous I/O.
Question 2: How I can get that information? I have to set more flags in the structure I derive from OVERLAPPED?
Basically I increment the atomic counter for pending operations before calling ReadFile() or WSARecv() or whatever, and when I see that it returned anything different from ERROR_IO_PENDING or success, I decrement it again.
With FILE_SKIP_COMPLETION_PORT_ON_SUCCESS set, I have to decrement it again also when the I/O API completes with success, because it means I won't receive a completion.
It's a waste of time incrementing and decrementing an atomic counter when your I/O API will likely do an immediate and synchronous completion. Can't I simply increment the atomic counter of pending operations only when I receive ERROR_IO_PENDING? I didn't this before because I thought that if another thread that completes my pending I/O will be scheduled before the calling thread can check if the error is ERROR_IO_PENDING and so increment the atomic counter of pending operations, I'll get the atomic counter messed up.
Question 3: Is this a real concern? Or can I just skip that, and increment the atomic counter only when I get ERROR_IO_PENDING? It would simplify things very much.
Only a flag, and a lot of design to rethink.
What are your thoughts?
As Remy states in the comments: Your understanding of what FILE_SKIP_COMPLETION_PORT_ON_SUCCESS does is wrong. ALL it does is allow you to process the completed operation 'in line' if the call that you made (say WSARecv() returns 0.
So, assuming you have a 'handleCompletion()' function that you would call once you retrieve the completion from the IOCP with GQCS(), you can simply call that function immediately after your successful WSARecv().
If you're using a per-operation counter to track when the final operation completes on a connection (and I do this for lifetime management of the per-connection data that I associate as a completion key) then you still do this in exactly the same way and nothing changes...
You can't increment ONLY on ERROR_IO_PENDING because then you have a race condition between the operation completing and the increment occurring. You ALWAYS have to increment before the API which could cause the decrement (in the handler) because otherwise thread scheduling can screw you up. I don't really see how skipping the increment would "simplify" things at all...
Nothing else changes. Except...
Of course you could now have recursive calls into your completion handler (depending on your design) and this was something which was not possible before. For example; You can now have a WSARecv() call complete with a return code of 0 (because there is data available) and your completion handling code can issue another WSARecv() which could also complete with a return code of 0 and then your completion handling code would be called again possibly recursively (depending on the design of your code).
Individual busy connections can now prevent other connections for getting any processing time. If you have 3 concurrent connections and all of the peers are sending data as fast as they can and this is faster than your server can process the data and you have, for example, 2 I/O threads calling GQCS() then with FILE_SKIP_COMPLETION_PORT_ON_SUCCESS you may find that two of these connections monopolise the I/O threads (all WSARecv() calls return success which results in inline processing of all inbound data). In this situation I tend to have a counter of "max consecutive I/O operations per connection" and once that counter reaches a configurable limit I post the next inline completion to the IOCP and let it be retrieved by a call to GQCS() as this allows other connections a chance.

Is it possible to ignore/throw out received data in MPI_Allgather?

In MPI, is it possible to immediately throw out received data without allocating a buffer to hold it? I'm using MPI_Allgather to collect data from several processes, but at some point, one or more processes have no useful data to send.
My original solution was to let the useless process finish. However, if a task terminates without calling MPI_Allgather, the rest end up in deadlock, because MPI_Allgather blocks. In order to overcome this problem I keep all the processes around until the end, but send junk data that is never used.
The useful processes keep some of the received data, but the useless processes don't. I tried passing a null pointer for the recbuf like so:
MPI_Allgather(&sendbuf, 1, MPI_INT, 0, 1, MPI_INT, MPI_COMM_WORLD);
but it didn't work. Is the anything I can do to avoid receiving, or at least storing the useless data?
You could create a new group for the useful processes and gather over that group instead of MPI_COMM_WORLD.
Collectives are by definition collective, which means every process in the communicator must call them. If you need to perform a collective call over a subset of a communicator, you have to make a new communicator, as already suggested.
You could also look at MPI_Allgatherv and see if it's a good fit for your application. With this collective you can specify how much data each process will send. However, every process in the communicator must still call MPI_Allgatherv, even those that send no data. In addition to this requirement, all processes must know how much data each process is contributing.
Finally, MPI 3 will most likely include sparse collective operations which do exactly what you wish. But for now a new communicator is probably the best approach.

WSARecv, Completionport Model, how to manage Buffer and avoid overruns?

My Problem: My Completionport Server will receive Data of unknown size from different clients, the thing is, that i don't know how avoid buffer overruns/ how to avoid my (receiving) buffer being "overfilled" with data.
now to the Quesitons:
1) If i make a receive call via WSARecv, does the workerthread work like a callback function ? I mean, does it dig up the receive call only when it has completed or does it also dig it up when the receiving is happening ? Does the lpNumberOfBytes (from GetQueuedCompletionStatus) variable contain the number of bytes received till now or the total number of bytes received ?
2) How to avoid overruns, i thought of dynamically allocated buffer structures, but then again, how do i find out how big the package is going to get ?
edit: i hate to ask this, but is there any "simple" method for managing the buffer and to avoid overruns ? synchronisations sounds off limit to me, atleast right now
If i make a receive call via WSARecv, does the workerthread work like a callback function ?
See #valdo post. Completion data si queued to your pool of threads and one will be made ready to process it.
'I mean, does it dig up the receive call only when it has completed?' Yes - hence the name. Note that the meaning of 'completed' may vary. depending on the protocol. With TCP, it means that some streamed data bytes have been received from the peer.
'Does the lpNumberOfBytes (from GetQueuedCompletionStatus) variable contain the number of bytes received till now or the total number of bytes received ?' It contains the number of bytes received and loaded into the buffer array provided in that IOCP completion only.
'How to avoid overruns, i thought of dynamically allocated buffer structures, but then again, how do i find out how big the package is going to get ?' You cannot get overruns if you provide the buffer arrays - the kernel thread/s that load the buffer/s will not exceed the passed buffer lengths. At application level, given the streaming nature of TCP, it's up to you to decide how to process the buffer arrays into useable application-level protocol-units. You must decide, using your knowledge of the services provided, on a suitable buffer management scheme.
Last IOCP server was somwewhat general-purpose. I used an array of buffer pools and a pool of 'buffer-carrier' objects, allocated at startup, (along with a pool of socket objects). Each buffer pool held buffers of a different size. Upon a new connection, I issued an WSARecv using one buffer from the smallest pool. If this buffer got completely filled, I used a buffer from the next largest pool for the next WSARecv, and so on.
Then there's the issue of the sequence numbers needed to prevent out-of-order buffering with multiple handler threads :(
_1. Completion port is a sort of a queue (with sophisticated logic concerning priority of threads waiting to dequeue an I/O completion from it). Whenever an I/O completes (either successfully or not), it's queued into the completion port. Then it's dequeued by one of the thread called GetQueuedCompletionStatus.
So that you never dequeue an I/O "in progress". Moreover, it's processed by your worker thread asynchronously. That is, it's delayed until your thread calls GetQueuedCompletionStatus.
_2. This is actually a complex matter. Synchronization is not a trivial task overall, especially when it comes to symmetric multi-threading (where you have several threads, each may be doing everything).
One of the parameters you receive with a completed I/O is a pointer to an OVERLAPPED structure (that you supplied to the function that issued I/O, such as WSARecv). It's a common practice to allocate your own structure that is based on OVERLAPPED (either inherits it or has it as the first member). Upon receiving a completion you may cast the dequeued OVERLAPPED to your actual data structure. There you may have everything needed for the synchronization: sync objects, state description and etc.
Note however that it's not a trivial task to synchronize things correctly (to have a good performance and avoid deadlocks) even when you have the custom context. This demands an accurate design.

Boost.Asio: The difference between async_read and async_receive

What's the difference between async_read and async_receive?
async_receive is a function that just receives into a buffer, but may not receive the amount you asked for. (It'll be equal or less, never more.)
async_read, however, will always receive the amount you asked for, as it states:
This function is used to asynchronously read a certain number of bytes of data from a stream. The function call always returns immediately. The asynchronous operation will continue until one of the following conditions is true:
The supplied buffers are full. That is, the bytes transferred is equal to the sum of the buffer sizes.
An error occurred.
The only thing the page is a bit vague on is what async_read does if it doesn't get that many bytes, and the connection closes gracefully. (Does that count as "error"?) This can probably be determined with a quick test. (async_receive, however, would just give you what it got.)
The first is a free function, the second is a member function.
Another difference is socket_base::message_flags flags parameter. See possible values, for example, in the recv(2) manual page.
Edit:
With async_receive you need to check how many bytes you got. Use it if you want to read at max N bytes, vs. exactly N bytes with async_read. Sorry, thought that was sort of obvious from boost docs.

definition of wait-free (referring to parallel programming)

In Maurice Herlihy paper "Wait-free synchronization" he defines wait-free:
"A wait-free implementation of a concurrent data object is one that guarantees
that any process can complete any operation in a finite number of steps, regardless
the execution speeds on the other processes."
www.cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf
Let's take one operation op from the universe.
(1) Does the definition mean: "Every process completes a certain operation op in the same finite number n of steps."?
(2) Or does it mean: "Every process completes a certain operation op in any finite number of steps. So that a process can complete op in k steps another process in j steps, where k != j."?
Just by reading the definition i would understand meaning (2). However this makes no sense to me, since a process executing op in k steps and another time in k + m steps meets the definition, but m steps could be a waiting loop. If meaning (2) is right, can anybody explain to me, why this describes wait-free?
In contrast to (2), meaning (1) would guarantee that op is executed in the same number of steps k. So there can't be any additional steps m that are necessary e.g. in a waiting loop.
Which meaning is right and why?
Thanks a lot,
sema
The answer means definition (2). Consider that the waiting loop may potentially never terminate, if the process that is waited for runs indefinitely: “regardless the execution speeds on the other processes”.
So the infinite waiting loop effectively means that a given process may not be able to complete an operation in a finite number of steps.
When an author of a theoretical paper like this writes "a finite number of steps", it means that there exists some constant k (you do not necessarily know k), so that the number of steps is smaller than k (i.e. your waiting time surely won't be infinite).
I'm not sure what 'op' means in this context, but generally, when you have a multithreaded program, threads might wait for one another to do something.
Example: a thread has a lock, and other threads wait for this lock to be freed until they can operate.
This example is not wait free, since if the thread holding the lock does not get a chance to do any ops (this is bad, since the requirement here is that other threads will continue regardless of any other thread), other threads are doomed, and will never ever make any progress.
Other Example: there are several threads each trying to CAS on the same address
This example is wait free, because although all threads but one will fail in such an operation, there will always be progress no matter which threads are chosen to run.
It sounds like you're concerned that definition 2 would allow for an infinite wait loop, but such a loop—being infinite—would not satisfy the requirement for completion within a finite number of steps.
I take "wait-free" to mean that making progress does not require any participant to wait for another participant to finish. If such waiting was necessary, if one participant hangs or operates slowly, other participants suffer similarly.
By contrast, with a wait-free approach, each participant tries its operation and accommodates competitive interaction with other participants. For instance, each thread may try to advance some state, and if two try "at the same" time, only one should succeed, but there's no need for any participants that "failed" to retry. They merely recognize that someone else already got the job done, and they move on.
Rather than focusing on "waiting my turn to act", a wait-free approach encourages "trying to help", acknowledging that others may also be trying to help at the same time. Each participant has to know how to detect success, when to retry, and when to give up, confident that trying only failed because someone else got in there first. As long as the job gets done, it doesn't matter which thread got it done.
Wait-free essentially means that it needs no synchronization to be used in a multi-processing environment. The 'finite number of steps' refers to not having to wait on a synchronization device (e.g. a mutex) for an unknown -- and potentially infinite (deadlock) -- length of time while another process executes a critical section.