SetFileCompletionNotificationModes() disrupts my event loop, and yours?

SetFileCompletionNotificationModes() disrupts my event loop, and yours? - c++

The new Windows API SetFileCompletionNotificationModes() with the flag FILE_SKIP_COMPLETION_PORT_ON_SUCCESS is very useful to optimize an I/O completion port loop, because you'll get less I/O completions for the same HANDLE.
But it also disrupts the entire I/O completion port loop, becase you have to change a lot of things, so I thought it was better to open a new post about all of those things to change.
First of all, when you set the flag FILE_SKIP_COMPLETION_PORT_ON_SUCCESS it means that you won't receive I/O completions anymore for that HANDLE/SOCKET until all of the bytes are read (or written) so, until there is no more I/O to do, just like in unix when you got EWOULDBLOCK.
When you'll receive ERROR_IO_PENDING again (so a new request will pending) it's just like getting EWOULDBLOCK in unix.
Said that, I encountered some difficulties to adapt this behavior to my iocp event loop, because a normal iocp event loop simply wait forever until there is some OVERLAPPED packet to process, the OVERLAPPED packet will be processed calling the correct callback, which in turn will decrement an atomic counter, and the loop starts to wait again, until the next packet will come.
Now, if I set FILE_SKIP_COMPLETION_PORT_ON_SUCCESS, when an OVERLAPPED packet is returned to be processed, I process it by doing some I/O (with ReadFile() or WSARecv() or whatever) and it can be pending again (if I get ERROR_IO_PENDING) or it cannot, if my I/O API completes immediately. In the former case I have just to wait the next pending OVERLAPPED, but in the latter case what I have to do?
If I try to do I/O until I get ERROR_IO_PENDING, it goes in an infinite loop, it will never return ERROR_IO_PENDING (until the HANDLE/SOCKET's counterpart stop reading/writing), so others OVERLAPPEDs will wait indefinitely. Since I am testing that with a local named pipe that writes or reads forever, it goes in an infinite loop.
So I thought to do I/O until a certain X amount of bytes, just like a scheduler assigns time slices, and if I get ERROR_IO_PENDING before X, that's ok, the OVERLAPPED will be queued again in the iocp event loop, but what about I didn't get ERROR_IO_PENDING?
I tried to put my OVERLAPPED that hasn't finished its I/O in a list/queue for later processing, calling I/O APIs later (always with max X amount of bytes), after processed others OVERLAPPEDs waiting, and I set GetQueuedCompletionStatus[Ex]() timeout to 0, so, basically the loop will process listed/queued OVERLAPPEDs that hasn't finished I/O and in the same time checking immediately for new OVERLAPPEDs without going to sleep.
When the list/queued of unfinished OVERLAPPEDs becomes empty, I can set GQCS[Ex] timeout to INFINITE again. And so on.
In theory it should work perfectly, but I have noticed a strange thing: GQCS[Ex] with timeout set to 0 returns the same OVERLAPPEDs that aren't still fully processed (so those are in the list/queue waiting for later processing) again and again.
Question 1: so if I got it right, the OVERLAPPED packet will be removed from the system only when all data is processed?
Let's say that is ok, because If I get the same OVERLAPPEDs again and again, I don't need to put them in the list/queue, but I process them only like other OVERLAPPEDs, and if I get ERROR_IO_PENDING, fine, otherwise I will process them again later.
But there is a flaw: when I call the callback for processing OVERLAPPEDs packets, I decrement the atomic counter of pending I/O operations. With FILE_SKIP_COMPLETION_PORT_ON_SUCCESS set, I don't know if the callback has been called to process a real pending operation, or just an OVERLAPPED waiting for more synchronous I/O.
Question 2: How I can get that information? I have to set more flags in the structure I derive from OVERLAPPED?
Basically I increment the atomic counter for pending operations before calling ReadFile() or WSARecv() or whatever, and when I see that it returned anything different from ERROR_IO_PENDING or success, I decrement it again.
With FILE_SKIP_COMPLETION_PORT_ON_SUCCESS set, I have to decrement it again also when the I/O API completes with success, because it means I won't receive a completion.
It's a waste of time incrementing and decrementing an atomic counter when your I/O API will likely do an immediate and synchronous completion. Can't I simply increment the atomic counter of pending operations only when I receive ERROR_IO_PENDING? I didn't this before because I thought that if another thread that completes my pending I/O will be scheduled before the calling thread can check if the error is ERROR_IO_PENDING and so increment the atomic counter of pending operations, I'll get the atomic counter messed up.
Question 3: Is this a real concern? Or can I just skip that, and increment the atomic counter only when I get ERROR_IO_PENDING? It would simplify things very much.
Only a flag, and a lot of design to rethink.
What are your thoughts?

As Remy states in the comments: Your understanding of what FILE_SKIP_COMPLETION_PORT_ON_SUCCESS does is wrong. ALL it does is allow you to process the completed operation 'in line' if the call that you made (say WSARecv() returns 0.
So, assuming you have a 'handleCompletion()' function that you would call once you retrieve the completion from the IOCP with GQCS(), you can simply call that function immediately after your successful WSARecv().
If you're using a per-operation counter to track when the final operation completes on a connection (and I do this for lifetime management of the per-connection data that I associate as a completion key) then you still do this in exactly the same way and nothing changes...
You can't increment ONLY on ERROR_IO_PENDING because then you have a race condition between the operation completing and the increment occurring. You ALWAYS have to increment before the API which could cause the decrement (in the handler) because otherwise thread scheduling can screw you up. I don't really see how skipping the increment would "simplify" things at all...
Nothing else changes. Except...
Of course you could now have recursive calls into your completion handler (depending on your design) and this was something which was not possible before. For example; You can now have a WSARecv() call complete with a return code of 0 (because there is data available) and your completion handling code can issue another WSARecv() which could also complete with a return code of 0 and then your completion handling code would be called again possibly recursively (depending on the design of your code).
Individual busy connections can now prevent other connections for getting any processing time. If you have 3 concurrent connections and all of the peers are sending data as fast as they can and this is faster than your server can process the data and you have, for example, 2 I/O threads calling GQCS() then with FILE_SKIP_COMPLETION_PORT_ON_SUCCESS you may find that two of these connections monopolise the I/O threads (all WSARecv() calls return success which results in inline processing of all inbound data). In this situation I tend to have a counter of "max consecutive I/O operations per connection" and once that counter reaches a configurable limit I post the next inline completion to the IOCP and let it be retrieved by a call to GQCS() as this allows other connections a chance.

Related

What is the exact behaviour of MPI_Wait following an MPI_Isend?

I have read all the MPI documentation and tutorials and Stack Overflow questions that I could find relevant to this, but I still do not fully understand how MPI_Wait behaves when "completing" an MPI_Isend. Can we summarize it succinctly? Does it
A. Return when the buffer used in the corresponding Isend is usable
again (I think yes, but this doesn't tell me everything I want to
know).
B. Return when the corresponding Recv completes (I think not
necessarily, but maybe sometimes if the message is large?)
C. Guarantee that the Recv can be completed by the receiving process at some later time after it returns
I ask because I am trying to implement a kind of non-blocking broadcast (since MPI_Ibcast is an MPI 3 thing and doesn't seem to exist in any of the implementations I actually encounter). The sequence I am currently following on each process is this:
MPI_Isend from every process to every other process
Do some other work
MPI_Wait for all the Isends to 'complete', whatever that means exactly
MPI_Recv all the messages
This seems to work fine in practice, but I don't know if it is guaranteed to work, or if I am just lucky because my messages are small (they are just one int each, so I suspect that they get rapidly shuffled off by MPI into some internal buffers or whatever). I don't know whether this would create deadlocks if the messages were bigger (I worry that in this case not all the MPI_Waits would return, because some might be deadlocked waiting for MPI_Recvs to happen on another process).
If this is not guaranteed to work in general, is it at least guaranteed to work for very small messages? Or is even this not necessarily true? I am really not sure what I can count on here.
If this is not guaranteed to work then how can a non-blocking broadcast be implemented? Perhaps there is some clever sequence of performing the Waits and Recvs that works out? Like first rank 0 Waits for rank 1 to Recv, then rank 1 Waits for rank 0 to Recv? Is some kind of switched pairing arrangement like that more correct?

Your conditions are met by the following:
MPI_Isend followed by MPI_Wait both complete:
A. The buffer used in the corresponding Isend is usable again.
If you were to use MPI_Issend and MPI_Wait:
almost B. Return when the corresponding Recv is posted.
The following is true right after MPI_Send:
C. Guarantee that a matching Recv can be completed by the receiving process at some later time after it returns.
In your suggested non-blocking broadcast, you would have to allow to swap 3. and 4., otherwise there is a deadlock. That means, there must not be a strict condition that 4. happens after 3. Since 3 happens on the root, and 4 happens on all other ranks, it may not even be an issue.
I would, however suggest to instead a cleaner approach:
MPI_Isend on the root to all processes
MPI_Irecv on **all88 processes from the root
Do some other work (on all processes)
MPI_Waitall for all posted requests (both send/recv) on all ranks
That should be clean to implement and work just fine. Of course that is not an optimized collective... but that is another topic.

Making a gather/barrier function with System V Semaphores

I'm trying to implement a gather function that waits for N processes to continue.
struct sembuf operations[2];
operaciones[0].sem_num = 0;
operaciones[0].sem_op = -1; // wait() or p()
operaciones[1].sem_num = 0;
operaciones[1].sem_op = 0; // wait until it becomes 0
semop ( this->id,operations,2 );
Initially, the value of the semaphore is N.
The problem is that it freezes even when all processes have executed the semop function. I think it is related to the fact that the operations are executed atomically (but I don't know exactly what it means). But I don't understand why it doesn't work.
Does the code subtract 1 from the semaphore and then block the process if it's not the last or is the code supposed to act in a different way?

It's hard to see what the code does without the whole function and algorithm.
By the looks of it, you apply 2 action in a single atomic action: subtract 1 from the semaphore and wait for 0.
There could be several issues if all processes freeze; the semaphore is not a shared between all processes, you got the number of processes wrong when initiating the semaphore or one process leaves the barrier, at a later point increases the semaphore and returns to the barrier.
I suggest debugging to see that all processes are actually in barrier, and maybe even printing each time you do any action on the semaphore (preferably on the same console).
As for what is an atomic action is; it is a single or sequence of operation that guarantied not to be interrupted while being executed. This means no other process/thread will interfere the action.

Is this implementation of a general semaphore with binary semaphores correct?

Prove or Disprove the correctness of the following semaphore.
Here are my thoughts on this.
Well, if someone implements it so wait runs first before signal, there will be a deadlock. The program will call wait, decrement count, enter the count < 0 condition and wait at gate. Because it is waiting at gate, it cannot proceed to the signal that is right after the wait. So in that case, this might imply that the semaphore is incorrect.
However, if we assume that two processes are running, one running wait first and the other running signal first, then if the first process run waits and blocks at wait(gate), then the other process can run signal and release the process that was blocked. Thus, continuing on this scheme would make the algorithm valid and not result in a dead lock.

Given implementation follows these principles:
Binary semaphore S protect count variable from concurrent access.
If non-negative, count reflect number of free resources for general semaphore. Otherwise, absolute value of count reflect number of threads which wait (p5) or ready-to-wait (between p4 and p5) on binary semaphore gate.
Every signal() call increments count and, if its previous value is negative, signals binary semaphore gate.
But, because of possibility of ready-to-wait state, given implementation is incorrect:
Assume thread#1 calls wait(), and currently is in ready-to-wait state. Assume another thread#2 also calls wait(), and currently is in ready-to-wait state too.
Assume thread#3 calls signal() at this moment. Because count is negative (-2), the thread performs all operations including p10 (signal(gate)). Because gate is not waited at the moment, it becomes in free state.
Assume another thread#4 calls signal() at this moment. Because count is still negative (-1), the thread also performs all operations including p10. But now gate is already in free state. So, signal(gate) is no-op here, and we have missed signal event: only one of thread#1 and thread#2 will continue after executing p5 (wait(gate)). Other thread will wait forever.
Without possibility of ready-to-wait state (that is signal(S) and wait(gate) would be executed atomically) implementation would be OK.

WSARecv, Completionport Model, how to manage Buffer and avoid overruns?

My Problem: My Completionport Server will receive Data of unknown size from different clients, the thing is, that i don't know how avoid buffer overruns/ how to avoid my (receiving) buffer being "overfilled" with data.
now to the Quesitons:
1) If i make a receive call via WSARecv, does the workerthread work like a callback function ? I mean, does it dig up the receive call only when it has completed or does it also dig it up when the receiving is happening ? Does the lpNumberOfBytes (from GetQueuedCompletionStatus) variable contain the number of bytes received till now or the total number of bytes received ?
2) How to avoid overruns, i thought of dynamically allocated buffer structures, but then again, how do i find out how big the package is going to get ?
edit: i hate to ask this, but is there any "simple" method for managing the buffer and to avoid overruns ? synchronisations sounds off limit to me, atleast right now

If i make a receive call via WSARecv, does the workerthread work like a callback function ?
See #valdo post. Completion data si queued to your pool of threads and one will be made ready to process it.
'I mean, does it dig up the receive call only when it has completed?' Yes - hence the name. Note that the meaning of 'completed' may vary. depending on the protocol. With TCP, it means that some streamed data bytes have been received from the peer.
'Does the lpNumberOfBytes (from GetQueuedCompletionStatus) variable contain the number of bytes received till now or the total number of bytes received ?' It contains the number of bytes received and loaded into the buffer array provided in that IOCP completion only.
'How to avoid overruns, i thought of dynamically allocated buffer structures, but then again, how do i find out how big the package is going to get ?' You cannot get overruns if you provide the buffer arrays - the kernel thread/s that load the buffer/s will not exceed the passed buffer lengths. At application level, given the streaming nature of TCP, it's up to you to decide how to process the buffer arrays into useable application-level protocol-units. You must decide, using your knowledge of the services provided, on a suitable buffer management scheme.
Last IOCP server was somwewhat general-purpose. I used an array of buffer pools and a pool of 'buffer-carrier' objects, allocated at startup, (along with a pool of socket objects). Each buffer pool held buffers of a different size. Upon a new connection, I issued an WSARecv using one buffer from the smallest pool. If this buffer got completely filled, I used a buffer from the next largest pool for the next WSARecv, and so on.
Then there's the issue of the sequence numbers needed to prevent out-of-order buffering with multiple handler threads :(

_1. Completion port is a sort of a queue (with sophisticated logic concerning priority of threads waiting to dequeue an I/O completion from it). Whenever an I/O completes (either successfully or not), it's queued into the completion port. Then it's dequeued by one of the thread called GetQueuedCompletionStatus.
So that you never dequeue an I/O "in progress". Moreover, it's processed by your worker thread asynchronously. That is, it's delayed until your thread calls GetQueuedCompletionStatus.
_2. This is actually a complex matter. Synchronization is not a trivial task overall, especially when it comes to symmetric multi-threading (where you have several threads, each may be doing everything).
One of the parameters you receive with a completed I/O is a pointer to an OVERLAPPED structure (that you supplied to the function that issued I/O, such as WSARecv). It's a common practice to allocate your own structure that is based on OVERLAPPED (either inherits it or has it as the first member). Upon receiving a completion you may cast the dequeued OVERLAPPED to your actual data structure. There you may have everything needed for the synchronization: sync objects, state description and etc.
Note however that it's not a trivial task to synchronize things correctly (to have a good performance and avoid deadlocks) even when you have the custom context. This demands an accurate design.

Is it thread safe to call async_send and async_receive at the same time?

I understood that calling boost::asio::ip::tcp::socket::async_receive (or boost::asio::ip::tcp::socket::async_send) two times may result in a bad behavior..
Is it OK if i call boost::asio::ip::tcp::socket::async_recive and boost::asio::ip::tcp::socket::async_send at the same time?
I am going to have 2 or more threads running the boost::asio::run so you need to take that into account..
Thanks

This has to be OK. How else would you perform full duplex async communications on a single service? You need a receive outstanding at all times for incoming data.
The Boost docs indicate only that each of async_read and async_write must be called serially. For example, for async_read:
The program must ensure that the
stream performs no other read
operations (such as async_read, the
stream's async_read_some function, or
any other composed operations that
perform reads) until this operation
completes.
The docs for socket are not specific on this point, it's true.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js