How to use Overlapped I/O with sockets? - c++

I want to use Overlapped I/O in my server, but I am unable to find many tutorials on the subject (most of the tutorials are about Overlapped I/O with Completion Ports, and I want to use a callback function).
My server will have a maximum of 400 clients connected at one time, and it only send and receive data at a long periods of time (each 30 seconds a few kilobytes worth of data is exchanged between the server and the clients).
The main reason why I want to use Overlapped I/O is because select() can only handle a maximum of 64 sockets (and I have 400!).
So I will tell you how I understand Overlapped I/O and correct me if I'm wrong:
If I want to receive data from one of the clients, I use WSARecv() and supply the socket handle, and a buffer to be filled with the received data, and also I supply a callback function. When the data is received and filled in the buffer, the callback function will be called, and I can process the data.
When I want to send data I use WSASend(), I also supply the socket handle and the callback function, and when the data is sent (not sure if when placed in the underlying sent buffer or actually placed on the wire), the callback will also be called telling me that data was sent, and I can send the next piece of data.

The one misunderstanding you appear to have is that OVERLAPPED callbacks are actually synchronous.
You said:
When the data is received and filled in the buffer, the callback function will be called
Reality:
When a call is made to an alertable wait function (e.g. SleepEx or MsgWaitForMultipleObjectsEx), if data has been received and filled in the buffer, the callback function will be called
As long as you are aware of that, you should be in good shape. I agree with you that overlapped I/O with callbacks is a great approach in your scenario. Because callbacks occur on the thread performing the I/O, you don't have to worry about synchronizing access from multiple threads, the way you would need to with completion ports and work items on the thread pool.
Oh, also make sure to check for WSA_IO_PENDING, because it's possible for operations to complete synchronously, if there's enough data already buffered (for receive) or enough space in the buffer (for send). In this case the callback will occur, but it is queued for the next alertable wait, it never runs immediately. Certain errors will be reported synchronously also. Others will come to your callback.
Also, it's guaranteed that your callback gets queued exactly once for every operation that returned 0 or WSA_IO_PENDING, whether that operation completes successfully, is cancelled, or with some other error. You can't reuse the buffer until that callback has happened.

The IO completion callback mechanism works fine, I've used it a few times, no problem. In 32-bit systems, you can put the 'this' for the socket-context instance into the hEvent field of the OVERLAPPED struct and retreive it in the callback. Not sure how to do it in 64-bit systems:(

Related

Under what conditions does MPI_Isend/MPI_Irecv wait for its associated completion call (MPI_Wait/MPI_Test) to start data transmission?

One of the comments in this post briefly mentions
The standard allows the implementation to postpone the actual data transmission until the wait/test call.
Is it always the case that data transmission of MPI_Isend/MPI_Irecv is postponed until the associated completion call (MPI_Wait/MPI_Test or their variants) is invoked? If not, what conditions influence this?
MPI_Wait is used to wait for a single communication to complete
MPI_Waitall is used to wait for a list of communications to complete
MPI_Test and MPI_Testall use in non-blocking communications to check if the communications are finished without requiring them to be finished.
With the MPI_Isend, You had to store the values of each of the data points as a separate variable in an array
This is because the data can be sent anytime until the MPI_Waitall is called.
This means that the data mustn’t be changed/overwritten or go out of scope within this interval.
This is different from MPI_Send, where the data to be sent is buffered and/or actually sent by the time MPI_Send completes.
The same is true of MPI_Irecv, though this is more obvious as you want to have the data.
Note
In MPI_Testall the flag will be true only if all the communications are finished.

Can I call WSASend() repeatedly?

When using IOCP, if I call WSASend(), do I have to wait for the notification to arrive before making another call to it, or can I call it multiple times before receiving any notification, for example is something like this allowed:
WSASend();
// Call it again without waiting for notification of previous call
WSASend();
// Call it again without waiting for notification of previous call
WSASend();
Yes, you can make multiple I/O requests without waiting for completion notifications. Alternatively, you can WSASend() multiple buffers with one call.
Either way, or both, will work fine. The OVERLAPPED block for each call is, essentially, pointers for a linked-list of I/O requests, so they can all be queued up and executed by the kernel and comms stack/s when the I/O resource is available.
This applies to WSARecv etc. overlapped I/O too. This allows the kernel/stack to be loading buffers while user thread code is processing those notified earlier.
Note: the OVERLAPPED block, and the buffers, must be unique per call and their lifetime must extend to the completion notification. You must not let them be RAII'd or delete'd away before a user thread has handled the completion notification. It's usual for the buffer and OVERLAPPED to be members of one 'IOrequest' class, (with a 'SocketContext*' class pointer member to connect every IOrequest up with its bound socket).
Yes you can issue multiple overlapped operations on a single socket without needing to wait for any completions to occur.
One thing that you need to be aware of with multiple outstanding WSASend() calls on a TCP socket is that you are effectively handing over resource management for the buffers used in the WSASend() calls to the peer on the other end of the socket. The reason for this is that if you send data and the peer doesn't read it as fast as you are sending then you will eventually cause TCP's flow control to kick in. This doesn't prevent you issuing more WSASend() calls and all you will notice is that the completions take longer and longer to occur. See here for more details.

IO Completion Ports and OVERLAPPED management

How win32 manages instances of OVERLAPPED struct in context of two functions:
GetQueuedCompletionStatus
PostQueuedCompletionStatus
When I call GetQueuedCompletionStatus does win32 free instance of OVERLAPPED struct or I must do it by my self?
When I send data with PostQueuedCompletionStatus does win32 copy it to internal structs? When I must free memory of sent data?
Where I could find some picture with scheme of processing of OVERLAPPED data between GetQueuedCompletionStatus, PostQueuedCompletionStatus and IOCP queue?
The OVERLAPPED structure must exist from when a successful I/O operation (or manual PostQueuedCompletionStatus()) executes until the OVERLAPPED emerges from a call to GetQueuedCompletionStatus().
You are responsible for the lifetime of the structure.
You'll see from the MSDN docs that GetQueuedCompletionStatus() actually takes "a pointer to a variable that receives the address of the OVERLAPPED structure that was specified when the completed I/O operation was started.". What you actually get out of that call is a pointer to the original OVERLAPPED that you passed when you made the PostQueuedCompletionStatus() call (or initiated an overlapped I/O operation).
This is all actually very useful as the "normal" way to use the OVERLAPPED structure is to place it inside a larger structure which holds all of the 'per operation' information that you might need - so it's the ideal way to navigate directly from the limited information that you're given when you call GetQueuedCompletionStatus() to, for example, the data buffer that you used in your overlapped read call...
I find the best way to deal with OVERLAPPED structures is to a) embed them in the buffer you're using for read/write b) reference count them and c) return them to a pool for reuse when the ref count drops to 0.
I have some source code that you could download (here) which may make this a little easier to understand (it's a full IOCP server example so it's a little complex, but it works and shows how these things can be used).
You should pass a the address of a OVERLAPPED * to GetQueuedCompletionStatus. This gets filled in with the value passed to PostQueuedCompletionStatus.
You should not free this data in the PostQueuedCompletionStatus context. It should be done by the context using GetQueuedCompletionStatus. (Assuming it was allocated dynamically in the first place - there is no requirement that it is a dynamically allocated structure, it could be taken out of a fixed pool, or be allocated on the stack of a function that doesn't return until it has been signalled that the operation is complete).
I'm not sure there is such a picture.

is it valid to async send data before completion handler of the previous one was invoked?

I'm sending data asynchronously to TCP socket. Is it valid to send the next data piece before the previous one was reported as sent by completion handler?
As I know it's not allowed when sending is done from different threads. In my case all sending are done from the same thread.
Different modules of my client send data to the same socket. E.g. module1 sent some data and will continue when corresponding completion handler is invoked. Before this io_service invoked deadline_timer handler of module2 which leads to another async_write call. Should I expect any problems here?
Is it valid to send the next data piece before the previous one was
reported as sent by completion handler?
No it is not valid to interleave write operations. This is very clear in the documentation
This operation is implemented in terms of zero or more calls to the
stream's async_write_some function, and is known as a composed
operation. The program must ensure that the stream performs no other
write operations (such as async_write, the stream's async_write_some
function, or any other composed operations that perform writes) until
this operation completes.
emphasis added by me.
As I know it's not allowed when sending is done from different
threads. In my case all sending are done from the same thread.
Your problem has nothing to do with threads.
Yes, you can do that as long as the underlying memory (buffer) is not modified until the write handler is called. Calling async_write means you hand over the buffer ownership to Asio. When the write handler is called, the buffer ownership is given back to you.

What's the difference between WaitForMultipleObjects and boost::asio on multiple windows::basic_handle's?

I have a list of HANDLE's, controlled by a lot of different IO devices. What would be the (performance) difference between:
A call to WaitForMultipleObjects on all these handles
async_read on boost::windows::basic_handle's around all these handles
Is WaitForMultipleObjects O(n) time complex with n the amount of handles?
You can somehow call async_read on a windows::basic_handle right? Or is that assumption wrong?
If I call run on the same IO device in multiple threads, will the handling-calls be balanced between those threads? That would be a major benefit of using asio.
since it sounds like the main use you would derive from asio is the fact that it is built on top of IO completion ports (iocp for short). So, let's start with comparing iocp with WaitForMultipleObjects(). These two approaches are essentially the same as select vs. epoll on linux.
The main drawback of WaitForMultipleObjects that was solved by iocp is the inability to scale with many file descriptors. It is O(n), since for each event you receive you pass in the full array again, and internally WaitForMultipleObjects must scan the array to know which handles to trigger on.
However, this is rarely a problem because of the second drawback. WaitForMultipleObjects() has a limit on the max number of handles it can wait on (MAXIMUM_WAIT_OBJECTS). This limit is 64 objects (see winnt.h). There are ways around this limit by creating Event objects and tying multiple sockets to each event, and then wait on 64 events.
The third drawback is that there's actually a subtle "bug" in WaitForMultipleObjects(). It returns the index of the handle which triggered an event. This means it can only communicate a single event back to the user. This is different from select, which will return all file descriptors that triggered an event. WaitForMultipleObjects scans the handles passed in to it and return the first handle that has its event raised.
This means, if you are waiting on 10 very active sockets, all of which has an event on them most of the time, there will be a very heavy bias toward servicing the first socket in the list passed in to WaitForMultipleObjects. This can be circumvented by, every time the function returns and the event has been serviced, run it again with a timeout of 0, but this time only pass in the part of the array 1 past the event that triggered. Repeatedly until all handles has been visited, then go back to the original call with all handles and an actual timeout.
iocp solves all of these problems, and also introduces an interface for a more generic event notification, which is quite nice.
With iocp (and hence asio):
you don't repeat which handles you're interested in, you tell windows once, and it remembers it. This means it scales a lot better with many handles.
You don't have a limit of the number of handles you can wait on
You get every event, i.e. there's no bias towards any specific handle
I'm not sure about your assumption of using async_read on a custom handle. You might have to test that. If your handle refers to a socket, I would imagine it would work.
As for the threading question; yes. If you run() the io_service in multiple threads, events are dispatched to a free thread, and will scale with more threads. This is a feature of iocp, which even has a thread pool API.
In short: I believe asio or iocp would provide better performance than simply using WaitForMultipleObjects, but whether or not that performance will benefit you mostly depends on how many handles you have and how active they are.
Both WaitForSingleObject & WaitForMultipleObjects are widely used functions, The WaitForSingleObject function is used for waiting on a single Thread synchronization object. This is signaled when the object is set to signal or the time out interval is finished. If the time interval is INFINITE, it waits infinitely.
DWORD WaitForSingleObject(
HANDLE hHandle,
DWORD dwMilliseconds
);
The WaitForMultipleObjects is used to wait for multiple objects signaled. In the Semaphore thread synchronization object, when the counters go to zero the object is non-signaled. The Auto reset event and Mutex is non-signaled when it releases the object. The manual reset event does affect the wait functions' state.
DWORD WaitForMultipleObjects(
DWORD nCount,
const HANDLE *lpHandles,
BOOL bWaitAll,
DWORD dwMilliseconds
);
If dwMilliseconds is zero, the function does not enter a wait state if the object is not signaled; it always returns immediately. If dwMilliseconds is INFINITE, the function will return only when the object is signaled.