IO Completion Ports and OVERLAPPED management - c++

How win32 manages instances of OVERLAPPED struct in context of two functions:
GetQueuedCompletionStatus
PostQueuedCompletionStatus
When I call GetQueuedCompletionStatus does win32 free instance of OVERLAPPED struct or I must do it by my self?
When I send data with PostQueuedCompletionStatus does win32 copy it to internal structs? When I must free memory of sent data?
Where I could find some picture with scheme of processing of OVERLAPPED data between GetQueuedCompletionStatus, PostQueuedCompletionStatus and IOCP queue?

The OVERLAPPED structure must exist from when a successful I/O operation (or manual PostQueuedCompletionStatus()) executes until the OVERLAPPED emerges from a call to GetQueuedCompletionStatus().
You are responsible for the lifetime of the structure.
You'll see from the MSDN docs that GetQueuedCompletionStatus() actually takes "a pointer to a variable that receives the address of the OVERLAPPED structure that was specified when the completed I/O operation was started.". What you actually get out of that call is a pointer to the original OVERLAPPED that you passed when you made the PostQueuedCompletionStatus() call (or initiated an overlapped I/O operation).
This is all actually very useful as the "normal" way to use the OVERLAPPED structure is to place it inside a larger structure which holds all of the 'per operation' information that you might need - so it's the ideal way to navigate directly from the limited information that you're given when you call GetQueuedCompletionStatus() to, for example, the data buffer that you used in your overlapped read call...
I find the best way to deal with OVERLAPPED structures is to a) embed them in the buffer you're using for read/write b) reference count them and c) return them to a pool for reuse when the ref count drops to 0.
I have some source code that you could download (here) which may make this a little easier to understand (it's a full IOCP server example so it's a little complex, but it works and shows how these things can be used).

You should pass a the address of a OVERLAPPED * to GetQueuedCompletionStatus. This gets filled in with the value passed to PostQueuedCompletionStatus.
You should not free this data in the PostQueuedCompletionStatus context. It should be done by the context using GetQueuedCompletionStatus. (Assuming it was allocated dynamically in the first place - there is no requirement that it is a dynamically allocated structure, it could be taken out of a fixed pool, or be allocated on the stack of a function that doesn't return until it has been signalled that the operation is complete).
I'm not sure there is such a picture.

Related

Memory strategy when using IOCP

I am new to Windows IOCP, currently I am rewriting the network part of our server using IOCP. I am trying to figure out how to handle the memory (Overlapped / WSABUF objects).I have a little struct that derives from OVERLAPPED and contains a WSABUF and a few other fields.
I tried keeping a ring buffer these objects and reusing them, but it didn't work. I thought that when the completion routine is called, one of the objects will be passed back to me and I can mark it as available again.
When I allocate them on the heap (and keep in a vector), how do I know when it's safe to delete/reuse them?
Thanks,
Michael

Why GetQueuedCompletionStatus() does not return operation type?

GetQueuedCompletionStatus() dequeue a completion notification, but it does not return what type of notification it is (e.g. Read notification, write notification).
It is my responsibility to keep track of what operations I initiate, for example when I use WSARecv(), I add a flag to the OVERLAPPED structure that indicate what operation this is (read in this case), and when I dequeue a notification, I read this flag. So does anyone knows why GetQueuedCompletionStatus() does not return the type of operation?
Why should it care? You can pass user data through the APIs that result in a completion being extracted via GetQueuedCompletionStatus() so why do you need anything else? Since you can post your own completions using PostQueuedCompletionStatus() there's an infinite number of 'operations' that you could complete, so pass it in the 'extended OVERLAPPED structure and you can pass anything...
If you COULD pass your own separate flag then it wouldn't actually remove the need to pass additional extra stuff as an extended OVERLAPPED structure as it's pretty useful to be able to pass the data buffer and other information along with the operation anyway, so one extra flag is unlikely to be worth having... My designs need more than your designs need, so lets just deal with the method that the API designers gave us...

How to use Overlapped I/O with sockets?

I want to use Overlapped I/O in my server, but I am unable to find many tutorials on the subject (most of the tutorials are about Overlapped I/O with Completion Ports, and I want to use a callback function).
My server will have a maximum of 400 clients connected at one time, and it only send and receive data at a long periods of time (each 30 seconds a few kilobytes worth of data is exchanged between the server and the clients).
The main reason why I want to use Overlapped I/O is because select() can only handle a maximum of 64 sockets (and I have 400!).
So I will tell you how I understand Overlapped I/O and correct me if I'm wrong:
If I want to receive data from one of the clients, I use WSARecv() and supply the socket handle, and a buffer to be filled with the received data, and also I supply a callback function. When the data is received and filled in the buffer, the callback function will be called, and I can process the data.
When I want to send data I use WSASend(), I also supply the socket handle and the callback function, and when the data is sent (not sure if when placed in the underlying sent buffer or actually placed on the wire), the callback will also be called telling me that data was sent, and I can send the next piece of data.
The one misunderstanding you appear to have is that OVERLAPPED callbacks are actually synchronous.
You said:
When the data is received and filled in the buffer, the callback function will be called
Reality:
When a call is made to an alertable wait function (e.g. SleepEx or MsgWaitForMultipleObjectsEx), if data has been received and filled in the buffer, the callback function will be called
As long as you are aware of that, you should be in good shape. I agree with you that overlapped I/O with callbacks is a great approach in your scenario. Because callbacks occur on the thread performing the I/O, you don't have to worry about synchronizing access from multiple threads, the way you would need to with completion ports and work items on the thread pool.
Oh, also make sure to check for WSA_IO_PENDING, because it's possible for operations to complete synchronously, if there's enough data already buffered (for receive) or enough space in the buffer (for send). In this case the callback will occur, but it is queued for the next alertable wait, it never runs immediately. Certain errors will be reported synchronously also. Others will come to your callback.
Also, it's guaranteed that your callback gets queued exactly once for every operation that returned 0 or WSA_IO_PENDING, whether that operation completes successfully, is cancelled, or with some other error. You can't reuse the buffer until that callback has happened.
The IO completion callback mechanism works fine, I've used it a few times, no problem. In 32-bit systems, you can put the 'this' for the socket-context instance into the hEvent field of the OVERLAPPED struct and retreive it in the callback. Not sure how to do it in 64-bit systems:(

What is the best way to implement an echo server with async i/o and IOCP?

As we all know, an echo server is a server that reads from a socket, and writes that very data into another socket.
Since Windows I/O Completion ports give you different ways to do things, I was wondering what is the best way (the most efficient) to implement an echo server. I'm sure to find someone who tested the ways I will describe here, and can give his/her contribute.
My classes are Stream which abstracts a socket, named pipe, or whatever, and IoRequest which abstracts both an OVERLAPPED structure and the memory buffer to do the I/O (of course, suitable for both reading and writing). In this way when I allocate an IoRequest I'm simply allocating memory for memory buffer for data + OVERLAPPED structure in one shot, so I call malloc() only once.
In addition to this, I also implement fancy and useful things in the IoRequest object, such as an atomic reference counter, and so on.
Said that, let's explore the ways to do the best echo server:
-------------------------------------------- Method A. ------------------------------------------
1) The "reader" socket completes its reading, the IOCP callback returns, and you have an IoRequest just completed with the memory buffer.
2) Let's copy the buffer just received with the "reader" IoRequest to the "writer" IoRequest. (this will involve a memcpy() or whatever).
3) Let's fire again a new reading with ReadFile() in the "reader", with the same IoRequest used for reading.
4) Let's fire a new writing with WriteFile() in the "writer".
-------------------------------------------- Method B. ------------------------------------------
1) The "reader" socket completes its reading, the IOCP callback returns, and you have an IoRequest just completed with the memory buffer.
2) Instead of copying data, pass that IoRequest to the "writer" for writing, without copying data with memcpy().
3) The "reader" now needs a new IoRequest to continue reading, allocate a new one or pass one already allocated before, maybe one just completed for writing before the new writing does happen.
So, in the first case, every Stream objects has its own IoRequest, data is copied with memcpy() or similar functions, and everything works fine.
In the second case the 2 Stream objects do pass IoRequest objects each other, without copying data, but its a little bit more complex, you have to manage the "swapping" of IoRequest objects between the 2 Stream objects, with the possible drawback to get synchronization problems (what about those completions do happen in different threads?)
My questions are:
Q1) Is avoiding copying data really worth it!?
Copying 2 buffers with memcpy() or similar, is very fast, also because the CPU cache is exploited for this very purpose.
Let's consider that with the first method, I have the possibility to echo from a "reader" socket to multiple "writer" sockets, but with the second one I can't do that, since I should create N new IoRequest objects for each N writers, since each WriteFile() needs its own OVERLAPPED structure.
Q2) I guess that when I fire a new N writings for N different sockets with WriteFile(), I have to provide N different OVERLAPPED structure AND N different buffers where to read the data.
Or, I can fire N WriteFile() calls with N different OVERLAPPED taking the data from the same buffer for the N sockets?
Is avoiding copying data really worth it!?
Depends on how much you are copying. 10 bytes, not so much. 10MB, then yes, it's worth avoiding the copying!
In this case, since you already have an object that contains the rx data and an OVERLAPPED block, it seems somewhat pointless to copy it - just reissue it to WSASend(), or whatever.
but with the second one I can't do that
You can, but you need to abstract the 'IORequest' class from a 'Buffer' class. The buffer holds the data, an atomic int reference-count and any other management info for all calls, the IOrequest the OVERLAPPED block and a pointer to the data and any other management information for each call. This information could have an atomic int reference-count for the buffer object.
The IOrequest is the class that is used for each send call. Since it contains only a pointer to the buffer, there is no need to copy the data and so it's reasonably small and O(1) to data size.
When the tx completions come in, the handler threads get the IOrequest, deref the buffer and dec the atomic int in it towards zero. The thread that manages to hit 0 knows that the buffer object is no longer needed and can delete it, (or, more likely, in a high-performance server, repool it for later reuse).
Or, I can fire N WriteFile() calls with N different OVERLAPPED taking
the data from the same buffer for the N sockets?
Yes, you can. See above.
Re. threading - sure, if your 'management data' can be reached from multiple completion-handler threads, then yes, you may want to protect it with a critical-section, but an atomic int should do for the buffer refcount.

Non-blocking TCP buffer issues

I think I'm in a problem. I have two TCP apps connected to each other which use winsock I/O completion ports to send/receive data (non-blocking sockets).
Everything works just fine until there's a data transfer burst. The sender starts sending incorrect/malformed data.
I allocate the buffers I'm sending on the stack, and if I understand correctly, that's a wrong to do, because these buffers should remain as I sent them until I get the "write complete" notification from IOCP.
Take this for example:
void some_function()
{
char cBuff[1024];
// filling cBuff with some data
WSASend(...); // sending cBuff, non-blocking mode
// filling cBuff with other data
WSASend(...); // again, sending cBuff
// ..... and so forth!
}
If I understand correctly, each of these WSASend() calls should have its own unique buffer, and that buffer can be reused only when the send completes.
Correct?
Now, what strategies can I implement in order to maintain a big sack of such buffers, how should I handle them, how can I avoid performance penalty, etc'?
And, if I am to use buffers that means I should copy the data to be sent from the source buffer to the temporary one, thus, I'd set SO_SNDBUF on each socket to zero, so the system will not re-copy what I already copied. Are you with me? Please let me know if I wasn't clear.
Take a serious look at boost::asio. Asynchronous IO is its specialty (just as the name suggests.) It's pretty mature library by now being in Boost since 1.35. Many people use it in production for very intensive networking. There's a wealth of examples in the documentation.
One thing for sure - it take working with buffers very seriously.
Edit:
Basic idea to handling bursts of input is queuing.
Create, say, three linked lists of pre-allocated buffers - one is for free buffers, one for to-be-processed (received) data, one for to-be-sent data.
Every time you need to send something - take a buffer off the free list (allocate a new one if free list is empty), fill with data, put it onto to-be-sent list.
Every time you need to receive something - take a buffer off the free list as above, give it to IO receive routine.
Periodically take buffers off to-be-sent queue, hand them off to send routine.
On send completion (inline or asynchronous) - put them back onto free list.
On receive completion - put buffer onto to-be-processed list.
Have your "business" routine take buffers off to-be-processed list.
The bursts will then fill that input queue until you are able to process them. You might want to limit the queue size to avoid blowing through all the memory.
I don't think it is a good idea to do a second send before the first send is finished.
Similarly, I don't think it is a good idea to change the buffer before the send is finished.
I would be inclined to store the data in some sort of queue. One thread can keep adding data to the queue. The second thread can work in a loop. Do a send and wait for it to finish. If there is more data do another send, else wait for more data.
You would need a critical section (or some such) to nicely share the queue between the threads and possibly an event or a semaphore for the sending thread to wait on if there is no data ready.
Now, what strategies can I implement in order to maintain a big sack of such buffers, how should I handle them, how can I avoid performance penalty, etc'?
It's difficult to know the answer without knowing more about your specific design. In general I'd avoid maintaining your own "sack of buffers" and instead use the OS's built in sack of buffers - the heap.
But in any case, what I would do in the general case is expose an interface to the callers of your code that mirror what WSASend is doing for overlapped i/o. For example, suppose you are providing an interface to send a specific struct:
struct Foo
{
int x;
int y;
};
// foo will be consumed by SendFoo, and deallocated, don't use it after this call
void SendFoo(Foo* foo);
I would require users of SendFoo allocate a Foo instance with new, and tell them that after calling SendFoo the memory is no longer "owned" by their code and they therefore shouldn't use it.
You can enforce this even further with a little trickery:
// After this operation the resultant foo ptr will no longer point to
// memory passed to SendFoo
void SendFoo(Foo*& foo);
This allows the body of SendFoo to send the address of the memory down to WSASend, but modify the passed in pointer to NULL, severing the link between the caller's code and their memory. Of course, you can't really know what the caller is doing with that address, they may have a copy elsewhere.
This interface also enforces that a single block of memory will be used with each WSASend. You are really treading into more than dangerous territory trying to share one buffer between two WSASend calls.