Memory strategy when using IOCP

Memory strategy when using IOCP - c++

I am new to Windows IOCP, currently I am rewriting the network part of our server using IOCP. I am trying to figure out how to handle the memory (Overlapped / WSABUF objects).I have a little struct that derives from OVERLAPPED and contains a WSABUF and a few other fields.
I tried keeping a ring buffer these objects and reusing them, but it didn't work. I thought that when the completion routine is called, one of the objects will be passed back to me and I can mark it as available again.
When I allocate them on the heap (and keep in a vector), how do I know when it's safe to delete/reuse them?
Thanks,
Michael

Related

What is the best way to implement an echo server with async i/o and IOCP?

As we all know, an echo server is a server that reads from a socket, and writes that very data into another socket.
Since Windows I/O Completion ports give you different ways to do things, I was wondering what is the best way (the most efficient) to implement an echo server. I'm sure to find someone who tested the ways I will describe here, and can give his/her contribute.
My classes are Stream which abstracts a socket, named pipe, or whatever, and IoRequest which abstracts both an OVERLAPPED structure and the memory buffer to do the I/O (of course, suitable for both reading and writing). In this way when I allocate an IoRequest I'm simply allocating memory for memory buffer for data + OVERLAPPED structure in one shot, so I call malloc() only once.
In addition to this, I also implement fancy and useful things in the IoRequest object, such as an atomic reference counter, and so on.
Said that, let's explore the ways to do the best echo server:
-------------------------------------------- Method A. ------------------------------------------
1) The "reader" socket completes its reading, the IOCP callback returns, and you have an IoRequest just completed with the memory buffer.
2) Let's copy the buffer just received with the "reader" IoRequest to the "writer" IoRequest. (this will involve a memcpy() or whatever).
3) Let's fire again a new reading with ReadFile() in the "reader", with the same IoRequest used for reading.
4) Let's fire a new writing with WriteFile() in the "writer".
-------------------------------------------- Method B. ------------------------------------------
1) The "reader" socket completes its reading, the IOCP callback returns, and you have an IoRequest just completed with the memory buffer.
2) Instead of copying data, pass that IoRequest to the "writer" for writing, without copying data with memcpy().
3) The "reader" now needs a new IoRequest to continue reading, allocate a new one or pass one already allocated before, maybe one just completed for writing before the new writing does happen.
So, in the first case, every Stream objects has its own IoRequest, data is copied with memcpy() or similar functions, and everything works fine.
In the second case the 2 Stream objects do pass IoRequest objects each other, without copying data, but its a little bit more complex, you have to manage the "swapping" of IoRequest objects between the 2 Stream objects, with the possible drawback to get synchronization problems (what about those completions do happen in different threads?)
My questions are:
Q1) Is avoiding copying data really worth it!?
Copying 2 buffers with memcpy() or similar, is very fast, also because the CPU cache is exploited for this very purpose.
Let's consider that with the first method, I have the possibility to echo from a "reader" socket to multiple "writer" sockets, but with the second one I can't do that, since I should create N new IoRequest objects for each N writers, since each WriteFile() needs its own OVERLAPPED structure.
Q2) I guess that when I fire a new N writings for N different sockets with WriteFile(), I have to provide N different OVERLAPPED structure AND N different buffers where to read the data.
Or, I can fire N WriteFile() calls with N different OVERLAPPED taking the data from the same buffer for the N sockets?

Is avoiding copying data really worth it!?
Depends on how much you are copying. 10 bytes, not so much. 10MB, then yes, it's worth avoiding the copying!
In this case, since you already have an object that contains the rx data and an OVERLAPPED block, it seems somewhat pointless to copy it - just reissue it to WSASend(), or whatever.
but with the second one I can't do that
You can, but you need to abstract the 'IORequest' class from a 'Buffer' class. The buffer holds the data, an atomic int reference-count and any other management info for all calls, the IOrequest the OVERLAPPED block and a pointer to the data and any other management information for each call. This information could have an atomic int reference-count for the buffer object.
The IOrequest is the class that is used for each send call. Since it contains only a pointer to the buffer, there is no need to copy the data and so it's reasonably small and O(1) to data size.
When the tx completions come in, the handler threads get the IOrequest, deref the buffer and dec the atomic int in it towards zero. The thread that manages to hit 0 knows that the buffer object is no longer needed and can delete it, (or, more likely, in a high-performance server, repool it for later reuse).
Or, I can fire N WriteFile() calls with N different OVERLAPPED taking
the data from the same buffer for the N sockets?
Yes, you can. See above.
Re. threading - sure, if your 'management data' can be reached from multiple completion-handler threads, then yes, you may want to protect it with a critical-section, but an atomic int should do for the buffer refcount.

How can I store marshaled data?

I have an rpc com interface which passes data from a service to a client. On the client side, I need to store this data temporarily and put it into a queue so that it can be processed later on the ui thread (this needs to be done on the ui thread because that is where the view and view model objects need to be created).
The structure is a bit complex and contains pointers to other structures and variable length strings.
Question - is there an easy way to grab the full "blob" of marshaled memory for stowing or do I need to duplicate the same structure and repack it myself so I can later process it on the ui thread? Currently this looks like duplicating the same structs but replacing LPCWSTR with CComBSTR etc etc which seems kind of dirty and wasteful to me...
Thanks

Until you need to make sense of the data, you can just treat it as a sequence of bytes. The only thing that you need to know is just how much data you have. You can then do something like:
std::vector<unint8_t> buf;
buf.resize(length);
memcpy(&buf[0], source, length);
and then at a later point in time, assuming your vector is still around
memcpy(dest, &buf[0], buf.size());
The vector will get freed when it goes out of scope.
The only thing that can be a bit tricky may be getting the length. This might need some knowledge of the data sent but otherwise there is no need to unpack the data.

You're supposed to marshal the data between your background thread and your UI thread (see CoMarshalInterThreadInterfaceInStream and CoGetInterfaceAndReleaseStream for example). Passing a COM object between threads directly is illegal.
Using the above mentioned API's, you would generate the IStream object and queue the corresponding pointer. Then the UI thread would eventually recover the pointer, call the second API, and retrieve the marshalled copy of the object.
That all assumes that the data is in a COM object to begin with. If it's a blob copied over the wire by your COM object, then proper semantics dictate that when you requested the data from the COM object, you acquired ownership of it (and responsibility to release/delete it). Then just queue a pointer to the data as-is, and have the UI thread do the release/delete (How? I'm not sure; I have to look it up. Maybe CoTaskMemFree()?)
If it's a more bizarre scenario, please clarify what that is.

Good allocator for cross thread allocation and free

I am planning to write a C++ networked application where:
I use a single thread to accept TCP connections and also to read data from them. I am planning to use epoll/select to do this. The data is written into buffers that are allocated using some arena allocator say jemalloc.
Once there is enough data from a single TCP client to form a protocol message, the data is published on a ring buffer. The ring buffer structures contain the fd for the connection and a pointer to the buffer containing the relevant data.
A worker thread processes entries from the ring buffers and sends some result data to the client. After processing each event, the worker thread frees the actual data buffer to return it to the arena allocator for re use.
I am leaving out details on how the publisher makes data written by it visible to the worker thread.
So my question is: Are there any allocators which optimize for this kind of behavior i.e. allocating objects on one thread and freeing on another?
I am worried specifically about having to use locks to return memory to an arena which is not the thread affinitized arena. I am also worried about false sharing since the producer thread and the worker thread will both write to the same region. Seems like jemalloc or tcmalloc both don't optimize for this.

Before you go down the path of implementing a highly optimized allocator for your multi-threaded application, you should first just use the standard new and delete operators for your implementation. After you have a correct implementation of your application, you can move to address bottlenecks that are discovered through profiling it.
If you get to the stage where it is obvious that the standard new and delete allocators are a bottleneck to the application, the following is the approach I have used:
Assumption: The number of threads are fixed and are statically created.
Each thread has their own arena.
Each object taken from an arena has a reference back to the arena it came from.
Each arena has a separate garbage list for each thread.
When a thread frees an object, it goes back the arena it came from, but is placed in the thread specific garbage list.
The thread that actually owns the arena treats its garbage list as the real free list.
Periodically, the thread that owns an arena performs a garbage collection pass to fold objects from the other thread garbage lists into the real free list.
The "periodical" garbage collection pass doesn't necessarily have to be time based. A subset of the garbage could be reaped on every allocation and free, for example.

The best way to deal with memory allocation and deallocation issues is to not deal with it.
You mention a ring buffer. Those are usually a fixed size. If you can come up with a fixed maximum size for your protocol messages you can allocate all the memory you will ever need at program start. When deallocating, keep the memory but reset it to a fresh state.
Now, your program may need to allocate and deallocate memory while dealing with each message but that will be done in each thread and cross-thread issues will not come into play.
This can work even if your message maximum size is too large to preallocate if you can allocate the amount of memory that most messages will use and have handlers for allocating more when necessary.

IO Completion Ports and OVERLAPPED management

How win32 manages instances of OVERLAPPED struct in context of two functions:
GetQueuedCompletionStatus
PostQueuedCompletionStatus
When I call GetQueuedCompletionStatus does win32 free instance of OVERLAPPED struct or I must do it by my self?
When I send data with PostQueuedCompletionStatus does win32 copy it to internal structs? When I must free memory of sent data?
Where I could find some picture with scheme of processing of OVERLAPPED data between GetQueuedCompletionStatus, PostQueuedCompletionStatus and IOCP queue?

The OVERLAPPED structure must exist from when a successful I/O operation (or manual PostQueuedCompletionStatus()) executes until the OVERLAPPED emerges from a call to GetQueuedCompletionStatus().
You are responsible for the lifetime of the structure.
You'll see from the MSDN docs that GetQueuedCompletionStatus() actually takes "a pointer to a variable that receives the address of the OVERLAPPED structure that was specified when the completed I/O operation was started.". What you actually get out of that call is a pointer to the original OVERLAPPED that you passed when you made the PostQueuedCompletionStatus() call (or initiated an overlapped I/O operation).
This is all actually very useful as the "normal" way to use the OVERLAPPED structure is to place it inside a larger structure which holds all of the 'per operation' information that you might need - so it's the ideal way to navigate directly from the limited information that you're given when you call GetQueuedCompletionStatus() to, for example, the data buffer that you used in your overlapped read call...
I find the best way to deal with OVERLAPPED structures is to a) embed them in the buffer you're using for read/write b) reference count them and c) return them to a pool for reuse when the ref count drops to 0.
I have some source code that you could download (here) which may make this a little easier to understand (it's a full IOCP server example so it's a little complex, but it works and shows how these things can be used).

You should pass a the address of a OVERLAPPED * to GetQueuedCompletionStatus. This gets filled in with the value passed to PostQueuedCompletionStatus.
You should not free this data in the PostQueuedCompletionStatus context. It should be done by the context using GetQueuedCompletionStatus. (Assuming it was allocated dynamically in the first place - there is no requirement that it is a dynamically allocated structure, it could be taken out of a fixed pool, or be allocated on the stack of a function that doesn't return until it has been signalled that the operation is complete).
I'm not sure there is such a picture.

Non-blocking TCP buffer issues

I think I'm in a problem. I have two TCP apps connected to each other which use winsock I/O completion ports to send/receive data (non-blocking sockets).
Everything works just fine until there's a data transfer burst. The sender starts sending incorrect/malformed data.
I allocate the buffers I'm sending on the stack, and if I understand correctly, that's a wrong to do, because these buffers should remain as I sent them until I get the "write complete" notification from IOCP.
Take this for example:
void some_function()
{
char cBuff[1024];
// filling cBuff with some data
WSASend(...); // sending cBuff, non-blocking mode
// filling cBuff with other data
WSASend(...); // again, sending cBuff
// ..... and so forth!
}
If I understand correctly, each of these WSASend() calls should have its own unique buffer, and that buffer can be reused only when the send completes.
Correct?
Now, what strategies can I implement in order to maintain a big sack of such buffers, how should I handle them, how can I avoid performance penalty, etc'?
And, if I am to use buffers that means I should copy the data to be sent from the source buffer to the temporary one, thus, I'd set SO_SNDBUF on each socket to zero, so the system will not re-copy what I already copied. Are you with me? Please let me know if I wasn't clear.

Take a serious look at boost::asio. Asynchronous IO is its specialty (just as the name suggests.) It's pretty mature library by now being in Boost since 1.35. Many people use it in production for very intensive networking. There's a wealth of examples in the documentation.
One thing for sure - it take working with buffers very seriously.
Edit:
Basic idea to handling bursts of input is queuing.
Create, say, three linked lists of pre-allocated buffers - one is for free buffers, one for to-be-processed (received) data, one for to-be-sent data.
Every time you need to send something - take a buffer off the free list (allocate a new one if free list is empty), fill with data, put it onto to-be-sent list.
Every time you need to receive something - take a buffer off the free list as above, give it to IO receive routine.
Periodically take buffers off to-be-sent queue, hand them off to send routine.
On send completion (inline or asynchronous) - put them back onto free list.
On receive completion - put buffer onto to-be-processed list.
Have your "business" routine take buffers off to-be-processed list.
The bursts will then fill that input queue until you are able to process them. You might want to limit the queue size to avoid blowing through all the memory.

I don't think it is a good idea to do a second send before the first send is finished.
Similarly, I don't think it is a good idea to change the buffer before the send is finished.
I would be inclined to store the data in some sort of queue. One thread can keep adding data to the queue. The second thread can work in a loop. Do a send and wait for it to finish. If there is more data do another send, else wait for more data.
You would need a critical section (or some such) to nicely share the queue between the threads and possibly an event or a semaphore for the sending thread to wait on if there is no data ready.

Now, what strategies can I implement in order to maintain a big sack of such buffers, how should I handle them, how can I avoid performance penalty, etc'?
It's difficult to know the answer without knowing more about your specific design. In general I'd avoid maintaining your own "sack of buffers" and instead use the OS's built in sack of buffers - the heap.
But in any case, what I would do in the general case is expose an interface to the callers of your code that mirror what WSASend is doing for overlapped i/o. For example, suppose you are providing an interface to send a specific struct:
struct Foo
{
int x;
int y;
};
// foo will be consumed by SendFoo, and deallocated, don't use it after this call
void SendFoo(Foo* foo);
I would require users of SendFoo allocate a Foo instance with new, and tell them that after calling SendFoo the memory is no longer "owned" by their code and they therefore shouldn't use it.
You can enforce this even further with a little trickery:
// After this operation the resultant foo ptr will no longer point to
// memory passed to SendFoo
void SendFoo(Foo*& foo);
This allows the body of SendFoo to send the address of the memory down to WSASend, but modify the passed in pointer to NULL, severing the link between the caller's code and their memory. Of course, you can't really know what the caller is doing with that address, they may have a copy elsewhere.
This interface also enforces that a single block of memory will be used with each WSASend. You are really treading into more than dangerous territory trying to share one buffer between two WSASend calls.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js