Non-blocking TCP buffer issues

Non-blocking TCP buffer issues - c++

I think I'm in a problem. I have two TCP apps connected to each other which use winsock I/O completion ports to send/receive data (non-blocking sockets).
Everything works just fine until there's a data transfer burst. The sender starts sending incorrect/malformed data.
I allocate the buffers I'm sending on the stack, and if I understand correctly, that's a wrong to do, because these buffers should remain as I sent them until I get the "write complete" notification from IOCP.
Take this for example:
void some_function()
{
char cBuff[1024];
// filling cBuff with some data
WSASend(...); // sending cBuff, non-blocking mode
// filling cBuff with other data
WSASend(...); // again, sending cBuff
// ..... and so forth!
}
If I understand correctly, each of these WSASend() calls should have its own unique buffer, and that buffer can be reused only when the send completes.
Correct?
Now, what strategies can I implement in order to maintain a big sack of such buffers, how should I handle them, how can I avoid performance penalty, etc'?
And, if I am to use buffers that means I should copy the data to be sent from the source buffer to the temporary one, thus, I'd set SO_SNDBUF on each socket to zero, so the system will not re-copy what I already copied. Are you with me? Please let me know if I wasn't clear.

Take a serious look at boost::asio. Asynchronous IO is its specialty (just as the name suggests.) It's pretty mature library by now being in Boost since 1.35. Many people use it in production for very intensive networking. There's a wealth of examples in the documentation.
One thing for sure - it take working with buffers very seriously.
Edit:
Basic idea to handling bursts of input is queuing.
Create, say, three linked lists of pre-allocated buffers - one is for free buffers, one for to-be-processed (received) data, one for to-be-sent data.
Every time you need to send something - take a buffer off the free list (allocate a new one if free list is empty), fill with data, put it onto to-be-sent list.
Every time you need to receive something - take a buffer off the free list as above, give it to IO receive routine.
Periodically take buffers off to-be-sent queue, hand them off to send routine.
On send completion (inline or asynchronous) - put them back onto free list.
On receive completion - put buffer onto to-be-processed list.
Have your "business" routine take buffers off to-be-processed list.
The bursts will then fill that input queue until you are able to process them. You might want to limit the queue size to avoid blowing through all the memory.

I don't think it is a good idea to do a second send before the first send is finished.
Similarly, I don't think it is a good idea to change the buffer before the send is finished.
I would be inclined to store the data in some sort of queue. One thread can keep adding data to the queue. The second thread can work in a loop. Do a send and wait for it to finish. If there is more data do another send, else wait for more data.
You would need a critical section (or some such) to nicely share the queue between the threads and possibly an event or a semaphore for the sending thread to wait on if there is no data ready.

Now, what strategies can I implement in order to maintain a big sack of such buffers, how should I handle them, how can I avoid performance penalty, etc'?
It's difficult to know the answer without knowing more about your specific design. In general I'd avoid maintaining your own "sack of buffers" and instead use the OS's built in sack of buffers - the heap.
But in any case, what I would do in the general case is expose an interface to the callers of your code that mirror what WSASend is doing for overlapped i/o. For example, suppose you are providing an interface to send a specific struct:
struct Foo
{
int x;
int y;
};
// foo will be consumed by SendFoo, and deallocated, don't use it after this call
void SendFoo(Foo* foo);
I would require users of SendFoo allocate a Foo instance with new, and tell them that after calling SendFoo the memory is no longer "owned" by their code and they therefore shouldn't use it.
You can enforce this even further with a little trickery:
// After this operation the resultant foo ptr will no longer point to
// memory passed to SendFoo
void SendFoo(Foo*& foo);
This allows the body of SendFoo to send the address of the memory down to WSASend, but modify the passed in pointer to NULL, severing the link between the caller's code and their memory. Of course, you can't really know what the caller is doing with that address, they may have a copy elsewhere.
This interface also enforces that a single block of memory will be used with each WSASend. You are really treading into more than dangerous territory trying to share one buffer between two WSASend calls.

Related

Memory strategy when using IOCP

I am new to Windows IOCP, currently I am rewriting the network part of our server using IOCP. I am trying to figure out how to handle the memory (Overlapped / WSABUF objects).I have a little struct that derives from OVERLAPPED and contains a WSABUF and a few other fields.
I tried keeping a ring buffer these objects and reusing them, but it didn't work. I thought that when the completion routine is called, one of the objects will be passed back to me and I can mark it as available again.
When I allocate them on the heap (and keep in a vector), how do I know when it's safe to delete/reuse them?
Thanks,
Michael

What is the best way to implement an echo server with async i/o and IOCP?

As we all know, an echo server is a server that reads from a socket, and writes that very data into another socket.
Since Windows I/O Completion ports give you different ways to do things, I was wondering what is the best way (the most efficient) to implement an echo server. I'm sure to find someone who tested the ways I will describe here, and can give his/her contribute.
My classes are Stream which abstracts a socket, named pipe, or whatever, and IoRequest which abstracts both an OVERLAPPED structure and the memory buffer to do the I/O (of course, suitable for both reading and writing). In this way when I allocate an IoRequest I'm simply allocating memory for memory buffer for data + OVERLAPPED structure in one shot, so I call malloc() only once.
In addition to this, I also implement fancy and useful things in the IoRequest object, such as an atomic reference counter, and so on.
Said that, let's explore the ways to do the best echo server:
-------------------------------------------- Method A. ------------------------------------------
1) The "reader" socket completes its reading, the IOCP callback returns, and you have an IoRequest just completed with the memory buffer.
2) Let's copy the buffer just received with the "reader" IoRequest to the "writer" IoRequest. (this will involve a memcpy() or whatever).
3) Let's fire again a new reading with ReadFile() in the "reader", with the same IoRequest used for reading.
4) Let's fire a new writing with WriteFile() in the "writer".
-------------------------------------------- Method B. ------------------------------------------
1) The "reader" socket completes its reading, the IOCP callback returns, and you have an IoRequest just completed with the memory buffer.
2) Instead of copying data, pass that IoRequest to the "writer" for writing, without copying data with memcpy().
3) The "reader" now needs a new IoRequest to continue reading, allocate a new one or pass one already allocated before, maybe one just completed for writing before the new writing does happen.
So, in the first case, every Stream objects has its own IoRequest, data is copied with memcpy() or similar functions, and everything works fine.
In the second case the 2 Stream objects do pass IoRequest objects each other, without copying data, but its a little bit more complex, you have to manage the "swapping" of IoRequest objects between the 2 Stream objects, with the possible drawback to get synchronization problems (what about those completions do happen in different threads?)
My questions are:
Q1) Is avoiding copying data really worth it!?
Copying 2 buffers with memcpy() or similar, is very fast, also because the CPU cache is exploited for this very purpose.
Let's consider that with the first method, I have the possibility to echo from a "reader" socket to multiple "writer" sockets, but with the second one I can't do that, since I should create N new IoRequest objects for each N writers, since each WriteFile() needs its own OVERLAPPED structure.
Q2) I guess that when I fire a new N writings for N different sockets with WriteFile(), I have to provide N different OVERLAPPED structure AND N different buffers where to read the data.
Or, I can fire N WriteFile() calls with N different OVERLAPPED taking the data from the same buffer for the N sockets?

Is avoiding copying data really worth it!?
Depends on how much you are copying. 10 bytes, not so much. 10MB, then yes, it's worth avoiding the copying!
In this case, since you already have an object that contains the rx data and an OVERLAPPED block, it seems somewhat pointless to copy it - just reissue it to WSASend(), or whatever.
but with the second one I can't do that
You can, but you need to abstract the 'IORequest' class from a 'Buffer' class. The buffer holds the data, an atomic int reference-count and any other management info for all calls, the IOrequest the OVERLAPPED block and a pointer to the data and any other management information for each call. This information could have an atomic int reference-count for the buffer object.
The IOrequest is the class that is used for each send call. Since it contains only a pointer to the buffer, there is no need to copy the data and so it's reasonably small and O(1) to data size.
When the tx completions come in, the handler threads get the IOrequest, deref the buffer and dec the atomic int in it towards zero. The thread that manages to hit 0 knows that the buffer object is no longer needed and can delete it, (or, more likely, in a high-performance server, repool it for later reuse).
Or, I can fire N WriteFile() calls with N different OVERLAPPED taking
the data from the same buffer for the N sockets?
Yes, you can. See above.
Re. threading - sure, if your 'management data' can be reached from multiple completion-handler threads, then yes, you may want to protect it with a critical-section, but an atomic int should do for the buffer refcount.

How can I store marshaled data?

I have an rpc com interface which passes data from a service to a client. On the client side, I need to store this data temporarily and put it into a queue so that it can be processed later on the ui thread (this needs to be done on the ui thread because that is where the view and view model objects need to be created).
The structure is a bit complex and contains pointers to other structures and variable length strings.
Question - is there an easy way to grab the full "blob" of marshaled memory for stowing or do I need to duplicate the same structure and repack it myself so I can later process it on the ui thread? Currently this looks like duplicating the same structs but replacing LPCWSTR with CComBSTR etc etc which seems kind of dirty and wasteful to me...
Thanks

Until you need to make sense of the data, you can just treat it as a sequence of bytes. The only thing that you need to know is just how much data you have. You can then do something like:
std::vector<unint8_t> buf;
buf.resize(length);
memcpy(&buf[0], source, length);
and then at a later point in time, assuming your vector is still around
memcpy(dest, &buf[0], buf.size());
The vector will get freed when it goes out of scope.
The only thing that can be a bit tricky may be getting the length. This might need some knowledge of the data sent but otherwise there is no need to unpack the data.

You're supposed to marshal the data between your background thread and your UI thread (see CoMarshalInterThreadInterfaceInStream and CoGetInterfaceAndReleaseStream for example). Passing a COM object between threads directly is illegal.
Using the above mentioned API's, you would generate the IStream object and queue the corresponding pointer. Then the UI thread would eventually recover the pointer, call the second API, and retrieve the marshalled copy of the object.
That all assumes that the data is in a COM object to begin with. If it's a blob copied over the wire by your COM object, then proper semantics dictate that when you requested the data from the COM object, you acquired ownership of it (and responsibility to release/delete it). Then just queue a pointer to the data as-is, and have the UI thread do the release/delete (How? I'm not sure; I have to look it up. Maybe CoTaskMemFree()?)
If it's a more bizarre scenario, please clarify what that is.

IO Completion Ports and OVERLAPPED management

How win32 manages instances of OVERLAPPED struct in context of two functions:
GetQueuedCompletionStatus
PostQueuedCompletionStatus
When I call GetQueuedCompletionStatus does win32 free instance of OVERLAPPED struct or I must do it by my self?
When I send data with PostQueuedCompletionStatus does win32 copy it to internal structs? When I must free memory of sent data?
Where I could find some picture with scheme of processing of OVERLAPPED data between GetQueuedCompletionStatus, PostQueuedCompletionStatus and IOCP queue?

The OVERLAPPED structure must exist from when a successful I/O operation (or manual PostQueuedCompletionStatus()) executes until the OVERLAPPED emerges from a call to GetQueuedCompletionStatus().
You are responsible for the lifetime of the structure.
You'll see from the MSDN docs that GetQueuedCompletionStatus() actually takes "a pointer to a variable that receives the address of the OVERLAPPED structure that was specified when the completed I/O operation was started.". What you actually get out of that call is a pointer to the original OVERLAPPED that you passed when you made the PostQueuedCompletionStatus() call (or initiated an overlapped I/O operation).
This is all actually very useful as the "normal" way to use the OVERLAPPED structure is to place it inside a larger structure which holds all of the 'per operation' information that you might need - so it's the ideal way to navigate directly from the limited information that you're given when you call GetQueuedCompletionStatus() to, for example, the data buffer that you used in your overlapped read call...
I find the best way to deal with OVERLAPPED structures is to a) embed them in the buffer you're using for read/write b) reference count them and c) return them to a pool for reuse when the ref count drops to 0.
I have some source code that you could download (here) which may make this a little easier to understand (it's a full IOCP server example so it's a little complex, but it works and shows how these things can be used).

You should pass a the address of a OVERLAPPED * to GetQueuedCompletionStatus. This gets filled in with the value passed to PostQueuedCompletionStatus.
You should not free this data in the PostQueuedCompletionStatus context. It should be done by the context using GetQueuedCompletionStatus. (Assuming it was allocated dynamically in the first place - there is no requirement that it is a dynamically allocated structure, it could be taken out of a fixed pool, or be allocated on the stack of a function that doesn't return until it has been signalled that the operation is complete).
I'm not sure there is such a picture.

How should I control multithreaded access to several queues in the following situation?

I'm working on a multithreaded project in C++ that sends data to a series of network connections. Here's some pseudocode that illustrates what's going on:
class NetworkManager
{
Thread writer; // responsible for writing data in queues to the network
Queue[] outqueue; // holds data until the network is ready to receive it
Network[] nets; // sockets or whatever
Mutex[] outlock; // protects access to members of outqueue
Mutex managerlock; // protects access to all queues
Condition notifier; // blocks the write thread when there is no data
}
In reality it's a whole lot more complicated than that, but I've axed a lot of unnecessary details. One important detail is that the networking is rate-limited, and the ability of the program to queue data independently from sending it is a feature of the design (the program should not have to wait to process new data because it's blocking on a network write).
Here's a brief description of how the program is expected to interact with this class. Note that QueueWriteToNetwork and DoAdministrativeStuff are, in my implementation, managed by THE SAME external thread.
QueueWriteToNetwork(network, data) // responsibility of external thread
Let i = the index of the network to send to
Lock(outlock[i])
outqueue[i].Add(data)
Unlock(outlock[i])
Signal(notifier)
DoAdministrativeStuff(network, more) // responsibility of external thread
Lock(managerlock)
more.Process() // might do any of the following:
// connect or disconnect networks
// add or remove networks from list
// immediate write data to network, bypassing rate limiting
// other things that I forgot
Unlock(managerlock)
WriterThreadMain() // responsibility of internal write thread
Lock(managerlock)
Loop forever:
Check for data in every queue (locking and unlocking each queue)
If all queues have no data to write:
Wait(notifier, managerlock)
continue
If outqueue[i] has data ready to write
Lock(outlock[i])
Send data from outqueue[i]
outqueue[i].Pop()
Unlock(outqueue[i])
As you might be able to see, there are a few issues with this approach (for example, if a write is queued to the network with QueueWriteToNetwork as WriterThreadMain is checking if the queues are empty, the call to Signal(notifier) could potentially be dropped, and the write queue could remain waiting even though there was data ready).
I need to phrase this in such a way that the following are possible:
Adding data to a write queue does not block, or blocks for only a reasonably short time (specifically, it does not block for the duration of a network write that's in progress)
The DoAdministrativeStuff function must have the ability to ensure that the writer thread is blocked in a safe state (i.e. not accessing any queue, queue lock, or network)
I've explored the possibility of using a semaphore to track the number of items in write queues. This would solve the lost-update problem I mentioned earlier.
Finally, I'm targeting Linux (using Posix libraries to provide the types pthread_t, pthread_mutex_t, pthread_cond_t, and sem_t), and I don't care about compatibility with Windows. Also, please don't recommend Boost. Pulling any Boost header into my code makes compilation take unbearably long.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js