WSARecv, Completionport Model, how to manage Buffer and avoid overruns? - c++

My Problem: My Completionport Server will receive Data of unknown size from different clients, the thing is, that i don't know how avoid buffer overruns/ how to avoid my (receiving) buffer being "overfilled" with data.
now to the Quesitons:
1) If i make a receive call via WSARecv, does the workerthread work like a callback function ? I mean, does it dig up the receive call only when it has completed or does it also dig it up when the receiving is happening ? Does the lpNumberOfBytes (from GetQueuedCompletionStatus) variable contain the number of bytes received till now or the total number of bytes received ?
2) How to avoid overruns, i thought of dynamically allocated buffer structures, but then again, how do i find out how big the package is going to get ?
edit: i hate to ask this, but is there any "simple" method for managing the buffer and to avoid overruns ? synchronisations sounds off limit to me, atleast right now

If i make a receive call via WSARecv, does the workerthread work like a callback function ?
See #valdo post. Completion data si queued to your pool of threads and one will be made ready to process it.
'I mean, does it dig up the receive call only when it has completed?' Yes - hence the name. Note that the meaning of 'completed' may vary. depending on the protocol. With TCP, it means that some streamed data bytes have been received from the peer.
'Does the lpNumberOfBytes (from GetQueuedCompletionStatus) variable contain the number of bytes received till now or the total number of bytes received ?' It contains the number of bytes received and loaded into the buffer array provided in that IOCP completion only.
'How to avoid overruns, i thought of dynamically allocated buffer structures, but then again, how do i find out how big the package is going to get ?' You cannot get overruns if you provide the buffer arrays - the kernel thread/s that load the buffer/s will not exceed the passed buffer lengths. At application level, given the streaming nature of TCP, it's up to you to decide how to process the buffer arrays into useable application-level protocol-units. You must decide, using your knowledge of the services provided, on a suitable buffer management scheme.
Last IOCP server was somwewhat general-purpose. I used an array of buffer pools and a pool of 'buffer-carrier' objects, allocated at startup, (along with a pool of socket objects). Each buffer pool held buffers of a different size. Upon a new connection, I issued an WSARecv using one buffer from the smallest pool. If this buffer got completely filled, I used a buffer from the next largest pool for the next WSARecv, and so on.
Then there's the issue of the sequence numbers needed to prevent out-of-order buffering with multiple handler threads :(

_1. Completion port is a sort of a queue (with sophisticated logic concerning priority of threads waiting to dequeue an I/O completion from it). Whenever an I/O completes (either successfully or not), it's queued into the completion port. Then it's dequeued by one of the thread called GetQueuedCompletionStatus.
So that you never dequeue an I/O "in progress". Moreover, it's processed by your worker thread asynchronously. That is, it's delayed until your thread calls GetQueuedCompletionStatus.
_2. This is actually a complex matter. Synchronization is not a trivial task overall, especially when it comes to symmetric multi-threading (where you have several threads, each may be doing everything).
One of the parameters you receive with a completed I/O is a pointer to an OVERLAPPED structure (that you supplied to the function that issued I/O, such as WSARecv). It's a common practice to allocate your own structure that is based on OVERLAPPED (either inherits it or has it as the first member). Upon receiving a completion you may cast the dequeued OVERLAPPED to your actual data structure. There you may have everything needed for the synchronization: sync objects, state description and etc.
Note however that it's not a trivial task to synchronize things correctly (to have a good performance and avoid deadlocks) even when you have the custom context. This demands an accurate design.

Related

Should `stream.async_read_some(null_buffers,handler)` complete immediately or not?

The context for this: I'm working on implementing a simple stream adaptor on top of an RDMA wrapper library integrating with Boost ASIO.
The issue I'm running into is that the boost::asio::async_read aggregator being called by the client code is hanging even when the stream has returned enough data to fill the buffer, if there's no more pending data in the internal receive buffer. Debugging appears to show that it's calling my stream adaptor's async_read_some method with a single buffer of size 0.
The documentation I've found seems to be conflicting on whether that operation should complete immediately or not. On the one hand, the AsyncReadStream concept specification says:
If the total size of all buffers in the sequence mb is 0, the asynchronous read operation shall complete immediately and pass 0 as the argument to the handler that specifies the number of bytes read.
On the other hand, the overview of boost::asio::null_buffers says:
A null_buffers operation doesn't return until the I/O object is "ready" to perform the operation.
(And in fact, elsewhere I'm relying on that to register handlers to be called when the rdma_cm and ibverbs completion channel FDs indicate available events.) But looking at the implementation of null_buffers, it would appear that it's just a static object containing no buffers, so that would seem to satisfy the condition of a sequence with total size of all buffers being 0.
So, I'm confused as to how my async_read_some method should handle the case of trying to read 0 bytes. As a wild guess, maybe it should be something like: on a truly empty sequence like null_buffers it should complete only when there's data available in the receive buffer, while if it has a non-empty sequence with total length of the buffers equal to 0 then it should complete immediately regardless of the state of the receive buffer?
Note null_buffers has been deprecated since Boost 1.66.0: "(Deprecated: Use the socket/descriptor wait() and async_wait() member functions.)"
As a wild guess, maybe it should be something like: on a truly empty sequence like null_buffers it should complete only when there's data available in the receive buffer, while if it has a non-empty sequence with total length of the buffers equal to 0 then it should complete immediately regardless of the state of the receive buffer?
No. null_buffers are NOT "empty buffers" or "zero-length buffers". It's "no buffer", signalling to the ASIO routines that there are no buffers (so the size cannot be logically thought to be zero).
All of the completion notes pertaining to the buffer size are irrelevant, because no buffers are present. The relevant piece of documentation is Reactor Style Operations.
You said it correctly: passing null_buffers signals that you want a reactor-style operation to weave into the asio event subsystem (meaning, you can async wait for a socket to become ready for reading/writing, and then do the actual IO e.g. by passing the underlying socket handle to a third party API that doesn't support asynchronous IO natively.

Can I limit the memory usage of bufferevents in libevent?

Does libevent have a way to restrict the memory usage or the outstanding unwritten data with the bufferevents API? Alternatively, what happens when libevent runs out of memory?
My application reads data from a file, parses it, and sends it over the network (pseudocode below, my application is C++):
for line in file:
host = parse_host(line)
socket[host].send(line)
Previously, I split this up into two threads to ensure that we don't block the parsing of the file unnecessarily if one host is slow:
for line in file:
host = parse_host(line)
fixed_sized_queues[host].append(line)
while stillParsingFile():
sockets.poll(timeout=1s)
for queue in fixed_sized_queues:
if queue is not Empty and socket[queue.host] is ReadyForWrite:
socket[queue.host].send(q.pop())
The logic for the second thread is very clunky, and I thought it would be better to use libevents:
for line in file:
host = parse_host(line)
bufferevents[host].add(line)
My understanding is that appending to a bufferevent is non-blocking (i.e. it will do dynamic memory allocation and asynchronously write the data to the socket). How do I prevent the parse thread from outpacing the network and filling up all of memory by allocating space for data that can't be sent yet?
The ideas I have currently are:
Use evbuffer_add_reference to mitigate the memory usage (still doesn't provide a bound, but at least allocations should be small)
Use event_set_mem_functions and custom locking to force libevent to block and wait for memory when it hits a cap (this sounds very dangerous, as it only works if libevent only does allocations when adding to a bufferevent and nowhere else)
In the parse thread, grab a lock before adding to the evbuffer and check evbuffer_get_length(); in the send thread, signal the parse thread in a low water write callback from libevent.

MPI_ERR_BUFFER: invalid buffer pointer

What is the most common reason for this error
MPI_ERR_BUFFER: invalid buffer pointer
which results from MPI_Bsend() and MPI_Rcev() calls?
The program works fine when the number of parallel processes is small (<14), but when I increase the number of processes I get this error.
To expand on my previous comment:
Buffering in MPI can occur on various occasions. Messages can be buffered internally by the MPI library in order to hide the network latency (usually only done for small messages up to an implementation-dependent size) or buffering can be enforced by the user by using any of the buffered send operations MPI_Bsend() and MPI_Ibsend(). User buffering differs from the internal one though:
first, messages sent by MPI_Bsend() or by MPI_Ibsend() are always buffered, which is not the case with internally buffered messages. The latter can either be buffered or not depending on their size and the availability of internal buffer space;
second, because of the "always buffer" aspect, if no buffer space is available in the user attached buffer, an MPI_ERR_BUFFER error occurs.
Sent messages use buffer space until they are received for sure by the destionation process. Since MPI does not provide any built-in mechanisms to confirm the reception of a message, one has to devise another way to do it, e.g. by sending back a confirmation messages from the destination process to the source one.
For that reason one has to consider all messages that were not explicitly confirmed as being in transit and has to allocate enough memory in the buffer. Usually this means that the buffer should be at least as large as the total amount of data that you are willing to transfer plus the message envelope overhead which is equal to number_of_sends * MPI_BSEND_OVERHEAD. This can put a lot of memory pressure for large MPI jobs. One has to keep that in mind and adjust the buffer space accordingly when the number of processes is changed.
Note that buffered send is provided merely as a convenience. It could be readily implemented as a combination of memory duplication and non-blocking send operation, e.g. buffered send frees you from writing code like:
int data[];
int *shadow_data;
MPI_Request req;
...
<populate data>
...
shadow_data = (int *)malloc(sizeof(data));
memcpy(shadow_data, data, sizeof(data));
MPI_Isend(shadow_data, count, MPI_INT, destination, tag, MPI_COMM_WORLD, &req);
...
<reuse data as it is not used by MPI>
...
MPI_Wait(&req);
free(shadow_data);
If memory is scarce then you should resort to non-blocking sends only.

Design of concurrent processing of a dual buffer system?

I have a long-running application that basically:
read packets off network
save it somewhere
process it and output to disk
A very common use-case indeed - except both the data size and data rate can be quite large. To avoid overflow of the memory and improve efficiency, I am thinking of a dual buffer design, where buffer A and B alternate: while A is holding networking packet, B is processed for output. Once buffer A reaches a soft bound, A is due for output processing, and B will be used for holding network packets.
I am not particularly experienced on concurrency/multi-thread program paradigm. I have read some past discussion on circular buffer that handle multiple-producer and multiple consumer case. I am not sure if that is the best solution and It seems the dual buffer design is simpler.
My question is: is there a design pattern I can follow to tackle the problem? or better design for that matter? If possible, please use pseudo code to help to illustrate the solution. Thanks.
I suggest that you should, instead of assuming "two" (or any fixed number of ...) buffers, simply use a queue, and therefore a "producer/consumer" relationship.
The process that is receiving packets simply adds them to a buffer of some certain size, and, either when the buffer is sufficiently full or a specified (short...) time interval has elapsed, places the (non-empty) buffer onto a queue for processing by the other. It then allocates a new buffer for its own use.
The receiving ("other...") process is woken up any time there might be a new buffer in the queue for it to process. It removes the buffer, processes it, then checks the queue again. It goes to sleep only when it finds that the queue is empty. (Use care to be sure that the process cannot decide to go to sleep at the precise instant that the other process decides to signal it... there must be no "race condition" here.)
Consider simply allocating storage "per-message" (whatever a "message" may mean to you), and putting that "message" onto the queue, so that there is no unnecessary delay in processing caused by "waiting for a buffer to fill up."
It might be worth mentioning a technique used in real-time audio processing/recording, which uses a single ring buffer (or fifo if you prefer that term) of sufficient size can be used for this case.
You will need then a read and write cursor. (Whether you actually need a lock or can do with volatile plus memory barriers is a touchy subject, but the people at portaudio suggest you do this without locks if performance is important.)
You can use one thread to read and another thread to write. The read thread should consume as much of the buffer as possible. You will be safe unless you run out of buffer space, but that exists for the dual-buffer solution as well. So the underlying assumption is that you can write to disk faster then the input comes in, or you will need to expand on the solution.
Find a producer-consumer queue class that works. Use one to create a buffer pool to improve performance and control memory use. Use another to transfer the buffers from the network thread to the disk thread:
#define CnumBuffs 128
#define CbufSize 8182
#define CcacheLineSize 128
public class netBuf{
private char cacheLineFiller[CcacheLineSize]; // anti false-sharing space
public int dataLen;
public char bigBuf[CbufSize];
};
PCqueue pool;
PCqueue diskQueue;
netThread Thread;
diskThread Thread;
pool=new(PCqueue);
diskQueue=new(PCqueue);
// make an object pool
for(i=0;i<CnumBuffs,i++){
pool->push(new(netBuf));
};
netThread=new(netThread);
diskThread=new(diskThread);
netThread->start();
diskThread->start();
..
void* netThread.run{
netbuf *thisBuf;
for(;;){
pool->pop(&thisBuf}; // blocks if pool empty
netBuf->datalen=network.read(&thisBuf.bigBuf,sizeof(thisBuf.bigBuf));
diskQueue->push(thisBuf);
};
};
void* diskThread.run{
fileStream *myFile;
diskBuf *thisBuf;
new myFile("someFolder\fileSpec",someEnumWrite);
for(;;){
diskQueue->pop(&thisBuf}; // blocks until buffer available
myFile.write(&thisBuf.bigBuf,thisBuf.dataLen);
pool->push(thisBuf};
};
};

Buffering Incomplete High Speed Reads

I am reading data ~100 bytes at 100hz from a serial port. My buffer is 1024 bytes, so often my buffer doesn't get completely used. Sometimes however, I get hiccups from the serial port and the buffer gets filled up.
My data is organized as a [header]data[checksum]. When my buffer gets filled up, sometimes a message/data is split across two reads from the serial port.
This is a simple problem, and I'm sure there are a lot of different approaches. I am ahead of schedule so I would like to research different approaches. Could you guys name some paradigms that cover buffering in high speed data that might need to be put together from two reads? Note, the main difference I see in this problem from say other buffering I've done (image acquisition, tcp/ip), is that there we are guaranteed full packets/messages. Here a "packet" may be split between reads, which we will only know once we start parsing the data.
Oh yes, note that the data buffered in from the read has to be parsed, so to make things simple, the data should be contiguous when it reaches the parsing. (Plus I don't think that's the parser's responsibility)
Some Ideas I Had:
Carry over unused bytes to my original buffer, then fill it with the read after the left over bytes from the previous read. (For example, we read 1024 bytes, 24 bytes are left at the end, they're a partial message, memcpy to the beginning of the read_buffer_, pass the beginning + 24 to read and read in 1024 - 24)
Create my own class that just gets blocks of data. It has two pointers, read/write and a large chunk of memory (1024 * 4). When you pass in the data, the class updates the write pointer correctly, wraps around to the beginning of its buffer when it reaches the end. I guess like a ring buffer?
I was thinking maybe using a std::vector<unsigned char>. Dynamic memory allocation, guaranteed to be contiguous.
Thanks for the info guys!
Define some 'APU' application-protocol-unit class that will represent your '[header]data[checksum]'. Give it some 'add' function that takes a char parameter and returns a 'valid' bool. In your serial read thread, create an APU and read some data into your 1024-byte buffer. Iterate the data in the buffer, pushing it into the APU add() until either the APU add() function returns true or the iteration is complete. If the add() returns true, you have a complete APU - queue it off for handling, create another one and start add()-ing the remaining buffer bytes to it. If the iteration is complete, loop back round to read more serial data.
The add() method would use a state-machine, or other mechanism, to build up and check the incoming bytes, returning 'true' only in the case of a full sanity-checked set of data with the correct checksum. If some part of the checking fails, the APU is 'reset' and waits to detect a valid header.
The APU could maybe parse the data itself, either byte-by-byte during the add() data input, just before add() returns with 'true', or perhaps as a separate 'parse()' method called later, perhaps by some other APU-processing thread.
When reading from a serial port at speed, you typically need some kind of handshaking mechanism to control the flow of data. This can be hardware (e.g. RTS/CTS), software (Xon/Xoff), or controlled by a higher level protocol. If you're reading a large amount of data at speed without handshaking, your UART or serial controller needs to be able to read and buffer all the available data at that speed to ensure no data loss. On 16550 compatible UARTs that you see on Windows PCs, this buffer is just 14 bytes, hence the need for handshaking or a real time OS.