Can I limit the memory usage of bufferevents in libevent? - c++

Does libevent have a way to restrict the memory usage or the outstanding unwritten data with the bufferevents API? Alternatively, what happens when libevent runs out of memory?
My application reads data from a file, parses it, and sends it over the network (pseudocode below, my application is C++):
for line in file:
host = parse_host(line)
socket[host].send(line)
Previously, I split this up into two threads to ensure that we don't block the parsing of the file unnecessarily if one host is slow:
for line in file:
host = parse_host(line)
fixed_sized_queues[host].append(line)
while stillParsingFile():
sockets.poll(timeout=1s)
for queue in fixed_sized_queues:
if queue is not Empty and socket[queue.host] is ReadyForWrite:
socket[queue.host].send(q.pop())
The logic for the second thread is very clunky, and I thought it would be better to use libevents:
for line in file:
host = parse_host(line)
bufferevents[host].add(line)
My understanding is that appending to a bufferevent is non-blocking (i.e. it will do dynamic memory allocation and asynchronously write the data to the socket). How do I prevent the parse thread from outpacing the network and filling up all of memory by allocating space for data that can't be sent yet?
The ideas I have currently are:
Use evbuffer_add_reference to mitigate the memory usage (still doesn't provide a bound, but at least allocations should be small)
Use event_set_mem_functions and custom locking to force libevent to block and wait for memory when it hits a cap (this sounds very dangerous, as it only works if libevent only does allocations when adding to a bufferevent and nowhere else)
In the parse thread, grab a lock before adding to the evbuffer and check evbuffer_get_length(); in the send thread, signal the parse thread in a low water write callback from libevent.

Related

C++: Which weak atomic to use for buffers that receive async. RDMA transfers?

The Derecho system (open-source C++ library for data replication, distributed coordination, Paxos -- ultra-fast) is built around asynchronous RDMA networking primitives. Senders can write to receivers without pausing, using RDMA transfers into receiver memory. Typically this is done in two steps: we transfer the data bytes in one operation, then notify the receiver by incrementing a counter or setting a flag: "message 67 is ready for you, now". Soon the receiver will notice that message 67 is ready, at which point it will access the bytes of that message.
Intended semantic: "seeing the counter updated should imply that the receiver's C++ code will see the bytes of the message." In PL terms, we need a memory fence between the update of the guard and the bytes of the message. The individual cache-lines must also be sequentially consistent: my guard will go through values like 67, 68, .... and I don't want any form of mashed up value or non-monotonic sequencing, such as could arise if C++ reads a stale cache line, or mistakenly holds a stale value in memory. Same for the message buffer itself: these bytes might overwrite old bytes and I don't want to see some kind of mashup.
This is the crux of my question: I need a weak atomic that will impose [exactly] the needed barrier, without introducing unwanted overheads. Which annotation would be appropriate? Would the weak atomic annotation be the same for the "message" as for the counter (the "guard")?
Secondary question: If I declare my buffer with the proper weak atomic, do I also need to say that it is "volatile", or will C++ realize this because the memory was declared weakly atomic?
An atomic counter, whatever its type, will not guarantee anything about memory not controlled by the CPU. Before the RDMA transfer starts, you need to ensure the CPU's caches for the RDMA region are flushed and invalidated, and then of course not read from or write to that region while the RDMA transfer is ongoing. When the RDMA device signals the transfer is done, then you can update the counter.
The thread that is waiting for the counter to be incremented should not reorder any loads or stores done after reading the counter, so the correct memory order is std::memory_order_acquire. So basically, you want Release-Acquire ordering, although there is nothing to "release" in the thread that updates the counter.
You don't need to make the buffers volatile; in general you should not rely on volatile for atomicity.

user land memory is overwritten by 8 bytes kernel land memory pointer

We finished a multiple-thread program which is complicated. I simplify it like this:
Start main thread and initialize a thread-pool which will used to request data from remote server.
The main thread new about 3000 task and add them to the thread pool.
Every task is initialized by memory buffer. The thread pool's thread get one of the tasks , get data form remote, copy the data to the memory buffer initialized before and inform the main thread to read the data.
When the main thread read data from the buffer, it will check the data's checksum value for error-detection.
The processes are run on 2500 machines and with every machine runs about 20 processes at the same time. There may be one error occur about one of the process in one day long.
First we think the cause must be the data return by the remote server, so we add data CRC check to the data response from the remote server. It proved that the data is always correct when it reaches the process by checking the CRC32.
Then we think there must be some mistake about the pointer used in user land, which can overwrite the task's memory buffer. So we add protection to every task's data buffer by mprotect. First copy the data to task buffer, then protect the memory buffer by setting it as ready-only, last check the CRC of the response and the data buffer; both of them are correct at this time. But when main thread reads data from the buffer, the data error occur!!!
We modify the code to abort when it encounters the data error, and look up the data where is overwritten. 0xffff881492499c88 is found. This value is a kernel-land virtual memory pointer in X86_64 linux. I gdb the coredump and find the task memory is in read-only mode also.
I'm so confused about the memory protection. Why is a user-land memory buffer is in readonly mode and is modified without segmentation fault?
I don't not familiar about the kernel things. Is there a possible way for the kernel to modify user-land memory which can ignore the memory privilege? If the kernel and user-land virtual space are mmap-ed to the same physical memory space, could the kernel modify the memory without core-dump? If so, how can I trace these issues?
In my machine environment has enable cgroup to control the processes memory on Linux kernel 2.6.32.
I have been working on this for more than 2 weeks. Any suggestion is appreciated. If there was a bug related to this case, please let me know. There are some kernel modules plugged in to the OS which has necessary private patch, so the next work I can take is to un-plugin the suspicious module and reproduce this error. If it can't be reproduced any more, the related module may be found.

WSARecv, Completionport Model, how to manage Buffer and avoid overruns?

My Problem: My Completionport Server will receive Data of unknown size from different clients, the thing is, that i don't know how avoid buffer overruns/ how to avoid my (receiving) buffer being "overfilled" with data.
now to the Quesitons:
1) If i make a receive call via WSARecv, does the workerthread work like a callback function ? I mean, does it dig up the receive call only when it has completed or does it also dig it up when the receiving is happening ? Does the lpNumberOfBytes (from GetQueuedCompletionStatus) variable contain the number of bytes received till now or the total number of bytes received ?
2) How to avoid overruns, i thought of dynamically allocated buffer structures, but then again, how do i find out how big the package is going to get ?
edit: i hate to ask this, but is there any "simple" method for managing the buffer and to avoid overruns ? synchronisations sounds off limit to me, atleast right now
If i make a receive call via WSARecv, does the workerthread work like a callback function ?
See #valdo post. Completion data si queued to your pool of threads and one will be made ready to process it.
'I mean, does it dig up the receive call only when it has completed?' Yes - hence the name. Note that the meaning of 'completed' may vary. depending on the protocol. With TCP, it means that some streamed data bytes have been received from the peer.
'Does the lpNumberOfBytes (from GetQueuedCompletionStatus) variable contain the number of bytes received till now or the total number of bytes received ?' It contains the number of bytes received and loaded into the buffer array provided in that IOCP completion only.
'How to avoid overruns, i thought of dynamically allocated buffer structures, but then again, how do i find out how big the package is going to get ?' You cannot get overruns if you provide the buffer arrays - the kernel thread/s that load the buffer/s will not exceed the passed buffer lengths. At application level, given the streaming nature of TCP, it's up to you to decide how to process the buffer arrays into useable application-level protocol-units. You must decide, using your knowledge of the services provided, on a suitable buffer management scheme.
Last IOCP server was somwewhat general-purpose. I used an array of buffer pools and a pool of 'buffer-carrier' objects, allocated at startup, (along with a pool of socket objects). Each buffer pool held buffers of a different size. Upon a new connection, I issued an WSARecv using one buffer from the smallest pool. If this buffer got completely filled, I used a buffer from the next largest pool for the next WSARecv, and so on.
Then there's the issue of the sequence numbers needed to prevent out-of-order buffering with multiple handler threads :(
_1. Completion port is a sort of a queue (with sophisticated logic concerning priority of threads waiting to dequeue an I/O completion from it). Whenever an I/O completes (either successfully or not), it's queued into the completion port. Then it's dequeued by one of the thread called GetQueuedCompletionStatus.
So that you never dequeue an I/O "in progress". Moreover, it's processed by your worker thread asynchronously. That is, it's delayed until your thread calls GetQueuedCompletionStatus.
_2. This is actually a complex matter. Synchronization is not a trivial task overall, especially when it comes to symmetric multi-threading (where you have several threads, each may be doing everything).
One of the parameters you receive with a completed I/O is a pointer to an OVERLAPPED structure (that you supplied to the function that issued I/O, such as WSARecv). It's a common practice to allocate your own structure that is based on OVERLAPPED (either inherits it or has it as the first member). Upon receiving a completion you may cast the dequeued OVERLAPPED to your actual data structure. There you may have everything needed for the synchronization: sync objects, state description and etc.
Note however that it's not a trivial task to synchronize things correctly (to have a good performance and avoid deadlocks) even when you have the custom context. This demands an accurate design.

Design of concurrent processing of a dual buffer system?

I have a long-running application that basically:
read packets off network
save it somewhere
process it and output to disk
A very common use-case indeed - except both the data size and data rate can be quite large. To avoid overflow of the memory and improve efficiency, I am thinking of a dual buffer design, where buffer A and B alternate: while A is holding networking packet, B is processed for output. Once buffer A reaches a soft bound, A is due for output processing, and B will be used for holding network packets.
I am not particularly experienced on concurrency/multi-thread program paradigm. I have read some past discussion on circular buffer that handle multiple-producer and multiple consumer case. I am not sure if that is the best solution and It seems the dual buffer design is simpler.
My question is: is there a design pattern I can follow to tackle the problem? or better design for that matter? If possible, please use pseudo code to help to illustrate the solution. Thanks.
I suggest that you should, instead of assuming "two" (or any fixed number of ...) buffers, simply use a queue, and therefore a "producer/consumer" relationship.
The process that is receiving packets simply adds them to a buffer of some certain size, and, either when the buffer is sufficiently full or a specified (short...) time interval has elapsed, places the (non-empty) buffer onto a queue for processing by the other. It then allocates a new buffer for its own use.
The receiving ("other...") process is woken up any time there might be a new buffer in the queue for it to process. It removes the buffer, processes it, then checks the queue again. It goes to sleep only when it finds that the queue is empty. (Use care to be sure that the process cannot decide to go to sleep at the precise instant that the other process decides to signal it... there must be no "race condition" here.)
Consider simply allocating storage "per-message" (whatever a "message" may mean to you), and putting that "message" onto the queue, so that there is no unnecessary delay in processing caused by "waiting for a buffer to fill up."
It might be worth mentioning a technique used in real-time audio processing/recording, which uses a single ring buffer (or fifo if you prefer that term) of sufficient size can be used for this case.
You will need then a read and write cursor. (Whether you actually need a lock or can do with volatile plus memory barriers is a touchy subject, but the people at portaudio suggest you do this without locks if performance is important.)
You can use one thread to read and another thread to write. The read thread should consume as much of the buffer as possible. You will be safe unless you run out of buffer space, but that exists for the dual-buffer solution as well. So the underlying assumption is that you can write to disk faster then the input comes in, or you will need to expand on the solution.
Find a producer-consumer queue class that works. Use one to create a buffer pool to improve performance and control memory use. Use another to transfer the buffers from the network thread to the disk thread:
#define CnumBuffs 128
#define CbufSize 8182
#define CcacheLineSize 128
public class netBuf{
private char cacheLineFiller[CcacheLineSize]; // anti false-sharing space
public int dataLen;
public char bigBuf[CbufSize];
};
PCqueue pool;
PCqueue diskQueue;
netThread Thread;
diskThread Thread;
pool=new(PCqueue);
diskQueue=new(PCqueue);
// make an object pool
for(i=0;i<CnumBuffs,i++){
pool->push(new(netBuf));
};
netThread=new(netThread);
diskThread=new(diskThread);
netThread->start();
diskThread->start();
..
void* netThread.run{
netbuf *thisBuf;
for(;;){
pool->pop(&thisBuf}; // blocks if pool empty
netBuf->datalen=network.read(&thisBuf.bigBuf,sizeof(thisBuf.bigBuf));
diskQueue->push(thisBuf);
};
};
void* diskThread.run{
fileStream *myFile;
diskBuf *thisBuf;
new myFile("someFolder\fileSpec",someEnumWrite);
for(;;){
diskQueue->pop(&thisBuf}; // blocks until buffer available
myFile.write(&thisBuf.bigBuf,thisBuf.dataLen);
pool->push(thisBuf};
};
};

Memory SPIKE - Boost ASIO ASYNC read

Wrote a Server which just reads data from a client:
Using a boost::array buffer
Started the server and system monitor shows 1MB of usage.
1.) Just do an async_read_some and do a handleRead in which I again call the asyncRead function.
void asyncRead() {
m_socket->async_read_some(
boost::asio::buffer(m_readBuffer, READ_BLOCK_SIZE),
m_strand->wrap(boost::bind(&ConnectionHandler::handleRead,
shared_from_this(),
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred))
);
}
and in handleRead I verify if there are any errors or not and if there aren't any I simply issue another asyncRead().
2.) Kept sending frames ( data of size around 102 bytes ).
At end of test for 10000 Frames. Total Sent size = 102*10000
Total Read Size = 102*10000
But, the memory usage in system monitor spikes up to 7.8 Mb .
Couldn't figure out the cause of this increase. The different aspects tried out are:
1.) Number of connections being made - only 1.
2.) Verified closing of connection - yes.
3.) stopped even the ioServic but still no change.
On a 2nd run of the client, I see the memory increasing. What could be the case? I am using a boost::array which is a stack variable and simply just reading. No other place there is a buffer being initialized.
Raja,
First of all, are you aware that async_read_some does not guarantee that you will read the entire READ_BLOCK_SIZE? If you need that guarantee, I would suggest you to use async_read instead.
Now, back to the original question, your situation is quite typical. So, basically, you need a container (array) that will hold the data until is sent, and then you need to get rid of it.
I strongly suggest you switching to boost shared_array. You can use it in the same way as boost array, but it has a built-in reference counter, so the object will be deleted when it is not needed anymore. This should solve your memory leak.