Memory SPIKE - Boost ASIO ASYNC read - c++

Wrote a Server which just reads data from a client:
Using a boost::array buffer
Started the server and system monitor shows 1MB of usage.
1.) Just do an async_read_some and do a handleRead in which I again call the asyncRead function.
void asyncRead() {
m_socket->async_read_some(
boost::asio::buffer(m_readBuffer, READ_BLOCK_SIZE),
m_strand->wrap(boost::bind(&ConnectionHandler::handleRead,
shared_from_this(),
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred))
);
}
and in handleRead I verify if there are any errors or not and if there aren't any I simply issue another asyncRead().
2.) Kept sending frames ( data of size around 102 bytes ).
At end of test for 10000 Frames. Total Sent size = 102*10000
Total Read Size = 102*10000
But, the memory usage in system monitor spikes up to 7.8 Mb .
Couldn't figure out the cause of this increase. The different aspects tried out are:
1.) Number of connections being made - only 1.
2.) Verified closing of connection - yes.
3.) stopped even the ioServic but still no change.
On a 2nd run of the client, I see the memory increasing. What could be the case? I am using a boost::array which is a stack variable and simply just reading. No other place there is a buffer being initialized.

Raja,
First of all, are you aware that async_read_some does not guarantee that you will read the entire READ_BLOCK_SIZE? If you need that guarantee, I would suggest you to use async_read instead.
Now, back to the original question, your situation is quite typical. So, basically, you need a container (array) that will hold the data until is sent, and then you need to get rid of it.
I strongly suggest you switching to boost shared_array. You can use it in the same way as boost array, but it has a built-in reference counter, so the object will be deleted when it is not needed anymore. This should solve your memory leak.

Related

How to recive more than 65000 bytes in C++ socket using recv()

I am developing a client server application (TCP) in Linux using C++. I want to send more than 65,000 bytes at the same time. In TCP, the maximum packet size is 65,535 bytes only.
How can I send the entire bytes without loss?
Following is my code at server side.
//Receive the message from client socket
if((iByteCount = recv(GetSocketId(), buffer, MAXRECV, MSG_WAITALL)) > 0)
{
printf("\n Received bytes %d\n", iByteCount);
SetReceivedMessage(buffer);
return LS_RESULT_OK;
}
If I use MSG_WAITALL it takes a long time to receive the bytes so how can I set the flag to receive more than 1 million bytes at time.
Edit: The MTU size is 1500 bytes but the absolute limitation on TCP Packet size if 65,535.
Judging from the comments above, it seems you don't understand how recv works, or how it is supposed to be used.
You really want to call recv in a loop, until either you know that the expected amount of data has been received or until you get a "zero bytes read" result, which means the other end has closed the connection. Always, no exceptions.
If you need to do other things concurrently (likely, with a server process!) then you will probably want to check descriptor readiness with poll or epoll first. That lets you multiplex sockets as they become ready.
The reason why you want to do it that way, and never any different, is that you don't know how the data will be packeted and how (or when) packets will arrive. Plus, recv gives no guarantee about the amount of data read at a time. It will offer what it has in its buffers at the time you call it, no more and no less (it may block if there's nothing, but then you still don't have a guarantee that any particular amount of data will be returned when it resumes, it may still return e.g. 50 bytes!).
Even if you only send, say, 5,000 bytes total, it is perfectly valid behaviour for TCP to break this into 5 (or 10, or 20) packets, and for recv to return 500 (or 100, or 20, or 1) bytes at a time, every time you call it. That's just how it works.
TCP guarantees that anything you send will eventually arrive at the other end or produce an error. And, it guarantees that whatever you send arrives in order. It does not guarantee much else. Above all, it does not guarantee that any particular amount of data is ready at any given time.
You must be prepared for that, and the only way to do it is calling recv repeatedly. Otherwise you will always lose data under some circumstances.
MSG_WAITALL should in principle make it work the way you expect, but that is bad behaviour, and it is not guaranteed to work. If the socket (or some other structure in the network stack) runs against a soft or hard limit, it may not, and probably will not fulfill your request. Some limits are obscure, too. For example, the number for SO_RCVBUF must be twice as large as what you expect to receive under Linux, because of implementation details.
Correct behaviour of a server application should never depend on assumptions such as "it fits into the receive buffer". Your application needs to be prepared, in principle, to receive terabytes of data using a 1 kilobyte receive buffer, and in chunks of 1 byte at a time, if need be. A larger receive buffer will make it more efficient, but that's it... it still has to work either way.
The fact that you only seee failures upwards of some "huge" limit is just luck (or rather, bad luck). The fact that it apparently "works fine" up to that limit suggests what you do is correct, but it isn't. It's an unlucky coincidence that it works.
EDIT:
As requested in below comment, here is what this could look like (Code is obviously untested, caveat emptor.)
std::vector<char> result;
int size;
char recv_buf[250];
for(;;)
{
if((size = recv(fd, recv_buf, sizeof(recv_buf), 0)) > 0)
{
for(unsigned int i = 0; i < size; ++i)
result.push_back(recv_buf[i]);
}
else if(size == 0)
{
if(result.size() < expected_size)
{
printf("premature close, expected %u, only got %u\n", expected_size, result.size());
}
else
{
do_something_with(result);
}
break;
}
else
{
perror("recv");
exit(1);
}
}
That will receive any amount of data you want (or until operator new throws bad_alloc after allocating a vector several hundred MiB in size, but that's a different story...).
If you want to handle several connections, you need to add poll or epoll or kqueue or a similar functionality (or... fork), I'll leave this as exercise for the reader.
It is possible that your problem is related to kernel socket buffer sizes. Try adding the following to your code:
int buffsize = 1024*1024;
setsockopt(s, SOL_SOCKET, SO_RCVBUF, &buffsize, sizeof(buffsize));
You might need to increase some sysctl variables too:
sysctl -w net.core.rmem_max=8388608
sysctl -w net.core.wmem_max=8388608
Note however, that relying on TCP to fill your whole buffer is generally a bad idea. You should rather call recv() multiple times. The only good reason why you would want to receive more than 64K is for improved performance. However, Linux should already have auto-tuning that will progressively increase the buffer sizes as required.
in tcp max packet sixe is 65,635,bytes
No it isn't. TCP is a byte-stream protocol over segments over IP packets, and the protocol has unlimited transmission sizes over any one connection. Look at all those 100MB downloads: how do you think they work?
Just send and receive the data. You'll get it.
I would suggest exploring kqueue or something similar. With event notification there is no need to loop on recv . Just call a simple read function upon an EV_READ event and use a single call to the recv function upon the socket that triggered the event. Your function can have a buffer size of 10 bytes or however much you want it doesn't matter because if you did not read the entire message the first time around you'll just get another EV_READ event on the socket and you recall your read function. When the data is read you'll get a EOF event. No need to hustle with loops that may or may not block other incoming connections.

Is this an appropriate use for shared_ptr?

Project: typical chat program. Server must receive text from multiple clients and fan each input out to all clients.
In the server I want to have each client to have a struct containing the socket fd and a std::queue. Each structure will be on a std::list.
As input is received from a client socket I want to iterate over the list of structs and put new input into each client struct's queue. A string is new[ed] because I don't want copies of the string multiplied over all the clients. But I also want to avoid the headache of have multiple pointers to the string spread out and deciding when it is time to finally delete the string.
Is this an appropriate occassion for a shared pointer? If so, is the shared_ptr incremented each time I push them into the queue and decremented when I pop them from the queue?
Thanks for any help.
This is a case where a pseudo-garbage collector system will work much better than reference counting.
You need only one list of strings, because you "fan every input out to all clients". Because you will add to one end and remove from the other, a deque is an appropriate data structure.
Now, each connection needs only to keep track of the index of the last string it sent. Periodically (every 1000th message received, or every 4MB received, or something like that), you find the minimum of this index across all clients, and delete strings up to that point. This periodic check is also an opportunity to detect clients which have fallen far behind (possible broken connection) and recover. Without this check, a single stuck client will cause your program to leak memory (even under the reference counting scheme).
This scheme is several times less data than reference counting, and also removes one of the major points of cache contention (reference counts must be written from multiple threads, so they ruin performance). If you aren't using threads, it'll still be faster.
That is an appropriate use of a shared_ptr. And yes, the use count will be increment because a new shared_ptr will be create to push.

WSARecv, Completionport Model, how to manage Buffer and avoid overruns?

My Problem: My Completionport Server will receive Data of unknown size from different clients, the thing is, that i don't know how avoid buffer overruns/ how to avoid my (receiving) buffer being "overfilled" with data.
now to the Quesitons:
1) If i make a receive call via WSARecv, does the workerthread work like a callback function ? I mean, does it dig up the receive call only when it has completed or does it also dig it up when the receiving is happening ? Does the lpNumberOfBytes (from GetQueuedCompletionStatus) variable contain the number of bytes received till now or the total number of bytes received ?
2) How to avoid overruns, i thought of dynamically allocated buffer structures, but then again, how do i find out how big the package is going to get ?
edit: i hate to ask this, but is there any "simple" method for managing the buffer and to avoid overruns ? synchronisations sounds off limit to me, atleast right now
If i make a receive call via WSARecv, does the workerthread work like a callback function ?
See #valdo post. Completion data si queued to your pool of threads and one will be made ready to process it.
'I mean, does it dig up the receive call only when it has completed?' Yes - hence the name. Note that the meaning of 'completed' may vary. depending on the protocol. With TCP, it means that some streamed data bytes have been received from the peer.
'Does the lpNumberOfBytes (from GetQueuedCompletionStatus) variable contain the number of bytes received till now or the total number of bytes received ?' It contains the number of bytes received and loaded into the buffer array provided in that IOCP completion only.
'How to avoid overruns, i thought of dynamically allocated buffer structures, but then again, how do i find out how big the package is going to get ?' You cannot get overruns if you provide the buffer arrays - the kernel thread/s that load the buffer/s will not exceed the passed buffer lengths. At application level, given the streaming nature of TCP, it's up to you to decide how to process the buffer arrays into useable application-level protocol-units. You must decide, using your knowledge of the services provided, on a suitable buffer management scheme.
Last IOCP server was somwewhat general-purpose. I used an array of buffer pools and a pool of 'buffer-carrier' objects, allocated at startup, (along with a pool of socket objects). Each buffer pool held buffers of a different size. Upon a new connection, I issued an WSARecv using one buffer from the smallest pool. If this buffer got completely filled, I used a buffer from the next largest pool for the next WSARecv, and so on.
Then there's the issue of the sequence numbers needed to prevent out-of-order buffering with multiple handler threads :(
_1. Completion port is a sort of a queue (with sophisticated logic concerning priority of threads waiting to dequeue an I/O completion from it). Whenever an I/O completes (either successfully or not), it's queued into the completion port. Then it's dequeued by one of the thread called GetQueuedCompletionStatus.
So that you never dequeue an I/O "in progress". Moreover, it's processed by your worker thread asynchronously. That is, it's delayed until your thread calls GetQueuedCompletionStatus.
_2. This is actually a complex matter. Synchronization is not a trivial task overall, especially when it comes to symmetric multi-threading (where you have several threads, each may be doing everything).
One of the parameters you receive with a completed I/O is a pointer to an OVERLAPPED structure (that you supplied to the function that issued I/O, such as WSARecv). It's a common practice to allocate your own structure that is based on OVERLAPPED (either inherits it or has it as the first member). Upon receiving a completion you may cast the dequeued OVERLAPPED to your actual data structure. There you may have everything needed for the synchronization: sync objects, state description and etc.
Note however that it's not a trivial task to synchronize things correctly (to have a good performance and avoid deadlocks) even when you have the custom context. This demands an accurate design.

Buffering Incomplete High Speed Reads

I am reading data ~100 bytes at 100hz from a serial port. My buffer is 1024 bytes, so often my buffer doesn't get completely used. Sometimes however, I get hiccups from the serial port and the buffer gets filled up.
My data is organized as a [header]data[checksum]. When my buffer gets filled up, sometimes a message/data is split across two reads from the serial port.
This is a simple problem, and I'm sure there are a lot of different approaches. I am ahead of schedule so I would like to research different approaches. Could you guys name some paradigms that cover buffering in high speed data that might need to be put together from two reads? Note, the main difference I see in this problem from say other buffering I've done (image acquisition, tcp/ip), is that there we are guaranteed full packets/messages. Here a "packet" may be split between reads, which we will only know once we start parsing the data.
Oh yes, note that the data buffered in from the read has to be parsed, so to make things simple, the data should be contiguous when it reaches the parsing. (Plus I don't think that's the parser's responsibility)
Some Ideas I Had:
Carry over unused bytes to my original buffer, then fill it with the read after the left over bytes from the previous read. (For example, we read 1024 bytes, 24 bytes are left at the end, they're a partial message, memcpy to the beginning of the read_buffer_, pass the beginning + 24 to read and read in 1024 - 24)
Create my own class that just gets blocks of data. It has two pointers, read/write and a large chunk of memory (1024 * 4). When you pass in the data, the class updates the write pointer correctly, wraps around to the beginning of its buffer when it reaches the end. I guess like a ring buffer?
I was thinking maybe using a std::vector<unsigned char>. Dynamic memory allocation, guaranteed to be contiguous.
Thanks for the info guys!
Define some 'APU' application-protocol-unit class that will represent your '[header]data[checksum]'. Give it some 'add' function that takes a char parameter and returns a 'valid' bool. In your serial read thread, create an APU and read some data into your 1024-byte buffer. Iterate the data in the buffer, pushing it into the APU add() until either the APU add() function returns true or the iteration is complete. If the add() returns true, you have a complete APU - queue it off for handling, create another one and start add()-ing the remaining buffer bytes to it. If the iteration is complete, loop back round to read more serial data.
The add() method would use a state-machine, or other mechanism, to build up and check the incoming bytes, returning 'true' only in the case of a full sanity-checked set of data with the correct checksum. If some part of the checking fails, the APU is 'reset' and waits to detect a valid header.
The APU could maybe parse the data itself, either byte-by-byte during the add() data input, just before add() returns with 'true', or perhaps as a separate 'parse()' method called later, perhaps by some other APU-processing thread.
When reading from a serial port at speed, you typically need some kind of handshaking mechanism to control the flow of data. This can be hardware (e.g. RTS/CTS), software (Xon/Xoff), or controlled by a higher level protocol. If you're reading a large amount of data at speed without handshaking, your UART or serial controller needs to be able to read and buffer all the available data at that speed to ensure no data loss. On 16550 compatible UARTs that you see on Windows PCs, this buffer is just 14 bytes, hence the need for handshaking or a real time OS.

TCP/IP Message Framing Examples

I'm trying to find concrete examples of how to manage breaking an incoming stream of data on a TCP/IP socket and aggregating this data in a buffer of some sort so that I could find the messages in it (variable length with header + delimiters) and extract them to reconstruct the messages for the receiving application.
Any good pointers/links/examples on an efficient way of doing this would be appreciated as I couldn't find good examples online and I'm sure this problem has been addressed by others in an efficient way in the past.
Efficient memory allocation of aggregation buffer
Quickly finding the message boundaries of a message to extract it from the buffer
Thanks
David
I've found that the simple method works pretty well.
Allocate a buffer of fixed size double the size of your biggest message. One buffer. Keep a pointer to the end of the data in the buffer.
Allocation happens once. The next part is the message loop:
If not using blocking sockets, then poll or select here.
Read data into the buffer at the end-data pointer. Only read what will fit into the buffer.
Scan the new data for your delimiters with strchr. If you found a message:
memcpy the message into its own buffer. (Note: I do this because I was using threading and you probably should too.)
memmove the remaining buffer data to the beginning of the buffer and update the end of data pointer.
Call the processing function for the message. (Send it to the thread pool.)
There are more complicated methods. I haven't found them worth the bother in the end but you might depending on circumstances.
You could use a circular buffer with beginning and end of data pointers. Lots of hassle keeping track and computing remaining space, etc.
You could allocate a new buffer after finding each message. You wouldn't have to copy so much data around. You do still have to move the excess data into a new message buffer after finding the delimiter.
Do not think that dumb tricks like reading one byte at a time out of the socket will improve performance. Every system call round-trip makes an 8 kB memmove look cheap.