Sudden receive-buffer buildup on CentOS for C++ application - c++

I have a somewhat strange problem when receiving UDP data on CentOS. When my application receives data everything is fine at first and all packets are received as expected, then all of a sudden the kernel receive buffer (net.core.rmem) starts to fill up for no apparent reason until it's full and packets are dropped. The strange part is that the buffer is more or less empty until all of a sudden when it starts to increase dramatically even though the sending party sends at the same rate as before. As if a resource I haven't accounted for is depleted, or the operating system changes priority of the thread dedicated for receive operations. Data is still received and read by receive(), but the buffer starts to fill too fast.
Files are sent from the sending application to the receiving application. There is a unidirectional gateway between the sending and receiving application, so there is no possibility for congestion control (or TCP) what so ever. The problem only arises when I send a big file (around 3 GiB or more). Everything works fine when I send multiple small files, even if the sum of their size is much larger than 3 GiB.
At the moment I'm unable to specify the problem further and I'm pretty stunned as to what could be wrong. This is the information about the systems configuration that I can imagine being relevant, but I've looked into memory leaks, disk usage and buffer sizes without being able to find something specific.
Data is sent at a rate of 100 Mbit/s.
MTU is 9000 on both the sending and receiving machine.
net.core.rmem_max/net.core.rmem_default is set to 536870912 Bytes (huge).
net.core.netdev_max_backlog is set to 65536 (huge).
Each UDP packet sent is 8192 Bytes, excluding the UDP header.
A temporary file created through tmpfile() is used to store the data for each file.
The temporary file is closed as soon as the file is completed (hashsum is verified).
CPU usage when receiving files is consistent at 100%.
Memory usage when receiving files is consistent at 0.5%.
receive()
std::vector<uint8_t>* vector = new std::vector<uint8_t>();
while (signal == 0)
{
ret = _serverio->Receive(*vector);
if (ret == -1 || ret == 0)
{
continue;
}
else
{
Produce(vector);
vector = new std::vector<uint8_t>();
}
}
_serverio->Receive(std::vector& data)
ssize_t n;
data.resize(UDPMAXMSG);
int res = m_fdwait.Wait(m_timeoutms);
if(res < 1) {
data.resize(0);
return res; // timeout or interrupt
}
n = read(m_servfd, &(data[0]), data.size());
if(n < 0) {
if(errno == EINTR) {
data.resize(0);
return -1;
}
else {
throw socket_error(errno, "UDPServer::Receive");
}
}
data.resize(n);
return n;
Produce(std::vector* vector)
_producerSemaphore.aquire();
_queue.lock();
_buffer.push_back(vector);
_queue.unlock();
_consumerSemaphore.release();
Consume()
bool aquired = false;
while (!aquired)
{
if (_terminated)
{
// Consume should return NULL immediately if
// receiver has been terminated.
return NULL;
}
aquired = _consumerSemaphore.aquire_timeout(1);
}
std::vector<uint8_t>* vector = NULL;
_queue.lock();
vector = _buffer.front();
_buffer.pop_front();
_queue.unlock();
_producerSemaphore.release();
return vector;
recv_buffer.sh (for monitoring of the receive buffer)
while true ; do
_BUFFER_VALUE=$(printf "%d" "0x$(grep 2710 /proc/net/udp \
| awk '{print $5}' | sed 's/.*://')")
_DELTA=$(expr $_BUFFER_VALUE - $_PRE_BUFFER)
_PRE_BUFFER=$_BUFFER_VALUE
echo "$_BUFFER_VALUE : $_DELTA"
sleep 0.1
done
recv_buffer.sh output
buffer-size delta
0 0
0 0
...
10792 10792
10792 0
0 -10792
10792 10792
0 -10792
0 0
0 0
0 0 // This type of pattern goes on for 2.5 GiB
...
0 0
0 0
0 0
0 0
971280 971280 // At this point the buffer starts to fill
1823848 852568
1931768 107920
2039688 107920
2179984 140296
2287904 107920
2406616 118712
2525328 118712
2644040 118712
2741168 97128
2881464 140296
3010968 129504
3140472 129504
...
533567272 647520
536038640 2471368
536675368 636728
536880416 205048 // At this point packets are dropped
536869624 -10792
536880416 10792
536880416 0
536869624 -10792
536880416 10792
536869624 -10792
536880416 10792
536880416 0
536880416 0
536880416 0
536880416 0
536880416 0

Related

Recv function for TCP Socket programming

I am new in Socket Programming. I am trying to create a client application. The server is a camera which communicates using TCP. The camera is sending continuous data. Using Wireshark, I can see that the camera is sending continuous packets of different sizes, but not more than 1514 bytes. But my recv function is always returning 2000 which is the size of my buffer.
unsigned char buf[2000];
int bytesIn = recv(sock, (char*)buf, sizeof(buf) , 0);
if (bytesIn > 0)
{
std::cout << bytesIn << std::endl;
}
The first packet I receive is of size 9 bytes, which recv returns correct, but after that it always returns 2000.
Can anyone please tell me the solution so that I can get the correct size of the actual data payload?
EDIT
int bytesIn = recv(sock, (char*)buf, sizeof(buf) , 0);
if (bytesIn > 0)
{
while (bytes != 1514)
{
if (count == 221184)
{
break;
}
buffer[count++] = buf[bytes++];
}
std::cout << count;
}
EDIT:
Here is my Wireshark capture:
My Code to handle packets
int bytesIn = recv(sock, (char*)&buf, sizeof(buf) , 0);
if (bytesIn > 0)
{
if (flag1 == true)
{
while ((bytes != 1460 && (buf[bytes] != 0)) && _fillFlag)
{
buffer[fill++] = buf[bytes++];
if (fill == 221184)
{
flag1 = false;
_fillFlag = false;
fill = 0;
queue.Enqueue(buffer, sizeof(buffer));
break;
}
}
}
if ((strncmp(buf, _string2, 10) == 0))
{
flag1 = true;
}
}
For each frame camera is sending 221184 bytes and after each frame it sends a packet of data 9 bytes which I used to compare this 9 bytes are constant.
This 221184 bytes send by camera doesn't have 0 so I use this condition in while loop. This code is working and showing the frame but after few frame it shows fully black frame. I think the mistake is in receiving the packet.
Size of per frame is : 221184 (fixed)
Size of per recv is : 0 ~ 1514
My implementation here :
DWORD MakeFrame(int socket)
{
INT nFrameSize = 221184;
INT nSizeToRecv = 221184;
INT nRecvSize = 2000;
INT nReceived = 0;
INT nTotalReceived = 0;
BYTE byCamera[2000] = { 0 }; // byCamera size = nRecvSize
BYTE byFrame[221184] = { 0 }; // byFrame size = nFrameSize
while(0 != nSizeToRecv)
{
nRecvSize = min(2000, nSizeToRecv);
nReceived = recv(socket, (char*)byCamera, nRecvSize, 0);
memcpy_s(byFrame + nTotalReceived, nFrameSize, byCamera, nReceived);
nSizeToRecv -= nReceived;
nTotalReceived += nReceived;
}
// byFrame is ready to use.
// ...
// ...
return WSAGetLastError();
}
The first packet I receive is of size 9 bytes which it print correct after that it always print 2000. So can anyone please tell me the solution that I only get the size of actual data payload.
TCP is no packet-oriented, but a stream-oriented transport protocol. There is no notion of packets in TCP (apart maybe from a MTU). If you want to work in packets, you have to either use UDP (which is in fact packet-oriented, but by default not reliable concerning order, discarding and alike) or you have to implement your packet logic in TCP, i.e. reading from a stream and partition the data into logical packets once received.

Does v4l2 camera capture with mmap ring buffer make sense for tracking application

I'm working on a v4l2 API for capturing images from a raw sensor on embedded platform. My capture routine is related to the example on [1]. The proposed method for streaming is using mmaped buffers as a ringbuffer.
For initialization, buffers (default = 4 buffers) are requested using ioctl with VIDIOC_REQBUFS identifier. Subsequently, they are queued using VIDIOC_QBUF. The entire streaming procedure is described here [2]. As soon as streaming starts, the driver fills the queued buffers with data. The timestamp of v4l2_buffer struct indicates the time of first byte captured which in my case results in a time interval of approximately 8.3 ms (=120fps) between buffers. So far so good.
Now what I would expect of a ring buffer is that new captures automatically overwrite older ones in a circular fashion. But this is not what happens. Only when a buffer is queued again (VIDIOC_QBUF) after it has been dequeued (VIDIOC_DQBUF) and processed (demosaic, tracking step,..), a new frame is assigned to a buffer. If I do meet the timing condition (processing < 8.3 ms) I don't get the latest captured frame when dequeuing but the oldest captured one (according to FIFO), so the one of 3x8.3 ms before the current one. If the timing condition is not met the time span gets even larger, as the buffers are not overwritten.
So I have several questions:
1. Does it even make sense for this tracking application to have a ring buffer as I don't really need history of frames? I certainly doubt that, but by using the proposed mmap method drivers mostly require a minimum amount of buffers to be requested.
2. Should a seperate thread continously DQBUF and QBUF to accomplish the buffer overwrite? How could this be accomplished?
3. As a workaround one could probably dequeue and requeue all buffers on every capture, but this doesn't sound right. Is there someone with more experience in real time capture and streaming who can point to the "proper" way to go?
4. Also currently, I'm doing the preprocessing step (demosaicing) between DQBUF and QBUF and the tracking step afterwards. Should the tracking step also be executed before QBUF is called again?
So the main code basically performs Capture() and Track() subsequently in a while loop. The Capture routine looks as follows:
cv::Mat v4l2Camera::Capture( size_t timeout ) {
fd_set fds;
FD_ZERO(&fds);
FD_SET(mFD, &fds);
struct timeval tv;
tv.tv_sec = 0;
tv.tv_usec = 0;
const bool threaded = true; //false;
// proper register settings
this->format2registerSetting();
//
if( timeout > 0 )
{
tv.tv_sec = timeout / 1000;
tv.tv_usec = (timeout - (tv.tv_sec * 1000)) * 1000;
}
//
const int result = select(mFD + 1, &fds, NULL, NULL, &tv);
if( result == -1 )
{
//if (EINTR == errno)
printf("v4l2 -- select() failed (errno=%i) (%s)\n", errno, strerror(errno));
return cv::Mat();
}
else if( result == 0 )
{
if( timeout > 0 )
printf("v4l2 -- select() timed out...\n");
return cv::Mat(); // timeout, not necessarily an error (TRY_AGAIN)
}
// dequeue input buffer from V4L2
struct v4l2_buffer buf;
memset(&buf, 0, sizeof(v4l2_buffer));
buf.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
buf.memory = V4L2_MEMORY_MMAP; //V4L2_MEMORY_USERPTR;
if( xioctl(mFD, VIDIOC_DQBUF, &buf) < 0 )
{
printf("v4l2 -- ioctl(VIDIOC_DQBUF) failed (errno=%i) (%s)\n", errno, strerror(errno));
return cv::Mat();
}
if( buf.index >= mBufferCountMMap )
{
printf("v4l2 -- invalid mmap buffer index (%u)\n", buf.index);
return cv::Mat();
}
// emit ringbuffer entry
printf("v4l2 -- recieved %ux%u video frame (index=%u)\n", mWidth, mHeight, (uint32_t)buf.index);
void* image_ptr = mBuffersMMap[buf.index].ptr;
// frame processing (& tracking step)
cv::Mat demosaic_mat = demosaic(image_ptr,mSize,mDepth,1);
// re-queue buffer to V4L2
if( xioctl(mFD, VIDIOC_QBUF, &buf) < 0 )
printf("v4l2 -- ioctl(VIDIOC_QBUF) failed (errno=%i) (%s)\n", errno, strerror(errno));
return demosaic_mat;
}
As my knowledge is limited regarding capturing and streaming video I appreciate any help.

Linux poll on serial transmission end

I'm implementing RS485 on arm developement board using serial port and gpio for data enable.
I'm setting data enable to high before sending and I want it to be set low after transmission is complete.
It can be simply done by writing:
//fd = open("/dev/ttyO2", ...);
DataEnable.Set(true);
write(fd, data, datalen);
tcdrain(fd); //Wait until all data is sent
DataEnable.Set(false);
I wanted to change from blocking-mode to non-blocking and use poll with fd. But I dont see any poll event corresponding to 'transmission complete'.
How can I get notified when all data has been sent?
System: linux
Language: c++
Board: BeagleBone Black
I don't think it's possible. You'll either have to run tcdrain in another thread and have it notify the the main thread, or use timeout on poll and poll to see if the output has been drained.
You can use the TIOCOUTQ ioctl to get the number of bytes in the output buffer and tune the timeout according to baud rate. That should reduce the amount of polling you need to do to just once or twice. Something like:
enum { writing, draining, idle } write_state;
while(1) {
int write_event, timeout = -1;
...
if (write_state == writing) {
poll_fds[poll_len].fd = write_fd;
poll_fds[poll_len].event = POLLOUT;
write_event = poll_len++
} else if (write == draining) {
int outq;
ioctl(write_fd, TIOCOUTQ, &outq);
if (outq == 0) {
DataEnable.Set(false);
write_state = idle;
} else {
// 10 bits per byte, 1000 millisecond in a second
timeout = outq * 10 * 1000 / baud_rate;
if (timeout < 1) {
timeout = 1;
}
}
}
int r = poll(poll_fds, poll_len, timeout);
...
if (write_state == writing && r > 0 && (poll_fds[write_event].revent & POLLOUT)) {
DataEnable.Set(true); // Gets set even if already set.
int n = write(write_fd, write_data, write_datalen);
write_data += n;
write_datalen -= n;
if (write_datalen == 0) {
state = draining;
}
}
}
Stale thread, but I have been working on RS-485 with a 16550-compatible UART under Linux and find
tcdrain works - but it adds a delay of 10 to 20 msec. Seems to be polled
The value returned by TIOCOUTQ seems to count bytes in the OS buffer, but NOT bytes in the UART FIFO, so it may underestimate the delay required if transmission has already started.
I am currently using CLOCK_MONOTONIC to timestamp each send, calculating when the send should be complete, when checking that time against the next send, delaying if necessary. Sucks, but seems to work

Boost asio tcp socket available reports incorrect number of bytes

In SSL client server model, I use the code below to read data from the socket on either client or server side.
I only read data when there is data available. To know when there is data available, I check the available() method on the lowest_layer() of the asio::ssl::stream.
After I send 380 bytes from the client to the server and enter the read method on the server, I see the following.
‘s’ is the buffer I supplied.
‘n’ is the size of the buffer I supplied.
‘a1’ is the result of available() before the read and will report 458 bytes.
‘r’ is the number of bytes actually read. It will report 380, which is correct.
‘a2’ is the result of available() after the read and will report 0 bytes. This is what I expect, since my client sent 380 bytes and I have read them all.
Why does the first call to available() report too many bytes?
Types:
/**
* Type used as SSL Socket. Handles SSL and socket functionality.
*/
typedef boost::asio::ssl::stream<boost::asio::ip::tcp::socket> SslSocket;
/**
* A shared pointer version of the SSL Socket type.
*/
typedef boost::shared_ptr<SslSocket> ShpSslSocket;
Members:
ShpSslSocket m_shpSecureSocket;
Part of the read method:
std::size_t a1 = 0;
if ((a1 = m_shpSecureSocket->lowest_layer().available()) > 0)
{
r += boost::asio::read(*m_shpSecureSocket,
boost::asio::buffer(s, n),
boost::asio::transfer_at_least(1));
}
std::size_t a2 = m_shpSecureSocket->lowest_layer().available();
Added info:
So I changed my read method to more thoroughly check if there is still data available to be read from the boost::asio::ssl::stream. Not only do I need to check if there is data available on the socket level, but there may also be data stuck in the OpenSSL buffers somewhere. SSL_peek does the trick. Next to checking for available data, it also checks the TCP port status and does all this as long as there is no timeout.
Here is the complete read method of the boost::iostreams::device class that I created.
std::streamsize SslClientSocketDevice::read(char* s, std::streamsize n)
{
// Request from the stream/device to receive/read bytes.
std::streamsize r = 0;
LIB_PROCESS::TcpState eActualState = LIB_PROCESS::TCP_NOT_EXIST;
char chSslPeekBuf; // 1 byte peek buffer
// Check that there is data available. If not, wait for it.
// Check is on the lowest layer (tcp). In that layer the data is encrypted.
// The number of encrypted bytes is most often different than the number
// of unencrypted bytes that would be read from the secure socket.
// Also: Data may be read by OpenSSL from the socket and remain in an
// OpenSSL buffer somewhere. We also check that.
boost::posix_time::ptime start = BOOST_UTC_NOW;
int nSslPeek = 0;
std::size_t nAvailTcp = 0;
while ((*m_shpConnected) &&
(LIB_PROCESS::IpMonitor::CheckPortStatusEquals(GetLocalEndPoint(),
GetRemoteEndPoint(),
ms_ciAllowedStates,
eActualState)) &&
((nAvailTcp = m_shpSecureSocket->lowest_layer().available()) == 0) &&
((nSslPeek = SSL_peek(m_shpSecureSocket->native_handle(), &chSslPeekBuf, 1)) <= 0) && // May return error (<0) as well
((start + m_oReadTimeout) > BOOST_UTC_NOW))
{
boost::this_thread::sleep(boost::posix_time::millisec(10));
}
// Always read data when there is data available, even if the state is no longer valid.
// Data may be reported by the TCP socket (num encrypted bytes) or have already been read
// by SSL and not yet returned to us.
// Remote party can have sent data and have closed the socket immediately.
if ((nAvailTcp > 0) || (nSslPeek > 0))
{
r += boost::asio::read(*m_shpSecureSocket,
boost::asio::buffer(s, n),
boost::asio::transfer_at_least(1));
}
// Close socket when state is not valid.
if ((eActualState & ms_ciAllowedStates) == 0x00)
{
LOG4CXX_INFO(LOG4CXX_LOGGER, "TCP socket not/no longer connected. State is: " <<
LIB_PROCESS::IpMonitor::TcpStateToString(eActualState));
LOG4CXX_INFO(LOG4CXX_LOGGER, "Disconnecting socket.");
Disconnect();
}
if (! (*m_shpConnected))
{
if (r == 0)
{
r = -1; // Signal stream is closed if no data was retrieved.
ThrowExceptionStreamFFL("TCP socket not/no longer connected.");
}
}
return r;
}
So maybe I know why this is. It is an SSL connection and therfor the transfered bytes will be encrypted. Encrypted data may well be of a different size because of the block size. I guess that answers the question why the number of bytes available on TCP level is different than the number of bytes that comes out of a read.

pthread scheduling problems

I have two threads in a producer-consumer pattern. Code works, but then the consumer thread will get starved, and then the producer thread will get starved.
When working, program outputs:
Send Data...semValue = 1
Recv Data...semValue = 0
Send Data...semValue = 1
Recv Data...semValue = 0
Send Data...semValue = 1
Recv Data...semValue = 0
Then something changes and threads get starved, program outputs:
Send Data...semValue = 1
Send Data...semValue = 2
Send Data...semValue = 3
...
Send Data...semValue = 256
Send Data...semValue = 257
Send Data...semValue = 258
Recv Data...semValue = 257
Recv Data...semValue = 256
Recv Data...semValue = 255
...
Recv Data...semValue = 0
Send Data...semValue = 1
Recv Data...semValue = 0
Send Data...semValue = 1
Recv Data...semValue = 0
I know threads are scheduled by the OS, and can run at different rates and in random order. My question: When I do a YieldThread(calls pthread_yield), shouldn’t the Talker give Listener a chance to run? Why am I getting this bizarre scheduling?
Snippet of Code below. Thread class and Semaphore class are abstractions classes. I went ahead as stripped out the queue for data passing between the threads so I could eliminate that variable.
const int LOOP_FOREVER = 1;
class Listener : public Thread
{
public:
Listener(Semaphore* dataReadySemaphorePtr)
: Thread("Listener"),
dataReadySemaphorePtr(dataReadySemaphorePtr)
{
//Intentionally left blank.
}
private:
void ThreadTask(void)
{
while(LOOP_FOREVER)
{
this->dataReadySemaphorePtr->Wait();
printf("Recv Data...");
YieldThread();
}
}
Semaphore* dataReadySemaphorePtr;
};
class Talker : public Thread
{
public:
Talker(Semaphore* dataReadySemaphorePtr)
: Thread("Talker"),
dataReadySemaphorePtr(dataReadySemaphorePtr)
{
//Intentionally left blank
}
private:
void ThreadTask(void)
{
while(LOOP_FOREVER)
{
printf("Send Data...");
this->dataReadySemaphorePtr->Post();
YieldThread();
}
}
Semaphore* dataReadySemaphorePtr;
};
int main()
{
Semaphore dataReadySemaphore(0);
Listener listener(&dataReadySemaphore);
Talker talker(&dataReadySemaphore);
listener.StartThread();
talker.StartThread();
while (LOOP_FOREVER); //Wait here so threads can run
}
No. Unless you are using a lock to prevent it, even if one thread yields it's quantum, there's no requirement that the other thread receives the next quantum.
In a multithreaded environment, you can never ever ever make assumptions about how processor time is going to be scheduled; if you need to enforce correct behavior, use a lock.
Believe it or not, it runs that way because it's more efficient. Every time the processor switches between threads, it performs a context switch that wastes a certain amount of time. My advice is to let it go unless you have another requirement like a maximum latency or queue size, in which case you need another semaphore for "ready for more data" in addition to your "data ready for listening" one.