UDP - lost data during microbursts - c++

The code below runs great (ie. doesn't drop messages) 99.9 of the time. But when there's a microburst of datagrams coming in at the rate of 2-3 microseconds between datagrams, then I experience data loss. The boost notify_one() member call requires 5 to 10 microseconds to complete, so that by itself is the key bottleneck under these conditions. Any suggestions on how to improve performance?
Receiver/"producer" code thread:
if (bytes_recvd > 0) {
InQ.mut.lock();
string t;
t.append(data_, bytes_recvd);
InQ.msg_queue.push(t); // < 1 microsecs
InQ.mut.unlock();
InQ.cond.notify_one(); // 5 - 10 microsecs
}
Consumer code thread:
//snip......
std::string s;
while (1) {
InQ.mut.lock();
if (!InQ.msg_queue.empty()) {
s.clear();
s = InQ.msg_queue.front();
InQ.msg_queue.pop();
}
InQ.mut.unlock();
if (s.length()) {
processDatagram((char *)s.c_str(), s.length());
s.clear();
}
boost::mutex::scoped_lock lock(InQ.mut);
InQ.cond.wait(lock);
}

Just change
if (!InQ.msg_queue.empty()) {
to
while (!InQ.msg_queue.empty()) {
That way packets don't have to wake the thread to get processed, if the thread is already awake and busy, it will see the new packet before sleeping.
Ok, it's not quite that simple, because you need to release the lock between packets, but the idea will work -- before sleeping, check whether the queue is empty.

If you're losing data try increasing your socket buffer read size. If you're using boost::asio, look into this option: boost::asio::socket_base::receiver_buffer_size. Generally for our high throughput UDP applications we set the socket buffer size to 1MB (more in some cases).
Also, make sure that the buffers you're using in your receive calls are not too large, they should only be large enough to handle your maximum expected datagram size (which is obviously implementation dependent).

Your obvious clog is in the conditioning.
Your main hope would be in using a lockless Q implementation. This is probably an obvious statement to you.
The only way to really get the lockless q to work for you, of course, is if you have multicores and don't mind dedicating on to the consuming task.

Some general suggestions:
Increase socket receive buffer size.
Read all available datagrams, then pass them all on for processing.
Avoid data copying, pass pointers around.
Reduce lock scope to absolute minimum, say, only push/pop a pointer onto/off the queue under that mutex.
If all above fails you, look into lock-free data structures to pass data around.
Hope this helps.

Related

How to match processing time with reception time in c++ multithreading

I'm writing a c++ application, in which I'll receive 4096 bytes of data for every 0.5 seconds. This is processed and the output will be sent to some other application. Processing each set of data is taking nearly 2 seconds.
This is how exactly I'm doing this.
In my main function, I'm receiving the data and pushing it into a vector.
I've created a thread, which will always process the first element and deletes it immediately after processing. Below is the simulation of my application receiving part.
#include<iostream>
#include <unistd.h>
#include <vector>
#include <mutex>
#include <pthread.h>
using namespace std;
struct Student{
int id;
int age;
};
vector<Student> dustBin;
pthread_mutex_t lock1;
bool isEven=true;
void *processData(void* arg){
Student st1;
while(true)
{
if(dustBin.size())
{
printf("front: %d\tSize: %d\n",dustBin.front(),dustBin.size());
st1 = dustBin.front();
cout << "Currently Processing ID "<<st1.id<<endl;
sleep(2);
pthread_mutex_lock(&lock1);
dustBin.erase(dustBin.begin());
cout<<"Deleted"<<endl;
pthread_mutex_unlock(&lock1);
}
}
return NULL;
}
int main()
{
pthread_t ptid;
Student st;
dustBin.clear();
pthread_mutex_init(&lock1, NULL);
pthread_create(&ptid, NULL, &processData, NULL);
while(true)
{
for(int i=0; i<4096; i++)
{
st.id = i+1;
st.age = i+2;
pthread_mutex_lock(&lock1);
dustBin.push_back(st);
printf("Pushed: %d\n",st.id);
pthread_mutex_unlock(&lock1);
usleep(500000);
}
}
pthread_join(ptid, NULL);
pthread_mutex_destroy(&lock1);
}
The output of this code is
Output
In the output image posted here, you can observe the exact sequence of the processing. It is processing only one item for every 4 insertions.
Note that the reception time of data <<< processing time.
Because of this reason, my input buffer is growing very rapidly. And one more thing is that as the main thread and the processData thread are using a mutex, they are dependent on each other for the lock to release. Because of this reason my incoming buffer is getting locked sometimes leading to data misses. Please, someone, suggest to me how to handle this or suggest me some method to do.
Thanks & Regards
Vamsi
Undefined behavior
When you read data, you must lock before getting the size.
Busy waiting
You should always avoid tight loop that does nothing. Here if dustBin is empty, you will immediately check it against forever which will use 100% of that core and slow down everything else, drain the laptop battery and make it hotter than it should be. Very bad idea to write such code!
Learn multithreading first
You should read a book or 2 on multithreading. Doing multithreading right is hard and almost impossible without taking time to learn it properly. C++ Concurrency in Action is highly recommended for standard C++ multithreading.
Condition variable
Usually you will use a condition variable or some sort of event to tell the consumer thread when data is added so it does not have to wake up uselessly to check if it is the case.
Since you have a typical producer/consumer, you should be able to find lot of information on how to do it or special containers or other constructs that will help implement your code.
Output
Your printf and cout stuff will have an impact on the performance and since some are inside a lock and other not, you will probably get an improperly formatted output. If you really need output, a third thread might be a better option. In any case, you want to minimize the time you have a lock so formatting into a temporary buffer might be a good idea too.
By the way, standard output is relatively slow and it is perfectly possible that it might even be the reason why you are not able to process rapidly all data.
Processing rate
Obviously if you are able to produce 4096 bytes of data every 0.5 second but need 2 seconds to process that data, you have a serious problem.
You should really think about what you want to do in such case before asking a question here as without that information, we are making guess about possible solutions.
Here are some possibilities:
Slow down the producer. Obviously, this does not work if you get data in real time.
Optimize the consumer (better algorithms, better hardware, optimal parallelism…)
Skip some data
Obviously for performance problems, you should use a profiler to know were you lost your time. Once you know that, you will have a better idea where to check to improve you code.
Taking 2 seconds to process the data is really slow but we cannot help you since we have no idea of what your code is doing.
For example, if you add the data into a database and it is not able to follow up, you might want to batch multiple insert into a single command to reduce the overhead of communicating with the database over the network.
Another example, would be if you append the data to a file, you might want to keep the file open and accumulate some data before doing each write.
Container
A vector would not be a good choice if you remove item from the head one by one and it size become somewhat large (say more than 100 small items) as every other item need to be moved every time.
In addition to changing the container as suggested in a comment, another possibility would be to use 2 vectors and swap them. That way, you will be able to reduce the number of time you lock the mutex and process many item without needing a lock.
How to optimize
You should accumulate enough data (say 30 seconds), stop accumulating and then test your processing speed with that data. If you cannot process that data in less that about half the time (15 seconds), then you clearly need to improve your processing speed one way or another. One your consumer(s) is (are) fast enough, then you could optimize communication from the producer to the consumer(s).
You have to know if your bottleneck is I/O, database or what and if some part might be done in parallel.
There are probably a lot of optimization that can be done in the code you have not shown...
If you can't handle messages fast enough, you have to drop some.
Use a circular buffer of a fixed size.
Then if the provider is faster than the consumer, older entries will be overwritten.
If you cannot skip some data and you cannot process it fast enough, you are doomed.
Create two const variables, NBUFFERS and NTHREADS, make them both 8 initially if you have 16 cores and your processing is 4x too slow. Play with these values later.
Create NBUFFERS data buffers, each big enough to hold 4096 samples, In practice, just create a single large buffer and make offsets into it to divide it up.
Start NTHREADS. They will each continuously wait to be told which buffer to process and then they will process it and wait again for another buffer.
In your main program, go into a loop, receiving data. Receive the first 4096 samples into the first buffer and notify the first thread. Receive the second 4096 samples into the second buffer and notify the second thread.
buffer = (buffer + 1) % NBUFFERS
thread = (thread + 1) % NTHREADS
Rinse and repeat. As you have 8 threads, and data only arrives every 0.5 seconds, each thread will only get a new buffer every 4 seconds but only needs 2 seconds to clear the previous buffer.

empty std::queue pushing data to end of stale items

I am using an std::queue to buffer messages on my network (CAN bus in this case). During an interrupt I am adding the message to the "inbox". Then my main program checks every cycle if the queue is empty, if not handles the messages. Problem is, the queue is pop'd until empty (it exits from while (! inbox.empty()), but the next time I push data to it, it works as normal BUT the old data is still hanging out at the back.
For example, first message pushes a "1" to the queue. Loop reads
1
Next message is "2". Next read is
2
1
If I were to get in TWO messages before another read, "3", "4", then next read would be
3
4
2
1
I am very confused. I am also working with an STM32F0 ARM chip and mbed online, and have no idea if this is working poorly on the hardware or what!
I was concerned about thread safety, so I added an extra buffer queue and only push to the inbox when it "unlocked". And once I ran this I have not seen any conflict occur anyway!
Pusher code:
if (bInboxUnlocked) {
while (! inboxBuffer.empty()) {
inbox.push (inboxBuffer.front());
inboxBuffer.pop();
}
inbox.push(msg);
} else {
inboxBuffer.push(msg);
printf("LOCKED!");
}
Main program read code
bInboxUnlocked = 0;
while (! inbox.empty()) {
printf("%d\r\n", inbox.front().data);
inbox.pop();
}
bInboxUnlocked = 1;
Thoughts anyone? Am I using this wrong? Any other ways to easily accomplish what I am doing? I expect the buffers to be small enough to implement a small circular array, but with queue on hand I was hoping not to have to do that.
Based on what I can figure out from a basic Google search, your CPU is a single core CPU, essentially. If so, then there should not be any memory fencing issues to deal with, here.
If, on the other hand, you had multiple CPU cores to deal with here, it will be necessary to either cram in explicit fences, in key places, or employ C++11 classes like std::mutex, that will take care of this for you.
But going with the original use case of a single CPU, and no memory fencing issues, if you can guarantee that:
A) There's some definite upper limit on the number of messages you expect to buffer by your interrupt handling code in the queue before it gets drained, and:
B) the messages you're buffering are PODs
Then a potential alternative to std::queue worth exploring here is to roll your own simple queue, using nothing more than a static std::array, or maybe a std::vector, an int head pointer, and an int tail pointer. A google search should find plenty of examples of implementing this simple algorithm:
The puller checks "if head != tail", if so, reads the message in queue[head] and increments head. Increment means: head=(head+1)%queuesize. The puller checks if incrementing tail (also modulo queuesize) results in head, if so the queue has filled up (something that shouldn't happen, according to the prerequisites of this approach). If not, put the message into queue[tail], and increment tail.
If all of these operations are done in the right order, the net effect would be the same as using std::queue but:
1) Without the overhead of std::queue and the heap allocation it uses. Should be a major win on an embedded platform.
2) Since the queue is a vector, in contiguous memory, this should take advantage of CPU caching that's often the case here, with traditional CPUs.

boost::lockfree::spsc_queue busy wait strategy. Is there a blocking pop?

So i'm using a boost::lockfree::spec_queue to communicate via two boost_threads running functors of two objects in my application.
All is fine except for the fact that the spec_queue::pop() method is non blocking. It returns True or False even if there is nothing in the queue. However my queue always seems to return True (problem #1). I think this is because i preallocate the queue.
typedef boost::lockfree::spsc_queue<q_pl, boost::lockfree::capacity<100000> > spsc_queue;
This means that to use the queue efficiently i have to busy wait constantly popping the queue using 100% cpu. Id rather not sleep for arbitrary amounts of time. I've used other queues in java which block until an object is made available. Can this be done with std:: or boost:: data structures?
A lock free queue, by definition, does not have blocking operations.
How would you synchronize on the datastructure? There is no internal lock, for obvious reasons, because that would mean all clients need to synchronize on it, making it your grandfathers locking concurrent queue.
So indeed, you will have to devise a waiting function yourself. How you do this depends on your use case, which is probably why the library doesn't supply one (disclaimer: I haven't checked and I don't claim to know the full documentation).
So what can you do:
As you already described, you can spin in a tight loop. Obviously, you'll do this if you know that your wait condition (queue non-empty) is always going to be satisfied very quickly.
Alternatively, poll the queue at a certain frequency (doing micro-sleeps in the mean time). Scheduling a good good frequency is an art: for some applications 100ms is optimal, for others, a potential 100ms wait would destroy throughput. So, vary and measure your performance indicators (don't forget about power consumption if your application is going to be deployed on many cores in a datacenter :)).
Lastly, you could arrive at a hybrid solution, spinning for a fixed number of iterations, and resorting to (increasing) interval polling if nothing arrives. This would nicely support servers applications where high loads occur in bursts.
Use a semaphore to cause the producers to sleep when the queue is full, and another semaphore to cause the consumers to sleep when the queue is empty.
when the queue is neither full nor empty, the sem_post and sem_wait operations are nonblocking (in newer kernels)
#include <semaphore.h>
template<typename lock_free_container>
class blocking_lock_free
{
public:
lock_free_queue_semaphore(size_t n) : container(n)
{
sem_init(&pop_semaphore, 0, 0);
sem_init(&push_semaphore, 0, n);
}
~lock_free_queue_semaphore()
{
sem_destroy(&pop_semaphore);
sem_destroy(&push_semaphore);
}
bool push(const lock_free_container::value_type& v)
{
sem_wait(&push_semaphore);
bool ret = container::bounded_push(v);
ASSERT(ret);
if (ret)
sem_post(&pop_semaphore);
else
sem_post(&push_semaphore); // shouldn't happen
return ret;
}
bool pop(lock_free_container::value_type& v)
{
sem_wait(&pop_semaphore);
bool ret = container::pop(v);
ASSERT(ret);
if (ret)
sem_post(&push_semaphore);
else
sem_post(&pop_semaphore); // shouldn't happen
return ret;
}
private:
lock_free_container container;
sem_t pop_semaphore;
sem_t push_semaphore;
};

How to recive more than 65000 bytes in C++ socket using recv()

I am developing a client server application (TCP) in Linux using C++. I want to send more than 65,000 bytes at the same time. In TCP, the maximum packet size is 65,535 bytes only.
How can I send the entire bytes without loss?
Following is my code at server side.
//Receive the message from client socket
if((iByteCount = recv(GetSocketId(), buffer, MAXRECV, MSG_WAITALL)) > 0)
{
printf("\n Received bytes %d\n", iByteCount);
SetReceivedMessage(buffer);
return LS_RESULT_OK;
}
If I use MSG_WAITALL it takes a long time to receive the bytes so how can I set the flag to receive more than 1 million bytes at time.
Edit: The MTU size is 1500 bytes but the absolute limitation on TCP Packet size if 65,535.
Judging from the comments above, it seems you don't understand how recv works, or how it is supposed to be used.
You really want to call recv in a loop, until either you know that the expected amount of data has been received or until you get a "zero bytes read" result, which means the other end has closed the connection. Always, no exceptions.
If you need to do other things concurrently (likely, with a server process!) then you will probably want to check descriptor readiness with poll or epoll first. That lets you multiplex sockets as they become ready.
The reason why you want to do it that way, and never any different, is that you don't know how the data will be packeted and how (or when) packets will arrive. Plus, recv gives no guarantee about the amount of data read at a time. It will offer what it has in its buffers at the time you call it, no more and no less (it may block if there's nothing, but then you still don't have a guarantee that any particular amount of data will be returned when it resumes, it may still return e.g. 50 bytes!).
Even if you only send, say, 5,000 bytes total, it is perfectly valid behaviour for TCP to break this into 5 (or 10, or 20) packets, and for recv to return 500 (or 100, or 20, or 1) bytes at a time, every time you call it. That's just how it works.
TCP guarantees that anything you send will eventually arrive at the other end or produce an error. And, it guarantees that whatever you send arrives in order. It does not guarantee much else. Above all, it does not guarantee that any particular amount of data is ready at any given time.
You must be prepared for that, and the only way to do it is calling recv repeatedly. Otherwise you will always lose data under some circumstances.
MSG_WAITALL should in principle make it work the way you expect, but that is bad behaviour, and it is not guaranteed to work. If the socket (or some other structure in the network stack) runs against a soft or hard limit, it may not, and probably will not fulfill your request. Some limits are obscure, too. For example, the number for SO_RCVBUF must be twice as large as what you expect to receive under Linux, because of implementation details.
Correct behaviour of a server application should never depend on assumptions such as "it fits into the receive buffer". Your application needs to be prepared, in principle, to receive terabytes of data using a 1 kilobyte receive buffer, and in chunks of 1 byte at a time, if need be. A larger receive buffer will make it more efficient, but that's it... it still has to work either way.
The fact that you only seee failures upwards of some "huge" limit is just luck (or rather, bad luck). The fact that it apparently "works fine" up to that limit suggests what you do is correct, but it isn't. It's an unlucky coincidence that it works.
EDIT:
As requested in below comment, here is what this could look like (Code is obviously untested, caveat emptor.)
std::vector<char> result;
int size;
char recv_buf[250];
for(;;)
{
if((size = recv(fd, recv_buf, sizeof(recv_buf), 0)) > 0)
{
for(unsigned int i = 0; i < size; ++i)
result.push_back(recv_buf[i]);
}
else if(size == 0)
{
if(result.size() < expected_size)
{
printf("premature close, expected %u, only got %u\n", expected_size, result.size());
}
else
{
do_something_with(result);
}
break;
}
else
{
perror("recv");
exit(1);
}
}
That will receive any amount of data you want (or until operator new throws bad_alloc after allocating a vector several hundred MiB in size, but that's a different story...).
If you want to handle several connections, you need to add poll or epoll or kqueue or a similar functionality (or... fork), I'll leave this as exercise for the reader.
It is possible that your problem is related to kernel socket buffer sizes. Try adding the following to your code:
int buffsize = 1024*1024;
setsockopt(s, SOL_SOCKET, SO_RCVBUF, &buffsize, sizeof(buffsize));
You might need to increase some sysctl variables too:
sysctl -w net.core.rmem_max=8388608
sysctl -w net.core.wmem_max=8388608
Note however, that relying on TCP to fill your whole buffer is generally a bad idea. You should rather call recv() multiple times. The only good reason why you would want to receive more than 64K is for improved performance. However, Linux should already have auto-tuning that will progressively increase the buffer sizes as required.
in tcp max packet sixe is 65,635,bytes
No it isn't. TCP is a byte-stream protocol over segments over IP packets, and the protocol has unlimited transmission sizes over any one connection. Look at all those 100MB downloads: how do you think they work?
Just send and receive the data. You'll get it.
I would suggest exploring kqueue or something similar. With event notification there is no need to loop on recv . Just call a simple read function upon an EV_READ event and use a single call to the recv function upon the socket that triggered the event. Your function can have a buffer size of 10 bytes or however much you want it doesn't matter because if you did not read the entire message the first time around you'll just get another EV_READ event on the socket and you recall your read function. When the data is read you'll get a EOF event. No need to hustle with loops that may or may not block other incoming connections.

Design of concurrent processing of a dual buffer system?

I have a long-running application that basically:
read packets off network
save it somewhere
process it and output to disk
A very common use-case indeed - except both the data size and data rate can be quite large. To avoid overflow of the memory and improve efficiency, I am thinking of a dual buffer design, where buffer A and B alternate: while A is holding networking packet, B is processed for output. Once buffer A reaches a soft bound, A is due for output processing, and B will be used for holding network packets.
I am not particularly experienced on concurrency/multi-thread program paradigm. I have read some past discussion on circular buffer that handle multiple-producer and multiple consumer case. I am not sure if that is the best solution and It seems the dual buffer design is simpler.
My question is: is there a design pattern I can follow to tackle the problem? or better design for that matter? If possible, please use pseudo code to help to illustrate the solution. Thanks.
I suggest that you should, instead of assuming "two" (or any fixed number of ...) buffers, simply use a queue, and therefore a "producer/consumer" relationship.
The process that is receiving packets simply adds them to a buffer of some certain size, and, either when the buffer is sufficiently full or a specified (short...) time interval has elapsed, places the (non-empty) buffer onto a queue for processing by the other. It then allocates a new buffer for its own use.
The receiving ("other...") process is woken up any time there might be a new buffer in the queue for it to process. It removes the buffer, processes it, then checks the queue again. It goes to sleep only when it finds that the queue is empty. (Use care to be sure that the process cannot decide to go to sleep at the precise instant that the other process decides to signal it... there must be no "race condition" here.)
Consider simply allocating storage "per-message" (whatever a "message" may mean to you), and putting that "message" onto the queue, so that there is no unnecessary delay in processing caused by "waiting for a buffer to fill up."
It might be worth mentioning a technique used in real-time audio processing/recording, which uses a single ring buffer (or fifo if you prefer that term) of sufficient size can be used for this case.
You will need then a read and write cursor. (Whether you actually need a lock or can do with volatile plus memory barriers is a touchy subject, but the people at portaudio suggest you do this without locks if performance is important.)
You can use one thread to read and another thread to write. The read thread should consume as much of the buffer as possible. You will be safe unless you run out of buffer space, but that exists for the dual-buffer solution as well. So the underlying assumption is that you can write to disk faster then the input comes in, or you will need to expand on the solution.
Find a producer-consumer queue class that works. Use one to create a buffer pool to improve performance and control memory use. Use another to transfer the buffers from the network thread to the disk thread:
#define CnumBuffs 128
#define CbufSize 8182
#define CcacheLineSize 128
public class netBuf{
private char cacheLineFiller[CcacheLineSize]; // anti false-sharing space
public int dataLen;
public char bigBuf[CbufSize];
};
PCqueue pool;
PCqueue diskQueue;
netThread Thread;
diskThread Thread;
pool=new(PCqueue);
diskQueue=new(PCqueue);
// make an object pool
for(i=0;i<CnumBuffs,i++){
pool->push(new(netBuf));
};
netThread=new(netThread);
diskThread=new(diskThread);
netThread->start();
diskThread->start();
..
void* netThread.run{
netbuf *thisBuf;
for(;;){
pool->pop(&thisBuf}; // blocks if pool empty
netBuf->datalen=network.read(&thisBuf.bigBuf,sizeof(thisBuf.bigBuf));
diskQueue->push(thisBuf);
};
};
void* diskThread.run{
fileStream *myFile;
diskBuf *thisBuf;
new myFile("someFolder\fileSpec",someEnumWrite);
for(;;){
diskQueue->pop(&thisBuf}; // blocks until buffer available
myFile.write(&thisBuf.bigBuf,thisBuf.dataLen);
pool->push(thisBuf};
};
};