In an LMAX disruptor like pattern, how do you handle a slow consumer? - c++

I have a question on what to do in a case of a slow consumer in a lmax disruptor like ring buffer that has multi producers and a single consumer running on x86 Linux. With an lmax like ring buffer pattern you are constantly overwriting data but what if the consumer is slow. Therefore how do you handle the case where say in a 10 sized ring buffer 0-9 ring slots your consumer is on slot 5 and now your writers are ready to start writing slot 15, which is also slot 5 in the buffer ( ie: slot 5 = 15 % 10 )? What is the typical way to handle this such that writers still produce data in order it came in and clients will receive the data in the same order? That's really my question. Below are some details about my design and it works fine it's just I currently don't have a good way to handle this issue. There are multiple threads doing writes and a single thread doing reads I can't introduce multiple reader threads without changing the existing design which is beyond the current project scope currently, but interested still in your thoughts still if they involve this as a solution.
Design specifics
I have a ring buffer and the design currently has multiple producers threads and a single consumer thread. This part of the design is existing and cannot currently change . I am trying to remove the existing queue-ing system using a lock free ring buffer. What I have is as follows.
The code runs on x86 Linux, there are multiple threads running for writers and there is a single thread for the reader. The reader and writer start one slot apart and are std::atomic<uint64_t>, so the reader starts at slot 0 and writer at slot 1 then each writer will first claim a slot by first doing an atomic fetch_add(1, std::memory_order::memory_order_acq_rel) on the writer sequence by calling incrementSequence shown below and afterwards use a compare_and_swap loop to update the reader sequence to let clients know this slot is available see updateSequence .
inline data_type incrementSequence() {
return m_sequence.fetch_add(1,std::memory_order::memory_order_seq_cst);
}
void updateSequence(data_type aOld, data_type aNew) {
while ( !m_sequence.compare_exchange_weak(aOld, aNew, std::memory_order::memory_order_release, std::memory_order_relaxed)
if (sequence() < aNew) {
continue;
}
break;
}
}
inline data_type sequence() const {
return m_sequence.load(std::memory_order::memory_order_acquire);
}

A ring buffer (or a FIFO in general--doesn't have to be implemented as a ring buffer) is intended to smooth out bursts of traffic. Even though producers may produce the data in bursts, the consumers can deal with a steady flow of input.
If you're overflowing the FIFO, it means one of two things:
Your bursts are larger than you planned for. Fix this by increasing the FIFO size (or making its size dynamic).
Your producers are out-running your consumers. Fix this by increasing the resources devoted to consuming the data (probably more threads).
It sounds to me like you're currently hitting the second: your single consumer simply isn't fast enough to keep up with the producers. The only real choices in that case are to speed up consumption by either optimizing the single consumer, or adding more consumers.
It also sounds a bit as if your consumer may be leaving their input data in the FIFO while they process the data, so that spot in the FIFO remains occupied until the consumer finishes processing that input. If so, you may be able to fix your problem by simply having the consumer remove the input data from the FIFO as soon as it starts processing. This frees up that slot so the producers can continue placing input into the buffer.
One other point: making the FIFO size dynamic can be something of a problem. The problem is fairly simple: it can cover up the fact that you really have the second problem of not having the resources necessary to process the data on the consumer side.
Assuming both the producers and the consumers are thread pools, the easiest way to balance the system is often to use a fixed-size FIFO. If the producers start to get so far ahead of the consumers that the FIFO overflows, then producers start to block. This lets the consumer thread pool consume more computational resources (e.g., run on more cores) to catch back up with the producers. This does, however, depend on being able to add more consumers, not restricting the system to a single consumer.

Related

multithreaded one read one write time_t

I'm writing a multithreaded application that have 2 threads.
One of the threads receives data from a queue and aggregates it and the other one send the aggregated data to a server.
I want to be able to know the last time that a data was received so I use:
time_t last_data = time(NULL)
to get the correct time on each event (I dont need it to be super accurate but I need it to be fast) and then the other send this value with the aggregated data.
My questions are:
Do I have to synchronize this even if this is not very important that I get the most recent update?
I tested it with std::atomic<time_t> and it seems to have some performance issues, is there any other faster way?
What would be the worst case that can happen if I will not synchronize the read/write?
Is there a faster way to get the current time then time(NULL) (don't have to be super accurate)?
UPDATE
Here is an explanation of my application workflow.
Application needs:
1. Consume data from external sources using IPC (currently nanomsg).
2. Aggregate the data to bulks.
3. Send the aggregated data to remote server every given interval (1 second).
Current implementation:
Create 2 buffers to hold the aggregated data (one for receiving and one for sending).
Create a consumer thread to consume data from IPC and fill the receiving buffer.
Create a sending thread that will send the data to the server.
Every iteration of the interval the sending thread will swap the buffers (swap pointers and locking using mutex) and send the data to the server.
I don't want that the consumer will wait on network IO so I have created this flow.
Can I use event driven here instead of this complex mechanism without all the locking (currently it is working fine but i'm sure it can be better)?
Don't do it that way. You only need one thread. You can use select/poll/epoll. These can wait on your inputs, and at the same time for you output to finish. You will be doing event driven programming, and non-blocking output. It is something worth learning. It is a bit harder at first, but soon makes life easier i.e. not having the problem that you have now. Also the program will be faster.
Supposing one thread executes:
last_data = time(NULL);
And the other uses last_data but there is no synchronization event between the two then there are no guarantees when or if the revised value of last_data will become visible to the reading thread.
However the most serious possibility is that write of time_t (maybe long) isn't atomic and another thread could read a corrupt 'part written' value.
That could cause glitches in delay and time calculations that might foul downstream process.
You might analyse your program and find that because the two interact there is a sufficient memory fence at some point that guarantees eventual update.
NB: This is an odd situation where I'm suspecting you think something isn't synchronized but it is! The usual experience is the other way around...
Basically there's not really enough information to understand what problem you're having.
For example, if the reader thread is the only process to read the time I'd expect to see code like:
Thread 1:
If data received, lock L, update time, add to queue, unlock L.
Thread 2:
If items in queue L, read queue and update time, unlock L .. process item.
In which case the time will be synchronized already.
Please provide a minimum, complete, verifiable example...

C++: Multi-thread design when each thread is supposed to do both I/O and CPU intensive task

I have a situation where I am offloading my work to threads. The "work" compromises of two portions:
First compress the given data buffer
Then write the compressed data to disk
My main thread is continuously creating many data buffers.
I was initially thinking of a thread pool design, but then there could be a possibility that all my threads in the pool are waiting on I/O.
If I create a new thread whenever I create a new dataBuffer, I see that a large number of threads get created. This can then have overhead of content switching, but because of the context switch my CPU cycles are not getting wasted.
What can be a good design to manage this situations?
Let me try if i could help for this.
1. First compress the given data buffer
2. Then write the compressed data to disk
What i understand from you is you have data buffer generated, which you need to compress and store into the disk.
If order matters and source of data is not time intensive that it will not
loose the data till the next cycle, then you could have the below approach.
Thread A
Generate a data buffer
Signal to Thread B saying you have a data.
Thread B
Wait for the signal from Thread A
Retrieve the data and compress.
Signal to Thread C to store it.
Thread C
Wait for the signal from Thread B
Retrieve compressed data, and store into the disk.
Another useful and highly efficient design pattern, is to have a pool of threads all pulling from a single queue of tasks. Each task, upon completion, generates a new task and pushes it to the queue
The data generation task, upon completion, generates the compression task.
The compression task, upon completion, generates the storage task.
Now, if you want all storage tasks to happen sequentially, use a separate queue for those, and have just one dedicated thread pulling tasks from that queue.
The advantage is that this creates a very clean and general design, in which direct message passing is avoided, and instead a concurrent queue provides the reliability, and there's minimal context switching. It is highly scalable, because it will always make use of as many threads as you have in the pool. This is optimal in case you don't have any order constraints (such as "buffer #n must be written to disk before buffer #(n+1)").
I think you can use 2 semaphore variables to sync the worker thread.
For example the variables is:
compressSemaphore:
storeSemaphore:
And you also need 2 queue to store the data:
toCompressData:
toStoreData:
Your main thread is creating many data buffers, and then put the data buffer into toCompressData and signal the compressSemaphore.
Your compress worker thread is wait for compressSemaphore, and start to get data from toCompressData if get a signal from main thread. The compress thread will put the compressed data into toStoreData, and signal storeSemaphore.
Your store worker thread is wait for storeSemaphore, and start to get data from toStoreData if the semaphore is signal.
You can set fixed number of compress worker thread or store thread, or can adjust the number dynamically.

Multirate threads

I ran recently into a requirement in which there is a need for multithreaded application whose threads run at different rates.
The questions then become, since i am still learning multithreading:
A scenario is given to put things into perspective:
Say 1st thread runs at 100 Hz "real time"
2nd runs at 10 Hz
and say that the 1st thread provides data "myData" to the 2nd thread.
How is myData going to be provided to the 2nd thread, is the common practice to just read whatever is available from the first thread, or there need to be some kind of decimation to reduce the rate.
Does the myData need to be some kind of Singleton with locking mechanism. Although myData isn't shared, but rather updated by the first thread and used in the second thread.
How about the opposite case, when the data used in one thread need to be used at higher rate in a different thread.
How is myData going to be provided to the 2nd thread
One common method is to provide a FIFO queue -- this could be a std::dequeue or a linked list, or whatever -- and have the producer thread push data items onto one end of the queue while the consumer thread pops the data items off of the other end of the queue. Be sure to serialize all accesses to the FIFO queue (using a mutex or similar locking mechanism), to avoid race conditions.
Alternatively, instead of a queue you could have a single shared data object (essentially a queue of length one) and have your producer thread overwrite the object every time it generates new data. This could be done in cases where it's not important that the consumer thread sees every piece of data that was generated, but rather it's only important that it sees the most recent data. You'd still need to do the locking, though, to avoid the risk of the consumer thread reading from the data object at the same time the producer thread is in the middle of writing to it.
or does there need to be some kind of decimation to reduce the rate.
There doesn't need to be any decimation -- the second thread can just read in as much data as there is available to read, whenever it wakes up.
Does the myData need to be some kind of Singleton with locking
mechanism.
Singleton isn't necessary (although it's possible to do it that way). The locking mechanism is necessary, unless you have some kind of lock-free synchronization mechanism (and if you're asking this level of question, you don't have one and you don't want to try to get one either -- keep things simple for now!)
How about the opposite case, when the data used in one thread need to
be used at higher rate in a different thread.
It's the same -- if you're using a proper inter-thread communications mechanism, the rates at which the threads wake up doesn't matter, because the communications mechanism will do the right thing regardless of when or how often the the threads wake up.
Any multithreaded program has to cope with the possibility that one of the threads will work faster than another - by any ratio - even if they're executing on the same CPU with the same clock frequency.
Your choices include:
producer-consumer container than lets the first thread enqueue data, and the second thread "pop" it off for processing: you could let the queue grow as large as memory allows, or put some limit on the size after which either data would be lost or the 1st thread would be forced to slow down and wait to enqueue further values
there are libraries available (e.g. boost), or if you want to implement it yourself google some tutorials/docs on mutex and condition variables
do something conceptually similar to the above but where the size limit is 1 so there's just the single myData variable rather than a "container" - but all the synchronisation and delay choices remain the same
The Singleton pattern is orthogonal to your needs here: the two threads do need to know where the data is, but that would normally be done using e.g. a pointer argument to the function(s) run in the threads. Singleton's easily overused and best avoided unless reasons stack up high....

High throughput non-blocking server design: Alternatives to busy wait

I have been building a high-throughput server application for multimedia messaging, language of implementation is C++. Each server can be used in stand-alone mode or many servers can be joined together to create a DHT-based overlay network; the servers act like super-peers like in case of Skype.
The work is in progress. Currently the server can handle around 200,000 messages per second (256 byte messages) and has a max throughput of around 256 MB/s on my machine (Intel i3 Mobile 2 GHz, Fedora Core 18 (64-bit), 4 GB RAM) for messages of length 4096 bytes. The server has got two threads, one thread for handling all IOs (epoll-based, edge triggered) another one for processing the incoming messages. There is another thread for overlay management, but it doesn't matter in the current discussion.
The two threads in discussion share data using two circular buffers. Thread number 1 enqueues fresh messages for the thread number 2 using one circular buffer, while thread number 2 returns back the processed messages through the other circular Buffer. The server is completely lock-free. I am not using any synchronization primitive what-so-ever, not even atomic operations.
The circular buffers never overflow, because the messages are pooled (pre-allocated on start). In fact all vital/frequently-used data-structures are pooled to reduce memory fragmentation and to increase cache efficiency, hence we know the maximum number of messages we are ever going to create when the server starts, hence we can pre-calculate the maximum size of the buffers and then initialize the circular buffers accordingly.
Now my question: Thread #1 enqueues the serialized messages one message at a time (actually the pointers to message objects), while thread #2 takes out messages from the queue in chunks (chunks of 32/64/128), and returns back the processed messages in chunks through the second circular buffer. In case there are no new messages thread #2 keeps busy waiting, hence keeping one of the CPU cores busy all the time. How can I improve upon the design further? What are the alternatives to the busy wait strategy? I want to do this elegantly and efficiently. I have considered using semaphores, but I fear that is not the best solution for a simple reason that I have to call "sem_post" every time I enqueue a message in the thread #1 which might considerably decrease the throughput, the second thread has to call "sem_post" equal number of times to keep the semaphore from overflowing. Also I fear that a semaphore implementation might be using a mutex internally.
The second good option might be use of signal if I can discover an algorithm for raising signal only if the second thread has either "emptied the queue and is in process of calling sigwait" or is "already waiting on sigwait", in short the signal must be raised minimum number of times, although it won't hurt if signals are raised a few more times than needed. Yes, I did use Google Search, but none of the solutions I found on Internet were satisfactory. Here are a few considerations:
A. The server must waste minimum CPU cycles while making system calls, and system calls must be used a minimum number of times.
B. There must be very low overhead and the algorithm must be efficient.
C. No locking what-so-ever.
I want all options to be put on table.
Here is the link to the site where I have shared info about my server, to better understand the functionality and the purpose:
www.wanhive.com
Busy waiting is good if you need to wake up thread #2 as fast as possible. In fact this is the fastest way to notify one processor about changes made by another processor. You need to generate memory fences on both ends (write fence on one side and read fence on the other). But this statement holds true only if your both threads are executed on dedicated processors. In this case no context switching is needed, just cache coherency traffic.
There is some improvements can be made.
If thread #2 is in general CPU bound and do busy waiting - it can be penalized by the scheduler (at least on windows and linux). OS scheduler dynamically adjust thread priorities to improve overall system performance. It reduces priorities of CPU bound threads that consumes large amount of CPU time to prevent thread starvation. You need to manually increase priority of thread #2 to prevent this.
If you have multicore or multiprocessor machine, you will end up with undersubscription of processors and your application won't be able to exploit hardware concurrency. You can mitigate this by using several processor threads (thread #2).
Parallelization of processing step.
There is two options.
Your messages is totally ordered and need to be processed in the same order as they arrived.
Messages can be reordered. Processing can be done in any order.
You need N cycle buffers and N processing threads and N output buffers and one consumer thread in first case. Thread #1 enqueues messages in round-robin order in that cycle buffers.
// Thread #1 pseudocode
auto message = recv()
auto buffer_index = atomic_increment(&message_counter);
buffer_index = buffer_index % N; // N is the number of threads
// buffers is an array of cyclic buffers - Buffer* buffers[N];
Buffer* current_buffer = buffers[buffer_index];
current_buffer->euqueue(message);
Each thread consumes messages from one of the buffers and enqueues result to his dedicated output buffer.
// Thread #i pseudocode
auto message = my_buffer->dequeue();
auto result = process(message);
my_output_buffer->enqueue(result);
Now you need to process all this messages in the arrival order. You can do this with dedicated consumer thread by dequeuing processed messages from output cyclic buffers in round-robin manner.
// Consumer thread pseudocode
// out_message_counter is equal to message_counter at start
auto out_buffer_index = atomic_increment(&out_message_counter);
out_buffer_index = out_buffer_index % N;
// out_buffers is array of output buffers that is used by processing
// threads
auto out_buffer = out_buffers[out_buffer_index];
auto result = out_buffer->dequeue();
send(result); // or whatever you need to do with result
In second case, when you doesn't need to preserve message order - you doesn't need the consumer thread and output cyclic buffers. You just do whatever you need to do with result in processing thread.
N must be equal num CPU's - 3 in first case ("- 3" is one I/O thread + one consumer thread + one DHT thread) and num CPU's - 2 in second case ("- 2" is one I/O thread + one DHT thread). This is because busy wait can't be effective if you have oversubscription of processors.
Sounds like you want to coordinate a producer and consumer connected by some shared state. At least in Java for such patterns, one way to avoid busy wait is to use wait and notify. With this approach, thread #2 can go into a blocked state if it finds that the queue is empty by calling wait and avoid spinning the CPU. Once thread #1 puts some stuff in the queue, it can do a notify. A quick search of such mechanisms in C++ yields this:
wait and notify in C/C++ shared memory
You can have thread #2 go to sleep for X miliseconds when the queue is empty.
X can be determined by the length of the queues you want + some guard band.
BTW, in user mode (ring3) you can't use MONITOR/MWAIT instructions which would be ideal for your question.
Edit
You should definitely give TBB's RWlock a try (there's a free version). Sounds like what you're looking for.
Edit2
Another option is to use conditional variables. They involve a mutex and a condition. Basically you wait on the condition to become "true". The low level pthread stuff can be found here.

How can I improve my real-time behavior in multi-threaded app using pthreads and condition variables?

I have a multi-threaded application that is using pthreads. I have a mutex() lock and condition variables(). There are two threads, one thread is producing data for the second thread, a worker, which is trying to process the produced data in a real time fashion such that one chuck is processed as close to the elapsing of a fixed time period as possible.
This works pretty well, however, occasionally when the producer thread releases the condition upon which the worker is waiting, a delay of up to almost a whole second is seen before the worker thread gets control and executes again.
I know this because right before the producer releases the condition upon which the worker is waiting, it does a chuck of processing for the worker if it is time to process another chuck, then immediately upon receiving the condition in the worker thread, it also does a chuck of processing if it is time to process another chuck.
In this later case, I am seeing that I am late processing the chuck many times. I'd like to eliminate this lost efficiency and do what I can to keep the chucks ticking away as close to possible to the desired frequency.
Is there anything I can do to reduce the delay between the release condition from the producer and the detection that that condition is released such that the worker resumes processing? For example, would it help for the producer to call something to force itself to be context switched out?
Bottom line is the worker has to wait each time it asks the producer to create work for itself so that the producer can muck with the worker's data structures before telling the worker it is ready to run in parallel again. This period of exclusive access by the producer is meant to be short, but during this period, I am also checking for real-time work to be done by the producer on behalf of the worker while the producer has exclusive access. Somehow my hand off back to running in parallel again results in significant delay occasionally that I would like to avoid. Please suggest how this might be best accomplished.
I could suggest the following pattern. Generally the same technique could be used, e.g. when prebuffering frames in some real-time renderers or something like that.
First, it's obvious that approach that you describe in your message would only be effective if both of your threads are loaded equally (or almost equally) all the time. If not, multi-threading would actually benefit in your situation.
Now, let's think about a thread pattern that would be optimal for your problem. Assume we have a yielding and a processing thread. First of them prepares chunks of data to process, the second makes processing and stores the processing result somewhere (not actually important).
The effective way to make these threads work together is the proper yielding mechanism. Your yielding thread should simply add data to some shared buffer and shouldn't actually care about what would happen with that data. And, well, your buffer could be implemented as a simple FIFO queue. This means that your yielding thread should prepare data to process and make a PUSH call to your queue:
X = PREPARE_DATA()
BUFFER.LOCK()
BUFFER.PUSH(X)
BUFFER.UNLOCK()
Now, the processing thread. It's behaviour should be described this way (you should probably add some artificial delay like SLEEP(X) between calls to EMPTY)
IF !EMPTY(BUFFER) PROCESS(BUFFER.TOP)
The important moment here is what should your processing thread do with processed data. The obvious approach means making a POP call after the data is processed, but you will probably want to come with some better idea. Anyway, in my variant this would look like
// After data is processed
BUFFER.LOCK()
BUFFER.POP()
BUFFER.UNLOCK()
Note that locking operations in yielding and processing threads shouldn't actually impact your performance because they are only called once per chunk of data.
Now, the interesting part. As I wrote at the beginning, this approach would only be effective if threads act somewhat the same in terms of CPU / Resource usage. There is a way to make these threading solution effective even if this condition is not constantly true and matters on some other runtime conditions.
This way means creating another thread that is called controller thread. This thread would merely compare the time that each thread uses to process one chunk of data and balance the thread priorities accordingly. Actually, we don't have to "compare the time", the controller thread could simply work the way like:
IF BUFFER.SIZE() > T
DECREASE_PRIORITY(YIELDING_THREAD)
INCREASE_PRIORITY(PROCESSING_THREAD)
Of course, you could implement some better heuristics here but the approach with controller thread should be clear.