Waking up multiple threads to work once per condition - c++

I have a situation where one thread needs to occasionally wake up a number of worker threads and each worker thread needs to do it's work (only) once and then go back to sleep to wait for the next notification. I'm using a condition_variable to wake everything up, but the problem I'm having is the "only once" part. Assume that each thread is heavy to create, so I don't want to just be creating and joining them each time.
// g++ -Wall -o threadtest -pthread threadtest.cpp
#include <iostream>
#include <condition_variable>
#include <mutex>
#include <thread>
#include <chrono>
std::mutex condMutex;
std::condition_variable condVar;
bool dataReady = false;
void state_change_worker(int id)
{
while (1)
{
{
std::unique_lock<std::mutex> lck(condMutex);
condVar.wait(lck, [] { return dataReady; });
// Do work only once.
std::cout << "thread " << id << " working\n";
}
}
}
int main()
{
// Create some worker threads.
std::thread threads[5];
for (int i = 0; i < 5; ++i)
threads[i] = std::thread(state_change_worker, i);
while (1)
{
// Signal to the worker threads to work.
{
std::cout << "Notifying threads.\n";
std::unique_lock<std::mutex> lck(condMutex);
dataReady = true;
condVar.notify_all();
}
// It would be really great if I could wait() on all of the
// worker threads being done with their work here, but it's
// not strictly necessary.
std::cout << "Sleep for a bit.\n";
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
}
Update: Here is a version implementing an almost-but-not-quite working version of a squad lock. The problem is that I can't guarantee that each thread will have a chance to wake up and derement count in waitForLeader() before one runs through again.
// g++ -Wall -o threadtest -pthread threadtest.cpp
#include <iostream>
#include <condition_variable>
#include <mutex>
#include <thread>
#include <chrono>
class SquadLock
{
public:
void waitForLeader()
{
{
// Increment count to show that we are waiting in queue.
// Also, if we are the thread that reached the target, signal
// to the leader that everything is ready.
std::unique_lock<std::mutex> count_lock(count_mutex_);
std::unique_lock<std::mutex> target_lock(target_mutex_);
if (++count_ >= target_)
count_cond_.notify_one();
}
// Wait for leader to signal done.
std::unique_lock<std::mutex> lck(done_mutex_);
done_cond_.wait(lck, [&] { return done_; });
{
// Decrement count to show that we are no longer waiting.
// If we are the last thread set done to false.
std::unique_lock<std::mutex> lck(count_mutex_);
if (--count_ == 0)
{
done_ = false;
}
}
}
void waitForHerd()
{
std::unique_lock<std::mutex> lck(count_mutex_);
count_cond_.wait(lck, [&] { return count_ >= target_; });
}
void leaderDone()
{
std::unique_lock<std::mutex> lck(done_mutex_);
done_ = true;
done_cond_.notify_all();
}
void incrementTarget()
{
std::unique_lock<std::mutex> lck(target_mutex_);
++target_;
}
void decrementTarget()
{
std::unique_lock<std::mutex> lck(target_mutex_);
--target_;
}
void setTarget(int target)
{
std::unique_lock<std::mutex> lck(target_mutex_);
target_ = target;
}
private:
// Condition variable to indicate that the leader is done.
std::mutex done_mutex_;
std::condition_variable done_cond_;
bool done_ = false;
// Count of currently waiting tasks.
std::mutex count_mutex_;
std::condition_variable count_cond_;
int count_ = 0;
// Target number of tasks ready for the leader.
std::mutex target_mutex_;
int target_ = 0;
};
SquadLock squad_lock;
std::mutex print_mutex;
void state_change_worker(int id)
{
while (1)
{
// Wait for the leader to signal that we are ready to work.
squad_lock.waitForLeader();
{
// Adding just a bit of sleep here makes it so that every thread wakes up, but that isn't the right way.
// std::this_thread::sleep_for(std::chrono::milliseconds(100));
std::unique_lock<std::mutex> lck(print_mutex);
std::cout << "thread " << id << " working\n";
}
}
}
int main()
{
// Create some worker threads and increment target for each one
// since we want to wait until all threads are finished.
std::thread threads[5];
for (int i = 0; i < 5; ++i)
{
squad_lock.incrementTarget();
threads[i] = std::thread(state_change_worker, i);
}
while (1)
{
// Signal to the worker threads to work.
std::cout << "Starting threads.\n";
squad_lock.leaderDone();
// Wait for the worked threads to be done.
squad_lock.waitForHerd();
// Wait until next time, processing results.
std::cout << "Tasks done, waiting for next time.\n";
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
}

Following is an excerpt from a blog I created concerning concurrent design patterns. The patterns are expressed using the Ada language, but the concepts are translatable to C++.
Summary
Many applications are constructed of groups of cooperating threads of execution. Historically this has frequently been accomplished by creating a group of cooperating processes. Those processes would cooperate by sharing data. At first, only files were used to share data. File sharing presents some interesting problems. If one process is writing to the file while another process reads from the file you will frequently encounter data corruption because the reading process may attempt to read data before the writing process has completely written the information. The solution used for this was to create file locks, so that only one process at a time could open the file. Unix introduced the concept of a Pipe, which is effectively a queue of data. One process can write to a pipe while another reads from the pipe. The operating system treats data in a pipe as a series of bytes. It does not let the reading process access a particular byte of data until the writing process has completed its operation on the data.
Various operating systems also introduced other mechanisms allowing processes to share data. Examples include message queues, sockets, and shared memory. There were also special features to help programmers control access to data, such as semaphores. When operating systems introduced the ability for a single process to operate multiple threads of execution, also known as lightweight threads, or just threads, they also had to provide corresponding locking mechanisms for shared data.
Experience shows that, while the variety of possible designs for shared data is quite large, there are a few very common design patterns that frequently emerge. Specifically, there are a few variations on a lock or semaphore, as well as a few variations on data buffering. This paper explores the locking and buffering design patterns for threads in the context of a monitor. Although monitors can be implemented in many languages, all examples in this paper are presented using Ada protected types. Ada protected types are a very thorough implementation of a monitor.
Monitors
There are several theoretical approaches to creating and controlling shared memory. One of the most flexible and robust is the monitor as first described by C.A.R. Hoare. A monitor is a data object with three different kinds of operations.
Procedures are used to change the state or values contained by the monitor. When a thread calls a monitor procedure that thread must have exclusive access to the monitor to prevent other threads from encountering corrupted or partially written data.
Entries, like procedures, are used to change the state or values contained by the monitor, but an entry also specifies a boundary condition. The entry may only be executed when the boundary condition is true. Threads that call an entry when the boundary condition is false are placed in a queue until the boundary condition becomes true. Entries are used, for example, to allow a thread to read from a shared buffer. The reading thread is not allowed to read the data until the buffer actually contains some data. The boundary condition would be that the buffer must not be empty. Entries, like procedures, must have exclusive access to the monitor's data.
Functions are used to report the state of a monitor. Since functions only report state, and do not change state, they do not need exclusive access to the monitor's data. Many threads may simultaneously access the same monitor through functions without danger of data corruption.
The concept of a monitor is extremely powerful. It can also be extremely efficient. Monitors provide all the capabilities needed to design efficient and robust shared data structures for threaded systems.
Although monitors are powerful, they do have some limitations. The operations performed on a monitor should be very fast, with no chance of making a thread block. If those operations should block, the monitor will become a road block instead of a communication tool. All the threads awaiting access to the monitor will be blocked as long as the monitor operation is blocked. For this reason, some people choose not to use monitors. There are design patterns for monitors that can actually be used to work around these problems. Those design patterns are grouped together as locking patterns.
Squad Locks
A squad lock allows a special task (the squad leader) to monitor the progress of a herd or group of worker tasks. When all (or a sufficient number) of the worker tasks are done with some aspect of their work, and the leader is ready to proceed, the entire set of tasks is allowed to pass a barrier and continue with the next sequence of their activities. The purpose is to allow tasks to execute asynchronously, yet coordinate their progress through a complex set of activities.
package Barriers is
protected type Barrier(Trigger : Positive) is
entry Wait_For_Leader;
entry Wait_For_Herd;
procedure Leader_Done;
private
Done : Boolean := False;
end Barrier;
protected type Autobarrier(Trigger : Positive) is
entry Wait_For_Leader;
entry Wait_For_Herd;
private
Done : Boolean := False;
end Autobarrier;
end Barriers;
This package shows two kinds of squad lock. The Barrier protected type demonstrates a basic squad lock. The herd calls Wait_For_Leader and the leader calls Wait_For_Herd and then Leader_Done. The Autobarrier demonstrates a simpler interface. The herd calls Wait_For_Leader and the leader calls Wait_For_Herd. The Trigger parameter is used when creating an instance of either type of barrier. It sets the minimum number of herd tasks the leader must wait for before it can proceed.
package body Barriers is
protected body Barrier is
entry Wait_For_Herd when Wait_For_Leader'Count >= Trigger is
begin
null;
end Wait_For_Herd;
entry Wait_For_Leader when Done is
begin
if Wait_For_Leader'Count = 0 then
Done := False;
end if;
end Wait_For_Leader;
procedure Leader_Done is
begin
Done := True;
end Leader_Done;
end Barrier;
protected body Autobarrier is
entry Wait_For_Herd when Wait_For_Leader'Count >= Trigger is
begin
Done := True;
end Wait_For_Herd;
entry Wait_For_Leader when Done is
begin
if Wait_For_Leader'Count = 0 then
Done := False;
end if;
end Wait_For_Leader;
end Autobarrier;
end Barriers;

Related

How bad it is to lock a mutex in an infinite loop or an update function

std::queue<double> some_q;
std::mutex mu_q;
/* an update function may be an event observer */
void UpdateFunc()
{
/* some other processing */
std::lock_guard lock{ mu_q };
while (!some_q.empty())
{
const auto& val = some_q.front();
/* update different states according to val */
some_q.pop();
}
/* some other processing */
}
/* some other thread might add some values after processing some other inputs */
void AddVal(...)
{
std::lock_guard lock{ mu_q };
some_q.push(...);
}
For this case is it okay to handle the queue this way?
Or would it be better if I try to use a lock-free queue like the boost one?
How bad it is to lock a mutex in an infinite loop or an update function
It's pretty bad. Infinite loops actually make your program have undefined behavior unless it does one of the following:
terminate
make a call to a library I/O function
perform an access through a volatile glvalue
perform a synchronization operation or an atomic operation
Acquiring the mutex lock before entering the loop and just holding it does not count as performing a synchronization operation (in the loop). Also, when holding the mutex, noone can add information to the queue, so while processing the information you extract, all threads wanting to add to the queue will have to wait - and no other worker threads wanting to share the load can extract from the queue either. It's usually better to extract one task from the queue, release the lock and then work with what you got.
The common way is to use a condition_variable that lets other threads acquire the lock and then notify other threads waiting with the same condition_variable. The CPU will be pretty close to idle while waiting and wake up to do the work when needed.
Using your program as a base, it could look like this:
#include <chrono>
#include <condition_variable>
#include <iostream>
#include <mutex>
#include <queue>
#include <thread>
std::queue<double> some_q;
std::mutex mu_q;
std::condition_variable cv_q; // the condition variable
bool stop_q = false; // something to signal the worker thread to quit
/* an update function may be an event observer */
void UpdateFunc() {
while(true) {
double val;
{
std::unique_lock lock{mu_q};
// cv_q.wait lets others acquire the lock to work with the queue
// while it waits to be notified.
while (not stop_q && some_q.empty()) cv_q.wait(lock);
if(stop_q) break; // time to quit
val = std::move(some_q.front());
some_q.pop();
} // lock released so others can use the queue
// do time consuming work with "val" here
std::cout << "got " << val << '\n';
}
}
/* some other thread might add some values after processing some other inputs */
void AddVal(double val) {
std::lock_guard lock{mu_q};
some_q.push(val);
cv_q.notify_one(); // notify someone that there's a new value to work with
}
void StopQ() { // a function to set the queue in shutdown mode
std::lock_guard lock{mu_q};
stop_q = true;
cv_q.notify_all(); // notify all that it's time to stop
}
int main() {
auto th = std::thread(UpdateFunc);
// simulate some events coming with some time apart
std::this_thread::sleep_for(std::chrono::seconds(1));
AddVal(1.2);
std::this_thread::sleep_for(std::chrono::seconds(1));
AddVal(3.4);
std::this_thread::sleep_for(std::chrono::seconds(1));
AddVal(5.6);
std::this_thread::sleep_for(std::chrono::seconds(1));
StopQ();
th.join();
}
If you really want to process everything that is currently in the queue, then extract everything first and then release the lock, then work with what you extracted. Extracting everything from the queue is done quickly by just swapping in another std::queue. Example:
#include <atomic>
std::atomic<bool> stop_q{}; // needs to be atomic in this version
void UpdateFunc() {
while(not stop_q) {
std::queue<double> work; // this will be used to swap with some_q
{
std::unique_lock lock{mu_q};
// cv_q.wait lets others acquire the lock to work with the queue
// while it waits to be notified.
while (not stop_q && some_q.empty()) cv_q.wait(lock);
std::swap(work, some_q); // extract everything from the queue at once
} // lock released so others can use the queue
// do time consuming work here
while(not stop_q && not work.empty()) {
auto val = std::move(work.front());
work.pop();
std::cout << "got " << val << '\n';
}
}
}
You can use it like you currently are assuming proper use of the lock across all threads. However, you may run into some frustrations about how you want to call updateFunc().
Are you going to be using a callback?
Are you going to be using an ISR?
Are you going to be polling?
If you use a 3rd party lib it often trivializes thread synchronization and queues
For example, if you are using a CMSIS RTOS(v2). It is a fairly straight forward process to get multiple threads to pass information between each other. You could have multiple producers, and a single consumer.
The single consumer can wait in a forever loop where it waits to receive a message before performing its work
when timeout is set to osWaitForever the function will wait for an
infinite time until the message is retrieved (i.e. wait semantics).
// Two producers
osMessageQueuePut(X,Y,Z,timeout=0)
osMessageQueuePut(X,Y,Z,timeout=0)
// One consumer which will run only once something enters the queue
osMessageQueueGet(X,Y,Z,osWaitForever)
tldr; You are safe to proceed, but using a library will likely make your synchronization problems easier.

is there any way to wakeup multiple threads at the same time in c/c++

well, actually, I'm not asking the threads must "line up" to work, but I just want to notify multiple threads. so I'm not looking for barrier.
it's kind of like the condition_variable::notify_all(), but I don't want the threads wakeup one-by-one, which may cause starvation(also the potential problem in multiple semaphore post operation). it's kind of like:
std::atomic_flag flag{ATOMIC_FLAG_INIT};
void example() {
if (!flag.test_and_set()) {
// this is the thread to do the job, and notify others
do_something();
notify_others(); // this is what I'm looking for
flag.clear();
} else {
// this is the waiting thread
wait_till_notification();
do_some_other_thing();
}
}
void runner() {
std::vector<std::threads>;
for (int i=0; i<10; ++i) {
threads.emplace_back([]() {
while(1) {
example();
}
});
}
// ...
}
so how can I do this in c/c++ or maybe posix API?
sorry, I didn't make this question clear enough, I'd add some more explaination.
it's not thunder heard problem I'm talking about, and yes, it's the re-acquire-lock that bothers me, and I tried shared_mutex, there's still some problem.
let me split the threads to 2 parts, 1 as leader thread, which do the writing job, the others as worker threads, which do the reading job.
but actually they're all equal in programme, the leader thread is the thread that 1st got access to the job( you can take it as the shared buffer is underflowed for this thread). once the job is done, the other workers just need to be notified that them have the access.
if the mutex is used here, any thread would block the others.
to give an example: the main thread's job do_something() here is a read, and it block the main thread, thus the whole system is blocked.
unfortunatly, shared_mutex won't solve this problem:
void example() {
if (!flag.test_and_set()) {
// leader thread:
lk.lock();
do_something();
lk.unlock();
flag.clear();
} else {
// worker thread
lk.shared_lock();
do_some_other_thing();
lk.shared_unlock();
}
}
// outer loop
void looper() {
std::vector<std::threads>;
for (int i=0; i<10; ++i) {
threads.emplace_back([]() {
while(1) {
example();
}
});
}
}
in this code, if the leader job was done, and not much to do between this unlock and next lock (remember they're in a loop), it may get the lock again, leave the worker jobs not working, which is why I call it starve earlier.
and to explain the blocking in do_something(), I don't want this part of job takes all my CPU time, even if the leader's job is not ready (no data arrive for read)
and std::call_once may still not be the answer to this. because, as you can see, the workers must wait till the leader's job finished.
to summarize, this is actually a one-producer-multi-consumer problem.
but I want the consumers can do the job when the product is ready for them. and any can be the producer or consumer. if any but the 1st find the product has run out, the thread should be the producer, thus others are automatically consumer.
but unfortunately, I'm not sure if this idea would work or not
it's kind of like the condition_variable::notify_all(), but I don't want the threads wakeup one-by-one, which may cause starvation
In principle it's not waking up that is serialized, but re-acquiring the lock.
You can avoid that by using std::condition_variable_any with a std::shared_lock - so long as nobody ever gets an exclusive lock on the std::shared_mutex. Alternatively, you can provide your own Lockable type.
Note however that this won't magically allow you to concurrently run more threads than you have cores, or force the scheduler to start them all running in parallel. They'll just be marked as runnable and scheduled as normal - this only fixes the avoidable serialization in your own code.
It sounds like you are looking for call_once
#include <mutex>
void example()
{
static std::once_flag flag;
bool i_did_once = false;
std::call_once(flag, [&i_did_once]() mutable {
i_did_once = true;
do_something();
});
if(! i_did_once)
do_some_other_thing();
}
I don't see how your problem relates to starvation. Are you perhaps thinking about the thundering herd problem? This may arise if do_some_other_thing has a mutex but in that case you have to describe your problem in more detail.

Synchronizing very fast threads

In the following example (an idealized "game") there are two threads. The main thread which updates data and RenderThread which "renders" it to the screen. What I need it those two to be synchronized. I cannot afford to run several update iteration without running a render for every single one of them.
I use a condition_variable to sync those two, so ideally the faster thread will spend some time waiting for the slower. However condition variables don't seem to do the job if one of the threads completes an iteration for a very small amount of time. It seems to quickly reacquire the lock of the mutex before wait in the other thread is able to acquire it. Even though notify_one is called
#include <iostream>
#include <thread>
#include <chrono>
#include <atomic>
#include <functional>
#include <mutex>
#include <condition_variable>
using namespace std;
bool isMultiThreaded = true;
struct RenderThread
{
RenderThread()
{
end = false;
drawing = false;
readyToDraw = false;
}
void Run()
{
while (!end)
{
DoJob();
}
}
void DoJob()
{
unique_lock<mutex> lk(renderReadyMutex);
renderReady.wait(lk, [this](){ return readyToDraw; });
drawing = true;
// RENDER DATA
this_thread::sleep_for(chrono::milliseconds(15)); // simulated render time
cout << "frame " << count << ": " << frame << endl;
++count;
drawing = false;
readyToDraw = false;
lk.unlock();
renderReady.notify_one();
}
atomic<bool> end;
mutex renderReadyMutex;
condition_variable renderReady;
//mutex frame_mutex;
int frame = -10;
int count = 0;
bool readyToDraw;
bool drawing;
};
struct UpdateThread
{
UpdateThread(RenderThread& rt)
: m_rt(rt)
{}
void Run()
{
this_thread::sleep_for(chrono::milliseconds(500));
for (int i = 0; i < 20; ++i)
{
// DO GAME UPDATE
// when this is uncommented everything is fine
// this_thread::sleep_for(chrono::milliseconds(10)); // simulated update time
// PREPARE RENDER THREAD
unique_lock<mutex> lk(m_rt.renderReadyMutex);
m_rt.renderReady.wait(lk, [this](){ return !m_rt.drawing; });
m_rt.readyToDraw = true;
// SUPPLY RENDER THREAD WITH DATA TO RENDER
m_rt.frame = i;
lk.unlock();
m_rt.renderReady.notify_one();
if (!isMultiThreaded)
m_rt.DoJob();
}
m_rt.end = true;
}
RenderThread& m_rt;
};
int main()
{
auto start = chrono::high_resolution_clock::now();
RenderThread rt;
UpdateThread u(rt);
thread* rendering = nullptr;
if (isMultiThreaded)
rendering = new thread(bind(&RenderThread::Run, &rt));
u.Run();
if (rendering)
rendering->join();
auto duration = chrono::high_resolution_clock::now() - start;
cout << "Duration: " << double(chrono::duration_cast<chrono::microseconds>(duration).count())/1000 << endl;
return 0;
}
Here is the source of this small example code, and as you can see even on ideone's run the output is frame 0: 19 (this means that the render thread has completed a single iteration, while the update thread has completed all 20 of its).
If we uncomment line 75 (ie simulate some time for the update loop) everything runs fine. Every update iteration has an associated render iteration.
Is there a way to really truly sync those threads, even if one of them completes an iteration in mere nanoseconds, but also without having a performance penalty if they both take some reasonable amount of milliseconds to complete?
If I understand correctly, you want the 2 threads to work alternately: updater wait until the renderer finish before to iterate again, and the renderer wait until the updater finish before to iterate again. Part of the computation could be parallel, but the number of iteration shall be similar between both.
You need 2 locks:
one for the updating
one for the rendering
Updater:
wait (renderingLk)
update
signal(updaterLk)
Renderer:
wait (updaterLk)
render
signal(renderingLk)
EDITED:
Even if it look simple, there are several problems to solve:
Allowing part of the calculations to be made in parallel: As in the above snippet, update and render will not be parallel but sequential, so there is no benefit to have multi-thread. To a real solution, some the calculation should be made before the wait, and only the copy of the new values need to be between the wait and the signal. Same for rendering: all the render need to be made after the signal, and only getting the value between the wait and the signal.
The implementation need to care also about the initial state: so no rendering is performed before the first update.
The termination of both thread: so no one will stay locked or loop infinitely after the other terminate.
I think a mutex (alone) is not the right tool for the job. You might want to consider using a semaphore (or something similar) instead. What you describe sound a lot like a producer/consumer problem, i.e., one process is allowed to run once everytime another process has finnished a task. Therefore you might also have a look at producer/consumer patterns. For example this series might get you some ideas:
A multi-threaded Producer Consumer with C++11
There a std::mutex is combined with a std::condition_variable to mimic the behavior of a semaphore. An approach that appears quite reasonable. You would probably not count up and down but rather toggle true and false a variable with needs redraw semantics.
For reference:
http://en.cppreference.com/w/cpp/thread/condition_variable
C++0x has no semaphores? How to synchronize threads?
This is because you use a separate drawing variable that is only set when the rendering thread reacquires the mutex after a wait, which may be too late. The problem disappears when the drawing variable is removed and the check for wait in the update thread is replaced with ! m_rt.readyToDraw (which is already set by the update thread and hence not susceptible to the logical race.
Modified code and results
That said, since the threads do not work in parallel, I don't really get the point of having two threads. Unless you should choose to implement double (or even triple) buffering later.
A technique often used in computer graphics is to use a double-buffer. Instead of having the renderer and the producer operate on the same data in memory, each one has its own buffer. This is implemented by using two independent buffers, and switch them when needed. The producer updates one buffer, and when it is done, it switches the buffer and fills the second buffer with the next data. Now, while the producer is processing the second buffer, the renderer works with the first one and displays it.
You could use this technique by letting the renderer lock the swap operation such that the producer may have to wait until rendering is finished.

C++11 lockfree single producer single consumer: how to avoid busy wait

I'm trying to implement a class that uses two threads: one for the producer and one for the consumer. The current implementation does not use locks:
#include <boost/lockfree/spsc_queue.hpp>
#include <atomic>
#include <thread>
using Queue =
boost::lockfree::spsc_queue<
int,
boost::lockfree::capacity<1024>>;
class Worker
{
public:
Worker() : working_(false), done_(false) {}
~Worker() {
done_ = true; // exit even if the work has not been completed
worker_.join();
}
void enqueue(int value) {
queue_.push(value);
if (!working_) {
working_ = true;
worker_ = std::thread([this]{ work(); });
}
}
void work() {
int value;
while (!done_ && queue_.pop(value)) {
std::cout << value << std::endl;
}
working_ = false;
}
private:
std::atomic<bool> working_;
std::atomic<bool> done_;
Queue queue_;
std::thread worker_;
};
The application needs to enqueue work items for a certain amount of time and then sleep waiting for an event. This is a minimal main that simulates the behavior:
int main()
{
Worker w;
for (int i = 0; i < 1000; ++i)
w.enqueue(i);
std::this_thread::sleep_for(std::chrono::seconds(1));
for (int i = 0; i < 1000; ++i)
w.enqueue(i);
std::this_thread::sleep_for(std::chrono::seconds(1));
}
I'm pretty sure that my implementation is bugged: what if the worker thread completes and before executing working_ = false, another enqueue comes? Is it possible to make my code thread safe without using locks?
The solution requires:
a fast enqueue
the destructor has to quit even if the queue is not empty
no busy wait, because there are long period of time in which the worker thread is idle
no locks if possible
Edit
I did another implementation of the Worker class, based on your suggestions. Here is my second attempt:
class Worker
{
public:
Worker()
: working_(ATOMIC_FLAG_INIT), done_(false) { }
~Worker() {
// exit even if the work has not been completed
done_ = true;
if (worker_.joinable())
worker_.join();
}
bool enqueue(int value) {
bool enqueued = queue_.push(value);
if (!working_.test_and_set()) {
if (worker_.joinable())
worker_.join();
worker_ = std::thread([this]{ work(); });
}
return enqueued;
}
void work() {
int value;
while (!done_ && queue_.pop(value)) {
std::cout << value << std::endl;
}
working_.clear();
while (!done_ && queue_.pop(value)) {
std::cout << value << std::endl;
}
}
private:
std::atomic_flag working_;
std::atomic<bool> done_;
Queue queue_;
std::thread worker_;
};
I introduced the worker_.join() inside the enqueue method. This can impact the performances, but in very rare cases (when the queue gets empty and before the thread exits, another enqueue comes). The working_ variable is now an atomic_flag that is set in enqueue and cleared in work. The Additional while after working_.clear() is needed because if another value is pushed, before the clear, but after the while, the value is not processed.
Is this implementation correct?
I did some tests and the implementation seems to work.
OT: Is it better to put this as an edit, or an answer?
what if the worker thread completes and before executing working_ = false, another enqueue comes?
Then the value will be pushed to the queue but will not be processed until another value is enqueued after the flag is set. You (or your users) may decide whether that is acceptable. This can be avoided using locks, but they're against your requirements.
The code may fail if the running thread is about to finish and sets working_ = false; but hasn't stopped running before next value is enqueued. In that case your code will call operator= on the running thread which results in a call to std::terminate according to the linked documentation.
Adding worker_.join() before assigning the worker to a new thread should prevent that.
Another problem is that queue_.push may fail if the queue is full because it has a fixed size. Currently you just ignore the case and the value will not be added to the full queue. If you wait for queue to have space, you don't get fast enqueue (in the edge case). You could take the bool returned by push (which tells if it was successful) and return it from enqueue. That way the caller may decide whether it wants to wait or discard the value.
Or use non-fixed size queue. Boost has this to say about that choice:
Can be used to completely disable dynamic memory allocations during push in order to ensure lockfree behavior.
If the data structure is configured as fixed-sized, the internal nodes are stored inside an array and they are addressed
by array indexing. This limits the possible size of the queue to the number of elements that can be addressed by the index
type (usually 2**16-2), but on platforms that lack double-width compare-and-exchange instructions, this is the best way
to achieve lock-freedom.
Your worker thread needs more than 2 states.
Not running
Doing tasks
Idle shutdown
Shutdown
If you force shut down, it skips idle shutdown. If you run out of tasks, it transitions to idle shutdown. In idle shutdown, it empties the task queue, then goes into shutting down.
Shutdown is set, then you walk off the end of your worker task.
The producer first puts things on the queue. Then it checks the worker state. If Shutdown or Idle shutdown, first join it (and transition it to not running) then launch a new worker. If not running, just launch a new worker.
If the producer wants to launch a new worker, it first makes sure that we are in the not running state (otherwise, logic error). We then transition to the Doing tasks state, and then we launch the worker thread.
If the producer wants to shut down the helper task, it sets the done flag. It then checks the worker state. If it is anything besides not running, it joins it.
This can result in a worker thread that is launched for no good reason.
There are a few cases where the above can block, but there where a few before as well.
Then, we write a formal or semi-formal proof that the above cannot lose messages, because when writing lock free code you aren't done until you have a proof.
This is my solution of the question. I don't like very much answering myself, but I think showing actual code may help others.
#include <boost/lockfree/spsc_queue.hpp>
#include <atomic>
#include <thread>
// I used this semaphore class: https://gist.github.com/yohhoy/2156481
#include "binsem.hpp"
using Queue =
boost::lockfree::spsc_queue<
int,
boost::lockfree::capacity<1024>>;
class Worker
{
public:
// the worker thread starts in the constructor
Worker()
: working_(ATOMIC_FLAG_INIT), done_(false), semaphore_(0)
, worker_([this]{ work(); })
{ }
~Worker() {
// exit even if the work has not been completed
done_ = true;
semaphore_.signal();
worker_.join();
}
bool enqueue(int value) {
bool enqueued = queue_.push(value);
if (!working_.test_and_set())
// signal to the worker thread to wake up
semaphore_.signal();
return enqueued;
}
void work() {
int value;
// the worker thread continue to live
while (!done_) {
// wait the start signal, sleeping
semaphore_.wait();
while (!done_ && queue_.pop(value)) {
// perform actual work
std::cout << value << std::endl;
}
working_.clear();
while (!done_ && queue_.pop(value)) {
// perform actual work
std::cout << value << std::endl;
}
}
}
private:
std::atomic_flag working_;
std::atomic<bool> done_;
binsem semaphore_;
Queue queue_;
std::thread worker_;
};
I tried the suggestion of #Cameron, to not shutdown the thread and adding a semaphore. This actually is used only in the first enqueue and in the last work. This is not lock-free, but only in these two cases.
I did some performance comparison, between my previous version (see my edited question), and this one. There are no significant differences, when there are not many start and stop. However, the enqueue is 10 times faster when it have to signal the worker thread, instead of starting a new thread. This is a rare case, so it is not very important, but anyway it is an improvement.
This implementation satisfies:
lock-free in the common case (when enqueue and work are busy);
no busy wait in case for long time there are not enqueue
the destructor exits as soon as possible
correctness?? :)
Very partial answer: I think all those atomics, semaphores and states are a back-communication channel, from "the thread" to "the Worker". Why not use another queue for that? At the very least, thinking about it will help you around the problem.

C++ multithreading, simple consumer / producer threads, LIFO, notification, counter

I am new to multi-thread programming, I want to implement the following functionality.
There are 2 threads, producer and consumer.
Consumer only processes the latest value, i.e., last in first out (LIFO).
Producer sometimes generates new value at a faster rate than consumer can
process. For example, producer may generate 2 new value in 1
milli-second, but it approximately takes consumer 5 milli-seconds to process.
If consumer receives a new value in the middle of processing an old
value, there is no need to interrupt. In other words, consumer will finish current
execution first, then start an execution on the latest value.
Here is my design process, please correct me if I am wrong.
There is no need for a queue, since only the latest value is
processed by consumer.
Is notification sent from producer being queued automatically???
I will use a counter instead.
ConsumerThread() check the counter at the end, to make sure producer
doesn't generate new value.
But what happen if producer generates a new value just before consumer
goes to sleep(), but after check the counter???
Here is some pseudo code.
boost::mutex mutex;
double x;
void ProducerThread()
{
{
boost::scoped_lock lock(mutex);
x = rand();
counter++;
}
notify(); // wake up consumer thread
}
void ConsumerThread()
{
counter = 0; // reset counter, only process the latest value
... do something which takes 5 milli-seconds ...
if (counter > 0)
{
... execute this function again, not too sure how to implement this ...
}
else
{
... what happen if producer generates a new value here??? ...
sleep();
}
}
Thanks.
If I understood your question correctly, for your particular application, the consumer only needs to process the latest available value provided by the producer. In other words, it's acceptable for values to get dropped because the consumer cannot keep up with the producer.
If that's the case, then I agree that you can get away without a queue and use a counter. However, the shared counter and value variables will be need to be accessed atomically.
You can use boost::condition_variable to signal notifications to the consumer that a new value is ready. Here is a complete example; I'll let the comments do the explaining.
#include <boost/thread/thread.hpp>
#include <boost/thread/mutex.hpp>
#include <boost/thread/condition_variable.hpp>
#include <boost/thread/locks.hpp>
#include <boost/date_time/posix_time/posix_time_types.hpp>
boost::mutex mutex;
boost::condition_variable condvar;
typedef boost::unique_lock<boost::mutex> LockType;
// Variables that are shared between producer and consumer.
double value = 0;
int count = 0;
void producer()
{
while (true)
{
{
// value and counter must both be updated atomically
// using a mutex lock
LockType lock(mutex);
value = std::rand();
++count;
// Notify the consumer that a new value is ready.
condvar.notify_one();
}
// Simulate exaggerated 2ms delay
boost::this_thread::sleep(boost::posix_time::milliseconds(200));
}
}
void consumer()
{
// Local copies of 'count' and 'value' variables. We want to do the
// work using local copies so that they don't get clobbered by
// the producer when it updates.
int currentCount = 0;
double currentValue = 0;
while (true)
{
{
// Acquire the mutex before accessing 'count' and 'value' variables.
LockType lock(mutex); // mutex is locked while in this scope
while (count == currentCount)
{
// Wait for producer to signal that there is a new value.
// While we are waiting, Boost releases the mutex so that
// other threads may acquire it.
condvar.wait(lock);
}
// `lock` is automatically re-acquired when we come out of
// condvar.wait(lock). So it's safe to access the 'value'
// variable at this point.
currentValue = value; // Grab a copy of the latest value
// while we hold the lock.
}
// Now that we are out of the mutex lock scope, we work with our
// local copy of `value`. The producer can keep on clobbering the
// 'value' variable all it wants, but it won't affect us here
// because we are now using `currentValue`.
std::cout << "value = " << currentValue << "\n";
// Simulate exaggerated 5ms delay
boost::this_thread::sleep(boost::posix_time::milliseconds(500));
}
}
int main()
{
boost::thread c(&consumer);
boost::thread p(&producer);
c.join();
p.join();
}
ADDENDUM
I was thinking about this question recently, and realized that this solution, while it may work, is not optimal. Your producer is using all that CPU just to throw away half of the computed values.
I suggest that you reconsider your design and go with a bounded blocking queue between the producer and consumer. Such a queue should have the following characteristics:
Thread-safe
The queue has a fixed size (bounded)
If the consumer wants to pop the next item, but the queue is empty, the operation will be blocked until notified by the producer that an item is available.
The producer can check if there's room to push another item and block until the space becomes available.
With this type of queue, you can effectively throttle down the producer so that it doesn't outpace the consumer. It also ensures that the producer doesn't waste CPU resources computing values that will be thrown away.
Libraries such as TBB and PPL provide implementations of concurrent queues. If you want to attempt to roll your own using std::queue (or boost::circular_buffer) and boost::condition_variable, check out this blogger's example.
The short answer is that you're almost certainly wrong.
With a producer/consumer, you pretty much need a queue between the two threads. There are basically two alternatives: either your code won't will simply lose tasks (which usually equals not working at all) or else your producer thread will need to block for the consumer thread to be idle before it can produce an item -- which effectively translates to single threading.
For the moment, I'm going to assume that the value you get back from rand is supposed to represent the task to be executed (i.e., is the value produced by the producer and consumed by the consumer). In that case, I'd write the code something like this:
void producer() {
for (int i=0; i<100; i++)
queue.insert(random()); // queue.insert blocks if queue is full
queue.insert(-1.0); // Tell consumer to exit
}
void consumer() {
double value;
while ((value = queue.get()) != -1) // queue.get blocks if queue is empty
process(value);
}
This, relegates nearly all the interlocking to the queue. The rest of the code for both threads pretty much ignores threading issues entirely.
Implementing a pipeline is actually quite tricky if you are doing it ground-up. For example, you'd have to use condition variable to avoid the kind of race condition you described in your question, avoid busy waiting when implementing the mechanism for "waking up" the consumer etc... Even using a "queue" of just 1 element won't save you from some of these complexities.
It's usually much better to use specialized libraries that were developed and extensively tested specifically for this purpose. If you can live with Visual C++ specific solution, take a look at Parallel Patterns Library, and the concept of Pipelines.