Terminating custom std::thread pool

Ive been playing around with multithreaded game engine architecture and thread pools lately. Now ive implemented a basic Kernel class. This class has an std::vector<std::thread>, which represents the threadpool. Now, the following function is run by a single thread in the pool:
std::unique_lock<std::mutex> kernelstateLocker(m_KernelStateMutex);
if(m_KernelState == KernelState::KernelShutdown || m_KernelState == KernelState::KernelTerminate)
//std::cout << "Worker #" << _workerID << std::endl console log here
else if(m_KernelState == KernelState::KernelWorkAvailable)
As you can see, a thread wakes up if the KernelState variable changes. This can happen when a Task is added to the queue, or when the Kernel shuts down. The kernel shutdown condition variable gets called by the main program thread, via m_KernelStateCond.notify_all(). However, as i added cout's as seen in the comment, only one of at times up to 8 worker threads would print its name and id, indicating that the others never terminated. Does anybody know why this is, and how i can terminate all of the threads in my pool? In case it matters, my platform is TDM-GCC-64 5.1 on Windows 10 64-bit.
As per comment request and SO rules, here is the code that calls the condition variable.
std::unique_lock<std::mutex> shutdownLocker(m_IsRunningMutex);
m_ShutdownCond.wait(shutdownLocker, [this](){ return !m_IsRunning; });
m_KernelState = KernelState::KernelShutdown;
Im pretty sure this part of my code is working, since at least one of the thread workers actually closes. And for completeness, here is my full Kernel class:
class Kernel : public Singleton<Kernel>
void boot(unsigned int _workerCount);
void run();
void shutdown();
void addTask(std::shared_ptr<Task> _task);
friend class Singleton<Kernel>;
bool m_IsRunning;
KernelState m_KernelState;
std::vector<std::thread> m_Workers;
std::queue<std::shared_ptr<Task>> m_Tasks;
std::vector<std::shared_ptr<Task>> m_LoopTasks;
std::condition_variable m_KernelStateCond;
std::mutex m_KernelStateMutex;
void workTask(unsigned int _workerID);

I figured the problem out. It was not a problem with my thread pool implementation per se, but it had to do with the fact that there were still tasks arriving while the kernel was shutting down. So some of the worker threads never shut down, and thus, the kernel got stuck. Adding constraints to the tasks solved this.


Thread Pool blocks main threads after some loops

I'm trying to learn how threading works on C++ and I found an implementation which I used as a guide
to make my own implementation, however after a loop or a couple it blocks.
I have a thread-safe queue in which I retrieve the jobs that are assigned to the thread pool.
Each thread runs this function:
// Declarations
std::vector<std::thread> m_threads;
JobQueue m_jobs; // A queue with locks
std::mutex m_mutex;
std::condition_variable m_condition;
std::atomic_bool m_active;
std::atomic_bool m_started;
std::atomic_int m_busy;
[this, threadIndex] {
int numThread = threadIndex;
while(this->m_active) {
std::unique_ptr<Job> currJob;
bool dequeued = false;
std::unique_lock<std::mutex> lock { this->m_mutex };
this->m_condition.wait(lock, [this, numThread]() {
return (this->m_started && !this->m_jobs.empty()) || !this->m_active;
if (this->m_active) {
dequeued = this->m_jobs.dequeue(currJob);
if (dequeued) {
std::lock_guard<std::mutex> lock { this->m_mutex };
} else {
std::lock_guard<std::mutex> lock { this->m_mutex };
and the loop is basically:
while(1) {
int numJobs = rand() % 10000;
std::cout << "Will do " << numJobs << " jobs." << std::endl;
while(numJobs--) {
// some heavy calculation
std::cout << "Done!" << std::endl; // chrono removed for readability
While the waitEmpty method is described as:
std::unique_lock<std::mutex> lock { this->m_mutex };
this->m_condition.wait(lock, [this] {
return this->empty();
And is in this wait method that the code usually hangs as the test inside is never called again.
I've debugged it, changed the notification_one's and all's from place to place, but for some reason after some loops it always blocks.
Usually, but not always, it locks on condition_variable.wait() method that locks the current thread until there are no other thread working and the queue is empty, but I also saw it happen when I call condition_variable.notify_all().
Some debugging helped me notice that while I call notify_all() on the slave thread, the wait() in the main thread is not tested again.
The expected behavior is that it does not block when it loops.
I'm using G++ 8.1.0 on Windows.
and the output is:
Will do 41 jobs.
Done! Took 0ms!
Will do 8467 jobs.
<main thread blocked>
Edit: I fixed the issue pointed by paddy's comment: now m_busy-- also happens when a job is not dequeued.
Edit 2: Running this on Linux does not locks the main thread and runs as expected. (g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0)
Edit 3: As mentioned in the comments, corrected deadlock to block, as it only involves one lock
Edit 4: As commented by Jérôme Richard I was able to improve it by creating a lock_guard around the m_busy--; but now the code blocks at the notify_all() that is called inside the assign method. Here is the assign method for reference:
template<class Func, class... Args>
auto assign(Func&& func, Args&&... args) -> std::future<typename std::result_of<Func(Args...)>::type> {
using jobResultType = typename std::result_of<Func(Args...)>::type;
auto task = std::make_shared<std::packaged_task<jobResultType()>>(
std::bind(std::forward<Func>(func), std::forward<Args>(args)...)
auto job = std::unique_ptr<Job>(new Job([task](){ (*task)(); }));
std::future<jobResultType> result = task->get_future();
std::cout << " - enqueued";
std::cout << " - ok!" << std::endl;
return result;
In one of the loops the last output is
- enqueued - ok!
- enqueued - ok!
- enqueued
<blocked again>
Edit 5: With the latest changes, this does not happens on msbuild compiler.
The Gist for my implementation is here: https://gist.github.com/GuiAmPm/4be7716b7f1ea62819e61ef4ad3beb02
Here's also the original Article which I based my implementation:
Any help will be appreciated.
tl;dr: use a std::lock_guard of m_mutex around m_busy-- to avoid unexpected wait condition blocking.
First of all, please note that the problem can occur with one thread in the pool and just few jobs. This means that there is a problem between the master thread that submit the jobs and the one that execute them.
Using GDB to analyze further the state of the program when the wait condition get stuck, one can see that there is no jobs, m_busy is set to 0 and both threads are waiting for notifications.
This means that there is a concurrency issue on the wait condition between the master and the only worker on the last job to execute.
By adding a global atomic clock in your code, one can see that (in almost all case) the worker finishes all the jobs before the master can wait for the jobs to be completed and workers done.
Here is one practical scenario retrieved (bullets are done sequentially):
the master start the wait call and there is jobs remaining
the worker perform m_busy++, dequeue the last job and execute it (m_busy is now set to 1 and the job queue is empty)
the master compute the predicate of the wait call
the master call ThreadPool::empty and the result is false due to busy set to 1
the worker perform m_busy-- (m_busy is now set to 0)
from that moment, the master could wait for the condition back (but is suspected to not do it)
the worker notify the condition
the master is suspected to wait for the condition back only now and to not be impacted by this last notification (as no waits will happen next)
At this point, the master is no longer executing instructions and will wait forever
the worker wait for the condition and will wait forever too
The fact that the master is not impacted by the notification is very strange.
It seems to be related to memory fencing issues. A more detailed explanation can be found here. To quote the article:
Even if you make dataReady an atomic, it must be modified under the mutex; if not the modification to the waiting thread may be published, but not correctly synchronized.
So a solution is to replace the m_busy-- instruction by the following lines:
std::lock_guard<std::mutex> lck {this->m_mutex};
It avoid the previous scenario. Indeed, on one hand m_mutex is acquired in during the predicate checking of the wait call preventing m_busy to be modified during this specific moment; on the other hand it enforce data to be properly synchronized.
It should be theoretically safer to also include the m_jobs.dequeue call into it but will strongly reduce the degree of parallelism of the workers. In practice, useful synchronizations are made when the lock is released in the worker threads.
Please note that one general workaround to avoid such problems could be to add a timeout to waiting calls using the wait_for function in a loop to enforce the predicate condition. However, this solution comes a the price of a higher latency of the waiting calls and can thus significantly slow the execution down.

Simple threaded timer, sanity check please

I've made a very simple threaded timer class and given the pitfalls around MT code, I would like a sanity check please. The idea here is to start a thread then continuously loop waiting on a variable. If the wait times out, the interval was exceeded and we call the callback. If the variable was signalled, the thread should quit and we don't call the callback.
One of the things I'm not sure about is what happens in the destructor with my code, given the thread may be joinable there (just). Can I join a thread in a destructor to make sure it's finished?
Here's the class:
class TimerThreaded
TimerThreaded() {}
if (MyThread.joinable())
void Start(std::chrono::milliseconds const & interval, std::function<void(void)> const & callback)
if (MyThread.joinable())
MyThread = std::thread([=]()
for (;;)
auto locked = std::unique_lock<std::mutex>(MyMutex);
auto result = MyTerminate.wait_for(locked, interval);
if (result == std::cv_status::timeout)
void Stop()
std::thread MyThread;
std::mutex MyMutex;
std::condition_variable MyTerminate;
I suppose a better question might be to ask someone to point me towards a very simple threaded timer, if there's one already available somewhere.
Can I join a thread in a destructor to make sure it's finished?
Not only you can, but it's quite typical to do so. If the thread instance is joinable (i.e. still running) when it's destroyed, terminate would be called.
For some reason result is always timeout. It never seems to get signalled and so never stops. Is it correct? notify_all should unblock the wait_for?
It can only unblock if the thread happens to be on the cv at the time. What you're probably doing is call Start and then immediately Stop before the thread has started running and begun waiting (or possibly while callback is running). In that case, the thread would never be notified.
There is another problem with your code. Blocked threads may be spuriously woken up on some implementations even when you don't explicitly call notify_X. That would cause your timer to stop randomly for no apparent reason.
I propose that you add a flag variable that indicates whether Stop has been called. This will fix both of the above problems. This is the typical way to use condition variables. I've even written the code for you:
class TimerThreaded
MyThread = std::thread([=]()
for (;;)
auto locked = std::unique_lock<std::mutex>(MyMutex);
auto result = MyTerminate.wait_for(locked, interval);
if (stop_please)
if (result == std::cv_status::timeout)
void Stop()
std::lock_guard<std::mutex> lock(MyMutex);
stop_please = true;
bool stop_please = false;
With these changes yout timer should work, but do realize that "[std::condition_variable::wait_for] may block for longer than timeout_duration due to scheduling or resource contention delays", in the words of cppreference.com.
point me towards a very simple threaded timer, if there's one already available somewhere.
I don't know of a standard c++ solution, but modern operating systems typically provide this kind of functionality or at least pieces that can be used to build it. See timerfd_create on linux for an example.

C++11 lockfree single producer single consumer: how to avoid busy wait

I'm trying to implement a class that uses two threads: one for the producer and one for the consumer. The current implementation does not use locks:
#include <boost/lockfree/spsc_queue.hpp>
#include <atomic>
#include <thread>
using Queue =
class Worker
Worker() : working_(false), done_(false) {}
~Worker() {
done_ = true; // exit even if the work has not been completed
void enqueue(int value) {
if (!working_) {
working_ = true;
worker_ = std::thread([this]{ work(); });
void work() {
int value;
while (!done_ && queue_.pop(value)) {
std::cout << value << std::endl;
working_ = false;
std::atomic<bool> working_;
std::atomic<bool> done_;
Queue queue_;
std::thread worker_;
The application needs to enqueue work items for a certain amount of time and then sleep waiting for an event. This is a minimal main that simulates the behavior:
int main()
Worker w;
for (int i = 0; i < 1000; ++i)
for (int i = 0; i < 1000; ++i)
I'm pretty sure that my implementation is bugged: what if the worker thread completes and before executing working_ = false, another enqueue comes? Is it possible to make my code thread safe without using locks?
The solution requires:
a fast enqueue
the destructor has to quit even if the queue is not empty
no busy wait, because there are long period of time in which the worker thread is idle
no locks if possible
I did another implementation of the Worker class, based on your suggestions. Here is my second attempt:
class Worker
: working_(ATOMIC_FLAG_INIT), done_(false) { }
~Worker() {
// exit even if the work has not been completed
done_ = true;
if (worker_.joinable())
bool enqueue(int value) {
bool enqueued = queue_.push(value);
if (!working_.test_and_set()) {
if (worker_.joinable())
worker_ = std::thread([this]{ work(); });
return enqueued;
void work() {
int value;
while (!done_ && queue_.pop(value)) {
std::cout << value << std::endl;
while (!done_ && queue_.pop(value)) {
std::cout << value << std::endl;
std::atomic_flag working_;
std::atomic<bool> done_;
Queue queue_;
std::thread worker_;
I introduced the worker_.join() inside the enqueue method. This can impact the performances, but in very rare cases (when the queue gets empty and before the thread exits, another enqueue comes). The working_ variable is now an atomic_flag that is set in enqueue and cleared in work. The Additional while after working_.clear() is needed because if another value is pushed, before the clear, but after the while, the value is not processed.
Is this implementation correct?
I did some tests and the implementation seems to work.
OT: Is it better to put this as an edit, or an answer?
what if the worker thread completes and before executing working_ = false, another enqueue comes?
Then the value will be pushed to the queue but will not be processed until another value is enqueued after the flag is set. You (or your users) may decide whether that is acceptable. This can be avoided using locks, but they're against your requirements.
The code may fail if the running thread is about to finish and sets working_ = false; but hasn't stopped running before next value is enqueued. In that case your code will call operator= on the running thread which results in a call to std::terminate according to the linked documentation.
Adding worker_.join() before assigning the worker to a new thread should prevent that.
Another problem is that queue_.push may fail if the queue is full because it has a fixed size. Currently you just ignore the case and the value will not be added to the full queue. If you wait for queue to have space, you don't get fast enqueue (in the edge case). You could take the bool returned by push (which tells if it was successful) and return it from enqueue. That way the caller may decide whether it wants to wait or discard the value.
Or use non-fixed size queue. Boost has this to say about that choice:
Can be used to completely disable dynamic memory allocations during push in order to ensure lockfree behavior.
If the data structure is configured as fixed-sized, the internal nodes are stored inside an array and they are addressed
by array indexing. This limits the possible size of the queue to the number of elements that can be addressed by the index
type (usually 2**16-2), but on platforms that lack double-width compare-and-exchange instructions, this is the best way
to achieve lock-freedom.
Your worker thread needs more than 2 states.
Not running
Doing tasks
Idle shutdown
If you force shut down, it skips idle shutdown. If you run out of tasks, it transitions to idle shutdown. In idle shutdown, it empties the task queue, then goes into shutting down.
Shutdown is set, then you walk off the end of your worker task.
The producer first puts things on the queue. Then it checks the worker state. If Shutdown or Idle shutdown, first join it (and transition it to not running) then launch a new worker. If not running, just launch a new worker.
If the producer wants to launch a new worker, it first makes sure that we are in the not running state (otherwise, logic error). We then transition to the Doing tasks state, and then we launch the worker thread.
If the producer wants to shut down the helper task, it sets the done flag. It then checks the worker state. If it is anything besides not running, it joins it.
This can result in a worker thread that is launched for no good reason.
There are a few cases where the above can block, but there where a few before as well.
Then, we write a formal or semi-formal proof that the above cannot lose messages, because when writing lock free code you aren't done until you have a proof.
This is my solution of the question. I don't like very much answering myself, but I think showing actual code may help others.
#include <boost/lockfree/spsc_queue.hpp>
#include <atomic>
#include <thread>
// I used this semaphore class: https://gist.github.com/yohhoy/2156481
#include "binsem.hpp"
using Queue =
class Worker
// the worker thread starts in the constructor
: working_(ATOMIC_FLAG_INIT), done_(false), semaphore_(0)
, worker_([this]{ work(); })
{ }
~Worker() {
// exit even if the work has not been completed
done_ = true;
bool enqueue(int value) {
bool enqueued = queue_.push(value);
if (!working_.test_and_set())
// signal to the worker thread to wake up
return enqueued;
void work() {
int value;
// the worker thread continue to live
while (!done_) {
// wait the start signal, sleeping
while (!done_ && queue_.pop(value)) {
// perform actual work
std::cout << value << std::endl;
while (!done_ && queue_.pop(value)) {
// perform actual work
std::cout << value << std::endl;
std::atomic_flag working_;
std::atomic<bool> done_;
binsem semaphore_;
Queue queue_;
std::thread worker_;
I tried the suggestion of #Cameron, to not shutdown the thread and adding a semaphore. This actually is used only in the first enqueue and in the last work. This is not lock-free, but only in these two cases.
I did some performance comparison, between my previous version (see my edited question), and this one. There are no significant differences, when there are not many start and stop. However, the enqueue is 10 times faster when it have to signal the worker thread, instead of starting a new thread. This is a rare case, so it is not very important, but anyway it is an improvement.
This implementation satisfies:
lock-free in the common case (when enqueue and work are busy);
no busy wait in case for long time there are not enqueue
the destructor exits as soon as possible
correctness?? :)
Very partial answer: I think all those atomics, semaphores and states are a back-communication channel, from "the thread" to "the Worker". Why not use another queue for that? At the very least, thinking about it will help you around the problem.

Static Class variable for Thread Count in C++

I am writing a thread based application in C++. The following is sample code showing how I am checking the thread count. I need to ensure that at any point in time, there are only 20 worker threads spawned from my application:
using namespace std;
class ThreadWorkerClass
static int threadCount;
void ThreadWorkerClass()
threadCount ++;
static int getThreadCount()
return threadCount;
void run()
/* The worker thread execution
* logic is to be written here */
//Reduce count by 1 as worker thread would finish here
threadCount --;
int main()
ThreadWorkerClass twObj;
//Use Boost to start Worker Thread
//Assume max 20 worker threads need to be spawned
if(ThreadWorkerClass::getThreadCount() <= 20)
boost::thread *wrkrThread = new boost::thread(
//Wait for the threads to join
//Something like (*wrkrThread).join();
return 0;
Will this design require me to take a lock on the variable threadCount? Assume that I will be running this code in a multi-processor environment.
The design is not good enough. The problem is that you exposed the constructor, so whether you like it or not, people will be able to create as many instances of your object as they want. You should do some sort of threads pooling. i.e. You have a class maintaining a set of pools and it gives out threads if available. something like
class MyThreadClass {
//the method obtaining that thread is reponsible for returning it
class ThreadPool {
//create 20 instances of your Threadclass
//This is a blocking function
MyThreadClass getInstance() {
//if a thread from the pool is free give it, else wait
So everything is maintaned internally by the pooling class. Never give control over that class to the others. you can also add query functions to the pooling class, like hasFreeThreads(), numFreeThreads() etc...
You can also enhance this design through giving out smart pointer so you can follow how many people are still owning the thread.
Making the people obtaining the thread responsible for releasing it is sometimes dangerous, as processes crashes and they never give the tread back, there are many solutions to that, the simplest one is to maintain a clock on each thread, when time runs out the thread is taken back by force.

Boost shared_lock. Read preferred?

I was checking out the boost library(version 1.45) for a reader/writer lock. When I ran my tests on it, it seemed like the shared_ptr was preferring my reader threads, i.e. when my writer tried to take the lock for its operation, it didn't stop any subsequent reads from occurring.
Is it possible in boost to change this behavior?
using namespace std;
using namespace boost;
mutex outLock;
shared_mutex workerAccess;
bool shouldIWork = true;
class WorkerKiller
void operator()()
upgrade_lock<shared_mutex> lock(workerAccess);
upgrade_to_unique_lock<shared_mutex> uniqueLock(lock);
cout << "Grabbed exclusive lock, killing system" << endl;
shouldIWork = false;
cout << "KILLING ALL WORK" << endl;
class Worker
void operator()()
shared_lock<shared_mutex> lock(workerAccess);
if (!shouldIWork) {
cout << "Workers are on strike. This worker refuses to work" << endl;
} else {
cout << "Worked finished her work" << endl;
int main(int argc, char* argv[])
Worker w1;
Worker w2;
Worker w3;
Worker w4;
WorkerKiller wk;
boost::thread workerThread1(w1);
boost::thread workerThread2(w2);
boost::thread workerKillerThread(wk);
boost::thread workerThread3(w3);
boost::thread workerThread4(w4);
return 0;
And here is the output every time:
Worked finished her work
Worked finished her work
Worked finished her work
Worked finished her work
Grabbed exclusive lock, killing system
My Requirement
If the writer tried to grab an exclusive lock, I'd like for all previous read operations to finish. And then all subsequent read operations to block.
I'm a little late to this question, but I believe I have some pertinent information.
The proposals of shared_mutex to the C++ committee, which the boost libs are based on, purposefully did not specify an API to give readers nor writers priority. This is because Alexander Terekhov proposed an algorithm about a decade ago that is completely fair. It lets the operating system decide whether the next thread to get the mutex is a reader or writer, and the operating system is completely ignorant as to whether the next thread is a reader or writer.
Because of this algorithm, the need for specifying whether a reader or writer is preferred disappears. To the best of my knowledge, the boost libs are now (boost 1.52) implemented with this fair algorithm.
The Terekhov algorithm consists of having the read/write mutex consist of two gates: gate1 and gate2. Only one thread at a time can pass through each gate. The gates can be implemented with a mutex and two condition variables.
Both readers and writers attempt to pass through gate1. In order to pass through gate1 it must be true that a writer thread is not currently inside of gate1. If there is, the thread attempting to pass through gate1 blocks.
Once a reader thread passes through gate1 it has read ownership of the mutex.
When a writer thread passes through gate1 it must also pass through gate2 before obtaining write ownership of the mutex. It can not pass through gate2 until the number of readers inside of gate1 drops to zero.
This is a fair algorithm because when there are only 0 or more readers inside of gate1, it is up to the OS as to whether the next thread to get inside of gate1 is a reader or writer. A writer becomes "prioritized" only after it has passed through gate1, and is thus next in line to obtain ownership of the mutex.
I used your example compiled against an example implementation of what eventually became shared_timed_mutex in C++14 (with minor modifications to your example). The code below calls it shared_mutex which is the name it had when it was proposed.
I got the following outputs (all with the same executable):
Worked finished her work
Worked finished her work
Grabbed exclusive lock, killing system
Workers are on strike. This worker refuses to work
Workers are on strike. This worker refuses to work
And sometimes:
Worked finished her work
Grabbed exclusive lock, killing system
Workers are on strike. This worker refuses to work
Workers are on strike. This worker refuses to work
Workers are on strike. This worker refuses to work
And sometimes:
Worked finished her work
Worked finished her work
Worked finished her work
Worked finished her work
Grabbed exclusive lock, killing system
I believe it should be theoretically possible to also obtain other outputs, though I did not confirm that experimentally.
In the interest of full disclosure, here is the exact code I executed:
#include "../mutexes/shared_mutex"
#include <thread>
#include <chrono>
#include <iostream>
using namespace std;
using namespace ting;
mutex outLock;
shared_mutex workerAccess;
bool shouldIWork = true;
class WorkerKiller
void operator()()
unique_lock<shared_mutex> lock(workerAccess);
cout << "Grabbed exclusive lock, killing system" << endl;
shouldIWork = false;
cout << "KILLING ALL WORK" << endl;
class Worker
void operator()()
shared_lock<shared_mutex> lock(workerAccess);
if (!shouldIWork) {
lock_guard<mutex> _(outLock);
cout << "Workers are on strike. This worker refuses to work" << endl;
} else {
lock_guard<mutex> _(outLock);
cout << "Worked finished her work" << endl;
int main()
Worker w1;
Worker w2;
Worker w3;
Worker w4;
WorkerKiller wk;
thread workerThread1(w1);
thread workerThread2(w2);
thread workerKillerThread(wk);
thread workerThread3(w3);
thread workerThread4(w4);
return 0;
