C++: Thread pool slower than single threading? - c++

First of all I did look at the other topics on this website and found they don't relate to my problem as those mostly deal with people using I/O operations or thread creation overheads. My problem is that my threadpool or worker-task structure implementation is (in this case) a lot slower than single threading. I'm really confused by this and not sure if it's the ThreadPool, the task itself, how I test it, the nature of threads or something out of my control.
// Sorry for the long code
#include <vector>
#include <queue>
#include <thread>
#include <mutex>
#include <future>
#include "task.hpp"
class ThreadPool
for (unsigned i = 0; i < std::thread::hardware_concurrency() - 1; i++)
m_workers.emplace_back(this, i);
m_running = true;
for (auto&& worker : m_workers)
m_running = false;
for (auto&& worker : m_workers)
void add_task(Task* task)
std::unique_lock<std::mutex> lock(m_in_mutex);
class Worker
Worker(ThreadPool* parent, unsigned id) : m_parent(parent), m_id(id)
void start()
m_thread = new std::thread(&Worker::work, this);
void terminate()
if (m_thread)
if (m_thread->joinable())
delete m_thread;
m_thread = nullptr;
m_parent = nullptr;
void work()
while (m_parent->m_running)
std::unique_lock<std::mutex> lock(m_parent->m_in_mutex);
m_parent->m_task_signal.wait(lock, [&]()
return !m_parent->m_in.empty() || !m_parent->m_running;
if (!m_parent->m_running) break;
Task* task = m_parent->m_in.front();
// Fixed the mutex being locked while the task is executed
ThreadPool* m_parent = nullptr;
unsigned m_id = 0;
std::thread* m_thread = nullptr;
std::vector<Worker> m_workers;
std::mutex m_in_mutex;
std::condition_variable m_task_signal;
std::queue<Task*> m_in;
bool m_running = false;
class TestTask : public Task
TestTask() {}
TestTask(unsigned number) : m_number(number) {}
inline void Set(unsigned number) { m_number = number; }
void execute() override
if (m_number <= 3)
m_is_prime = m_number > 1;
else if (m_number % 2 == 0 || m_number % 3 == 0)
m_is_prime = false;
for (unsigned i = 5; i * i <= m_number; i += 6)
if (m_number % i == 0 || m_number % (i + 2) == 0)
m_is_prime = false;
m_is_prime = true;
unsigned m_number = 0;
bool m_is_prime = false;
int main()
ThreadPool pool;
unsigned num_tasks = 1000000;
std::vector<TestTask> tasks(num_tasks);
for (auto&& task : tasks)
task.Set(randint(0, 1000000000));
auto s = std::chrono::high_resolution_clock::now();
#if MT
for (auto&& task : tasks)
for (auto&& task : tasks)
auto e = std::chrono::high_resolution_clock::now();
double seconds = std::chrono::duration_cast<std::chrono::nanoseconds>(e - s).count() / 1000000000.0;
Benchmarks with VS2013 Profiler:
10,000,000 tasks:
13 seconds of wall clock time
93.36% is spent in msvcp120.dll
3.45% is spent in Task::execute() // Not good here
0.5 seconds of wall clock time
97.31% is spent with Task::execute()

Usual disclaimer in such answers: the only way to tell for sure is to measure it with a profiler tool.
But I will try to explain your results without it. First of all, you have one mutex across all your threads. So only one thread at a time can execute some task. It kills all your gains you might have. In spite of your threads your code is perfectly serial. So at the very least make your task execution out of the mutex. You need to lock the mutex only to get a task out of the queue — you don't need to hold it when the task gets executed.
Next, your tasks are so simple that single thread will execute them in no time. You just can't measure any gains with such tasks. Create some heavy tasks which could produce some more interesting results(some tasks which are closer to the real world, not such contrived).
And the 3rd point: threads are not without their cost — context switching, mutex contention etc. To have real gains, as the previous 2 points say, you need to have tasks which take more time than the overheads threads introduce and the code should be truly parallel instead of waiting on some resource making it serial.
UPD: I looked at the wrong part of the code. The task is complex enough provided you create tasks with sufficiently large numbers.
UPD2: I've played with your code and found a good prime number to show how the MT code is better. Use the following prime number: 1019048297. It will give enough computation complexity to show the difference.
But why your code doesn't produce good results? It is hard to tell without seeing the implementation of randint() but I take it is pretty simple and in a half of the cases it returns even numbers and other cases produce not much of big prime numbers either. So the tasks are so simple that context switching and other things around your particular implementation and threads in general consume more time than the computation itself. Using the prime number I gave you give the tasks no choice but spend time computing — no easy answer since the number is big and actually prime. That's why the big number will give you the answer you seek — better time for the MT code.

You should not hold the mutex while the task is getting executed, otherwise other threads will not be able to get a task:
void work() {
while (m_parent->m_running) {
Task* currentTask = nullptr;
std::unique_lock<std::mutex> lock(m_parent->m_in_mutex);
m_parent->m_task_signal.wait(lock, [&]() {
return !m_parent->m_in.empty() || !m_parent->m_running;
if (!m_parent->m_running) continue;
currentTask = m_parent->m_in.front();
lock.unlock(); //<- Release the lock so that other threads can get tasks
currentTask = nullptr;

For MT, how much time is spent in each phase of the "overhead": std::unique_lock, m_task_signal.wait, front, pop, unlock?
Based on your results of only 3% useful work, this means the above consumes 97%. I'd get numbers for each part of the above (e.g. add timestamps between each call).
It seems to me, that the code you use to [merely] dequeue the next task pointer is quite heavy. I'd do a much simpler queue [possibly lockless] mechanism. Or, perhaps, use atomics to bump an index into the queue instead of the five step process above. For example:
while (m_parent->m_running) {
// NOTE: this is just an example, not necessarily the real function
int curindex = atomic_increment(&global_index);
if (curindex >= max_index)
Task *task = m_parent->m_in[curindex];
Also, maybe you should pop [say] ten at a time instead of just one.
You might also be memory bound and/or "task switch" bound. (e.g.) For threads that access an array, more than four threads usually saturates the memory bus. You could also have heavy contention for the lock, such that the threads get starved because one thread is monopolizing the lock [indirectly, even with the new unlock call]
Interthread locking usually involves a "serialization" operation where other cores must synchronize their out-of-order execution pipelines.
Here's a "lockless" implementation:
// assume m_id is 0,1,2,...
int curindex = m_id;
while (m_parent->m_running) {
if (curindex >= max_index)
Task *task = m_parent->m_in[curindex];
curindex += NUMBER_OF_WORKERS;


C++ Multithreading, Mutex

Back in days I was working on an option that would speed up my function by multithreading. The base function finished around 15seconds, and I would like to reducing it, but I cannot logicing out how to create a good and working multithreading function.
Base function, before touches:
void FirstCall()
void MainFunction1()
//Call another functions, MainFunction3-10 for example
void MainFunction2()
//Cann another, different functions, in a for loop
In this case, the time that needed to finishing the function is around 15 seconds.
That I found to speeding up this function was the multithreading idea.
Let me show how it is right now, and what is my problem with it.
//Way 1 of multithreading
void FirstCall()
std::vector<std::thread> threads;
threads.push_back(std::thread(&MainFunction1, this));
threads.push_back(std::thread(&MainFunction2, this));
for (auto& th : threads)
if (th.joinable())
The other functions are exactly same, so that shouldnt be related to the runtime. The runtime with the function that I showed up above is around 8-10seconds, so seems it is working fine, but sometimes the application simply closing when this function is called.
//Way 2 of multithreading
void FirstCall()
static std::mutex s_mutex;
static std::atomic<int> thread_number = 0;
auto MainFunctions = [&](int index)
auto ThreadFunction = [&]()
std::lock_guard<std::mutex> lGuard (s_mutex);
int thread_count = std::thread::hardware_concurrency(); //8
//thread count > function count (2 functions)
std::vector<std::thread> threads;
for (int i = 0; i < 2; i++)
for (auto& th : threads)
if (th.joinable())
void SwitchMainFunctions(int index)
case 0:
case 1:
The function that is presented as way 2 of multithreading is working fine, my application is do not crashing anymore, but the run time is same like the untouched function time is ~15 seconds.
I think the mutex lock is forceto wait until one thread is finishing, so it is exactly same if I'd just using the default code, but I would like really speeding up the function.
I tried to speed up my function with multithreading option, but the 2 way I tried to do have different problems.
The first idea is sometimes force my application crashing when the function is called.
The second way that I created have the same run time than the default function has without multithreading.
Your second option is far more complicated first option. Here is simple
void FirstCall()
std::vector<std::thread> threads;
threads.push_back(std::thread(MainFunction1));//this removed since MainFunction1 is void
for (auto& th : threads)
if (th.joinable())
In this simple scenario main thread will block till both threads finish and join. In your first option you used this as argument for MainFunction1 in thread constructor. That implies FirstCall() to be member function of this.
In such case you should add whole class definition to your question and at least scope of MainFunction1/2. This will help to understand why application simply closing.
Your second option is worse then single threaded since lGuard will unlock only when thread finished executing all functions.
auto ThreadFunction = [&]()
std::lock_guard<std::mutex> lGuard (s_mutex);
//MainFunctions calls SwitchMainFunctions
//SwitchMainFunctions calls MainFunction
//when done lGuard unlocks on destruction
Another problem with second option is why do you need mutex at all. If you insist to map std::atomic thread_number to specific function simply pass result of atomic fetch_add to SwitchMainFunctions in thread constructor.
void FirstCall()
std::atomic<int> thread_number = 0;
std::vector<std::thread> threads;
for (int i = 0; i < 2; i++)
threads.push_back(std::thread(SwitchMainFunctions, thread_number++));
for (auto& th : threads)
if (th.joinable())

How to get local hour efficiently?

I'm developing a service. Currently I need to get local hour for every request, since it involves system call, it costs too much.
In my case, some deviation like 200ms is OK for me.
So what's the best way to maintain a variable storing local_hour, and update it every 200ms?
static int32_t GetLocalHour() {
time_t t = std::time(nullptr);
if (t == -1) { return -1; }
struct tm *time_info_ptr = localtime(&t);
return (nullptr != time_info_ptr) ? time_info_ptr->tm_hour : -1;
If you want your main thread to spend as little time as possible on getting the current hour you can start a background thread to do all the heavy lifting.
For all things time use std::chrono types.
Here is the example, which uses quite a few (very useful) multithreading building blocks from C++.
#include <chrono>
#include <future>
#include <condition_variable>
#include <mutex>
#include <atomic>
#include <iostream>
// building blocks
// std::future/std::async, to start a loop/function on a seperate thread
// std::atomic, to be able to read/write threadsafely from a variable
// std::chrono, for all things time
// std::condition_variable, for communicating between threads. Basicall a signal that only signals that something has changed that might be interesting
// lambda functions : anonymous functions that are useful in this case for starting the asynchronous calls and to setup predicates (functions returning a bool)
// std::mutex : threadsafe access to a bit of code
// std::unique_lock : to automatically unlock a mutex when code goes out of scope (also needed for condition_variable)
// helper to convert time to start of day
using days_t = std::chrono::duration<int, std::ratio_multiply<std::chrono::hours::period, std::ratio<24> >::type>;
// class that has an asynchronously running loop that updates two variables (threadsafe)
// m_hours and m_seconds (m_seconds so output is a bit more interesting)
class time_keeper_t
time_keeper_t() :
m_delay{ std::chrono::milliseconds(200) }, // update loop period
m_future{ std::async(std::launch::async,[this] {update_time_loop(); }) } // start update loop
// wait until asynchronous loop has started
std::unique_lock<std::mutex> lock{ m_mtx };
// wait until the asynchronous loop has started.
// this can take a bit of time since OS needs to schedule a thread for that
m_cv.wait(lock, [this] {return m_started; });
// threadsafe stopping of the mainloop
// to avoid problems that the thread is still running but the object
// with members is deleted.
std::unique_lock<std::mutex> lock{ m_mtx };
m_stop = true;
m_cv.notify_all(); // this will wakeup the loop and stop
// future.get will wait until the loop also has finished
// this ensures no member variables will be accessed
// by the loop thread and it is safe to fully destroy this instance
// inline to avoid extra calls
inline int hours() const
return m_hours;
// inline to avoid extra calls
inline int seconds() const
return m_seconds;
void update_time()
m_now = std::chrono::steady_clock::now();
std::chrono::steady_clock::duration tp = m_now.time_since_epoch();
// calculate back till start of day
days_t days = duration_cast<days_t>(tp);
tp -= days;
// calculate hours since start of day
auto hours = std::chrono::duration_cast<std::chrono::hours>(tp);
tp -= hours;
m_hours = hours.count();
// seconds since start of last hour
auto seconds = std::chrono::duration_cast<std::chrono::seconds>(tp);
m_seconds = seconds.count() % 60;
void update_time_loop()
std::unique_lock<std::mutex> lock{ m_mtx };
// loop has started and has initialized all things time with values
m_started = true;
// stop condition for the main loop, put in a predicate lambda
auto stop_condition = [this]()
return m_stop;
while (!m_stop)
// wait until m_cv is signaled or m_delay timed out
// a condition variable allows instant response and thus
// is better then just having a sleep here.
// (imagine a delay of seconds, that would also mean stopping could
// take seconds, this is faster)
m_cv.wait_for(lock, m_delay, stop_condition);
if (!m_stop) update_time();
std::atomic<int> m_hours;
std::atomic<int> m_seconds;
std::mutex m_mtx;
std::condition_variable m_cv;
bool m_started{ false };
bool m_stop{ false };
std::chrono::steady_clock::time_point m_now;
std::chrono::steady_clock::duration m_delay;
std::future<void> m_future;
int main()
time_keeper_t time_keeper;
// the mainloop now just can ask the time_keeper for seconds
// or in your case hours. The only time needed is the time
// to return an int (atomic) instead of having to make a full
// api call to get the time.
for (std::size_t n = 0; n < 30; ++n)
std::cout << "seconds now = " << time_keeper.seconds() << "\n";
return 0;
You don't need to query local time for every request because hour doesn't change every 200ms. Just update the local hour variable every hour
The most correct solution would be registering to a timer event like scheduled task on Windows or cronjobs on Linux that runs at the start of every hour. Alternatively create a timer that runs every hour and update the variable
The timer creation depends on the platform, for example on Windows use SetTimer, on Linux use timer_create. Here's a very simple solution using boost::asio which assumes that you run on the exact hour. You'll need to make some modification to allow it to run at any time, for example by creating a one-shot timer or by sleeping until the next hour
#include <chrono>
using namespace std::chrono_literals;
int32_t get_local_hour()
time_t t = std::time(nullptr);
if (t == -1) { return -1; }
struct tm *time_info_ptr = localtime(&t);
return (nullptr != time_info_ptr) ? time_info_ptr->tm_hour : -1;
static int32_t local_hour = get_local_hour();
bool running = true;
// Timer callback body, called every hour
void update_local_hour(const boost::system::error_code& /*e*/,
boost::asio::deadline_timer* t)
while (running)
t->expires_at(t->expires_at() + boost::posix_time::hour(1));
boost::asio::placeholders::error, t, count));
local_hour = get_local_hour();
int main()
boost::asio::io_service io;
// Timer that runs every hour and update the local_hour variable
boost::asio::deadline_timer t(io, boost::posix_time::hour(1));
boost::asio::placeholders::error, &t));
running = true;
running = false; // stop the timer
Now just use local_hour directly instead of GetLocalHour()

How to create an efficient multi-threaded task scheduler in C++?

I'd like to create a very efficient task scheduler system in C++.
The basic idea is this:
class Task {
virtual void run() = 0;
class Scheduler {
void add(Task &task, double delayToRun);
Behind Scheduler, there should be a fixed-size thread pool, which run the tasks (I don't want to create a thread for each task). delayToRun means that the task doesn't get executed immediately, but delayToRun seconds later (measuring from the point it was added into the Scheduler).
(delayToRun means an "at-least" value, of course. If the system is loaded, or if we ask the impossible from the Scheduler, it won't be able to handle our request. But it should do the best it can)
And here's my problem. How to implement delayToRun functionality efficiently? I'm trying to solve this problem with the use of mutexes and condition variables.
I see two ways:
With manager thread
Scheduler contains two queues: allTasksQueue, and tasksReadyToRunQueue. A task gets added into allTasksQueue at Scheduler::add. There is a manager thread, which waits the smallest amount of time so it can put a task from allTasksQueue to tasksReadyToRunQueue. Worker threads wait for a task available in tasksReadyToRunQueue.
If Scheduler::add adds a task in front of allTasksQueue (a task, which has a value of delayToRun so it should go before the current soonest-to-run task), then the manager task need to be woken up, so it can update the time of wait.
This method can be considered inefficient, because it needs two queues, and it needs two condvar.signals to make a task run (one for allTasksQueue->tasksReadyToRunQueue, and one for signalling a worker thread to actually run the task)
Without manager thread
There is one queue in the scheduler. A task gets added into this queue at Scheduler::add. A worker thread checks the queue. If it is empty, it waits without a time constraint. If it is not empty, it waits for the soonest task.
If there is only one condition variable for which the working threads waiting for: this method can be considered inefficient, because if a task added in front of the queue (front means, if there are N worker threads, then the task index < N) then all the worker threads need to be woken up to update the time which they are waiting for.
If there is a separate condition variable for each thread, then we can control which thread to wake up, so in this case we don't need to wake up all threads (we only need to wake up the thread which has the largest waiting time, so we need to manage this value). I'm currently thinking about implementing this, but working out the exact details are complex. Are there any recommendations/thoughts/document on this method?
Is there any better solution for this problem? I'm trying to use standard C++ features, but I'm willing to use platform dependent (my main platform is linux) tools too (like pthreads), or even linux specific tools (like futexes), if they provide a better solution.
You can avoid both having a separate "manager" thread, and having to wake up a large number of tasks when the next-to-run task changes, by using a design where a single pool thread waits for the "next to run" task (if there is one) on one condition variable, and the remaining pool threads wait indefinitely on a second condition variable.
The pool threads would execute pseudocode along these lines:
while (running)
if (head task is ready to run)
dequeue head task;
if (task_thread == 1)
run dequeued task;
else if (!queue_empty && task_thread == 0)
task_thread = 1;
pthread_cond_timedwait(&task_cv, &queue_lock, time head task is ready to run);
task_thread = 0;
pthread_cond_wait(&queue_cv, &queue_lock);
If you change the next task to run, then you execute:
if (task_thread == 1)
with the queue_lock held.
Under this scheme, all wakeups are directly at only a single thread, there's only one priority queue of tasks, and there's no manager thread required.
Your specification is a bit too strong:
delayToRun means that the task doesn't get executed immediately, but delayToRun seconds later
You forgot to add "at least" :
The task don't get executed now, but at least delayToRun seconds later
The point is that if ten thousand tasks are all scheduled with a 0.1 delayToRun, they surely won't practically be able to run at the same time.
With such correction, you just maintain some queue (or agenda) of (scheduled-start-time, closure to run), you keep that queue sorted, and you start N (some fixed number) of threads which atomically pop the first element of the agenda and run it.
then all the worker threads need to be woken up to update the time which they are waiting for.
No, some worker threads would be woken up.
Read about condition variables and broadcast.
You might also user POSIX timers, see timer_create(2), or Linux specific fd timer, see timerfd_create(2)
You probably would avoid running blocking system calls in your threads, and have some central thread managing them using some event loop (see poll(2)...); otherwise, if you have a hundred tasks running sleep(100) and one task scheduled to run in half a second it won't run before a hundred seconds.
You may want to read about continuation-passing style programming (it -CPS- is highly relevant). Read the paper about Continuation Passing C by Juliusz Chroboczek.
Look also into Qt threads.
You could also consider coding in Go (with its Goroutines).
This is a sample implementation for the interface you provided that comes closest to your 'With manager thread' description.
It uses a single thread (timer_thread) to manage a queue (allTasksQueue) that is sorted based on the actual time when a task must be started (std::chrono::time_point).
The 'queue' is a std::priority_queue (which keeps its time_point key elements sorted).
timer_thread is normally suspended until the next task is started or when a new task is added.
When a task is about to be run, it is placed in tasksReadyToRunQueue, one of the worker threads is signaled, wakes up, removes it from the queue and starts processing the task..
Note that the thread pool has a compile-time upper limit for the number of threads (40). If you are scheduling more tasks than can be dispatched to workers,
new task will block until threads are available again.
You said this approach is not efficient, but overall, it seems reasonably efficient to me. It's all event driven and you are not wasting CPU cycles by unnecessary spinning.
Of course, it's just an example, optimizations are possible (note: std::multimap has been replaced with std::priority_queue).
The implementation is C++11 compliant
#include <iostream>
#include <chrono>
#include <queue>
#include <unistd.h>
#include <vector>
#include <thread>
#include <condition_variable>
#include <mutex>
#include <memory>
class Task {
virtual void run() = 0;
virtual ~Task() { }
class Scheduler {
void add(Task &task, double delayToRun);
using timepoint = std::chrono::time_point<std::chrono::steady_clock>;
struct key {
timepoint tp;
Task *taskp;
struct TScomp {
bool operator()(const key &a, const key &b) const
return a.tp > b.tp;
const int ThreadPoolSize = 40;
std::vector<std::thread> ThreadPool;
std::vector<Task *> tasksReadyToRunQueue;
std::priority_queue<key, std::vector<key>, TScomp> allTasksQueue;
std::thread TimerThr;
std::mutex TimerMtx, WorkerMtx;
std::condition_variable TimerCV, WorkerCV;
bool WorkerIsRunning = true;
bool TimerIsRunning = true;
void worker_thread();
void timer_thread();
for (int i = 0; i <ThreadPoolSize; ++i)
ThreadPool.push_back(std::thread(&Scheduler::worker_thread, this));
TimerThr = std::thread(&Scheduler::timer_thread, this);
std::lock_guard<std::mutex> lck{TimerMtx};
TimerIsRunning = false;
std::lock_guard<std::mutex> lck{WorkerMtx};
WorkerIsRunning = false;
for (auto &t : ThreadPool)
void Scheduler::add(Task &task, double delayToRun)
auto now = std::chrono::steady_clock::now();
long delay_ms = delayToRun * 1000;
std::chrono::milliseconds duration (delay_ms);
timepoint tp = now + duration;
if (now >= tp)
* This is a short-cut
* When time is due, the task is directly dispatched to the workers
std::lock_guard<std::mutex> lck{WorkerMtx};
} else
std::lock_guard<std::mutex> lck{TimerMtx};
allTasksQueue.push({tp, &task});
void Scheduler::worker_thread()
for (;;)
std::unique_lock<std::mutex> lck{WorkerMtx};
WorkerCV.wait(lck, [this] { return tasksReadyToRunQueue.size() != 0 ||
!WorkerIsRunning; } );
if (!WorkerIsRunning)
Task *p = tasksReadyToRunQueue.back();
delete p; // delete Task
void Scheduler::timer_thread()
for (;;)
std::unique_lock<std::mutex> lck{TimerMtx};
if (!TimerIsRunning)
auto duration = std::chrono::nanoseconds(1000000000);
if (allTasksQueue.size() != 0)
auto now = std::chrono::steady_clock::now();
auto head = allTasksQueue.top();
Task *p = head.taskp;
duration = head.tp - now;
if (now >= head.tp)
* A Task is due, pass to worker threads
std::unique_lock<std::mutex> ulck{WorkerMtx};
TimerCV.wait_for(lck, duration);
* End sample implementation
class DemoTask : public Task {
int n;
DemoTask(int n=0) : n{n} { }
void run() override
std::cout << "Start task " << n << std::endl;;
std::cout << " Stop task " << n << std::endl;;
int main()
Scheduler sched;
Task *t0 = new DemoTask{0};
Task *t1 = new DemoTask{1};
Task *t2 = new DemoTask{2};
Task *t3 = new DemoTask{3};
Task *t4 = new DemoTask{4};
Task *t5 = new DemoTask{5};
sched.add(*t0, 7.313);
sched.add(*t1, 2.213);
sched.add(*t2, 0.713);
sched.add(*t3, 1.243);
sched.add(*t4, 0.913);
sched.add(*t5, 3.313);
It means that you want to run all tasks continuously using some order.
You can create some type of sorted by a delay stack (or even linked list) of tasks. When a new task is coming you should insert it in the position depending of a delay time (just efficiently calculate that position and efficiently insert the new task).
Run all tasks starting with the head of the task stack (or list).
Core code for C++11:
#include <thread>
#include <queue>
#include <chrono>
#include <mutex>
#include <atomic>
using namespace std::chrono;
using namespace std;
class Task {
virtual void run() = 0;
template<typename T, typename = enable_if<std::is_base_of<Task, T>::value>>
class SchedulerItem {
T task;
time_point<steady_clock> startTime;
int delay;
SchedulerItem(T t, time_point<steady_clock> s, int d) : task(t), startTime(s), delay(d){}
template<typename T, typename = enable_if<std::is_base_of<Task, T>::value>>
class Scheduler {
queue<SchedulerItem<T>> pool;
mutex mtx;
atomic<bool> running;
Scheduler() : running(false){}
void add(T task, double delayMsToRun) {
lock_guard<mutex> lock(mtx);
pool.push(SchedulerItem<T>(task, high_resolution_clock::now(), delayMsToRun));
if (running == false) runNext();
void runNext(void) {
running = true;
auto th = [this]() {
auto item = pool.front();
auto remaining = (item.startTime + milliseconds(item.delay)) - high_resolution_clock::now();
if(remaining.count() > 0) this_thread::sleep_for(remaining);
if(pool.size() > 0)
running = false;
thread t(th);
Test code:
class MyTask : Task {
virtual void run() override {
printf("mytask \n");
int main()
Scheduler<MyTask> s;
s.add(MyTask(), 0);
s.add(MyTask(), 2000);
s.add(MyTask(), 2500);
s.add(MyTask(), 6000);

Multithreaded not efficient: Debugging False Sharing?

I have the following code, that starts multiple Threads (a threadpool) at the very beginning (startWorkers()). Subsequently, at some point i have a container full of myWorkObject instances, which I want to process using multiple worker threads simulatenously. The myWorkObject are completely isolated from another in terms of memory usage. For now lets assume myWorkObject has a method doWorkIntenseStuffHere() which takes some cpu time to calculate.
When benchmarking the following code, i have noticed that this code does not scale well with the number of threads, and the overhead for initializing/synchronizing the worker threads exceeds the benefit of multithreading unless there are 3-4 threads active. I've looked into this issue and read about the false-sharing problem and i assume my code suffers from this problem. However, I'd like to debug/profile my code to see whether there is some kind of starvation/false sharing going on. How can I do this? Please feel free to critize anything about my code as I'm still learning a lot about memory/cpu and multithreading in particular.
#include <boost/thread.hpp>
class MultiThreadedFitnessProcessingStrategy
MultiThreadedFitnessProcessingStrategy(unsigned int numWorkerThreads):
_startBarrier(numWorkerThreads + 1),
_endBarrier(numWorkerThreads + 1),
assert(_numWorkerThreads > 0);
virtual ~MultiThreadedFitnessProcessingStrategy()
void startWorkers()
_shutdown = false;
_started = true;
for(unsigned int i = 0; i < _numWorkerThreads;i++)
boost::thread* workerThread = new boost::thread(
boost::bind(&MultiThreadedFitnessProcessingStrategy::workerTask, this,i)
_threadQueue.push_back(new std::queue<myWorkObject::ptr>());
void stopWorkers()
_shutdown = true;
for(unsigned int i = 0; i < _numWorkerThreads;i++)
void workerTask(unsigned int id)
//Wait until all worker threads have started.
//Wait for any input to become available.
bool queueEmpty = false;
std::queue<SomeClass::ptr >* myThreadq(_threadQueue[id]);
SomeClass::ptr myWorkObject;
//Make sure queue is not empty,
//Caution: this is necessary if start barrier was triggered without queue input (e.g., shutdown) , which can happen.
//Do not try to be smart and refactor this without knowing what you are doing!
queueEmpty = myThreadq->empty();
chromosome = myThreadq->front();
//Wait until all worker threads have synchronized.
void doWork(const myWorkObject::chromosome_container &refcontainer)
unsigned int j = 0;
for(myWorkObject::chromosome_container::const_iterator it = refcontainer.begin();
it != refcontainer.end();++it)
//Start Signal!
//Wait for workers to be complete
unsigned int getNumWorkerThreads() const
return _numWorkerThreads;
bool isStarted() const
return _started;
boost::barrier _startBarrier;
boost::barrier _endBarrier;
bool _started;
bool _shutdown;
unsigned int _numWorkerThreads;
std::vector<boost::thread*> _workerThreads;
std::vector< std::queue<myWorkObject::ptr >* > _threadQueue;
Sampling-based profiling can give you a pretty good idea whether you're experiencing false sharing. Here's a previous thread that describes a few ways to approach the issue. I don't think that thread mentioned Linux's perf utility. It's a quick, easy and free way to count cache misses that might tell you what you need to know (am I experiencing a significant number of cache misses that correlates with how many times I'm accessing a particular variable?).
If you do find that your threading scheme might be causing a lot of conflict misses, you could try declaring your myWorkObject instances or the data contained within them that you're actually concerned about with __attribute__((aligned(64))) (alignment to 64 byte cache lines).
If you're on Linux, there is a tool called valgrind, with one of the modules doing cache effects simulation (cachegrind). Please take a look at

c++ multithreading and affinity

I'm writing a simple thread pool for my application, which I test on dual-core processor. Usually it works good, but i noticed that when other processes are using more than 50% of processor, my application almost halts. This made me curious, so i decided to reproduce this situation and created auxiliary application, which simply runs infinite loop (without multithreading), taking 50% of processor. While auxiliary one is running, multithreaded application almost halts, as before (processing speed falls from 300-400 tasks per second to 5-10 tasks per second). But when I changed process affinity of my multithreaded program to use only one core (auxiliary still uses both), it started working, of course using at most 50% processor left. When I disabled multithreading in my application (still processing the same tasks, but without thread pool), it worked like charm, without any slow down from auxiliary, which was still running (and that's how two applications should behave when running on two cores). But when I enable multithreading, the problem comes back.
I've made special code for testing this particular ThreadPool:
typedef double FloatingPoint;
#include <queue>
#include <vector>
#include <mutex>
#include <atomic>
#include <condition_variable>
#include <thread>
using namespace std;
struct ThreadTask
int size;
ThreadTask(int s)
size = s;
class ThreadPool
queue<ThreadTask*> tasks;
vector<std::thread> threads;
std::condition_variable task_ready;
std::mutex variable_mutex;
std::mutex max_mutex;
std::atomic<FloatingPoint> max;
std::atomic<int> sleeping;
std::atomic<bool> running;
int threads_count;
ThreadTask * getTask();
void runWorker();
void processTask(ThreadTask*);
bool isQueueEmpty();
bool isTaskAvailable();
void threadMethod();
void createThreads();
void waitForThreadsToSleep();
virtual ~ThreadPool();
void addTask(int);
void start();
FloatingPoint getValue();
void reset();
void clearTasks();
#endif /* THREADPOOL_H_ */
and .cpp
#include "stdafx.h"
#include <climits>
#include <float.h>
#include "ThreadPool.h"
ThreadPool::ThreadPool(int t)
running = true;
threads_count = t;
max = FLT_MIN;
sleeping = 0;
if(threads_count < 2) //one worker thread has no sense
threads_count = (int)thread::hardware_concurrency(); //default value
if(threads_count == 0) //in case it fails ('If this value is not computable or well defined, the function returns 0')
threads_count = 2;
printf("%d worker threads\n", threads_count);
running = false;
reset(); //it will make sure that all worker threads are sleeping on condition variable
task_ready.notify_all(); //let them finish in natural way
for (auto& th : threads)
void ThreadPool::start()
FloatingPoint ThreadPool::getValue()
return max;
void ThreadPool::createThreads()
for(int i = 0; i < threads_count; ++i)
threads.push_back(std::thread(&ThreadPool::threadMethod, this));
void ThreadPool::threadMethod()
void ThreadPool::runWorker()
ThreadTask * task = getTask();
void ThreadPool::processTask(ThreadTask * task)
if(task == NULL)
//do something to simulate processing
vector<int> v;
for(int i = 0; i < task->size; ++i)
delete task;
void ThreadPool::addTask(int s)
ThreadTask * task = new ThreadTask(s);
std::lock_guard<std::mutex> lock(variable_mutex);
ThreadTask * ThreadPool::getTask()
std::unique_lock<std::mutex> lck(variable_mutex);
if(tasks.empty()) //in case of ThreadPool being deleted (destructor calls notify_all), or spurious notifications
return NULL; //return to main loop and repeat it
ThreadTask * task = tasks.front();
return task;
bool ThreadPool::isQueueEmpty()
std::lock_guard<std::mutex> lock(variable_mutex);
return tasks.empty();
bool ThreadPool::isTaskAvailable()
return !isQueueEmpty();
void ThreadPool::waitForThreadsToSleep()
std::this_thread::yield(); //wait for all tasks to be taken
while(true) //wait for all threads to finish they last tasks
if(sleeping == threads_count)
void ThreadPool::clearTasks()
std::unique_lock<std::mutex> lock(variable_mutex);
while(!tasks.empty()) tasks.pop();
void ThreadPool::reset() //don't call this when var_mutex is already locked by this thread!
max = FLT_MIN;
how it's tested:
ThreadPool tp(2);
int iterations = 1000;
int task_size = 1000;
for(int j = 0; j < iterations; ++j)
printf("\r%d left", iterations - j);
for(int i = 0; i < 1000; ++i)
return 0;
I've build this code with mingw with gcc 4.8.1 (from here) and Visual Studio 2012 (VC11) on Win7 64, both on debug configuration.
Two programs build with mentioned compilers behave totally different.
a) program build with mingw works much faster than one build on VS, when it can take whole processor (system shows almost 100% CPU usage by this process, so i don't think mingw is secretly setting affinity to one core). But when i run auxiliary program (using 50% of CPU), it slows down greatly (about several dozen times). CPU usage in this case is about 50%-50% for main program and auxiliary one.
b) program build with VS 2012, when using whole CPU, is even slower than a) with slowdown (when i set task_size = 1, their speeds were similiar). But when auxiliary is running, main program even takes most of CPU (usage is about 66% main - 33% aux) and resulting slow down is barely noticeable.
When set to use only one core, both programs speed up noticeable (about 1.5 - 2 times), and mingw one stops being vulnerable to competition.
Well, now i don't know what to do. My program behaves differently when build by two different toolsets. Is this a flaw in my code (which is suppose is true), or something to do with compilers having problems with c++11 ?