Implementing a simple, generic thread pool in C++11 - c++

I want to create a thread pool for experimental purposes (and for the fun factor). It should be able to process a wide variety of tasks (so I can possibly use it in later projects).
In my thread pool class I'm going to need some sort of task queue. Since the Standard Library provides std::packaged_task since the C++11 standard, my queue will look like std::deque<std::packaged_task<?()> > task_queue, so the client can push std::packaged_tasks into the queue via some sort of public interface function (and then one of the threads in the pool will be notified with a condition variable to execute it, etc.).
My question is related to the template argument of the std::packaged_task<?()>s in the deque.
The function signature ?() should be able to deal with any type/number of parameters, because the client can do something like:
std::packaged_task<int()> t(std::bind(factorial, 342));
So I don't have to deal with the type/number of parameters.
But what should the return value be? (hence the question mark)
If I make my whole thread pool class a template class, one instance
of it will only be able to deal with tasks with a specific signature
(like std::packaged_task<int()>).
I want one thread pool object to be able to deal with any kind of task.
If I go with std::packaged_task<void()> and the function invoked
returns an integer, or anything at all, then thats undefined behaviour.

So the hard part is that packaged_task<R()> is move-only, otherwise you could just toss it into a std::function<void()>, and run those in your threads.
There are a few ways around this.
First, ridiculously, use a packaged_task<void()> to store a packaged_task<R()>. I'd advise against this, but it does work. ;) (what is the signature of operator() on packaged_task<R()>? What is the required signature for the objects you pass to packaged_task<void()>?)
Second, wrap your packaged_task<R()> in a shared_ptr, capture that in a lambda with signature void(), store that in a std::function<void()>, and done. This has overhead costs, but probably less than the first solution.
Finally, write your own move-only function wrapper. For the signature void() it is short:
struct task {
template<class F,
class dF=std::decay_t<F>,
class=decltype( std::declval<dF&>()() )
task( F&& f ):
new dF(std::forward<F>(f)),
[](void* ptr){ delete static_cast<dF*>(ptr); }
void operator()()const{
invoke( ptr.get() );
explicit operator bool()const{return static_cast<bool>(ptr);}
std::unique_ptr<void, void(*)(void*)> ptr;
void(*invoke)(void*) = nullptr;
and simple. The above can store packaged_task<R()> for any type R, and invoke them later.
This has relatively minimal overhead -- it should be cheaper than std::function, at least the implementations I've seen -- except it does not do SBO (small buffer optimization) where it stores small function objects internally instead of on the heap.
You can improve the unique_ptr<> ptr container with a small buffer optimization if you want.

I happen to have an implementation which does exactly that. My way of doing things is to wrap the std::packaged_task objects in a struct which abstracts away the return type. The method which submits a task into the thread pool returns a future on the result.
This kind of works, but due to the memory allocations required for each task it is not suitable for tasks which are very short and very frequent (I tried to use it to parallelize chunks of a fluid simulation and the overhead was way too high, in the order of several milliseconds for 324 tasks).
The key part is this structure:
struct abstract_packaged_task
template <typename R>
abstract_packaged_task(std::packaged_task<R> &&task):
m_task((void*)(new std::packaged_task<R>(std::move(task)))),
m_call_exec([](abstract_packaged_task *instance)mutable{
m_call_delete([](abstract_packaged_task *instance)mutable{
delete (std::packaged_task<R>*)(instance->m_task);
abstract_packaged_task(abstract_packaged_task &&other);
void operator()();
void *m_task;
std::function<void(abstract_packaged_task*)> m_call_exec;
std::function<void(abstract_packaged_task*)> m_call_delete;
As you can see, it hides away the type dependencies by using lambdas with std::function and a void*. If you know the maximum size of all possibly occuring std::packaged_task objects (I have not checked whether the size has a dependency on R at all), you could try to further optimize this by removing the memory allocation.
The submission method into the thread pool then does this:
template <typename R>
std::future<R> submit_task(std::packaged_task<R()> &&task)
assert(m_workers.size() > 0);
std::future<R> result = task.get_future();
std::unique_lock<std::mutex> lock(m_queue_mutex);
return result;
where m_task_queue is an std::deque of abstract_packaged_task structs. m_queue_wakeup is a std::condition_variable to wake a worker thread up to pick up the task. The worker threads implementation is as simple as:
void ThreadPool::worker_impl()
std::unique_lock<std::mutex> lock(m_queue_mutex, std::defer_lock);
while (!m_terminated) {
while (m_task_queue.empty()) {
if (m_terminated) {
abstract_packaged_task task(std::move(m_task_queue.front()));
You can take a look at the full source code and the corresponding header on my github.


c++ thread pool: alternative to std::function for passing functions/lambdas to threads?

I have a thread pool that I use to execute many tiny jobs (millions of jobs, dozens/hundreds of milliseconds each). The jobs are passed in the form of either:
std::bind(&fn, arg1, arg2, arg3...)
[&](){fn(arg1, arg2, arg3...);}
with the thread pool taking them like this:
std::queue<std::function<void(void)>> queue;
void addJob(std::function<void(void)> fn)
Pretty standard stuff....except that I've noticed a bottleneck where if jobs execute in a fast enough time (less than a millisecond), the conversion from lambda/binder to std::function in the addJob function actually takes longer than execution of the jobs themselves. After doing some reading, std::function is notoriously slow and so my bottleneck isn't necessarily unexpected.
Is there a faster way of doing this type of thing? I've looked into drop-in std::function replacements but they either weren't compatible with my compiler or weren't faster. I've also looked into "fast delegates" by Don Clugston but they don't seem to allow the passing of arguments along with functions (maybe I don't understand them correctly?).
I'm compiling with VS2015u3, and the functions passed to the jobs are all static, with their arguments being either ints/floats or pointers to other objects.
Have a separate queue for each of the task types - you probably don't have tens of thousands of task types. Each of these can be e.g. a static member of your tasks. Then addJob() is actually the ctor of Task and it's perfectly type-safe.
Then define a compile-time list of your task types and visit it via template metaprogramming (for_each). It'll be way faster as you don't need any virtual call fnptr / std::function<> to achieve this.
This will only work if your tuple code sees all the Task classes (so you can't e.g. add a new descendant of Task to an already running executable by loading the image from disc - hope that's a non-issue).
template<typename D> // CRTP on D
class Task {
// you might want to static_assert at some point that D is in TaskTypeList
Task() : it_(tasks_.end()) {} // call enqueue() in descendant
~Task() {
// add your favorite lock here
if (queued()) {
bool queued() const { return it_ != tasks_.end(); }
static size_t ExecNext() {
if (!tasks_.empty()) {
// add your favorite lock here
auto&& itTask = tasks_.begin();
// release lock
itTask->it_ = tasks_.end();
return tasks_.size();
void enqueue() const
// add your favorite lock here
it_ = tasks_.rbegin();
std::list<D*>::iterator it_;
static std::list<D*> tasks_; // you can have one per thread, too - then you don't need locking, but tasks are assigned to threads statically
struct MyTask : Task<MyTask> {
MyTask() { enqueue(); } // call enqueue only when the class is ready
void operator()() { /* add task here */ }
// ...
struct MyTask2; // etc.
struct list_ {};
using TaskTypeList = list_<MyTask, MyTask2>;
void thread_pocess(list_<>) {}
template<typename TaskType, typename... TaskTypes>
void thread_pocess(list_<TaskType, TaskTypes...>)
void thread_process(void*)
for (;;) {
There's a lot to tune on this code: different threads should start from different parts of the queue (or one would use a ring, or several queues and either static/dynamic assignment to threads), you'd send it to sleep when there are absolutely no tasks, one could have an enum for the tasks, etc.
Note that this can't be used with arbitrary lambdas: you need to list task types. You need to 'communicate' the lambda type out of the function where you declare it (e.g. by returning `std::make_pair(retval, list_) and sometimes it's not easy. However, you can always convert a lambda to a functor, which is straightforward - just ugly.

Synchronizing method calls on shared object from multiple threads

I am thinking about how to implement a class that will contain private data that will be eventually be modified by multiple threads through method calls. For synchronization (using the Windows API), I am planning on using a CRITICAL_SECTION object since all the threads will spawn from the same process.
Given the following design, I have a few questions.
template <typename T> class Shareable
const LPCRITICAL_SECTION sync; //Can be read and used by multiple threads
T *data;
Shareable(LPCRITICAL_SECTION cs, unsigned elems) : sync{cs}, data{new T[elems]} { }
~Shareable() { delete[] data; }
void sharedModify(unsigned index, T &datum) //<-- Can this be validly called
//by multiple threads with synchronization being implicit?
The critical section of code involving reads & writes to 'data'
// Somewhere else ...
DWORD WINAPI ThreadProc(LPVOID lpParameter)
Shareable<ActualType> *ptr = static_cast<Shareable<ActualType>*>(lpParameter);
T copyable = /* initialization */;
ptr->sharedModify(validIndex, copyable); //<-- OK, synchronized?
return 0;
The way I see it, the API calls will be conducted in the context of the current thread. That is, I assume this is the same as if I had acquired the critical section object from the pointer and called the API from within ThreadProc(). However, I am worried that if the object is created and placed in the main/initial thread, there will be something funky about the API calls.
When sharedModify() is called on the same object concurrently,
from multiple threads, will the synchronization be implicit, in the
way I described it above?
Should I instead get a pointer to the
critical section object and use that instead?
Is there some other
synchronization mechanism that is better suited to this scenario?
When sharedModify() is called on the same object concurrently, from multiple threads, will the synchronization be implicit, in the way I described it above?
It's not implicit, it's explicit. There's only only CRITICAL_SECTION and only one thread can hold it at a time.
Should I instead get a pointer to the critical section object and use that instead?
No. There's no reason to use a pointer here.
Is there some other synchronization mechanism that is better suited to this scenario?
It's hard to say without seeing more code, but this is definitely the "default" solution. It's like a singly-linked list -- you learn it first, it always works, but it's not always the best choice.
When sharedModify() is called on the same object concurrently, from multiple threads, will the synchronization be implicit, in the way I described it above?
Implicit from the caller's perspective, yes.
Should I instead get a pointer to the critical section object and use that instead?
No. In fact, I would suggest giving the Sharable object ownership of its own critical section instead of accepting one from the outside (and embrace RAII concepts to write safer code), eg:
template <typename T>
class Shareable
std::vector<T> data;
struct SyncLocker
SyncLocker(CRITICAL_SECTION &cs) : sync(cs) { EnterCriticalSection(&sync); }
~SyncLocker() { LeaveCriticalSection(&sync); }
Shareable(unsigned elems) : data(elems)
Shareable(const Shareable&) = delete;
Shareable(Shareable&&) = delete;
SyncLocker lock(sync);
void sharedModify(unsigned index, const T &datum)
SyncLocker lock(sync);
data[index] = datum;
Shareable& operator=(const Shareable&) = delete;
Shareable& operator=(Shareable&&) = delete;
Is there some other synchronization mechanism that is better suited to this scenario?
That depends. Will multiple threads be accessing the same index at the same time? If not, then there is not really a need for the critical section at all. One thread can safely access one index while another thread accesses a different index.
If multiple threads need to access the same index at the same time, a critical section might still not be the best choice. Locking the entire array might be a big bottleneck if you only need to lock portions of the array at a time. Things like the Interlocked API, or Slim Read/Write locks, might make more sense. It really depends on your thread designs and what you are actually trying to protect.

C++ Multi-threading giving a templated std::bind to another thread

I try to give a std::bind to another existing thread currently waiting in a condition_variable. I really want to keep this other thread alive and not creating another one.
But I don't know how to give this std::bind to the other thread, due to the fact that everything is decided at compile-time.
I know that boost thread pool manage that, and I really wonder how and I'd like doing it without boost.
Here is some pseudo-code
class Exec
template<typename Func, typename... Args>
auto call(Func func, Args... args)
sendWork(std::bind(func, this->someMemberClass, args...)); // Async
return getResults(); // Waiting til get results
void waitThread()
//Thread waiting
// Will do the std::bind sent at sendWork
Has someone any idea?
Thank you for your time!
As mentioned in the commentaries, the only current way to pass a generic function to another thread is by using std::function<void()> which forbid any return type, but grants the ability to specify any parameters and number of parameters, in order to return results, you'll have to think about callbacks.

A thread-safe implementation of a generic container of type pair<unsigned int, boost::any> using shared_ptrs

I have created a generic message queue for use in a multi-threaded application. Specifically, single producer, multi-consumer. Main code below.
1) I wanted to know if I should pass a shared_ptr allocated with new into the enqueue method by value, or is it better to have the queue wrapper allocate the memory itself and just pass in a genericMsg object by const reference?
2) Should I have my dequeue method return a shared_ptr, have a shared_ptr passed in as a parameter by reference (current strategy), or just have it directly return a genericMsg object?
3) Will I need signal/wait in enqueue/dequeue or will the read/write locks suffice?
4) Do I even need to use shared_ptrs? Or will this depend solely on the implementation I use? I like that the shared_ptrs will free memory once all references are no longer using the object. I can easily port this to regular pointers if that's recommended, though.
5) I'm storing a pair here because I'd like to discriminate what type of message I'm dealing with else w/o having to do an any_cast. Every message type has a unique ID that refers to a specific struct. Is there a better way of doing this?
Generic Message Type:
template<typename Message_T>
class genericMsg
id = 0;
size = 0;
genericMsg (unsigned int &_id, unsigned int &_size, Message_T &_data)
id = _id;
size = _size;
data = _data;
unisgned int id;
unsigned int size;
Message_T data; //All structs stored here contain only POD types
Enqueue Methods:
// ----------------------------------------------------------------
// -- Thread safe function that adds a new genericMsg object to the
// -- back of the Queue.
// -----------------------------------------------------------------
template<class Message_T>
inline void enqueue(boost::shared_ptr< genericMsg<Message_T> > data)
WriteLock w_lock(myLock);
this->qData.push_back(std::make_pair(data->id, data));
// ----------------------------------------------------------------
// -- Thread safe function that adds a new genericMsg object to the
// -- back of the Queue.
// -----------------------------------------------------------------
template<class Message_T>
inline void enqueue(const genericMsg<Message_T> &data_in)
WriteLock w_lock(myLock);
boost::shared_ptr< genericMsg<Message_T> > data =
new genericMsg<Message_T>(, data_in.size,;
this->qData.push_back(std::make_pair(, data));
Dequeue Method:
// ----------------------------------------------------------------
// -- Thread safe function that grabs a genericMsg object from the
// -- front of the Queue.
// -----------------------------------------------------------------
template<class Message_T>
void dequeue(boost::shared_ptr< genericMsg<Message_T> > &msg)
ReadLock r_lock(myLock);
msg = boost::any_cast< boost::shared_ptr< genericMsg<Message_T> > >(qData.front().second);
Get message ID:
inline unsigned int getMessageID()
ReadLock r_lock(myLock);
unsigned int tempID = qData.front().first;
return tempID;
Data Types:
std::deque < std::pair< unsigned int, boost::any> > qData;
I have improved upon my design. I now have a genericMessage base class that I directly subclass from in order to derive the unique messages.
Generic Message Base Class:
class genericMessage
virtual ~genericMessage() {}
unsigned int getID() {return id;}
unsigned int getSize() {return size;}
unsigned int id;
unsigned int size;
Producer Snippet:
boost::shared_ptr<genericMessage> tmp (new derived_msg1(MSG1_ID));
Consumer Snippet:
boost::shared_ptr<genericMessage> tmp = theQueue.dequeue();
if(tmp->getID() == MSG1_ID)
boost::shared_ptr<derived_msg1> tObj = boost::dynamic_pointer_cast<derived_msg1>(tmp);
New Queue:
std::deque< boost::shared_ptr<genericMessage> > qData;
New Enqueue:
void mq_class::enqueue(const boost::shared_ptr<genericMessage> &data_in)
boost::unique_lock<boost::mutex> lock(mut);
New Dequeue:
boost::shared_ptr<genericMessage> mq_class::dequeue()
boost::shared_ptr<genericMessage> ptr;
boost::unique_lock<boost::mutex> lock(mut);
ptr = qData.front();
return ptr;
Now, my question is am I doing dequeue correctly? Is there another way of doing it? Should I pass in a shared_ptr as a reference in this case to achieve what I want?
Edit (I added answers for parts 1, 2, and 4).
1) You should have a factory method that creates new genericMsgs and returns a std::unique_ptr. There is absolutely no good reason to pass genericMsg in by const reference and then have the queue wrap it in a smart pointer: Once you've passed by reference you have lost track of ownership, so if you do that the queue is going to have to construct (by copy) the entire genericMsg to wrap.
2) I can't think of any circumstance under which it would be safe to take a reference to a shared_ptr or unique_ptr or auto_ptr. shared_ptrs and unique_ptrs are for tracking ownership and once you've taken a reference to them (or the address of them) you have no idea how many references or pointers are still out there expecting the shared_ptr/unique_ptr object to contain a valid naked pointer.
unique_ptr is always preferred to a naked pointer, and is preferred to a shared_ptr in cases where you only have a single piece of code (validly) pointing to an object at a time.
Bad practice to return unique_ptr for raw pointer like ownership semantics? (the answer explains why it is good practice not bad).
3) Yes, you need to use a std::condition_variable in your dequeue function. You need to test whether qData is empty or not before calling qData.front() or qData.pop_front(). If qData is empty you need to wait on a condition variable. When enqueue inserts an item it should signal the condition variable to wake up anyone who may have been waiting.
Your use of reader/writer locks is completely incorrect. Don't use reader/writer locks. Use std::mutex. A reader lock can only be used on a method that is completely const. You are modifying qData in dequeue, so a reader lock will lead to data races there. (Reader writer locks are only applicable when you have stupid code that is both const and holds locks for extended period of time. You are only keeping the lock for the period of time it takes to insert or remove from the queue, so even if you were const the added overhead of reader/writer locks would be a net lose.)
An example of implementing a (bounded) buffer using mutexes and condition_variables can be found at: Is this a correct way to implement a bounded buffer in C++.
4) unique_ptr is always preferred to naked pointers, and usually preferred to shared_ptr. (The main exception where shared_ptr might be better is for graph-like data structures.) In cases like yours where you are reading something in on side, creating a new object with a factory, moving the ownership to the queue and then moving ownership out of the queue to the consumer it sounds like you should be using unique_ptr.
5) You are reinventing tagged unions. Virtual functions were added to c++ specifically so you wouldn't need to do this. You should subclass your messages from a class that has a virtual function called do_it() (or better yet, operator()() or something like that). Then instead of tagging each struct, make each struct a subclass of your message class. When you dequeue each struct (or ptr to struct) just call do_it() on it. Strong static typing, no casts. See C++ std condition variable covering a lot of share variables for an example.
Also: if you are going to stick with the tagged unions: you can't have separate calls to get the id and the data item. Consider: If thread A calls to get the id, then thread B calls to get the id, then thread B retrieves the data item, now what happens when thread A calls to retrieve a data item? It gets a data item, but not with the type that it expected. You need to retrieve the id and the data item under the same critical section.
First of all, it's better to use 3rd-party concurrency containers than to implement them yourself, except it's for education purpose.
Your messages doesn't look to have costly constructors/destructor, so you can store them by value and forget about all your other questions. Use move semantics (if available) for optimizations.
If your profiler says "by value" is bad idea in your particular case:
I suppose your producer creates messages, puts them into your queue and loses any interest in them. In this case you don't need shared_ptr because you don't have shared ownership. You can use unique_ptr or even a raw pointer. It's implementation details and better to hide them inside the queue.
From performance point of view, it's better to implement lock-free queue. "locks vs. signals" depends completely on your application. For example, if you use thread pool and kind of a scheduler it's better to allow your clients to do something useful while queue is full/empty. In simpler cases reader/writer lock is just fine.
If I want to be thread safe, I usually use const objects and modify only on copy or create constructor. In this way you don't need to use any lock mechanism. In a threaded system, it is usually more effective than use mutexes on a single instance.
In your case only deque would need lock.

Lightweight wrapper - is this a common problem and if yes, what is its name?

I have to use a library that makes database calls which are not thread-safe. Also I occasionally have to load larger amounts of data in a background thread.
It is hard to say which library functions actually access the DB, so I think the safest approach for me is to protect every library call with a lock.
Let's say I have a library object:
dbLib::SomeObject someObject;
Right now I can do something like this:
dbLib::ErrorCode errorCode = 0;
std::list<dbLib::Item> items;
DbLock dbLock;
errorCode = someObject.someFunction(&items);
} // dbLock goes out of scope
I would like to simplify that to something like this (or even simpler):
dbLib::ErrorCode errorCode =
protectedCall(someObject, &dbLib::SomeObject::someFunction(&items));
The main advantage of this would be that I won't have to duplicate the interface of dbLib::SomeObject in order to protect each call with a lock.
I'm pretty sure that this is a common pattern/idiom but I don't know its name or what keywords to search for. (Looking at I think, it's more an idiom than a pattern).
Where do I have to look for more information?
You could make protectedCall a template function that takes a functor without arguments (meaning you'd bind the arguments at the call-site), and then creates a scoped lock, calls the functor, and returns its value. For example something like:
template <typename Ret>
Ret protectedCall(boost::function<Ret ()> func)
DbLock lock;
return func();
You'd then call it like this:
dbLib::ErrorCode errorCode = protectedCall(boost::bind(&dbLib::SomeObject::someFunction, &items));
EDIT. In case you're using C++0x, you can use std::function and std::bind instead of the boost equivalents.
In C++0x, you can implement some form of decorators:
template <typename F>
auto protect(F&& f) -> decltype(f())
DbLock lock;
return f();
dbLib::ErrorCode errorCode = protect([&]()
return someObject.someFunction(&items);
From your description this would seem a job for Decorator Pattern.
However, especially in the case of resources, I wouldn't recommend using it.
The reason is that in general these functions tend to scale badly, require higher level (less finegrained) locking for consistency, or return references to internal structures that require the lock to stay locked until all information is read.
Think, e.g. about a DB function that calls a stored procedure that returns a BLOB (stream) or a ref cursor: the streams should not be read outside of the lock.
What to do?
I recommend instead to use the Facade Pattern. Instead of composing your operations directly in terms of DB calls, implement a facade that uses the DB layer; This layer could then manage the locking at exactly the required level (and optimize where needed: you could have the facade be implemented as a thread-local Singleton, and use separate resources, obviating the need for locks, e.g.)
The simplest (and still straightforward) solution might be to write a function which returns a proxy for the object. The proxy does the locking and overloads -> to allow calling the object. Here is an example:
#include <cstdio>
template<class T>
class call_proxy
T &item;
call_proxy(T &t) : item(t) { puts("LOCK"); }
T *operator -> () { return &item; }
~call_proxy() { puts("UNLOCK"); }
template<class T>
call_proxy<T> protect(T &t)
return call_proxy<T>(t);
Here's how to use it:
class Intf
void function()
int main()
Intf a;
The output should be:
If you want the lock to happen before the evaluation of the arguments, then can use this macro:
#define PCALL(X,APPL) (protect(X), (X).APPL)
This evaluates x twice though.
This article by Andrei Alexandrescu has a pretty interesting article how to create this kind of thin wrapper and combine it with dreaded volatile keyword for thread safety.
Mutex locking is a similar problem. It asked for help here: Need some feedback on how to make a class "thread-safe"
The solution I came up with was a wrapper class that prevents access to the protected object. Access can be obtained via an "accessor" class. The accessor will lock the mutex in its constructor and unlock it on destruction. See the "ThreadSafe" and "Locker" classes in Threading.h for more details.