Multi threading independent tasks - c++

I have N tasks, which are independent (ie., write at different memory addresses) but don't take exactly the same time to complete (from 2 to 10 seconds, say). I have P threads.
I can divide my N tasks into P threads, and launch my threads. Ultimately, at the end, there will be a single thread remaining to complete the last few tasks, which is not optimal.
I can also launch P threads with 1 task each, WaitForMultipleObjects, and relaunch P threads etc. (that's what I currently do, as the overhead of creating threads is small compared to the task). However, this does not solve the problem either, there will still be P-1 threads waiting for the last one at some point.
Is there a way to launch threads, and as soon as the thread has finished its task, go on to the next available task until all tasks are completed ?
Thanks !

yes, it's called thread pooling. it's a very common practice.
http://en.wikipedia.org/wiki/Thread_pool_pattern
Basically, you create a queue of tasks (function pointers with their arguments), and push the tasks there. You have N threads running which do the following loop (schematic code):
while (bRunning) {
task = m_pQueue.pop();
if (task) {
executeTask(task);
}
else {
//you can sleep a bit here if you want
}
}
there are more elegant ways to implement it (avoiding sleeps, etc) but this is the gist of it.

Related

Where can we use std::barrier over std::latch?

I recently heard new c++ standard features which are:
std::latch
std::barrier
I cannot figure it out ,in which situations that they are applicable and useful over one-another.
If someone can raise an example for how to use each one of them wisely it would be really helpful.
Very short answer
They're really aimed at quite different goals:
Barriers are useful when you have a bunch of threads and you want to synchronise across of them at once, for example to do something that operates on all of their data at once.
Latches are useful if you have a bunch of work items and you want to know when they've all been handled, and aren't necessarily interested in which thread(s) handled them.
Much longer answer
Barriers and latches are often used when you have a pool of worker threads that do some processing and a queue of work items that is shared between. It's not the only situation where they're used, but it is a very common one and does help illustrate the differences. Here's some example code that would set up some threads like this:
const size_t worker_count = 7; // or whatever
std::vector<std::thread> workers;
std::vector<Proc> procs(worker_count);
Queue<std::function<void(Proc&)>> queue;
for (size_t i = 0; i < worker_count; ++i) {
workers.push_back(std::thread(
[p = &procs[i], &queue]() {
while (auto fn = queue.pop_back()) {
fn(*p);
}
}
));
}
There are two types that I have assumed exist in that example:
Proc: a type specific to your application that contains data and logic necessary to process work items. A reference to one is passed to each callback function that's run in the thread pool.
Queue: a thread-safe blocking queue. There is nothing like this in the C++ standard library (somewhat surprisingly) but there are a lot of open-source libraries containing them e.g. Folly MPMCQueue or moodycamel::ConcurrentQueue, or you can build a less fancy one yourself with std::mutex, std::condition_variable and std::deque (there are many examples of how to do this if you Google for them).
Latch
A latch is often used to wait until some work items you push onto the queue have all finished, typically so you can inspect the result.
std::vector<WorkItem> work = get_work();
std::latch latch(work.size());
for (WorkItem& work_item : work) {
queue.push_back([&work_item, &latch](Proc& proc) {
proc.do_work(work_item);
latch.count_down();
});
}
latch.wait();
// Inspect the completed work
How this works:
The threads will - eventually - pop the work items off of the queue, possibly with multiple threads in the pool handling different work items at the same time.
As each work item is finished, latch.count_down() is called, effectively decrementing an internal counter that started at work.size().
When all work items have finished, that counter reaches zero, at which point latch.wait() returns and the producer thread knows that the work items have all been processed.
Notes:
The latch count is the number of work items that will be processed, not the number of worker threads.
The count_down() method could be called zero times, one time, or multiple times on each thread, and that number could be different for different threads. For example, even if you push 7 messages onto 7 threads, it might be that all 7 items are processed onto the same one thread (rather than one for each thread) and that's fine.
Other unrelated work items could be interleaved with these ones (e.g. because they weree pushed onto the queue by other producer threads) and again that's fine.
In principle, it's possible that latch.wait() won't be called until after all of the worker threads have already finished processing all of the work items. (This is the sort of odd condition you need to look out for when writing threaded code.) But that's OK, it's not a race condition: latch.wait() will just immediately return in that case.
An alternative to using a latch is that there's another queue, in addition to the one shown here, that contains the result of the work items. The thread pool callback pushes results on to that queue while the producer thread pops results off of it. Basically, it goes in the opposite direction to the queue in this code. That's a perfectly valid strategy too, in fact if anything it's more common, but there are other situations where the latch is more useful.
Barrier
A barrier is often used to make all threads wait simultaneously so that the data associated with all of the threads can be operated on simultaneously.
typedef Fn std::function<void()>;
Fn completionFn = [&procs]() {
// Do something with the whole vector of Proc objects
};
auto barrier = std::make_shared<std::barrier<Fn>>(worker_count, completionFn);
auto workerFn = [barrier](Proc&) {
barrier->count_down_and_wait();
};
for (size_t i = 0; i < worker_count; ++i) {
queue.push_back(workerFn);
}
How this works:
All of the worker threads will pop one of these workerFn items off of the queue and call barrier.count_down_and_wait().
Once all of them are waiting, one of them will call completionFn() while the others continue to wait.
Once that function completes they will all return from count_down_and_wait() and be free to pop other, unrelated, work items from the queue.
Notes:
Here the barrier count is the number of worker threads.
It is guaranteed that each thread will pop precisely one workerFn off of the queue and handle it. Once a thread has popped one off of the queue, it will wait in barrier.count_down_and_wait() until all the other copies of workerFn have been popped off by other threads, so there is no chance of it popping another one off.
I used a shared pointer to the barrier so that it will be destroyed automatically once all the work items are done. This wasn't an issue with the latch because there we could just make it a local variable in the producer thread function, because it waits until the worker threads have used the latch (it calls latch.wait()). Here the producer thread doesn't wait for the barrier so we need to manage the memory in a different way.
If you did want the original producer thread to wait until the barrier has been finished, that's fine, it can call count_down_and_wait() too, but you will obviously need to pass worker_count + 1 to the barrier's constructor. (And then you wouldn't need to use a shared pointer for the barrier.)
If other work items are being pushed onto the queue at the same time, that's fine too, although it will potentially waste time as some threads will just be sitting there waiting for the barrier to be acquired while other threads are distracted by other work before they acquire the barrier.
!!! DANGER !!!
The last bullet point about other working being pushed onto the queue being "fine" is only the case if that other work doesn't also use a barrier! If you have two different producer threads putting work items with a barrier on to the same queue and those items are interleaved, then some threads will wait on one barrier and others on the other one, and neither will ever reach the required wait count - DEADLOCK. One way to avoid this is to only ever use barriers like this from a single thread, or even to only ever use one barrier in your whole program (this sounds extreme but is actually quite a common strategy, as barriers are often used for one-time initialisation on startup). Another option, if the thread queue you're using supports it, is to atomically push all work items for the barrier onto the queue at once so they're never interleaved with any other work items. (This won't work with the moodycamel queue, which supports pushing multiple items at once but doesn't guarantee that they won't be interleved with items pushed on by other threads.)
Barrier without completion function
At the point when you asked this question, the proposed experimental API didn't support completion functions. Even the current API at least allows not using them, so I thought I should show an example of how barriers can be used like that too.
auto barrier = std::make_shared<std::barrier<>>(worker_count);
auto workerMainFn = [&procs, barrier](Proc&) {
barrier->count_down_and_wait();
// Do something with the whole vector of Proc objects
barrier->count_down_and_wait();
};
auto workerOtherFn = [barrier](Proc&) {
barrier->count_down_and_wait(); // Wait for work to start
barrier->count_down_and_wait(); // Wait for work to finish
}
queue.push_back(std::move(workerMainFn));
for (size_t i = 0; i < worker_count - 1; ++i) {
queue.push_back(workerOtherFn);
}
How this works:
The key idea is to wait for the barrier twice in each thread, and do the work in between. The first waits have the same purpose as the previous example: they ensure any earlier work items in the queue are finished before starting this work. The second waits ensure that any later items in the queue don't start until this work has finished.
Notes:
The notes are mostly the same as the previous barrier example, but here are some differences:
One difference is that, because the barrier is not tied to the specific completion function, it's more likely that you can share it between multiple uses, like we did in the latch example, avoiding the use of a shared pointer.
This example makes it look like using a barrier without a completion function is much more fiddly, but that's just because this situation isn't well suited to them. Sometimes, all you need is to reach the barrier. For example, whereas we initialised a queue before the threads started, maybe you have a queue for each thread but initialised in the threads' run functions. In that case, maybe the barrier just signifies that the queues have been initialised and are ready for other threads to pass messages to each other. In that case, you can use a barrier with no completion function without needing to wait on it twice like this.
You could actually use a latch for this, calling count_down() and then wait() in place of count_down_and_wait(). But using a barrier makes more sense, both because calling the combined function is a little simpler and because using a barrier communicates your intention better to future readers of the code.
Any any case, the "DANGER" warning from before still applies.

Recommended pattern for a queue accessed by multiple threads...what should the worker thread do?

I have a queue of objects that is being added to by a thread A. Thread B is removing objects from the queue and processing them. There may be many threads A and many threads B.
I am using a mutex when the queue in being "push"ed to, and also when "front"ed and "pop"ped from as shown in the pseudo-code as below:
Thread A calls this to add to the queue:
void Add(object)
{
mutex->lock();
queue.push(object);
mutex->unlock();
}
Thread B processes the queue as follows:
object GetNextTargetToWorkOn()
{
object = NULL;
mutex->lock();
if (! queue.empty())
{
object = queue.front();
queue.pop();
}
mutex->unlock();
return(object);
}
void DoTheWork(int param)
{
while(true)
{
object structure;
while( (object = GetNextTargetToWorkOn()) == NULL)
boost::thread::sleep(100ms); // sleep a very short time
// do something with the object
}
}
What bothers me is the while---get object---sleep-if-no-object paradigm. While there are objects to process it is fine. But while the thread is waiting for work there are two problems
a) The while loop is whirling consuming resources
b) the sleep means wasted time is a new object comes in to be processed
Is there a better pattern to achieve the same thing?
You're using spin-waiting, a better design is to use a monitor. Read more on the details on wikipedia.
And a cross-platform solution using std::condition_variable with a good example can be found here.
a) The while loop is whirling consuming resources
b) the sleep means wasted time is a new object comes in to be processed
It has been my experience that the sleep you used actually 'fixes' both of these issues.
a) The consuming of resources is a small amount of ram, and remarkably small fraction of available cpu cycles.
b) Sleep is not a wasted time on the OS's I've worked on.
c) Sleep can affect 'reaction time' (aka latency), but has seldom been an issue (outside of interrupts.)
The time spent in sleep is likely to be several orders of magnitude longer than the time spent in this simple loop. i.e. It is not significant.
IMHO - this is an ok implementation of the 'good neighbor' policy of relinquishing the processor as soon as possible.
On my desktop, AMD64 Dual Core, Ubuntu 15.04, a semaphore enforced context switch takes ~13 us.
100 ms ==> 100,000 us .. that is 4 orders of magnitude difference, i.e. VERY insignificant.
In the 5 OS's (Linux, vxWorks, OSE, and several other embedded system OS's) I have worked on, sleep (or their equivalent) is the correct way to relinquish the processor, so that it is not blocked from running another thread while the one thread is in sleep.
Note: It is feasible that some OS's sleep might not relinquish the processor. So, you should always confirm. I've not found one. Oh, but I admit I have not looked / worked much on Windows.

Many detached boost threads segfault

I'm creating boost threads inside a function with
while(trueNonceQueue.empty() && block.nNonce < std::numeric_limits<uint64_t>::max()){
if ( block.nNonce % 100000 == 0 )
{
cout << block.nNonce << endl;
}
boost::thread t(CheckNonce, block);
t.detach();
block.nNonce++;
}
uint64 trueNonce;
while (trueNonceQueue.pop(trueNonce))
block.nNonce = trueNonce;
trueNonceQueue was created with boost::lockfree::queue<uint64> trueNonceQueue(128); in the global scope.
This is the function being threaded
void CheckNonce(CBlock block){
if(block.CheckBlockSilently()){
while (!trueNonceQueue.push(block.nNonce))
;
}
}
I noticed that after it crashed, my swap had grown marginally which never happens unless if I use poor technique like this after leaking memory; otherwise, my memory usage stays frequently below 2 gigs. I'm running cinnamon on ubuntu desktop with chrome and a few other small programs open. I was not using the computer at the time this was running.
The segfault occurred after the 949900000th iteration. How can this be corrected?
CheckNonce execution time
I added the same modulus to CheckNonce to see if there was any lag. So far, there is none.
I will update if the detached threads start to lag behind the spawning while.
You should use a Thread Pool instead. This means spawning just enough threads to get work done without undue contention (for example you might spawn something like N-2 threads on an N-core machine, but perhaps more if some work may block on I/O).
There is not exactly a thread pool in Boost, but there are the parts you need to build one. See here for some ideas: boost::threadpool::pool vs.boost::thread_group
Or you can use a more ready-made solution like this (though it is a bit dated and perhaps unmaintained, not sure): http://threadpool.sourceforge.net/
Then the idea is to spawn the N threads, and then in your loop for each task, just "post" the task to the thread pool, where the next available worker thread will pick it up.
By doing this, you will avoid many problems, such as running out of thread stack space, avoiding inefficient resource contention (look up the "thundering herd problem"), and you will be able to easily tune the aggressiveness with which you use multiple cores on any system.

Multithreading using threadpool

I'm currently using the boost threadpool with the number of threads equal to the number of cores. I have scheduled, say 10 tasks using the pool's schedule function. For example,
suppose I have the function
void my_fun(std::vector<double>* my_vec){
// Do something here
}
The argument 'my_vec' here is just used to do some temporary calculations. The main reason I passing it the function is that I would like to reuse this vector when I call the function again.
Currently, I have the following
// Create a vector of 10 vectors called my_vecs
// Create threadpool
boost::threadpool::pool tp(num_threads);
// Schedule tasks
for (int m = 0; m < 10; m++){
tp.schedule(boost::bind(my_fun, my_vecs.at(m)));
}
This is my problem: I would like to replace the vector of 10 vectors with only 2 vectors. If I want to schedule 10 tasks and I have 2 cores, a maximum of 2 threads (tasks) will be running at any time. So I only want to use two vectors (one assigned to each thread) and use it to carry out my 10 tasks. How can I do this?
I hope this is clear. Thank You!
Probably boost::thread_specific_ptr is what you need. Below is how you may use it in your function:
#include <boost/thread/tss.hpp>
boost::thread_specific_ptr<std::vector<double> > tls_vec;
void my_fun()
{
std::vector<double>* my_vec = tls_vec.get();
if( !my_vec ) {
my_vec = new std::vector<double>();
tls_vec.reset(my_vec);
}
// Do something here with my_vec
}
It will reuse vector instances between tasks scheduled to the same thread. There might be more than 2 instances if there are more threads in the pool, but due to preemption mentioned in other answers you really need an instance per running thread, not per core.
You should not need to delete vector instances stored in thread_specific_ptr; those will be automatically destroyed when corresponding threads finish.
I wouldn't limit the number of threads to the number of cores. Remember that multi-threaded programming has been going on long before we had multi-core processors. This is because the threads will likely block for some resource and the next thread can jump in and use the CPU.
Java has a FixedThreadPool.
it looks like Boost might have something similar
http://deltavsoft.com/w/RcfUserGuide/1.2/rcf_user_guide/Multithreading.html
Basically a fixed thread pool spawned a fixed number of threads and then you can queue tasks in the manager queue.
While it's two that only two threads can be scheduled at the same time, on many threading systems the threads get time-sliced, so a thread gets pre-empted during the execution of its task. Hence a third (fourth, ...) thread will get a chance to work while the processing of the first and second are still incomplete.
I don't know about this particular threading implementation, but my guess is that it will allow (or run in environments supporting) pre-emptive scheduling. My way of thinking for threads is to try to keep it really simple, let each threads have its own resoruces.

Overhead due to use of Events

I have a custom thread pool class, that creates some threads that each wait on their own event (signal). When a new job is added to the thread pool, it wakes the first free thread so that it executes the job.
The problem is the following : I have around 1000 loops of each around 10'000 iterations do to. These loops must be executed sequentially, but I have 4 CPUs available. What I try to do is to split the 10'000 iteration loops into 4 2'500 iterations loops, ie one per thread. But I have to wait for the 4 small loops to finish before going to the next "big" iteration. This means that I can't bundle the jobs.
My problem is that using the thread pool and 4 threads is much slower than doing the jobs sequentially (having one loop executed by a separate thread is much slower than executing it directly in the main thread sequentially).
I'm on Windows, so I create events with CreateEvent() and then wait on one of them using WaitForMultipleObjects(2, handles, false, INFINITE) until the main thread calls SetEvent().
It appears that this whole event thing (along with the synchronization between the threads using critical sections) is pretty expensive !
My question is : is it normal that using events takes "a lot of" time ? If so, is there another mechanism that I could use and that would be less time-expensive ?
Here is some code to illustrate (some relevant parts copied from my thread pool class) :
// thread function
unsigned __stdcall ThreadPool::threadFunction(void* params) {
// some housekeeping
HANDLE signals[2];
signals[0] = waitSignal;
signals[1] = endSignal;
do {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
// try to get the next job parameters;
if (tp->getNextJob(threadId, data)) {
// execute job
void* output = jobFunc(data.params);
// tell thread pool that we're done and collect output
tp->collectOutput(data.ID, output);
}
tp->threadDone(threadId);
}
while (waitResult - WAIT_OBJECT_0 == 0);
// if we reach this point, endSignal was sent, so we are done !
return 0;
}
// create all threads
for (int i = 0; i < nbThreads; ++i) {
threadData data;
unsigned int threadId = 0;
char eventName[20];
sprintf_s(eventName, 20, "WaitSignal_%d", i);
data.handle = (HANDLE) _beginthreadex(NULL, 0, ThreadPool::threadFunction,
this, CREATE_SUSPENDED, &threadId);
data.threadId = threadId;
data.busy = false;
data.waitSignal = CreateEvent(NULL, true, false, eventName);
this->threads[threadId] = data;
// start thread
ResumeThread(data.handle);
}
// add job
void ThreadPool::addJob(int jobId, void* params) {
// housekeeping
EnterCriticalSection(&(this->mutex));
// first, insert parameters in the list
this->jobs.push_back(job);
// then, find the first free thread and wake it
for (it = this->threads.begin(); it != this->threads.end(); ++it) {
thread = (threadData) it->second;
if (!thread.busy) {
this->threads[thread.threadId].busy = true;
++(this->nbActiveThreads);
// wake thread such that it gets the next params and runs them
SetEvent(thread.waitSignal);
break;
}
}
LeaveCriticalSection(&(this->mutex));
}
This looks to me as a producer consumer pattern, which can be implented with two semaphores, one guarding the queue overflow, the other the empty queue.
You can find some details here.
Yes, WaitForMultipleObjects is pretty expensive. If your jobs are small, the synchronization overhead will start to overwhelm the cost of actually doing the job, as you're seeing.
One way to fix this is bundle multiple jobs into one: if you get a "small" job (however you evaluate such things), store it someplace until you have enough small jobs together to make one reasonably-sized job. Then send all of them to a worker thread for processing.
Alternately, instead of using signaling you could use a multiple-reader single-writer queue to store your jobs. In this model, each worker thread tries to grab jobs off the queue. When it finds one, it does the job; if it doesn't, it sleeps for a short period, then wakes up and tries again. This will lower your per-task overhead, but your threads will take up CPU even when there's no work to be done. It all depends on the exact nature of the problem.
Watch out, you are still asking for a next job after the endSignal is emitted.
for( ;; ) {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
if( waitResult - WAIT_OBJECT_0 != 0 )
return;
//....
}
Since you say that it is much slower in parallel than sequential execution, I assume that your processing time for your internal 2500 loop iterations is tiny (in the few micro seconds range). Then there is not much you can do except review your algorithm to split larger chunks of precessing; OpenMP won't help and every other synchronization techniques won't help either because they fundamentally all rely on events (spin loops do not qualify).
On the other hand, if your processing time of the 2500 loop iterations is larger than 100 micro seconds (on current PCs), you might be running into limitations of the hardware. If your processing uses a lot of memory bandwidth, splitting it to four processors will not give you more bandwidth, it will actually give you less because of collisions. You could also be running into problems of cache cycling where each of your top 1000 iteration will flush and reload the cache of the 4 cores. Then there is no one solution, and depending on your target hardware, there may be none.
If you are just parallelizing loops and using vs 2008, I'd suggest looking at OpenMP. If you're using visual studio 2010 beta 1, I'd suggesting looking at the parallel pattern library, particularly the "parallel for" / "parallel for each"
apis or the "task group class because these will likely do what you're attempting to do, only with less code.
Regarding your question about performance, here it really depends. You'll need to look at how much work you're scheduling during your iterations and what the costs are. WaitForMultipleObjects can be quite expensive if you hit it a lot and your work is small which is why I suggest using an implementation already built. You also need to ensure that you aren't running in debug mode, under a debugger and that the tasks themselves aren't blocking on a lock, I/O or memory allocation, and you aren't hitting false sharing. Each of these has the potential to destroy scalability.
I'd suggest looking at this under a profiler like xperf the f1 profiler in visual studio 2010 beta 1 (it has 2 new concurrency modes which help see contention) or Intel's vtune.
You could also share the code that you're running in the tasks, so folks could get a better idea of what you're doing, because the answer I always get with performance issues is first "it depends" and second, "have you profiled it."
Good Luck
-Rick
It shouldn't be that expensive, but if your job takes hardly any time at all, then the overhead of the threads and sync objects will become significant. Thread pools like this work much better for longer-processing jobs or for those that use a lot of IO instead of CPU resources. If you are CPU-bound when processing a job, ensure you only have 1 thread per CPU.
There may be other issues, how does getNextJob get its data to process? If there's a large amount of data copying, then you've increased your overhead significantly again.
I would optimise it by letting each thread keep pulling jobs off the queue until the queue is empty. that way, you can pass a hundred jobs to the thread pool and the sync objects will be used just the once to kick off the thread. I'd also store the jobs in a queue and pass a pointer, reference or iterator to them to the thread instead of copying the data.
The context switching between threads can be expensive too. It is interesting in some cases to develop a framework you can use to process your jobs sequentially with one thread or with multiple threads. This way you can have the best of the two worlds.
By the way, what is your question exactly ? I will be able to answer more precisely with a more precise question :)
EDIT:
The events part can consume more than your processing in some cases, but should not be that expensive, unless your processing is really fast to achieve. In this case, switching between thredas is expensive too, hence my answer first part on doing things sequencially ...
You should look for inter-threads synchronisation bottlenecks. You can trace threads waiting times to begin with ...
EDIT: After more hints ...
If I guess correctly, your problem is to efficiently use all your computer cores/processors to parralellize some processing essencialy sequential.
Take that your have 4 cores and 10000 loops to compute as in your example (in a comment). You said that you need to wait for the 4 threads to end before going on. Then you can simplify your synchronisation process. You just need to give your four threads thr nth, nth+1, nth+2, nth+3 loops, wait for the four threads to complete then going on. You should use a rendezvous or barrier (a synchronization mechanism that wait for n threads to complete). Boost has such a mechanism. You can look the windows implementation for efficiency. Your thread pool is not really suited to the task. The search for an available thread in a critical section is what is killing your CPU time. Not the event part.
It appears that this whole event thing
(along with the synchronization
between the threads using critical
sections) is pretty expensive !
"Expensive" is a relative term. Are jets expensive? Are cars? or bicycles... shoes...?
In this case, the question is: are events "expensive" relative to the time taken for JobFunction to execute? It would help to publish some absolute figures: How long does the process take when "unthreaded"? Is it months, or a few femtoseconds?
What happens to the time as you increase the threadpool size? Try a pool size of 1, then 2 then 4, etc.
Also, as you've had some issues with threadpools here in the past, I'd suggest some debug
to count the number of times that your threadfunction is actually invoked... does it match what you expect?
Picking a figure out of the air (without knowing anything about your target system, and assuming you're not doing anything 'huge' in code you haven't shown), I'd expect the "event overhead" of each "job" to be measured in microseconds. Maybe a hundred or so. If the time taken to perform the algorithm in JobFunction is not significantly MORE than this time, then your threads are likely to cost you time rather than save it.
As mentioned previously, the amount of overhead added by threading depends on the relative amount of time taken to do the "jobs" that you defined. So it is important to find a balance in the size of the work chunks that minimizes the number of pieces but does not leave processors idle waiting for the last group of computations to complete.
Your coding approach has increased the amount of overhead work by actively looking for an idle thread to supply with new work. The operating system is already keeping track of that and doing it a lot more efficiently. Also, your function ThreadPool::addJob() may find that all of the threads are in use and be unable to delegate the work. But it does not provide any return code related to that issue. If you are not checking for this condition in some way and are not noticing errors in the results, it means that there are idle processors always. I would suggest reorganizing the code so that addJob() does what it is named -- adds a job ONLY (without finding or even caring who does the job) while each worker thread actively gets new work when it is done with its existing work.