thread_specific_ptr multithread confusion - c++

// code snippet 1
static boost::thread_specific_ptr<StreamX> StreamThreadSpecificPtr;
void thread_proc() {
StreamX * stream = NULL;
stream = StreamThreadSpecificPtr.get();
if (NULL == stream) {
stream = new StreamX();
printf("%p\n", stream);
int run() {
boost::thread_group threads;
for(int i = 0; i < 5; i ++) {
// the result is
0x50d560 -- SAME POINTER
0x50d560 -- SAME POINTER
// code snippet 2
static boost::thread_specific_ptr<StreamX> StreamThreadSpecificPtr(NULL); // DIFF from code snippet 1
void thread_proc() {
StreamX * stream = NULL;
stream = StreamThreadSpecificPtr.get();
if (NULL == stream) {
stream = new StreamX();
printf("%p\n", stream);
int run() {
boost::thread_group threads;
for(int i = 0; i < 5; i ++) {
// the result is
In code snippet 1, two pointer are same. it is not expected.
In code snippet 2, with initializing StreamThreadSpecificPtr to NULL, everything seams good.
Could you please help to figure out the answer for this confusion? Thanks a lot.

The joy is that your threads are actually terminating asynchronously, destructing the StreamX instances.
Using a detector:
struct StreamX
StreamX() { puts(__FUNCTION__); }
~StreamX() { puts(__FUNCTION__); }
I get the following output:
real 0m0.002s
user 0m0.000s
sys 0m0.004s
It makes sense for subsequent allocations to reuse the same heap addresses, since there isn't much fragmentation involved. In other words, you can't just compare pointers to see whether they alias the same object in a concurrent application.
The difference with the second example is only spurious. There are many factors that can - and will - influence the result. E.g. adding a tiny delay at the end of each thread will remove all opportunity for threads to terminate before other instances have been instantiated.
See it Live On Coliru


QNX pthread_mutex_lock causing deadlock error ( 45 = EDEADLK )

I am implementing an asynchronous log writing mechanism for my project's multithreaded application. Below is the partial code of the part where the error occurs.
void CTraceFileWriterThread::run()
bool fShoudIRun = shouldThreadsRun(); // Some global function which decided if operations need to stop. Not really relevant here. Assume "true" value.
std::string nextMessage = fetchNext();
if( !nextMessage.empty() )
fShoudIRun = shouldThreadsRun();
//This is the consumer. This is in my thread with lower priority
std::string CTraceFileWriterThread::fetchNext()
// When there are a lot of logs, I mean A LOT, I believe the
// control stays in this function for a long time and an other
// thread calling the "add" function is not able to acquire the lock
// since its held here.
std::string message;
if( !writeQueue.empty() )
writeQueueMutex.lock(); // Obj of our wrapper around pthread_mutex_lock
message = writeQueue.front();
writeQueue.pop(); // std::queue
writeQueueMutex.unLock() ;
return message;
// This is the producer and is called from multiple threads.
void CTraceFileWriterThread::add( std::string outputString ) {
if ( !outputString.empty() )
// crashes here while trying to acquire the lock when there are lots of
// logs in prod systems.
const size_t writeQueueSize = writeQueue.size();
if ( writeQueueSize == maximumWriteQueueCapacity )
outputString.append ("\n queue full, discarding traces, traces are incomplete" );
if ( writeQueueSize <= maximumWriteQueueCapacity )
bool wasEmpty = writeQueue.empty();
writeQueue.push(outputString);; // will be waiting in a function which calls "fetchNext"
int wrapperMutex::lock() {
//#[ operation lock()
int iRetval;
int iRetry = 10;
tRfcErrno = pthread_mutex_lock (&tMutex);
if ( (tRfcErrno == EINTR) || (tRfcErrno == EAGAIN) )
iRetval = RFC_ERROR;
else if (tRfcErrno != EOK)
iRetval = RFC_ERROR;
iRetry = 0;
iRetval = RFC_OK;
iRetry = 0;
} while (iRetry > 0);
return iRetval;
I generated the core dump and analysed it with GDB and here are some findings
Program terminated with signal 11, Segmentation fault.
"Errno=45" at the add function where I am trying to acquire the lock. The wrapper we have around pthread_mutex_lock tries to acquire the lock for around 10 times before it gives up.
The code works fine when there are fewer logs. Also, we do not have C++11 or further and hence restricted to mutex of QNX. Any help is appreciated as I am looking at this issue for over a month with little progress. Please ask if anymore info is required.

No need for mutex, race conditions not always bad, do they?

I'm getting this crazy idea that mutex synchronization can be omitted in some cases when most of us would typically want and would use mutex synchronization.
Ok suppose you have this case:
Buffer *buffer = new Buffer(); // Initialized by main thread;
// The call to buffer's `accumulateSomeData` method is thread-safe
// and is heavily executed by many workers from different threads simultaneously.
buffer->accumulateSomeData(data); // While the code inside is equivalent to vector->push_back()
// All lines of code below are executed by a totally separate timer
// thread that executes once per second until the program is finished.
auto bufferPrev = buffer; // A temporary pointer to previous instance
// Switch buffers, put old one offline
buffer = new Buffer();
// As of this line of code all the threads will switch to new instance
// of buffer. Which yields that calls to `accumulateSomeData`
// are executed over new buffer instance. Which also means that old
// instance is kinda taken offline and can be safely operated from a
// timer thread.
bufferPrev->flushToDisk(); // Ok, so we can safely flush
delete bufferPrev;
While it's obvious that during buffer = new Buffer(); there can still be uncompleted operations that add data on previous instance. But since disk operations are slow we get natural kind of barrier.
So how do you estimate the risk of running such code without mutex synchronisation?
It's so hard these days to ask a question in SO without getting mugged by couple of angry guys for no reason.
Here is my correct in all terms code:
#include <cassert>
#include "leveldb/db.h"
#include "leveldb/filter_policy.h"
#include <iostream>
#include <boost/asio.hpp>
#include <boost/chrono.hpp>
#include <boost/thread.hpp>
#include <boost/filesystem.hpp>
#include <boost/lockfree/stack.hpp>
#include <boost/lockfree/queue.hpp>
#include <boost/uuid/uuid.hpp> // uuid class
#include <boost/uuid/uuid_io.hpp> // streaming operators etc.
#include <boost/uuid/uuid_generators.hpp> // generators
#include <CommonCrypto/CommonDigest.h>
using namespace std;
using namespace boost::filesystem;
using boost::mutex;
using boost::thread;
enum FileSystemItemType : char {
Unknown = 1,
File = 0,
Directory = 4,
FileLink = 2,
DirectoryLink = 6
// Structure packing optimizations are used in the code below
class FileSystemScanner {
leveldb::DB *database;
boost::asio::thread_pool pool;
leveldb::WriteBatch *batch;
std::atomic<int> queue_size;
std::atomic<int> workers_online;
std::atomic<int> entries_processed;
std::atomic<int> directories_processed;
std::atomic<uintmax_t> filesystem_usage;
boost::lockfree::stack<boost::filesystem::path*, boost::lockfree::fixed_sized<false>> directories_pending;
void work() {
boost::filesystem::path *item;
if (directories_pending.pop(item) && item != NULL)
try {
boost::filesystem::directory_iterator completed;
boost::filesystem::directory_iterator iterator(*item);
while (iterator != completed)
bool isFailed = false, isSymLink, isDirectory;
boost::filesystem::path path = iterator->path();
try {
isSymLink = boost::filesystem::is_symlink(path);
isDirectory = boost::filesystem::is_directory(path);
} catch (const boost::filesystem::filesystem_error& e) {
isFailed = true;
isSymLink = false;
isDirectory = false;
if (!isFailed)
if (!isSymLink) {
if (isDirectory) {
directories_pending.push(new boost::filesystem::path(path));
boost::asio::post(this->pool, [this]() { this->work(); });
} else {
filesystem_usage += boost::filesystem::file_size(iterator->path());
int result = ++entries_processed;
if (result % 10000 == 0) {
cout << entries_processed.load() << ", " << directories_processed.load() << ", " << queue_size.load() << ", " << workers_online.load() << endl;
delete item;
} catch (boost::filesystem::filesystem_error &e) {
FileSystemScanner(int threads, leveldb::DB* database):
pool(threads), queue_size(), workers_online(), entries_processed(), directories_processed(), directories_pending(0), database(database)
void scan(string path) {
directories_pending.push(new boost::filesystem::path(path));
boost::asio::post(this->pool, [this]() { this->work(); });
void join() {
int main(int argc, char* argv[])
leveldb::Options opts;
opts.create_if_missing = true;
opts.compression = leveldb::CompressionType::kSnappyCompression;
opts.filter_policy = leveldb::NewBloomFilterPolicy(10);
leveldb::DB* db;
leveldb::DB::Open(opts, "/temporary/projx", &db);
FileSystemScanner scanner(std::thread::hardware_concurrency(), db);
return 0;
My question is: Can I omit synchronization for batch which I'm not using yet? Since it's thread-safe and it should be enough to just switch buffers before actually committing any results to disk?
You have a serious misunderstanding. You think that when you have a race condition, there are some specific list of things that can happen. This is not true. A race condition can cause any kind of failure, including crashes. So absolutely, definitely not. You absolutely cannot do this.
That said, even with this misunderstanding, this is still a disaster.
buffer = new Buffer();
Suppose this is implemented by first allocating memory, then setting buffer to point to that memory, and then calling the constructor. Other threads may operate on the unconstructed buffer. boom.
Now, you can fix this. But it's just one the many ways I can imagine this screwing up. And it can screw up in ways that we're not clever enough to imagine. So, for all that is holy, do not even think of doing this ever again.

NET-SNMP and multithreading

I am writing a C++ SNMP server using a NET-SNMP library. I read the documentation and still got one question. Can multiple threads sharing single snmp session and using it in procedures like snmp_sess_synch_response() simultaneously, or I must init and open new session in each thread?
Well, when I am trying to snmp_sess_synch_response() from two different threads using the same opaque session pointer simultaneously, one of three errors always occures. The first is memory access violation, the second is endless WaitForSingleObject() in both threads and the third is heap allocation error.
I suppose I can treat this as an answer, thus sharing single session between multiple threads is unsafe, because using it in procedures like snmp_sess_synch_response() simultaneously will cause an errors.
P.S. Here is the piece of code of described before:
void* _opaqueSession;
boost::mutex _sessionMtx;
std::shared_ptr<netsnmp_pdu> ReadObjectValue(Oid& objectID)
netsnmp_pdu* requestPdu = snmp_pdu_create(SNMP_MSG_GET);
netsnmp_pdu* response = 0;
snmp_add_null_var(requestPdu, objectID.GetObjId(), objectID.GetLen());
void* opaqueSessionCopy;
//Locks the _opaqueSession, wherever it appears
boost::mutex::scoped_lock lock(_sessionMtx);
opaqueSessionCopy = _opaqueSession;
//Errors here!
snmp_sess_synch_response(opaqueSessionCopy, requestPdu, &response);
std::shared_ptr<netsnmp_pdu> result(response);
return result;
void ExecuteThread1()
Oid sysName(".");
void ExecuteThread2()
Oid sysServices(".");
int main()
std::string community = "public";
std::string ipAddress = "";
snmp_session session;
session.timeout = 500000;
session.retries = 0;
session.version = SNMP_VERSION_2c;
session.remote_port = 161;
session.peername = (char*)ipAddress.c_str(); = (u_char*)community.c_str();
session.community_len = community.size();
_opaqueSession = snmp_sess_open(&session);
boost::thread thread1 = boost::thread(&ExecuteThread1);
boost::thread thread2 = boost::thread(&ExecuteThread2);
return 0;

Does the use of an anonymous pipe introduce a memory barrier for interthread communication?

For example, say I allocate a struct with new and write the pointer into the write end of an anonymous pipe.
If I read the pointer from the corresponding read end, am I guaranteed to see the 'correct' contents on the struct?
Also of of interest is whether the results of socketpair() on unix & self connecting over tcp loopback on windows have the same guarantees.
The context is a server design which centralizes event dispatch with select/epoll
For example, say I allocate a struct with new and write the pointer into the write end of an anonymous pipe.
If I read the pointer from the corresponding read end, am I guaranteed to see the 'correct' contents on the struct?
No. There is no guarantee that the writing CPU will have flushed the write out of its cache and made it visible to the other CPU that might do the read.
Also of of interest is whether the results of socketpair() on unix & self connecting over tcp loopback on windows have the same guarantees.
In practice, calling write(), which is a system call, will end up locking one or more data structures in the kernel, which should take care of the reordering issue. For example, POSIX requires subsequent reads to see data written before their call, which implies a lock (or some kind of acquire/release) by itself.
As for whether that's part of the formal spec of the calls, probably it's not.
A pointer is just a memory address, so provided you are on the same process the pointer will be valid on the receiving thread and will point to the same struct. If you are on different processes, at best you will get immediately a memory error, at worse you will read (or write) to a random memory which is essentially Undefined Behaviour.
Will you read the correct content? Neither better nor worse than if your pointer was in a static variable shared by both threads: you still have to do some synchronization if you want consistency.
Will the kind of transfer address matter between static memory (shared by threads), anonymous pipes, socket pairs, tcp loopback, etc.? No: all those channels transfers bytes, so if you pass a memory address, you will get your memory address. What is left you then is synchronization, because here you are just sharing a memory address.
If you do not use any other synchronization, anything can happen (did I already spoke of Undefined Behaviour?):
reading thread can access memory before it has been written by writing one giving stale data
if you forgot to declare the struct members as volatile, reading thread can keep using cached values, here again getting stale data
reading thread can read partially written data meaning incoherent data
Interesting question with, so far, only one correct answer from Cornstalks.
Within the same (multi-threaded) process there are no guarantees since pointer and data follow different paths to reach their destination.
Implicit acquire/release guarantees do not apply since the struct data cannot piggyback on the pointer through the cache and formally you are dealing with a data race.
However, looking at how the pointer and the struct data itself reach the second thread (through the pipe and memory cache respectively), there is a real chance that this mechanism is not going to cause any harm.
Sending the pointer to a peer thread takes 3 system calls (write() in the sending thread, select() and read() in the receiving thread) which is (relatively) expensive and by the time the pointer value is available
in the receiving thread, the struct data probably has arrived long before.
Note that this is just an observation, the mechanism is still incorrect.
I believe, your case might be reduced to this 2 threads model:
int data = 0;
std::atomic<int*> atomicPtr{nullptr};
void thread1()
data = 42;, std::memory_order_release);
void thread2()
int* ptr = nullptr;
ptr = atomicPtr.load(std::memory_order_consume);
assert(*ptr == 42);
Since you have 2 processes you can't use one atomic variable across them but since you listed windows you can omit atomicPtr.load(std::memory_order_consume) from the consuming part because, AFAIK, all the architectures Windows is running on guarantee this load to be correct without any barrier on the loading side. In fact, I think there are not much architectures out there where that instruction would not be a NO-OP(I heard only about DEC Alpha)
I agree with Serge Ballesta's answer. Within the same process, it's feasible to send and receive object address via anonymous pipe.
Since the write system call is guaranteed to be atomic when message size is below PIPE_BUF (normally 4096 bytes), so multi-producer threads will not mess up each other's object address (8 bytes for 64 bit applications).
Talk is cheap, here is the demo code for Linux (defensive code and error handlers are omitted for simplicity). Just copy & paste to then compile & run the test.
#include <unistd.h>
#include <string.h>
#include <pthread.h>
#include <string>
#include <list>
template<class T> class MPSCQ { // pipe based Multi Producer Single Consumer Queue
int producerPush(const T* t);
T* consumerPoll(double timeout = 1.0);
void _consumeFd();
int _selectFdConsumer(double timeout);
T* _popFront();
int _fdProducer;
int _fdConsumer;
char* _consumerBuf;
std::string* _partial;
std::list<T*>* _list;
static const int _PTR_SIZE;
static const int _CONSUMER_BUF_SIZE;
template<class T> const int MPSCQ<T>::_PTR_SIZE = sizeof(void*);
template<class T> const int MPSCQ<T>::_CONSUMER_BUF_SIZE = 1024;
template<class T> MPSCQ<T>::MPSCQ() :
_fdConsumer(-1) {
_consumerBuf = new char[_CONSUMER_BUF_SIZE];
_partial = new std::string; // for holding partial pointer address
_list = new std::list<T*>; // unconsumed T* cache
int fd_[2];
int r = pipe(fd_);
_fdConsumer = fd_[0];
_fdProducer = fd_[1];
template<class T> MPSCQ<T>::~MPSCQ() { /* omitted */ }
template<class T> int MPSCQ<T>::producerPush(const T* t) {
return t == NULL ? 0 : write(_fdProducer, &t, _PTR_SIZE);
template<class T> T* MPSCQ<T>::consumerPoll(double timeout) {
T* t = _popFront();
if (t != NULL) {
return t;
if (_selectFdConsumer(timeout) <= 0) { // timeout or error
return NULL;
return _popFront();
template<class T> void MPSCQ<T>::_consumeFd() {
memcpy(_consumerBuf, _partial->data(), _partial->length());
ssize_t r = read(_fdConsumer, _consumerBuf, _CONSUMER_BUF_SIZE - _partial->length());
if (r <= 0) { // EOF or error, error handler omitted
const char* p = _consumerBuf;
int remaining_len_ = _partial->length() + r;
T* t;
while (remaining_len_ >= _PTR_SIZE) {
memcpy(&t, p, _PTR_SIZE);
remaining_len_ -= _PTR_SIZE;
p += _PTR_SIZE;
*_partial = std::string(p, remaining_len_);
template<class T> int MPSCQ<T>::_selectFdConsumer(double timeout) {
int r;
int nfds_ = _fdConsumer + 1;
fd_set readfds_;
struct timeval timeout_;
int64_t usec_ = timeout * 1000000.0;
while (true) {
timeout_.tv_sec = usec_ / 1000000;
timeout_.tv_usec = usec_ % 1000000;
FD_SET(_fdConsumer, &readfds_);
r = select(nfds_, &readfds_, NULL, NULL, &timeout_);
if (r < 0 && errno == EINTR) {
return r;
template<class T> T* MPSCQ<T>::_popFront() {
if (!_list->empty()) {
T* t = _list->front();
return t;
} else {
return NULL;
// = = = = = test code below = = = = =
#define _LOOP_CNT 5000000
#define _ONE_MILLION 1000000
struct TestMsg { // all public
int _threadId;
int _msgId;
int64_t _val;
TestMsg(int thread_id, int msg_id, int64_t val) :
_val(val) { };
static MPSCQ<TestMsg> _QUEUE;
static int64_t _SUM = 0;
void* functor_producer(void* arg) {
int my_thr_id_ = pthread_self();
TestMsg* msg_;
for (int i = 0; i <= _LOOP_CNT; ++ i) {
if (i == _LOOP_CNT) {
msg_ = new TestMsg(my_thr_id_, i, -1);
} else {
msg_ = new TestMsg(my_thr_id_, i, i + 1);
return NULL;
void* functor_consumer(void* arg) {
int msg_cnt_ = 0;
int stop_cnt_ = 0;
TestMsg* msg_;
while (true) {
if ((msg_ = _QUEUE.consumerPoll()) == NULL) {
int64_t val_ = msg_->_val;
delete msg_;
if (val_ <= 0) {
if ((++ stop_cnt_) >= _PRODUCER_THREAD_NUM) {
printf("All done, _SUM=%ld\n", _SUM);
} else {
_SUM += val_;
if ((++ msg_cnt_) % _ONE_MILLION == 0) {
printf("msg_cnt_=%d, _SUM=%ld\n", msg_cnt_, _SUM);
return NULL;
int main(int argc, char* const* argv) {
pthread_t consumer_;
pthread_create(&consumer_, NULL, functor_consumer, NULL);
pthread_t producers_[_PRODUCER_THREAD_NUM];
for (int i = 0; i < _PRODUCER_THREAD_NUM; ++ i) {
pthread_create(&producers_[i], NULL, functor_producer, NULL);
for (int i = 0; i < _PRODUCER_THREAD_NUM; ++ i) {
pthread_join(producers_[i], NULL);
pthread_join(consumer_, NULL);
return 0;
And here is test result ( 2 * sum(1..5000000) == (1 + 5000000) * 5000000 == 25000005000000 ):
$ g++ -o pipe_ipc_demo -lpthread
$ ./pipe_ipc_demo ## output may vary except for the final _SUM
msg_cnt_=1000000, _SUM=251244261289
msg_cnt_=2000000, _SUM=1000708879236
msg_cnt_=3000000, _SUM=2250159002500
msg_cnt_=4000000, _SUM=4000785160225
msg_cnt_=5000000, _SUM=6251640644676
msg_cnt_=6000000, _SUM=9003167062500
msg_cnt_=7000000, _SUM=12252615629881
msg_cnt_=8000000, _SUM=16002380952516
msg_cnt_=9000000, _SUM=20252025092401
msg_cnt_=10000000, _SUM=25000005000000
All done, _SUM=25000005000000
The technique showed here is used in our production applications. One typical usage is the consumer thread acts as a log writer, and worker threads can write log messages almost asynchronously. Yes, almost means sometimes writer threads may be blocked in write() when pipe is full, and this is a reliable congestion control feature provided by OS.

C++: Thread pool slower than single threading?

First of all I did look at the other topics on this website and found they don't relate to my problem as those mostly deal with people using I/O operations or thread creation overheads. My problem is that my threadpool or worker-task structure implementation is (in this case) a lot slower than single threading. I'm really confused by this and not sure if it's the ThreadPool, the task itself, how I test it, the nature of threads or something out of my control.
// Sorry for the long code
#include <vector>
#include <queue>
#include <thread>
#include <mutex>
#include <future>
#include "task.hpp"
class ThreadPool
for (unsigned i = 0; i < std::thread::hardware_concurrency() - 1; i++)
m_workers.emplace_back(this, i);
m_running = true;
for (auto&& worker : m_workers)
m_running = false;
for (auto&& worker : m_workers)
void add_task(Task* task)
std::unique_lock<std::mutex> lock(m_in_mutex);
class Worker
Worker(ThreadPool* parent, unsigned id) : m_parent(parent), m_id(id)
void start()
m_thread = new std::thread(&Worker::work, this);
void terminate()
if (m_thread)
if (m_thread->joinable())
delete m_thread;
m_thread = nullptr;
m_parent = nullptr;
void work()
while (m_parent->m_running)
std::unique_lock<std::mutex> lock(m_parent->m_in_mutex);
m_parent->m_task_signal.wait(lock, [&]()
return !m_parent->m_in.empty() || !m_parent->m_running;
if (!m_parent->m_running) break;
Task* task = m_parent->m_in.front();
// Fixed the mutex being locked while the task is executed
ThreadPool* m_parent = nullptr;
unsigned m_id = 0;
std::thread* m_thread = nullptr;
std::vector<Worker> m_workers;
std::mutex m_in_mutex;
std::condition_variable m_task_signal;
std::queue<Task*> m_in;
bool m_running = false;
class TestTask : public Task
TestTask() {}
TestTask(unsigned number) : m_number(number) {}
inline void Set(unsigned number) { m_number = number; }
void execute() override
if (m_number <= 3)
m_is_prime = m_number > 1;
else if (m_number % 2 == 0 || m_number % 3 == 0)
m_is_prime = false;
for (unsigned i = 5; i * i <= m_number; i += 6)
if (m_number % i == 0 || m_number % (i + 2) == 0)
m_is_prime = false;
m_is_prime = true;
unsigned m_number = 0;
bool m_is_prime = false;
int main()
ThreadPool pool;
unsigned num_tasks = 1000000;
std::vector<TestTask> tasks(num_tasks);
for (auto&& task : tasks)
task.Set(randint(0, 1000000000));
auto s = std::chrono::high_resolution_clock::now();
#if MT
for (auto&& task : tasks)
for (auto&& task : tasks)
auto e = std::chrono::high_resolution_clock::now();
double seconds = std::chrono::duration_cast<std::chrono::nanoseconds>(e - s).count() / 1000000000.0;
Benchmarks with VS2013 Profiler:
10,000,000 tasks:
13 seconds of wall clock time
93.36% is spent in msvcp120.dll
3.45% is spent in Task::execute() // Not good here
0.5 seconds of wall clock time
97.31% is spent with Task::execute()
Usual disclaimer in such answers: the only way to tell for sure is to measure it with a profiler tool.
But I will try to explain your results without it. First of all, you have one mutex across all your threads. So only one thread at a time can execute some task. It kills all your gains you might have. In spite of your threads your code is perfectly serial. So at the very least make your task execution out of the mutex. You need to lock the mutex only to get a task out of the queue — you don't need to hold it when the task gets executed.
Next, your tasks are so simple that single thread will execute them in no time. You just can't measure any gains with such tasks. Create some heavy tasks which could produce some more interesting results(some tasks which are closer to the real world, not such contrived).
And the 3rd point: threads are not without their cost — context switching, mutex contention etc. To have real gains, as the previous 2 points say, you need to have tasks which take more time than the overheads threads introduce and the code should be truly parallel instead of waiting on some resource making it serial.
UPD: I looked at the wrong part of the code. The task is complex enough provided you create tasks with sufficiently large numbers.
UPD2: I've played with your code and found a good prime number to show how the MT code is better. Use the following prime number: 1019048297. It will give enough computation complexity to show the difference.
But why your code doesn't produce good results? It is hard to tell without seeing the implementation of randint() but I take it is pretty simple and in a half of the cases it returns even numbers and other cases produce not much of big prime numbers either. So the tasks are so simple that context switching and other things around your particular implementation and threads in general consume more time than the computation itself. Using the prime number I gave you give the tasks no choice but spend time computing — no easy answer since the number is big and actually prime. That's why the big number will give you the answer you seek — better time for the MT code.
You should not hold the mutex while the task is getting executed, otherwise other threads will not be able to get a task:
void work() {
while (m_parent->m_running) {
Task* currentTask = nullptr;
std::unique_lock<std::mutex> lock(m_parent->m_in_mutex);
m_parent->m_task_signal.wait(lock, [&]() {
return !m_parent->m_in.empty() || !m_parent->m_running;
if (!m_parent->m_running) continue;
currentTask = m_parent->m_in.front();
lock.unlock(); //<- Release the lock so that other threads can get tasks
currentTask = nullptr;
For MT, how much time is spent in each phase of the "overhead": std::unique_lock, m_task_signal.wait, front, pop, unlock?
Based on your results of only 3% useful work, this means the above consumes 97%. I'd get numbers for each part of the above (e.g. add timestamps between each call).
It seems to me, that the code you use to [merely] dequeue the next task pointer is quite heavy. I'd do a much simpler queue [possibly lockless] mechanism. Or, perhaps, use atomics to bump an index into the queue instead of the five step process above. For example:
while (m_parent->m_running) {
// NOTE: this is just an example, not necessarily the real function
int curindex = atomic_increment(&global_index);
if (curindex >= max_index)
Task *task = m_parent->m_in[curindex];
Also, maybe you should pop [say] ten at a time instead of just one.
You might also be memory bound and/or "task switch" bound. (e.g.) For threads that access an array, more than four threads usually saturates the memory bus. You could also have heavy contention for the lock, such that the threads get starved because one thread is monopolizing the lock [indirectly, even with the new unlock call]
Interthread locking usually involves a "serialization" operation where other cores must synchronize their out-of-order execution pipelines.
Here's a "lockless" implementation:
// assume m_id is 0,1,2,...
int curindex = m_id;
while (m_parent->m_running) {
if (curindex >= max_index)
Task *task = m_parent->m_in[curindex];
curindex += NUMBER_OF_WORKERS;