C++11 Dynamic Threadpool

C++11 Dynamic Threadpool - c++

Recently, i've been trying to find a library for threading concurrent tasks. Ideally, a simple interface that calls a function on a thread. There are n number of threads at any time, some complete faster than others and arrive at different times.
First i was trying Rx, which is great in c++. I've also looked into Blocks and TBB but they are either platform dependant. For my prototype, i need to remain platform independent as we don't know what it will be running on yet and can change when decisions are made.
C++11 has a number of things for threading and concurrency and i found a number of examples like this one for thread pools.
https://github.com/bilash/threadpool
Similar projects use the same lambda expressions with std::thread and std::mutex.
This looks perfect for what i need. There some issues. The pools are started with a defined number of threads and tasks are queued until a thread is free.
How can i add new threads?
Remove expired threads? (.Join()??)
Obviously, this is much easier for a known number of threads as they can be initialised in the ctor and then join() in the dtor.
Any tips or pointers here from someone with experience with C++ concurrency?

Start with maximum number of threads a system can support:
int Num_Threads = thread::hardware_concurrency();
For an efficient threadpool implementation, once threads are created according to Num_Threads, it's better not to create new ones, or destroy old ones (by joining). There will be performance penalty, might even make your application goes slower than the serial version.
Each C++11 thread should be running in their function with an infinite loop, constantly waiting for new tasks to grab and run.
Here is how to attach such function to the thread pool:
int Num_Threads = thread::hardware_concurrency();
vector<thread> Pool;
for(int ii = 0; ii < Num_Threads; ii++)
{ Pool.push_back(thread(Infinite_loop_function));}
The Infinite_loop_function
This is a "while(true)" loop waiting for the task queue
void The_Pool:: Infinite_loop_function()
{
while(true)
{
{
unique_lock<mutex> lock(Queue_Mutex);
condition.wait(lock, []{return !Queue.empty()});
Job = Queue.front();
Queue.pop();
}
Job(); // function<void()> type
}
};
Make a function to add job to your Queue
void The_Pool:: Add_Job(function<void()> New_Job)
{
{
unique_lock<mutex> lock(Queue_Mutex);
Queue.push(New_Job);
}
condition.notify_one();
}
Bind an arbitrary function to your Queue
Pool_Obj.Add_Job(std::bind(&Some_Class::Some_Method, &Some_object));
Once you integrate these ingredients, you have your own dynamic threading pool. These threads always run, waiting for jobs to do.

This should be simple to use: https://pocoproject.org/docs/Poco.ThreadPool.html
A thread pool always keeps a number of threads running, ready to
accept work. Creating and starting a threads can impose a significant
runtime overhead to an application. A thread pool helps to improve the
performance of an application by reducing the number of threads that
have to be created (and destroyed again). Threads in a thread pool are
re-used once they become available again. The thread pool always keeps
a minimum number of threads running. If the demans for threads
increases, additional threads are created. Once the demand for threads
sinks again, no-longer used threads are stopped and removed from the
pool.
ThreadPool(
int minCapacity = 2,
int maxCapacity = 16,
int idleTime = 60,
int stackSize = 0
);
This is very nice library and easy to use not like Boost :(
https://github.com/pocoproject/poco

Related

How many std::future objects can exist in a system at a time simultaneously.?

I wanted to perform hashing of a stream of input messages in multithreading, so was trying to implement std::vector<std::future<HashData>> futures; but not sure as how many future objects can exist in a system, at a time simultaneously.
std::vector<std::future<HashData>> futures;
std::vector<std::string> messages;
for (int i = 0; i < messages.size(); i++)
{
std::promise<HashData> promiseHashData;
std::future<HashData> futureHashData = promiseHashData.get_future();
futures.emplace_back(std::move(futureHashData));
std::async(std::launch::async, [&]() {PerformHash(std::move(promiseHashData), messages[i]);});
}
std::vector<HashData> vectorOfHashData;
// wait for all async tasks to complete
for (auto& futureObj : futures)
{
vectorOfHashData.push_back(futureObj.get());
}
Is there any limit for creation of future objects in a system (similar to how system may reach thread saturation level, if the existing threads won't get destroyed and new ones gets created continuously), As i will be calling PerformHash() method in async manner for large data of messages.
i am exploring concurrency in c++ during recent times and wanted to improve the hashing task performance. So this thought came to my mind, but not sure as whether it will work or not. wanted to know if i am missing something here.

The problem isn't going to be "how many futures can a vector hold"; futures (on most systems) are just a shared pointer to a block of memory with some cheap concurrency primitives in it.
The problem is you are creating a thread per future then blocking forward progress until the thread is finished. If you fix that problem, then your code is using dangling references.
std::vector<std::future<HashData>> futures;
std::vector<std::string> messages;
for (int i = 0; i < messages.size(); i++)
{
std::promise<HashData> promiseHashData;
std::future<HashData> futureHashData = promiseHashData.get_future();
futures.emplace_back(std::move(futureHashData));
// this captures a promiseHashData by reference
// It also creates a thread, then blocks until the
// thread finishes.
std::async(std::launch::async, [&]() {PerformHash(std::move(promiseHashData), messages[i]);});
}
So a few points:
Unless the hash data is worth consuming in small pieces, a future<vector<HashData>> is going to be more efficient.
If you want a vector<future>, you'll also want a vector<promise>. Then create a bounded number of threads (or get them from a pool you write) and fullfill those promises.
Creating an unbounded number of futures, then creating an unbounded number of threads to service those futures, is a bad plan.
Finally, std::async is funny in that it returns a std::future itself. When that future is destroyed, it blocks on the completion of the thread it creates. This is atypical behavior, but it prevents losing track of a thread of execution.

C++ condition variables vs new threads for vectorization

I have a block of code that goes through a loop. A section of the code operates on a vector of data and I would like to vectorize this operation. The idea is to split the elaboration of the array on multiple threads that will work on subsections of the array. I have to decide between two possibilities. The first one is to create the threads each time this section is encountered an rejoin them at the end with the main thread:
for(....)
{
//serial stuff
//crate threads
for(i = 0; i < num_threads; ++i)
{
threads_vect.push_back(std::thread(f, sub_array[i]));
}
//join them
for(auto& t : threads_vect)
{
t.join();
}
//serial stuff
}
This is similar at what it is done with OpenMP, but since the problem is simple I'd like to use std::threads instead of OpenMP (unless there are good reasons against this).
The second option is to create the threads beforehand to avoid the overhead of creation and destruction, and use condition variables for synchronization (I omitted a lot of stuff for the synchronization. It is just the general idea):
std::condition_variable cv_threads;
std::condition_variable cv_main;
//create threads, the will be to sleep on cv_threads
for(....)
{
//serial stuff
//wake up threads
cv_threads.notify_all();
//sleep until the last thread finishes, that will notify.
main_thread_lock.lock();
cv_main.wait(main_lock);
//serial stuff
}
To allow for parallelism the threads will have to unlock the thread_lock as soon as they wake up at the beginning of the computation, then acquire it again at to go to sleep and synchronize between them to notify the main thread.
My question is which of this solutions is usually preferred in a context like this, and if the avoided overhead of thread creation and destruction is usually worth the added complexity (or worth at all given that the added synchronization also adds time).
Obviously this also depends on how long the computation is for each thread, but this could vary a lot since the length of the data vector could also be very short (to about two element per thread, that would led to a computation time of about 15 milliseconds).

The biggest disadvantage of creating new threads is that thread creation and shutdown is usually quite expensive. Just think of all the things an operating system has to do to get a thread off the ground, compared to what it takes to notify a condition variable.
Note that synchronization is always required, also with thread creation. The C++11 std::thread for instances introduces a synchronizes-with relationship with the creating thread upon construction. So you can safely assume that thread creation will always be significantly more expensive than condition variable signalling, regardless of your implementation.
A framework like OpenMP will typically attempt to amortize these costs in some way. For instance, an OpenMP implementation is not required to shut down the worker threads after every loop and many implementations will not do this.

dynamic thread creation using std::thread in c++

How we can Create dynamic threads using std::thread. Actually I am accessing some raw string from a queue and have to perform some processing on that and the queue is having thousands of such messages, so i want to create a thread for each message to improve the performance. I can create the thread's using below code.
unsigned int n = std::thread::hardware_concurrency();
std::thread myThreads[n];
while(true)
{
for (int i=0; i<n; i++){
myThreads[i] = std::thread(&ControlQueue::processSomeStuff,this,msg_struct);
}
//for joining
for (int i=0; i<n; i++){
myThreads[i].join();
}
}
but the thing is if I use the above code then it will create the threads only for que.size()
threads, but the queue will have some more new messages.
So is there any method for creating a thread for each new message dynamically. like server socket creates a new socket for processing the client request.

This looks like a good case for a thread pool: a set of threads is pre-created and ready to execute the jobs.
When a new message is received then the app passes it to the thread pool for further processing and immediately starts waiting for the next message.
A thread pool implementation I have done some time ago may be found here https://codereview.stackexchange.com/questions/36018/thread-pool-on-c11 on the Codereview site.

You should really consider #Mankarse and #Jerry YY Rain comments.
But if you want to go on with your approach, I'd consider creating a receiver thread whose sole purpose would be to observe the queue in a loop (i.e. hang when there are no messages), and when a new message arrives, it should create a new worker thread and pass the message to it.
This may also simplify your synchronization, as you will have only one reader, so you don't have to worry if two threads have read the same message or not.

I would propose to profit from thread pool. This way you will have some fixed amount of threads that process your requests in parallel. The amount of threads in the thread pool could correspond to the number of cores on your machine.
There is no standard thread pool in C++, but you can easily write your own (queue, thread group, mutex, condition_variable), or benefit from boost::asio::io_service.
See also this for reference implementation.
Creating new thread for each request might be very expensive. Moreover you might not get better performance because of context switch overhead (when number of requests is big).

Replacing many std::async calls by a threadpool

I have a programm that calls std::async many many times. The tasks that are executed are reasonable short (like a few hundred miliseconds each). I figure that there is significant overhead for thread creation and I wonder if I can avoid this somehow. The code that enumerates the jobs runs much faster than the processing of the jobs. Therefore I already have a sort of pooling in place. It goes like this. I create an array of 'job slots':
template <typename T>
struct job {
std::future <void> fut;
std::vector <T*> *result;
bool inUse;
}
Before the parallel code starts, I initialize the array of job slots, creating the result vectors only once. Then everytime the job enumeration code has enumerated a job, it looks for a job slot that is not in use. If there is a free slot, it will start (with std::async) a new job, move the future to the slot. The job runs and fills the result vector. If there is no free slot, then the code checks if any of the futures in the slots is ready. If so, it processes the result vector, and then uses that slot. If not, it waits a few miliseconds. This code runs very nicely, and scales exactly to the number of processors available. I learned that each call to std::async creates a new thread, and indeed, I can see the process IDs scrolling through. I want to remove this overhead, creating the threads once and for all in the beginning. How to proceed?
I have found this threadpool implementation
https://code.google.com/p/cppthreadpool/downloads/list
but it states that a task should take one or two seconds for this to be efficient. I don't need any fancy scheduling, priorities, etc. I just want to remove overhead for repeated construction and destruction of threads.

I ran a test program that creates tasks using std::async and found that many tasks were ran by same thread!! In fact I see 2 threads ran 25 async tasks. So looks like the standard library does some thread pooling already.
std::vector<std::future<void>> futures;
for (int i = 0; i < 25; ++i)
{
auto fut = std::async([]
{
std::cout << std::this_thread::get_id() <<std::endl;
});
futures.push_back(std::move(fut));
}
std::for_each(futures.begin(), futures.end(), [](std::future<void> & fut)
{
fut.wait();
});

pthread_join - multiple threads waiting

Using POSIX threads & C++, I have an "Insert operation" which can only be done safely one at a time.
If I have multiple threads waiting to insert using pthread_join then spawning a new thread
when it finishes. Will they all receive the "thread complete" signal at once and spawn multiple inserts or is it safe to assume that the thread that receives the "thread complete" signal first will spawn a new thread blocking the others from creating new threads.
/* --- GLOBAL --- */
pthread_t insertThread;
/* --- DIFFERENT THREADS --- */
// Wait for Current insert to finish
pthread_join(insertThread, NULL);
// Done start a new one
pthread_create(&insertThread, NULL, Insert, Data);
Thank you for the replies
The program is basically a huge hash table which takes requests from clients through Sockets.
Each new client connection spawns a new thread from which it can then perform multiple operations, specifically lookups or inserts. lookups can be conducted in parallel. But inserts need to be "re-combined" into a single thread. You could say that lookup operations could be done without spawning a new thread for the client, however they can take a while causing the server to lock, dropping new requests. The design tries to minimize system calls and thread creation as much as possible.
But now that i know it's not safe the way i first thought I should be able to cobble something together
Thanks

From opengroup.org on pthread_join:
The results of multiple simultaneous calls to pthread_join() specifying the same target thread are undefined.
So, you really should not have several threads joining your previous insertThread.
First, as you use C++, I recommend boost.thread. They resemble the POSIX model of threads, and also work on Windows. And it helps you with C++, i.e. by making function-objects usable more easily.
Second, why do you want to start a new thread for inserting an element, when you always have to wait for the previous one to finish before you start the next one? Seems not to be classical use of multiple-threads.
Although... One classical solution to this would be to have one worker-thread getting jobs from an event-queue, and other threads posting the operation onto the event-queue.
If you really just want to keep it more or less the way you have it now, you'd have to do this:
Create a condition variable, like insert_finished.
All the threads which want to do an insert, wait on the condition variable.
As soon as one thread is done with its insertion, it fires the condition variable.
As the condition variable requires a mutex, you can just notify all waiting threads, they all want start inserting, but as only one thread can acquire the mutex at a time, all threads will do the insert sequentially.
But you should take care that your synchronization is not implemented in a too ad-hoc way. As this is called insert, I suspect you want to manipulate a data-structure, so you probably want to implement a thread-safe data-structure first, instead of sharing the synchronization between data-structure-accesses and all clients. I also suspect that there will be more operations then just insert, which will need proper synchronization...

According to the Single Unix Specifcation: "The results of multiple simultaneous calls to pthread_join() specifying the same target thread are undefined."
The "normal way" of achieving a single thread to get the task would be to set up a condition variable (don't forget the related mutex): idle threads wait in pthread_cond_wait() (or pthread_cond_timedwait()), and when the thread doing the work has finished, it wakes up one of the idle ones with pthread_cond_signal().

Yes as most people recommended the best way seems to have a worker thread reading from a queue. Some code snippets below
pthread_t insertThread = NULL;
pthread_mutex_t insertConditionNewMutex = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t insertConditionDoneMutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t insertConditionNew = PTHREAD_COND_INITIALIZER;
pthread_cond_t insertConditionDone = PTHREAD_COND_INITIALIZER;
//Thread for new incoming connection
void * newBatchInsert()
{
for(each Word)
{
//Push It into the queue
pthread_mutex_lock(&lexicon[newPendingWord->length - 1]->insertQueueMutex);
lexicon[newPendingWord->length - 1]->insertQueue.push(newPendingWord);
pthread_mutex_unlock(&lexicon[newPendingWord->length - 1]->insertQueueMutex);
}
//Send signal to worker Thread
pthread_mutex_lock(&insertConditionNewMutex);
pthread_cond_signal(&insertConditionNew);
pthread_mutex_unlock(&insertConditionNewMutex);
//Wait Until it's finished
pthread_cond_wait(&insertConditionDone, &insertConditionDoneMutex);
}
//Worker thread
void * insertWorker(void *)
{
while(1)
{
pthread_cond_wait(&insertConditionNew, &insertConditionNewMutex);
for (int ii = 0; ii < maxWordLength; ++ii)
{
while (!lexicon[ii]->insertQueue.empty())
{
queueNode * newPendingWord = lexicon[ii]->insertQueue.front();
lexicon[ii]->insert(newPendingWord->word);
pthread_mutex_lock(&lexicon[ii]->insertQueueMutex);
lexicon[ii]->insertQueue.pop();
pthread_mutex_unlock(&lexicon[ii]->insertQueueMutex);
}
}
//Send signal that it's done
pthread_mutex_lock(&insertConditionDoneMutex);
pthread_cond_broadcast(&insertConditionDone);
pthread_mutex_unlock(&insertConditionDoneMutex);
}
}
int main(int argc, char * const argv[])
{
pthread_create(&insertThread, NULL, &insertWorker, NULL);
lexiconServer = new server(serverPort, (void *) newBatchInsert);
return 0;
}

The others have already pointed out this has undefined behaviour. I'd just add that the really simplest way to accomplish your task (to allow only one thread executing part of code) is to use a simple mutex - you need the threads executing that code to be MUTally EXclusive, and that's where mutex came to its name :-)
If you need the code to be ran in a specific thread (like Java AWT), then you need conditional variables. However, you should think twice whether this solution actually pays off. Imagine, how many context switches you need if you call your "Insert operation" 10000 times per second.

As you just now mentioned you're using a hash-table with several look-ups parallel to insertions, I'd recommend to check whether you can use a concurrent hash-table.
As the exact look-up results are non-deterministic when you're inserting elements simultaneously, such a concurrent hash-map may be exactly what you need. I do not have used concurrent hash-tables in C++, though, but as they are available in Java, you'll for sure find a library doing this in C++.

The only library which i found which supports inserts without locking new lookups - Sunrise DD (And i'm not sure whether it supports concurrent inserts)
However the switch from Google's Sparse Hash map more than doubles the memory usage. Lookups should happen fairly infrequently so rather than trying and write my own library
which combines the advantages of both i would rather just lock the table suspending lookups while changes are made safely.
Thanks again

It seems to me that you want to serialise inserts to the hashtable.
For this you want a lock - not spawning new threads.

From your description that looks very inefficient as you are re-creating the insert thread every time you want to insert something. The cost of creating the thread is not 0.
A more common solution to this problem is to spawn an insert thread that waits on a queue (ie sits in a loop sleeping while the loop is empty). Other threads then add work items to the queue. The insert thread picks items of the queue in the order they were added (or by priority if you want) and does the appropriate action.
All you have to do is make sure addition to the queue is protected so that only one thread at a time has accesses to modifying the actual queue, and that the insert thread does not do a busy wait but rather sleeps when nothing is in the queue (see condition variable).

Ideally,you dont want multiple threadpools in a single process, even if they perform different operations. The resuability of a thread is an important architectural definition, which leads to pthread_join being created in a main thread if you use C.
Ofcourse, for a C++ threadpool aka ThreadFactory , the idea is to keep the thread primitives abstract so, it can handle any of function/operation types passed to it.
A typical example would be a webserver which will have connection pools and thread pools which service connections and then process them further, but, all are derived from a common threadpool process.
SUMMARY : AVOID PTHREAD_JOIN IN any place other than a main thread.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js