Replacing many std::async calls by a threadpool - c++

I have a programm that calls std::async many many times. The tasks that are executed are reasonable short (like a few hundred miliseconds each). I figure that there is significant overhead for thread creation and I wonder if I can avoid this somehow. The code that enumerates the jobs runs much faster than the processing of the jobs. Therefore I already have a sort of pooling in place. It goes like this. I create an array of 'job slots':
template <typename T>
struct job {
std::future <void> fut;
std::vector <T*> *result;
bool inUse;
}
Before the parallel code starts, I initialize the array of job slots, creating the result vectors only once. Then everytime the job enumeration code has enumerated a job, it looks for a job slot that is not in use. If there is a free slot, it will start (with std::async) a new job, move the future to the slot. The job runs and fills the result vector. If there is no free slot, then the code checks if any of the futures in the slots is ready. If so, it processes the result vector, and then uses that slot. If not, it waits a few miliseconds. This code runs very nicely, and scales exactly to the number of processors available. I learned that each call to std::async creates a new thread, and indeed, I can see the process IDs scrolling through. I want to remove this overhead, creating the threads once and for all in the beginning. How to proceed?
I have found this threadpool implementation
https://code.google.com/p/cppthreadpool/downloads/list
but it states that a task should take one or two seconds for this to be efficient. I don't need any fancy scheduling, priorities, etc. I just want to remove overhead for repeated construction and destruction of threads.

I ran a test program that creates tasks using std::async and found that many tasks were ran by same thread!! In fact I see 2 threads ran 25 async tasks. So looks like the standard library does some thread pooling already.
std::vector<std::future<void>> futures;
for (int i = 0; i < 25; ++i)
{
auto fut = std::async([]
{
std::cout << std::this_thread::get_id() <<std::endl;
});
futures.push_back(std::move(fut));
}
std::for_each(futures.begin(), futures.end(), [](std::future<void> & fut)
{
fut.wait();
});

Related

How many std::future objects can exist in a system at a time simultaneously.?

I wanted to perform hashing of a stream of input messages in multithreading, so was trying to implement std::vector<std::future<HashData>> futures; but not sure as how many future objects can exist in a system, at a time simultaneously.
std::vector<std::future<HashData>> futures;
std::vector<std::string> messages;
for (int i = 0; i < messages.size(); i++)
{
std::promise<HashData> promiseHashData;
std::future<HashData> futureHashData = promiseHashData.get_future();
futures.emplace_back(std::move(futureHashData));
std::async(std::launch::async, [&]() {PerformHash(std::move(promiseHashData), messages[i]);});
}
std::vector<HashData> vectorOfHashData;
// wait for all async tasks to complete
for (auto& futureObj : futures)
{
vectorOfHashData.push_back(futureObj.get());
}
Is there any limit for creation of future objects in a system (similar to how system may reach thread saturation level, if the existing threads won't get destroyed and new ones gets created continuously), As i will be calling PerformHash() method in async manner for large data of messages.
i am exploring concurrency in c++ during recent times and wanted to improve the hashing task performance. So this thought came to my mind, but not sure as whether it will work or not. wanted to know if i am missing something here.
The problem isn't going to be "how many futures can a vector hold"; futures (on most systems) are just a shared pointer to a block of memory with some cheap concurrency primitives in it.
The problem is you are creating a thread per future then blocking forward progress until the thread is finished. If you fix that problem, then your code is using dangling references.
std::vector<std::future<HashData>> futures;
std::vector<std::string> messages;
for (int i = 0; i < messages.size(); i++)
{
std::promise<HashData> promiseHashData;
std::future<HashData> futureHashData = promiseHashData.get_future();
futures.emplace_back(std::move(futureHashData));
// this captures a promiseHashData by reference
// It also creates a thread, then blocks until the
// thread finishes.
std::async(std::launch::async, [&]() {PerformHash(std::move(promiseHashData), messages[i]);});
}
So a few points:
Unless the hash data is worth consuming in small pieces, a future<vector<HashData>> is going to be more efficient.
If you want a vector<future>, you'll also want a vector<promise>. Then create a bounded number of threads (or get them from a pool you write) and fullfill those promises.
Creating an unbounded number of futures, then creating an unbounded number of threads to service those futures, is a bad plan.
Finally, std::async is funny in that it returns a std::future itself. When that future is destroyed, it blocks on the completion of the thread it creates. This is atypical behavior, but it prevents losing track of a thread of execution.

C++ condition variables vs new threads for vectorization

I have a block of code that goes through a loop. A section of the code operates on a vector of data and I would like to vectorize this operation. The idea is to split the elaboration of the array on multiple threads that will work on subsections of the array. I have to decide between two possibilities. The first one is to create the threads each time this section is encountered an rejoin them at the end with the main thread:
for(....)
{
//serial stuff
//crate threads
for(i = 0; i < num_threads; ++i)
{
threads_vect.push_back(std::thread(f, sub_array[i]));
}
//join them
for(auto& t : threads_vect)
{
t.join();
}
//serial stuff
}
This is similar at what it is done with OpenMP, but since the problem is simple I'd like to use std::threads instead of OpenMP (unless there are good reasons against this).
The second option is to create the threads beforehand to avoid the overhead of creation and destruction, and use condition variables for synchronization (I omitted a lot of stuff for the synchronization. It is just the general idea):
std::condition_variable cv_threads;
std::condition_variable cv_main;
//create threads, the will be to sleep on cv_threads
for(....)
{
//serial stuff
//wake up threads
cv_threads.notify_all();
//sleep until the last thread finishes, that will notify.
main_thread_lock.lock();
cv_main.wait(main_lock);
//serial stuff
}
To allow for parallelism the threads will have to unlock the thread_lock as soon as they wake up at the beginning of the computation, then acquire it again at to go to sleep and synchronize between them to notify the main thread.
My question is which of this solutions is usually preferred in a context like this, and if the avoided overhead of thread creation and destruction is usually worth the added complexity (or worth at all given that the added synchronization also adds time).
Obviously this also depends on how long the computation is for each thread, but this could vary a lot since the length of the data vector could also be very short (to about two element per thread, that would led to a computation time of about 15 milliseconds).
The biggest disadvantage of creating new threads is that thread creation and shutdown is usually quite expensive. Just think of all the things an operating system has to do to get a thread off the ground, compared to what it takes to notify a condition variable.
Note that synchronization is always required, also with thread creation. The C++11 std::thread for instances introduces a synchronizes-with relationship with the creating thread upon construction. So you can safely assume that thread creation will always be significantly more expensive than condition variable signalling, regardless of your implementation.
A framework like OpenMP will typically attempt to amortize these costs in some way. For instance, an OpenMP implementation is not required to shut down the worker threads after every loop and many implementations will not do this.

Handling mulitple async future results a le carte

I'm going to start by saying I have minimal experience with the C++ STL and paralleled processing. Still doing my research...
My application has a queue that tends to get large. I use asychronous future's to handle these tasks a la carte (for a lack of better terms). The maximum tasks created are based on the number of available cores to the machine.
I store the future in a class member vector to prevent the task being bound to the scope of the method in which it is called from. Except, now I have the problem of dealing with the results after the task is completed. Here is a sample of my code to provide context to my question:
if ( ALI::WorkingTasks < CPU_HW_CONCURRENCY ) {
std::string Task = TaskQueue.front();
TaskQueue.pop();
ALI::WorkingTasks++;
ALI::AsyncTasks.push_back(std::async(std::launch::async, &ALI::ProcessCodecUI, this, Task));
}
The method that is called from std::async
bool ALI::ProcessCodecUI(std::string UIPath)
{
// long inefficient process here
ALI::WorkingTasks--;
// notify condition_variable to create more tasks here
}
In my class definitions, this is how ALI::AsyncTasks is defined.
private:
std::vector<std::future<bool>> AsyncTasks;
This is my initial implementation to get the application working at the very minimum - it works. I've done some reading on threadpools and have poked at the idea of creating my own implementation of an "a la carte" threadpool.
So my question is: How do I handle the results of the ALI::AsyncTasks? Every example I have seen deals with the future directly in the method that calls it. In my scenario, the vector keeps building up and the future never gets destroyed even after the task is completed - this creates a memory leak. I don't have anyway to self-destroy the future after ProcessCodeUI() is completed.
If I am not clear, please let me know and I will revise.
Thank you
You should not use a future if you don't have a specific point in the program where you want to block while waiting for the asynchronous operation to complete.
You could achieve what you are trying to with minimal changes by using std::thread instead of std::async then detaching the thread immediately.
if ( ALI::WorkingTasks < CPU_HW_CONCURRENCY ) {
std::string Task = TaskQueue.front();
TaskQueue.pop();
ALI::WorkingTasks++;
std::thread Thread (&ALI::ProcessCodecUI, this, Task);
Thread.detach ();
}
Then, there is nothing left to clean up when the thread terminates. Though it would be more efficient to spawn a thread per core that pulls from a common queue.

C++11 Dynamic Threadpool

Recently, i've been trying to find a library for threading concurrent tasks. Ideally, a simple interface that calls a function on a thread. There are n number of threads at any time, some complete faster than others and arrive at different times.
First i was trying Rx, which is great in c++. I've also looked into Blocks and TBB but they are either platform dependant. For my prototype, i need to remain platform independent as we don't know what it will be running on yet and can change when decisions are made.
C++11 has a number of things for threading and concurrency and i found a number of examples like this one for thread pools.
https://github.com/bilash/threadpool
Similar projects use the same lambda expressions with std::thread and std::mutex.
This looks perfect for what i need. There some issues. The pools are started with a defined number of threads and tasks are queued until a thread is free.
How can i add new threads?
Remove expired threads? (.Join()??)
Obviously, this is much easier for a known number of threads as they can be initialised in the ctor and then join() in the dtor.
Any tips or pointers here from someone with experience with C++ concurrency?
Start with maximum number of threads a system can support:
int Num_Threads = thread::hardware_concurrency();
For an efficient threadpool implementation, once threads are created according to Num_Threads, it's better not to create new ones, or destroy old ones (by joining). There will be performance penalty, might even make your application goes slower than the serial version.
Each C++11 thread should be running in their function with an infinite loop, constantly waiting for new tasks to grab and run.
Here is how to attach such function to the thread pool:
int Num_Threads = thread::hardware_concurrency();
vector<thread> Pool;
for(int ii = 0; ii < Num_Threads; ii++)
{ Pool.push_back(thread(Infinite_loop_function));}
The Infinite_loop_function
This is a "while(true)" loop waiting for the task queue
void The_Pool:: Infinite_loop_function()
{
while(true)
{
{
unique_lock<mutex> lock(Queue_Mutex);
condition.wait(lock, []{return !Queue.empty()});
Job = Queue.front();
Queue.pop();
}
Job(); // function<void()> type
}
};
Make a function to add job to your Queue
void The_Pool:: Add_Job(function<void()> New_Job)
{
{
unique_lock<mutex> lock(Queue_Mutex);
Queue.push(New_Job);
}
condition.notify_one();
}
Bind an arbitrary function to your Queue
Pool_Obj.Add_Job(std::bind(&Some_Class::Some_Method, &Some_object));
Once you integrate these ingredients, you have your own dynamic threading pool. These threads always run, waiting for jobs to do.
This should be simple to use: https://pocoproject.org/docs/Poco.ThreadPool.html
A thread pool always keeps a number of threads running, ready to
accept work. Creating and starting a threads can impose a significant
runtime overhead to an application. A thread pool helps to improve the
performance of an application by reducing the number of threads that
have to be created (and destroyed again). Threads in a thread pool are
re-used once they become available again. The thread pool always keeps
a minimum number of threads running. If the demans for threads
increases, additional threads are created. Once the demand for threads
sinks again, no-longer used threads are stopped and removed from the
pool.
ThreadPool(
int minCapacity = 2,
int maxCapacity = 16,
int idleTime = 60,
int stackSize = 0
);
This is very nice library and easy to use not like Boost :(
https://github.com/pocoproject/poco

Waiting for multiple futures?

I'd like to run tasks (worker threads) of the same type, but not more than a certain number of tasks at a time. When a task finishes, its result is an input for a new task which, then, can be started.
Is there any good way to implement this with async/future paradigm in C++11?
At first glance, it looks straight forward, you just spawn multiple tasks with:
std::future<T> result = std::async(...);
and, then, run result.get() to get an async result of a task.
However, the problem here is that the future objects has to be stored in some sort of queue and be waited one by one. It is, though, possible to iterate over the future objects over and over again checking if any of them are ready, but it's not desired due to unnecessary CPU load.
Is it possible somehow to wait for any future from a given set to be ready and get its result?
The only option I can think of so far is an old-school approach without any async/future. Specifically, spawning multiple worker threads and at the end of each thread push its result into a mutex-protected queue notifying the waiting thread via a condition variable that the queue has been updated with more results.
Is there any other better solution with async/future possible?
Thread support in C++11 was just a first pass, and while std::future rocks, it does not support multiple waiting as yet.
You can fake it relatively inefficiently, however. You end up creating a helper thread for each std::future (ouch, very expensive), then gathering their "this future is ready" into a synchronized many-producer single-consumer message queue, then setting up a consumer task that dispatches the fact that a given std::future is ready.
The std::future in this system doesn't add much functionality, and having tasks that directly state that they are ready and sticks their result into the above queue would be more efficient. If you go this route, you could write wrapper that match the pattern of std::async or std::thread, and return a std::future like object that represents a queue message. This basically involves reimplementing a chunk of the the concurrency library.
If you want to stay with std::future, you could create shared_futures, and have each dependent task depend on the set of shared_futures: ie, do it without a central scheduler. This doesn't permit things like abort/shutdown messages, which I consider essential for a robust multi threaded task system.
Finally, you can wait for C++2x, or whenever the concurrency TS is folded into the standard, to solve the problem for you.
You could create all the futures of "generation 1", and give all those futures to your generation 2 tasks, who will then wait for their input themselves.
facebook's folly has collectAny/collectN/collectAll on futures, I haven't try it yet, but looks promising.
Given that the "Wating for multiple futures" title attracts folks with questions like "is there a wait all for a list of futures?". You can do that adequately by keeping track of the pending threads:
unsigned pending = 0;
for (size_t i = 0; i < N; ++i) {
++pending;
auto callPause =
[&pending, i, &each, &done]()->unsigned {
unsigned ret = each();
results[i] = ret;
if (!--pending)
// called in whatever thread happens to finish last
done(results);
return ret;
};
futures[i] = std::async(std::launch::async, each);
}
full example
It might be possible to use std::experimental::when_all with a spread operator