In this example i have 3 tasks which should run parallel. This works realy well. But now the number of tasks is variable. How can i handle that? I tried something like vector of lambda expressions but that doesn't worked. Thanks for helping !
#include <iostream>
#include <future>
#include "myclass.h"
#define N 3
int main()
{
// Create instances
std::vector<myclass*> instances;
for(int i = 0; i < N; i++){
myclass *mc = new myclass();
instances.push_back(mc);
}
auto f0 = std::async(std::launch::async, [&](){
bool finish = false;
while(!finish){
if(instances[0]->do() != 0) finish = true;
}
});
auto f1 = std::async(std::launch::async, [&](){
bool finish = false;
while(!finish){
if(instances[1]->do() != 0) finish = true;
}
});
auto f2 = std::async(std::launch::async, [&](){
bool finish = false;
while(!finish){
if(instances[2]->do() != 0) finish = true;
}
});
f0.wait(); f1.wait(); f2.wait();
// delete instances
for(int i = 0; i < N; i++){
delete instances[i];
}
return 0;
}
You may consider use std::function, a demo program:
#include <functional>
#include <future>
#include <iostream>
#include <vector>
int main(int argc, char* argv[]) {
std::vector<std::function<void()>> vec;
for (size_t i = 0; i < 10; ++i) {
vec.emplace_back(
[i = i] { std::cout << "hi" + std::to_string(i) << std::endl; });
}
std::vector<std::future<void>> futs;
for (const auto& f : vec) {
futs.emplace_back(std::async(std::launch::async, f));
}
for (auto& f : futs) f.wait();
return 0;
}
Generally std::async is not suitable for large programs, we may consider use some thread pool based frameworks like:
tbb
folly
boost-asio
poco
Use a loop and create a new lambda with different binding context every iteration:
std::vector<std::future<void>> tasks;
for (auto& instance : instances) {
tasks.push_back(
std::async(std::launch::async, [&, instance]() {
bool finish = false;
while (!finish) {
if (instance->do() != 0) finish = true;
}
})
);
}
// wait for all tasks to finish
for (auto& task : tasks) task.wait();
Related
Is it possible that "Main end" could get displayed before all result.get(); are returned back in below code snippet (Under any scenario)?
OR "Main end" will always be the last one to appear?
#include <iostream>
#include <vector>
#include <future>
#include <chrono>
using namespace std::chrono;
std::vector<std::future<int>> doParallelProcessing()
{
std::vector<std::future<int>> v;
for (int i = 0; i < 10; i++)
{
auto ret = std::async(std::launch::async, [&]() {
std::this_thread::sleep_for(seconds(i + 5));
return 5;
});
v.push_back(std::move(ret));
}
return v;
}
int main() {
std::vector<std::future<int>> results;
results = doParallelProcessing();
for (std::future<int>& result : results)
{
result.get();
}
std::cout << "Main end\n";
return 0;
}
Being an early stage c++/thread coder I am having some hard time with thread racing in one of my test functions and would truly appreciate some feedback.
My parent() function takes in as input a rather large vector of images (cv::Mat from openCV) and the task is to compute an operator on each one separately (e.g. dilation). I wrote a loop that creates threads using a worker() function and passes on each thread a subset of my input vector.
The result from each thread is to be stored on that input subset vector. My problem is that I cannot retrieve it back from within the parent().
As an alternative I passed the entire vector to worker() with start and end indices for each thread but then I run into some serious thread racing issues consuming more time than the serial approach.
Please see my code below.
std::vector<cv::Mat> worker(std::vector<cv::Mat>& ctn);
std::vector<cv::Mat> worker(std::vector<cv::Mat>& ctn) {
int erosion_type = cv::MORPH_RECT;
int erosion_size = 5;
cv::Mat element = cv::getStructuringElement( erosion_type,
cv::Size( 2*erosion_size + 1, 2*erosion_size+1 ),
cv::Point( erosion_size, erosion_size ) );
this_mutex.lock();
for(uint it=0; it<ctn.size(); ++it) {
cv::erode(ctn[it], ctn[it], element);
}
this_mutex.unlock();
return ctn;
}
void parent(std::vector<cv::Mat>& imageSet) {
auto start = std::chrono::steady_clock::now();
const auto processor_count = std::thread::hardware_concurrency();
std::vector<std::thread> threads;
const int grainsize = imageSet.size() / processor_count;
uint work_iter = 0;
std::vector<cv::Mat> target; // holds the output vector
// create the threads
for(uint it=0; it<processor_count-1; ++it) {
std::vector<cv::Mat> subvec(imageSet.begin() + work_iter, imageSet.begin() + work_iter + grainsize);
threads.emplace_back([&,it]() {
std::vector<cv::Mat> tmp = worker(subvec);
target.insert(target.end(), tmp.begin(), tmp.end());
});
work_iter += grainsize;
}
// create the last thread for the remainder of the vector elements
std::vector<cv::Mat> subvec(imageSet.begin() + work_iter, imageSet.end());
int it = processor_count-1;
threads.emplace_back([&,it]() {
std::vector<cv::Mat> tmp = worker(subvec);
target.insert(target.end(), tmp.begin(), tmp.end());
});
// join the threads
for(int i=0; i<threads.size(); ++i) {
threads[i].join();
}
auto end = std::chrono::steady_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
// try to reconstruct the output
imageSet.clear();
for(int i=0; i<target.size(); ++i) {
imageSet.push_back(target[i]);
}
}
In this code the statement target.insert(target.end(), tmp.begin(), tmp.end()) is meant to concatenate the target[ ] vector with the result of each thread but it does not execute in time thus I get an empty target[] at the end.
Any ideas how to get target[] to collect all tmp[]s?
Where you thinking something like this?
This processes them all individually, but you can chunk it up however you want and return a vector from the lamda if you want.
Note: This is in C++11, since that is what you tagged. If you have access to 17, this becomes a whole lot simpler.
#include <vector>
#include <algorithm>
#include <numeric>
#include <future>
#include <iostream>
int main()
{
std::vector<int> input{0,1,2,3,4,5,6,7,8,9,10};
for(const auto& item : input)
{
std::cout << item << " ";
}
std::cout << std::endl;
std::vector<std::future<int>> threads{};
for(const auto& item : input)
{
threads.push_back(std::async(std::launch::async, [&item]{
return item * 100;
}));
}
std::vector<int> output{};
for(auto& thread : threads)
{
output.push_back(thread.get());
}
for(const auto& item : output)
{
std::cout << item << " ";
}
return 0;
}
One result (res) for each thread.
#include <iostream>
#include <thread>
#include <vector>
#include <algorithm>
#include <cassert>
void threadFunction (std::vector<int> &speeds, int start, int end, std::vector<int>& res);
int main()
{
std::vector<int> images (100000);
auto processor_count = std::thread::hardware_concurrency();
auto step = images.size() / processor_count;
auto startFrom = 0;
// one result vector (res) for each thread (t).
std::vector<std::thread>t;
std::vector<std::vector<int>>res (processor_count);
// Start the threads
for (auto i = 0; i < processor_count; ++i)
{
auto th = std::thread(threadFunction, std::ref(images), startFrom, startFrom+step, std::ref(res[i]));
t.push_back(std::move(th));
startFrom += step;
}
// Join
std::for_each(begin(t), end(t), [](std::thread &t)
{
assert(t.joinable());
t.join();
});
// Results here. Each thread puts the results in res[i];
return 0;
}
void threadFunction (std::vector<int> &images, int start, int end, std::vector<int>& res)
{
for (int i = start; i <= end; ++i)
res.push_back(images[i]);
}
I tried to implement a c++ thread pool according to some notes made by others, the code is like this:
#include <vector>
#include <queue>
#include <functional>
#include <future>
#include <atomic>
#include <condition_variable>
#include <thread>
#include <mutex>
#include <memory>
#include <glog/logging.h>
#include <iostream>
#include <chrono>
using std::cout;
using std::endl;
class ThreadPool {
public:
ThreadPool(const ThreadPool&) = delete;
ThreadPool(ThreadPool&&) = delete;
ThreadPool& operator=(const ThreadPool&) = delete;
ThreadPool& operator=(ThreadPool&&) = delete;
ThreadPool(uint32_t capacity=std::thread::hardware_concurrency(),
uint32_t n_threads=std::thread::hardware_concurrency()
): capacity(capacity), n_threads(n_threads) {
init(capacity, n_threads);
}
~ThreadPool() noexcept {
shutdown();
}
void init(uint32_t capacity, uint32_t n_threads) {
CHECK_GT(capacity, 0) << "task queue capacity should be greater than 0";
CHECK_GT(n_threads, 0) << "thread pool capacity should be greater than 0";
for (int i{0}; i < n_threads; ++i) {
pool.emplace_back(std::thread([this] {
std::function<void(void)> task;
while (!this->stop) {
{
std::unique_lock<std::mutex> lock(this->q_mutex);
task_q_empty.wait(lock, [&] {return this->stop | !task_q.empty();});
if (this->stop) break;
task = this->task_q.front();
this->task_q.pop();
task_q_full.notify_one();
}
// auto id = std::this_thread::get_id();
// std::cout << "thread id is: " << id << std::endl;
task();
}
}));
}
}
void shutdown() {
stop = true;
task_q_empty.notify_all();
task_q_full.notify_all();
for (auto& thread : pool) {
if (thread.joinable()) {
thread.join();
}
}
}
template<typename F, typename...Args>
auto submit(F&& f, Args&&... args) -> std::future<decltype(f(args...))> {
using res_type = decltype(f(args...));
std::function<res_type(void)> func = std::bind(std::forward<F>(f), std::forward<Args>(args)...);
auto task_ptr = std::make_shared<std::packaged_task<res_type()>>(func);
{
std::unique_lock<std::mutex> lock(q_mutex);
task_q_full.wait(lock, [&] {return this->stop | task_q.size() <= capacity;});
CHECK (this->stop == false) << "should not add task to stopped queue\n";
task_q.emplace([task_ptr]{(*task_ptr)();});
}
task_q_empty.notify_one();
return task_ptr->get_future();
}
private:
std::vector<std::thread> pool;
std::queue<std::function<void(void)>> task_q;
std::condition_variable task_q_full;
std::condition_variable task_q_empty;
std::atomic<bool> stop{false};
std::mutex q_mutex;
uint32_t capacity;
uint32_t n_threads;
};
int add(int a, int b) {return a + b;}
int main() {
auto t1 = std::chrono::steady_clock::now();
int n_threads = 1;
ThreadPool tp;
tp.init(n_threads, 1024);
std::vector<std::future<int>> res;
for (int i{0}; i < 1000000; ++i) {
res.push_back(tp.submit(add, i, i+1));
}
auto t2 = std::chrono::steady_clock::now();
for (auto &el : res) {
el.get();
// cout << el.get() << endl;
}
tp.shutdown();
cout << "processing: "
<< std::chrono::duration<double, std::milli>(t2 - t1).count()
<< endl;
return 0;
}
The problem is that, when I set n_threads=1, the program takes the same length of time as I set n_threads=4. Since my gpu has 72 kernels (from the htop command), I believe the 4 thread would be faster than the 1 thread settings. What is the problem with this implementation of the thread pool please?
I found few issues:
1) Use ORing instead of the bitwise operation in the both conditional-variable waits:
Replace this - `task_q_empty.wait(lock, [&] {return this->stop | !task_q.empty();});`
By - `task_q_empty.wait(lock, [&] {return this->stop || !task_q.empty();});`
2) Use notify_all() in place of notify_one() in init() and submit().
3) Two condition_variables is unnecessary here, use only task_q_empty.
4) Your use case is not ideal. Switching of the threads may outweigh adding of two integers, it may appear more the threads longer the execution time. Test in optimized mode. Try scenario like this to simulate longer process:
int add(int a, int b) { this_thread::sleep_for(chrono::milliseconds(200)); return a + b; }
Let's say we have a function odd which is a bool(int) function. I'd like to execute this function in parallel but with different parameter (differ numbers).
bool odd(int i) { return (((i&1)==1)?true:false); }
Here's the code I'm trying to use (which works but has a wart).
std::size_t num = 256;
std::vector<bool> results(num);
std::vector<std::function<bool(int)>> funcs(num);
std::vector<std::packaged_task<bool(int)>> tasks(num);
std::vector<std::future<bool>> futures(num);
std::vector<std::thread> threads(num);
for (std::size_t i = 0; i < num; i++) {
results[i] = false;
funcs[i] = std::bind(odd, static_cast<int>(i));
tasks[i] = std::packaged_task<bool(int)>(funcs[i]);
futures[i] = tasks[i].get_future();
threads[i] = std::thread(std::move(tasks[i]),0); // args ignored
}
for (std::size_t i = 0; i < num; i++) {
results[i] = futures[i].get();
threads[i].join();
}
for (std::size_t i = 0; i < num; i++) {
printf("odd(%d)=%s\n", i, (results[i]?"true":"false"));
}
I'd like to get rid of the arguments to the thread creation, as they are dependent on the argument types of the function bool(int). I'd like to make a function template of this code and be able to make a massive parallel function executor.
template <typename _returnType, typename ..._argTypes>
void exec_and_collect(std::vector<_returnType>& results,
std::vector<std::function<_returnType(_argTypes...)>> funcs) {
std::size_t numTasks = (funcs.size() > results.size() ? results.size() : funcs.size());
std::vector<std::packaged_task<_returnType(_argTypes...)>> tasks(numTasks);
std::vector<std::future<_returnType>> futures(numTasks);
std::vector<std::thread> threads(numTasks);
for (std::size_t h = 0; h < numTasks; h++) {
tasks[h] = std::packaged_task<_returnType(_argTypes...)>(funcs[h]);
futures[h] = tasks[h].get_future();
threads[h] = std::thread(std::move(tasks[h]), 0); // zero is a wart
}
// threads are now running, collect results
for (std::size_t h = 0; h < numTasks; h++) {
results[h] = futures[h].get();
threads[h].join();
}
}
Then called like this:
std::size_t num = 8;
std::vector<bool> results(num);
std::vector<std::function<bool(int)>> funcs(num);
for (std::size_t i = 0; i < num; i++) {
funcs[i] = std::bind(odd, static_cast<int>(i));
}
exec_and_collect<bool,int>(results, funcs);
I'd to remove the zero in the std::thread(std::move(task), 0); line since it's completely ignored by the thread. If I do completely remove it, the compiler can't find the arguments to pass to the thread create and it fails.
You could just not be micromanaging/control freak in the generic code. Just take any task returntype() and let the caller handle the binding of arguments:
Live On Coliru
#include <thread>
#include <future>
#include <iostream>
#include <vector>
#include <functional>
bool odd(int i) { return (((i&1)==1)?true:false); }
template <typename _returnType>
void exec_and_collect(std::vector<_returnType>& results,
std::vector<std::function<_returnType()>> funcs
) {
std::size_t numTasks = std::min(funcs.size(), results.size());
std::vector<std::packaged_task<_returnType()>> tasks(numTasks);
std::vector<std::future<_returnType>> futures(numTasks);
std::vector<std::thread> threads(numTasks);
for (std::size_t h = 0; h < numTasks; h++) {
tasks[h] = std::packaged_task<_returnType()>(funcs[h]);
futures[h] = tasks[h].get_future();
threads[h] = std::thread(std::move(tasks[h]));
}
// threads are now running, collect results
for (std::size_t h = 0; h < numTasks; h++) {
results[h] = futures[h].get();
threads[h].join();
}
}
int main() {
std::size_t num = 8;
std::vector<bool> results(num);
std::vector<std::function<bool()>> funcs(num);
for (std::size_t i = 0; i < num; i++) {
funcs[i] = std::bind(odd, static_cast<int>(i));
}
exec_and_collect<bool>(results, funcs);
}
Note this is a quick job, I've seen quite a few things that are overly specific here still.
In particular all the temporary collections are just paper weight (you even move each tasks[h] out of the vector even before moving to the next task, so why keep a vector of dead bits?)
There's no scheduling at all; you just create new threads willy nilly. That's not gonna scale (also, you want pluggable pooling models; see the Executor specifications and Boost Async's implementation of these)
UPDATE
A somewhat more cleaned up version that demonstrates what unneeded dependencies can be shed:
no temporary vectors of packaged tasks/threads
no assumption/requirement to have std::function<> wrapped tasks (this removes dynamic allocations and virtual dispatch internally in the implementation)
no requirement that the results must be in a vector (in fact, you can collect them anywhere you want using a custom output iterator)
move-awareness (this is arguably a "complicated" part of the code seeing that there is no std::move_transform, so go the extra mile using std::make_move_iterator
Live On Coliru
#include <thread>
#include <future>
#include <iostream>
#include <vector>
#include <algorithm>
#include <boost/range.hpp>
bool odd(int i) { return (((i&1)==1)?true:false); }
template <typename Range, typename OutIt>
void exec_and_collect(OutIt results, Range&& tasks) {
using namespace std;
using T = typename boost::range_value<Range>::type;
using R = decltype(declval<T>()());
auto tb = std::make_move_iterator(boost::begin(tasks)),
te = std::make_move_iterator(boost::end(tasks));
vector<future<R>> futures;
transform(
tb, te,
back_inserter(futures), [](auto&& t) {
std::packaged_task<R()> task(std::forward<decltype(t)>(t));
auto future = task.get_future();
thread(std::move(task)).detach();
return future;
});
// threads are now running, collect results
transform(begin(futures), end(futures), results, [](auto& fut) { return fut.get(); });
}
#include <boost/range/irange.hpp>
#include <boost/range/adaptors.hpp>
using namespace boost::adaptors;
int main() {
std::vector<bool> results;
exec_and_collect(
std::back_inserter(results),
boost::irange(0, 8) | transformed([](int i) { return [i] { return odd(i); }; })
);
std::copy(results.begin(), results.end(), std::ostream_iterator<bool>(std::cout << std::boolalpha, "; "));
}
Output
false; false; false; false; false; false; false; false;
Note that you could indeed write
exec_and_collect(
std::ostream_iterator<bool>(std::cout << std::boolalpha, "; "),
boost::irange(0, 8) | transformed([](int i) { return [i] { return odd(i); }; })
);
and do without any results container :)
I have a loop over a set where I have to perform an expensive calculation. I want to do this in parallel using the future class. As far as I understand this, async either starts the thread or defers it and starts it only when I call get() or wait(). So, when I have threads not started and try to get the result, I block the main thread an get a sequential processing. Is there a way to start the remaining deferred processes, so everything is calculated in parallel and will not block when I call get().
// do the calculations
std::vector<std::future<class>> futureList;
for (auto elem : container)
{
futureList.push_back(std::async(fct, elem));
}
// start remaining processes
// use the results
for (auto elem : futureList)
{
processResult(elem.get())
}
Thanks for your help.
You might use:
std::async(std::launch::async, fct, elem)
Sample:
#include <iostream>
#include <future>
#include <chrono>
#include <vector>
#include <stdexcept>
bool work() {
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
if( ! (std::rand() % 2)) throw std::runtime_error("Exception");
return true;
}
int main() {
const unsigned Elements = 10;
typedef std::vector<std::future<bool>> future_container;
future_container futures;
for(unsigned i = 0; i < Elements; ++i)
{
futures.push_back(std::async(std::launch::async, work));
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
while( ! futures.empty()) {
future_container::iterator f = futures.begin();
while(f != futures.end())
{
if(f->wait_for(std::chrono::milliseconds(100)) == std::future_status::timeout) ++f;
else {
// Note:: Exception resulting due to the invokation of
// the thread are thrown here.
// (See 30.6.6 Class template future)
try {
std::cout << f->get() << '\n';
}
catch(const std::exception& e) {
std::cout << e.what() << '\n';
}
f = futures.erase(f);
}
}
}
return 0;
}
You may do something like : (http://coliru.stacked-crooked.com/a/005c7d2345ad791c)
Create this function:
void processResult_async(std::future<myClass>& f) { processResult(f.get()); }
And then
// use the results
std::vector<std::future<void>> results;
for (auto& elem : futureList)
{
results.push_back(std::async(std::launch::async, processResult_async, std::ref(elem)));
}