Start remaining futures without blocking - c++

I have a loop over a set where I have to perform an expensive calculation. I want to do this in parallel using the future class. As far as I understand this, async either starts the thread or defers it and starts it only when I call get() or wait(). So, when I have threads not started and try to get the result, I block the main thread an get a sequential processing. Is there a way to start the remaining deferred processes, so everything is calculated in parallel and will not block when I call get().
// do the calculations
std::vector<std::future<class>> futureList;
for (auto elem : container)
{
futureList.push_back(std::async(fct, elem));
}
// start remaining processes
// use the results
for (auto elem : futureList)
{
processResult(elem.get())
}
Thanks for your help.

You might use:
std::async(std::launch::async, fct, elem)
Sample:
#include <iostream>
#include <future>
#include <chrono>
#include <vector>
#include <stdexcept>
bool work() {
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
if( ! (std::rand() % 2)) throw std::runtime_error("Exception");
return true;
}
int main() {
const unsigned Elements = 10;
typedef std::vector<std::future<bool>> future_container;
future_container futures;
for(unsigned i = 0; i < Elements; ++i)
{
futures.push_back(std::async(std::launch::async, work));
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
while( ! futures.empty()) {
future_container::iterator f = futures.begin();
while(f != futures.end())
{
if(f->wait_for(std::chrono::milliseconds(100)) == std::future_status::timeout) ++f;
else {
// Note:: Exception resulting due to the invokation of
// the thread are thrown here.
// (See 30.6.6 Class template future)
try {
std::cout << f->get() << '\n';
}
catch(const std::exception& e) {
std::cout << e.what() << '\n';
}
f = futures.erase(f);
}
}
}
return 0;
}

You may do something like : (http://coliru.stacked-crooked.com/a/005c7d2345ad791c)
Create this function:
void processResult_async(std::future<myClass>& f) { processResult(f.get()); }
And then
// use the results
std::vector<std::future<void>> results;
for (auto& elem : futureList)
{
results.push_back(std::async(std::launch::async, processResult_async, std::ref(elem)));
}

Related

C++ Variable number of async tasks

In this example i have 3 tasks which should run parallel. This works realy well. But now the number of tasks is variable. How can i handle that? I tried something like vector of lambda expressions but that doesn't worked. Thanks for helping !
#include <iostream>
#include <future>
#include "myclass.h"
#define N 3
int main()
{
// Create instances
std::vector<myclass*> instances;
for(int i = 0; i < N; i++){
myclass *mc = new myclass();
instances.push_back(mc);
}
auto f0 = std::async(std::launch::async, [&](){
bool finish = false;
while(!finish){
if(instances[0]->do() != 0) finish = true;
}
});
auto f1 = std::async(std::launch::async, [&](){
bool finish = false;
while(!finish){
if(instances[1]->do() != 0) finish = true;
}
});
auto f2 = std::async(std::launch::async, [&](){
bool finish = false;
while(!finish){
if(instances[2]->do() != 0) finish = true;
}
});
f0.wait(); f1.wait(); f2.wait();
// delete instances
for(int i = 0; i < N; i++){
delete instances[i];
}
return 0;
}
You may consider use std::function, a demo program:
#include <functional>
#include <future>
#include <iostream>
#include <vector>
int main(int argc, char* argv[]) {
std::vector<std::function<void()>> vec;
for (size_t i = 0; i < 10; ++i) {
vec.emplace_back(
[i = i] { std::cout << "hi" + std::to_string(i) << std::endl; });
}
std::vector<std::future<void>> futs;
for (const auto& f : vec) {
futs.emplace_back(std::async(std::launch::async, f));
}
for (auto& f : futs) f.wait();
return 0;
}
Generally std::async is not suitable for large programs, we may consider use some thread pool based frameworks like:
tbb
folly
boost-asio
poco
Use a loop and create a new lambda with different binding context every iteration:
std::vector<std::future<void>> tasks;
for (auto& instance : instances) {
tasks.push_back(
std::async(std::launch::async, [&, instance]() {
bool finish = false;
while (!finish) {
if (instance->do() != 0) finish = true;
}
})
);
}
// wait for all tasks to finish
for (auto& task : tasks) task.wait();

Concurrent program compiled with clang runs fine, but hangs with gcc

I wrote a class to share a limited number of resources (for instance network interfaces) between a larger number of threads. The resources are pooled and, if not in use, they are borrowed out to the requesting thread, which otherwise waits on a condition_variable.
Nothing really exotic: apart for the fancy scoped_lock which requires c++17, it should be good old c++11.
Both gcc10.2 and clang11 compile the test main fine, but while the latter produces an executable which does pretty much what expected, the former hangs without consuming CPU (deadlock?).
With the help of https://godbolt.org/ I tried older versions of gcc and also icc (passing options -O3 -std=c++17 -pthread), all reproducing the bad result, while even there clang confirms the proper behavior.
I wonder if I made a mistake or if the code triggers some compiler misbehavior and in case how to work around that.
#include <iostream>
#include <vector>
#include <stdexcept>
#include <mutex>
#include <condition_variable>
template <typename T>
class Pool {
///////////////////////////
class Borrowed {
friend class Pool<T>;
Pool<T>& pool;
const size_t id;
T * val;
public:
Borrowed(Pool & p, size_t i, T& v): pool(p), id(i), val(&v) {}
~Borrowed() { release(); }
T& get() const {
if (!val) throw std::runtime_error("Borrowed::get() this resource was collected back by the pool");
return *val;
}
void release() { pool.collect(*this); }
};
///////////////////////////
struct Resource {
T val;
bool available = true;
Resource(T v): val(std::move(v)) {}
};
///////////////////////////
std::vector<Resource> vres;
size_t hint = 0;
std::condition_variable cv;
std::mutex mtx;
size_t available_cnt;
public:
Pool(std::initializer_list<T> l): available_cnt(l.size()) {
vres.reserve(l.size());
for (T t: l) {
vres.emplace_back(std::move(t));
}
std::cout << "Pool has size " << vres.size() << std::endl;
}
~Pool() {
for ( auto & res: vres ) {
if ( ! res.available ) {
std::cerr << "WARNING Pool::~Pool resources are still in use\n";
}
}
}
Borrowed borrow() {
std::unique_lock<std::mutex> lk(mtx);
cv.wait(lk, [&](){return available_cnt > 0;});
if ( vres[hint].available ) {
// quick path, if hint points to an available resource
std::cout << "hint good" << std::endl;
vres[hint].available = false;
--available_cnt;
Borrowed b(*this, hint, vres[hint].val);
if ( hint + 1 < vres.size() ) ++hint;
return b; // <--- gcc seems to hang here
} else {
// full scan to find the available resource
std::cout << "hint bad" << std::endl;
for ( hint = 0; hint < vres.size(); ++hint ) {
if ( vres[hint].available ) {
vres[hint].available = false;
--available_cnt;
return Borrowed(*this, hint, vres[hint].val);
}
}
}
throw std::runtime_error("Pool::borrow() no resource is available - internal logic error");
}
void collect(Borrowed & b) {
if ( &(b.pool) != this )
throw std::runtime_error("Pool::collect() trying to collect resource owned by another pool!");
if ( b.val ) {
b.val = nullptr;
{
std::scoped_lock<std::mutex> lk(mtx);
hint = b.id;
vres[hint].available = true;
++available_cnt;
}
cv.notify_one();
}
}
};
///////////////////////////////////////////////////////////////////
#include <thread>
#include <chrono>
int main() {
Pool<std::string> pool{"hello","world"};
std::vector<std::thread> vt;
for (int i = 10; i > 0; --i) {
vt.emplace_back( [&pool, i]()
{
auto res = pool.borrow();
std::this_thread::sleep_for(std::chrono::milliseconds(i*300));
std::cout << res.get() << std::endl;
}
);
}
for (auto & t: vt) t.join();
return 0;
}
You're running into undefined behavior since you effectively relock an already acquired lock. With MSVC I obtained a helpful callstack to distinguish this. Here is a working fixed example (I suppose, works now for me, see the changes within the borrow() method, might be further re-designed since locking inside a destructor might be questioned):
#include <iostream>
#include <vector>
#include <stdexcept>
#include <mutex>
#include <condition_variable>
template <typename T>
class Pool {
///////////////////////////
class Borrowed {
friend class Pool<T>;
Pool<T>& pool;
const size_t id;
T * val;
public:
Borrowed(Pool & p, size_t i, T& v) : pool(p), id(i), val(&v) {}
~Borrowed() { release(); }
T& get() const {
if (!val) throw std::runtime_error("Borrowed::get() this resource was collected back by the pool");
return *val;
}
void release() { pool.collect(*this); }
};
///////////////////////////
struct Resource {
T val;
bool available = true;
Resource(T v) : val(std::move(v)) {}
};
///////////////////////////
std::vector<Resource> vres;
size_t hint = 0;
std::condition_variable cv;
std::mutex mtx;
size_t available_cnt;
public:
Pool(std::initializer_list<T> l) : available_cnt(l.size()) {
vres.reserve(l.size());
for (T t : l) {
vres.emplace_back(std::move(t));
}
std::cout << "Pool has size " << vres.size() << std::endl;
}
~Pool() {
for (auto & res : vres) {
if (!res.available) {
std::cerr << "WARNING Pool::~Pool resources are still in use\n";
}
}
}
Borrowed borrow() {
std::unique_lock<std::mutex> lk(mtx);
while (available_cnt == 0) cv.wait(lk);
if (vres[hint].available) {
// quick path, if hint points to an available resource
std::cout << "hint good" << std::endl;
vres[hint].available = false;
--available_cnt;
Borrowed b(*this, hint, vres[hint].val);
if (hint + 1 < vres.size()) ++hint;
lk.unlock();
return b; // <--- gcc seems to hang here
}
else {
// full scan to find the available resource
std::cout << "hint bad" << std::endl;
for (hint = 0; hint < vres.size(); ++hint) {
if (vres[hint].available) {
vres[hint].available = false;
--available_cnt;
lk.unlock();
return Borrowed(*this, hint, vres[hint].val);
}
}
}
throw std::runtime_error("Pool::borrow() no resource is available - internal logic error");
}
void collect(Borrowed & b) {
if (&(b.pool) != this)
throw std::runtime_error("Pool::collect() trying to collect resource owned by another pool!");
if (b.val) {
b.val = nullptr;
{
std::scoped_lock<std::mutex> lk(mtx);
hint = b.id;
vres[hint].available = true;
++available_cnt;
cv.notify_one();
}
}
}
};
///////////////////////////////////////////////////////////////////
#include <thread>
#include <chrono>
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
int main()
{
try
{
Pool<std::string> pool{ "hello","world" };
std::vector<std::thread> vt;
for (int i = 10; i > 0; --i) {
vt.emplace_back([&pool, i]()
{
auto res = pool.borrow();
std::this_thread::sleep_for(std::chrono::milliseconds(i * 300));
std::cout << res.get() << std::endl;
}
);
}
for (auto & t : vt) t.join();
return 0;
}
catch(const std::exception& e)
{
std::cout << "exception occurred: " << e.what();
}
return 0;
}
Locking destructor coupled with missed NRVO caused the issue (credits to Secundi for pointing this out in the comments).
If the compiler skips NRVO, the few lines below if will call the destructor of b. The destructor tries to acquire the mutex before this gets released by the unique_lock, resulting in a deadlock.
Borrowed b(*this, hint, vres[hint].val);
if ( hint + 1 < vres.size() ) ++hint;
return b; // <--- gcc seems to hang here
It is of crucial importance here to avoid destroying b. In fact, even if manually releasing the unique_lock before returning will avoid the deadlock, the destructor of b will mark the pooled resource as available, while this is just being borrowed out, making the code wrong.
A possible fix consists in replacing the lines above with:
const auto tmp = hint;
if ( hint + 1 < vres.size() ) ++hint;
return Borrowed(*this, tmp, vres[tmp].val);
Another possibility (which does not exclude the former) is to delete the (evil) copy ctor of Borrowed and only provide a move ctor:
Borrowed(const Borrowed &) = delete;
Borrowed(Borrowed && b): pool(b.pool), id(b.id), val(b.val) { b.val = nullptr; }

How to wait until all threads from the pool ends their work?

I am trying to implement simple thread pool using boost library.
Here is code:
//boost::asio::io_service ioService;
//boost::thread_group pool;
//boost::asio::io_service::work* worker;
ThreadPool::ThreadPool(int poolSize /*= boost::thread::hardware_concurrency()*/)
{
if (poolSize >= 1 && poolSize <= boost::thread::hardware_concurrency())
threadAmount = poolSize;
else
threadAmount = 1;
worker = NULL;
}
ThreadPool::~ThreadPool()
{
if (worker != NULL && !ioService.stopped())
{
_shutdown();
delete worker;
worker = NULL;
}
}
void ThreadPool::start()
{
if (worker != NULL)
{
return;
}
worker = new boost::asio::io_service::work(ioService);
for (int i = 0; i < threadAmount; ++i)
{
pool.create_thread(boost::bind(&boost::asio::io_service::run, &ioService));
}
}
template<class F, class...Args>
void ThreadPool::execute(F f, Args&&... args)
{
ioService.post(boost::bind(f, std::forward<Args>(args)...));
}
void ThreadPool::shutdown()
{
pool.interrupt_all();
_shutdown();
}
void ThreadPool::join_all()
{
// wait for all threads before continue
// in other words - barier for all threads when they finished all jobs
// and to be able re-use them in futur.
}
void ThreadPool::_shutdown()
{
ioService.reset();
ioService.stop();
}
In my program i assign to thread pool some tasks that needs to be done, and going further with main thread. At some point i need to wait for all threads to finished all tasks before i could proceed calculations. Is there any way to do this ?
Thanks a lot.
As others have pointed out, the main culprit is the work instance.
I'd much simplify the interface (there's really no reason to split shutdown into shutdown, _shutdown, join_all and some random logic in the destructor as well. That just makes it hard to know what responsibility is where.
The interface should be a Pit Of Success - easy to use right, hard to use wrong.
At the same time it makes it much easier to implement it correctly.
Here's a first stab:
Live On Coliru
#include <boost/asio.hpp>
#include <boost/thread.hpp>
namespace ba = boost::asio;
struct ThreadPool {
ThreadPool(unsigned poolSize = boost::thread::hardware_concurrency());
~ThreadPool();
void start();
template <typename F, typename... Args>
void execute(F f, Args&&... args) {
ioService.post(std::bind(f, std::forward<Args>(args)...));
}
private:
unsigned threadAmount;
ba::io_service ioService;
boost::thread_group pool;
std::unique_ptr<ba::io_service::work> work;
void shutdown();
};
ThreadPool::ThreadPool(
unsigned poolSize /*= boost::thread::hardware_concurrency()*/) {
threadAmount = std::max(1u, poolSize);
threadAmount = std::min(boost::thread::hardware_concurrency(), poolSize);
}
ThreadPool::~ThreadPool() {
shutdown();
}
void ThreadPool::start() {
if (!work) {
work = std::make_unique<ba::io_service::work>(ioService);
for (unsigned i = 0; i < threadAmount; ++i) {
pool.create_thread(
boost::bind(&ba::io_service::run, &ioService));
}
}
}
void ThreadPool::shutdown() {
work.reset();
pool.interrupt_all();
ioService.stop();
pool.join_all();
ioService.reset();
}
#include <iostream>
using namespace std::chrono_literals;
int main() {
auto now = std::chrono::high_resolution_clock::now;
auto s = now();
{
ThreadPool p(10);
p.start();
p.execute([] { std::this_thread::sleep_for(1s); });
p.execute([] { std::this_thread::sleep_for(600ms); });
p.execute([] { std::this_thread::sleep_for(400ms); });
p.execute([] { std::this_thread::sleep_for(200ms); });
p.execute([] { std::this_thread::sleep_for(10ms); });
}
std::cout << "Total elapsed: " << (now() - s) / 1.0s << "s\n";
}
Which on most multi-core systems will print something like on mine:
Total elapsed: 1.00064s
It looks like you had an error in calculating threadAmount where you'd take 1 if poolSize was more than hardware_concurrency.
To be honest, why have the bind in the implementation? It really doesn't add a lot, you can leave it up to the caller, and they can choose whether they use bind, and if so, whether it's boost::bind, std::bind or some other way of composing calleables:
template <typename F>
void execute(F f) { ioService.post(f); }
You're missing exception handling around io_service::run calls (see Should the exception thrown by boost::asio::io_service::run() be caught?).
If you're using recent boost version, you can use the newer io_context and thread_pool interfaces, greatly simplifying things:
Live On Coliru
#include <boost/asio.hpp>
struct ThreadPool {
ThreadPool(unsigned poolSize)
: pool(std::clamp(poolSize, 1u, std::thread::hardware_concurrency()))
{ }
template <typename F>
void execute(F f) { post(pool, f); }
private:
boost::asio::thread_pool pool;
};
This still has 99% of the functionality¹, but in 10 LoC.
In fact, the class has become a trivial wrapper, so we could just write:
Live On Coliru
#include <boost/asio.hpp>
#include <iostream>
using namespace std::chrono_literals;
using C = std::chrono::high_resolution_clock;
static void sleep_for(C::duration d) { std::this_thread::sleep_for(d); }
int main() {
auto s = C::now();
{
boost::asio::thread_pool pool;
post(pool, [] { sleep_for(1s); });
post(pool, [] { sleep_for(600ms); });
// still can bind if you want
post(pool, std::bind(sleep_for, 400ms));
post(pool, std::bind(sleep_for, 200ms));
post(pool, std::bind(sleep_for, 10ms));
//pool.join(); // implicit in destructor
}
std::cout << "Total elapsed: " << (C::now() - s) / 1.0s << "s\n";
}
Main difference is the default pool size: it is 2*hardware concurrency (but also calculated more safely, because not all platforms have a reliable hardware_concurrency() - it could be zero, e.g.).
¹ It doesn't currently exercise interruptions points

Why cannot my c++ thread pool accelerate my program?

I tried to implement a c++ thread pool according to some notes made by others, the code is like this:
#include <vector>
#include <queue>
#include <functional>
#include <future>
#include <atomic>
#include <condition_variable>
#include <thread>
#include <mutex>
#include <memory>
#include <glog/logging.h>
#include <iostream>
#include <chrono>
using std::cout;
using std::endl;
class ThreadPool {
public:
ThreadPool(const ThreadPool&) = delete;
ThreadPool(ThreadPool&&) = delete;
ThreadPool& operator=(const ThreadPool&) = delete;
ThreadPool& operator=(ThreadPool&&) = delete;
ThreadPool(uint32_t capacity=std::thread::hardware_concurrency(),
uint32_t n_threads=std::thread::hardware_concurrency()
): capacity(capacity), n_threads(n_threads) {
init(capacity, n_threads);
}
~ThreadPool() noexcept {
shutdown();
}
void init(uint32_t capacity, uint32_t n_threads) {
CHECK_GT(capacity, 0) << "task queue capacity should be greater than 0";
CHECK_GT(n_threads, 0) << "thread pool capacity should be greater than 0";
for (int i{0}; i < n_threads; ++i) {
pool.emplace_back(std::thread([this] {
std::function<void(void)> task;
while (!this->stop) {
{
std::unique_lock<std::mutex> lock(this->q_mutex);
task_q_empty.wait(lock, [&] {return this->stop | !task_q.empty();});
if (this->stop) break;
task = this->task_q.front();
this->task_q.pop();
task_q_full.notify_one();
}
// auto id = std::this_thread::get_id();
// std::cout << "thread id is: " << id << std::endl;
task();
}
}));
}
}
void shutdown() {
stop = true;
task_q_empty.notify_all();
task_q_full.notify_all();
for (auto& thread : pool) {
if (thread.joinable()) {
thread.join();
}
}
}
template<typename F, typename...Args>
auto submit(F&& f, Args&&... args) -> std::future<decltype(f(args...))> {
using res_type = decltype(f(args...));
std::function<res_type(void)> func = std::bind(std::forward<F>(f), std::forward<Args>(args)...);
auto task_ptr = std::make_shared<std::packaged_task<res_type()>>(func);
{
std::unique_lock<std::mutex> lock(q_mutex);
task_q_full.wait(lock, [&] {return this->stop | task_q.size() <= capacity;});
CHECK (this->stop == false) << "should not add task to stopped queue\n";
task_q.emplace([task_ptr]{(*task_ptr)();});
}
task_q_empty.notify_one();
return task_ptr->get_future();
}
private:
std::vector<std::thread> pool;
std::queue<std::function<void(void)>> task_q;
std::condition_variable task_q_full;
std::condition_variable task_q_empty;
std::atomic<bool> stop{false};
std::mutex q_mutex;
uint32_t capacity;
uint32_t n_threads;
};
int add(int a, int b) {return a + b;}
int main() {
auto t1 = std::chrono::steady_clock::now();
int n_threads = 1;
ThreadPool tp;
tp.init(n_threads, 1024);
std::vector<std::future<int>> res;
for (int i{0}; i < 1000000; ++i) {
res.push_back(tp.submit(add, i, i+1));
}
auto t2 = std::chrono::steady_clock::now();
for (auto &el : res) {
el.get();
// cout << el.get() << endl;
}
tp.shutdown();
cout << "processing: "
<< std::chrono::duration<double, std::milli>(t2 - t1).count()
<< endl;
return 0;
}
The problem is that, when I set n_threads=1, the program takes the same length of time as I set n_threads=4. Since my gpu has 72 kernels (from the htop command), I believe the 4 thread would be faster than the 1 thread settings. What is the problem with this implementation of the thread pool please?
I found few issues:
1) Use ORing instead of the bitwise operation in the both conditional-variable waits:
Replace this - `task_q_empty.wait(lock, [&] {return this->stop | !task_q.empty();});`
By - `task_q_empty.wait(lock, [&] {return this->stop || !task_q.empty();});`
2) Use notify_all() in place of notify_one() in init() and submit().
3) Two condition_variables is unnecessary here, use only task_q_empty.
4) Your use case is not ideal. Switching of the threads may outweigh adding of two integers, it may appear more the threads longer the execution time. Test in optimized mode. Try scenario like this to simulate longer process:
int add(int a, int b) { this_thread::sleep_for(chrono::milliseconds(200)); return a + b; }

Proper usage of std::atomic, pre increment value as function param

I have code where I need unique id (packet id for some protocol). So I used std::atomic<int>. After reading documentation I was confused because it stated that increment is done in this way. fetch_add(1)+1
I understand that value inside fetch_add is incremented atomically but I get pre-increment value +1 outside atomic operation. What I would guess is not atomic.
Can I use some_func(++atomic_value)?
I wrote simple code to check if it works. And it works but I don't understand why.
#include<iostream>
#include <atomic>
#include <thread>
#include <vector>
#include <random>
#include <mutex>
#include <algorithm>
std::atomic<int> Index = 0;
//int Index = 0; // non atomic Index. It will generate duplicities
std::vector<int> Numbers;
std::mutex Mutex;
std::default_random_engine Generator;
std::uniform_int_distribution<int> Distribution(5, 10);
void func(int Value)
{
std::lock_guard<std::mutex> Guard(Mutex);
Numbers.push_back(Value);
}
void ThreadProc()
{
Sleep(Distribution(Generator));
func(++Index); // is this proper usage of std::atomic?
}
int main()
{
const int ThreadCount = 1000;
std::vector<std::thread> Threads;
for ( int i = 0; i < ThreadCount; i++ )
{
Threads.push_back(std::thread(ThreadProc));
}
for_each(Threads.begin(), Threads.end(), [](std::thread& t) { t.join(); });
std::sort(Numbers.begin(), Numbers.end());
auto End = std::unique(Numbers.begin(), Numbers.end());
if ( Numbers.end() == End )
{
std::cout << "No duplicites found." << std::endl;
}
else
{
std::cout << "Duplicites found ! - " << Numbers.end() - End << std::endl;
for_each(End, Numbers.end(), [](int n) { std::cout << n << ", "; });
}
return 0;
}
Off-topic question: When I defined Index as non atomic I get duplicities but only from end of range. Numbers are always 900+. Why it is so?