Boost thread pool join tasks without closing the pool [duplicate] - c++

Consider the functions
#include <iostream>
#include <boost/bind.hpp>
#include <boost/asio.hpp>
void foo(const uint64_t begin, uint64_t *result)
{
uint64_t prev[] = {begin, 0};
for (uint64_t i = 0; i < 1000000000; ++i)
{
const auto tmp = (prev[0] + prev[1]) % 1000;
prev[1] = prev[0];
prev[0] = tmp;
}
*result = prev[0];
}
void batch(boost::asio::thread_pool &pool, const uint64_t a[])
{
uint64_t r[] = {0, 0};
boost::asio::post(pool, boost::bind(foo, a[0], &r[0]));
boost::asio::post(pool, boost::bind(foo, a[1], &r[1]));
pool.join();
std::cerr << "foo(" << a[0] << "): " << r[0] << " foo(" << a[1] << "): " << r[1] << std::endl;
}
where foo is a simple "pure" function that performs a calculation on begin and writes the result to the pointer *result.
This function gets called with different inputs from batch. Here dispatching each call to another CPU core might be beneficial.
Now assume the batch function gets called several 10 000 times. Therefore a thread pool would be nice which is shared between all the sequential batch calls.
Trying this with (for the sake of simplicity only 3 calls)
int main(int argn, char **)
{
boost::asio::thread_pool pool(2);
const uint64_t a[] = {2, 4};
batch(pool, a);
const uint64_t b[] = {3, 5};
batch(pool, b);
const uint64_t c[] = {7, 9};
batch(pool, c);
}
leads to the result
foo(2): 2 foo(4): 4
foo(3): 0 foo(5): 0
foo(7): 0 foo(9): 0
Where all three lines appear at the same time, while the computation of foo takes ~3s.
I assume that only the first join really waits for the pool to complete all jobs.
The others have invalid results. (The not initialized values)
What is the best practice here to reuse the thread pool?

The best practice is not to reuse the pool (what would be the use of pooling, if you keep creating new pools?).
If you want to be sure you "time" the batches together, I'd suggest using when_all on futures:
Live On Coliru
#define BOOST_THREAD_PROVIDES_FUTURE_WHEN_ALL_WHEN_ANY
#include <iostream>
#include <boost/bind.hpp>
#include <boost/asio.hpp>
#include <boost/thread.hpp>
uint64_t foo(uint64_t begin) {
uint64_t prev[] = {begin, 0};
for (uint64_t i = 0; i < 1000000000; ++i) {
const auto tmp = (prev[0] + prev[1]) % 1000;
prev[1] = prev[0];
prev[0] = tmp;
}
return prev[0];
}
void batch(boost::asio::thread_pool &pool, const uint64_t a[2])
{
using T = boost::packaged_task<uint64_t>;
T tasks[] {
T(boost::bind(foo, a[0])),
T(boost::bind(foo, a[1])),
};
auto all = boost::when_all(
tasks[0].get_future(),
tasks[1].get_future());
for (auto& t : tasks)
post(pool, std::move(t));
auto [r0, r1] = all.get();
std::cerr << "foo(" << a[0] << "): " << r0.get() << " foo(" << a[1] << "): " << r1.get() << std::endl;
}
int main() {
boost::asio::thread_pool pool(2);
const uint64_t a[] = {2, 4};
batch(pool, a);
const uint64_t b[] = {3, 5};
batch(pool, b);
const uint64_t c[] = {7, 9};
batch(pool, c);
}
Prints
foo(2): 2 foo(4): 4
foo(3): 503 foo(5): 505
foo(7): 507 foo(9): 509
I would consider
generalizing
message queuing
Generalized
Make it somewhat more flexible by not hardcoding batch sizes. After all, the pool size is already fixed, we don't need to "make sure batches fit" or something:
Live On Coliru
#define BOOST_THREAD_PROVIDES_FUTURE_WHEN_ALL_WHEN_ANY
#include <iostream>
#include <boost/bind.hpp>
#include <boost/asio.hpp>
#include <boost/thread.hpp>
#include <boost/thread/future.hpp>
struct Result { uint64_t begin, result; };
Result foo(uint64_t begin) {
uint64_t prev[] = {begin, 0};
for (uint64_t i = 0; i < 1000000000; ++i) {
const auto tmp = (prev[0] + prev[1]) % 1000;
prev[1] = prev[0];
prev[0] = tmp;
}
return { begin, prev[0] };
}
void batch(boost::asio::thread_pool &pool, std::vector<uint64_t> const a)
{
using T = boost::packaged_task<Result>;
std::vector<T> tasks;
tasks.reserve(a.size());
for(auto begin : a)
tasks.emplace_back(boost::bind(foo, begin));
std::vector<boost::unique_future<T::result_type> > futures;
for (auto& t : tasks) {
futures.push_back(t.get_future());
post(pool, std::move(t));
}
for (auto& fut : boost::when_all(futures.begin(), futures.end()).get()) {
auto r = fut.get();
std::cerr << "foo(" << r.begin << "): " << r.result << " ";
}
std::cout << std::endl;
}
int main() {
boost::asio::thread_pool pool(2);
batch(pool, {2});
batch(pool, {4, 3, 5});
batch(pool, {7, 9});
}
Prints
foo(2): 2
foo(4): 4 foo(3): 503 foo(5): 505
foo(7): 507 foo(9): 509
Generalized2: Variadics Simplify
Contrary to popular believe (and honestly, what usually happens) this time we can leverage variadics to get rid of all the intermediate vectors (every single one of them):
Live On Coliru
void batch(boost::asio::thread_pool &pool, T... a)
{
auto launch = [&pool](uint64_t begin) {
boost::packaged_task<Result> pt(boost::bind(foo, begin));
auto fut = pt.get_future();
post(pool, std::move(pt));
return fut;
};
for (auto& r : {launch(a).get()...}) {
std::cerr << "foo(" << r.begin << "): " << r.result << " ";
}
std::cout << std::endl;
}
If you insist on outputting the results in time, you can still add when_all into the mix (requiring a bit more heroics to unpack the tuple):
Live On Coliru
template <typename...T>
void batch(boost::asio::thread_pool &pool, T... a)
{
auto launch = [&pool](uint64_t begin) {
boost::packaged_task<Result> pt(boost::bind(foo, begin));
auto fut = pt.get_future();
post(pool, std::move(pt));
return fut;
};
std::apply([](auto&&... rfut) {
Result results[] {rfut.get()...};
for (auto& r : results) {
std::cerr << "foo(" << r.begin << "): " << r.result << " ";
}
}, boost::when_all(launch(a)...).get());
std::cout << std::endl;
}
Both still print the same result
Message Queuing
This is very natural to boost, and sort of skips most complexity. If you also want to report per batched group, you'd have to coordinate:
Live On Coliru
#include <iostream>
#include <boost/asio.hpp>
#include <memory>
struct Result { uint64_t begin, result; };
Result foo(uint64_t begin) {
uint64_t prev[] = {begin, 0};
for (uint64_t i = 0; i < 1000000000; ++i) {
const auto tmp = (prev[0] + prev[1]) % 1000;
prev[1] = prev[0];
prev[0] = tmp;
}
return { begin, prev[0] };
}
using Group = std::shared_ptr<size_t>;
void batch(boost::asio::thread_pool &pool, std::vector<uint64_t> begins) {
auto group = std::make_shared<std::vector<Result> >(begins.size());
for (size_t i=0; i < begins.size(); ++i) {
post(pool, [i,begin=begins.at(i),group] {
(*group)[i] = foo(begin);
if (group.unique()) {
for (auto& r : *group) {
std::cout << "foo(" << r.begin << "): " << r.result << " ";
std::cout << std::endl;
}
}
});
}
}
int main() {
boost::asio::thread_pool pool(2);
batch(pool, {2});
batch(pool, {4, 3, 5});
batch(pool, {7, 9});
pool.join();
}
Note this is having concurrent access to group, which is safe due to the limitations on element accesses.
Prints:
foo(2): 2
foo(4): 4 foo(3): 503 foo(5): 505
foo(7): 507 foo(9): 509

I just ran into this advanced executor example which is hidden from the documentation:
I realized just now that Asio comes with a fork_executor example which does exactly this: you can "group" tasks and join the executor (which represents that group) instead of the pool. I've missed this for the longest time since none of the executor examples are listed in the HTML documentation – sehe 21 mins ago
So without further ado, here's that sample applied to your question:
Live On Coliru
#define BOOST_BIND_NO_PLACEHOLDERS
#include <boost/asio/thread_pool.hpp>
#include <boost/asio/ts/executor.hpp>
#include <condition_variable>
#include <memory>
#include <mutex>
#include <queue>
#include <thread>
// A fixed-size thread pool used to implement fork/join semantics. Functions
// are scheduled using a simple FIFO queue. Implementing work stealing, or
// using a queue based on atomic operations, are left as tasks for the reader.
class fork_join_pool : public boost::asio::execution_context {
public:
// The constructor starts a thread pool with the specified number of
// threads. Note that the thread_count is not a fixed limit on the pool's
// concurrency. Additional threads may temporarily be added to the pool if
// they join a fork_executor.
explicit fork_join_pool(std::size_t thread_count = std::thread::hardware_concurrency()*2)
: use_count_(1), threads_(thread_count)
{
try {
// Ask each thread in the pool to dequeue and execute functions
// until it is time to shut down, i.e. the use count is zero.
for (thread_count_ = 0; thread_count_ < thread_count; ++thread_count_) {
boost::asio::dispatch(threads_, [&] {
std::unique_lock<std::mutex> lock(mutex_);
while (use_count_ > 0)
if (!execute_next(lock))
condition_.wait(lock);
});
}
} catch (...) {
stop_threads();
threads_.join();
throw;
}
}
// The destructor waits for the pool to finish executing functions.
~fork_join_pool() {
stop_threads();
threads_.join();
}
private:
friend class fork_executor;
// The base for all functions that are queued in the pool.
struct function_base {
std::shared_ptr<std::size_t> work_count_;
void (*execute_)(std::shared_ptr<function_base>& p);
};
// Execute the next function from the queue, if any. Returns true if a
// function was executed, and false if the queue was empty.
bool execute_next(std::unique_lock<std::mutex>& lock) {
if (queue_.empty())
return false;
auto p(queue_.front());
queue_.pop();
lock.unlock();
execute(lock, p);
return true;
}
// Execute a function and decrement the outstanding work.
void execute(std::unique_lock<std::mutex>& lock,
std::shared_ptr<function_base>& p) {
std::shared_ptr<std::size_t> work_count(std::move(p->work_count_));
try {
p->execute_(p);
lock.lock();
do_work_finished(work_count);
} catch (...) {
lock.lock();
do_work_finished(work_count);
throw;
}
}
// Increment outstanding work.
void
do_work_started(const std::shared_ptr<std::size_t>& work_count) noexcept {
if (++(*work_count) == 1)
++use_count_;
}
// Decrement outstanding work. Notify waiting threads if we run out.
void
do_work_finished(const std::shared_ptr<std::size_t>& work_count) noexcept {
if (--(*work_count) == 0) {
--use_count_;
condition_.notify_all();
}
}
// Dispatch a function, executing it immediately if the queue is already
// loaded. Otherwise adds the function to the queue and wakes a thread.
void do_dispatch(std::shared_ptr<function_base> p,
const std::shared_ptr<std::size_t>& work_count) {
std::unique_lock<std::mutex> lock(mutex_);
if (queue_.size() > thread_count_ * 16) {
do_work_started(work_count);
lock.unlock();
execute(lock, p);
} else {
queue_.push(p);
do_work_started(work_count);
condition_.notify_one();
}
}
// Add a function to the queue and wake a thread.
void do_post(std::shared_ptr<function_base> p,
const std::shared_ptr<std::size_t>& work_count) {
std::lock_guard<std::mutex> lock(mutex_);
queue_.push(p);
do_work_started(work_count);
condition_.notify_one();
}
// Ask all threads to shut down.
void stop_threads() {
std::lock_guard<std::mutex> lock(mutex_);
--use_count_;
condition_.notify_all();
}
std::mutex mutex_;
std::condition_variable condition_;
std::queue<std::shared_ptr<function_base>> queue_;
std::size_t use_count_;
std::size_t thread_count_;
boost::asio::thread_pool threads_;
};
// A class that satisfies the Executor requirements. Every function or piece of
// work associated with a fork_executor is part of a single, joinable group.
class fork_executor {
public:
fork_executor(fork_join_pool& ctx)
: context_(ctx), work_count_(std::make_shared<std::size_t>(0)) {}
fork_join_pool& context() const noexcept { return context_; }
void on_work_started() const noexcept {
std::lock_guard<std::mutex> lock(context_.mutex_);
context_.do_work_started(work_count_);
}
void on_work_finished() const noexcept {
std::lock_guard<std::mutex> lock(context_.mutex_);
context_.do_work_finished(work_count_);
}
template <class Func, class Alloc>
void dispatch(Func&& f, const Alloc& a) const {
auto p(std::allocate_shared<exFun<Func>>(
typename std::allocator_traits<Alloc>::template rebind_alloc<char>(a),
std::move(f), work_count_));
context_.do_dispatch(p, work_count_);
}
template <class Func, class Alloc> void post(Func f, const Alloc& a) const {
auto p(std::allocate_shared<exFun<Func>>(
typename std::allocator_traits<Alloc>::template rebind_alloc<char>(a),
std::move(f), work_count_));
context_.do_post(p, work_count_);
}
template <class Func, class Alloc>
void defer(Func&& f, const Alloc& a) const {
post(std::forward<Func>(f), a);
}
friend bool operator==(const fork_executor& a, const fork_executor& b) noexcept {
return a.work_count_ == b.work_count_;
}
friend bool operator!=(const fork_executor& a, const fork_executor& b) noexcept {
return a.work_count_ != b.work_count_;
}
// Block until all work associated with the executor is complete. While it
// is waiting, the thread may be borrowed to execute functions from the
// queue.
void join() const {
std::unique_lock<std::mutex> lock(context_.mutex_);
while (*work_count_ > 0)
if (!context_.execute_next(lock))
context_.condition_.wait(lock);
}
private:
template <class Func> struct exFun : fork_join_pool::function_base {
explicit exFun(Func f, const std::shared_ptr<std::size_t>& w)
: function_(std::move(f)) {
work_count_ = w;
execute_ = [](std::shared_ptr<fork_join_pool::function_base>& p) {
Func tmp(std::move(static_cast<exFun*>(p.get())->function_));
p.reset();
tmp();
};
}
Func function_;
};
fork_join_pool& context_;
std::shared_ptr<std::size_t> work_count_;
};
// Helper class to automatically join a fork_executor when exiting a scope.
class join_guard {
public:
explicit join_guard(const fork_executor& ex) : ex_(ex) {}
join_guard(const join_guard&) = delete;
join_guard(join_guard&&) = delete;
~join_guard() { ex_.join(); }
private:
fork_executor ex_;
};
//------------------------------------------------------------------------------
#include <algorithm>
#include <iostream>
#include <random>
#include <vector>
#include <boost/bind.hpp>
static void foo(const uint64_t begin, uint64_t *result)
{
uint64_t prev[] = {begin, 0};
for (uint64_t i = 0; i < 1000000000; ++i) {
const auto tmp = (prev[0] + prev[1]) % 1000;
prev[1] = prev[0];
prev[0] = tmp;
}
*result = prev[0];
}
void batch(fork_join_pool &pool, const uint64_t (&a)[2])
{
uint64_t r[] = {0, 0};
{
fork_executor fork(pool);
join_guard join(fork);
boost::asio::post(fork, boost::bind(foo, a[0], &r[0]));
boost::asio::post(fork, boost::bind(foo, a[1], &r[1]));
// fork.join(); // or let join_guard destructor run
}
std::cerr << "foo(" << a[0] << "): " << r[0] << " foo(" << a[1] << "): " << r[1] << std::endl;
}
int main() {
fork_join_pool pool;
batch(pool, {2, 4});
batch(pool, {3, 5});
batch(pool, {7, 9});
}
Prints:
foo(2): 2 foo(4): 4
foo(3): 503 foo(5): 505
foo(7): 507 foo(9): 509
Things to note:
executors can overlap/nest: you can use several joinable fork_executors on a single fork_join_pool and they will join the distinct groups of tasks for each executor
You can get that sense easily when looking at the library example (which does a recursive divide-and-conquer merge sort).

I had a similar problem and ended up using latches. In this case the code would would be (I also switched from bind to lambdas):
void batch(boost::asio::thread_pool &pool, const uint64_t a[])
{
uint64_t r[] = {0, 0};
boost::latch latch(2);
boost::asio::post(pool, [&](){ foo(a[0], &r[0]); latch.count_down();});
boost::asio::post(pool, [&](){ foo(a[1], &r[1]); latch.count_down();});
latch.wait();
std::cerr << "foo(" << a[0] << "): " << r[0] << " foo(" << a[1] << "): " << r[1] << std::endl;
}
https://godbolt.org/z/oceP6jjs7

Related

Determining function time using a wrapper

I'm looking for a generic way of measuring a functions timing like Here, but for c++.
My main goal is to not have cluttered code like this piece everywhere:
auto t1 = std::chrono::high_resolution_clock::now();
function(arg1, arg2);
auto t2 = std::chrono::high_resolution_clock::now();
auto tDur = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1);
But rather have a nice wrapper around the function.
What I got so far is:
timing.hpp:
#pragma once
#include <chrono>
#include <functional>
template <typename Tret, typename Tin1, typename Tin2> unsigned int getDuration(std::function<Tret(Tin1, Tin2)> function, Tin1 arg1, Tin2 arg2, Tret& retValue)
{
auto t1 = std::chrono::high_resolution_clock::now();
retValue = function(arg1, arg2);
auto t2 = std::chrono::high_resolution_clock::now();
auto tDur = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1);
return tDur.count();
}
main.cpp:
#include "timing.hpp"
#include "matrix.hpp"
constexpr int G_MATRIXSIZE = 2000;
int main(int argc, char** argv)
{
CMatrix<double> myMatrix(G_MATRIXSIZE);
bool ret;
// this call is quite ugly
std::function<bool(int, std::vector<double>)> fillRow = std::bind(&CMatrix<double>::fillRow, &myMatrix, 0, fillVec);
auto duration = getDuration(fillRow, 5, fillVec, ret );
std::cout << "duration(ms): " << duration << std::endl;
}
in case sb wants to test the code, matrix.hpp:
#pragma once
#include <iostream>
#include <string>
#include <sstream>
#include <vector>
template<typename T> class CMatrix {
public:
// ctor
CMatrix(int size) :
m_size(size)
{
m_matrixData = new std::vector<std::vector<T>>;
createUnityMatrix();
}
// dtor
~CMatrix()
{
std::cout << "Destructor of CMatrix called" << std::endl;
delete m_matrixData;
}
// print to std::out
void printMatrix()
{
std::ostringstream oss;
for (int i = 0; i < m_size; i++)
{
for (int j = 0; j < m_size; j++)
{
oss << m_matrixData->at(i).at(j) << ";";
}
oss << "\n";
}
std::cout << oss.str() << std::endl;
}
bool fillRow(int index, std::vector<T> row)
{
// checks
if (!indexValid(index))
{
return false;
}
if (row.size() != m_size)
{
return false;
}
// data replacement
for (int j = 0; j < m_size; j++)
{
m_matrixData->at(index).at(j) = row.at(j);
}
return true;
}
bool fillColumn(int index, std::vector<T> column)
{
// checks
if (!indexValid(index))
{
return false;
}
if (column.size() != m_size)
{
return false;
}
// data replacement
for (int j = 0; j < m_size; j++)
{
m_matrixData->at(index).at(j) = column.at(j);
}
return true;
}
private:
// variables
std::vector<std::vector<T>>* m_matrixData;
int m_size;
bool indexValid(int index)
{
if (index + 1 > m_size)
{
return false;
}
return true;
}
// functions
void createUnityMatrix()
{
for (int i = 0; i < m_size; i++)
{
std::vector<T> _vector;
for (int j = 0; j < m_size; j++)
{
if (i == j)
{
_vector.push_back(1);
}
else
{
_vector.push_back(0);
}
}
m_matrixData->push_back(_vector);
}
}
};
The thing is, this code is still quite ugly due to the std::function usage. Is there a better and/or simpler option ?
(+ also I'm sure I messed sth up with the std::bind, I think I need to use std::placeholders since I want to set the arguments later on.)
// edit, correct use of placeholder in main:
std::function<bool(int, std::vector<double>)> fillRow = std::bind(&CMatrix<double>::fillRow, &myMatrix, std::placeholders::_1, std::placeholders::_2);
auto duration = getDuration(fillRow, 18, fillVec, ret );
You can utilize RAII to implement a timer that records the execution time of a code block and a template function that wraps the function you would like to execute with the timer.
#include<string>
#include<chrono>
#include <unistd.h>
struct Timer
{
std::string fn, title;
std::chrono::time_point<std::chrono::steady_clock> start;
Timer(std::string fn, std::string title)
: fn(std::move(fn)), title(std::move(title)), start(std::chrono::steady_clock::now())
{
}
~Timer()
{
const auto elapsed =
std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
printf("%s: function=%s; elasepd=%f ms\n", title.c_str(), fn.c_str(), elapsed / 1000.0);
}
};
#ifndef ENABLE_BENCHMARK
static constexpr inline void dummy_fn() { }
#define START_BENCHMARK_TIMER(...) dummy_fn()
#else
#define START_BENCHMARK_TIMER(title) bench::Timer timer(__FUNCTION__, title)
#endif
template<typename F, typename ...Args>
auto time_fn(F&& fn, Args&&... args) {
START_BENCHMARK_TIMER("wrapped fn");
return fn(std::forward<Args>(args)...);
}
int foo(int i) {
usleep(70000);
return i;
}
int main()
{
printf("%d\n", time_fn(foo, 3));
}
stdout:
wrapped fn: function=time_fn; elasepd=71.785000 ms
3
General Idea:
time_fn is a simple template function that calls START_BENCHMARK_TIMER and calls fn with the provided arguments
START_BENCHMARK_TIMER then creates a Timer object. It will record the current time in start. Do note that __FUNCTION__ will be replaced with the function that was called.
When the
provided fn returns or throws an exception, the Timer object from (1) will be destroyed and the destructor will be called. The destructor will then calculate the time difference between the current time and the recorded start time and prints it to stdout
Note:
Even though declaring start and end in time_fn instead of the RAII timer will work, having an RAII timer will allow you to cleanly handle the situation when fn throws an exception
If you are on c++11, you will need to change time_fn declaration to typename std::result_of<F &&(Args &&...)>::type time_fn(F&& fn, Args&&... args).
Edit: Updated the response to include a wrapper function approach.

Overwriting first element of a vector changes contents of last element of vector

When writing my code, I noticed that running the code returns incorrect results, it turns out something in my code is changing the vector of handles for my coroutines and narrowed it down to one line of code where I overwrite an existing element of the handle vector with a new element.
Doing that also changes the content of the last element of the vector (more specifically, the bool from the myTask header) but not the elements in between.
Does anyone know what causes this? Any help appreciated.
Full implementation code:
#include <concepts>
#include <coroutine>
#include <exception>
#include <iostream>
#include <myTask.h>
#include <vector>
myTask<int> getVectorInt(std::vector<int>& array, int key, bool interleave)
{
std::cout << "started lookup of key: " << key << std::endl;
int result = array.at(key);
if (interleave == true)
{
std::cout << "about to suspend task with key: " << key << std::endl;
co_await std::suspend_always{};
std::cout << "resumed task with key: " << key << std::endl;
}
co_return result;
}
void interleavedExecution(std::vector<int>& lookup, std::vector<int>& keys, std::vector<int>& results)
{
// group size = number of concurrent instruction streams
int groupsize = 3;
// initialization of handle vector
std::vector<std::coroutine_handle<myTask<int>::promise_type>> handles;
// initialization of promise vector
std::vector<myTask<int>::promise_type> promises;
// creating/initializing first handles
for (int i = 0; i < groupsize; ++i)
{
handles.push_back(getVectorInt(lookup, keys.at(i), true));
}
int notDone = groupsize;
int i = groupsize;
// interleaved execution starts here
while (notDone > 0)
{
for (int handleIndex = 0; handleIndex < handles.size(); ++handleIndex)
{
if (!handles.at(handleIndex).promise().isDone())
{
handles.at(handleIndex).resume();
handles.at(handleIndex).promise().boolIsDone = true;
}
else
{
// pushing value back directly into results
results.push_back(handles.at(handleIndex).promise().value_);
if (i < keys.size())
{
// bug here, changes the last boolIsDone also to false (or messes with the last vector element)
handles.at(handleIndex) = getVectorInt(lookup, keys.at(i), true);
handles.at(handleIndex).promise().boolIsDone = false;
++i;
}
else { --notDone; }
}
}
}
}
template <typename T>
void outputVector(std::vector<T> toOutput)
{
std::cout << "Results: ";
for (int i = 0; i < toOutput.size(); ++i)
{
std::cout << toOutput.at(i) << ' ';
}
}
int main()
{
std::vector<int> lookup = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100};
std::vector<int> keys = {4, 2, 0, 6, 9, 0};
std::vector<int> results;
// correct output: 50, 30, 10, 70, 100, 10
// given output: 50, 30, 70, 10, 100, 10
interleavedExecution(lookup, keys, results);
outputVector(results);
}
myTask header carrying a bool:
#include <concepts>
#include <coroutine>
#include <exception>
#include <iostream>
template <typename T>
struct myTask {
struct promise_type {
unsigned value_;
~promise_type() {
//std::cout << "promise_type destroyed" << std::endl;
}
myTask<T> get_return_object() {
return myTask<T> {
.h_ = std::coroutine_handle<promise_type>::from_promise(*this)
};
}
std::suspend_never initial_suspend() { return {}; }
std::suspend_never final_suspend() { return {}; }
void unhandled_exception() { std::terminate(); }
std::suspend_always return_value(unsigned value) {
value_ = value;
return {};
}
bool boolIsDone = false;
auto isDone() { return boolIsDone; }
};
std::coroutine_handle<promise_type> h_;
operator std::coroutine_handle<promise_type>() const {
//std::cout << "called handle" << std::endl;
return h_; }
};
It turned out that changing the return type of final_suspend() from std::suspend_never to std::suspend_always fixed the issue.

Concurrent program compiled with clang runs fine, but hangs with gcc

I wrote a class to share a limited number of resources (for instance network interfaces) between a larger number of threads. The resources are pooled and, if not in use, they are borrowed out to the requesting thread, which otherwise waits on a condition_variable.
Nothing really exotic: apart for the fancy scoped_lock which requires c++17, it should be good old c++11.
Both gcc10.2 and clang11 compile the test main fine, but while the latter produces an executable which does pretty much what expected, the former hangs without consuming CPU (deadlock?).
With the help of https://godbolt.org/ I tried older versions of gcc and also icc (passing options -O3 -std=c++17 -pthread), all reproducing the bad result, while even there clang confirms the proper behavior.
I wonder if I made a mistake or if the code triggers some compiler misbehavior and in case how to work around that.
#include <iostream>
#include <vector>
#include <stdexcept>
#include <mutex>
#include <condition_variable>
template <typename T>
class Pool {
///////////////////////////
class Borrowed {
friend class Pool<T>;
Pool<T>& pool;
const size_t id;
T * val;
public:
Borrowed(Pool & p, size_t i, T& v): pool(p), id(i), val(&v) {}
~Borrowed() { release(); }
T& get() const {
if (!val) throw std::runtime_error("Borrowed::get() this resource was collected back by the pool");
return *val;
}
void release() { pool.collect(*this); }
};
///////////////////////////
struct Resource {
T val;
bool available = true;
Resource(T v): val(std::move(v)) {}
};
///////////////////////////
std::vector<Resource> vres;
size_t hint = 0;
std::condition_variable cv;
std::mutex mtx;
size_t available_cnt;
public:
Pool(std::initializer_list<T> l): available_cnt(l.size()) {
vres.reserve(l.size());
for (T t: l) {
vres.emplace_back(std::move(t));
}
std::cout << "Pool has size " << vres.size() << std::endl;
}
~Pool() {
for ( auto & res: vres ) {
if ( ! res.available ) {
std::cerr << "WARNING Pool::~Pool resources are still in use\n";
}
}
}
Borrowed borrow() {
std::unique_lock<std::mutex> lk(mtx);
cv.wait(lk, [&](){return available_cnt > 0;});
if ( vres[hint].available ) {
// quick path, if hint points to an available resource
std::cout << "hint good" << std::endl;
vres[hint].available = false;
--available_cnt;
Borrowed b(*this, hint, vres[hint].val);
if ( hint + 1 < vres.size() ) ++hint;
return b; // <--- gcc seems to hang here
} else {
// full scan to find the available resource
std::cout << "hint bad" << std::endl;
for ( hint = 0; hint < vres.size(); ++hint ) {
if ( vres[hint].available ) {
vres[hint].available = false;
--available_cnt;
return Borrowed(*this, hint, vres[hint].val);
}
}
}
throw std::runtime_error("Pool::borrow() no resource is available - internal logic error");
}
void collect(Borrowed & b) {
if ( &(b.pool) != this )
throw std::runtime_error("Pool::collect() trying to collect resource owned by another pool!");
if ( b.val ) {
b.val = nullptr;
{
std::scoped_lock<std::mutex> lk(mtx);
hint = b.id;
vres[hint].available = true;
++available_cnt;
}
cv.notify_one();
}
}
};
///////////////////////////////////////////////////////////////////
#include <thread>
#include <chrono>
int main() {
Pool<std::string> pool{"hello","world"};
std::vector<std::thread> vt;
for (int i = 10; i > 0; --i) {
vt.emplace_back( [&pool, i]()
{
auto res = pool.borrow();
std::this_thread::sleep_for(std::chrono::milliseconds(i*300));
std::cout << res.get() << std::endl;
}
);
}
for (auto & t: vt) t.join();
return 0;
}
You're running into undefined behavior since you effectively relock an already acquired lock. With MSVC I obtained a helpful callstack to distinguish this. Here is a working fixed example (I suppose, works now for me, see the changes within the borrow() method, might be further re-designed since locking inside a destructor might be questioned):
#include <iostream>
#include <vector>
#include <stdexcept>
#include <mutex>
#include <condition_variable>
template <typename T>
class Pool {
///////////////////////////
class Borrowed {
friend class Pool<T>;
Pool<T>& pool;
const size_t id;
T * val;
public:
Borrowed(Pool & p, size_t i, T& v) : pool(p), id(i), val(&v) {}
~Borrowed() { release(); }
T& get() const {
if (!val) throw std::runtime_error("Borrowed::get() this resource was collected back by the pool");
return *val;
}
void release() { pool.collect(*this); }
};
///////////////////////////
struct Resource {
T val;
bool available = true;
Resource(T v) : val(std::move(v)) {}
};
///////////////////////////
std::vector<Resource> vres;
size_t hint = 0;
std::condition_variable cv;
std::mutex mtx;
size_t available_cnt;
public:
Pool(std::initializer_list<T> l) : available_cnt(l.size()) {
vres.reserve(l.size());
for (T t : l) {
vres.emplace_back(std::move(t));
}
std::cout << "Pool has size " << vres.size() << std::endl;
}
~Pool() {
for (auto & res : vres) {
if (!res.available) {
std::cerr << "WARNING Pool::~Pool resources are still in use\n";
}
}
}
Borrowed borrow() {
std::unique_lock<std::mutex> lk(mtx);
while (available_cnt == 0) cv.wait(lk);
if (vres[hint].available) {
// quick path, if hint points to an available resource
std::cout << "hint good" << std::endl;
vres[hint].available = false;
--available_cnt;
Borrowed b(*this, hint, vres[hint].val);
if (hint + 1 < vres.size()) ++hint;
lk.unlock();
return b; // <--- gcc seems to hang here
}
else {
// full scan to find the available resource
std::cout << "hint bad" << std::endl;
for (hint = 0; hint < vres.size(); ++hint) {
if (vres[hint].available) {
vres[hint].available = false;
--available_cnt;
lk.unlock();
return Borrowed(*this, hint, vres[hint].val);
}
}
}
throw std::runtime_error("Pool::borrow() no resource is available - internal logic error");
}
void collect(Borrowed & b) {
if (&(b.pool) != this)
throw std::runtime_error("Pool::collect() trying to collect resource owned by another pool!");
if (b.val) {
b.val = nullptr;
{
std::scoped_lock<std::mutex> lk(mtx);
hint = b.id;
vres[hint].available = true;
++available_cnt;
cv.notify_one();
}
}
}
};
///////////////////////////////////////////////////////////////////
#include <thread>
#include <chrono>
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
int main()
{
try
{
Pool<std::string> pool{ "hello","world" };
std::vector<std::thread> vt;
for (int i = 10; i > 0; --i) {
vt.emplace_back([&pool, i]()
{
auto res = pool.borrow();
std::this_thread::sleep_for(std::chrono::milliseconds(i * 300));
std::cout << res.get() << std::endl;
}
);
}
for (auto & t : vt) t.join();
return 0;
}
catch(const std::exception& e)
{
std::cout << "exception occurred: " << e.what();
}
return 0;
}
Locking destructor coupled with missed NRVO caused the issue (credits to Secundi for pointing this out in the comments).
If the compiler skips NRVO, the few lines below if will call the destructor of b. The destructor tries to acquire the mutex before this gets released by the unique_lock, resulting in a deadlock.
Borrowed b(*this, hint, vres[hint].val);
if ( hint + 1 < vres.size() ) ++hint;
return b; // <--- gcc seems to hang here
It is of crucial importance here to avoid destroying b. In fact, even if manually releasing the unique_lock before returning will avoid the deadlock, the destructor of b will mark the pooled resource as available, while this is just being borrowed out, making the code wrong.
A possible fix consists in replacing the lines above with:
const auto tmp = hint;
if ( hint + 1 < vres.size() ) ++hint;
return Borrowed(*this, tmp, vres[tmp].val);
Another possibility (which does not exclude the former) is to delete the (evil) copy ctor of Borrowed and only provide a move ctor:
Borrowed(const Borrowed &) = delete;
Borrowed(Borrowed && b): pool(b.pool), id(b.id), val(b.val) { b.val = nullptr; }

Why cannot my c++ thread pool accelerate my program?

I tried to implement a c++ thread pool according to some notes made by others, the code is like this:
#include <vector>
#include <queue>
#include <functional>
#include <future>
#include <atomic>
#include <condition_variable>
#include <thread>
#include <mutex>
#include <memory>
#include <glog/logging.h>
#include <iostream>
#include <chrono>
using std::cout;
using std::endl;
class ThreadPool {
public:
ThreadPool(const ThreadPool&) = delete;
ThreadPool(ThreadPool&&) = delete;
ThreadPool& operator=(const ThreadPool&) = delete;
ThreadPool& operator=(ThreadPool&&) = delete;
ThreadPool(uint32_t capacity=std::thread::hardware_concurrency(),
uint32_t n_threads=std::thread::hardware_concurrency()
): capacity(capacity), n_threads(n_threads) {
init(capacity, n_threads);
}
~ThreadPool() noexcept {
shutdown();
}
void init(uint32_t capacity, uint32_t n_threads) {
CHECK_GT(capacity, 0) << "task queue capacity should be greater than 0";
CHECK_GT(n_threads, 0) << "thread pool capacity should be greater than 0";
for (int i{0}; i < n_threads; ++i) {
pool.emplace_back(std::thread([this] {
std::function<void(void)> task;
while (!this->stop) {
{
std::unique_lock<std::mutex> lock(this->q_mutex);
task_q_empty.wait(lock, [&] {return this->stop | !task_q.empty();});
if (this->stop) break;
task = this->task_q.front();
this->task_q.pop();
task_q_full.notify_one();
}
// auto id = std::this_thread::get_id();
// std::cout << "thread id is: " << id << std::endl;
task();
}
}));
}
}
void shutdown() {
stop = true;
task_q_empty.notify_all();
task_q_full.notify_all();
for (auto& thread : pool) {
if (thread.joinable()) {
thread.join();
}
}
}
template<typename F, typename...Args>
auto submit(F&& f, Args&&... args) -> std::future<decltype(f(args...))> {
using res_type = decltype(f(args...));
std::function<res_type(void)> func = std::bind(std::forward<F>(f), std::forward<Args>(args)...);
auto task_ptr = std::make_shared<std::packaged_task<res_type()>>(func);
{
std::unique_lock<std::mutex> lock(q_mutex);
task_q_full.wait(lock, [&] {return this->stop | task_q.size() <= capacity;});
CHECK (this->stop == false) << "should not add task to stopped queue\n";
task_q.emplace([task_ptr]{(*task_ptr)();});
}
task_q_empty.notify_one();
return task_ptr->get_future();
}
private:
std::vector<std::thread> pool;
std::queue<std::function<void(void)>> task_q;
std::condition_variable task_q_full;
std::condition_variable task_q_empty;
std::atomic<bool> stop{false};
std::mutex q_mutex;
uint32_t capacity;
uint32_t n_threads;
};
int add(int a, int b) {return a + b;}
int main() {
auto t1 = std::chrono::steady_clock::now();
int n_threads = 1;
ThreadPool tp;
tp.init(n_threads, 1024);
std::vector<std::future<int>> res;
for (int i{0}; i < 1000000; ++i) {
res.push_back(tp.submit(add, i, i+1));
}
auto t2 = std::chrono::steady_clock::now();
for (auto &el : res) {
el.get();
// cout << el.get() << endl;
}
tp.shutdown();
cout << "processing: "
<< std::chrono::duration<double, std::milli>(t2 - t1).count()
<< endl;
return 0;
}
The problem is that, when I set n_threads=1, the program takes the same length of time as I set n_threads=4. Since my gpu has 72 kernels (from the htop command), I believe the 4 thread would be faster than the 1 thread settings. What is the problem with this implementation of the thread pool please?
I found few issues:
1) Use ORing instead of the bitwise operation in the both conditional-variable waits:
Replace this - `task_q_empty.wait(lock, [&] {return this->stop | !task_q.empty();});`
By - `task_q_empty.wait(lock, [&] {return this->stop || !task_q.empty();});`
2) Use notify_all() in place of notify_one() in init() and submit().
3) Two condition_variables is unnecessary here, use only task_q_empty.
4) Your use case is not ideal. Switching of the threads may outweigh adding of two integers, it may appear more the threads longer the execution time. Test in optimized mode. Try scenario like this to simulate longer process:
int add(int a, int b) { this_thread::sleep_for(chrono::milliseconds(200)); return a + b; }

Thread unable to join for_each parallel c++

I wrote a sample code to run parallel instances of for_each
I am unable to join the threads, in the below code. I am little early to concurrent programming so im not sure if i have done everything right.
template <typename Iterator, typename F>
class for_each_block
{
public :
void operator()(Iterator start, Iterator end, F f) {
cout << this_thread::get_id << endl;
this_thread::sleep_for(chrono::seconds(5));
for_each(start, end, [&](auto& x) { f(x); });
}
};
typedef unsigned const long int ucli;
template <typename Iterator, typename F>
void for_each_par(Iterator first, Iterator last, F f)
{
ucli size = distance(first, last);
if (!size)
return;
ucli min_per_thread = 4;
ucli max_threads = (size + min_per_thread - 1) / min_per_thread;
ucli hardware_threads = thread::hardware_concurrency();
ucli no_of_threads = min(max_threads, hardware_threads != 0 ? hardware_threads : 4);
ucli block_size = size / no_of_threads;
vector<thread> vf(no_of_threads);
Iterator block_start = first;
for (int i = 0; i < (no_of_threads - 1); i++)
{
Iterator end = first;
advance(end, block_size);
vf.push_back(std::move(thread(for_each_block<Iterator, F>(),first,end,f)));
first = end;
}
vf.push_back(std::move(thread(for_each_block<Iterator, F>(), first, last, f)));
cout << endl;
cout << vf.size() << endl;
for(auto& x: vf)
{
if (x.joinable())
x.join();
else
cout << "threads not joinable " << endl;
}
this_thread::sleep_for(chrono::seconds(100));
}
int main()
{
vector<int> v1 = { 1,8,12,5,4,9,20,30,40,50,10,21,34,33 };
for_each_par(v1.begin(), v1.end(), print_type<int>);
return 0;
}
In the above code i am getting threads not joinable. I have also tried with async futures still i get the same. Am i missing something here?
Any help is greatly appreciated ,
Thank you in advance ..
vector<thread> vf(no_of_threads);
This creates a vector with no_of_threads default-initialized threads. Since they're default initialized, none of them will be joinable. You probably meant to do:
vector<thread> vf;
vf.reserve(no_of_threads);
P.S.: std::move on a temporary is redundant :); consider changing this:
vf.push_back(std::move(thread(for_each_block<Iterator, F>(), first, last, f)));
to this:
vf.emplace_back(for_each_block<Iterator, F>(), first, last, f);
This may or may not be interesting. I had a go at refactoring the code to use what I think is a more idiomatic approach. I'm not saying that your approach is wrong, but since you're learning thread management I thought you may be interested in what else is possible.
Feel free to flame/question as appropriate. Comments inline:
#include <vector>
#include <chrono>
#include <thread>
#include <mutex>
#include <iomanip>
#include <future>
using namespace std;
//
// provide a means of serialising writing to a stream.
//
struct locker
{
locker() : _lock(mutex()) {}
static std::mutex& mutex() { static std::mutex m; return m; }
std::unique_lock<std::mutex> _lock;
};
std::ostream& operator<<(std::ostream& os, const locker& l) {
return os;
}
//
// fill in the missing work function
//
template<class T>
void print_type(const T& t) {
std::cout << locker() << hex << std::this_thread::get_id() << " : " << dec << t << std::endl;
}
// put this in your personable library.
// the standards committee really should have given us ranges by now...
template<class I1, class I2>
struct range_impl
{
range_impl(I1 i1, I2 i2) : _begin(i1), _end(i2) {};
auto begin() const { return _begin; }
auto end() const { return _end; }
I1 _begin;
I2 _end;
};
// distinct types because sometimes dissimilar iterators are comparable
template<class I1, class I2>
auto range(I1 i1, I2 i2) {
return range_impl<I1, I2>(i1, i2);
}
//
// lets make a helper function so we can auto-deduce template args
//
template<class Iterator, typename F>
auto make_for_each_block(Iterator start, Iterator end, F&& f)
{
// a lambda gives all the advantages of a function object with none
// of the boilerplate.
return [start, end, f = std::move(f)] {
cout << locker() << this_thread::get_id() << endl;
this_thread::sleep_for(chrono::seconds(1));
// let's keep loops simple. for_each is a bit old-skool.
for (auto& x : range(start, end)) {
f(x);
}
};
}
template <typename Iterator, typename F>
void for_each_par(Iterator first, Iterator last, F f)
{
if(auto size = distance(first, last))
{
std::size_t min_per_thread = 4;
std::size_t max_threads = (size + min_per_thread - 1) / min_per_thread;
std::size_t hardware_threads = thread::hardware_concurrency();
auto no_of_threads = min(max_threads, hardware_threads != 0 ? hardware_threads : 4);
auto block_size = size / no_of_threads;
// futures give us two benefits:
// 1. they automatically transmit exceptions
// 2. no need for if(joinable) join. get is sufficient
//
vector<future<void>> vf;
vf.reserve(no_of_threads - 1);
for (auto count = no_of_threads ; --count ; )
{
//
// I was thinking of refactoring this into std::generate_n but actually
// it was less readable.
//
auto end = std::next(first, block_size);
vf.push_back(async(launch::async, make_for_each_block(first, end, f)));
first = end;
}
cout << locker() << endl << "threads: " << vf.size() << " (+ main thread)" << endl;
//
// why spawn a thread for the remaining block? we may as well use this thread
//
/* auto partial_sum = */ make_for_each_block(first, last, f)();
// join the threads
// note that if the blocks returned a partial aggregate, we could combine them
// here by using the values in the futures.
for (auto& f : vf) f.get();
}
}
int main()
{
vector<int> v1 = { 1,8,12,5,4,9,20,30,40,50,10,21,34,33 };
for_each_par(v1.begin(), v1.end(), print_type<int>);
return 0;
}
sample output:
0x700000081000
0x700000104000
threads: 3 (+ main thread)
0x700000187000
0x100086000
0x700000081000 : 1
0x700000104000 : 5
0x700000187000 : 20
0x100086000 : 50
0x700000081000 : 8
0x700000104000 : 4
0x700000187000 : 30
0x100086000 : 10
0x700000081000 : 12
0x700000104000 : 9
0x700000187000 : 40
0x100086000 : 21
0x100086000 : 34
0x100086000 : 33
Program ended with exit code: 0
please explain std::move here: [start, end, f = std::move(f)] {...};
This is a welcome language feature that was made available in c++14. f = std::move(f) inside the capture block is equivalent to: decltype(f) new_f = std::move(f) except that the new variable is called f and not new_f. It allows us to std::move objects into lambdas rather than copy them.
For most function objects it won't matter - but some can large and this gives the compiler the opportunity to use a move rather than a copy if available.