Thread unable to join for_each parallel c++

Thread unable to join for_each parallel c++ - c++

I wrote a sample code to run parallel instances of for_each
I am unable to join the threads, in the below code. I am little early to concurrent programming so im not sure if i have done everything right.
template <typename Iterator, typename F>
class for_each_block
{
public :
void operator()(Iterator start, Iterator end, F f) {
cout << this_thread::get_id << endl;
this_thread::sleep_for(chrono::seconds(5));
for_each(start, end, [&](auto& x) { f(x); });
}
};
typedef unsigned const long int ucli;
template <typename Iterator, typename F>
void for_each_par(Iterator first, Iterator last, F f)
{
ucli size = distance(first, last);
if (!size)
return;
ucli min_per_thread = 4;
ucli max_threads = (size + min_per_thread - 1) / min_per_thread;
ucli hardware_threads = thread::hardware_concurrency();
ucli no_of_threads = min(max_threads, hardware_threads != 0 ? hardware_threads : 4);
ucli block_size = size / no_of_threads;
vector<thread> vf(no_of_threads);
Iterator block_start = first;
for (int i = 0; i < (no_of_threads - 1); i++)
{
Iterator end = first;
advance(end, block_size);
vf.push_back(std::move(thread(for_each_block<Iterator, F>(),first,end,f)));
first = end;
}
vf.push_back(std::move(thread(for_each_block<Iterator, F>(), first, last, f)));
cout << endl;
cout << vf.size() << endl;
for(auto& x: vf)
{
if (x.joinable())
x.join();
else
cout << "threads not joinable " << endl;
}
this_thread::sleep_for(chrono::seconds(100));
}
int main()
{
vector<int> v1 = { 1,8,12,5,4,9,20,30,40,50,10,21,34,33 };
for_each_par(v1.begin(), v1.end(), print_type<int>);
return 0;
}
In the above code i am getting threads not joinable. I have also tried with async futures still i get the same. Am i missing something here?
Any help is greatly appreciated ,
Thank you in advance ..

vector<thread> vf(no_of_threads);
This creates a vector with no_of_threads default-initialized threads. Since they're default initialized, none of them will be joinable. You probably meant to do:
vector<thread> vf;
vf.reserve(no_of_threads);
P.S.: std::move on a temporary is redundant :); consider changing this:
vf.push_back(std::move(thread(for_each_block<Iterator, F>(), first, last, f)));
to this:
vf.emplace_back(for_each_block<Iterator, F>(), first, last, f);

This may or may not be interesting. I had a go at refactoring the code to use what I think is a more idiomatic approach. I'm not saying that your approach is wrong, but since you're learning thread management I thought you may be interested in what else is possible.
Feel free to flame/question as appropriate. Comments inline:
#include <vector>
#include <chrono>
#include <thread>
#include <mutex>
#include <iomanip>
#include <future>
using namespace std;
//
// provide a means of serialising writing to a stream.
//
struct locker
{
locker() : _lock(mutex()) {}
static std::mutex& mutex() { static std::mutex m; return m; }
std::unique_lock<std::mutex> _lock;
};
std::ostream& operator<<(std::ostream& os, const locker& l) {
return os;
}
//
// fill in the missing work function
//
template<class T>
void print_type(const T& t) {
std::cout << locker() << hex << std::this_thread::get_id() << " : " << dec << t << std::endl;
}
// put this in your personable library.
// the standards committee really should have given us ranges by now...
template<class I1, class I2>
struct range_impl
{
range_impl(I1 i1, I2 i2) : _begin(i1), _end(i2) {};
auto begin() const { return _begin; }
auto end() const { return _end; }
I1 _begin;
I2 _end;
};
// distinct types because sometimes dissimilar iterators are comparable
template<class I1, class I2>
auto range(I1 i1, I2 i2) {
return range_impl<I1, I2>(i1, i2);
}
//
// lets make a helper function so we can auto-deduce template args
//
template<class Iterator, typename F>
auto make_for_each_block(Iterator start, Iterator end, F&& f)
{
// a lambda gives all the advantages of a function object with none
// of the boilerplate.
return [start, end, f = std::move(f)] {
cout << locker() << this_thread::get_id() << endl;
this_thread::sleep_for(chrono::seconds(1));
// let's keep loops simple. for_each is a bit old-skool.
for (auto& x : range(start, end)) {
f(x);
}
};
}
template <typename Iterator, typename F>
void for_each_par(Iterator first, Iterator last, F f)
{
if(auto size = distance(first, last))
{
std::size_t min_per_thread = 4;
std::size_t max_threads = (size + min_per_thread - 1) / min_per_thread;
std::size_t hardware_threads = thread::hardware_concurrency();
auto no_of_threads = min(max_threads, hardware_threads != 0 ? hardware_threads : 4);
auto block_size = size / no_of_threads;
// futures give us two benefits:
// 1. they automatically transmit exceptions
// 2. no need for if(joinable) join. get is sufficient
//
vector<future<void>> vf;
vf.reserve(no_of_threads - 1);
for (auto count = no_of_threads ; --count ; )
{
//
// I was thinking of refactoring this into std::generate_n but actually
// it was less readable.
//
auto end = std::next(first, block_size);
vf.push_back(async(launch::async, make_for_each_block(first, end, f)));
first = end;
}
cout << locker() << endl << "threads: " << vf.size() << " (+ main thread)" << endl;
//
// why spawn a thread for the remaining block? we may as well use this thread
//
/* auto partial_sum = */ make_for_each_block(first, last, f)();
// join the threads
// note that if the blocks returned a partial aggregate, we could combine them
// here by using the values in the futures.
for (auto& f : vf) f.get();
}
}
int main()
{
vector<int> v1 = { 1,8,12,5,4,9,20,30,40,50,10,21,34,33 };
for_each_par(v1.begin(), v1.end(), print_type<int>);
return 0;
}
sample output:
0x700000081000
0x700000104000
threads: 3 (+ main thread)
0x700000187000
0x100086000
0x700000081000 : 1
0x700000104000 : 5
0x700000187000 : 20
0x100086000 : 50
0x700000081000 : 8
0x700000104000 : 4
0x700000187000 : 30
0x100086000 : 10
0x700000081000 : 12
0x700000104000 : 9
0x700000187000 : 40
0x100086000 : 21
0x100086000 : 34
0x100086000 : 33
Program ended with exit code: 0
please explain std::move here: [start, end, f = std::move(f)] {...};
This is a welcome language feature that was made available in c++14. f = std::move(f) inside the capture block is equivalent to: decltype(f) new_f = std::move(f) except that the new variable is called f and not new_f. It allows us to std::move objects into lambdas rather than copy them.
For most function objects it won't matter - but some can large and this gives the compiler the opportunity to use a move rather than a copy if available.

Related

thread doesnot update referenced variable

#include<iostream>
#include<vector>
#include<cstdlib>
#include<thread>
#include<array>
#include<iterator>
#include<algorithm>
#include<functional>
#include<numeric>//accumulate
int sum_of_digits(int num, int sum = 0) {
//calculate sum of digits recursively till the sum is single digit
if (num == 0) {
if (sum / 10 == 0)
return sum;
return sum_of_digits(sum);
}
return sum_of_digits(num / 10, sum + num % 10);
}
template<typename T, typename Iterator>
int sum_of_digits(Iterator begin, Iterator end) {
//not for array temp workout
T copy(std::distance(begin,end));
std::copy(begin, end, copy.begin());
std::for_each(copy.begin(), copy.end(), [](int& i) {i = sum_of_digits(i); });
return sum_of_digits(std::accumulate(begin, end, 0));
}
template<typename T, typename Iterator>
void sum_of_digits(Iterator begin, Iterator end, int& sum) {
sum = sum_of_digits<T, Iterator>(begin, end);
}
template<typename T>
int series_computation(const T& container) {
return sum_of_digits<T, typename T::const_iterator>(container.begin(), container.end());
}
#define MIN_THREAD 4
#define DEBUGG
template<typename T>
int parallel_computation(const T& container, const int min_el) {
if (container.size() < min_el) {
#ifdef DEBUGG
std::cout << "no multithreading" << std::endl;
#endif
return series_computation<T>(container);
}
const unsigned int total_el = container.size();
const unsigned int assumed_thread = total_el / min_el;
const unsigned hardwarethread = std::thread::hardware_concurrency();
const unsigned int thread_count = std::min<unsigned int>(hardwarethread == 0 ? MIN_THREAD : hardwarethread, assumed_thread) - 1;//one thread is main thread
const unsigned int el_per_thread = total_el / thread_count;
#ifdef DEBUGG
std::cout << "thread count: " << thread_count << " element per thread: " << el_per_thread << std::endl;
#endif
std::vector<std::thread> threads;
threads.reserve(thread_count);
using result_type = std::vector<int>;
result_type results(thread_count);
results.reserve(thread_count);
auto it_start = container.begin();
for (int i = 0; i < thread_count; i++) {
auto it_end = it_start;
std::advance(it_end, el_per_thread);
threads.push_back(std::thread([&]() {sum_of_digits<T, typename T::const_iterator>(it_start, it_end, std::ref(results[i])); }));
//threads.push_back( std::thread{ sum_of_digits<T,typename T::const_iterator>, it_start, it_end, std::ref(results[i]) });
it_start = it_end;
std::cout << "iterator " << i << std::endl;
}
results[thread_count - 1] = sum_of_digits<T, typename T::const_iterator>(it_start, container.end());
std::for_each(threads.begin(), threads.end(), std::mem_fn(&std::thread::join));
return series_computation<T>(results);
}
#define SIZE 1000
int main() {
std::vector<int> array(SIZE);
for (auto& curr : array) {
curr = std::rand();
}
for (auto& curr : array) {
//std::cout <<curr<< std::endl;
}
int series_val = series_computation<std::vector<int>>(array);
std::cout << "series val: " << series_val << std::endl;
int parallel_val = parallel_computation<std::vector<int>>(array, 25);
std::cout << "parallel val: " << parallel_val << std::endl;
return 0;
}
I am trying to calculate the sum of digits(recursive) of a randomly generated vector using std::thread but in the results vector only the result of the last element(i.e main thread) is stored and the child thread doesnot update the referenced results[i].
What is causing this behaviour?
Also, inside the for loop of parallel_combination, this code works
threads.push_back(std::thread([&]() {sum_of_digits<T, typename T::const_iterator>(it_start, it_end, std::ref(results[i])); }));
but this doesnot and the error is: Error: '<function-style-cast>': cannot convert from 'initializer list' to 'std::thread'.
What's wrong with the below one?
threads.push_back( std::thread{ sum_of_digits<T,typename T::const_iterator>, it_start, it_end, std::ref(results[i]) });

Your code has a data race here:
threads.push_back(std::thread([&]() {sum_of_digits<T, typename T::const_iterator>(it_start, it_end, std::ref(results[i])); }));
it_start = it_end;
The lambda uses it_start and then without any synchronization you immediately modify it in the main thread. Capture it_start and it_end by-copy. Also std::ref is pointless there. You are not passing a reference to the std::thread constructor, but just to a normal direct function call. Also, as I mentioned in the comments, template arguments in a function call can be deduced:
threads.push_back(std::thread([it_start,it_end,&results](){
sum_of_digits(it_start, it_end, results[i]); }));
it_start = it_end;
threads.push_back( std::thread{ sum_of_digits<T,typename T::const_iterator>, it_start, it_end, std::ref(results[i]) });
fails, because there are two function templates of which sum_of_digits<T,typename T::const_iterator> could be a specialization. It is impossible to pass overload sets to functions and since there is no way to deduce which one you mean, it will fail.
There are some smaller issues in the code, e.g. pointless reserve after resizing/construction of vectors or
std::for_each(threads.begin(), threads.end(), std::mem_fn(&std::thread::join));
which is not guaranteed to work, because taking the address of a member function of a standard library function has unspecified behavior. Instead just use a loop:
for(auto& t : threads) {
t.join();
}
and possibly some others.

Boost thread pool join tasks without closing the pool [duplicate]

Consider the functions
#include <iostream>
#include <boost/bind.hpp>
#include <boost/asio.hpp>
void foo(const uint64_t begin, uint64_t *result)
{
uint64_t prev[] = {begin, 0};
for (uint64_t i = 0; i < 1000000000; ++i)
{
const auto tmp = (prev[0] + prev[1]) % 1000;
prev[1] = prev[0];
prev[0] = tmp;
}
*result = prev[0];
}
void batch(boost::asio::thread_pool &pool, const uint64_t a[])
{
uint64_t r[] = {0, 0};
boost::asio::post(pool, boost::bind(foo, a[0], &r[0]));
boost::asio::post(pool, boost::bind(foo, a[1], &r[1]));
pool.join();
std::cerr << "foo(" << a[0] << "): " << r[0] << " foo(" << a[1] << "): " << r[1] << std::endl;
}
where foo is a simple "pure" function that performs a calculation on begin and writes the result to the pointer *result.
This function gets called with different inputs from batch. Here dispatching each call to another CPU core might be beneficial.
Now assume the batch function gets called several 10 000 times. Therefore a thread pool would be nice which is shared between all the sequential batch calls.
Trying this with (for the sake of simplicity only 3 calls)
int main(int argn, char **)
{
boost::asio::thread_pool pool(2);
const uint64_t a[] = {2, 4};
batch(pool, a);
const uint64_t b[] = {3, 5};
batch(pool, b);
const uint64_t c[] = {7, 9};
batch(pool, c);
}
leads to the result
foo(2): 2 foo(4): 4
foo(3): 0 foo(5): 0
foo(7): 0 foo(9): 0
Where all three lines appear at the same time, while the computation of foo takes ~3s.
I assume that only the first join really waits for the pool to complete all jobs.
The others have invalid results. (The not initialized values)
What is the best practice here to reuse the thread pool?

The best practice is not to reuse the pool (what would be the use of pooling, if you keep creating new pools?).
If you want to be sure you "time" the batches together, I'd suggest using when_all on futures:
Live On Coliru
#define BOOST_THREAD_PROVIDES_FUTURE_WHEN_ALL_WHEN_ANY
#include <iostream>
#include <boost/bind.hpp>
#include <boost/asio.hpp>
#include <boost/thread.hpp>
uint64_t foo(uint64_t begin) {
uint64_t prev[] = {begin, 0};
for (uint64_t i = 0; i < 1000000000; ++i) {
const auto tmp = (prev[0] + prev[1]) % 1000;
prev[1] = prev[0];
prev[0] = tmp;
}
return prev[0];
}
void batch(boost::asio::thread_pool &pool, const uint64_t a[2])
{
using T = boost::packaged_task<uint64_t>;
T tasks[] {
T(boost::bind(foo, a[0])),
T(boost::bind(foo, a[1])),
};
auto all = boost::when_all(
tasks[0].get_future(),
tasks[1].get_future());
for (auto& t : tasks)
post(pool, std::move(t));
auto [r0, r1] = all.get();
std::cerr << "foo(" << a[0] << "): " << r0.get() << " foo(" << a[1] << "): " << r1.get() << std::endl;
}
int main() {
boost::asio::thread_pool pool(2);
const uint64_t a[] = {2, 4};
batch(pool, a);
const uint64_t b[] = {3, 5};
batch(pool, b);
const uint64_t c[] = {7, 9};
batch(pool, c);
}
Prints
foo(2): 2 foo(4): 4
foo(3): 503 foo(5): 505
foo(7): 507 foo(9): 509
I would consider
generalizing
message queuing
Generalized
Make it somewhat more flexible by not hardcoding batch sizes. After all, the pool size is already fixed, we don't need to "make sure batches fit" or something:
Live On Coliru
#define BOOST_THREAD_PROVIDES_FUTURE_WHEN_ALL_WHEN_ANY
#include <iostream>
#include <boost/bind.hpp>
#include <boost/asio.hpp>
#include <boost/thread.hpp>
#include <boost/thread/future.hpp>
struct Result { uint64_t begin, result; };
Result foo(uint64_t begin) {
uint64_t prev[] = {begin, 0};
for (uint64_t i = 0; i < 1000000000; ++i) {
const auto tmp = (prev[0] + prev[1]) % 1000;
prev[1] = prev[0];
prev[0] = tmp;
}
return { begin, prev[0] };
}
void batch(boost::asio::thread_pool &pool, std::vector<uint64_t> const a)
{
using T = boost::packaged_task<Result>;
std::vector<T> tasks;
tasks.reserve(a.size());
for(auto begin : a)
tasks.emplace_back(boost::bind(foo, begin));
std::vector<boost::unique_future<T::result_type> > futures;
for (auto& t : tasks) {
futures.push_back(t.get_future());
post(pool, std::move(t));
}
for (auto& fut : boost::when_all(futures.begin(), futures.end()).get()) {
auto r = fut.get();
std::cerr << "foo(" << r.begin << "): " << r.result << " ";
}
std::cout << std::endl;
}
int main() {
boost::asio::thread_pool pool(2);
batch(pool, {2});
batch(pool, {4, 3, 5});
batch(pool, {7, 9});
}
Prints
foo(2): 2
foo(4): 4 foo(3): 503 foo(5): 505
foo(7): 507 foo(9): 509
Generalized2: Variadics Simplify
Contrary to popular believe (and honestly, what usually happens) this time we can leverage variadics to get rid of all the intermediate vectors (every single one of them):
Live On Coliru
void batch(boost::asio::thread_pool &pool, T... a)
{
auto launch = [&pool](uint64_t begin) {
boost::packaged_task<Result> pt(boost::bind(foo, begin));
auto fut = pt.get_future();
post(pool, std::move(pt));
return fut;
};
for (auto& r : {launch(a).get()...}) {
std::cerr << "foo(" << r.begin << "): " << r.result << " ";
}
std::cout << std::endl;
}
If you insist on outputting the results in time, you can still add when_all into the mix (requiring a bit more heroics to unpack the tuple):
Live On Coliru
template <typename...T>
void batch(boost::asio::thread_pool &pool, T... a)
{
auto launch = [&pool](uint64_t begin) {
boost::packaged_task<Result> pt(boost::bind(foo, begin));
auto fut = pt.get_future();
post(pool, std::move(pt));
return fut;
};
std::apply([](auto&&... rfut) {
Result results[] {rfut.get()...};
for (auto& r : results) {
std::cerr << "foo(" << r.begin << "): " << r.result << " ";
}
}, boost::when_all(launch(a)...).get());
std::cout << std::endl;
}
Both still print the same result
Message Queuing
This is very natural to boost, and sort of skips most complexity. If you also want to report per batched group, you'd have to coordinate:
Live On Coliru
#include <iostream>
#include <boost/asio.hpp>
#include <memory>
struct Result { uint64_t begin, result; };
Result foo(uint64_t begin) {
uint64_t prev[] = {begin, 0};
for (uint64_t i = 0; i < 1000000000; ++i) {
const auto tmp = (prev[0] + prev[1]) % 1000;
prev[1] = prev[0];
prev[0] = tmp;
}
return { begin, prev[0] };
}
using Group = std::shared_ptr<size_t>;
void batch(boost::asio::thread_pool &pool, std::vector<uint64_t> begins) {
auto group = std::make_shared<std::vector<Result> >(begins.size());
for (size_t i=0; i < begins.size(); ++i) {
post(pool, [i,begin=begins.at(i),group] {
(*group)[i] = foo(begin);
if (group.unique()) {
for (auto& r : *group) {
std::cout << "foo(" << r.begin << "): " << r.result << " ";
std::cout << std::endl;
}
}
});
}
}
int main() {
boost::asio::thread_pool pool(2);
batch(pool, {2});
batch(pool, {4, 3, 5});
batch(pool, {7, 9});
pool.join();
}
Note this is having concurrent access to group, which is safe due to the limitations on element accesses.
Prints:
foo(2): 2
foo(4): 4 foo(3): 503 foo(5): 505
foo(7): 507 foo(9): 509

I just ran into this advanced executor example which is hidden from the documentation:
I realized just now that Asio comes with a fork_executor example which does exactly this: you can "group" tasks and join the executor (which represents that group) instead of the pool. I've missed this for the longest time since none of the executor examples are listed in the HTML documentation – sehe 21 mins ago
So without further ado, here's that sample applied to your question:
Live On Coliru
#define BOOST_BIND_NO_PLACEHOLDERS
#include <boost/asio/thread_pool.hpp>
#include <boost/asio/ts/executor.hpp>
#include <condition_variable>
#include <memory>
#include <mutex>
#include <queue>
#include <thread>
// A fixed-size thread pool used to implement fork/join semantics. Functions
// are scheduled using a simple FIFO queue. Implementing work stealing, or
// using a queue based on atomic operations, are left as tasks for the reader.
class fork_join_pool : public boost::asio::execution_context {
public:
// The constructor starts a thread pool with the specified number of
// threads. Note that the thread_count is not a fixed limit on the pool's
// concurrency. Additional threads may temporarily be added to the pool if
// they join a fork_executor.
explicit fork_join_pool(std::size_t thread_count = std::thread::hardware_concurrency()*2)
: use_count_(1), threads_(thread_count)
{
try {
// Ask each thread in the pool to dequeue and execute functions
// until it is time to shut down, i.e. the use count is zero.
for (thread_count_ = 0; thread_count_ < thread_count; ++thread_count_) {
boost::asio::dispatch(threads_, [&] {
std::unique_lock<std::mutex> lock(mutex_);
while (use_count_ > 0)
if (!execute_next(lock))
condition_.wait(lock);
});
}
} catch (...) {
stop_threads();
threads_.join();
throw;
}
}
// The destructor waits for the pool to finish executing functions.
~fork_join_pool() {
stop_threads();
threads_.join();
}
private:
friend class fork_executor;
// The base for all functions that are queued in the pool.
struct function_base {
std::shared_ptr<std::size_t> work_count_;
void (*execute_)(std::shared_ptr<function_base>& p);
};
// Execute the next function from the queue, if any. Returns true if a
// function was executed, and false if the queue was empty.
bool execute_next(std::unique_lock<std::mutex>& lock) {
if (queue_.empty())
return false;
auto p(queue_.front());
queue_.pop();
lock.unlock();
execute(lock, p);
return true;
}
// Execute a function and decrement the outstanding work.
void execute(std::unique_lock<std::mutex>& lock,
std::shared_ptr<function_base>& p) {
std::shared_ptr<std::size_t> work_count(std::move(p->work_count_));
try {
p->execute_(p);
lock.lock();
do_work_finished(work_count);
} catch (...) {
lock.lock();
do_work_finished(work_count);
throw;
}
}
// Increment outstanding work.
void
do_work_started(const std::shared_ptr<std::size_t>& work_count) noexcept {
if (++(*work_count) == 1)
++use_count_;
}
// Decrement outstanding work. Notify waiting threads if we run out.
void
do_work_finished(const std::shared_ptr<std::size_t>& work_count) noexcept {
if (--(*work_count) == 0) {
--use_count_;
condition_.notify_all();
}
}
// Dispatch a function, executing it immediately if the queue is already
// loaded. Otherwise adds the function to the queue and wakes a thread.
void do_dispatch(std::shared_ptr<function_base> p,
const std::shared_ptr<std::size_t>& work_count) {
std::unique_lock<std::mutex> lock(mutex_);
if (queue_.size() > thread_count_ * 16) {
do_work_started(work_count);
lock.unlock();
execute(lock, p);
} else {
queue_.push(p);
do_work_started(work_count);
condition_.notify_one();
}
}
// Add a function to the queue and wake a thread.
void do_post(std::shared_ptr<function_base> p,
const std::shared_ptr<std::size_t>& work_count) {
std::lock_guard<std::mutex> lock(mutex_);
queue_.push(p);
do_work_started(work_count);
condition_.notify_one();
}
// Ask all threads to shut down.
void stop_threads() {
std::lock_guard<std::mutex> lock(mutex_);
--use_count_;
condition_.notify_all();
}
std::mutex mutex_;
std::condition_variable condition_;
std::queue<std::shared_ptr<function_base>> queue_;
std::size_t use_count_;
std::size_t thread_count_;
boost::asio::thread_pool threads_;
};
// A class that satisfies the Executor requirements. Every function or piece of
// work associated with a fork_executor is part of a single, joinable group.
class fork_executor {
public:
fork_executor(fork_join_pool& ctx)
: context_(ctx), work_count_(std::make_shared<std::size_t>(0)) {}
fork_join_pool& context() const noexcept { return context_; }
void on_work_started() const noexcept {
std::lock_guard<std::mutex> lock(context_.mutex_);
context_.do_work_started(work_count_);
}
void on_work_finished() const noexcept {
std::lock_guard<std::mutex> lock(context_.mutex_);
context_.do_work_finished(work_count_);
}
template <class Func, class Alloc>
void dispatch(Func&& f, const Alloc& a) const {
auto p(std::allocate_shared<exFun<Func>>(
typename std::allocator_traits<Alloc>::template rebind_alloc<char>(a),
std::move(f), work_count_));
context_.do_dispatch(p, work_count_);
}
template <class Func, class Alloc> void post(Func f, const Alloc& a) const {
auto p(std::allocate_shared<exFun<Func>>(
typename std::allocator_traits<Alloc>::template rebind_alloc<char>(a),
std::move(f), work_count_));
context_.do_post(p, work_count_);
}
template <class Func, class Alloc>
void defer(Func&& f, const Alloc& a) const {
post(std::forward<Func>(f), a);
}
friend bool operator==(const fork_executor& a, const fork_executor& b) noexcept {
return a.work_count_ == b.work_count_;
}
friend bool operator!=(const fork_executor& a, const fork_executor& b) noexcept {
return a.work_count_ != b.work_count_;
}
// Block until all work associated with the executor is complete. While it
// is waiting, the thread may be borrowed to execute functions from the
// queue.
void join() const {
std::unique_lock<std::mutex> lock(context_.mutex_);
while (*work_count_ > 0)
if (!context_.execute_next(lock))
context_.condition_.wait(lock);
}
private:
template <class Func> struct exFun : fork_join_pool::function_base {
explicit exFun(Func f, const std::shared_ptr<std::size_t>& w)
: function_(std::move(f)) {
work_count_ = w;
execute_ = [](std::shared_ptr<fork_join_pool::function_base>& p) {
Func tmp(std::move(static_cast<exFun*>(p.get())->function_));
p.reset();
tmp();
};
}
Func function_;
};
fork_join_pool& context_;
std::shared_ptr<std::size_t> work_count_;
};
// Helper class to automatically join a fork_executor when exiting a scope.
class join_guard {
public:
explicit join_guard(const fork_executor& ex) : ex_(ex) {}
join_guard(const join_guard&) = delete;
join_guard(join_guard&&) = delete;
~join_guard() { ex_.join(); }
private:
fork_executor ex_;
};
//------------------------------------------------------------------------------
#include <algorithm>
#include <iostream>
#include <random>
#include <vector>
#include <boost/bind.hpp>
static void foo(const uint64_t begin, uint64_t *result)
{
uint64_t prev[] = {begin, 0};
for (uint64_t i = 0; i < 1000000000; ++i) {
const auto tmp = (prev[0] + prev[1]) % 1000;
prev[1] = prev[0];
prev[0] = tmp;
}
*result = prev[0];
}
void batch(fork_join_pool &pool, const uint64_t (&a)[2])
{
uint64_t r[] = {0, 0};
{
fork_executor fork(pool);
join_guard join(fork);
boost::asio::post(fork, boost::bind(foo, a[0], &r[0]));
boost::asio::post(fork, boost::bind(foo, a[1], &r[1]));
// fork.join(); // or let join_guard destructor run
}
std::cerr << "foo(" << a[0] << "): " << r[0] << " foo(" << a[1] << "): " << r[1] << std::endl;
}
int main() {
fork_join_pool pool;
batch(pool, {2, 4});
batch(pool, {3, 5});
batch(pool, {7, 9});
}
Prints:
foo(2): 2 foo(4): 4
foo(3): 503 foo(5): 505
foo(7): 507 foo(9): 509
Things to note:
executors can overlap/nest: you can use several joinable fork_executors on a single fork_join_pool and they will join the distinct groups of tasks for each executor
You can get that sense easily when looking at the library example (which does a recursive divide-and-conquer merge sort).

I had a similar problem and ended up using latches. In this case the code would would be (I also switched from bind to lambdas):
void batch(boost::asio::thread_pool &pool, const uint64_t a[])
{
uint64_t r[] = {0, 0};
boost::latch latch(2);
boost::asio::post(pool, [&](){ foo(a[0], &r[0]); latch.count_down();});
boost::asio::post(pool, [&](){ foo(a[1], &r[1]); latch.count_down();});
latch.wait();
std::cerr << "foo(" << a[0] << "): " << r[0] << " foo(" << a[1] << "): " << r[1] << std::endl;
}
https://godbolt.org/z/oceP6jjs7

Type safe index values for std::vector

I have classes that collect index values from different constant STL vectors. Problem is, even if these vectors are different in content and they have different purposes, their indexes are of type std::size_t, so one might erroneusly use the index stored for one vector to access the elements of another vector. Can the code be changed in order to have a compile time error when a index is not used with the correct vector?
A code example:
#include <iostream>
#include <string>
#include <vector>
struct Named
{
std::string name;
};
struct Cat : Named { };
struct Dog : Named { };
struct Range
{
std::size_t start;
std::size_t end;
};
struct AnimalHouse
{
std::vector< Cat > cats;
std::vector< Dog > dogs;
};
int main( )
{
AnimalHouse house;
Range cat_with_name_starting_with_a;
Range dogs_with_name_starting_with_b;
// ...some initialization code here...
for( auto i = cat_with_name_starting_with_a.start;
i < cat_with_name_starting_with_a.end;
++i )
{
std::cout << house.cats[ i ].name << std::endl;
}
for( auto i = dogs_with_name_starting_with_b.start;
i < dogs_with_name_starting_with_b.end;
++i )
{
// bad copy paste but no compilation error
std::cout << house.cats[ i ].name << std::endl;
}
return 0;
}
Disclaimer: please do not focus too much on the example itself, I know it is dumb, it is just to get the idea.

Here is an attempt following up on my comment.
There are of course a lot of room to change the details of how this would work depending on the use-case, this way seemed reasonable to me.
#include <iostream>
#include <vector>
template <typename T>
struct Range {
Range(T& vec, std::size_t start, std::size_t end) :
m_vector(vec),
m_start(start),
m_end(end),
m_size(end-start+1) {}
auto begin() {
auto it = m_vector.begin();
std::advance(it, m_start);
return it;
}
auto end() {
auto it = m_vector.begin();
std::advance(it, m_end + 1);
return it;
}
std::size_t size() {
return m_size;
}
void update(std::size_t start, std::size_t end) {
m_start = start;
m_end = end;
m_size = end - start + 1;
}
Range copy(T& other_vec) {
return Range(other_vec, m_start, m_end);
}
typename T::reference operator[](std::size_t index) {
return m_vector[m_start + index];
}
private:
T& m_vector;
std::size_t m_start, m_end, m_size;
};
// This can be used if c++17 is not supported, to avoid
// having to specify template parameters
template <typename T>
Range<T> make_range(T& t, std::size_t start, std::size_t end) {
return Range<T>(t, start, end);
}
int main() {
std::vector<int> v1 {1, 2, 3, 4, 5};
std::vector<double> v2 {0.5, 1., 1.5, 2., 2.5};
Range more_then_2(v1, 1, 4); // Only works in c++17 or later
auto more_then_1 = make_range(v2, 2, 4);
for (auto v : more_then_2)
std::cout << v << ' ';
std::cout << std::endl;
for (auto v : more_then_1)
std::cout << v << ' ';
std::cout << std::endl;
more_then_2.update(2,4);
for (auto v : more_then_2)
std::cout << v << ' ';
std::cout << std::endl;
auto v3 = v1;
auto more_then_2_copy = more_then_2.copy(v3);
for (unsigned i=0; i < more_then_2_copy.size(); ++i)
std::cout << more_then_2_copy[i] << ' ';
return 0;
}

Parallel calling a function in std::vector

I have an std::vector of std::function<void()> like this:
std::map<Event, std::vector<std::function<void()>>> observers_;
calling each function like this:
for (const auto& obs : observers_.at(event)) obs();
I want to turn this into a parallel for loop. Since I am using C++14, and don't have access to the std::execution::parallel of C++17, I found a little library that allows me to create a ThreadPool.
How do I turn for (const auto& obs : observers_.at(event)) obs(); into a version that calls each function in observers_ in parallel? I can't seem to get the syntax correct. I tried, but this doesn't work.
std::vector<std::function<void()>> vec = observers_.at(event);
ThreadPool::ParallelFor(0, vec.size(), [&](int i)
{
vec.at(i);
});
The example program that uses the library below:
#include <iostream>
#include <mutex>
#include "ThreadPool.hpp"
////////////////////////////////////////////////////////////////////////////////
int main()
{
std::mutex critical;
ThreadPool::ParallelFor(0, 16, [&] (int i)
{
std::lock_guard<std::mutex> lock(critical);
std::cout << i << std::endl;
});
return 0;
}
The ThreadPool library.
#ifndef THREADPOOL_HPP_INCLUDED
#define THREADPOOL_HPP_INCLUDED
////////////////////////////////////////////////////////////////////////////////
#include <thread>
#include <vector>
#include <cmath>
////////////////////////////////////////////////////////////////////////////////
class ThreadPool {
public:
template<typename Index, typename Callable>
static void ParallelFor(Index start, Index end, Callable func) {
// Estimate number of threads in the pool
const static unsigned nb_threads_hint = std::thread::hardware_concurrency();
const static unsigned nb_threads = (nb_threads_hint == 0u ? 8u : nb_threads_hint);
// Size of a slice for the range functions
Index n = end - start + 1;
Index slice = (Index) std::round(n / static_cast<double> (nb_threads));
slice = std::max(slice, Index(1));
// [Helper] Inner loop
auto launchRange = [&func] (int k1, int k2) {
for (Index k = k1; k < k2; k++) {
func(k);
}
};
// Create pool and launch jobs
std::vector<std::thread> pool;
pool.reserve(nb_threads);
Index i1 = start;
Index i2 = std::min(start + slice, end);
for (unsigned i = 0; i + 1 < nb_threads && i1 < end; ++i) {
pool.emplace_back(launchRange, i1, i2);
i1 = i2;
i2 = std::min(i2 + slice, end);
}
if (i1 < end) {
pool.emplace_back(launchRange, i1, end);
}
// Wait for jobs to finish
for (std::thread &t : pool) {
if (t.joinable()) {
t.join();
}
}
}
// Serial version for easy comparison
template<typename Index, typename Callable>
static void SequentialFor(Index start, Index end, Callable func) {
for (Index i = start; i < end; i++) {
func(i);
}
}
};
#endif // THREADPOOL_HPP_INCLUDED

It seems that you should simply change:
vec.at(i); // Only returns a reference to the element at index i
into:
vec.at(i)(); // The second () calls the function
--- OR ---
vec[i](); // Same

Hint: What does this do?
vec.at(i);
What do you want it to do?
Unrelatedly, you're using at() when you mean [].

This works:
ThreadPool::ParallelFor(0, (int)vec.size(), [&] (int i)
{
vec[i]();
});

Why would a parallel version of accumulate be so much slower?

Inspired by Antony Williams' "C++ Concurrency in Action" I took a closer look at his parallel version of std::accumulate. I copied its code from the book and added some output for debugging purposes and this is what I ended up with:
#include <algorithm>
#include <future>
#include <iostream>
#include <thread>
template <typename Iterator, typename T>
struct accumulate_block
{
T operator()(Iterator first, Iterator last)
{
return std::accumulate(first, last, T());
}
};
template <typename Iterator, typename T>
T parallel_accumulate(Iterator first, Iterator last, T init)
{
const unsigned long length = std::distance(first, last);
if (!length) return init;
const unsigned long min_per_thread = 25;
const unsigned long max_threads = (length) / min_per_thread;
const unsigned long hardware_conc = std::thread::hardware_concurrency();
const unsigned long num_threads = std::min(hardware_conc != 0 ? hardware_conc : 2, max_threads);
const unsigned long block_size = length / num_threads;
std::vector<std::future<T>> futures(num_threads - 1);
std::vector<std::thread> threads(num_threads - 1);
Iterator block_start = first;
for (unsigned long i = 0; i < (num_threads - 1); ++i)
{
Iterator block_end = block_start;
std::advance(block_end, block_size);
std::packaged_task<T(Iterator, Iterator)> task{accumulate_block<Iterator, T>()};
futures[i] = task.get_future();
threads[i] = std::thread(std::move(task), block_start, block_end);
block_start = block_end;
}
T last_result = accumulate_block<Iterator, T>()(block_start, last);
for (auto& t : threads) t.join();
T result = init;
for (unsigned long i = 0; i < (num_threads - 1); ++i) {
result += futures[i].get();
}
result += last_result;
return result;
}
template <typename TimeT = std::chrono::microseconds>
struct measure
{
template <typename F, typename... Args>
static typename TimeT::rep execution(F func, Args&&... args)
{
using namespace std::chrono;
auto start = system_clock::now();
func(std::forward<Args>(args)...);
auto duration = duration_cast<TimeT>(system_clock::now() - start);
return duration.count();
}
};
template <typename T>
T parallel(const std::vector<T>& v)
{
return parallel_accumulate(v.begin(), v.end(), 0);
}
template <typename T>
T stdaccumulate(const std::vector<T>& v)
{
return std::accumulate(v.begin(), v.end(), 0);
}
int main()
{
constexpr unsigned int COUNT = 200000000;
std::vector<int> v(COUNT);
// optional randomising vector contents - std::accumulate also gives 0us
// but custom parallel accumulate gives longer times with randomised input
std::mt19937 mersenne_engine;
std::uniform_int_distribution<int> dist(1, 100);
auto gen = std::bind(dist, mersenne_engine);
std::generate(v.begin(), v.end(), gen);
std::fill(v.begin(), v.end(), 1);
auto v2 = v; // copy to work on the same data
std::cout << "starting ... " << '\n';
std::cout << "std::accumulate : \t" << measure<>::execution(stdaccumulate<int>, v) << "us" << '\n';
std::cout << "parallel: \t" << measure<>::execution(parallel<int>, v2) << "us" << '\n';
}
What is most interesting here is that almost always I will get 0 length time from std::accumulate.
Exemplar output:
starting ...
std::accumulate : 0us
parallel:
inside1 54us
inside2 81830us
inside3 89082us
89770us
What is the problem here?
http://cpp.sh/6jbt

As is the usual case with micro-benchmarking, you need to make sure that your code is actually doing something. You're doing an accumulate, but you're not actually storing the result anywhere or doing anything with it. So do you really need to have done any of the work anyway? The compiler just snipped out all that logic in the normal case. That's why you get 0.
Just change your code to actually ensure that work needs to be done. For example:
int s, s2;
std::cout << "starting ... " << '\n';
std::cout << "std::accumulate : \t"
<< measure<>::execution([&]{s = std::accumulate(v.begin(), v.end(), 0);})
<< "us\n";
std::cout << "parallel: \t"
<< measure<>::execution([&]{s2 = parallel_accumulate(v2.begin(), v2.end(), 0);})
<< "us\n";
std::cout << s << ',' << s2 << std::endl;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Thread unable to join for_each parallel c++ - c++

Related

thread doesnot update referenced variable

Boost thread pool join tasks without closing the pool [duplicate]

Type safe index values for std::vector

Parallel calling a function in std::vector

Why would a parallel version of accumulate be so much slower?

Categories

Resources