I want to split jobs among multiple std::thread workers and continue once they are all done.
To do so, I implemented a thread pool class mainly based on this SO answer.
I noticed, however, that my benchmarks can get stuck, running forever, without any errors thrown.
I wrote a minimal reproducing code, enclosed at the end.
Based on terminal output, the issue seems to occur when the jobs are being queued.
I checked videos (1, 2), documentation (3) and blog posts (4).
I tried replacing the type of the locks, using atomics.
I could not find the underlying cause.
Here is the snippet to replicate the issue.
The program repeatedly counts the odd elements in the test vector.
#include <atomic>
#include <condition_variable>
#include <functional>
#include <iostream>
#include <mutex>
#include <queue>
#include <thread>
#include <vector>
class Pool {
public:
const int worker_count;
bool to_terminate = false;
std::atomic<int> unfinished_tasks = 0;
std::mutex mutex;
std::condition_variable condition;
std::vector<std::thread> threads;
std::queue<std::function<void()>> jobs;
void thread_loop()
{
while (true) {
std::function<void()> job;
{
std::unique_lock<std::mutex> lock(mutex);
condition.wait(lock, [&] { return (!jobs.empty()) || to_terminate; });
if (to_terminate)
return;
job = jobs.front();
jobs.pop();
}
job();
unfinished_tasks -= 1;
}
}
public:
Pool(int size) : worker_count(size)
{
if (size < 0)
throw std::invalid_argument("Worker count needs to be a positive integer");
for (int i = 0; i < worker_count; ++i)
threads.push_back(std::thread(&Pool::thread_loop, this));
};
~Pool()
{
{
std::unique_lock lock(mutex);
to_terminate = true;
}
condition.notify_all();
for (auto &thread : threads)
thread.join();
threads.clear();
};
void queue_job(const std::function<void()> &job)
{
{
std::unique_lock<std::mutex> lock(mutex);
jobs.push(job);
unfinished_tasks += 1;
// std::cout << unfinished_tasks;
}
condition.notify_one();
}
void wait()
{
while (unfinished_tasks) {
; // spinlock
};
}
};
int main()
{
constexpr int worker_count = 8;
constexpr int vector_size = 1 << 10;
Pool pool = Pool(worker_count);
std::vector<int> test_vector;
test_vector.reserve(vector_size);
for (int i = 0; i < vector_size; ++i)
test_vector.push_back(i);
std::vector<int> worker_odd_counts(worker_count, 0);
std::function<void(int)> worker_task = [&](int thread_id) {
int chunk_size = vector_size / (worker_count) + 1;
int my_start = thread_id * chunk_size;
int my_end = std::min(my_start + chunk_size, vector_size);
int local_odd_count = 0;
for (int ii = my_start; ii < my_end; ++ii)
if (test_vector[ii] % 2 != 0)
++local_odd_count;
worker_odd_counts[thread_id] = local_odd_count;
};
for (int iteration = 0;; ++iteration) {
std::cout << "Jobs.." << std::flush;
for (int i = 0; i < worker_count; ++i)
pool.queue_job([&worker_task, i] { worker_task(i); });
std::cout << "..queued. " << std::flush;
pool.wait();
int odd_count = 0;
for (auto elem : worker_odd_counts)
odd_count += elem;
std::cout << "Iter:" << iteration << ". Odd:" << odd_count << '\n';
}
}
Here is the terminal output of one specific run:
[...]
Jobs....queued. Iter:2994. Odd:512
Jobs....queued. Iter:2995. Odd:512
Jobs..
Edit:
The error occurres using GCC 12.2.0 x86_64-w64-mingw32 on Windows 10 with AMD Ryzen 4750U CPU. I do not get past 15k iterations .
Using Visual Studio Community 2022, I got past 1.5M iterations (and stopped it myself). Thanks #IgorTandetnik for pointing out the latter.
Mingw doesn’t natively support multithreading on Windows. They supporting threads in their C++ standard library over the POSIX API, and winpthreads compatibility layer which implements that API on top of the Windows OS threads.
I think your error is not in the C++ code, but in the computer setup. Do the following.
Use the compiler from x86_64-12.2.0-release-posix-seh-ucrt-rt_v10-rev2.7z archive, there.
Don’t forget the binary built that way depends on a bunch of DLL files provided by the compiler: libgcc_s_seh-1.dll, libwinpthread-1.dll and libstdc++-6.dll. You must use exactly the same version of these DLL which were shipped with mingw. If you have some other versions of these DLLs anywhere in your %PATH%, expect all kinds of fails.
Couple general notes.
Linux-first C++ compilers like gcc have issues on Windows. A path of least resistance is using Visual C++ instead. If you want your software to build on other platforms as well, consider cmake to abstract away the compiler.
Windows already includes a thread pool implementation, since Vista. The API is easy to use, you only need 4 functions: CreateThreadpoolWork, SubmitThreadpoolWork, WaitForThreadpoolWorkCallbacks, and CloseThreadpoolWork. Example.
The first thing you should do is split the queue from the thread pool. They are both tricky enough, writing both of them comingled in one class is asking for trouble.
This also allows you to unit test the queue without the pool.
template<class Payload>
class MutexQueue {
public:
std::optional<Payload> wait_and_pop();
void push(Payload);
void terminate_queue();
bool queue_is_terminated() const;
private:
mutable std::mutex m;
std::condition_variable cv;
std::deque<Payload> q;
bool terminated = false;
std::unique_lock<std::mutex> lock() const {
return std::unique_lock<std::mutex>(m);
}
};
this is a bit easier to write than the thread pool.
void push(Payload p) {
{
auto l = lock();
if (terminate) return;
q.push_back(std::move(p));
}
cv.notify_one();
}
void terminate_queue() {
{
auto l = lock(); // YOU CANNOT SKIP THIS LOCK, even if terminate is atomic
terminate = true;
q.clear();
}
cv.notify_all();
}
bool queue_is_terminated() const {
auto l = lock(); // if you make terminate atomic, you CAN skip this lock
return terminate;
}
std::optional<Payload> wait_and_pop() {
auto l = lock();
cv.wait(l, [&]{ return terminate || !q.empty(); }
if (terminate) return std::nullopt;
auto r = std::move(q.front());
q.pop_front();
return std::move(r);
}
there we go.
Now our thread pool is simpler.
struct ThreadPool {
explicit ThreadPool(std::size_t n) {
create_threads(n);
}
std::future<void> push_task(std::function<void()> f) {
std::packaged_task<void()> p = std::move(f);
auto r = p.get_future();
q.push( std::move(p) );
return r;
}
void terminate_pool() {
q.terminate_queue();
terminate_threads();
}
~ThreadPool() {
terminate_pool();
}
private:
MutexQueue<std::packaged_task<void()>> q;
std::vector<std::thread> threads;
void terminate_threads() {
for(auto& thread:threads)
thread.join();
threads.clear();
}
static void thread_task( MutexQueue<std::packaged_task<void()>>* pq ) {
if (!pq) return;
while (auto task = pq->wait_and_pop()) {
(*task)();
}
}
void create_threads(std::size_t n) {
for (std::size_t i = 0; i < n; ++i) {
threads.push_back( std::thread( thread_task, &q ) );
}
}
I cannot spot an error in your code. But with the above, you can test a split of the queue from the pool.
The queue will work with pthreads or other primitives.
Related
I have successfully implemented the thread pool from an answer on Stack Overflow, which helped me in speeding up my program. It uses a single std::queue to distribute jobs (std::function<void()>) among multiple workers (std::threads).
I wanted to improve on this. As I only need to run a limited set of functions, I planned to ditch the queue and to use variables instead. In other words, the n-th worker would do the n-th job from the std::vector<std::function<void()>>. Unfortunately, my test app crashes with Segmentation fault (core dumped) and I could not realize my mistake so far.
Here is my ~minimal reproducible code, with the job of counting the odd elements in a vector. (Idea taken from Scott Meyers: Cpu Caches and Why You Care.)
#include <algorithm>
#include <condition_variable>
#include <functional>
#include <iostream>
#include <mutex>
#include <stdexcept> // std::invalid_argument
#include <thread>
#include <vector>
// Thread pool with a std::function for each worker.
class Pool {
public:
enum class Status {
idle,
working,
terminate
};
const int worker_count;
std::vector<Status> statuses;
std::vector<std::mutex> mutexes;
std::vector<std::condition_variable> conditions;
std::vector<std::thread> threads;
std::vector<std::function<void()>> jobs;
void thread_loop(int thread_id)
{
std::puts("Thread started");
auto &my_status = statuses[thread_id];
auto &my_mutex = mutexes[thread_id];
auto &my_condition = conditions[thread_id];
auto &my_job = jobs[thread_id];
while (true) {
std::unique_lock<std::mutex> lock(my_mutex);
my_condition.wait(lock, [this, &my_status] { return my_status != Status::idle; });
if (my_status == Status::terminate)
return;
my_job();
my_status = Status::idle;
lock.unlock();
my_condition.notify_one(); // Tell the main thread we are done
}
}
public:
Pool(int size) : worker_count(size), statuses(size, Status::idle), mutexes(size), conditions(size), threads(), jobs(size)
{
if (size < 0)
throw std::invalid_argument("Worker count needs to be a positive integer");
};
~Pool()
{
for (int i = 0; i < worker_count; ++i) {
std::unique_lock lock(mutexes[i]);
statuses[i] = Status::terminate;
lock.unlock(); // Unlock before notifying
conditions[i].notify_one();
}
for (auto &thread : threads)
thread.join();
threads.clear();
};
void start_threads()
{
threads.resize(worker_count);
jobs.resize(worker_count);
for (int i = 0; i < worker_count; ++i) {
statuses[i] = Status::idle;
jobs[i] = []() { std::puts("I am running"); };
threads[i] = std::thread(&Pool::thread_loop, this, i);
}
}
void set_and_start_job(const std::function<void(int)> &job)
{
for (int i = 0; i < worker_count; ++i) {
std::unique_lock lock(mutexes[i]);
jobs[i] = [&job, i]() { job(i); };
statuses[i] = Status::working;
lock.unlock();
conditions[i].notify_one();
}
}
void wait()
{
for (int i = 0; i < worker_count; ++i) {
auto &my_status = statuses[i];
std::unique_lock lock(mutexes[i]);
conditions[i].wait(lock, [this, &my_status] { return my_status != Status::working; });
}
}
};
int main()
{
constexpr int worker_count = 1;
constexpr int vector_size = 1 << 10;
std::vector<int> test_vector;
test_vector.reserve(vector_size);
for (int i = 0; i < vector_size; ++i)
test_vector.push_back(i);
std::vector<int> worker_odd_counts(worker_count, 0);
const auto worker_task = [&](int thread_id) {
int chunk_size = vector_size / (worker_count) + 1;
int my_start = thread_id * chunk_size;
int my_end = std::min(my_start + chunk_size, vector_size);
int local_odd_count = 0;
for (int ii = my_start; ii < my_end; ++ii)
if (test_vector[ii] % 2 != 0)
++local_odd_count;
worker_odd_counts[thread_id] = local_odd_count;
};
Pool pool = Pool(worker_count);
pool.start_threads();
pool.set_and_start_job(worker_task);
pool.wait();
int odd_count = 0;
for (auto elem : worker_odd_counts)
odd_count += elem;
std::cout << odd_count << '\n';
}
TL;DR version:
The simplest fix is to change
jobs[i] = [&job, i]() { job(i); };
to
jobs[i] = [job, i]() { job(i); };
This captures job by value and makes a copy. The copy won't go out of scope before the lambda does and the lambda will outlive the thread.
The Long version:
The problem is at
jobs[i] = [&job, i]() { job(i); };
in set_and_start_job. The object backing job goes out of scope before the threads get started, but how can this be if
pool.set_and_start_job(worker_task);
and worker_task won't go out of scope until after the the threads are joined?
Turns out that's because set_and_start_job requires a const std::function<void(int)> & and worker_task isn't a std::function, merely implicitly convertible to a std::function. This conversion makes a temporary variable with a lifespan bound to set_and_start_job's job parameter. When set_and_start_job exits, job goes out of scope and the temporary is destroyed.
The simple fix is above, but we can also make the conversion right at the source to that `std::function is passed all the way through the system and will go out of scope after the threads are joined.
const std::function<void(int)> worker_task = [&](int thread_id) { ... };
There may be some small resource saving in end-to-end std::function and capturing a reference, but my experiences with references and threads haven't been the best, so I'd prefer the copy to reduce the possibility that I've missed some subtlety or someone in the future will make a change that adds some.
In the function Pool::set_and_start_job, when setting the job, removing the & from the job capture seems to have resolved the issue:
jobs[i] = [job, i]() { job(i); };
However, I just had the suspicion and does not know the underlying cause.
I want to calculate number of even numbers among all pairwise sums till 100000. And I want to do it using threadpools. Previously I did it in a static way, i.e., I allocated work to all the threads in the beginning itself. I was able to achieve linear speedup in that case. But the bottleneck is that the threads which started early, finished early (because there were less pairs to compute). So instead of that I want to allocate work to the threads dynamically, i.e., I will initially assign some work to the threads and as soon as they complete the work, they come back to take more work from the queue. Below is my threadpool code,
main.cpp :
#include <iostream>
#include <random>
#include<chrono>
#include<iomanip>
#include<future>
#include<vector>
#include "../include/ThreadPool.h"
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist(-10, 10);
auto rnd = std::bind(dist, mt);
int thread_work;
long long pairwise(const int start) {
long long sum = 0;
long long counter = 0;
for(int i = start+1; i <= start+thread_work; i++)
{
for(int j = i-1; j >= 0; j--)
{
sum = i + j;
if(sum%2 == 0)
counter++;
}
}
//std::cout<<counter<<std::endl;
return counter;
}
int main(int argc, char *argv[])
{
// Create pool with x threads
int x;
std::cout<<"Enter num of threads : ";
std::cin>>x;
std::cout<<"Enter thread_work : ";
std::cin>>thread_work;
ThreadPool pool(x);
// Initialize pool
pool.init();
int N = 100000;
long long res = 0;
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < N; i = i + thread_work)
{
std::future<long long int> fut = pool.submit(pairwise,i);
res += fut.get();
}
std::cout<<"total is "<<res<<std::endl;
pool.shutdown();
auto end = std::chrono::high_resolution_clock::now();
double time_taken = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
time_taken *= 1e-9;
std::cout << "Time taken by program is : " << std::fixed << time_taken << std::setprecision(9)<<" secs"<<std::endl;
return 0;
}
my SafeQueue.h :
#pragma once
#include <mutex>
#include <queue>
// Thread safe implementation of a Queue using an std::queue
template <typename T>
class SafeQueue {
private:
std::queue<T> m_queue;
std::mutex m_mutex;
public:
SafeQueue() {
}
SafeQueue(SafeQueue& other) {
//TODO:
}
~SafeQueue() {
}
bool empty() {
std::unique_lock<std::mutex> lock(m_mutex);
return m_queue.empty();
}
int size() {
std::unique_lock<std::mutex> lock(m_mutex);
return m_queue.size();
}
void enqueue(T& t) {
std::unique_lock<std::mutex> lock(m_mutex);
m_queue.push(t);
}
bool dequeue(T& t) {
std::unique_lock<std::mutex> lock(m_mutex);
if (m_queue.empty()) {
return false;
}
t = std::move(m_queue.front());
m_queue.pop();
return true;
}
};
and my ThreadPool.h :
#pragma once
#include <functional>
#include <future>
#include <mutex>
#include <queue>
#include <thread>
#include <utility>
#include <vector>
#include "SafeQueue.h"
class ThreadPool {
private:
class ThreadWorker {
private:
int m_id;
ThreadPool * m_pool;
public:
ThreadWorker(ThreadPool * pool, const int id)
: m_pool(pool), m_id(id) {
}
void operator()() {
std::function<void()> func;
bool dequeued;
while (!m_pool->m_shutdown) {
{
std::unique_lock<std::mutex> lock(m_pool->m_conditional_mutex);
if (m_pool->m_queue.empty()) {
m_pool->m_conditional_lock.wait(lock);
}
dequeued = m_pool->m_queue.dequeue(func);
}
if (dequeued) {
func();
}
}
}
};
bool m_shutdown;
SafeQueue<std::function<void()>> m_queue;
std::vector<std::thread> m_threads;
std::mutex m_conditional_mutex;
std::condition_variable m_conditional_lock;
public:
ThreadPool(const int n_threads)
: m_threads(std::vector<std::thread>(n_threads)), m_shutdown(false) {
}
ThreadPool(const ThreadPool &) = delete;
ThreadPool(ThreadPool &&) = delete;
ThreadPool & operator=(const ThreadPool &) = delete;
ThreadPool & operator=(ThreadPool &&) = delete;
// Inits thread pool
void init() {
for (int i = 0; i < m_threads.size(); ++i) {
m_threads[i] = std::thread(ThreadWorker(this, i));
}
}
// Waits until threads finish their current task and shutdowns the pool
void shutdown() {
m_shutdown = true;
m_conditional_lock.notify_all();
for (int i = 0; i < m_threads.size(); ++i) {
if(m_threads[i].joinable()) {
m_threads[i].join();
}
}
}
// Submit a function to be executed asynchronously by the pool
template<typename F, typename...Args>
auto submit(F&& f, Args&&... args) -> std::future<decltype(f(args...))> {
// Create a function with bounded parameters ready to execute
std::function<decltype(f(args...))()> func = std::bind(std::forward<F>(f), std::forward<Args>(args)...);
// Encapsulate it into a shared ptr in order to be able to copy construct / assign
auto task_ptr = std::make_shared<std::packaged_task<decltype(f(args...))()>>(func);
// Wrap packaged task into void function
std::function<void()> wrapper_func = [task_ptr]() {
(*task_ptr)();
};
// Enqueue generic wrapper function
m_queue.enqueue(wrapper_func);
// Wake up one thread if its waiting
m_conditional_lock.notify_one();
// Return future from promise
return task_ptr->get_future();
}
};
I am developing a project in which I have to model (arbitrary) computations that happen in pipeline.
The pipeline is made of stages, each stage takes the input from the previous stage (except the first, who directly receives tasks from the pipeline object), makes a computation and sends the result to the next stage. Each stage is implemented with a separate thread of execution.
The pipeline should have a basic load balancing capability: if (after a while) it recognizes that the sum of the execution times of two consecutive stages is smaller than the execution time of the slowest stage, it "collapses" those two stages, that is it makes both of them run sequentially, using a single thread.
There are three classes in the project: classes Pipeline and Stage are obvious, while class TSOHeap (Thread-Safe Ordered heap) is the buffer used in input by each stage. It has a maximum size and the capability to give highest priority to special messages indicating that a Stage has to be collapsed.
My question is: why if I compile without optimizations the code runs smoothly (or at least does not block), while if I compile with optimizations ( -O2, -O3 ) the program blocks? If I run the program with the debugger it blocks few times; if I run the program "normally" from terminal it blocks almost always.
The strange thing is that a thread blocks on a line in which there is a simple print. Before I added that print (for debugging purpose), the program blocked on the previous line, which is the guard of a while loop.
I guess the problem is related to synchronization among threads, but I don't know how to discover the faulty part. The only constant is that the program blocks after the method collapse_next_stage() has been invoked, that is after a thread has been stopped.
Any suggestion would be appreciated, even general procedures to discover bugs like these.
I report the code to run an example:
Class "TSOHeap.hpp":
#include <mutex>
#include <queue>
#include <vector>
#include <atomic>
#include <climits>
using namespace std;
template<typename T>
struct Comparator{
bool operator()(pair<T,int> p1, pair<T,int> p2){
return p1.second > p2.second;
}
};
//Thread-Safe Ordered Heap
template<typename T>
struct TSOHeap
{
TSOHeap(int _max=10):size{0},max{_max}{};
~TSOHeap(){}
void push(T* item, int id){
while(size==max);
{
lock_guard<mutex> lock(heap_mutex);
heap.push(pair<T*,int>(item, id));
size++;
}
}
pair<T*,int> pop(){
while(size==0);
{
lock_guard<mutex> lock(heap_mutex);
pair<T*,int> p = heap.top();
heap.pop();
size--;
return p;
}
}
priority_queue<pair<T*,int>, vector<pair<T*,int>>,Comparator<T*>> heap;
atomic<int> size;
int max;
mutex heap_mutex;
};
Class "Stage.hpp":
#include "TSOHeap.hpp"
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
#include <mutex>
using namespace std;;
struct IStage{
virtual void run() = 0;
virtual void wait_end() = 0;
virtual void stage_func() = 0;
virtual double get_exec_time() = 0;
virtual void reset_exec_time()=0;
virtual void add_next(IStage&)=0;
virtual IStage* get_next() = 0;
virtual void* get_input_ptr() = 0;
virtual void set_input(void*) = 0;
virtual void collapse() = 0;
virtual bool is_collapsed() = 0;
virtual void collapse_next_stage() = 0;
virtual int num_collapsed() = 0;
~IStage(){};
};
template <typename Tin, typename Tf, typename Tout>
struct Stage : IStage{
Stage(Tf function, int ind):fun{function}, input_ptr{new(TSOHeap<Tin>)},_end{false},
next{nullptr}, collapsed{0}, i{ind}, exec_time{0.0},count{0},collapsing{false},c{0}{};
~Stage(){delete input_ptr;}
void stage_func(){
Tin * input = input_ptr->pop().first;
if (input!=nullptr){
auto start = chrono::system_clock::now();
Tout out = fun(*input);
auto end = chrono::system_clock::now();
chrono::duration<double> diff = end-start;
set_exec_time(diff.count());
if (next!=nullptr)
next->set_input(new Tout(out));
}
else
_end = true;
}
void run_thread(){
while(!_end){
cout << "t " << i << ", r " << ++c << endl; // BLOCKS HERE
while(collapsing); //waiting that next stage finishes the remaining tasks
stage_func();
if(collapsed==1 && !_end)
next->stage_func();
}
if(collapsed!=-1){
IStage * nptr = next;
if(nptr!=nullptr && nptr->is_collapsed())
nptr = nptr->get_next();
if(nptr!=nullptr)
nptr->set_input(nullptr);
}
else{
while((input_ptr->size)>0)
stage_func();
}
}
void run()
{
thread _t(&Stage::run_thread, this);
t = move(_t);
return;
}
void wait_end()
{
t.join();
}
void set_input(void * iptr)
{
input_ptr->push(static_cast<Tin*>(iptr), ++count);
}
void* get_input_ptr()
{
return input_ptr;
}
void add_next(IStage &n)
{
next = &n;
output_ptr = static_cast<TSOHeap<Tout>*>(n.get_input_ptr());
}
void collapse()
{
collapsed=-1;
input_ptr->push(nullptr, INT_MIN);
// First condition is to avoid deadlock, in case this thread finished the execution in the meanwhile
while(!_end && (input_ptr->size) > 0);
}
bool is_collapsed()
{
return collapsed==-1;
}
void collapse_next_stage()
{
collapsing = true;
next->collapse();
collapsed++;
collapsing = false;
cout << "Stage # " << i << " has collapsed the successive Stage" << endl;
}
IStage* get_next()
{
return next;
}
double get_exec_time()
{
return exec_time;
}
void reset_exec_time()
{
set_exec_time(0.0);
}
void set_exec_time(double value)
{
lock_guard<mutex> lock(et_mutex);
exec_time = value;
}
int num_collapsed()
{
return collapsed;
}
Tf fun;
TSOHeap<Tin> * input_ptr;
bool _end;
IStage * next;
int collapsed;
int const i;
double exec_time;
int count;
mutex et_mutex;
bool collapsing;
int c;
TSOHeap<Tout> * output_ptr;
thread t;
};
Class "Pipe.hpp":
#include "Stage.hpp"
#include <list>
#include <thread>
#include <algorithm>
using namespace std;;
template <typename Tin, typename Tout>
struct Pipe{
Pipe(list<IStage*>li, int n_samples=10):slowest{-1},end{false},num_samples{n_samples}
{
for(auto& s:li)
add_node(s);
}
void add_node(IStage* sptr)
{
if(!nodes.empty())
nodes.back()->add_next(*sptr);
nodes.push_back(sptr);
}
void set_input(void * in_ptr)
{
nodes.front()->set_input(in_ptr);
}
int num_nodes()
{
return nodes.size();
}
void run()
{
for(auto &x: nodes)
x->run();
}
void run(list<Tin>&& input)
{
thread t(&Pipe::run_manager, this, ref(input));
while(!end)
monitor_times();
t.join();
}
void run_manager(list<Tin>& input)
{
run();
for(auto& x:input)
set_input(&x);
set_input(nullptr);
end=true;
for(auto& s : nodes)
s->wait_end();
}
void monitor_times()
{ // initialization phase
vector<int> count;
vector<double> avg;
vector<priority_queue<pair<double,int>, vector<pair<double,int>>,Comparator<double>>> measures;
for(auto& x : nodes){
count.push_back(0);
avg.push_back(0);
measures.push_back(priority_queue<pair<double,int>,
vector<pair<double,int>>,Comparator<double>>());
}
while(!end){
// monitoring phase
for(int i=0; i<nodes.size(); i++){
if(nodes[i]->get_exec_time()!=0){
pair<double,int> measure = pair<double,int>(nodes[i]->get_exec_time(),++count[i]);
nodes[i]->reset_exec_time();
measures[i].push(measure);
if(count[i]<=num_samples){
avg[i] = (avg[i]*(count[i]-1) + measure.first) / count[i];
}
else
{
double old = measures[i].top().first;
// the ordering of the heap guarantees that I drop the oldest measure
measures[i].pop();
avg[i] = (avg[i] * num_samples - old + measure.first) / num_samples;
}
}
}
// updating phase
if(is_steady_state(count)){
int slowest = get_slowest_stage(avg);
for(int i=0; i<nodes.size()-1; i++){
if(avg[i]+avg[i+1]<avg[slowest]){
if(nodes[i]->num_collapsed()==0 && nodes[i+1]->num_collapsed()==0){
nodes[i]->collapse_next_stage();
break;
}
}
}
}
}
}
bool is_steady_state(vector<int>& count){
for(auto& c: count){
if(c < num_samples) return false;
}
return true;
}
int get_slowest_stage(vector<double>& avg){
double max = 0.0;
int index = -1;
for(int i=0; i<avg.size(); i++){
if(avg[i]>max){
max=avg[i];
index = i;
}
}
return index;
}
int slowest;
bool end;
int num_samples;
vector<IStage*> nodes;
};
Class "main.cpp":
#include<iostream>
#include<functional>
#include <chrono>
#include<cmath>
#include "Pipe.hpp"
using namespace std;;
auto f = [](int x){
int c = 0;
for(int i=0; i<300; i++)
c=sin(i);
return x;
};
auto fast = [] (int x) {return x;};
auto fast_init = [](int x){
if(x < 5)
return x;
int c=0;
for(int i=0; i<300; i++)
c=sin(i);
return x;
};
auto print = [] (int x) {
cout << "Result: " << x << " " << endl;
return x;
};
int main(int argc, char* argv[])
{
auto print_usage_msg = [&](){
cout << "Usage: " << argv[0] << " <func_type> \n" <<
"<func_type> = \n"
" 0 to have 2 consecutive stages running the fast function\n"
" 1 to have 2 consecutive stages running the fast function "
"but after a short time reaching steady state " << endl;
};
if(argc!=2){
print_usage_msg();
return 1;
}
int fun_code = atoi(argv[1]);
if (fun_code!=0 && fun_code!=1){
print_usage_msg();
return 1;
}
Stage<int,function<int(int)>,int> s1{f,1};
Stage<int,function<int(int)>,int> s2{f,2};
Stage<int,function<int(int)>,int> s3{f,3};
Stage<int,function<int(int)>,int> s4{f,4};
Stage<int,function<int(int)>,int> s5{f,5};
Stage<int,function<int(int)>,int> s6{f,6};
Stage<int,function<int(int)>,int> s7{f,7};
Stage<int,function<int(int)>,int> sp{print,8};
if(fun_code==0){
s2.fun = fast;
s3.fun = fast;
}
else{
s2.fun = fast_init;
s3.fun = fast_init;
}
Pipe<int,int> p ({&s1, &s2, &s3, &s4, &s5, &s6, &s7, &sp});
cout << "Pipe length: " << p.num_nodes() << endl;
list<int> li {};
for(int i=0; i<100; i++)
li.push_back(i);
p.run(move(li));
return 0;
}
Compile with:
g++ main.cpp -std=c++11 -pthread -O3 -o gpipe -g
Run with :
./gpipe 1
Thanks for any help!
Imagine the following code for a single-threaded program:
void func()
{
bool a = true;
while(a)
{
// busy wait...
}
}
Will this function ever return? Obviously not. If you were a compiler, how would you write optimized code for this?
1: NOP
2: GOTO 1
This is exactly what you're doing with this bit of code. Twice.
while(!_end){ // here #1
cout << "t " << i << ", r " << ++c << endl;
while(collapsing) // here #2
; // for the love of God, move your semicolon here or use braces
stage_func();
if(collapsed==1 && !_end)
next->stage_func();
}
Your compiler has absolutely no obligation to realize that you're doing multi-threading programming. (It's your job to tell it)
The compiler needs to know not to perform optimizations on _end and collapsed. DO NOT USE volatile. Why? volatile will keep the compiler from optimizing a variable, but... heh heh... the CPU can also potentially optimize away your writes to _end and collapsed from different threads (by keeping them in its cache and not writing to main memory). Compilers and CPU's will also re-order your instructions, which can cause similar problems.
Memory fences (aka memory barriers) can be used to instruct the CPU to do things like push out pending writes or re-update its cached value for reading. They also give guidelines for command re-ordering. AFAIK the std::atomic_thread_fence will prevent compiler reordering but I've read conflicting things about this...
By far the simplest, most-pragmatic, and easiest-to-prove-correct thing to do is just to switch all your inter-thread communicating variables to std::atomic<> types, which incorporate memory barriers. So
std::atomic<bool> _end;
std::atomic<int> collapsed;
As a general rule, any data that is shared between threads should be protected by a mutex OR be an std::atomic<> if race conditions are not an issue (as you are doing with the simple signaling). You can break this rule if you really know what you're doing and really know the architecture, compiler, and standard implementation really well, but that's a tall order even for an expert.
By the way, a mutex's lock and unlock operation both incorporate a memory barrier, in case you were worried about that. So when you get a pointer from the TSOHeap, that's fine (assuming your TSOHeap implementation is correct...I didn't look at it).
You have race conditions in TSOHeap when using size. While size is atomic, it is a part of larger state that is not atomic, so that changes in size are not synchronized with changes to the rest of the state.
Make size non-atomic and access it only when the mutex is locked. Add condition variables to notify threads waiting in push and pop.
Alternatively, remove size entirely. Example:
template<typename T>
struct TSOHeap
{
TSOHeap(size_t _max=10): max{_max}{}
void push(T* item, int id){
unique_lock<mutex> lock(heap_mutex);
while(heap.size() == max)
cnd_pop.wait(lock);
heap.push(pair<T*,int>(item, id));
cnd_push.notify_one();
}
pair<T*,int> pop() {
pair<T*,int> result = {};
{
unique_lock<mutex> lock(heap_mutex);
while(heap.empty())
cnd_push.wait(lock);
bool notify = heap.size() == max;
result = heap.top();
heap.pop();
if(notify)
cnd_pop.notify_one();
}
return result;
}
mutex heap_mutex;
condition_variable cnd_push, cnd_pop;
priority_queue<pair<T*,int>, vector<pair<T*,int>>,Comparator<T*>> heap;
size_t const max;
};
First, let me introduce you to my problem.
My code looks like this:
#include <iostream>
#include <thread>
#include <condition_variable>
std::mutex mtx;
std::mutex cvMtx;
std::mutex mtx2;
bool ready{false};
std::condition_variable cv;
int threadsFinishedCurrentLevel{0};
void tfunc() {
for(int i = 0; i < 5; i++) {
//do something
for (int j = 0; j < 10000; j++) {
std::cout << j << std::endl;
}
//this is i-th level
mtx2.lock();
threadsFinishedCurrentLevel++;
if (threadsFinishedCurrentLevel == 2) {
//this is last thread in current level
threadsFinishedCurrentLevel = 0;
cvMtx.unlock();
}
mtx2.unlock();
{
//wait for notify
unique_lock<mutex> lck(mtx);
while (!ready) cv_.wait(lck);
}
}
}
int main() {
cvMtx.lock(); //init
std::thread t1(tfunc);
std::thread t2(tfunc);
for (int i = 0; i < 5; i++) {
cvMtx.lock();
{
unique_lock<mutex> lck(mtx);
ready = true;
cv.notify_all();
}
}
t1.join();
t2.join();
return 0;
}
I have 2 threads. My computation consists of levels(for this example, lets say we have 5 levels). On the same level, computation can be divided to threads. Each thread then calculates part of a problem. When i want to step to the next(higher) level, lower level must be first done. So my idea is something like this. When last thread on the current level is done, it unlocks main thread, so it can notify all of the threads to continue to next level. But this notify has to be called more then once. Because there are plenty of these levels. Can this condition_variable be restarted or something? Or do I need for each level one condition_variable? So for example, when i have 1000 levels, i need to allocate dynamically 1000x condition_variable?
Is it just me or you are trying to block the main thread with a mutex (which is your way of trying to notify it when all threads are done?), I mean that's not the task of a mutex. That's where the condition variable should be used.
// New condition_variable, to nofity main thread when child is done with level
std::condition_variable cv2;
// When a child is done, it will update this counter
int counter = 0; // This is already protected by cvMtx, otherwise it could be atomic.
// This is to sync cout
std::mutex cout_mutex;
void tfunc()
{
for (int i = 0; i < 5; i++)
{
{
std::lock_guard<std::mutex> l(cout_mutex);
std::cout << "Level " << i + 1 << " " << std::this_thread::get_id() << std::endl;
}
{
std::lock_guard<std::mutex> l(cvMtx);
counter++; // update counter &
}
cv2.notify_all(); // notify main thread we are done.
{
//wait for notify
unique_lock<mutex> lck(mtx);
cv.wait(lck);
// Note that I've removed the "ready" flag here
// That's because u would need multiple ready flags to make that work
}
}
}
int main()
{
std::thread t1(tfunc);
std::thread t2(tfunc);
for (int i = 0; i < 5; i++)
{
{
unique_lock<mutex> lck(cvMtx);
// Wait takes a predicate which u can take advantage of
cv2.wait(lck, [] { return (counter == 2); });
counter = 0;
// This thread will get notified multiple times
// But it only will wake up when counter matches 2
// Which equals to how many threads we've created.
}
// Sleeping a bit to know the code is working
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
// Wake up all threds and continue to next level.
unique_lock<mutex> lck(mtx);
cv.notify_all();
}
t1.join();
t2.join();
return 0;
}
The synchronization can be done with a single counter, threads increment the counter under lock and check for the counter to reach a multiple of the number of concurrent threads. This greatly simplifies the logic. I've made this change and also grouped the shared variables into a class, and provided member functions to access them. To avoid false sharing I've ensured that variables that are read-only are separate from those that are read-write by the threads, and also separated read-write variables by usage. The use of global variables is discouraged, see C++ Core Guidelines for this and other good advice.
The simplified code follows, you can see it live in ideone. Note: it looks like there isn't true concurrency in ideone, you'll have to run this on a multi-core environment to actually test hardware concurrency.
//http://stackoverflow.com/questions/35318942/stdcondition-variable-calling-notify-all-more-than-once
#include <iostream>
#include <functional>
#include <thread>
#include <mutex>
#include <vector>
#include <condition_variable>
static constexpr size_t CACHE_LINE_SIZE = 64;
static constexpr size_t NTHREADS = 2;
static constexpr size_t NLEVELS = 5;
static constexpr size_t NITERATIONS = 100;
class Synchronize
{
alignas(CACHE_LINE_SIZE) // read/write while threads are busy working
std::mutex mtx_std_cout;
alignas(CACHE_LINE_SIZE) // read/write while threads are synchronizing at level
std::mutex cvMtx;
std::condition_variable cv;
size_t threadsFinished{0};
alignas(CACHE_LINE_SIZE) // read-only parameters
const size_t n_threads;
const size_t n_levels;
public: // class Synchronize owns unique resources:
// - must be explicitly constructed
// - disallow default ctor,
// - disallow copy/move ctor and
// - disallow copy/move assignment
Synchronize( Synchronize const& ) = delete;
Synchronize & operator=( Synchronize const& ) = delete;
explicit Synchronize( size_t nthreads, size_t nlevels )
: n_threads{nthreads}, n_levels{nlevels}
{}
size_t nlevels() const { return n_levels; }
std::mutex & std_cout_mutex() { return mtx_std_cout; }
void level_done_wait_all( size_t level )
{
std::unique_lock<std::mutex> lk(cvMtx);
threadsFinished++;
cv.wait(lk, [&]{return threadsFinished >= n_threads * (level+1);});
cv.notify_all();
}
};
void tfunc( Synchronize & sync )
{
for(size_t i = 0; i < sync.nlevels(); i++)
{
//do something
for (size_t j = 0; j < NITERATIONS; j++) {
std::unique_lock<std::mutex> lck(sync.std_cout_mutex());
if (j == 0) std::cout << '\n';
std::cout << ' ' << i << ',' << j;
}
sync.level_done_wait_all(i);
}
}
int main() {
Synchronize sync{ NTHREADS, NLEVELS };
std::vector<std::thread*> threads(NTHREADS,nullptr);
for(auto&t:threads) t = new std::thread(tfunc,std::ref(sync));
for(auto t:threads) {
t->join();
delete t;
}
std::cout << std::endl;
return 0;
}
I have the following program (made up example!):
#include<thread>
#include<mutex>
#include<iostream>
class MultiClass {
public:
void Run() {
std::thread t1(&MultiClass::Calc, this);
std::thread t2(&MultiClass::Calc, this);
std::thread t3(&MultiClass::Calc, this);
t1.join();
t2.join();
t3.join();
}
private:
void Calc() {
for (int i = 0; i < 10; ++i) {
std::cout << i << std::endl;
}
}
};
int main() {
MultiClass m;
m.Run();
return 0;
}
What I need is to sync the loop iterations the following way and I cant come up with a solution (I've been fiddling for about an hour now using mutexes but cant find THE combination):
t1 and t2 shall do one loop iteration, then t3 shall do one iteration, then again t1 and t2 shall do one, then t3 shall do one.
So you see, I need t1 and t2 to do things simultaneously and after one iteration, t3 shall do one iteration on its own.
Can you point your finger on how I would be able to achieve that? Like I said, ive been trying this with mutexes and cant come up with a solution.
If you really want to do this by hand with the given thread structure, you could use something like this*:
class SyncObj {
mutex mux;
condition_variable cv;
bool completed[2]{ false,false };
public:
void signalCompetionT1T2(int id) {
lock_guard<mutex> ul(mux);
completed[id] = true;
cv.notify_all();
}
void signalCompetionT3() {
lock_guard<mutex> ul(mux);
completed[0] = false;
completed[1] = false;
cv.notify_all();
}
void waitForCompetionT1T2() {
unique_lock<mutex> ul(mux);
cv.wait(ul, [&]() {return completed[0] && completed[1]; });
}
void waitForCompetionT3(int id) {
unique_lock<mutex> ul(mux);
cv.wait(ul, [&]() {return !completed[id]; });
}
};
class MultiClass {
public:
void Run() {
std::thread t1(&MultiClass::Calc1, this);
std::thread t2(&MultiClass::Calc2, this);
std::thread t3(&MultiClass::Calc3, this);
t1.join();
t2.join();
t3.join();
}
private:
SyncObj obj;
void Calc1() {
for (int i = 0; i < 10; ++i) {
obj.waitForCompetionT3(0);
std::cout << "T1:" << i << std::endl;
obj.signalCompetionT1T2(0);
}
}
void Calc2() {
for (int i = 0; i < 10; ++i) {
obj.waitForCompetionT3(1);
std::cout << "T2:" << i << std::endl;
obj.signalCompetionT1T2(1);
}
}
void Calc3() {
for (int i = 0; i < 10; ++i) {
obj.waitForCompetionT1T2();
std::cout << "T3:" << i << std::endl;
obj.signalCompetionT3();
}
}
};
However, this is only a reasonable approach, if each iteration is computational expensive, such that you can ignore the synchronization overhead. If that is not the case you should probably better have a look at a proper parallel programming library like intel's tbb or microsofts ppl.
*)NOTE: This code is untested and unoptimized. I just wrote it to show what the general structure could look like
Use two condition variables, here is a sketch..
thread 1 & 2 wait on condition variable segment_1:
std::condition_variable segment_1;
thread 3 waits on condition variable segment_2;
std::condition_variable segment_2;
threads 1 & 2 should wait() on segment_1, and thread 3 should wait() on segment_2. To kick off threads 1 & 2, call notify_all() on segment_1, and once they complete, call notify_one() on segment_2 to kick off thread 3. You may want to use some controlling thread to control the sequence unless you can chain (i.e. once 1 & 2 complete, the last one to complete calls notify for thread 3 and so on..)
This is not perfect (see lost wakeups)