std::condition_variable calling notify_all more than once - c++

First, let me introduce you to my problem.
My code looks like this:
#include <iostream>
#include <thread>
#include <condition_variable>
std::mutex mtx;
std::mutex cvMtx;
std::mutex mtx2;
bool ready{false};
std::condition_variable cv;
int threadsFinishedCurrentLevel{0};
void tfunc() {
for(int i = 0; i < 5; i++) {
//do something
for (int j = 0; j < 10000; j++) {
std::cout << j << std::endl;
}
//this is i-th level
mtx2.lock();
threadsFinishedCurrentLevel++;
if (threadsFinishedCurrentLevel == 2) {
//this is last thread in current level
threadsFinishedCurrentLevel = 0;
cvMtx.unlock();
}
mtx2.unlock();
{
//wait for notify
unique_lock<mutex> lck(mtx);
while (!ready) cv_.wait(lck);
}
}
}
int main() {
cvMtx.lock(); //init
std::thread t1(tfunc);
std::thread t2(tfunc);
for (int i = 0; i < 5; i++) {
cvMtx.lock();
{
unique_lock<mutex> lck(mtx);
ready = true;
cv.notify_all();
}
}
t1.join();
t2.join();
return 0;
}
I have 2 threads. My computation consists of levels(for this example, lets say we have 5 levels). On the same level, computation can be divided to threads. Each thread then calculates part of a problem. When i want to step to the next(higher) level, lower level must be first done. So my idea is something like this. When last thread on the current level is done, it unlocks main thread, so it can notify all of the threads to continue to next level. But this notify has to be called more then once. Because there are plenty of these levels. Can this condition_variable be restarted or something? Or do I need for each level one condition_variable? So for example, when i have 1000 levels, i need to allocate dynamically 1000x condition_variable?

Is it just me or you are trying to block the main thread with a mutex (which is your way of trying to notify it when all threads are done?), I mean that's not the task of a mutex. That's where the condition variable should be used.
// New condition_variable, to nofity main thread when child is done with level
std::condition_variable cv2;
// When a child is done, it will update this counter
int counter = 0; // This is already protected by cvMtx, otherwise it could be atomic.
// This is to sync cout
std::mutex cout_mutex;
void tfunc()
{
for (int i = 0; i < 5; i++)
{
{
std::lock_guard<std::mutex> l(cout_mutex);
std::cout << "Level " << i + 1 << " " << std::this_thread::get_id() << std::endl;
}
{
std::lock_guard<std::mutex> l(cvMtx);
counter++; // update counter &
}
cv2.notify_all(); // notify main thread we are done.
{
//wait for notify
unique_lock<mutex> lck(mtx);
cv.wait(lck);
// Note that I've removed the "ready" flag here
// That's because u would need multiple ready flags to make that work
}
}
}
int main()
{
std::thread t1(tfunc);
std::thread t2(tfunc);
for (int i = 0; i < 5; i++)
{
{
unique_lock<mutex> lck(cvMtx);
// Wait takes a predicate which u can take advantage of
cv2.wait(lck, [] { return (counter == 2); });
counter = 0;
// This thread will get notified multiple times
// But it only will wake up when counter matches 2
// Which equals to how many threads we've created.
}
// Sleeping a bit to know the code is working
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
// Wake up all threds and continue to next level.
unique_lock<mutex> lck(mtx);
cv.notify_all();
}
t1.join();
t2.join();
return 0;
}

The synchronization can be done with a single counter, threads increment the counter under lock and check for the counter to reach a multiple of the number of concurrent threads. This greatly simplifies the logic. I've made this change and also grouped the shared variables into a class, and provided member functions to access them. To avoid false sharing I've ensured that variables that are read-only are separate from those that are read-write by the threads, and also separated read-write variables by usage. The use of global variables is discouraged, see C++ Core Guidelines for this and other good advice.
The simplified code follows, you can see it live in ideone. Note: it looks like there isn't true concurrency in ideone, you'll have to run this on a multi-core environment to actually test hardware concurrency.
//http://stackoverflow.com/questions/35318942/stdcondition-variable-calling-notify-all-more-than-once
#include <iostream>
#include <functional>
#include <thread>
#include <mutex>
#include <vector>
#include <condition_variable>
static constexpr size_t CACHE_LINE_SIZE = 64;
static constexpr size_t NTHREADS = 2;
static constexpr size_t NLEVELS = 5;
static constexpr size_t NITERATIONS = 100;
class Synchronize
{
alignas(CACHE_LINE_SIZE) // read/write while threads are busy working
std::mutex mtx_std_cout;
alignas(CACHE_LINE_SIZE) // read/write while threads are synchronizing at level
std::mutex cvMtx;
std::condition_variable cv;
size_t threadsFinished{0};
alignas(CACHE_LINE_SIZE) // read-only parameters
const size_t n_threads;
const size_t n_levels;
public: // class Synchronize owns unique resources:
// - must be explicitly constructed
// - disallow default ctor,
// - disallow copy/move ctor and
// - disallow copy/move assignment
Synchronize( Synchronize const& ) = delete;
Synchronize & operator=( Synchronize const& ) = delete;
explicit Synchronize( size_t nthreads, size_t nlevels )
: n_threads{nthreads}, n_levels{nlevels}
{}
size_t nlevels() const { return n_levels; }
std::mutex & std_cout_mutex() { return mtx_std_cout; }
void level_done_wait_all( size_t level )
{
std::unique_lock<std::mutex> lk(cvMtx);
threadsFinished++;
cv.wait(lk, [&]{return threadsFinished >= n_threads * (level+1);});
cv.notify_all();
}
};
void tfunc( Synchronize & sync )
{
for(size_t i = 0; i < sync.nlevels(); i++)
{
//do something
for (size_t j = 0; j < NITERATIONS; j++) {
std::unique_lock<std::mutex> lck(sync.std_cout_mutex());
if (j == 0) std::cout << '\n';
std::cout << ' ' << i << ',' << j;
}
sync.level_done_wait_all(i);
}
}
int main() {
Synchronize sync{ NTHREADS, NLEVELS };
std::vector<std::thread*> threads(NTHREADS,nullptr);
for(auto&t:threads) t = new std::thread(tfunc,std::ref(sync));
for(auto t:threads) {
t->join();
delete t;
}
std::cout << std::endl;
return 0;
}

Related

Thread pool with job queue gets stuck

I want to split jobs among multiple std::thread workers and continue once they are all done.
To do so, I implemented a thread pool class mainly based on this SO answer.
I noticed, however, that my benchmarks can get stuck, running forever, without any errors thrown.
I wrote a minimal reproducing code, enclosed at the end.
Based on terminal output, the issue seems to occur when the jobs are being queued.
I checked videos (1, 2), documentation (3) and blog posts (4).
I tried replacing the type of the locks, using atomics.
I could not find the underlying cause.
Here is the snippet to replicate the issue.
The program repeatedly counts the odd elements in the test vector.
#include <atomic>
#include <condition_variable>
#include <functional>
#include <iostream>
#include <mutex>
#include <queue>
#include <thread>
#include <vector>
class Pool {
public:
const int worker_count;
bool to_terminate = false;
std::atomic<int> unfinished_tasks = 0;
std::mutex mutex;
std::condition_variable condition;
std::vector<std::thread> threads;
std::queue<std::function<void()>> jobs;
void thread_loop()
{
while (true) {
std::function<void()> job;
{
std::unique_lock<std::mutex> lock(mutex);
condition.wait(lock, [&] { return (!jobs.empty()) || to_terminate; });
if (to_terminate)
return;
job = jobs.front();
jobs.pop();
}
job();
unfinished_tasks -= 1;
}
}
public:
Pool(int size) : worker_count(size)
{
if (size < 0)
throw std::invalid_argument("Worker count needs to be a positive integer");
for (int i = 0; i < worker_count; ++i)
threads.push_back(std::thread(&Pool::thread_loop, this));
};
~Pool()
{
{
std::unique_lock lock(mutex);
to_terminate = true;
}
condition.notify_all();
for (auto &thread : threads)
thread.join();
threads.clear();
};
void queue_job(const std::function<void()> &job)
{
{
std::unique_lock<std::mutex> lock(mutex);
jobs.push(job);
unfinished_tasks += 1;
// std::cout << unfinished_tasks;
}
condition.notify_one();
}
void wait()
{
while (unfinished_tasks) {
; // spinlock
};
}
};
int main()
{
constexpr int worker_count = 8;
constexpr int vector_size = 1 << 10;
Pool pool = Pool(worker_count);
std::vector<int> test_vector;
test_vector.reserve(vector_size);
for (int i = 0; i < vector_size; ++i)
test_vector.push_back(i);
std::vector<int> worker_odd_counts(worker_count, 0);
std::function<void(int)> worker_task = [&](int thread_id) {
int chunk_size = vector_size / (worker_count) + 1;
int my_start = thread_id * chunk_size;
int my_end = std::min(my_start + chunk_size, vector_size);
int local_odd_count = 0;
for (int ii = my_start; ii < my_end; ++ii)
if (test_vector[ii] % 2 != 0)
++local_odd_count;
worker_odd_counts[thread_id] = local_odd_count;
};
for (int iteration = 0;; ++iteration) {
std::cout << "Jobs.." << std::flush;
for (int i = 0; i < worker_count; ++i)
pool.queue_job([&worker_task, i] { worker_task(i); });
std::cout << "..queued. " << std::flush;
pool.wait();
int odd_count = 0;
for (auto elem : worker_odd_counts)
odd_count += elem;
std::cout << "Iter:" << iteration << ". Odd:" << odd_count << '\n';
}
}
Here is the terminal output of one specific run:
[...]
Jobs....queued. Iter:2994. Odd:512
Jobs....queued. Iter:2995. Odd:512
Jobs..
Edit:
The error occurres using GCC 12.2.0 x86_64-w64-mingw32 on Windows 10 with AMD Ryzen 4750U CPU. I do not get past 15k iterations .
Using Visual Studio Community 2022, I got past 1.5M iterations (and stopped it myself). Thanks #IgorTandetnik for pointing out the latter.
Mingw doesn’t natively support multithreading on Windows. They supporting threads in their C++ standard library over the POSIX API, and winpthreads compatibility layer which implements that API on top of the Windows OS threads.
I think your error is not in the C++ code, but in the computer setup. Do the following.
Use the compiler from x86_64-12.2.0-release-posix-seh-ucrt-rt_v10-rev2.7z archive, there.
Don’t forget the binary built that way depends on a bunch of DLL files provided by the compiler: libgcc_s_seh-1.dll, libwinpthread-1.dll and libstdc++-6.dll. You must use exactly the same version of these DLL which were shipped with mingw. If you have some other versions of these DLLs anywhere in your %PATH%, expect all kinds of fails.
Couple general notes.
Linux-first C++ compilers like gcc have issues on Windows. A path of least resistance is using Visual C++ instead. If you want your software to build on other platforms as well, consider cmake to abstract away the compiler.
Windows already includes a thread pool implementation, since Vista. The API is easy to use, you only need 4 functions: CreateThreadpoolWork, SubmitThreadpoolWork, WaitForThreadpoolWorkCallbacks, and CloseThreadpoolWork. Example.
The first thing you should do is split the queue from the thread pool. They are both tricky enough, writing both of them comingled in one class is asking for trouble.
This also allows you to unit test the queue without the pool.
template<class Payload>
class MutexQueue {
public:
std::optional<Payload> wait_and_pop();
void push(Payload);
void terminate_queue();
bool queue_is_terminated() const;
private:
mutable std::mutex m;
std::condition_variable cv;
std::deque<Payload> q;
bool terminated = false;
std::unique_lock<std::mutex> lock() const {
return std::unique_lock<std::mutex>(m);
}
};
this is a bit easier to write than the thread pool.
void push(Payload p) {
{
auto l = lock();
if (terminate) return;
q.push_back(std::move(p));
}
cv.notify_one();
}
void terminate_queue() {
{
auto l = lock(); // YOU CANNOT SKIP THIS LOCK, even if terminate is atomic
terminate = true;
q.clear();
}
cv.notify_all();
}
bool queue_is_terminated() const {
auto l = lock(); // if you make terminate atomic, you CAN skip this lock
return terminate;
}
std::optional<Payload> wait_and_pop() {
auto l = lock();
cv.wait(l, [&]{ return terminate || !q.empty(); }
if (terminate) return std::nullopt;
auto r = std::move(q.front());
q.pop_front();
return std::move(r);
}
there we go.
Now our thread pool is simpler.
struct ThreadPool {
explicit ThreadPool(std::size_t n) {
create_threads(n);
}
std::future<void> push_task(std::function<void()> f) {
std::packaged_task<void()> p = std::move(f);
auto r = p.get_future();
q.push( std::move(p) );
return r;
}
void terminate_pool() {
q.terminate_queue();
terminate_threads();
}
~ThreadPool() {
terminate_pool();
}
private:
MutexQueue<std::packaged_task<void()>> q;
std::vector<std::thread> threads;
void terminate_threads() {
for(auto& thread:threads)
thread.join();
threads.clear();
}
static void thread_task( MutexQueue<std::packaged_task<void()>>* pq ) {
if (!pq) return;
while (auto task = pq->wait_and_pop()) {
(*task)();
}
}
void create_threads(std::size_t n) {
for (std::size_t i = 0; i < n; ++i) {
threads.push_back( std::thread( thread_task, &q ) );
}
}
I cannot spot an error in your code. But with the above, you can test a split of the queue from the pool.
The queue will work with pthreads or other primitives.

Thread pool with individual std::function jobs per worker crashes with segmentation fault

I have successfully implemented the thread pool from an answer on Stack Overflow, which helped me in speeding up my program. It uses a single std::queue to distribute jobs (std::function<void()>) among multiple workers (std::threads).
I wanted to improve on this. As I only need to run a limited set of functions, I planned to ditch the queue and to use variables instead. In other words, the n-th worker would do the n-th job from the std::vector<std::function<void()>>. Unfortunately, my test app crashes with Segmentation fault (core dumped) and I could not realize my mistake so far.
Here is my ~minimal reproducible code, with the job of counting the odd elements in a vector. (Idea taken from Scott Meyers: Cpu Caches and Why You Care.)
#include <algorithm>
#include <condition_variable>
#include <functional>
#include <iostream>
#include <mutex>
#include <stdexcept> // std::invalid_argument
#include <thread>
#include <vector>
// Thread pool with a std::function for each worker.
class Pool {
public:
enum class Status {
idle,
working,
terminate
};
const int worker_count;
std::vector<Status> statuses;
std::vector<std::mutex> mutexes;
std::vector<std::condition_variable> conditions;
std::vector<std::thread> threads;
std::vector<std::function<void()>> jobs;
void thread_loop(int thread_id)
{
std::puts("Thread started");
auto &my_status = statuses[thread_id];
auto &my_mutex = mutexes[thread_id];
auto &my_condition = conditions[thread_id];
auto &my_job = jobs[thread_id];
while (true) {
std::unique_lock<std::mutex> lock(my_mutex);
my_condition.wait(lock, [this, &my_status] { return my_status != Status::idle; });
if (my_status == Status::terminate)
return;
my_job();
my_status = Status::idle;
lock.unlock();
my_condition.notify_one(); // Tell the main thread we are done
}
}
public:
Pool(int size) : worker_count(size), statuses(size, Status::idle), mutexes(size), conditions(size), threads(), jobs(size)
{
if (size < 0)
throw std::invalid_argument("Worker count needs to be a positive integer");
};
~Pool()
{
for (int i = 0; i < worker_count; ++i) {
std::unique_lock lock(mutexes[i]);
statuses[i] = Status::terminate;
lock.unlock(); // Unlock before notifying
conditions[i].notify_one();
}
for (auto &thread : threads)
thread.join();
threads.clear();
};
void start_threads()
{
threads.resize(worker_count);
jobs.resize(worker_count);
for (int i = 0; i < worker_count; ++i) {
statuses[i] = Status::idle;
jobs[i] = []() { std::puts("I am running"); };
threads[i] = std::thread(&Pool::thread_loop, this, i);
}
}
void set_and_start_job(const std::function<void(int)> &job)
{
for (int i = 0; i < worker_count; ++i) {
std::unique_lock lock(mutexes[i]);
jobs[i] = [&job, i]() { job(i); };
statuses[i] = Status::working;
lock.unlock();
conditions[i].notify_one();
}
}
void wait()
{
for (int i = 0; i < worker_count; ++i) {
auto &my_status = statuses[i];
std::unique_lock lock(mutexes[i]);
conditions[i].wait(lock, [this, &my_status] { return my_status != Status::working; });
}
}
};
int main()
{
constexpr int worker_count = 1;
constexpr int vector_size = 1 << 10;
std::vector<int> test_vector;
test_vector.reserve(vector_size);
for (int i = 0; i < vector_size; ++i)
test_vector.push_back(i);
std::vector<int> worker_odd_counts(worker_count, 0);
const auto worker_task = [&](int thread_id) {
int chunk_size = vector_size / (worker_count) + 1;
int my_start = thread_id * chunk_size;
int my_end = std::min(my_start + chunk_size, vector_size);
int local_odd_count = 0;
for (int ii = my_start; ii < my_end; ++ii)
if (test_vector[ii] % 2 != 0)
++local_odd_count;
worker_odd_counts[thread_id] = local_odd_count;
};
Pool pool = Pool(worker_count);
pool.start_threads();
pool.set_and_start_job(worker_task);
pool.wait();
int odd_count = 0;
for (auto elem : worker_odd_counts)
odd_count += elem;
std::cout << odd_count << '\n';
}
TL;DR version:
The simplest fix is to change
jobs[i] = [&job, i]() { job(i); };
to
jobs[i] = [job, i]() { job(i); };
This captures job by value and makes a copy. The copy won't go out of scope before the lambda does and the lambda will outlive the thread.
The Long version:
The problem is at
jobs[i] = [&job, i]() { job(i); };
in set_and_start_job. The object backing job goes out of scope before the threads get started, but how can this be if
pool.set_and_start_job(worker_task);
and worker_task won't go out of scope until after the the threads are joined?
Turns out that's because set_and_start_job requires a const std::function<void(int)> & and worker_task isn't a std::function, merely implicitly convertible to a std::function. This conversion makes a temporary variable with a lifespan bound to set_and_start_job's job parameter. When set_and_start_job exits, job goes out of scope and the temporary is destroyed.
The simple fix is above, but we can also make the conversion right at the source to that `std::function is passed all the way through the system and will go out of scope after the threads are joined.
const std::function<void(int)> worker_task = [&](int thread_id) { ... };
There may be some small resource saving in end-to-end std::function and capturing a reference, but my experiences with references and threads haven't been the best, so I'd prefer the copy to reduce the possibility that I've missed some subtlety or someone in the future will make a change that adds some.
In the function Pool::set_and_start_job, when setting the job, removing the & from the job capture seems to have resolved the issue:
jobs[i] = [job, i]() { job(i); };
However, I just had the suspicion and does not know the underlying cause.

Thread Pool join hangs when oversubscribing threads

I'm having an issue with a thread hanging when joining my threads in a thread pool I have created. The issue only occurs if I loop over the thread pool execution a large number of times.
I have a thread pool class like the following;
#include <queue>
#include <mutex>
#include <condition_variable>
#include <functional>
#include <atomic>
#include <vector>
#include <thread>
#include <iostream>
class ThreadPool
{
public:
ThreadPool()
{
m_shutdown.store(false, std::memory_order_relaxed);
createThreads(1);
}
ThreadPool(std::size_t numThreads)
{
m_shutdown.store(false, std::memory_order_relaxed);
createThreads(numThreads);
}
void add_job(std::function<void()> new_job)
{
{
std::scoped_lock<std::mutex> lock(m_jobMutex);
m_jobQueue.push(new_job);
}
m_notifier.notify_one();
}
void waitFinished()
{
{
std::unique_lock<std::mutex> lock(m_jobMutex);
m_finished.wait(lock, [this] {return m_jobQueue.empty(); }); //&& busy == 0
}
m_shutdown.store(true, std::memory_order_relaxed);
m_notifier.notify_all();
for (std::thread& th : m_threads)
{
th.join();
}
m_threads.clear();
}
private:
using Job = std::function<void()>;
std::vector<std::thread> m_threads;
std::queue<Job> m_jobQueue;
std::condition_variable m_notifier;
std::condition_variable m_finished;
std::mutex m_jobMutex;
std::atomic<bool> m_shutdown;
void createThreads(std::size_t numThreads)
{
// Settup threads
m_threads.reserve(numThreads);
for (int i = 0; i != numThreads; ++i)
{
m_threads.emplace_back(std::thread([this]()
{
// Infinite loop to consume tasks from queue and execute
while (true)
{
Job job;
{
std::unique_lock<std::mutex> lock(m_jobMutex);
m_notifier.wait(lock, [this] {return !m_jobQueue.empty() || m_shutdown.load(std::memory_order_relaxed); });
if (m_shutdown.load(std::memory_order_relaxed) || m_jobQueue.empty())
{
break;
}
job = std::move(m_jobQueue.front());
m_jobQueue.pop();
}
job();
m_finished.notify_one();
}
}));
}
}
};
I run this in a simple manner, like the following;
void threader (int x) {
std::cout<<"In threaded function: "<<x<<std::endl;
}
int main()
{
//outer loop
for (auto i = 0; i < 10000; i++) {
//Thread pool
int num_threads = std::thread::hardware_concurrency();
ThreadPool test_pool(num_threads);
// Assign work
for (int j = 0; j < 48; j++) {
test_pool.add_job(std::bind(threader, j));
}
test_pool.waitFinished();
std::cout<<"Thread Pool Done"<<std::endl;
}
}
After a number of outer loop iterations the join in the waitFinished hangs for a thread. The error only seems to occur after a number of iterations of the outer loop large enough. I have investigated this and can see that the threader function get's called 48 times, so looks like all threads complete. It seems to be the joining of threads in the waitFinished function of the ThreadPool that is causing the hang.
Is there something obvious I'm doing wrong ?
Many thanks!

Synchronize three Threads in C++

I have the following program (made up example!):
#include<thread>
#include<mutex>
#include<iostream>
class MultiClass {
public:
void Run() {
std::thread t1(&MultiClass::Calc, this);
std::thread t2(&MultiClass::Calc, this);
std::thread t3(&MultiClass::Calc, this);
t1.join();
t2.join();
t3.join();
}
private:
void Calc() {
for (int i = 0; i < 10; ++i) {
std::cout << i << std::endl;
}
}
};
int main() {
MultiClass m;
m.Run();
return 0;
}
What I need is to sync the loop iterations the following way and I cant come up with a solution (I've been fiddling for about an hour now using mutexes but cant find THE combination):
t1 and t2 shall do one loop iteration, then t3 shall do one iteration, then again t1 and t2 shall do one, then t3 shall do one.
So you see, I need t1 and t2 to do things simultaneously and after one iteration, t3 shall do one iteration on its own.
Can you point your finger on how I would be able to achieve that? Like I said, ive been trying this with mutexes and cant come up with a solution.
If you really want to do this by hand with the given thread structure, you could use something like this*:
class SyncObj {
mutex mux;
condition_variable cv;
bool completed[2]{ false,false };
public:
void signalCompetionT1T2(int id) {
lock_guard<mutex> ul(mux);
completed[id] = true;
cv.notify_all();
}
void signalCompetionT3() {
lock_guard<mutex> ul(mux);
completed[0] = false;
completed[1] = false;
cv.notify_all();
}
void waitForCompetionT1T2() {
unique_lock<mutex> ul(mux);
cv.wait(ul, [&]() {return completed[0] && completed[1]; });
}
void waitForCompetionT3(int id) {
unique_lock<mutex> ul(mux);
cv.wait(ul, [&]() {return !completed[id]; });
}
};
class MultiClass {
public:
void Run() {
std::thread t1(&MultiClass::Calc1, this);
std::thread t2(&MultiClass::Calc2, this);
std::thread t3(&MultiClass::Calc3, this);
t1.join();
t2.join();
t3.join();
}
private:
SyncObj obj;
void Calc1() {
for (int i = 0; i < 10; ++i) {
obj.waitForCompetionT3(0);
std::cout << "T1:" << i << std::endl;
obj.signalCompetionT1T2(0);
}
}
void Calc2() {
for (int i = 0; i < 10; ++i) {
obj.waitForCompetionT3(1);
std::cout << "T2:" << i << std::endl;
obj.signalCompetionT1T2(1);
}
}
void Calc3() {
for (int i = 0; i < 10; ++i) {
obj.waitForCompetionT1T2();
std::cout << "T3:" << i << std::endl;
obj.signalCompetionT3();
}
}
};
However, this is only a reasonable approach, if each iteration is computational expensive, such that you can ignore the synchronization overhead. If that is not the case you should probably better have a look at a proper parallel programming library like intel's tbb or microsofts ppl.
*)NOTE: This code is untested and unoptimized. I just wrote it to show what the general structure could look like
Use two condition variables, here is a sketch..
thread 1 & 2 wait on condition variable segment_1:
std::condition_variable segment_1;
thread 3 waits on condition variable segment_2;
std::condition_variable segment_2;
threads 1 & 2 should wait() on segment_1, and thread 3 should wait() on segment_2. To kick off threads 1 & 2, call notify_all() on segment_1, and once they complete, call notify_one() on segment_2 to kick off thread 3. You may want to use some controlling thread to control the sequence unless you can chain (i.e. once 1 & 2 complete, the last one to complete calls notify for thread 3 and so on..)
This is not perfect (see lost wakeups)

whats the use of shared mutex?

Consider following example -
#include <boost/thread.hpp>
#include <iostream>
#include <vector>
#include <cstdlib>
#include <ctime>
void wait(int seconds)
{
boost::this_thread::sleep(boost::posix_time::seconds(seconds));
}
boost::shared_mutex mutex;
std::vector<int> random_numbers;
void fill()
{
std::srand(static_cast<unsigned int>(std::time(0)));
for (int i = 0; i < 3; ++i)
{
boost::unique_lock<boost::shared_mutex> lock(mutex);
random_numbers.push_back(std::rand());
lock.unlock();
wait(1);
}
}
void print()
{
for (int i = 0; i < 3; ++i)
{
wait(1);
boost::shared_lock<boost::shared_mutex> lock(mutex);
std::cout << random_numbers.back() << std::endl;
}
}
int sum = 0;
void count()
{
for (int i = 0; i < 3; ++i)
{
wait(1);
boost::shared_lock<boost::shared_mutex> lock(mutex);
sum += random_numbers.back();
}
}
int main()
{
boost::thread t1(fill);
boost::thread t2(print);
boost::thread t3(count);
t1.join();
t2.join();
t3.join();
std::cout << "Summe: " << sum << std::endl;
}
In the given example, both print() and count() access random_numbers read-only. While the print() function writes the last number of random_numbers to the standard output stream, the count() function adds it to the variable sum. Since neither function modifies random_numbers, both can access it at the same time using a non-exclusive lock of type boost::shared_lock.
My question is : As the resource is read only why the shared mutex is needed at the first place in count and print function?' Cant we manage without it?
As the resource is read only [...]
No, it is not : the fill() method proceed to writes through the following :
random_numbers.push_back(std::rand()); // write to random_numbers
So the shared mutex really is necessary to synchronize your access to the vector.