reusable barrier simple (alternating) implementation - c++

std::mutex mutex;
std::condition_variable cv;
uint8_t size = 2;
uint8_t count = size;
uint8_t direction = -1;
const auto sync = [&size, &count, &mutex, &cv, &direction]() //.
{
{
std::unique_lock<std::mutex> lock(mutex);
auto current_direction = direction;
if (--count == 0)
{
count = size;
direction *= -1;
cv.notify_all();
}
else
{
cv.wait(lock,
[&direction, &current_direction]() //.
{ return direction != current_direction; });
}
}
};
as provided in the first unaccepted answer of reusable barrier
a 'generation' must be stored inside a barrier object to prevent a next generation from manipulating the wake up 'condition' of the current generation for a given set of threads. What I do not like about the first unaccepted answer is the growing counter of generations, I believe that we need only to differentiate between two generations at most that is if a thread satisfied the wait condition and started another barrier synchronization call as the second unaccepted solution suggests, the second solution however was somewhat complex and I believe that the above snippet would even be enough (currently implemented locally inside the main but could be abstracted into a struct). Am I correct in my 'belief' that a barrier can only be used simultaneously for 2 generations at most?

Related

double buffer for consumer and producer problem

So I am trying to implement a double buffer for a typical producer and consumer problem.
1.get_items() basically produces 10 items at a time.
2.producer basically push 10 items onto a write queue. Assume that currently we only have one producer.
3.consumers will consume one item from the queue. There are many consumers.
So I am sharing my code as the following. The implementation idea is simple, consume from the readq until it is empty and then swap the queue pointer, which the readq would point to the writeq now and writeq would now points to the emptied queue and would starts to fill it again. So producer and consumer can work independently without halting each other. This sort of swaps space for time.
However, my code does not work in multiple consumer cases. In my code, I initiated 10 consumer threads, and it always stuck at the .join().
So I am thinking that my code is definitely buggy. However, by examine carefully, I did not find where that bug is. And it seems the code stuck after lk1.unlock(), so it is not stuck in a while or something obvious.
mutex m1;
mutex m2; // using 2 mutex, so when producer is locked, consumer can still run
condition_variable put;
condition_variable fetch;
queue<int> q1;
queue<int> q2;
queue<int>* readq = &q1;
queue<int>* writeq = &q2;
bool flag{ true };
vector<int> get_items() {
vector<int> res;
for (int i = 0; i < 10; i++) {
res.push_back(i);
}
return res;
}
void producer_mul() {
unique_lock<mutex> lk2(m2);
put.wait(lk2, [&]() {return flag == false; }); //producer waits for consumer signal
vector<int> items = get_items();
for (auto it : items) {
writeq->push(it);
}
flag = true; //signal queue is filled
fetch.notify_one();
lk2.unlock();
}
int get_one_item_mul() {
unique_lock<mutex> lk1(m1);
int res;
if (!(*readq).empty()) {
res = (*readq).front(); (*readq).pop();
if ((*writeq).empty() && flag == true) { //if writeq is empty
flag = false;
put.notify_one();
}
}
else {
readq = writeq; // swap queue pointer
while ((*readq).empty()) { // not yet write
if (flag) {
flag = false;
put.notify_one();//start filling process
}
//if (readq->empty()) { //upadted due to race. readq now points to writeq, so if producer finished, readq is not empty and flag = true.
fetch.wait(lk1, [&]() {return flag == true; });
//}
}
if (flag) {
writeq = writeq == &q1 ? &q2 : &q1; //swap the writeq to the alternative queue and fill it again
flag = false;
//put.notify_one(); //fill that queue again if needed. but in my case, 10 item is produced and consumed, so no need to use the 2nd round, plus the code does not working in this simple case..so commented out for now.
}
res = readq->front(); readq->pop();
}
lk1.unlock();
this_thread::sleep_for(10ms);
return res;
}
int main()
{
std::vector<std::thread> threads;
std::packaged_task<void(void)> job1(producer_mul);
vector<std::future<int>> res;
for (int i = 0; i < 10; i++) {
std::packaged_task<int(void)> job2(get_one_item_mul);
res.push_back(job2.get_future());
threads.push_back(std::thread(std::move(job2)));
}
threads.push_back(std::thread(std::move(job1)));
for (auto& t : threads) {
t.join();
}
for (auto& a : res) {
cout << a.get() << endl;
}
return 0;
}
I added some comments, but the idea and code is pretty simple and self-explanatory.
I am trying to figure out where the problem is in my code. Does it work for multiple consumer? Further more, if there are multiple producers here, does it work? I do not see a problem since basically in the code the lock is not fine grained. Producer and Consumer both are locked from the beginning till the end.
Looking forward to discussion and any help is appreciated.
Update
updated the race condition based on one of the answer.
The program is still not working.
Your program contains data races, and therefore exhibits undefined behavior. I see at least two:
producer_mul accesses and modifies flag while holding m2 mutex but not m1. get_one_item_mul accesses and modifies flag while holding m1 mutex but not m2. So flag is not in fact protected against concurrent access.
Similarly, producer_mul accesses writeq pointer while holding m2 mutex but not m1. get_one_item_mul modifies writeq while holding m1 mutex but not m2.
There's also a data race on the queues themselves. Initially, both queues are empty. producer_mul is blocked waiting on flag. Then the following sequence occurs ( P for producer thread, C for consumer thread):
C: readq = writeq; // Both now point to the same queue
C: flag = false; put.notify_one(); // This wakes up producer
**P: writeq->push(it);
**C: if (readq->empty())
The last two lines happen concurrently, with no protection against concurrent access. One thread modifies an std::queue instance while the other accesses that same instance. This is a data race.
There's a data race at the heart of the design. Let's imagine there's just one producer P and two consumers C1 and C2. Initially, P waits on put until flag == false. C1 grabs m1; C2 is blocked on m1.
C1 sets readq = writeq, then unblocks P1, then calls fetch.wait(lk1, [&]() {return flag == true; });. This unlocks m1, allowing C2 to proceed. So now P is busy writing to writeq while C2 is busy reading from readq - which is one and the same queue.

How to let different threads fill an array together?

Suppose I have some tasks (Monte Carlo simulations) that I want to run in parallel. I want to complete a given number of tasks, but tasks take different amount of time so not easy to divide the work evenly over the threads. Also: I need the results of all simulations in a single vector (or array) in the end.
So I come up with below approach:
int Max{1000000};
//SimResult is some struct with well-defined default value.
std::vector<SimResult> vec(/*length*/Max);//Initialize with default values of SimResult
int LastAdded{0};
void fill(int RandSeed)
{
Simulator sim{RandSeed};
while(LastAdded < Max)
{
// Do some work to bring foo to the desired state
//The duration of this work is subject to randomness
vec[LastAdded++]
= sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1);
auto fut2 = std::async(fill,2);
//maybe some more tasks.
fut1.get();
fut2.get();
//do something with the results in vec.
}
The above code will give race conditions I guess. I am looking for a performant approach to avoid that. Requirements: avoid race conditions (fill the entire array, no skips) ; final result is immediately in array ; performant.
Reading on various approaches, it seems atomic is a good candidate, but I am not sure what settings will be most performant in my case? And not even sure whether atomic will cut it; maybe a mutex guarding LastAdded is needed?
One thing I would say is that you need to be very careful with the standard library random number functions. If your 'Simulator' class creates an instance of a generator, you should not run Monte Carlo simulations in parallel using the same object, because you'll get likely get repeated patterns of random numbers between the runs, which will give you inaccurate results.
The best practice in this area would be to create N Simulator objects with the same properties, and give each one a different random seed. Then you could pool these objects out over multiple threads using OpenMP, which is a common parallel programming model for scientific software development.
std::vector<SimResult> generateResults(size_t N_runs, double seed)
{
std::vector<SimResult> results(N_runs);
#pragma omp parallel for
for(auto i = 0; i < N_runs; i++)
{
auto sim = Simulator(seed + i);
results[i] = sim.GetResult();
}
}
Edit: With OpenMP, you can choose different scheduling models, which allow you to for e.g. dynamically split work between threads. You can do this with:
#pragma omp parallel for schedule(dynamic, 16)
which would give each thread chunks of 16 items to work on at a time.
Since you already know how many elements your are going to work with and never change the size of the vector, the easiest solution is to let each thread work on it's own part of the vector. For example
Update
to accomodate for vastly varying calculation times, you should keep your current code, but avoid race conditions via a std::lock_guard. You will need a std::mutex that is the same for all threads, for example a global variable, or pass a reference of the mutex to each thread.
void fill(int RandSeed, std::mutex &nextItemMutex)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
// enter critical area
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
// Acquire next item
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded++;
}
else
{
break;
}
// lock is released when nextItemLock goes out of scope
}
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[workingIndex] = sim.GetResult();//Produces SimResult.
}
}
Problem with this is, that snychronisation is quite expensive. But it's probably not that expensive in comparison to the simulation you run, so it shouldn't be too bad.
Version 2:
To reduce the amount of synchronisation that is required, you could acquire blocks to work on, instead of single items:
void fill(int RandSeed, std::mutex &nextItemMutex, size_t blockSize)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded += blockSize;
}
else
{
break;
}
}
for(size_t i = workingIndex; i < workingIndex + blockSize && i < MAX; i++)
vec[i] = sim.GetResult();//Produces SimResult.
}
}
Simple Version
void fill(int RandSeed, size_t partitionStart, size_t partitionEnd)
{
Simulator sim{RandSeed};
for(size_t i = partitionStart; i < partitionEnd; i++)
{
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[i] = sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1, 0, Max / 2);
auto fut2 = std::async(fill,2, Max / 2, Max);
// ...
}

How to use std::condition_variable in a loop

I'm trying to implement some algorithm using threads that must be synchronized at some moment. More or less the sequence for each thread should be:
1. Try to find a solution with current settings.
2. Synchronize solution with other threads.
3. If any of the threads found solution end work.
4. (empty - to be inline with example below)
5. Modify parameters for algorithm and jump to 1.
Here is a toy example with algorithm changed to just random number generation - all threads should end if at least one of them will find 0.
#include <iostream>
#include <condition_variable>
#include <thread>
#include <vector>
const int numOfThreads = 8;
std::condition_variable cv1, cv2;
std::mutex m1, m2;
int lockCnt1 = 0;
int lockCnt2 = 0;
int solutionCnt = 0;
void workerThread()
{
while(true) {
// 1. do some important work
int r = rand() % 1000;
// 2. synchronize and get results from all threads
{
std::unique_lock<std::mutex> l1(m1);
++lockCnt1;
if (r == 0) ++solutionCnt; // gather solutions
if (lockCnt1 == numOfThreads) {
// last thread ends here
lockCnt2 = 0;
cv1.notify_all();
}
else {
cv1.wait(l1, [&] { return lockCnt1 == numOfThreads; });
}
}
// 3. if solution found then quit all threads
if (solutionCnt > 0) return;
// 4. if not, then set lockCnt1 to 0 to have section 2. working again
{
std::unique_lock<std::mutex> l2(m2);
++lockCnt2;
if (lockCnt2 == numOfThreads) {
// last thread ends here
lockCnt1 = 0;
cv2.notify_all();
}
else {
cv2.wait(l2, [&] { return lockCnt2 == numOfThreads; });
}
}
// 5. Setup new algorithm parameters and repeat.
}
}
int main()
{
srand(time(NULL));
std::vector<std::thread> v;
for (int i = 0; i < numOfThreads ; ++i) v.emplace_back(std::thread(workerThread));
for (int i = 0; i < numOfThreads ; ++i) v[i].join();
return 0;
}
The questions I have are about sections 2. and 4. from code above.
A) In a section 2 there is synchronization of all threads and gathering solutions (if found). All is done using lockCnt1 variable. Comparing to single use of condition_variable I found it hard how to set lockCnt1 to zero safely, to be able to reuse this section (2.) next time. Because of that I introduced section 4. Is there better way to do that (without introducing section 4.)?
B) It seems that all examples shows using condition_variable rather in context of 'producer-consumer' scenario. Is there better way to synchronization all threads in case where all are 'producers'?
Edit: Just to be clear, I didn't want to describe algorithm details since this is not important here - anyway this is necessary to have all solution(s) or none from given loop execution and mixing them is not allowed. Described sequence of execution must be followed and the question is how to have such synchronization between threads.
A) You could just not reset the lockCnt1 to 0, just keep incrementing it further. The condition lockCnt2 == numOfThreads then changes to lockCnt2 % numOfThreads == 0. You can then drop the block #4. In future you could also use std::experimental::barrier to get the threads to meet.
B) I would suggest using std::atomic for solutionCnt and then you can drop all other counters, the mutex and the condition variable. Just atomically increase it by one in the thread that found solution and then return. In all threads after every iteration check if the value is bigger than zero. If it is, then return. The advantage is that the threads do not have to meet regularly, but can try to solve it at their own pace.
Out of curiosity, I tried to solve your problem using std::async. For every attempt to find a solution, we call async. Once all parallel attempts have finished, we process feedback, adjust parameters, and repeat. An important difference with your implementation is that feedback is processed in the calling (main) thread. If processing feedback takes too long — or if we don't want to block the main thread at all — then the code in main() can be adjusted to also call std::async.
The code is supposed to be quite efficient, provided that the implementation of async uses a thread pool (e. g. Microsoft's implementation does that).
#include <chrono>
#include <future>
#include <iostream>
#include <vector>
const int numOfThreads = 8;
struct Parameters{};
struct Feedback {
int result;
};
Feedback doTheWork(const Parameters &){
// do the work and provide result and feedback for future runs
return Feedback{rand() % 1000};
}
bool isSolution(const Feedback &f){
return f.result == 0;
}
// Runs doTheWork in parallel. Number of parallel tasks is same as size of params vector
std::vector<Feedback> findSolutions(const std::vector<Parameters> &params){
// 1. Run async tasks to find solutions. Normally threads are not created each time but re-used from a pool
std::vector<std::future<Feedback>> futures;
for (auto &p: params){
futures.push_back(std::async(std::launch::async,
[&p](){ return doTheWork(p); }));
}
// 2. Syncrhonize: wait for all tasks
std::vector<Feedback> feedback(futures.size());
for (auto nofRunning = futures.size(), iFuture = size_t{0}; nofRunning > 0; ){
// Check if the task has finished (future is invalid if we already handled it during an earlier iteration)
auto &future = futures[iFuture];
if (future.valid() && future.wait_for(std::chrono::milliseconds(1)) != std::future_status::timeout){
// Collect feedback for next attempt
// Alternatively, we could already check if solution has been found and cancel other tasks [if our algorithm supports cancellation]
feedback[iFuture] = std::move(future.get());
--nofRunning;
}
if (++iFuture == futures.size())
iFuture = 0;
}
return feedback;
}
int main()
{
srand(time(NULL));
std::vector<Parameters> params(numOfThreads);
// 0. Set inital parameter values here
// If we don't want to block the main thread while the algorithm is running, we can use std::async here too
while (true){
auto feedbackVector = findSolutions(params);
auto itSolution = std::find_if(std::begin(feedbackVector), std::end(feedbackVector), isSolution);
// 3. If any of the threads has found a solution, we stop
if (itSolution != feedbackVector.end())
break;
// 5. Use feedback to re-configure parameters for next iteration
}
return 0;
}

C++ - Multithreading takes longer with more threads

I'm making a parallel password cracker for an assignment. When I launch more than one thread, the times taken to crack take longer the more threads I add. What is the problem here?
Secondly, what resource sharing techniques can I use for optimal performance too? I'm required to use either mutexes, atomic operations or barriers while also using semaphores, conditional variables or channels. Mutexes seem to slow my program down quite drastically.
Here is an example of my code for context:
std::mutex mtx;
std::condition_variable cv;
void run()
{
std::unique_lock<std::mutex> lck(mtx);
ready = true;
cv.notify_all();
}
crack()
{
std::lock_guard<std::mutex> lk(mtx);
...do cracking stuff
}
main()
{
....
std::thread *t = new std::thread[uiThreadCount];
for(int i = 0; i < uiThreadCount; i++)
{
t[i] = std::thread(crack, params);
}
run();
for(int i = 0; i < uiThreadCount; i++)
{
t[i].join();
}
}
When writing multi-threaded code, it's generally a good idea to share as few resources as possible, so you can avoid having to synchronize using a mutex or an atomic.
There are a lot of different ways to do password cracking, so I'll give a slightly simpler example. Let's say you have a hash function, and a hash, and you're trying to guess what input produces the hash (this is basically how a password would get cracked).
We can write the cracker like this. It'll take the hash function and the password hash, check a range of values, and invoke the callback function if it found a match.
auto cracker = [](auto passwdHash, auto hashFunc, auto min, auto max, auto callback) {
for(auto i = min; i < max; i++) {
auto output = hashFunc(i);
if(output == passwdHash) {
callback(i);
}
}
};
Now, we can write a parallel version. This version only has to synchronize when it finds a match, which is pretty rare.
auto parallel_cracker = [](auto passwdHash, auto hashFunc, auto min, auto max, int num_threads) {
// Get a vector of threads
std::vector<std::thread> threads;
threads.reserve(num_threads);
// Make a vector of all the matches it discovered
using input_t = decltype(min);
std::vector<input_t> matches;
std::mutex match_lock;
// Whenever a match is found, this function gets called
auto callback = [&](input_t match) {
std::unique_lock<std::mutex> _lock(match_lock);
std::cout << "Found match: " << match << '\n';
matches.push_back(match);
};
for(int i = 0; i < num_threads; i++) {
auto sub_min = min + ((max - min) * i) / num_threads;
auto sub_max = min + ((max - min) * (i + 1)) / num_threads;
matches.push_back(std::thread(cracker, passwdHash, hashFunc, sub_min, sub_max, callback));
}
// Join all the threads
for(auto& thread : threads) {
thread.join();
}
return matches;
};
yes, not surprising with the way it's written: putting a mutex at the beginning of your thread (crack function), you effectively make them run sequentially
I understand you want to achieve a "synchronous start" of the threads (by the intention of using conditional variable cv), but you don't use it properly - without use of one of its wait methods, the call cv.notify_all() is useless: it does not do what you intended, instead your threads will simply run sequentially.
using wait() from the std::condition_variable in your crack() call is imperative: it will release the mtx (which you just grabbed with the mutex guard lk) and will block the execution of the thread until the cv.notify_all(). After the call, your other threads (except the first one, whichever it will be) will remain under the mtx so if you really want the "parallel" execution, you'd then need to unlock the mtx.
Here, how your crack thread should look like:
crack()
{
std::unique_lock<std::mutex> lk(mtx);
cv.wait(lk);
lk.unlock();
...do cracking stuff
}
btw, you don't need ready flag in your run() call - it's entirely redundant/unused.
I'm required to use either mutexes, atomic operations or barriers
while also using semaphores, conditional variables or channels
- different tools/techniques are good for different things, the question is too general

Implementing semaphore by using mutex operations and primitives

Some time ago had an interview and was asked to implement
Semaphore by using mutex operations and primitives only
(he allowed int to be considered as atomic). I came with solution below.
He did not like busy/wait part -- while (count >= size) {} -- and asked to implement locking instead by using more primitive
types and mutexes. I did not manage to come with improved solution.
Any ideas how it could be done?
struct Semaphore {
int size;
atomic<int> count;
mutex updateMutex;
Semaphore(int n) : size(n) { count.store(0); }
void aquire() {
while (1) {
while (count >= size) {}
updateMutex.lock();
if (count >= size) {
updateMutex.unlock();
continue;
}
++count;
updateMutex.unlock();
break;
}
}
void release() {
updateMutex.lock();
if (count > 0) {
--count;
} // else log err
updateMutex.unlock();
}
};
I'd wager this is not possible to implement without a busy-loop using mutexes only.
If not busy-looping, you have to block somewhere. The only blocking primitive you've got is
a mutex. Hence, you have to block on some mutex, when the semaphore counter is zero. You can be woken up only by the single owner of that mutex. However, you should woken up whenever an arbitrary thread returns a counter to the semaphore.
Now, if you are allowed condition variables, it's an entirely different story.
as #chill pointet out, the solution I did write down here will not work, as locks have unique ownership. I guess in the end you will revert to busy wait (if you are not allowed to use condition variables). I leave it here if ppl. have the same idea they see that this DOES NOT WORK ;)
struct Semaphore {
int size;
atomic<int> count;
mutex protection;
mutex wait;
Semaphore(int n) : size(n) { count.store(0); }
void aquire() {
protection.lock();
--count;
if (count < -1) {
protection.unlock();
wait.lock();
}
protection.unlock();
}
void release() {
protection.lock();
++count;
if (count > 0) {
wait.unlock();
}
protection.unlock();
}
};
That's true because technically there are some parts in your code that have no need to exist.
1- you used atomic datatypes atomic<int> count; which will take very few more cycles in execution and it is useless as long as incrementing and decrementing are locked by updateMutex.lock(); code so there is no other thread can change it during the locked state.
2- you put while (count >= size) {} which is also useless because you checked count again after the spinlock statement which is necessary and the one important here. "remember spinlock is a while(1)" when the mutex is taken by another thread.
besides if you decided to use int count; with some compiler's optimizations, maybe your code won't re-read count value!! for optimization, remember your semaphore is supposed to be used by different threads!! so you need to make it volatile, to avoid this problem.
at last, let me rewrite your code in a more performant way.
struct Semaphore {
int size;
volatile int count;
mutex updateMutex;
Semaphore(int n) : size(n), count(0) {}
void aquire() {
while (1) {
updateMutex.lock();
if (count >= size) {
updateMutex.unlock();
continue;
}
++count;
updateMutex.unlock();
break;
}
}
void release() {
updateMutex.lock();
if (count > 0) {
--count;
} // else log err
updateMutex.unlock();
}
};
EDIT - use a second mutex for queuing intstead of threads
Since a mutex already have proper thread-support, it can be used to queue the threads (instead of doing it explicitly as I first had tried to do). Unless the mutex is restricted to only let the owner unlock it (a lock?), then this solution doesn't work.
I found the solution in Anthony Howe's pdf that I came across when searching. There are two more solutions given there as well. I changed the names to make more sense for this example.
more or less pseudo code:
Semaphore{
int n;
mutex* m_count; //unlocked initially
mutex* m_queue; //locked initially
};
void wait(){
m_count.lock();
n = n-1;
if(n < 0){
m_count.unlock();
m_queue.lock(); //wait
}
m_count.unlock(); //unlock signal's lock
}
void signal(){
m_count.lock();
n = n+1;
if(n <= 0){
m_queue.unlock(); //leave m_count locked
}
else{
m_count.unlock();
}
}
lemme try this
`
# number of threads/workers
w = 10
# maximum concurrency
cr = 5
r_mutex = mutex()
w_mutex = [mutex() for x in range(w)]
# assuming mutex can be locked and unlocked by anyone
# (essentially we need a binary semaphore)
def acquire(id):
r_mutex.lock()
cr -= 1
# r_mutex.unlock()
# if exceeding maximum concurrency
if cr < 0:
# lock twice to be waken up by someone
w_mutex[id].lock()
r_mutex.unlock()
w_mutex[id].lock()
w_mutex[id].unlock()
return
r_mutex.unlock()
def release(id):
r_mutex.lock()
cr += 1
# someone must be waiting if cr < 0
if cr <= 0:
# maybe you can do this in a random order
for w in w_mutex:
if w.is_locked():
w.unlock()
break
r_mutex.unlock()
`