Implementing semaphore by using mutex operations and primitives - c++

Some time ago had an interview and was asked to implement
Semaphore by using mutex operations and primitives only
(he allowed int to be considered as atomic). I came with solution below.
He did not like busy/wait part -- while (count >= size) {} -- and asked to implement locking instead by using more primitive
types and mutexes. I did not manage to come with improved solution.
Any ideas how it could be done?
struct Semaphore {
int size;
atomic<int> count;
mutex updateMutex;
Semaphore(int n) : size(n) { count.store(0); }
void aquire() {
while (1) {
while (count >= size) {}
updateMutex.lock();
if (count >= size) {
updateMutex.unlock();
continue;
}
++count;
updateMutex.unlock();
break;
}
}
void release() {
updateMutex.lock();
if (count > 0) {
--count;
} // else log err
updateMutex.unlock();
}
};

I'd wager this is not possible to implement without a busy-loop using mutexes only.
If not busy-looping, you have to block somewhere. The only blocking primitive you've got is
a mutex. Hence, you have to block on some mutex, when the semaphore counter is zero. You can be woken up only by the single owner of that mutex. However, you should woken up whenever an arbitrary thread returns a counter to the semaphore.
Now, if you are allowed condition variables, it's an entirely different story.

as #chill pointet out, the solution I did write down here will not work, as locks have unique ownership. I guess in the end you will revert to busy wait (if you are not allowed to use condition variables). I leave it here if ppl. have the same idea they see that this DOES NOT WORK ;)
struct Semaphore {
int size;
atomic<int> count;
mutex protection;
mutex wait;
Semaphore(int n) : size(n) { count.store(0); }
void aquire() {
protection.lock();
--count;
if (count < -1) {
protection.unlock();
wait.lock();
}
protection.unlock();
}
void release() {
protection.lock();
++count;
if (count > 0) {
wait.unlock();
}
protection.unlock();
}
};

That's true because technically there are some parts in your code that have no need to exist.
1- you used atomic datatypes atomic<int> count; which will take very few more cycles in execution and it is useless as long as incrementing and decrementing are locked by updateMutex.lock(); code so there is no other thread can change it during the locked state.
2- you put while (count >= size) {} which is also useless because you checked count again after the spinlock statement which is necessary and the one important here. "remember spinlock is a while(1)" when the mutex is taken by another thread.
besides if you decided to use int count; with some compiler's optimizations, maybe your code won't re-read count value!! for optimization, remember your semaphore is supposed to be used by different threads!! so you need to make it volatile, to avoid this problem.
at last, let me rewrite your code in a more performant way.
struct Semaphore {
int size;
volatile int count;
mutex updateMutex;
Semaphore(int n) : size(n), count(0) {}
void aquire() {
while (1) {
updateMutex.lock();
if (count >= size) {
updateMutex.unlock();
continue;
}
++count;
updateMutex.unlock();
break;
}
}
void release() {
updateMutex.lock();
if (count > 0) {
--count;
} // else log err
updateMutex.unlock();
}
};

EDIT - use a second mutex for queuing intstead of threads
Since a mutex already have proper thread-support, it can be used to queue the threads (instead of doing it explicitly as I first had tried to do). Unless the mutex is restricted to only let the owner unlock it (a lock?), then this solution doesn't work.
I found the solution in Anthony Howe's pdf that I came across when searching. There are two more solutions given there as well. I changed the names to make more sense for this example.
more or less pseudo code:
Semaphore{
int n;
mutex* m_count; //unlocked initially
mutex* m_queue; //locked initially
};
void wait(){
m_count.lock();
n = n-1;
if(n < 0){
m_count.unlock();
m_queue.lock(); //wait
}
m_count.unlock(); //unlock signal's lock
}
void signal(){
m_count.lock();
n = n+1;
if(n <= 0){
m_queue.unlock(); //leave m_count locked
}
else{
m_count.unlock();
}
}

lemme try this
`
# number of threads/workers
w = 10
# maximum concurrency
cr = 5
r_mutex = mutex()
w_mutex = [mutex() for x in range(w)]
# assuming mutex can be locked and unlocked by anyone
# (essentially we need a binary semaphore)
def acquire(id):
r_mutex.lock()
cr -= 1
# r_mutex.unlock()
# if exceeding maximum concurrency
if cr < 0:
# lock twice to be waken up by someone
w_mutex[id].lock()
r_mutex.unlock()
w_mutex[id].lock()
w_mutex[id].unlock()
return
r_mutex.unlock()
def release(id):
r_mutex.lock()
cr += 1
# someone must be waiting if cr < 0
if cr <= 0:
# maybe you can do this in a random order
for w in w_mutex:
if w.is_locked():
w.unlock()
break
r_mutex.unlock()
`

Related

How to use std::condition_variable in a loop

I'm trying to implement some algorithm using threads that must be synchronized at some moment. More or less the sequence for each thread should be:
1. Try to find a solution with current settings.
2. Synchronize solution with other threads.
3. If any of the threads found solution end work.
4. (empty - to be inline with example below)
5. Modify parameters for algorithm and jump to 1.
Here is a toy example with algorithm changed to just random number generation - all threads should end if at least one of them will find 0.
#include <iostream>
#include <condition_variable>
#include <thread>
#include <vector>
const int numOfThreads = 8;
std::condition_variable cv1, cv2;
std::mutex m1, m2;
int lockCnt1 = 0;
int lockCnt2 = 0;
int solutionCnt = 0;
void workerThread()
{
while(true) {
// 1. do some important work
int r = rand() % 1000;
// 2. synchronize and get results from all threads
{
std::unique_lock<std::mutex> l1(m1);
++lockCnt1;
if (r == 0) ++solutionCnt; // gather solutions
if (lockCnt1 == numOfThreads) {
// last thread ends here
lockCnt2 = 0;
cv1.notify_all();
}
else {
cv1.wait(l1, [&] { return lockCnt1 == numOfThreads; });
}
}
// 3. if solution found then quit all threads
if (solutionCnt > 0) return;
// 4. if not, then set lockCnt1 to 0 to have section 2. working again
{
std::unique_lock<std::mutex> l2(m2);
++lockCnt2;
if (lockCnt2 == numOfThreads) {
// last thread ends here
lockCnt1 = 0;
cv2.notify_all();
}
else {
cv2.wait(l2, [&] { return lockCnt2 == numOfThreads; });
}
}
// 5. Setup new algorithm parameters and repeat.
}
}
int main()
{
srand(time(NULL));
std::vector<std::thread> v;
for (int i = 0; i < numOfThreads ; ++i) v.emplace_back(std::thread(workerThread));
for (int i = 0; i < numOfThreads ; ++i) v[i].join();
return 0;
}
The questions I have are about sections 2. and 4. from code above.
A) In a section 2 there is synchronization of all threads and gathering solutions (if found). All is done using lockCnt1 variable. Comparing to single use of condition_variable I found it hard how to set lockCnt1 to zero safely, to be able to reuse this section (2.) next time. Because of that I introduced section 4. Is there better way to do that (without introducing section 4.)?
B) It seems that all examples shows using condition_variable rather in context of 'producer-consumer' scenario. Is there better way to synchronization all threads in case where all are 'producers'?
Edit: Just to be clear, I didn't want to describe algorithm details since this is not important here - anyway this is necessary to have all solution(s) or none from given loop execution and mixing them is not allowed. Described sequence of execution must be followed and the question is how to have such synchronization between threads.
A) You could just not reset the lockCnt1 to 0, just keep incrementing it further. The condition lockCnt2 == numOfThreads then changes to lockCnt2 % numOfThreads == 0. You can then drop the block #4. In future you could also use std::experimental::barrier to get the threads to meet.
B) I would suggest using std::atomic for solutionCnt and then you can drop all other counters, the mutex and the condition variable. Just atomically increase it by one in the thread that found solution and then return. In all threads after every iteration check if the value is bigger than zero. If it is, then return. The advantage is that the threads do not have to meet regularly, but can try to solve it at their own pace.
Out of curiosity, I tried to solve your problem using std::async. For every attempt to find a solution, we call async. Once all parallel attempts have finished, we process feedback, adjust parameters, and repeat. An important difference with your implementation is that feedback is processed in the calling (main) thread. If processing feedback takes too long — or if we don't want to block the main thread at all — then the code in main() can be adjusted to also call std::async.
The code is supposed to be quite efficient, provided that the implementation of async uses a thread pool (e. g. Microsoft's implementation does that).
#include <chrono>
#include <future>
#include <iostream>
#include <vector>
const int numOfThreads = 8;
struct Parameters{};
struct Feedback {
int result;
};
Feedback doTheWork(const Parameters &){
// do the work and provide result and feedback for future runs
return Feedback{rand() % 1000};
}
bool isSolution(const Feedback &f){
return f.result == 0;
}
// Runs doTheWork in parallel. Number of parallel tasks is same as size of params vector
std::vector<Feedback> findSolutions(const std::vector<Parameters> &params){
// 1. Run async tasks to find solutions. Normally threads are not created each time but re-used from a pool
std::vector<std::future<Feedback>> futures;
for (auto &p: params){
futures.push_back(std::async(std::launch::async,
[&p](){ return doTheWork(p); }));
}
// 2. Syncrhonize: wait for all tasks
std::vector<Feedback> feedback(futures.size());
for (auto nofRunning = futures.size(), iFuture = size_t{0}; nofRunning > 0; ){
// Check if the task has finished (future is invalid if we already handled it during an earlier iteration)
auto &future = futures[iFuture];
if (future.valid() && future.wait_for(std::chrono::milliseconds(1)) != std::future_status::timeout){
// Collect feedback for next attempt
// Alternatively, we could already check if solution has been found and cancel other tasks [if our algorithm supports cancellation]
feedback[iFuture] = std::move(future.get());
--nofRunning;
}
if (++iFuture == futures.size())
iFuture = 0;
}
return feedback;
}
int main()
{
srand(time(NULL));
std::vector<Parameters> params(numOfThreads);
// 0. Set inital parameter values here
// If we don't want to block the main thread while the algorithm is running, we can use std::async here too
while (true){
auto feedbackVector = findSolutions(params);
auto itSolution = std::find_if(std::begin(feedbackVector), std::end(feedbackVector), isSolution);
// 3. If any of the threads has found a solution, we stop
if (itSolution != feedbackVector.end())
break;
// 5. Use feedback to re-configure parameters for next iteration
}
return 0;
}

C++ - Multithreading takes longer with more threads

I'm making a parallel password cracker for an assignment. When I launch more than one thread, the times taken to crack take longer the more threads I add. What is the problem here?
Secondly, what resource sharing techniques can I use for optimal performance too? I'm required to use either mutexes, atomic operations or barriers while also using semaphores, conditional variables or channels. Mutexes seem to slow my program down quite drastically.
Here is an example of my code for context:
std::mutex mtx;
std::condition_variable cv;
void run()
{
std::unique_lock<std::mutex> lck(mtx);
ready = true;
cv.notify_all();
}
crack()
{
std::lock_guard<std::mutex> lk(mtx);
...do cracking stuff
}
main()
{
....
std::thread *t = new std::thread[uiThreadCount];
for(int i = 0; i < uiThreadCount; i++)
{
t[i] = std::thread(crack, params);
}
run();
for(int i = 0; i < uiThreadCount; i++)
{
t[i].join();
}
}
When writing multi-threaded code, it's generally a good idea to share as few resources as possible, so you can avoid having to synchronize using a mutex or an atomic.
There are a lot of different ways to do password cracking, so I'll give a slightly simpler example. Let's say you have a hash function, and a hash, and you're trying to guess what input produces the hash (this is basically how a password would get cracked).
We can write the cracker like this. It'll take the hash function and the password hash, check a range of values, and invoke the callback function if it found a match.
auto cracker = [](auto passwdHash, auto hashFunc, auto min, auto max, auto callback) {
for(auto i = min; i < max; i++) {
auto output = hashFunc(i);
if(output == passwdHash) {
callback(i);
}
}
};
Now, we can write a parallel version. This version only has to synchronize when it finds a match, which is pretty rare.
auto parallel_cracker = [](auto passwdHash, auto hashFunc, auto min, auto max, int num_threads) {
// Get a vector of threads
std::vector<std::thread> threads;
threads.reserve(num_threads);
// Make a vector of all the matches it discovered
using input_t = decltype(min);
std::vector<input_t> matches;
std::mutex match_lock;
// Whenever a match is found, this function gets called
auto callback = [&](input_t match) {
std::unique_lock<std::mutex> _lock(match_lock);
std::cout << "Found match: " << match << '\n';
matches.push_back(match);
};
for(int i = 0; i < num_threads; i++) {
auto sub_min = min + ((max - min) * i) / num_threads;
auto sub_max = min + ((max - min) * (i + 1)) / num_threads;
matches.push_back(std::thread(cracker, passwdHash, hashFunc, sub_min, sub_max, callback));
}
// Join all the threads
for(auto& thread : threads) {
thread.join();
}
return matches;
};
yes, not surprising with the way it's written: putting a mutex at the beginning of your thread (crack function), you effectively make them run sequentially
I understand you want to achieve a "synchronous start" of the threads (by the intention of using conditional variable cv), but you don't use it properly - without use of one of its wait methods, the call cv.notify_all() is useless: it does not do what you intended, instead your threads will simply run sequentially.
using wait() from the std::condition_variable in your crack() call is imperative: it will release the mtx (which you just grabbed with the mutex guard lk) and will block the execution of the thread until the cv.notify_all(). After the call, your other threads (except the first one, whichever it will be) will remain under the mtx so if you really want the "parallel" execution, you'd then need to unlock the mtx.
Here, how your crack thread should look like:
crack()
{
std::unique_lock<std::mutex> lk(mtx);
cv.wait(lk);
lk.unlock();
...do cracking stuff
}
btw, you don't need ready flag in your run() call - it's entirely redundant/unused.
I'm required to use either mutexes, atomic operations or barriers
while also using semaphores, conditional variables or channels
- different tools/techniques are good for different things, the question is too general

reusable barrier simple (alternating) implementation

std::mutex mutex;
std::condition_variable cv;
uint8_t size = 2;
uint8_t count = size;
uint8_t direction = -1;
const auto sync = [&size, &count, &mutex, &cv, &direction]() //.
{
{
std::unique_lock<std::mutex> lock(mutex);
auto current_direction = direction;
if (--count == 0)
{
count = size;
direction *= -1;
cv.notify_all();
}
else
{
cv.wait(lock,
[&direction, &current_direction]() //.
{ return direction != current_direction; });
}
}
};
as provided in the first unaccepted answer of reusable barrier
a 'generation' must be stored inside a barrier object to prevent a next generation from manipulating the wake up 'condition' of the current generation for a given set of threads. What I do not like about the first unaccepted answer is the growing counter of generations, I believe that we need only to differentiate between two generations at most that is if a thread satisfied the wait condition and started another barrier synchronization call as the second unaccepted solution suggests, the second solution however was somewhat complex and I believe that the above snippet would even be enough (currently implemented locally inside the main but could be abstracted into a struct). Am I correct in my 'belief' that a barrier can only be used simultaneously for 2 generations at most?

How to increment the Semaphore value in c++, to solve philosophers dinning

Trying to solve philosophers dinning problem by creating a doorman to only allow 4 philosophers to dine at once, planned on using semaphores for this but there is limited material about them on the web, and i cant figure out how to increment to value of the semaphore once it has been signaled.
#define INITIAL_COUNT 1
#define MAX_COUNT 4
main()
philo.doorSemaphore = CreateSemaphore(
NULL, //default security attributes
INITIAL_COUNT, //initial count
MAX_COUNT, //maximum count
NULL);
while (philo.not_dead == true)
{
int num_philosophers = 5;
for (int i = 0; i < 5; i++)
{
philo.mythread[i] = thread (philosophersFunction, i); //init 5 threads calling philofunction each loop
philo.mythread[i].join(); //join thread to current thread each loop
}
sleep_for(milliseconds(500));
system("cls");
}
waiting()
void Philosophers::waiting(int current)
{
dWaitResult = WaitForSingleObject(doorSemaphore, 0L);
//waitResult = WaitForSingleObject(semaphores, 0L);
switch (dWaitResult)
{
case WAIT_OBJECT_0:
p[current] = hungry;
ReleaseSemaphore(doorSemaphore, 1, NULL);
break;
case WAIT_TIMEOUT:
hunger[current] ++;
counter[current] ++;
case WAIT_FAILED :
break;
CloseHandle(doorSemaphore);
}
}
Dining Philosophers Rebooted is a thorough treatment of this classic problem using modern C++ with std::thread and std::mutex. Full source code is available at the link.
This code works by representing each fork as a std::mutex. Then the trick is how to lock two mutexes simultaneously without causing deadlock. C++11/14 comes with a function expressly for this purpose:
template <class L1, class L2, class... L3>
void lock(L1&, L2&, L3&...);
The above paper explores several possible implementations of std::lock for the 2 mutex and 3 mutex cases, and identifies one algorithm that is never any worse than any other algorithm (and often much better).
The optimal implementation (according to this paper) is in fact the algorithm used by libc++.
Here is the Philosopher::eat() function for the "2-D" case in the paper:
void
Philosopher::eat()
{
using Lock = std::unique_lock<std::mutex>;
Lock first;
Lock second;
if (flip_coin())
{
first = Lock(left_fork_, std::defer_lock);
second = Lock(right_fork_, std::defer_lock);
}
else
{
first = Lock(right_fork_, std::defer_lock);
second = Lock(left_fork_, std::defer_lock);
}
auto d = get_eat_duration();
::lock(first, second);
auto end = std::chrono::steady_clock::now() + d;
while (std::chrono::steady_clock::now() < end)
;
eat_time_ += d;
}
For demonstration purposes only, the Philosopher randomly selects which fork to hold in the left and right hands. This randomness isn't required to solve the problem. The function could have been simplified to the following and still be correct:
void
Philosopher::eat()
{
using Lock = std::unique_lock<std::mutex>;
Lock first { left_fork_, std::defer_lock};
Lock second{right_fork_, std::defer_lock};
auto d = get_eat_duration();
::lock(first, second);
auto end = std::chrono::steady_clock::now() + d;
while (std::chrono::steady_clock::now() < end)
;
eat_time_ += d;
}
In real code the call to ::lock should be std::lock, but this code is trying out several implementations of std::lock without invasively changing the std::lib.

C++ boost thread: having a worker thread pause and unpause based on mutexes/conditions using a concurrent queue

I am fairly new to multi-threaded programming, so please forgive my possibly imprecise question. Here is my problem:
I have a function processing data and generating lots of objects of the same type. This is done iterating in several nested loops, so it would be practical to just do all iterations, save these objects in some container and then work on that container in interfacing code doing the next steps. However, I have to create millions of these objects which would blow up the memory usage. These constraints are mainly due to external factors I cannot control.
Generating only a certain amount of data would be ideal, but breaking out of the loops and restarting later at the same point is also impractical. My idea was to do the processing in a separate thread which would be paused after n iterations and resumed once all n objects are completely processed, then resuming, doing n next iterations and so on until all iterations are done. It is important to wait until the thread has done all n iterations, so both threads would not really run in parallel.
This is where my problems begin: How do I do the mutex locking properly here? My approaches produce boost::lock_errors. Here is some code to show what I want to do:
boost::recursive_mutex bla;
boost::condition_variable_any v1;
boost::condition_variable_any v2;
boost::recursive_mutex::scoped_lock lock(bla);
int got_processed = 0;
const int n = 10;
void ProcessNIterations() {
got_processed = 0;
// have some mutex or whatever unlocked here so that the worker thread can
// start or resume.
// my idea: have some sort of mutex lock that unlocks here and a condition
// variable v1 that is notified while the thread is waiting for that.
lock.unlock();
v1.notify_one();
// while the thread is working to do the iterations this function should wait
// because there is no use to proceed until the n iterations are done
// my idea: have another condition v2 variable that we wait for here and lock
// afterwards so the thread is blocked/paused
while (got_processed < n) {
v2.wait(lock);
}
}
void WorkerThread() {
int counter = 0;
// wait for something to start
// my idea: acquire a mutex lock here that was locked elsewhere before and
// wait for ProcessNIterations() to unlock it so this can start
boost::recursive_mutex::scoped_lock internal_lock(bla);
for (;;) {
for (;;) {
// here do the iterations
counter++;
std::cout << "iteration #" << counter << std::endl;
got_processed++;
if (counter >= n) {
// we've done n iterations; pause here
// my idea: unlock the mutex, notify v2
internal_lock.unlock();
v2.notify_one();
while (got_processed > 0) {
// when ProcessNIterations() is called again, resume here
// my idea: wait for v1 reacquiring the mutex again
v1.wait(internal_lock);
}
counter = 0;
}
}
}
}
int main(int argc, char *argv[]) {
boost::thread mythread(WorkerThread);
ProcessNIterations();
ProcessNIterations();
while (true) {}
}
The above code fails after doing 10 iterations in the line v2.wait(lock); with the following message:
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::lock_error> >'
what(): boost::lock_error
How do I do this properly? If this is the way to go, how do I avoid lock_errors?
EDIT: I solved it using a concurrent queue like discussed here. This queue also has a maximum size after which a push will simply wait until at least one element has been poped. Therefore, the producer worker can simply go on filling this queue and the rest of the code can pop entries as it is suitable. No mutex locking needs to be done outside the queue. The queue is here:
template<typename Data>
class concurrent_queue
{
private:
std::queue<Data> the_queue;
mutable boost::mutex the_mutex;
boost::condition_variable the_condition_variable;
boost::condition_variable the_condition_variable_popped;
int max_size_;
public:
concurrent_queue(int max_size=-1) : max_size_(max_size) {}
void push(const Data& data) {
boost::mutex::scoped_lock lock(the_mutex);
while (max_size_ > 0 && the_queue.size() >= max_size_) {
the_condition_variable_popped.wait(lock);
}
the_queue.push(data);
lock.unlock();
the_condition_variable.notify_one();
}
bool empty() const {
boost::mutex::scoped_lock lock(the_mutex);
return the_queue.empty();
}
bool wait_and_pop(Data& popped_value) {
boost::mutex::scoped_lock lock(the_mutex);
bool locked = true;
if (the_queue.empty()) {
locked = the_condition_variable.timed_wait(lock, boost::posix_time::seconds(1));
}
if (locked && !the_queue.empty()) {
popped_value=the_queue.front();
the_queue.pop();
the_condition_variable_popped.notify_one();
return true;
} else {
return false;
}
}
int size() {
boost::mutex::scoped_lock lock(the_mutex);
return the_queue.size();
}
};
This could be implemented using conditional variables. Once you've performed N iterations, you call wait() on the condition variable, and when the objects are processed in another thread, call signal() on the condition variable to unblock the other thread that is blocked on the condition variable.
You probably want some sort of finite capacity queue list or stack in conjunction with a condition variable. When the queue is full, the producer thread waits on the condition variable, and any time a consumer thread removes an element from the queue, it signals the condition variable. That would allow the producer to wake up and fill the queue again. If you really wanted to process N elements at a time, then have the workers signal only when there's capacity in the queue for N elements, rather then every time they pull an item out of the queue.