Atomic thread counter - c++

I'm experimenting with the C++11 atomic primitives to implement an atomic "thread counter" of sorts. Basically, I have a single critical section of code. Within this code block, any thread is free to READ from memory. However, sometimes, I want to do a reset or clear operation, which resets all shared memory to a default initialized value.
This seems like a great opportunity to use a read-write lock. C++11 doesn't include read-write mutexes out of the box, but maybe something simpler will do. I thought this problem would be a great opportunity to become more familiar with C++11 atomic primitives.
So I thought through this problem for a while, and it seems to me that all I have to do is :
Whenever a thread enters the critical section, increment an
atomic counter variable
Whenever a thread leaves the critical section, decrement the
atomic counter variable
If a thread wishes to reset all
variables to default values, it must atomically wait for the counter
to be 0, then atomically set it to some special "clearing flag"
value, perform the clear, then reset the counter to 0.
Of course,
threads wishing to increment and decrement the counter must also check for the
clearing flag.
So, the algorithm I just described can be implemented with three functions. The first function, increment_thread_counter() must ALWAYS be called before entering the critical section. The second function, decrement_thread_counter(), must ALWAYS be called right before leaving the critical section. Finally, the function clear() can be called from outside the critical section only iff the thread counter == 0.
This is what I came up with:
Given:
A thread counter variable, std::atomic<std::size_t> thread_counter
A constant clearing_flag set to std::numeric_limits<std::size_t>::max()
...
void increment_thread_counter()
{
std::size_t expected = 0;
while (!std::atomic_compare_exchange_strong(&thread_counter, &expected, 1))
{
if (expected != clearing_flag)
{
thread_counter.fetch_add(1);
break;
}
expected = 0;
}
}
void decrement_thread_counter()
{
thread_counter.fetch_sub(1);
}
void clear()
{
std::size_t expected = 0;
while (!thread_counter.compare_exchange_strong(expected, clearing_flag)) expected = 0;
/* PERFORM WRITES WHICH WRITE TO ALL SHARED VARIABLES */
thread_counter.store(0);
}
As far as I can reason, this should be thread-safe. Note that the decrement_thread_counter function shouldn't require ANY synchronization logic, because it is a given that increment() is always called before decrement(). So, when we get to decrement(), thread_counter can never equal 0 or clearing_flag.
Regardless, since THREADING IS HARDâ„¢, and I'm not an expert at lockless algorithms, I'm not entirely sure this algorithm is race-condition free.
Question: Is this code thread safe? Are any race conditions possible here?

You have a race condition; bad things happen if another thread changes the counter between increment_thread_counter()'s test for clearing_flag and the fetch_add.
I think this classic CAS loop should work better:
void increment_thread_counter()
{
std::size_t expected = 0;
std::size_t updated;
do {
if (expected == clearing_flag) { // don't want to succeed while clearing,
expected = 0; //take a chance that clearing completes before CMPEXC
}
updated = expected + 1;
// if (updated == clearing_flag) TOO MANY READERS!
} while (!std::atomic_compare_exchange_weak(&thread_counter, &expected, updated));
}

Related

Is there a way to test a self-written C++ semaphore?

I read the "Little Book Of Semaphores" by Allen B. Downey and realized that I have to implement a semaphore in C++ first, as they appear as apart of the standard library only in C++20.
I used the definition from the book:
A semaphore is like an integer, with three differences:
When you create the semaphore, you can initialize its value to any integer,
but after that the only operations you are allowed to perform are increment
(increase by one) and decrement (decrease by one). You cannot read the
current value of the semaphore.
When a thread decrements the semaphore, if the result is negative, the
thread blocks itself and cannot continue until another thread increments
the semaphore.
When a thread increments the semaphore, if there are other threads wait-
ing, one of the waiting threads gets unblocked.
I also used the answers to the question C++0x has no semaphores? How to synchronize threads?
My implementation is a bit different from those by the link, as I unlock my mutex before notifying a thread on signalling and also the definition by the book is a bit different.
So I implemented the semaphore and now I've realized that I don't know how I can really properly test it, except for the simplest case like e.g. sequentializing two calls for two threads. Is there any way to test the implementation like kinda using 100 threads or something like this and having no deadlock? I mean what test I should write to check the implementation? Or if the only way to check is to look through the code attentively, could you, maybe, please check?
My implementation:
// semaphore.h
#include <condition_variable>
#include <mutex>
namespace Semaphore
{
class CountingSemaphore final
{
public:
explicit CountingSemaphore(int initialValue = 0);
void signal();
void wait();
private:
std::mutex _mutex{};
std::condition_variable _conditionVar{};
int _value;
};
} // namespace Semaphore
// semaphore.cpp
#include "semaphore.h"
namespace Semaphore
{
CountingSemaphore::CountingSemaphore(const int initialValue) : _value(initialValue) {}
void CountingSemaphore::signal()
{
std::unique_lock<std::mutex> lock{_mutex};
++_value;
if (0 == _value)
{
lock.unlock();
_conditionVar.notify_one();
}
}
void CountingSemaphore::wait()
{
std::unique_lock<std::mutex> lock{_mutex};
--_value;
if (0 > _value)
{
_conditionVar.wait(lock, [this]() { return (0 <= this->_value); });
}
}
} // namespace Semaphore
This code is broken. Your current state machine uses negative numbers to indicate a number of waiters, and non-negative indicates remaining capacity (with 0 being no availability, but no waiters either).
Problem is, you only notify waiters when the count becomes zero. So if you had a semaphore with initial value 1 (basically a logical mutex), and five threads try to grab it, one gets it, and four others wait, at which point value is -4. When the thread that grabbed it finishes and signals, value rises to -3, but that's not 0, so notify_one is not called.
In addition, you have some redundant code. The predicate-based form of std::condition_variable's wait is equivalent to:
while (!predicate()) {
wait(lock);
}
so your if check is redundant (wait will check the same information before it actually waits, even once, anyway).
You could also condense the code a bit by having the increment and decrement on the same line you test them (it's not necessary since you're using mutexes to protect the whole block, not relying on atomics, but I like writing threaded code in a way that would be easier to port to atomics later, mostly to keep in the habit when I write actual atomics code). Fixing the bug and condensing the code gets this end result:
void CountingSemaphore::signal()
{
std::unique_lock<std::mutex> lock{_mutex};
if (0 >= ++_value) // If we were negative, had at least one waiter, notify one of them; personally, I'd find if (value++ < 0) clearer as meaning "if value *was* less than 0, and also increment it afterwards by side-effect, but I stuck to something closer to your design to avoid confusion
{
lock.unlock();
_conditionVar.notify_one();
}
}
void CountingSemaphore::wait()
{
std::unique_lock<std::mutex> lock{_mutex};
--_value;
_conditionVar.wait(lock, [this]() { return 0 <= this->_value; });
}
The alternative approach would be adopting a design that doesn't drop the count below 0 (so 0 means "has waiters" and otherwise the range of values is just 0 to initialValue). This is safer in theoretical circumstances (you can't trigger wraparound by having 2 ** (8 * sizeof(int) - 1) waiters). You could minimize that risk by making value a ssize_t (so 64 bit systems would be exponentially less likely to hit the bug), or by changing the design to stop at 0:
// value's declaration in header and argument passed to constructor may be changed
// to an unsigned type if you like
void CountingSemaphore::signal()
{
// Just for fun, a refactoring to use lock_guard rather than unique_lock
// since the logical unlock is unconditionally after the value read/modify
int oldvalue; // Type should match that of value
{
std::lock_guard<std::mutex> lock{_mutex};
oldvalue = value++;
}
if (0 == oldvalue) _conditionVar.notify_one();
}
void CountingSemaphore::wait()
{
std::unique_lock<std::mutex> lock{_mutex};
// Waits only if current count is 0, returns with lock held at which point
// it's guaranteed greater than 0
_conditionVar.wait(lock, [this]() { return 0 != this->_value; });
--value; // We only decrement when it's positive, just before we return to caller
}
This second design has a minor flaw, in that it calls notify_one when signaled with no available resources, but no available resources might mean "has waiters" or it might mean "all resources consumed, but no one waiting". This isn't actually a problem logically speaking though; calling notify_one with no waiters is legal and does nothing, though it may be slightly less efficient. Your original design may be preferable on that basis (it doesn't notify_one unless there were definitely waiters).

Concurrency Model C++

Suppose you are given the following code:
class FooBar {
public void foo() {
for (int i = 0; i < n; i++) {
print("foo");
}
}
public void bar() {
for (int i = 0; i < n; i++) {
print("bar");
}
}
}
The same instance of FooBar will be passed to two different threads. Thread A will call foo() while thread B will call bar(). Modify the given program to output "foobar" n times.
For the following problem on leetcode we have to write two functions
void foo(function<void()> printFoo);
void bar(function<void()> printBar);
where printFoo and correspondingly printBar is a function pointer that prints Foo. The functions foo and bar are being called in a multithreaded environment and there is no ordering guarantee on how foo and bar is being called.
My solution was
class FooBar {
private:
int n;
mutex m1;
condition_variable cv;
condition_variable cv2;
bool flag;
public:
FooBar(int n) {
this->n = n;
flag=false;
}
void foo(function<void()> printFoo) {
for (int i = 0; i < n; i++) {
unique_lock<mutex> lck(m1);
cv.wait(lck,[&]{return !flag;});
printFoo();
flag=true;
lck.unlock();
cv2.notify_one();
}
}
void bar(function<void()> printBar) {
for (int i = 0; i < n; i++) {
unique_lock<mutex> lck(m1);
cv2.wait(lck,[&]{return flag;});
printBar();
flag=false;
lck.unlock();
cv.notify_one();
// printBar() outputs "bar". Do not change or remove this line.
}
}
};
Let us assume, at time t = 0 bar was called and then at time t = 10 foo was called, foo goes through the critical section protected by the mutex m1.
My question are
Does the C++ memory model because of the fencing property guarantee that when the bar function resumes from waiting on cv2 the value of flag will be set to true?
Am I right in assuming locks shared among threads enforce a before and after relationship as illustrated in the manner of Leslie Lamports clocking system. The compiler and C++ guarantees everything before the end of a critical section (Here the end of the lock) will be observed will be observed by any thread that renters the lock, so common locks, atomics, semaphore can be visualised as enfocing before and after behavior by establishing time in multithreaded environment.
Can we solve this problem using just one condition variable?
Is there a way to do this without using locks and just atomics. What performance improvements do atomics give over locks?
What happens if i do cv.notify_one() and correspondigly cv2.notify_one() within the critical region, is there a chance of a missed interrupt.
Original Problem
https://leetcode.com/problems/print-foobar-alternately/.
Leslie Lamports Paper
https://lamport.azurewebsites.net/pubs/time-clocks.pdf
Does the C++ memory model because of the fencing property guarantee that when the bar function resumes from waiting on cv2 the value of flag will be set to true?
By itself, a conditional variable is prone to spurious wake-up. A CV.wait(lck) call without a predicate clause can return for kinds of reasons. That's why it's always important to check the predicate condition in a while loop before entering wait. You should never assume that when wait(lck) returns that the thing you were waiting for has actually happened. But with the clause you added within the wait: cv2.wait(lck,[&]{return flag;}); this check is taken care of for you. So yes, when wait(lck, predicate) returns, then flag will be true.
Can we solve this problem using just one condition variable?
Absolutely. Just get rid of cv2 and have both threads wait (and notify) on the first cv.
Is there a way to do this without using locks and just atomics. What performance improvements do atomics give over locks?
atomics are great when you can get away with polling on one thread instead of waiting. Imagine a UI thread that wants to show you the current speed of your car. And it polls the speed variable on every frame refresh. But another thread, the "engine thread" is setting that atomic<int> speed variable with every rotation of the tire. That's where it shines - when you already have a polling loop in place, and on x86, atomics are mostly implemented with the LOCK op code prefix (e.g. concurrency is done correctly by the CPU).
As for an implementation for just locks and atomics... well, it's late for me. Easy solution, both threads just sleep and poll on an atomic integer that increments with each thread's turn. Each thread just waits for value to be "last+2" and polls every few milliseconds. Not efficient, but would work.
It's a bit late in the evening for me to thing about how to do this with a single or pair of mutexes.
What happens if i do cv.notify_one() and correspondigly cv2.notify_one() within the critical region, is there a chance of a missed interrupt.
No, you're fine. As long as all your threads are holding a lock and checking their predicate condition before entering the wait call. You can do the notify call insider or outside of the critical region. I always recommend doing notify_all over notify_one, but that might even be unnecessary.

When would getters and setters with mutex be thread safe?

Consider the following class:
class testThreads
{
private:
int var; // variable to be modified
std::mutex mtx; // mutex
public:
void set_var(int arg) // setter
{
std::lock_guard<std::mutex> lk(mtx);
var = arg;
}
int get_var() // getter
{
std::lock_guard<std::mutex> lk(mtx);
return var;
}
void hundred_adder()
{
for(int i = 0; i < 100; i++)
{
int got = get_var();
set_var(got + 1);
sleep(0.1);
}
}
};
When I create two threads in main(), each with a thread function of hundred_adder modifying the same variable var, the end result of the var is always different i.e. not 200 but some other number.
Conceptually speaking, why is this use of mutex with getter and setter functions not thread-safe? Do the lock-guards fail to prevent the race-condition to var? And what would be an alternative solution?
Thread a: get 0
Thread b: get 0
Thread a: set 1
Thread b: set 1
Lo and behold, var is 1 even though it should've been 2.
It should be obvious that you need to lock the whole operation:
for(int i = 0; i < 100; i++){
std::lock_guard<std::mutex> lk(mtx);
var += 1;
}
Alternatively, you could make the variable atomic (even a relaxed one could do in your case).
int got = get_var();
set_var(got + 1);
Your get_var() and set_var() themselves are thread safe. But this combined sequence of get_var() followed by set_var() is not. There is no mutex that protects this entire sequence.
You have multiple concurrent threads executing this. You have multiple threads calling get_var(). After the first one finishes it and unlocks the mutex, another thread can lock the mutex immediately and obtain the same value for got that the first thread did. There's absolutely nothing that prevents multiple threads from locking and obtaining the same got, concurrently.
Then both threads will call set_var(), updating the mutex-protected int to the same value.
That's just one possibility that can happen here. You could easily have multiple threads acquiring the mutex sequentially and thus incrementing var by several values, only to be followed by some other, stalled thread, that called get_var() several seconds ago, and only now getting around to calling set_var(), thus resetting var to a much smaller value.
The code show in thread-safe in a sense that it will never set or get partial value of the variable.
But your usage of the methods does not guarantee that value will correctly change: reading and writing from multiple threads can collide with each other. Both threads read the value (11), both increment it (to 12) and than both set to the same (12) - now you counted 2 but effectively incremented only once.
Option to fix:
provide "safe increment" operation
provide equivalent of InterlockedCompareExchange to make sure value you are updating correspond to original one and retry as necessary
wrap calling code into separate mutex or use other synchronization mechanism to prevent operations to intermix.
Why don't you just use std::atomic for the shared data (var in this case)? That will be more safe efficient.
This is an absolute classic.
One thread obtains the value of var, releases the mutex and another obtains the same value before the first thread has chance to update it.
Consequently the process risks losing increments.
There are three obvious solutions:
void testThreads::inc_var(){
std::lock_guard<std::mutex> lk(mtx);
++var;
}
That's safe because the mutex is held until the variable is updated.
Next up:
bool testThreads::compare_and_inc_var(int val){
std::lock_guard<std::mutex> lk(mtx);
if(var!=val) return false;
++var;
return true;
}
Then write code like:
int val;
do{
val=get_var();
}while(!compare_and_inc_var(val));
This works because the loop repeats until it confirms it's updating the value it read. This could result in live-lock though in this case it has to be transient because a thread can only fail to make progress because another does.
Finally replace int var with std::atomic<int> var and either use ++var or var.compare_exchange(val,val+1) or var.fetch_add(1); to update it.
NB: Notice compare_exchange(var,var+1) is invalid...
++ is guaranteed to be atomic on std::atomic<> types but despite 'looking' like a single operation in general no such guarantee exists for int.
std::atomic<> also provides appropriate memory barriers (and ways to hint what kind of barrier is needed) to ensure proper inter-thread communication.
std::atomic<> should be a wait-free, lock-free implementation where available. Check your documentation and the flag is_lock_free().

Using a mutex to block execution from outside the critical section

I'm not sure I got the terminology right but here goes - I have this function that is used by multiple threads to write data (using pseudo code in comments to illustrate what I want)
//these are initiated in the constructor
int* data;
std::atomic<size_t> size;
void write(int value) {
//wait here while "read_lock"
//set "write_lock" to "write_lock" + 1
auto slot = size.fetch_add(1, std::memory_order_acquire);
data[slot] = value;
//set "write_lock" to "write_lock" - 1
}
the order of the writes is not important, all I need here is for each write to go to a unique slot
Every once in a while though, I need one thread to read the data using this function
int* read() {
//set "read_lock" to true
//wait here while "write_lock"
int* ret = data;
data = new int[capacity];
size = 0;
//set "read_lock" to false
return ret;
}
so it basically swaps out the buffer and returns the old one (I've removed capacity logic to make the snippets shorter)
In theory this should lead to 2 operating scenarios:
1 - just a bunch of threads writing into the container
2 - when some thread executes the read function, all new writers will have to wait, the reader will wait until all existing writes are finished, it will then do the read logic and scenario 1 can continue.
The question part is that I don't know what kind of a barrier to use for the locks -
A spinlock would be wasteful since there are many containers like this and they all need cpu cycles
I don't know how to apply std::mutex since I only want the write function to be in a critical section if the read function is triggered. Wrapping the whole write function in a mutex would cause unnecessary slowdown for operating scenario 1.
So what would be the optimal solution here?
If you have C++14 capability then you can use a std::shared_timed_mutex to separate out readers and writers. In this scenario it seems you need to give your writer threads shared access (allowing other writer threads at the same time) and your reader threads unique access (kicking all other threads out).
So something like this may be what you need:
class MyClass
{
public:
using mutex_type = std::shared_timed_mutex;
using shared_lock = std::shared_lock<mutex_type>;
using unique_lock = std::unique_lock<mutex_type>;
private:
mutable mutex_type mtx;
public:
// All updater threads can operate at the same time
auto lock_for_updates() const
{
return shared_lock(mtx);
}
// Reader threads need to kick all the updater threads out
auto lock_for_reading() const
{
return unique_lock(mtx);
}
};
// many threads can call this
void do_writing_work(std::shared_ptr<MyClass> sptr)
{
auto lock = sptr->lock_for_updates();
// update the data here
}
// access the data from one thread only
void do_reading_work(std::shared_ptr<MyClass> sptr)
{
auto lock = sptr->lock_for_reading();
// read the data here
}
The shared_locks allow other threads to gain a shared_lock at the same time but prevent a unique_lock gaining simultaneous access. When a reader thread tries to gain a unique_lock all shared_locks will be vacated before the unique_lock gets exclusive control.
You can also do this with regular mutexes and condition variables rather than shared. Supposedly shared_mutex has higher overhead, so I'm not sure which will be faster. With Gallik's solution you'd presumably be paying to lock the shared mutex on every write call; I got the impression from your post that write gets called way more than read so maybe this is undesirable.
int* data; // initialized somewhere
std::atomic<size_t> size = 0;
std::atomic<bool> reading = false;
std::atomic<int> num_writers = 0;
std::mutex entering;
std::mutex leaving;
std::condition_variable cv;
void write(int x) {
++num_writers;
if (reading) {
--num_writers;
if (num_writers == 0)
{
std::lock_guard l(leaving);
cv.notify_one();
}
{ std::lock_guard l(entering); }
++num_writers;
}
auto slot = size.fetch_add(1, std::memory_order_acquire);
data[slot] = x;
--num_writers;
if (reading && num_writers == 0)
{
std::lock_guard l(leaving);
cv.notify_one();
}
}
int* read() {
int* other_data = new int[capacity];
{
std::unique_lock enter_lock(entering);
reading = true;
std::unique_lock leave_lock(leaving);
cv.wait(leave_lock, [] () { return num_writers == 0; });
swap(data, other_data);
size = 0;
reading = false;
}
return other_data;
}
It's a bit complicated and took me some time to work out, but I think this should serve the purpose pretty well.
In the common case where only writing is happening, reading is always false. So you do the usual, and pay for two additional atomic increments and two untaken branches. So the common path does not need to lock any mutexes, unlike the solution involving a shared mutex, this is supposedly expensive: http://permalink.gmane.org/gmane.comp.lib.boost.devel/211180.
Now, suppose read is called. The expensive, slow heap allocation happens first, meanwhile writing continues uninterrupted. Next, the entering lock is acquired, which has no immediate effect. Now, reading is set to true. Immediately, any new calls to write enter the first branch, and eventually hit the entering lock which they are unable to acquire (as its already taken), and those threads then get put to sleep.
Meanwhile, the read thread is now waiting on the condition that the number of writers is 0. If we're lucky, this could actually go through right away. If however there are threads in write in either of the two locations between incrementing and decrementing num_writers, then it will not. Each time a write thread decrements num_writers, it checks if it has reduced that number to zero, and when it does it will signal the condition variable. Because num_writers is atomic which prevents various reordering shenanigans, it is guaranteed that the last thread will see num_writers == 0; it could also be notified more than once but this is ok and cannot result in bad behavior.
Once that condition variable has been signalled, that shows that all writers are either trapped in the first branch or are done modifying the array. So the read thread can now safely swap the data, and then unlock everything, and then return what it needs to.
As mentioned before, in typical operation there are no locks, just increments and untaken branches. Even when a read does occur, the read thread will have one lock and one condition variable wait, whereas a typical write thread will have about one lock/unlock of a mutex and that's all (one, or a small number of write threads, will also perform a condition variable notification).

Can I switch the test and modification part in wait/signal semaphore?

The classic none-busy-waiting version of wait() and signal() semaphore are implemented as below. In this verson, value can be negative.
//primitive
wait(semaphore* S)
{
S->value--;
if (S->value < 0)
{
add this process to S->list;
block();
}
}
//primitive
signal(semaphore* S)
{
S->value++;
if (S->value <= 0)
{
remove a process P from S->list;
wakeup(P);
}
}
Question: Is the following version also correct? Here I test first and modify the value. It's great if you can show me a scenario where it doesn't work.
//primitive wait().
//If (S->value > 0), the whole function is atomic
//otherise, only if(){} section is atomic
wait(semaphore* S)
{
if (S->value <= 0)
{
add this process to S->list;
block();
}
// here I decrement the value after the previous test and possible blocking
S->value--;
}
//similar to wait()
signal(semaphore* S)
{
if (S->list is not empty)
{
remove a process P from S->list;
wakeup(P);
}
// here I increment the value after the previous test and possible waking up
S->value++;
}
Edit:
My motivation is to figure out whether I can use this latter version to achieve mutual exclusion, and no deadlock, no starvation.
Your modified version introduces a race condition:
Thread A: if(S->Value < 0) // Value = 1
Thread B: if(S->Value < 0) // Value = 1
Thread A: S->Value--; // Value = 0
Thread B: S->Value--; // Value = -1
Both threads have acquired a count=1 semaphore. Oops. Note that there's another problem even if they're non-preemptible (see below), but for completeness, here's a discussion on atomicity and how real locking protocols work.
When working with protocols like this, it's very important to nail down exactly what atomic primitives you are using. Atomic primitives are such that they seem to execute instantaneously, without being interleaved with any other operations. You cannot just take a big function and call it atomic; you have to make it atomic somehow, using other atomic primitives.
Most CPUs offer a primitive called 'atomic compare and exchange'. I'll abbreviate it cmpxchg from here on. The semantics are like so:
bool cmpxchg(long *ptr, long old, long new) {
if (*ptr == old) {
*ptr = new;
return true;
} else {
return false;
}
}
cmpxchg is not implemented with this code. It is in the CPU hardware, but behaves a bit like this, only atomically.
Now, let's add to this some additional helpful functions (built out of other primitives):
add_waitqueue(waitqueue) - Sets our process state to sleeping and adds us to a wait queue, but continues executing (ATOMIC)
schedule() - Switch threads. If we're in a sleeping state, we don't run again until awakened (BLOCKING)
remove_waitqueue(waitqueue) - removes our process from a wait queue, then sets our state to awakened if it isn't already (ATOMIC)
memory_barrier() - ensures that any reads/writes logically before this point actually are performed before this point, avoiding nasty memory ordering issues (we'll assume all other atomic primitives come with a free memory barrier, although this isn't always true) (CPU/COMPILER PRIMITIVE)
Here's how a typical semaphore acquisition routine will look. It's a bit more complex than your example, because I've explicitly nailed down what atomic operations I'm using:
void sem_down(sem *pSem)
{
while (1) {
long spec_count = pSem->count;
read_memory_barrier(); // make sure spec_count doesn't start changing on us! pSem->count may keep changing though
if (spec_count > 0)
{
if (cmpxchg(&pSem->count, spec_count, spec_count - 1)) // ATOMIC
return; // got the semaphore without blocking
else
continue; // count is stale, try again
} else { // semaphore count is zero
add_waitqueue(pSem->wqueue); // ATOMIC
// recheck the semaphore count, now that we're in the waitqueue - it may have changed
if (pSem->count == 0) schedule(); // NOT ATOMIC
remove_waitqueue(pSem->wqueue); // ATOMIC
// loop around again to try to acquire the semaphore
}
}
}
You'll note that the actual test for a non-zero pSem->count, in a real-world semaphore_down function, is accomplished by cmpxchg. You can't trust any other read; the value can change an instant after you read the value. We simply can't separate the value check and the value modification.
The spec_count here is speculative. This is important. I'm essentially making a guess at what the count will be. It's a pretty good guess, but it's a guess. cmpxchg will fail if my guess is wrong, at which point the routine has to loop and try again. If I guess 0, then I will either be woken up (as it ceases to be zero while I'm on the waitqueue), or I will notice it's not zero anymore in the schedule test.
You should also note that there is no possible way to make a function that contains a blocking operation atomic. It's nonsensical. Atomic functions, by definition, appear to execute instantaneously, not interleaved with anything else whatsoever. But a blocking function, by definition, waits for something else to happen. This is inconsistent. Likewise, no atomic operation can be 'split up' across a blocking operation, which it is in your example.
Now, you could do away with a lot of this complexity by declaring the function non-preemptable. By using locks or other methods, you simply ensure only one thread is ever running (not including blocking of course) in the semaphore code at a time. But a problem still remains then. Start with a value of 0, where C has taken the semaphore down twice, then:
Thread A: if (S->Value < 0) // Value = 0
Thread A: Block....
Thread B: if (S->Value < 0) // Value = 0
Thread B: Block....
Thread C: S->Value++ // value = 1
Thread C: Wakeup(A)
(Thread C calls signal() again)
Thread C: S->Value++ // value = 2
Thread C: Wakeup(B)
(Thread C calls wait())
Thread C: if (S->Value <= 0) // Value = 2
Thread C: S->Value-- // Value = 1
// A and B have been woken
Thread A: S->Value-- // Value = 0
Thread B: S->Value-- // Value = -1
You could probably fix this with a loop to recheck S->value - again, assuming you are on a single processor machine and your semaphore code is preemptable. Unfortunately, these assumptions are false on all desktop OSes :)
For more discussion on how real locking protocols work, you might be interested in the paper "Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux"