Shared vector Synchronization with an atomic boolean - c++

I am having a shared vector which gets accessed by two threads.
A function from thread A pushs into the vector and a function from thread B swaps the vector completely for processing.
MovetoVec(PInfo* pInfo)
{
while(1)
{
if(GetSwitch())
{
swapBucket->push_back(pInfo);
toggles = true;
break;
}
else if(pInfo->tryMove == 5)
{
delete pInfo;
break;
}
pInfo->tryMove++;
Sleep(25);
}
}
The thread A tries to get atomic boolean toggles to true and pushes into vector.(the above MoveToVec function will be called by many number of threads). The function GetSwitch is defined as
GetSwitch()
{
if(toggles)
{
toggles = false;
return TRUE;
}
else
return FALSE;
}
toggles here is atomic_bool.And the another function from thread B that swaps the vector is
GetClassObj(vector<ProfiledInfo*>* toSwaps)
{
if(GetSwitch())
{
toSwaps->swap(*swapBucket);
toggles = true;
}
}
If GetSwitch returns false then the threadB does nothing. Here i dint use any locking. It works in most of the cases. But some time one of the pInfo objects in swapBucket is NULL. I got to know it is because of poor synchronization.
I followed this type of GetSwitch() logic just to neglect the overhead caused by locking. Should i drop this out and go back to mutex or critical section stuffs?

Your GetSwitch implementation is wrong. It is possible for multiple threads to acquire the switch simultaneously.
An example of such a scenario with just two threads:
Thread 1 | Thread 2
--------------------------|--------------------------
if (toggles) |
| if (toggles)
toggles = false; |
| toggles = false;
The if-test and assignment are not an atomic operation and therefore cannot be used to synchronize threads on their own.
If you want to use an atomic boolean as a means of synchronization, you need to compare and exchange the value in one atomic operation. Luckily, C++ provides such an operation called std::compare_exchange, which is available in a weak and strong flavor (the weak one may spuriously fail but is cheaper when called in a loop).
Using this operation, your GetSwitch method would become:
bool GetSwitch()
{
bool expected = true; // The value we expect 'toggles' to have
bool desired = false; // The value we want 'toggles' to get
// Check if 'toggles' is as expected, and if it is, update it to the desired value
bool result = toggles.compare_exchange_strong(&expected, desired);
// The result of the compare_exchange is true if the value was updated and false if it was not
return result;
}
This will ensure that comparing and updating the value happens atomically.
Note that the C++ standard does not guarantee an atomic boolean to be lock-free. In your case, you could also use std::atomic_flag which is guaranteed to be lock-free by the standard! Carefully read the example though, it works a tad bit different than atomic variables.
Writing lock-free code, as you are attempting to do, is quite complex and error-prone.
My advice would be to write the code with locks first and ensure it is 100% correct. Mutexes are actually surprisingly fast, so performance should be okay in most cases. A good read on lock performance: http://preshing.com/20111118/locks-arent-slow-lock-contention-is
Only once you have profiled your code, and convinced yourself that the locks are impacting performance, you should attempt to write the code lock-free. Then profile again because lock-free code is not necessarily faster.

Related

Why can mutex be used in different threads?

Using (writing) same variable in multiple threads simultaneously causes undefined behavior and crashes.
Why using mutex, despite on fact that they are also variables, not causes undefined behavior?
If mutex somehow can be used simultaneously, why not make all variables work simultaneously without locking?
All my research is pressing Show definition on mutex::lock in Visual Studio, where I get at the end _Mtx_lock function without realization, and then I found it’s realization (Windows), though it has some functions also without realization:
int _Mtx_lock(_Mtx_t mtx)
{ /* lock mutex */
return (mtx_do_lock(mtx, 0));
}
static int mtx_do_lock(_Mtx_t mtx, const xtime *target)
{ /* lock mutex */
if ((mtx->type & ~_Mtx_recursive) == _Mtx_plain)
{ /* set the lock */
if (mtx->thread_id != static_cast<long>(GetCurrentThreadId()))
{ /* not current thread, do lock */
mtx->_get_cs()->lock();
mtx->thread_id = static_cast<long>(GetCurrentThreadId());
}
++mtx->count;
return (_Thrd_success);
}
else
{ /* handle timed or recursive mutex */
int res = WAIT_TIMEOUT;
if (target == 0)
{ /* no target --> plain wait (i.e. infinite timeout) */
if (mtx->thread_id != static_cast<long>(GetCurrentThreadId()))
mtx->_get_cs()->lock();
res = WAIT_OBJECT_0;
}
else if (target->sec < 0 || target->sec == 0 && target->nsec <= 0)
{ /* target time <= 0 --> plain trylock or timed wait for */
/* time that has passed; try to lock with 0 timeout */
if (mtx->thread_id != static_cast<long>(GetCurrentThreadId()))
{ /* not this thread, lock it */
if (mtx->_get_cs()->try_lock())
res = WAIT_OBJECT_0;
else
res = WAIT_TIMEOUT;
}
else
res = WAIT_OBJECT_0;
}
else
{ /* check timeout */
xtime now;
xtime_get(&now, TIME_UTC);
while (now.sec < target->sec
|| now.sec == target->sec && now.nsec < target->nsec)
{ /* time has not expired */
if (mtx->thread_id == static_cast<long>(GetCurrentThreadId())
|| mtx->_get_cs()->try_lock_for(
_Xtime_diff_to_millis2(target, &now)))
{ /* stop waiting */
res = WAIT_OBJECT_0;
break;
}
else
res = WAIT_TIMEOUT;
xtime_get(&now, TIME_UTC);
}
}
if (res != WAIT_OBJECT_0 && res != WAIT_ABANDONED)
;
else if (1 < ++mtx->count)
{ /* check count */
if ((mtx->type & _Mtx_recursive) != _Mtx_recursive)
{ /* not recursive, fixup count */
--mtx->count;
res = WAIT_TIMEOUT;
}
}
else
mtx->thread_id = static_cast<long>(GetCurrentThreadId());
switch (res)
{
case WAIT_OBJECT_0:
case WAIT_ABANDONED:
return (_Thrd_success);
case WAIT_TIMEOUT:
if (target == 0 || (target->sec == 0 && target->nsec == 0))
return (_Thrd_busy);
else
return (_Thrd_timedout);
default:
return (_Thrd_error);
}
}
}
So, according to this code, and the atomic_ keywords I think mutex can be written the next way:
atomic_bool state = false;
void lock()
{
if(!state)
state = true;
else
while(state){}
}
void unlock()
{
state = false;
}
bool try_lock()
{
if(!state)
state = true;
else
return false;
return true;
}
As you have found, std::mutex is thread-safe because it uses atomic operations. It can be reproduced with std::atomic_bool. Using atomic variables from multiple thread is not undefined behavior, because that is the purpose of those variables.
From C++ standard (emphasis mine):
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
Atomic variables are implemented using atomic operations of the CPU. This is not implemented for non-atomic variables, because those operations take longer time to execute and would be useless if the variables are only used in one thread.
Your example is not thread-safe:
void lock()
{
if(!state)
state = true;
else
while(state){}
}
If two threads are checking if(!state) simultaneously, it is possible that both enter the if section, and both threads believe they have the ownership:
Thread 1 Thread 2
if (!state)
if (!state)
state=true;
state=true;
You must use an atomic exchange function to ensure that the another thread cannot come in between checking the value and changing it.
void lock()
{
bool expected;
do {
expected = false;
} while (!state.compare_exchange_weak(expected, true));
}
You can also add a counter and give time for other threads to execute if the wait takes a long time:
void lock()
{
bool expected;
size_t counter = 0;
do {
expected = false;
if (counter > 100) {
Sleep(10);
}
else if (counter > 20) {
Sleep(5);
}
else if (counter > 3) {
Sleep(1);
}
counter++;
} while (!state.compare_exchange_weak(expected, true));
}
Using (writing) same variable in multiple threads simultaneously causes undefined behavior and crashes. Why using mutex, despite on fact that they are also variables, not causes undefined behavior?
It is only undefined behaviour for regular variables, and only if there is no synchronisation. std::mutex is defined to be thread safe. It's entire point is to provide synchronisation to other objects.
From [intro.races]:
The library defines a number of atomic operations ([atomics]) and operations on mutexes ([thread]) that are specially identified as synchronization operations. These operations play a special role in making assignments in one thread visible to another.
...
Note: For example, a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a release operation on those same locations.
Certain library calls synchronize with other library calls performed by another thread.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
(Emphasis added)
Why mutex can be used in different threads?
How could it possibly be useful if it couldn't be used in different threads? Synchronizing multiple threads with a shared mutex is the only reason for it to exist at all.
Using (writing) same variable in multiple threads simultaneously
It's a bad idea to only worry about things happening "simultaneously". The problem is generally things happening with undetermined ordering, ie, unpredictably.
There are lots of multi-threading bugs that seem impossible if you believe things have to be simultaneous to go wrong.
causes undefined behavior and crashes.
Undefined Behaviour is not required to cause a crash. If it had to crash, crashing would be behaviour which was ... defined. There are endless questions on here from people who don't understand this, asking why their "undefined behaviour test" didn't crash.
Why using mutex, despite on fact that they are also variables, not causes undefined behavior?
Because of the way they're used. Mutexes are not simply assigned to or read from, like simple variables, but are manipulated with specialized code designed specifically to do this correctly.
You can almost write your own mutex - a spinlock, anyway - just by using std::atomic<int> and a lot of care.
The difference is that a mutex also interacts with your operating system scheduler, and that interface is not portably exposed as part of the language standard. The std::mutex class bundles the OS-specific part up in a class with correct semantics, so you can write portable C++ instead of being limited to, say, POSIX-compatible C++ or Windows-compatible C++.
In your exploration of the VS std::mutex implementation, you ignored the mtx->_get_cs()->lock() part: this is using the Windows Critical Section to interact with the scheduler.
Your implementation is an attempt at a spinlock, which is fine so long as you know the lock is never held for long (or if you don't mind sacrificing a core for each blocked thread). By contrast, the mutex allows a waiting thread to be de-scheduled until the lock is released. This is the part handled by the Critical Section.
You also ignored all the timeout and recursive locking code - which is fine, but your implementation isn't really attempting to do the same thing as the original.
A mutex is specifically made to synchronize code on different threads that work on the same resource. It is designed not have issues when used on multi-threaded. Be aware that when using different mutexes together, you can still deadlocks when taking them in different order on different threads. C++'s std::unique_lock solves this.
A variable is meant to use on either single threads or on synchronized threads because that's how they can be accessed in the fastest way. This has to do with computer architecture (registers, cache, operations in several steps).
To work with variables on non-synchronized threads, you can work with std::atomic variables. That can be faster than synchronizing, but the access is slower and more cumbersome than for normal variables. Some more complex situations with several variables can only be handled in a synchronous way.

When would getters and setters with mutex be thread safe?

Consider the following class:
class testThreads
{
private:
int var; // variable to be modified
std::mutex mtx; // mutex
public:
void set_var(int arg) // setter
{
std::lock_guard<std::mutex> lk(mtx);
var = arg;
}
int get_var() // getter
{
std::lock_guard<std::mutex> lk(mtx);
return var;
}
void hundred_adder()
{
for(int i = 0; i < 100; i++)
{
int got = get_var();
set_var(got + 1);
sleep(0.1);
}
}
};
When I create two threads in main(), each with a thread function of hundred_adder modifying the same variable var, the end result of the var is always different i.e. not 200 but some other number.
Conceptually speaking, why is this use of mutex with getter and setter functions not thread-safe? Do the lock-guards fail to prevent the race-condition to var? And what would be an alternative solution?
Thread a: get 0
Thread b: get 0
Thread a: set 1
Thread b: set 1
Lo and behold, var is 1 even though it should've been 2.
It should be obvious that you need to lock the whole operation:
for(int i = 0; i < 100; i++){
std::lock_guard<std::mutex> lk(mtx);
var += 1;
}
Alternatively, you could make the variable atomic (even a relaxed one could do in your case).
int got = get_var();
set_var(got + 1);
Your get_var() and set_var() themselves are thread safe. But this combined sequence of get_var() followed by set_var() is not. There is no mutex that protects this entire sequence.
You have multiple concurrent threads executing this. You have multiple threads calling get_var(). After the first one finishes it and unlocks the mutex, another thread can lock the mutex immediately and obtain the same value for got that the first thread did. There's absolutely nothing that prevents multiple threads from locking and obtaining the same got, concurrently.
Then both threads will call set_var(), updating the mutex-protected int to the same value.
That's just one possibility that can happen here. You could easily have multiple threads acquiring the mutex sequentially and thus incrementing var by several values, only to be followed by some other, stalled thread, that called get_var() several seconds ago, and only now getting around to calling set_var(), thus resetting var to a much smaller value.
The code show in thread-safe in a sense that it will never set or get partial value of the variable.
But your usage of the methods does not guarantee that value will correctly change: reading and writing from multiple threads can collide with each other. Both threads read the value (11), both increment it (to 12) and than both set to the same (12) - now you counted 2 but effectively incremented only once.
Option to fix:
provide "safe increment" operation
provide equivalent of InterlockedCompareExchange to make sure value you are updating correspond to original one and retry as necessary
wrap calling code into separate mutex or use other synchronization mechanism to prevent operations to intermix.
Why don't you just use std::atomic for the shared data (var in this case)? That will be more safe efficient.
This is an absolute classic.
One thread obtains the value of var, releases the mutex and another obtains the same value before the first thread has chance to update it.
Consequently the process risks losing increments.
There are three obvious solutions:
void testThreads::inc_var(){
std::lock_guard<std::mutex> lk(mtx);
++var;
}
That's safe because the mutex is held until the variable is updated.
Next up:
bool testThreads::compare_and_inc_var(int val){
std::lock_guard<std::mutex> lk(mtx);
if(var!=val) return false;
++var;
return true;
}
Then write code like:
int val;
do{
val=get_var();
}while(!compare_and_inc_var(val));
This works because the loop repeats until it confirms it's updating the value it read. This could result in live-lock though in this case it has to be transient because a thread can only fail to make progress because another does.
Finally replace int var with std::atomic<int> var and either use ++var or var.compare_exchange(val,val+1) or var.fetch_add(1); to update it.
NB: Notice compare_exchange(var,var+1) is invalid...
++ is guaranteed to be atomic on std::atomic<> types but despite 'looking' like a single operation in general no such guarantee exists for int.
std::atomic<> also provides appropriate memory barriers (and ways to hint what kind of barrier is needed) to ensure proper inter-thread communication.
std::atomic<> should be a wait-free, lock-free implementation where available. Check your documentation and the flag is_lock_free().

Effective placement of lock_guard - from item 16 from Effective Modern C++

In item 16: "Make const member functions thread safe" there is a code as follows:
class Widget {
public:
int magicValue() const
{
std::lock_guard<std::mutex> guard(m); // lock m
if (cacheValid) return cachedValue;
else {
auto val1 = expensiveComputation1();
auto val2 = expensiveComputation2();
cachedValue = val1 + val2;
cacheValid = true;
return cachedValue;
}
} // unlock m
private:
mutable std::mutex m;
mutable int cachedValue; // no longer atomic
mutable bool cacheValid{ false }; // no longer atomic
};
I wonder why std::lock_guard should be executed always on each magicValue() call, wouldnt following work as expected?:
class Widget {
public:
int magicValue() const
{
if (cacheValid) return cachedValue;
else {
std::lock_guard<std::mutex> guard(m); // lock m
if (cacheValid) return cachedValue;
auto val1 = expensiveComputation1();
auto val2 = expensiveComputation2();
cachedValue = val1 + val2;
cacheValid = true;
return cachedValue;
}
} // unlock m
private:
mutable std::atomic<bool> cacheValid{false};
mutable std::mutex m;
mutable int cachedValue; // no longer atomic
};
This way fewer mutex locks would be required, making code more efficient. I assume here that atomica are always faster than mutexes.
[edit]
For completness I measured efficiency of both apraches, and the second looks like is only 6% faster.: http://coliru.stacked-crooked.com/a/e8ce9c3cfd3a4019
Your second code snippet shows a perfectly valid implementation of the Double Checked Locking Pattern (DCLP) and is (probably) more efficient that Meyers' solution since it avoids locking the mutex unnecessarily after cachedValue is set.
It is guaranteed that the expensive computations are not performed more than once.
Also, it is important that the cacheValid flag is atomic because it creates a happens-before relationship between writing-to and reading-from cachedValue.
In other words, it synchronizes cachedValue (which is accessed outside of the mutex) with other threads calling magicValue().
Had cacheValid been a regular 'bool', you would have had a data race on both cacheValid and cachedValue (causing undefined behavior per the C++11 standard).
Using default sequential consistent memory ordering on the cacheValid memory operations is fine, since it implies acquire/release semantics.
In theory, you could optimize by using weaker memory orderings on the atomic loads and store:
int Widget::magicValue() const
{
if (cacheValid.load(std::memory_order_acquire)) return cachedValue;
else {
std::lock_guard<std::mutex> guard(m); // lock m
if (cacheValid.load(std::memory_order_relaxed)) return cachedValue;
auto val1 = expensiveComputation1();
auto val2 = expensiveComputation2();
cachedValue = val1 + val2;
cacheValid.store(true, std::memory_order_release);
return cachedValue;
}
}
Note that this is only a minor optimization since reading an atomic is a regular load on many platforms (making it as efficient as reading from a non-atomic).
As pointed out by Nir Friedman, this only works one way; you cannot invalidate cacheValid and restart calculations. But that was not part of Meyers' example.
I actually think that your snippet is correct in isolation, but it relies on an assumption that is usually not true in a real world example: it assumes that the cacheValid goes from false to true, but can never make the reverse progression, that is become invalidated.
In the old code, the mutex protects all reads and writes on cachedValue. In your new code, there's actually a read access of cachedValue outside the mutex. That means that it is possible for one thread to read this value, while another thread is writing it. The catch is that the reading outside the mutex will only occur if cacheValid is true. But if cacheValid is true, no writing will occur; cacheValid can only become true after all writing is complete (note that this is enforced, because the assignment operator on cacheValid will use the strictest memory ordering guarantee, so it cannot be reordered with the earlier instructions in the block).
But suppose some other piece of code is written, that can invalidate the cache: Widget::invalidateCache(). This piece of code does nothing but set cacheValid to false again. In the old code, if you called invalidateCache and magicValue repeatedly from different threads, the latter function might recalculate the value or not at any given point. But even if your complex calculations are returning different values each time they are called (because they use global state, say), you will always get either the old or new value, and nothing else. But now consider the following execution order in your code:
Thread 1 calls magicValue, and checks the value of cacheValid. It's true. It gets interrupted before it can continue.
Thread 2 calls invalidateCache, and then immediately calls magicValue. magicValue sees that the cache is invalid, acquires the mutex, and starts computing, and begins writing to cacheValid.
Thread 1 interrupts, reading a partially written cacheValid.
I actually don't think this example works on most modern computers, because int will typically be 32 bits, and typically 32 bit writes and reads will be atomic. So it's not really possible to intersperse or "tear" the value of cachedValue. But on different architectures, or if you use a type other than integer (anything over 64 bits, for example), writing or reading is not guaranteed to be atomic. So you can get, as a return for magicValue, something which is neither the old value nor the new value but some weird bitwise hybrid that is not even a valid object.
So, good on you for finding this. I guess that in trying to boil down the example for simplicity, the author forgot that it was no longer necessary to be strict about putting the mutex on the outside.

Atomic thread counter

I'm experimenting with the C++11 atomic primitives to implement an atomic "thread counter" of sorts. Basically, I have a single critical section of code. Within this code block, any thread is free to READ from memory. However, sometimes, I want to do a reset or clear operation, which resets all shared memory to a default initialized value.
This seems like a great opportunity to use a read-write lock. C++11 doesn't include read-write mutexes out of the box, but maybe something simpler will do. I thought this problem would be a great opportunity to become more familiar with C++11 atomic primitives.
So I thought through this problem for a while, and it seems to me that all I have to do is :
Whenever a thread enters the critical section, increment an
atomic counter variable
Whenever a thread leaves the critical section, decrement the
atomic counter variable
If a thread wishes to reset all
variables to default values, it must atomically wait for the counter
to be 0, then atomically set it to some special "clearing flag"
value, perform the clear, then reset the counter to 0.
Of course,
threads wishing to increment and decrement the counter must also check for the
clearing flag.
So, the algorithm I just described can be implemented with three functions. The first function, increment_thread_counter() must ALWAYS be called before entering the critical section. The second function, decrement_thread_counter(), must ALWAYS be called right before leaving the critical section. Finally, the function clear() can be called from outside the critical section only iff the thread counter == 0.
This is what I came up with:
Given:
A thread counter variable, std::atomic<std::size_t> thread_counter
A constant clearing_flag set to std::numeric_limits<std::size_t>::max()
...
void increment_thread_counter()
{
std::size_t expected = 0;
while (!std::atomic_compare_exchange_strong(&thread_counter, &expected, 1))
{
if (expected != clearing_flag)
{
thread_counter.fetch_add(1);
break;
}
expected = 0;
}
}
void decrement_thread_counter()
{
thread_counter.fetch_sub(1);
}
void clear()
{
std::size_t expected = 0;
while (!thread_counter.compare_exchange_strong(expected, clearing_flag)) expected = 0;
/* PERFORM WRITES WHICH WRITE TO ALL SHARED VARIABLES */
thread_counter.store(0);
}
As far as I can reason, this should be thread-safe. Note that the decrement_thread_counter function shouldn't require ANY synchronization logic, because it is a given that increment() is always called before decrement(). So, when we get to decrement(), thread_counter can never equal 0 or clearing_flag.
Regardless, since THREADING IS HARD™, and I'm not an expert at lockless algorithms, I'm not entirely sure this algorithm is race-condition free.
Question: Is this code thread safe? Are any race conditions possible here?
You have a race condition; bad things happen if another thread changes the counter between increment_thread_counter()'s test for clearing_flag and the fetch_add.
I think this classic CAS loop should work better:
void increment_thread_counter()
{
std::size_t expected = 0;
std::size_t updated;
do {
if (expected == clearing_flag) { // don't want to succeed while clearing,
expected = 0; //take a chance that clearing completes before CMPEXC
}
updated = expected + 1;
// if (updated == clearing_flag) TOO MANY READERS!
} while (!std::atomic_compare_exchange_weak(&thread_counter, &expected, updated));
}

Can I switch the test and modification part in wait/signal semaphore?

The classic none-busy-waiting version of wait() and signal() semaphore are implemented as below. In this verson, value can be negative.
//primitive
wait(semaphore* S)
{
S->value--;
if (S->value < 0)
{
add this process to S->list;
block();
}
}
//primitive
signal(semaphore* S)
{
S->value++;
if (S->value <= 0)
{
remove a process P from S->list;
wakeup(P);
}
}
Question: Is the following version also correct? Here I test first and modify the value. It's great if you can show me a scenario where it doesn't work.
//primitive wait().
//If (S->value > 0), the whole function is atomic
//otherise, only if(){} section is atomic
wait(semaphore* S)
{
if (S->value <= 0)
{
add this process to S->list;
block();
}
// here I decrement the value after the previous test and possible blocking
S->value--;
}
//similar to wait()
signal(semaphore* S)
{
if (S->list is not empty)
{
remove a process P from S->list;
wakeup(P);
}
// here I increment the value after the previous test and possible waking up
S->value++;
}
Edit:
My motivation is to figure out whether I can use this latter version to achieve mutual exclusion, and no deadlock, no starvation.
Your modified version introduces a race condition:
Thread A: if(S->Value < 0) // Value = 1
Thread B: if(S->Value < 0) // Value = 1
Thread A: S->Value--; // Value = 0
Thread B: S->Value--; // Value = -1
Both threads have acquired a count=1 semaphore. Oops. Note that there's another problem even if they're non-preemptible (see below), but for completeness, here's a discussion on atomicity and how real locking protocols work.
When working with protocols like this, it's very important to nail down exactly what atomic primitives you are using. Atomic primitives are such that they seem to execute instantaneously, without being interleaved with any other operations. You cannot just take a big function and call it atomic; you have to make it atomic somehow, using other atomic primitives.
Most CPUs offer a primitive called 'atomic compare and exchange'. I'll abbreviate it cmpxchg from here on. The semantics are like so:
bool cmpxchg(long *ptr, long old, long new) {
if (*ptr == old) {
*ptr = new;
return true;
} else {
return false;
}
}
cmpxchg is not implemented with this code. It is in the CPU hardware, but behaves a bit like this, only atomically.
Now, let's add to this some additional helpful functions (built out of other primitives):
add_waitqueue(waitqueue) - Sets our process state to sleeping and adds us to a wait queue, but continues executing (ATOMIC)
schedule() - Switch threads. If we're in a sleeping state, we don't run again until awakened (BLOCKING)
remove_waitqueue(waitqueue) - removes our process from a wait queue, then sets our state to awakened if it isn't already (ATOMIC)
memory_barrier() - ensures that any reads/writes logically before this point actually are performed before this point, avoiding nasty memory ordering issues (we'll assume all other atomic primitives come with a free memory barrier, although this isn't always true) (CPU/COMPILER PRIMITIVE)
Here's how a typical semaphore acquisition routine will look. It's a bit more complex than your example, because I've explicitly nailed down what atomic operations I'm using:
void sem_down(sem *pSem)
{
while (1) {
long spec_count = pSem->count;
read_memory_barrier(); // make sure spec_count doesn't start changing on us! pSem->count may keep changing though
if (spec_count > 0)
{
if (cmpxchg(&pSem->count, spec_count, spec_count - 1)) // ATOMIC
return; // got the semaphore without blocking
else
continue; // count is stale, try again
} else { // semaphore count is zero
add_waitqueue(pSem->wqueue); // ATOMIC
// recheck the semaphore count, now that we're in the waitqueue - it may have changed
if (pSem->count == 0) schedule(); // NOT ATOMIC
remove_waitqueue(pSem->wqueue); // ATOMIC
// loop around again to try to acquire the semaphore
}
}
}
You'll note that the actual test for a non-zero pSem->count, in a real-world semaphore_down function, is accomplished by cmpxchg. You can't trust any other read; the value can change an instant after you read the value. We simply can't separate the value check and the value modification.
The spec_count here is speculative. This is important. I'm essentially making a guess at what the count will be. It's a pretty good guess, but it's a guess. cmpxchg will fail if my guess is wrong, at which point the routine has to loop and try again. If I guess 0, then I will either be woken up (as it ceases to be zero while I'm on the waitqueue), or I will notice it's not zero anymore in the schedule test.
You should also note that there is no possible way to make a function that contains a blocking operation atomic. It's nonsensical. Atomic functions, by definition, appear to execute instantaneously, not interleaved with anything else whatsoever. But a blocking function, by definition, waits for something else to happen. This is inconsistent. Likewise, no atomic operation can be 'split up' across a blocking operation, which it is in your example.
Now, you could do away with a lot of this complexity by declaring the function non-preemptable. By using locks or other methods, you simply ensure only one thread is ever running (not including blocking of course) in the semaphore code at a time. But a problem still remains then. Start with a value of 0, where C has taken the semaphore down twice, then:
Thread A: if (S->Value < 0) // Value = 0
Thread A: Block....
Thread B: if (S->Value < 0) // Value = 0
Thread B: Block....
Thread C: S->Value++ // value = 1
Thread C: Wakeup(A)
(Thread C calls signal() again)
Thread C: S->Value++ // value = 2
Thread C: Wakeup(B)
(Thread C calls wait())
Thread C: if (S->Value <= 0) // Value = 2
Thread C: S->Value-- // Value = 1
// A and B have been woken
Thread A: S->Value-- // Value = 0
Thread B: S->Value-- // Value = -1
You could probably fix this with a loop to recheck S->value - again, assuming you are on a single processor machine and your semaphore code is preemptable. Unfortunately, these assumptions are false on all desktop OSes :)
For more discussion on how real locking protocols work, you might be interested in the paper "Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux"