R-bounded waiting for the Peterson Lock - concurrency

In The Art of Multiprocessor Programming, rev. 1st Ed., in Ch. 2, Excercise 9 is as follows (paraphrased):
Define r-bounded waiting for a mutex algorithm to mean that DAj ➝ DBk ⇒ CSAj ➝ CSBk+r. Is there a way to define a doorway for the Peterson algorithm such that it provides r-bounded waiting?
The book uses ➝ to define a total order on the precedence of events, where X ➝ Y means event X started and completed before Y started. DA is the "doorway" event for a thread A, which is the event of the request to enter the critical section. CSA is the critical section event for thread A.
For any event XA, XAi is the the i-th execution of event X on thread A.
Now getting to the question: it seems to me that the Peterson algorithm is completely fair (0-bounded waiting). Further, I think that r-bounded waiting implies k-bounded waiting for all k > r. Then this question does not make sense, since Peterson should satisfy r-bounded waiting for all r.
Is the question asking for a "simplification" of the Peterson algorithm, since it is requesting a relaxation of constraints?
This is self-study, not homework.
The code of the Peterson lock algorithm, taken from the book:
1 class Peterson implements Lock {
2 // thread-local index, 0 or 1
3 private volatile boolean[] flag = new boolean[2];
4 private volatile int victim;
5 public void lock() {
6 int i = ThreadID.get();
7 int j = 1 - i;
8 flag[i] = true; // I’m interested
9 victim = i; // you go first
10 while (flag[j] && victim == i) {}; // wait
11 }
12 public void unlock() {
13 int i = ThreadID.get();
14 flag[i] = false; // I’m not interested
15 }
16 }

You are right, the Peterson algorithm for two threads is fair (aka first come first served).
Let's (quite naturally) define the doorway section to be lines 6-9 in the code, and the waiting section to be line 10. Let's assume D0j ➝ D1k and both thread are in their corresponding waiting sections. In this case, flag[0]==true, flag[1]==true, and victim==1; therefore, thread 0 may exit its waiting section while thread 1 may not. So, thread 0 goes first, i.e. CS0j ➝ CS1k and Peterson lock has 0-bounded waiting, i.e. is fair.
However I think the question does make sense. It's an exercise, the first one for the section so not very hard - but still I think useful to check if the concepts are understood. The book does not say that the Peterson lock is fair; instead, it asks (perhaps in a bit convoluted way) to prove it as an exercise.

Related

Is there a way to test a self-written C++ semaphore?

I read the "Little Book Of Semaphores" by Allen B. Downey and realized that I have to implement a semaphore in C++ first, as they appear as apart of the standard library only in C++20.
I used the definition from the book:
A semaphore is like an integer, with three differences:
When you create the semaphore, you can initialize its value to any integer,
but after that the only operations you are allowed to perform are increment
(increase by one) and decrement (decrease by one). You cannot read the
current value of the semaphore.
When a thread decrements the semaphore, if the result is negative, the
thread blocks itself and cannot continue until another thread increments
the semaphore.
When a thread increments the semaphore, if there are other threads wait-
ing, one of the waiting threads gets unblocked.
I also used the answers to the question C++0x has no semaphores? How to synchronize threads?
My implementation is a bit different from those by the link, as I unlock my mutex before notifying a thread on signalling and also the definition by the book is a bit different.
So I implemented the semaphore and now I've realized that I don't know how I can really properly test it, except for the simplest case like e.g. sequentializing two calls for two threads. Is there any way to test the implementation like kinda using 100 threads or something like this and having no deadlock? I mean what test I should write to check the implementation? Or if the only way to check is to look through the code attentively, could you, maybe, please check?
My implementation:
// semaphore.h
#include <condition_variable>
#include <mutex>
namespace Semaphore
{
class CountingSemaphore final
{
public:
explicit CountingSemaphore(int initialValue = 0);
void signal();
void wait();
private:
std::mutex _mutex{};
std::condition_variable _conditionVar{};
int _value;
};
} // namespace Semaphore
// semaphore.cpp
#include "semaphore.h"
namespace Semaphore
{
CountingSemaphore::CountingSemaphore(const int initialValue) : _value(initialValue) {}
void CountingSemaphore::signal()
{
std::unique_lock<std::mutex> lock{_mutex};
++_value;
if (0 == _value)
{
lock.unlock();
_conditionVar.notify_one();
}
}
void CountingSemaphore::wait()
{
std::unique_lock<std::mutex> lock{_mutex};
--_value;
if (0 > _value)
{
_conditionVar.wait(lock, [this]() { return (0 <= this->_value); });
}
}
} // namespace Semaphore
This code is broken. Your current state machine uses negative numbers to indicate a number of waiters, and non-negative indicates remaining capacity (with 0 being no availability, but no waiters either).
Problem is, you only notify waiters when the count becomes zero. So if you had a semaphore with initial value 1 (basically a logical mutex), and five threads try to grab it, one gets it, and four others wait, at which point value is -4. When the thread that grabbed it finishes and signals, value rises to -3, but that's not 0, so notify_one is not called.
In addition, you have some redundant code. The predicate-based form of std::condition_variable's wait is equivalent to:
while (!predicate()) {
wait(lock);
}
so your if check is redundant (wait will check the same information before it actually waits, even once, anyway).
You could also condense the code a bit by having the increment and decrement on the same line you test them (it's not necessary since you're using mutexes to protect the whole block, not relying on atomics, but I like writing threaded code in a way that would be easier to port to atomics later, mostly to keep in the habit when I write actual atomics code). Fixing the bug and condensing the code gets this end result:
void CountingSemaphore::signal()
{
std::unique_lock<std::mutex> lock{_mutex};
if (0 >= ++_value) // If we were negative, had at least one waiter, notify one of them; personally, I'd find if (value++ < 0) clearer as meaning "if value *was* less than 0, and also increment it afterwards by side-effect, but I stuck to something closer to your design to avoid confusion
{
lock.unlock();
_conditionVar.notify_one();
}
}
void CountingSemaphore::wait()
{
std::unique_lock<std::mutex> lock{_mutex};
--_value;
_conditionVar.wait(lock, [this]() { return 0 <= this->_value; });
}
The alternative approach would be adopting a design that doesn't drop the count below 0 (so 0 means "has waiters" and otherwise the range of values is just 0 to initialValue). This is safer in theoretical circumstances (you can't trigger wraparound by having 2 ** (8 * sizeof(int) - 1) waiters). You could minimize that risk by making value a ssize_t (so 64 bit systems would be exponentially less likely to hit the bug), or by changing the design to stop at 0:
// value's declaration in header and argument passed to constructor may be changed
// to an unsigned type if you like
void CountingSemaphore::signal()
{
// Just for fun, a refactoring to use lock_guard rather than unique_lock
// since the logical unlock is unconditionally after the value read/modify
int oldvalue; // Type should match that of value
{
std::lock_guard<std::mutex> lock{_mutex};
oldvalue = value++;
}
if (0 == oldvalue) _conditionVar.notify_one();
}
void CountingSemaphore::wait()
{
std::unique_lock<std::mutex> lock{_mutex};
// Waits only if current count is 0, returns with lock held at which point
// it's guaranteed greater than 0
_conditionVar.wait(lock, [this]() { return 0 != this->_value; });
--value; // We only decrement when it's positive, just before we return to caller
}
This second design has a minor flaw, in that it calls notify_one when signaled with no available resources, but no available resources might mean "has waiters" or it might mean "all resources consumed, but no one waiting". This isn't actually a problem logically speaking though; calling notify_one with no waiters is legal and does nothing, though it may be slightly less efficient. Your original design may be preferable on that basis (it doesn't notify_one unless there were definitely waiters).

Synchronize n Threads with only using Semaphore and/or mutex in C++

We're studying for our test next week, and have been given an exercise from our teacher, and we just don't see the solution:
How to synchronize n threads, so that all n threads wait at a specific location and only continue with their "work" together when all n threads have reached that location?
We're allowed to use Mutex and Semaphore constructs. The solution should be easy, but we just cant find the answer.
Here's a big hint. You need 2 semaphores, both with N flags. You can solve this with an extra thread. The key is that you can call down() on a semaphore multiple times. e.g. If you call down() on a semaphore 8 times, you need all 8 up()'s before you can continue.
// an additional thread (not one of the N)
void trigger(Semaphore* workersCollect, Semaphore* workersRelease, int n)
{
while(true)
{
for (int i = 0; i < n; ++i)
workersCollect->down();
for (int i = 0; i < n; ++i)
workersRelease->up();
}
}
// Prototype for the "checkpoint" function (exercise for the reader)
void await(Semaphore* workersCollect, Semaphore* workersRelease);
You can also solve it without the extra thread, by using more complicated state checking.
This design has a drawback. If a worker finishes its work extremely quickly, it can grab more than one task (while another thread ends up not running at all). This is fine if you have a threadpool kind of design, but bad if, say, each thread is supposed to work on it's own distinct section of a dataset.
To fix that, you need a semaphore per thread. Something akin to
Semaphore workerRelease[N];
but being careful to avoid false sharing. (You don't want more than 1 semaphore on a cache line.)

How to make thread synchronization without using mutex, semorphore, spinLock and futex?

This is an interview question, the interview has been done.
How to make thread synchronization without using mutex, semorphore, spinLock and futex ?
Given 5 threads, how to make 4 of them wait for a signal from the left thread at the same point ?
it means that when all threads (1,2,3,4) execute at a point in their thread function, they stop and wait for
signal from thread 5 send a signal otherwise they will not proceed.
My idea:
Use global bool variable as a flag, if thread 5 does not set it true, all other threads wait at one point and also set their
flag variable true. After the thread 5 find all threads' flag variables are true, it will set it flag var true.
It is a busy-wait.
Any better ideas ?
Thanks
the pseudo code:
bool globalflag = false;
bool a[10] = {false} ;
int main()
{
for (int i = 0 ; i < 10; i++)
pthread_create( threadfunc, i ) ;
while(1)
{
bool b = true;
for (int i = 0 ; i < 10 ; i++)
{
b = a[i] & b ;
}
if (b) break;
}
}
void threadfunc(i)
{
a[i] = true;
while(!globalflag);
}
Start with an empty linked list of waiting threads. The head should be set to 0.
Use CAS, compare and swap, to insert a thread at the head of the list of waiters. If the head =-1, then do not insert or wait. You can safely use CAS to insert items at the head of a linked list if you do it right.
After being inserted, the waiting thread should wait on SIGUSR1. Use sigwait() to do this.
When ready, the signaling thread uses CAS to set the head of wait list to -1. This prevents any more threads from adding themselves to the wait list. Then the signaling thread iterates the threads in the wait list and calls pthread_kill(&thread, SIGUSR1) to wake up each waiting thread.
If SIGUSR1 is sent before a call to sigwait, sigwait will return immediately. Thus, there will not be a race between adding a thread to the wait list and calling sigwait.
EDIT:
Why is CAS faster than a mutex? Laymen's answer (I'm a layman). Its faster for some things in some situations, because it has lower overhead when there is NO race. So if you can reduce your concurrent problem down to needing to change 8-16-32-64-128 bits of contiguous memory, and a race is not going to happen very often, CAS wins. CAS is basically a slightly more fancy/expensive mov instruction right where you were going to do a regular "mov" anyway. Its a "lock exchng" or something like that.
A mutex on the other hand is a whole bunch of extra stuff, that gets other cache lines dirty and uses more memory barriers, etc. Although CAS acts as a memory barrier on the x86, x64, etc. Then of course you have to unlock the mutex which is probably about the same amount of extra stuff.
Here is how you add an item to a linked list using CAS:
while (1)
{
pOldHead = pHead; <-- snapshot of the world. Start of the race.
pItem->pNext = pHead;
if (CAS(&pHead, pOldHead, pItem)) <-- end of the race if phead still is pOldHead
break; // success
}
So how often do you think your code is going to have multiple threads at that CAS line at the exact same time? In reality....not very often. We did tests that just looped adding millions of items with multiple threads at the same time and it happens way less than 1% of the time. In a real program, it might never happen.
Obviously if there is a race you have to go back and do that loop again, but in the case of a linked list, what does that cost you?
The downside is that you can't do very complex things to that linked list if you are going to use that method to add items to the head. Try implementing a double linked list. What a pain.
EDIT:
In the code above I use a macro CAS. If you are using linux, CAS = macro using __sync_bool_compare_and_swap. See gcc atomic builtins. If you are using windows, CAS = macro using something like InterlockedCompareExchange. Here is what an inline function in windows might look like:
inline bool CAS(volatile WORD* p, const WORD nOld, const WORD nNew) {
return InterlockedCompareExchange16((short*)p, nNew, nOld) == nOld;
}
inline bool CAS(volatile DWORD* p, const DWORD nOld, const DWORD nNew) {
return InterlockedCompareExchange((long*)p, nNew, nOld) == nOld;
}
inline bool CAS(volatile QWORD* p, const QWORD nOld, const QWORD nNew) {
return InterlockedCompareExchange64((LONGLONG*)p, nNew, nOld) == nOld;
}
inline bool CAS(void*volatile* p, const void* pOld, const void* pNew) {
return InterlockedCompareExchangePointer(p, (PVOID)pNew, (PVOID)pOld) == pOld;
}
Choose a signal to use, say SIGUSR1.
Use pthread_sigmask to block SIGUSR1.
Create the threads (they inherit the signal mask, hence 1 must be done first!)
Threads 1-4 call sigwait, blocking until SIGUSR1 is received.
Thread 5 calls kill() or pthread_kill 4 times with SIGUSR1. Since POSIX specifies that signals will be delivered to a thread which is not blocking the signal, it will be delivered to one of the threads waiting in sigwait(). There is thus no need to keep track of which threads have already received the signal and which haven't, with associated synchronization.
You can do this using SSE3's MONITOR and MWAIT instructions, available via the _mm_mwait and _mm_monitor intrinsics, Intel has an article on it here.
(there is also a patent for using memory-monitor-wait for lock contention here that may be of interest).
I think you are looking the Peterson's algorithm or Dekker's algorithm
They synced threads only based on shared memory

simple custom made mutex failing

Can you spot the error in the code? tickets ends up going below 0 causing long stalls.
struct SContext {
volatile unsigned long* mutex;
volatile long* ticket;
volatile bool* done;
};
static unsigned int MyThreadFunc(SContext* ctxt) {
// -- keep going until we signal for thread to close
while(*ctxt->done == false) {
while(*ctxt->ticket) { // while we have tickets waiting
unsigned int lockedaquired = 0;
do {
if(*ctxt->mutex == 0) { // only try if someone doesn't have mutex locked
// -- if the compare and swap doesn't work then the function returns
// -- the value it expects
lockedaquired = InterlockedCompareExchange(ctxt->mutex, 1, 0);
}
} while(lockedaquired != 0); // loop while we didn't aquire lock
// -- enter critical section
// -- grab a ticket
if(*ctxt->ticket > 0);
(*ctxt->ticket)--;
// -- exit critical section
*ctxt->mutex = 0; // release lock
}
}
return 0;
}
Calling function waiting for threads to finish
for(unsigned int loops = 0; loops < eLoopCount; ++loops) {
*ctxt.ticket = eNumThreads; // let the threads start!
// -- wait for threads to finish
while(*ctxt.ticket != 0)
;
}
done = true;
EDIT:
The answer to this question is simple and unfortunately after I spent the time trimming down the example to post a simplified version I immediately find the answer after I post the question. Sigh..
I initialize lockaquired to 0. Then as an optimization to not take up bus bandwith I don't do the CAS if the mutex is taken.
Unfortunately, in that case where the lock is taken the while loop will let the second thread through!
Sorry for the extra question. I thought I didn't understand windows low level synchronization primitives but really I just had a simple mistake.
I see another race in your code: One thread can cause *ctxt.ticket to hit 0, allowing the parent loop to go back and re-set *ctxt.ticket = eNumThreads without holding *ctxt.mutex. Some other thread may already now hold the mutex (in fact, it probably does) and operate on *ctxt.ticket. For your simplified example this only prevents "batches" from being cleanly separated, but if you had more complex initialization (as in more complex than a single word write) at the top of the loops loop you could see strange behavior.
I posted a bug where I thought it was a legitimate multithreaded problem but really it was just bad logic. I solved the bug as soon as I posted. Here is the problem lines and answer
unsigned int lockedaquired = 0;
I initialized lockaquired to 0 and then after I added an if statement to skip the expensive operation of doing a CAS. This optimization caused it to fall out of the while loop and into the critical section. Changing the code to
unsigned int lockedaquired = 1;
Fixes the problem. There is another hidden problem in the code that I found as well(I really shouldn't code late at night anymore). Anyone notice the semicolon after the if statement in the critical section? Sigh...
if(*ctxt->ticket > 0);
(*ctxt->ticket)--;
That should be
if(*ctxt->ticket > 0)
Also, Ben Jackson pointed out that a thread probably will be inside the critical section when we reset the ticket to eNumThreads. While this is perfectly fine in this sample code if you were to apply it to a problem where you needed to do more operations it might not be safe because the threads aren't running in lockstep so keep that in mind if you apply this to your code.
A final note, if anyone does decide to use this code for their own implementation of a mutex please remember that your main driver thread is spinning idle. If you are doing a large operation in the critical section that takes a deal of time and your ticket count is high consider yielding your thread to let other software make use of the CPU while its waiting. Also, consider using a spin lock if the critical section is large.
Thank you

Can I switch the test and modification part in wait/signal semaphore?

The classic none-busy-waiting version of wait() and signal() semaphore are implemented as below. In this verson, value can be negative.
//primitive
wait(semaphore* S)
{
S->value--;
if (S->value < 0)
{
add this process to S->list;
block();
}
}
//primitive
signal(semaphore* S)
{
S->value++;
if (S->value <= 0)
{
remove a process P from S->list;
wakeup(P);
}
}
Question: Is the following version also correct? Here I test first and modify the value. It's great if you can show me a scenario where it doesn't work.
//primitive wait().
//If (S->value > 0), the whole function is atomic
//otherise, only if(){} section is atomic
wait(semaphore* S)
{
if (S->value <= 0)
{
add this process to S->list;
block();
}
// here I decrement the value after the previous test and possible blocking
S->value--;
}
//similar to wait()
signal(semaphore* S)
{
if (S->list is not empty)
{
remove a process P from S->list;
wakeup(P);
}
// here I increment the value after the previous test and possible waking up
S->value++;
}
Edit:
My motivation is to figure out whether I can use this latter version to achieve mutual exclusion, and no deadlock, no starvation.
Your modified version introduces a race condition:
Thread A: if(S->Value < 0) // Value = 1
Thread B: if(S->Value < 0) // Value = 1
Thread A: S->Value--; // Value = 0
Thread B: S->Value--; // Value = -1
Both threads have acquired a count=1 semaphore. Oops. Note that there's another problem even if they're non-preemptible (see below), but for completeness, here's a discussion on atomicity and how real locking protocols work.
When working with protocols like this, it's very important to nail down exactly what atomic primitives you are using. Atomic primitives are such that they seem to execute instantaneously, without being interleaved with any other operations. You cannot just take a big function and call it atomic; you have to make it atomic somehow, using other atomic primitives.
Most CPUs offer a primitive called 'atomic compare and exchange'. I'll abbreviate it cmpxchg from here on. The semantics are like so:
bool cmpxchg(long *ptr, long old, long new) {
if (*ptr == old) {
*ptr = new;
return true;
} else {
return false;
}
}
cmpxchg is not implemented with this code. It is in the CPU hardware, but behaves a bit like this, only atomically.
Now, let's add to this some additional helpful functions (built out of other primitives):
add_waitqueue(waitqueue) - Sets our process state to sleeping and adds us to a wait queue, but continues executing (ATOMIC)
schedule() - Switch threads. If we're in a sleeping state, we don't run again until awakened (BLOCKING)
remove_waitqueue(waitqueue) - removes our process from a wait queue, then sets our state to awakened if it isn't already (ATOMIC)
memory_barrier() - ensures that any reads/writes logically before this point actually are performed before this point, avoiding nasty memory ordering issues (we'll assume all other atomic primitives come with a free memory barrier, although this isn't always true) (CPU/COMPILER PRIMITIVE)
Here's how a typical semaphore acquisition routine will look. It's a bit more complex than your example, because I've explicitly nailed down what atomic operations I'm using:
void sem_down(sem *pSem)
{
while (1) {
long spec_count = pSem->count;
read_memory_barrier(); // make sure spec_count doesn't start changing on us! pSem->count may keep changing though
if (spec_count > 0)
{
if (cmpxchg(&pSem->count, spec_count, spec_count - 1)) // ATOMIC
return; // got the semaphore without blocking
else
continue; // count is stale, try again
} else { // semaphore count is zero
add_waitqueue(pSem->wqueue); // ATOMIC
// recheck the semaphore count, now that we're in the waitqueue - it may have changed
if (pSem->count == 0) schedule(); // NOT ATOMIC
remove_waitqueue(pSem->wqueue); // ATOMIC
// loop around again to try to acquire the semaphore
}
}
}
You'll note that the actual test for a non-zero pSem->count, in a real-world semaphore_down function, is accomplished by cmpxchg. You can't trust any other read; the value can change an instant after you read the value. We simply can't separate the value check and the value modification.
The spec_count here is speculative. This is important. I'm essentially making a guess at what the count will be. It's a pretty good guess, but it's a guess. cmpxchg will fail if my guess is wrong, at which point the routine has to loop and try again. If I guess 0, then I will either be woken up (as it ceases to be zero while I'm on the waitqueue), or I will notice it's not zero anymore in the schedule test.
You should also note that there is no possible way to make a function that contains a blocking operation atomic. It's nonsensical. Atomic functions, by definition, appear to execute instantaneously, not interleaved with anything else whatsoever. But a blocking function, by definition, waits for something else to happen. This is inconsistent. Likewise, no atomic operation can be 'split up' across a blocking operation, which it is in your example.
Now, you could do away with a lot of this complexity by declaring the function non-preemptable. By using locks or other methods, you simply ensure only one thread is ever running (not including blocking of course) in the semaphore code at a time. But a problem still remains then. Start with a value of 0, where C has taken the semaphore down twice, then:
Thread A: if (S->Value < 0) // Value = 0
Thread A: Block....
Thread B: if (S->Value < 0) // Value = 0
Thread B: Block....
Thread C: S->Value++ // value = 1
Thread C: Wakeup(A)
(Thread C calls signal() again)
Thread C: S->Value++ // value = 2
Thread C: Wakeup(B)
(Thread C calls wait())
Thread C: if (S->Value <= 0) // Value = 2
Thread C: S->Value-- // Value = 1
// A and B have been woken
Thread A: S->Value-- // Value = 0
Thread B: S->Value-- // Value = -1
You could probably fix this with a loop to recheck S->value - again, assuming you are on a single processor machine and your semaphore code is preemptable. Unfortunately, these assumptions are false on all desktop OSes :)
For more discussion on how real locking protocols work, you might be interested in the paper "Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux"