Linux 3.0: futex-lock deadlock bug? - c++

// SubFetch(x,y) = atomically x-=y and return x (__sync_sub_and_fetch)
// AddFetch(x,y) = atomically x+=y and return x (__sync_add_and_fetch)
// CompareWait(x, y) = futex(&x, FUTEX_WAIT, y) wait on x if x == y
// Wake(x, y) = futex(&x, FUTEX_WAKE, y) wake up y waiters
struct Lock
{
Lock() : x(1) {}
void lock()
{
while (true)
{
if (SubFetch(x, 1) == 0)
return;
x = -1;
CompareWait(x, -1);
}
}
void unlock()
{
if (AddFetch(x, 1) == 1)
return;
x = 1;
Wake(x, 1);
}
private:
int x;
};
Linux 3.0 provides a system call called futex, upon which many concurrency utilities are based including recent pthread_mutex implementations. Whenever you write code you should always consider whether using an existing implementation or writing it yourself is the better choice for your project.
Above is an implementation of a Lock (mutex, 1 permit counting semaphore) based upon futex and the semantics description in man futex(7)
It appears to contain a deadlock bug whereby after multiple threads are trying to lock and unlock it a few thousand times, the threads can get into a state where x == -1 and all the threads are stuck in CompareWait, however noone is holding the lock.
Can anyone see where the bug is?
Update: I'm a little surprised that futex(7)/semantics is so broken. I completely rewrote Lock as follows... is this correct now?
// CompareAssign(x,y,z) atomically: if (x == y) {x = z; ret true; } else ret false;
struct Lock
{
Lock() : x(0) {}
void lock()
{
while (!CompareAssign(x, 0, 1))
if (x == 2 || CompareAssign(x, 1, 2))
CompareWait(x, 2);
}
void unlock()
{
if (SubFetch(x, 1) == 0)
return;
x = 0;
Wake(x, 1);
}
private:
int x;
};
The idea here is that x has the following three states:
0: unlocked
1: locked & no waiters
2: locked & waiters

The problem is that you explicitly -1 assign to x if the SubFetch fails to acquire the lock. This races with the unlock.
Thread 1 acquires the lock. x==0.
Thread 2 tries to acquire the lock. The SubFetch sets x to -1, and then thread 2 is suspended.
Thread 1 releases the lock. The AddFetch sets x to 0, so the code then explicitly sets x to 1 and calls Wake.
Thread 2 wakes up and sets x to -1, and then calls CompareWait.
Thread 2 is now stuck waiting, with x set to -1, but there is no one around to wake it, as thread 1 has already released the lock.

The proper implementation of a futex-based Mutex is described in Ulrich Drepper's paper "Futexes are tricky"
http://people.redhat.com/drepper/futex.pdf
It includes not only the code but also a very detailed explanation of why it is correct. The code from the paper:
class mutex
{
public:
mutex () : val (0) { }
void lock () {
int c;
if ((c = cmpxchg (val, 0, 1)) != 0)
do {
if (c == 2 || cmpxchg (val, 1, 2) != 0)
futex_wait (&val, 2);
} while ((c = cmpxchg (val, 0, 2)) != 0);
}
void unlock () {
//NOTE: atomic_dec returns the value BEFORE the operation, unlike your SubFetch !
if (atomic_dec (val) != 1) {
val = 0;
futex_wake (&val, 1);
}
}
private:
int val;
};
Comparing the code in the paper with your code, I spot a difference
You have
if (x == 2 || CompareAssign(x, 1, 2))
using the futex's value directly whereas Drepper uses the return value from the previous CompareAssign(). That difference will probably affect performance only.
Your unlock code is different, too, but seems to be semantically equivalent.
In any case I would strongly advise you to follow Drepper's code to the letter. That paper has stood the test of time and received a lot of peer review. You gain nothing from rolling your own.

How about this scenario with three threads, A, B , and C.
The initial state of this scenario has:
thread A holding the lock
thread B not contending for the lock just yet
thread C in CompareWait()
x == -1 from when C failed to acquire the lock
A B C
============== ================ ===============
AddFetch()
(so x == 0)
SubFetch()
(so x == -1)
x = 1
x = -1
Wake()
At this point whether B or C are unblocked, they will not get a result of 0 when they SubFetch().

Related

Wait until a variable becomes zero

I'm writing a multithreaded program that can execute some tasks in separate threads.
Some operations require waiting for them at the end of execution of my program. I've written simple guard for such "important" operations:
class CPendingOperationGuard final
{
public:
CPendingOperationGuard()
{
InterlockedIncrementAcquire( &m_ullCounter );
}
~CPendingOperationGuard()
{
InterlockedDecrementAcquire( &m_ullCounter );
}
static bool WaitForAll( DWORD dwTimeOut )
{
// Here is a topic of my question
// Return false on timeout
// Return true if wait was successful
}
private:
static volatile ULONGLONG m_ullCounter;
};
Usage is simple:
void ImportantTask()
{
CPendingOperationGuard guard;
// Do work
}
// ...
void StopExecution()
{
if(!CPendingOperationGuard::WaitForAll( 30000 )) {
// Handle error
}
}
The question is: how to effectively wait until a m_ullCounter becames zero or until timeout.
I have two ideas:
To launch this function in another separate thread and write WaitForSingleObject( hThread, dwTimeout ):
DWORD WINAPI WaitWorker( LPVOID )
{
while(InterlockedCompareExchangeRelease( &m_ullCounter, 0, 0 ))
;
}
But it will "eat" almost 100% of CPU time - bad idea.
Second idea is to allow other threads to start:
DWORD WINAPI WaitWorker( LPVOID )
{
while(InterlockedCompareExchangeRelease( &m_ullCounter, 0, 0 ))
Sleep( 0 );
}
But it'll switch execution context into kernel mode and back - too expensive in may task. Bad idea too
The question is:
How to perform almost-zero-overhead waiting until my variable becames zero? Maybe without separate thread... The main condition is to support stopping of waiting by timeout.
Maybe someone can suggest completely another idea for my task - to wait for all registered operations (like in WinAPI's ThreadPools - its API has, for instance, WaitForThreadpoolWaitCallbacks to perform waiting for ALL registered tasks).
PS: it is not possible to rewrite my code with ThreadPool API :(
Have a look at the WaitOnAddress() and WakeByAddressSingle()/WakeByAddressAll() functions introduced in Windows 8.
For example:
class CPendingOperationGuard final
{
public:
CPendingOperationGuard()
{
InterlockedIncrementAcquire(&m_ullCounter);
Wake­By­Address­All(&m_ullCounter);
}
~CPendingOperationGuard()
{
InterlockedDecrementAcquire(&m_ullCounter);
Wake­By­Address­All(&m_ullCounter);
}
static bool WaitForAll( DWORD dwTimeOut )
{
ULONGLONG Captured, Now, Deadline = GetTickCount64() + dwTimeOut;
DWORD TimeRemaining;
do
{
Captured = InterlockedExchangeAdd64((LONG64 volatile *)&m_ullCounter, 0);
if (Captured == 0) return true;
Now = GetTickCount64();
if (Now >= Deadline) return false;
TimeRemaining = static_cast<DWORD>(Deadline - Now);
}
while (WaitOnAddress(&m_ullCounter, &Captured, sizeof(ULONGLONG), TimeRemaining));
return false;
}
private:
static volatile ULONGLONG m_ullCounter;
};
Raymond Chen wrote a series of blog articles about these functions:
WaitOnAddress lets you create a synchronization object out of any data variable, even a byte
Implementing a critical section in terms of WaitOnAddress
Spurious wakes, race conditions, and bogus FIFO claims: A peek behind the curtain of WaitOnAddress
Extending our critical section based on WaitOnAddress to support timeouts
Comparing WaitOnAddress with futexes (futexi? futexen?)
Creating a semaphore from WaitOnAddress
Creating a semaphore with a maximum count from WaitOnAddress
Creating a manual-reset event from WaitOnAddress
Creating an automatic-reset event from WaitOnAddress
A helper template function to wait for WaitOnAddress in a loop
you need for this task something like Run-Down Protection instead CPendingOperationGuard
before begin operation, you call ExAcquireRundownProtection and only if it return TRUE - begin execute operation. at the end you must call ExReleaseRundownProtection
so pattern must be next
if (ExAcquireRundownProtection(&RunRef)) {
do_operation();
ExReleaseRundownProtection(&RunRef);
}
when you want stop this process and wait for all active calls do_operation(); finished - you call ExWaitForRundownProtectionRelease (instead WaitWorker)
After ExWaitForRundownProtectionRelease is called, the ExAcquireRundownProtection routine will return FALSE (so new operations will not start after this). ExWaitForRundownProtectionRelease waits to return until all calls the ExReleaseRundownProtection routine to release the previously acquired run-down protection (so when all current(if exist) operation complete). When all outstanding accesses are completed, ExWaitForRundownProtectionRelease returns
unfortunately this api implemented by system only in kernel mode and no analog in user mode. however not hard implement such idea yourself
this is my example:
enum RundownState {
v_complete = 0, v_init = 0x80000000
};
template<typename T>
class RundownProtection
{
LONG _Value;
public:
_NODISCARD BOOL IsRundownBegin()
{
return 0 <= _Value;
}
_NODISCARD BOOL AcquireRP()
{
LONG Value, NewValue;
if (0 > (Value = _Value))
{
do
{
NewValue = InterlockedCompareExchangeNoFence(&_Value, Value + 1, Value);
if (NewValue == Value) return TRUE;
} while (0 > (Value = NewValue));
}
return FALSE;
}
void ReleaseRP()
{
if (InterlockedDecrement(&_Value) == v_complete)
{
static_cast<T*>(this)->RundownCompleted();
}
}
void Rundown_l()
{
InterlockedBitTestAndResetNoFence(&_Value, 31);
}
void Rundown()
{
if (AcquireRP())
{
Rundown_l();
ReleaseRP();
}
}
RundownProtection(RundownState Value = v_init) : _Value(Value)
{
}
void Init()
{
_Value = v_init;
}
};
///////////////////////////////////////////////////////////////
class OperationGuard : public RundownProtection<OperationGuard>
{
friend RundownProtection<OperationGuard>;
HANDLE _hEvent;
void RundownCompleted()
{
SetEvent(_hEvent);
}
public:
OperationGuard() : _hEvent(0) {}
~OperationGuard()
{
if (_hEvent)
{
CloseHandle(_hEvent);
}
}
ULONG WaitComplete(ULONG dwMilliseconds = INFINITE)
{
return WaitForSingleObject(_hEvent, dwMilliseconds);
}
ULONG Init()
{
return (_hEvent = CreateEvent(0, 0, 0, 0)) ? NOERROR : GetLastError();
}
} g_guard;
//////////////////////////////////////////////
ULONG CALLBACK PendingOperationThread(void*)
{
while (g_guard.AcquireRP())
{
Sleep(1000);// do operation
g_guard.ReleaseRP();
}
return 0;
}
void demo()
{
if (g_guard.Init() == NOERROR)
{
if (HANDLE hThread = CreateThread(0, 0, PendingOperationThread, 0, 0, 0))
{
CloseHandle(hThread);
}
MessageBoxW(0, 0, L"UI Thread", MB_ICONINFORMATION|MB_OK);
g_guard.Rundown();
g_guard.WaitComplete();
}
}
why simply wait when wait until a m_ullCounter became zero not enough
if we read 0 from m_ullCounter this mean only at this time no active operation. but pending operation can begin already after we check that m_ullCounter == 0 . we can use special flag (say bool g_bQuit) and set it. operation before begin check this flag and not begin if it true. but this anyway not enough
naive code:
//worker thread
if (!g_bQuit) // (1)
{
// MessageBoxW(0, 0, L"simulate delay", MB_ICONWARNING);
InterlockedIncrement(&g_ullCounter); // (4)
// do operation
InterlockedDecrement(&g_ullCounter); // (5)
}
// here we wait for all operation done
g_bQuit = true; // (2)
// wait on g_ullCounter == 0, how - not important
while (g_ullCounter) continue; // (3)
pending operation checking g_bQuit flag (1) - it yet false, so it
begin
worked thread is swapped (use MessageBox for simulate this)
we set g_bQuit = true; // (2)
we check/wait for g_ullCounter == 0, it 0 so we exit (3)
working thread wake (return from MessageBox) and increment
g_ullCounter (4)
problem here that operation can use some resources which we already begin destroy after g_ullCounter == 0
this happens because check quit flag (g_Quit) and increment counter after this not atomic - can be a gap between them.
for correct solution we need atomic access to flag+counter. this and do rundown protection. for flag+counter used single LONG variable (32 bit) because we can do atomic access to it. 31 bits used for counter and 1 bits used for quit flag. windows solution use 0 bit for flag (1 mean quit) and [1..31] bits for counter. i use the [0..30] bits for counter and 31 bit for flag (0 mean quit). look for

How do monitors guarantee mutual exclusion?

The producer/consumer problem in concurrency: a producer produces things and appends them to a buffer. A consumer takes things from the buffer. The consumer doesn't want to take things from an empty buffer and the producer doesn't want to append things to a full buffer.
William Stallings' "Operating Systems" gives the following example of a monitor used to solve the producer/consumer problem:
// Monitor
append(char x) {
if (count == N) cwait(notfull)
buffer[nextin] = x
nextin = (nextin + 1) % N
count++
csignal(nonempty)
}
take(char x) {
if (count == 0) cwait(notempty)
x = buffer[nextout]
nextout = (nextout + 1) % N
count--
csignal(notfull)
}
// Application using the monitor
producer() {
while (true) {
produce(x)
append(x)
}
}
consumer() {
while (true) {
take(x)
consume(x)
}
}
The book claims "only one process may be in the monitor at a time" [p.227]
How is this property enforced?
I can see how this would work with 1 consumer and 1 producer, but I fail to see how this protects - for example - 2 producers from simultaneously writing to a buffer.

Is there a way to synchronize this without locks? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Say I have 3 functions that can be called by an upper layer:
Start - Will only be called if we haven't been started yet, or Stop was previously called
Stop - Will only be called after a successful call to Start
Process - Can be called at any time (simultaneously on different threads); if started, will call into lower layer
In Stop, it must wait for all Process calls to finish calling into the lower layer, and prevent any further calls. With a locking mechanism, I can come up with the following pseudo code:
Start() {
ResetEvent(&StopCompleteEvent);
IsStarted = true;
RefCount = 0;
}
Stop() {
AcquireLock();
IsStarted = false;
WaitForCompletionEvent = (RefCount != 0);
ReleaseLock();
if (WaitForCompletionEvent)
WaitForEvent(&StopCompleteEvent);
ASSERT(RefCount == 0);
}
Process() {
AcquireLock();
AddedRef = IsStarted;
if (AddedRef)
RefCount++;
ReleaseLock();
if (!AddedRef) return;
ProcessLowerLayer();
AcquireLock();
FireCompletionEvent = (--RefCount == 0);
ReleaseLock();
if (FilreCompletionEvent)
SetEvent(&StopCompleteEvent);
}
Is there a way to achieve the same behavior without a locking mechanism? Perhaps with some fancy usage of InterlockedCompareExchange and InterlockedIncremenet/InterlockedDecrement?
The reason I ask is that this is in the data path of a network driver and I would really prefer not to have any locks.
I believe it is possible to avoid the use of explicit locks and any unnecessary blocking or kernel calls.
Note that this is pseudo-code only, for illustrative purposes; it hasn't seen a compiler. And while I believe the threading logic is sound, please verify its correctness for yourself, or get an expert to validate it; lock-free programming is hard.
#define STOPPING 0x20000000;
#define STOPPED 0x40000000;
volatile LONG s = STOPPED;
// state and count
// bit 30 set -> stopped
// bit 29 set -> stopping
// bits 0 through 28 -> thread count
Start()
{
KeClearEvent(&StopCompleteEvent);
LONG n = InterlockedExchange(&s, 0); // sets s to 0
if ((n & STOPPED) == 0)
bluescreen("Invalid call to Start()");
}
Stop()
{
LONG n = InterlockedCompareExchange(&s, STOPPED, 0);
if (n == 0)
{
// No calls to Process() were running so we could jump directly to stopped.
// Mission accomplished!
return;
}
LONG n = InterlockedOr(&s, STOPPING);
if ((n & STOPPED) != 0)
bluescreen("Stop called when already stopped");
if ((n & STOPPING) != 0)
bluescreen("Stop called when already stopping");
n = InterlockedCompareExchange(&s, STOPPED, STOPPING);
if (n == STOPPING)
{
// The last call to Process() exited before we set the STOPPING flag.
// Mission accomplished!
return;
}
// Now that STOPPING mode is set, and we know at least one call to Process
// is running, all we need do is wait for the event to be signaled.
KeWaitForSingleObject(...);
// The event is only ever signaled after a thread has successfully
// changed the state to STOPPED. Mission accomplished!
return;
}
Process()
{
LONG n = InterlockedCompareExchange(&s, STOPPED, STOPPING);
if (n == STOPPING)
{
// We've just stopped; let the call to Stop() complete.
KeSetEvent(&StopCompleteEvent);
return;
}
if ((n & STOPPED) != 0 || (n & STOPPING) != 0)
{
// Checking here avoids changing the state unnecessarily when
// we already know we can't enter the lower layer.
// It also ensures that the transition from STOPPING to STOPPED can't
// be delayed even if there are lots of threads making new calls to Process().
return;
}
n = InterlockedIncrement(&s);
if ((n & STOPPED) != 0)
{
// Turns out we've just stopped, so the call to Process() must be aborted.
// Explicitly set the state back to STOPPED, rather than decrementing it,
// in case Start() has been called. At least one thread will succeed.
InterlockedCompareExchange(&s, STOPPED, n);
return;
}
if ((n & STOPPING) == 0)
{
ProcessLowerLayer();
}
n = InterlockedDecrement(&s);
if ((n & STOPPED) != 0 || n == (STOPPED - 1))
bluescreen("Stopped during call to Process, shouldn't be possible!");
if (n != STOPPING)
return;
// Stop() has been called, and it looks like we're the last
// running call to Process() in which case we need to change the
// status to STOPPED and signal the call to Stop() to exit.
// However, another thread might have beaten us to it, so we must
// check again. The event MUST only be set once per call to Stop().
n = InterlockedCompareExchange(&s, STOPPED, STOPPING);
if (n == STOPPING)
{
// We've just stopped; let the call to Stop() complete.
KeSetEvent(&StopCompleteEvent);
}
return;
}

Looking for critique of my reader/writer implementation [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
i implemented the readers/writers problem in c++11… I'd like to know what's wrong with it, because these kinds of things are difficult to predict on my own.
Shared database:
Readers can access database when no writers
Writers can access database when no readers or writers
Only one thread manipulates state variables at a time
the example has 3 readers and 1 writer, but also use 2 or more writer....
Code:
class ReadersWriters {
private:
int AR; // number of active readers
int WR; // number of waiting readers
int AW; // number of active writers
int WW; // number of waiting writers
mutex lock;
mutex m;
condition_variable okToRead;
condition_variable okToWrite;
int data_base_variable;
public:
ReadersWriters() : AR(0), WR(0), AW(0), WW(0), data_base_variable(0) {}
void read_lock() {
unique_lock<mutex> l(lock);
WR++; // no writers exist
// is it safe to read?
okToRead.wait(l, [this](){ return WW == 0; });
okToRead.wait(l, [this](){ return AW == 0; });
WR--; // no longer waiting
AR++; // now we are active
}
void read_unlock() {
unique_lock<mutex> l(lock);
AR--; // no longer active
if (AR == 0 && WW > 0) { // no other active readers
okToWrite.notify_one(); // wake up one writer
}
}
void write_lock() {
unique_lock<mutex> l(lock);
WW++; // no active user exist
// is it safe to write?
okToWrite.wait(l, [this](){ return AR == 0; });
okToWrite.wait(l, [this](){ return AW == 0; });
WW--; // no longer waiting
AW++; // no we are active
}
void write_unlock() {
unique_lock<mutex> l(lock);
AW--; // no longer active
if (WW > 0) { // give priority to writers
okToWrite.notify_one(); // wake up one writer
}
else if (WR > 0) { // otherwize, wake reader
okToRead.notify_all(); // wake all readers
}
}
void data_base_thread_write(unsigned int thread_id) {
for (int i = 0; i < 10; i++) {
write_lock();
data_base_variable++;
m.lock();
cout << "data_base_thread: " << thread_id << "...write: " << data_base_variable << endl;
m.unlock();
write_unlock();
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
}
void data_base_thread_read(unsigned int thread_id) {
for (int i = 0; i < 10; i++) {
read_lock();
m.lock();
cout << "data_base_thread: " << thread_id << "...read: " << data_base_variable << endl;
m.unlock();
read_unlock();
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
}
};
int main() {
// your code goes here
ReadersWriters rw;
thread w1(&ReadersWriters::data_base_thread_write, &rw, 0);
thread r1(&ReadersWriters::data_base_thread_read, &rw, 1);
thread r2(&ReadersWriters::data_base_thread_read, &rw, 2);
thread r3(&ReadersWriters::data_base_thread_read, &rw, 3);
w1.join();
r1.join();
r2.join();
r3.join();
cout << "\nThreads successfully completed..." << endl;
return 0;
}
Feedback:
1. It is missing all necessary #includes.
2. It presumes a using namespace std, which is bad style in declarations, as that pollutes all of your clients with namespace std.
3. The release of your locks is not exception safe:
write_lock();
data_base_variable++;
m.lock();
cout << "data_base_thread: " << thread_id << "...write: " << data_base_variable << endl;
m.unlock(); // leaked if an exception is thrown after m.lock()
write_unlock(); // leaked if an exception is thrown after write_lock()
4. The m.lock() wrapping of cout in data_base_thread_write is really unnecessary since write_lock() should already be providing exclusive access. However I understand that this is just a demo.
5. I think I see a bug in the read/write logic:
step 1 2 3 4 5 6
WR 0 1 1 1 0 0
AR 0 0 0 0 1 1
WW 0 0 1 1 1 0
AW 1 1 1 0 0 1
In step 1, thread 1 has the write lock.
In step 2, thread 2 attempts to acquire a read lock, increments WR, and blocks on the second okToRead, waiting for AW == 0.
In step 3, thread 3 attempts to acquire a write lock, increments WW, and blocks on the second okToWrite, waiting for AW == 0.
In step 4, thread 1 releases, the write lock by decrementing AW to 0, and signals okToWrite.
In step 5, thread 2, despite not being signaled, is awoken spuriously, notes that AW == 0, and grabs the read lock by setting WR to 0 and AR to 1.
In step 6, thread 3 receives the signal, notes that AW == 0, and grabs the write lock by setting WW to 0 and AW to 1.
In step 6, both thread 2 owns the read lock and thread 3 owns the write lock (simultaneously).
6. The class ReadersWriters has two functions:
It implements a read/write mutex.
It implements tasks for threads to execute.
A better design would take advantage of the mutex/lock framework established in C++11:
Create a ReaderWriter mutex with members:
// unique ownership
void lock(); // write_lock
void unlock(); // write_unlock
// shared ownership
lock_shared(); // read_lock
unlock_shared(); // read_unlock
The first two names, lock and unlock are purposefully the same names as those used by the C++11 mutex types. Just doing this much allows you to do things like:
std::lock_guard<ReaderWriter> lk1(mut);
// ...
std::unique_lock<ReaderWriter> lk2(mut);
// ...
std::condition_variable_any cv;
cv.wait(lk2); // wait using the write lock
And if you add:
void try_lock();
Then you can also:
std::lock(lk2, <any other std or non-std locks>); // lock multiple locks
The lock_shared and unlock_shared names are chosen because of the std::shared_lock<T> type currently in the C++1y (we hope y is 4) working draft. It is documented in N3659.
And then you can say things like:
std::shared_lock<ReaderWriter> lk3(mut); // read_lock
std::condition_variable_any cv;
cv.wait(lk3); // wait using the read lock
I.e. By just creating a stand-alone ReaderWriter mutex type, with very carefully chosen names for the member functions, you get interoperability with the std-defined locks, condition_variable_any, and locking algorithms.
See N2406 for a more in-depth rationale of this framework.

Win32 Read/Write Lock Using Only Critical Sections

I have to implement a read/write lock in C++ using the Win32 api as part of a project at work. All of the existing solutions use kernel objects (semaphores and mutexes) that require a context switch during execution. This is far too slow for my application.
I would like implement one using only critical sections, if possible. The lock does not have to be process safe, only threadsafe. Any ideas on how to go about this?
If you can target Vista or greater, you should use the built-in SRWLock's. They are lightweight like critical sections, entirely user-mode when there is no contention.
Joe Duffy's blog has some recent entries on implementing different types of non-blocking reader/writer locks. These locks do spin, so they would not be appropriate if you intend to do a lot of work while holding the lock. The code is C#, but should be straightforward to port to native.
You can implement a reader/writer lock using critical sections and events - you just need to keep enough state to only signal the event when necessary to avoid an unnecessary kernel mode call.
I don't think this can be done without using at least one kernel-level object (Mutex or Semaphore), because you need the help of the kernel to make the calling process block until the lock is available.
Critical sections do provide blocking, but the API is too limited. e.g. you cannot grab a CS, discover that a read lock is available but not a write lock, and wait for the other process to finish reading (because if the other process has the critical section it will block other readers which is wrong, and if it doesn't then your process will not block but spin, burning CPU cycles.)
However what you can do is use a spin lock and fall back to a mutex whenever there is contention. The critical section is itself implemented this way. I would take an existing critical section implementation and replace the PID field with separate reader & writer counts.
Old question, but this is something that should work. It doesn't spin on contention. Readers incur limited extra cost if they have little or no contention, because SetEvent is called lazily (look at the edit history for a more heavyweight version that doesn't have this optimization).
#include <windows.h>
typedef struct _RW_LOCK {
CRITICAL_SECTION countsLock;
CRITICAL_SECTION writerLock;
HANDLE noReaders;
int readerCount;
BOOL waitingWriter;
} RW_LOCK, *PRW_LOCK;
void rwlock_init(PRW_LOCK rwlock)
{
InitializeCriticalSection(&rwlock->writerLock);
InitializeCriticalSection(&rwlock->countsLock);
/*
* Could use a semaphore as well. There can only be one waiter ever,
* so I'm showing an auto-reset event here.
*/
rwlock->noReaders = CreateEvent (NULL, FALSE, FALSE, NULL);
}
void rwlock_rdlock(PRW_LOCK rwlock)
{
/*
* We need to lock the writerLock too, otherwise a writer could
* do the whole of rwlock_wrlock after the readerCount changed
* from 0 to 1, but before the event was reset.
*/
EnterCriticalSection(&rwlock->writerLock);
EnterCriticalSection(&rwlock->countsLock);
++rwlock->readerCount;
LeaveCriticalSection(&rwlock->countsLock);
LeaveCriticalSection(&rwlock->writerLock);
}
int rwlock_wrlock(PRW_LOCK rwlock)
{
EnterCriticalSection(&rwlock->writerLock);
/*
* readerCount cannot become non-zero within the writerLock CS,
* but it can become zero...
*/
if (rwlock->readerCount > 0) {
EnterCriticalSection(&rwlock->countsLock);
/* ... so test it again. */
if (rwlock->readerCount > 0) {
rwlock->waitingWriter = TRUE;
LeaveCriticalSection(&rwlock->countsLock);
WaitForSingleObject(rwlock->noReaders, INFINITE);
} else {
/* How lucky, no need to wait. */
LeaveCriticalSection(&rwlock->countsLock);
}
}
/* writerLock remains locked. */
}
void rwlock_rdunlock(PRW_LOCK rwlock)
{
EnterCriticalSection(&rwlock->countsLock);
assert (rwlock->readerCount > 0);
if (--rwlock->readerCount == 0) {
if (rwlock->waitingWriter) {
/*
* Clear waitingWriter here to avoid taking countsLock
* again in wrlock.
*/
rwlock->waitingWriter = FALSE;
SetEvent(rwlock->noReaders);
}
}
LeaveCriticalSection(&rwlock->countsLock);
}
void rwlock_wrunlock(PRW_LOCK rwlock)
{
LeaveCriticalSection(&rwlock->writerLock);
}
You could decrease the cost for readers by using a single CRITICAL_SECTION:
countsLock is replaced with writerLock in rdlock and rdunlock
rwlock->waitingWriter = FALSE is removed in wrunlock
wrlock's body is changed to
EnterCriticalSection(&rwlock->writerLock);
rwlock->waitingWriter = TRUE;
while (rwlock->readerCount > 0) {
LeaveCriticalSection(&rwlock->writerLock);
WaitForSingleObject(rwlock->noReaders, INFINITE);
EnterCriticalSection(&rwlock->writerLock);
}
rwlock->waitingWriter = FALSE;
/* writerLock remains locked. */
However this loses in fairness, so I prefer the above solution.
Take a look at the book "Concurrent Programming on Windows" which has lots of different reference examples for reader/writer locks.
Check out the spin_rw_mutex from Intel's Thread Building Blocks ...
spin_rw_mutex is strictly in user-land
and employs spin-wait for blocking
This is an old question but perhaps someone will find this useful. We developed a high-performance, open-source RWLock for Windows that automatically uses Vista+ SRWLock Michael mentioned if available, or otherwise falls back to a userspace implementation.
As an added bonus, there are four different "flavors" of it (though you can stick to the basic, which is also the fastest), each providing more synchronization options. It starts with the basic RWLock() which is non-reentrant, limited to single-process synchronization, and no swapping of read/write locks to a full-fledged cross-process IPC RWLock with re-entrance support and read/write de-elevation.
As mentioned, they dynamically swap out to the Vista+ slim read-write locks for best performance when possible, but you don't have to worry about that at all as it'll fall back to a fully-compatible implementation on Windows XP and its ilk.
If you already know of a solution that only uses mutexes, you should be able to modify it to use critical sections instead.
We rolled our own using two critical sections and some counters. It suits our needs - we have a very low writer count, writers get precedence over readers, etc. I'm not at liberty to publish ours but can say that it is possible without mutexes and semaphores.
Here is the smallest solution that I could come up with:
http://www.baboonz.org/rwlock.php
And pasted verbatim:
/** A simple Reader/Writer Lock.
This RWL has no events - we rely solely on spinlocks and sleep() to yield control to other threads.
I don't know what the exact penalty is for using sleep vs events, but at least when there is no contention, we are basically
as fast as a critical section. This code is written for Windows, but it should be trivial to find the appropriate
equivalents on another OS.
**/
class TinyReaderWriterLock
{
public:
volatile uint32 Main;
static const uint32 WriteDesireBit = 0x80000000;
void Noop( uint32 tick )
{
if ( ((tick + 1) & 0xfff) == 0 ) // Sleep after 4k cycles. Crude, but usually better than spinning indefinitely.
Sleep(0);
}
TinyReaderWriterLock() { Main = 0; }
~TinyReaderWriterLock() { ASSERT( Main == 0 ); }
void EnterRead()
{
for ( uint32 tick = 0 ;; tick++ )
{
uint32 oldVal = Main;
if ( (oldVal & WriteDesireBit) == 0 )
{
if ( InterlockedCompareExchange( (LONG*) &Main, oldVal + 1, oldVal ) == oldVal )
break;
}
Noop(tick);
}
}
void EnterWrite()
{
for ( uint32 tick = 0 ;; tick++ )
{
if ( (tick & 0xfff) == 0 ) // Set the write-desire bit every 4k cycles (including cycle 0).
_InterlockedOr( (LONG*) &Main, WriteDesireBit );
uint32 oldVal = Main;
if ( oldVal == WriteDesireBit )
{
if ( InterlockedCompareExchange( (LONG*) &Main, -1, WriteDesireBit ) == WriteDesireBit )
break;
}
Noop(tick);
}
}
void LeaveRead()
{
ASSERT( Main != -1 );
InterlockedDecrement( (LONG*) &Main );
}
void LeaveWrite()
{
ASSERT( Main == -1 );
InterlockedIncrement( (LONG*) &Main );
}
};
I wrote the following code using only critical sections.
class ReadWriteLock {
volatile LONG writelockcount;
volatile LONG readlockcount;
CRITICAL_SECTION cs;
public:
ReadWriteLock() {
InitializeCriticalSection(&cs);
writelockcount = 0;
readlockcount = 0;
}
~ReadWriteLock() {
DeleteCriticalSection(&cs);
}
void AcquireReaderLock() {
retry:
while (writelockcount) {
Sleep(0);
}
EnterCriticalSection(&cs);
if (!writelockcount) {
readlockcount++;
}
else {
LeaveCriticalSection(&cs);
goto retry;
}
LeaveCriticalSection(&cs);
}
void ReleaseReaderLock() {
EnterCriticalSection(&cs);
readlockcount--;
LeaveCriticalSection(&cs);
}
void AcquireWriterLock() {
retry:
while (writelockcount||readlockcount) {
Sleep(0);
}
EnterCriticalSection(&cs);
if (!writelockcount&&!readlockcount) {
writelockcount++;
}
else {
LeaveCriticalSection(&cs);
goto retry;
}
LeaveCriticalSection(&cs);
}
void ReleaseWriterLock() {
EnterCriticalSection(&cs);
writelockcount--;
LeaveCriticalSection(&cs);
}
};
To perform a spin-wait, comment the lines with Sleep(0).
Look my implementation here:
https://github.com/coolsoftware/LockLib
VRWLock is a C++ class that implements single writer - multiple readers logic.
Look also test project TestLock.sln.
UPD. Below is the simple code for reader and writer:
LONG gCounter = 0;
// reader
for (;;) //loop
{
LONG n = InterlockedIncrement(&gCounter);
// n = value of gCounter after increment
if (n <= MAX_READERS) break; // writer does not write anything - we can read
InterlockedDecrement(&gCounter);
}
// read data here
InterlockedDecrement(&gCounter); // release reader
// writer
for (;;) //loop
{
LONG n = InterlockedCompareExchange(&gCounter, (MAX_READERS+1), 0);
// n = value of gCounter before attempt to replace it by MAX_READERS+1 in InterlockedCompareExchange
// if gCounter was 0 - no readers/writers and in gCounter will be MAX_READERS+1
// if gCounter was not 0 - gCounter stays unchanged
if (n == 0) break;
}
// write data here
InterlockedExchangeAdd(&gCounter, -(MAX_READERS+1)); // release writer
VRWLock class supports spin count and thread-specific reference count that allows to release locks of terminated threads.