How do monitors guarantee mutual exclusion?

How do monitors guarantee mutual exclusion? - concurrency

The producer/consumer problem in concurrency: a producer produces things and appends them to a buffer. A consumer takes things from the buffer. The consumer doesn't want to take things from an empty buffer and the producer doesn't want to append things to a full buffer.
William Stallings' "Operating Systems" gives the following example of a monitor used to solve the producer/consumer problem:
// Monitor
append(char x) {
if (count == N) cwait(notfull)
buffer[nextin] = x
nextin = (nextin + 1) % N
count++
csignal(nonempty)
}
take(char x) {
if (count == 0) cwait(notempty)
x = buffer[nextout]
nextout = (nextout + 1) % N
count--
csignal(notfull)
}
// Application using the monitor
producer() {
while (true) {
produce(x)
append(x)
}
}
consumer() {
while (true) {
take(x)
consume(x)
}
}
The book claims "only one process may be in the monitor at a time" [p.227]
How is this property enforced?
I can see how this would work with 1 consumer and 1 producer, but I fail to see how this protects - for example - 2 producers from simultaneously writing to a buffer.

Related

How one thread make something instead of waiting on condition variable

I was searching long before ask this question, and I can't find how to solve my problem.
I have five threads(Workers), this workers are mining gold,transport gold to avant poste and unload it there.
And my problem is there that when the worker is mining gold, user can input b to check is there enough gold, and if this is true to build barrack.
When worker is mining gold there is 2 sec sleep that is why I use pthread_cond_timedwait().
I have global variables which are storing barracks number, gold on map and gold in avant poste
Here is the pseudo code.
void makeBarrack(size_t data) {
timespec waitTime = { 2, 0 };
pthread_mutex_lock(&check_mutex);
while (wantBarrack) {
pthread_cond_timedwait(&condp, &gold_mutex, &waitTime);
}
std::cout << "Worker" << data << "is making barrack" << std::endl;
wantBarrack = false;
pthread_mutex_lock(&unload_mutex);
avantPost -= 100;
pthread_mutex_unlock(&unload_mutex);
barracks++;
pthread_mutex_unlock(&check_mutex);
}
void *work(void *data, char input) {
size_t thread_num = (size_t) data;
pthread_mutex_lock(&gold_mutex);
timespec waitTime = { 2, 0 };
if ((input == 'B' || input == 'b') && avantPost >= 100) {
wantBarrack = true;
input = 0;
} else if ((input == 'B' || input == 'b') && avantPoste < 100) {
std::cout << "There is " << avantPoste << " gold" << std::endl;
}
while (wantBarrack) {
pthread_cond_timedwait(&condp, &gold_mutex, &waitTime);
}
makeBarrack(data);
}
I an trying to make something like consumer producer but in my task I need to do something(mine gold) instead of waiting other threads to mine.
Other question is do I need to use same mutex in this two functions?
P.S.
I am novice in multithreading and it will be good someone to edit my question if there is something wrong.

The problem was threre that I've learnt that I can use cv in simple if.The main reason to use cv is thath we can block our thread without blocking other threads (It's unlocking the mutex while waiting on cv).And we just need to signal thath the conditition is done and we are ready to unblock(release) the thread and make the function we want. I am using pthread_cond_timedwait()
because I can block my thread for time I want.

C++: Thread pool slower than single threading?

First of all I did look at the other topics on this website and found they don't relate to my problem as those mostly deal with people using I/O operations or thread creation overheads. My problem is that my threadpool or worker-task structure implementation is (in this case) a lot slower than single threading. I'm really confused by this and not sure if it's the ThreadPool, the task itself, how I test it, the nature of threads or something out of my control.
// Sorry for the long code
#include <vector>
#include <queue>
#include <thread>
#include <mutex>
#include <future>
#include "task.hpp"
class ThreadPool
{
public:
ThreadPool()
{
for (unsigned i = 0; i < std::thread::hardware_concurrency() - 1; i++)
m_workers.emplace_back(this, i);
m_running = true;
for (auto&& worker : m_workers)
worker.start();
}
~ThreadPool()
{
m_running = false;
m_task_signal.notify_all();
for (auto&& worker : m_workers)
worker.terminate();
}
void add_task(Task* task)
{
{
std::unique_lock<std::mutex> lock(m_in_mutex);
m_in.push(task);
}
m_task_signal.notify_one();
}
private:
class Worker
{
public:
Worker(ThreadPool* parent, unsigned id) : m_parent(parent), m_id(id)
{}
~Worker()
{
terminate();
}
void start()
{
m_thread = new std::thread(&Worker::work, this);
}
void terminate()
{
if (m_thread)
{
if (m_thread->joinable())
{
m_thread->join();
delete m_thread;
m_thread = nullptr;
m_parent = nullptr;
}
}
}
private:
void work()
{
while (m_parent->m_running)
{
std::unique_lock<std::mutex> lock(m_parent->m_in_mutex);
m_parent->m_task_signal.wait(lock, [&]()
{
return !m_parent->m_in.empty() || !m_parent->m_running;
});
if (!m_parent->m_running) break;
Task* task = m_parent->m_in.front();
m_parent->m_in.pop();
// Fixed the mutex being locked while the task is executed
lock.unlock();
task->execute();
}
}
private:
ThreadPool* m_parent = nullptr;
unsigned m_id = 0;
std::thread* m_thread = nullptr;
};
private:
std::vector<Worker> m_workers;
std::mutex m_in_mutex;
std::condition_variable m_task_signal;
std::queue<Task*> m_in;
bool m_running = false;
};
class TestTask : public Task
{
public:
TestTask() {}
TestTask(unsigned number) : m_number(number) {}
inline void Set(unsigned number) { m_number = number; }
void execute() override
{
if (m_number <= 3)
{
m_is_prime = m_number > 1;
return;
}
else if (m_number % 2 == 0 || m_number % 3 == 0)
{
m_is_prime = false;
return;
}
else
{
for (unsigned i = 5; i * i <= m_number; i += 6)
{
if (m_number % i == 0 || m_number % (i + 2) == 0)
{
m_is_prime = false;
return;
}
}
m_is_prime = true;
return;
}
}
public:
unsigned m_number = 0;
bool m_is_prime = false;
};
int main()
{
ThreadPool pool;
unsigned num_tasks = 1000000;
std::vector<TestTask> tasks(num_tasks);
for (auto&& task : tasks)
task.Set(randint(0, 1000000000));
auto s = std::chrono::high_resolution_clock::now();
#if MT
for (auto&& task : tasks)
pool.add_task(&task);
#else
for (auto&& task : tasks)
task.execute();
#endif
auto e = std::chrono::high_resolution_clock::now();
double seconds = std::chrono::duration_cast<std::chrono::nanoseconds>(e - s).count() / 1000000000.0;
}
Benchmarks with VS2013 Profiler:
10,000,000 tasks:
MT:
13 seconds of wall clock time
93.36% is spent in msvcp120.dll
3.45% is spent in Task::execute() // Not good here
ST:
0.5 seconds of wall clock time
97.31% is spent with Task::execute()

Usual disclaimer in such answers: the only way to tell for sure is to measure it with a profiler tool.
But I will try to explain your results without it. First of all, you have one mutex across all your threads. So only one thread at a time can execute some task. It kills all your gains you might have. In spite of your threads your code is perfectly serial. So at the very least make your task execution out of the mutex. You need to lock the mutex only to get a task out of the queue — you don't need to hold it when the task gets executed.
Next, your tasks are so simple that single thread will execute them in no time. You just can't measure any gains with such tasks. Create some heavy tasks which could produce some more interesting results(some tasks which are closer to the real world, not such contrived).
And the 3rd point: threads are not without their cost — context switching, mutex contention etc. To have real gains, as the previous 2 points say, you need to have tasks which take more time than the overheads threads introduce and the code should be truly parallel instead of waiting on some resource making it serial.
UPD: I looked at the wrong part of the code. The task is complex enough provided you create tasks with sufficiently large numbers.
UPD2: I've played with your code and found a good prime number to show how the MT code is better. Use the following prime number: 1019048297. It will give enough computation complexity to show the difference.
But why your code doesn't produce good results? It is hard to tell without seeing the implementation of randint() but I take it is pretty simple and in a half of the cases it returns even numbers and other cases produce not much of big prime numbers either. So the tasks are so simple that context switching and other things around your particular implementation and threads in general consume more time than the computation itself. Using the prime number I gave you give the tasks no choice but spend time computing — no easy answer since the number is big and actually prime. That's why the big number will give you the answer you seek — better time for the MT code.

You should not hold the mutex while the task is getting executed, otherwise other threads will not be able to get a task:
void work() {
while (m_parent->m_running) {
Task* currentTask = nullptr;
std::unique_lock<std::mutex> lock(m_parent->m_in_mutex);
m_parent->m_task_signal.wait(lock, [&]() {
return !m_parent->m_in.empty() || !m_parent->m_running;
});
if (!m_parent->m_running) continue;
currentTask = m_parent->m_in.front();
m_parent->m_in.pop();
lock.unlock(); //<- Release the lock so that other threads can get tasks
currentTask->execute();
currentTask = nullptr;
}
}

For MT, how much time is spent in each phase of the "overhead": std::unique_lock, m_task_signal.wait, front, pop, unlock?
Based on your results of only 3% useful work, this means the above consumes 97%. I'd get numbers for each part of the above (e.g. add timestamps between each call).
It seems to me, that the code you use to [merely] dequeue the next task pointer is quite heavy. I'd do a much simpler queue [possibly lockless] mechanism. Or, perhaps, use atomics to bump an index into the queue instead of the five step process above. For example:
void
work()
{
while (m_parent->m_running) {
// NOTE: this is just an example, not necessarily the real function
int curindex = atomic_increment(&global_index);
if (curindex >= max_index)
break;
Task *task = m_parent->m_in[curindex];
task->execute();
}
}
Also, maybe you should pop [say] ten at a time instead of just one.
You might also be memory bound and/or "task switch" bound. (e.g.) For threads that access an array, more than four threads usually saturates the memory bus. You could also have heavy contention for the lock, such that the threads get starved because one thread is monopolizing the lock [indirectly, even with the new unlock call]
Interthread locking usually involves a "serialization" operation where other cores must synchronize their out-of-order execution pipelines.
Here's a "lockless" implementation:
void
work()
{
// assume m_id is 0,1,2,...
int curindex = m_id;
while (m_parent->m_running) {
if (curindex >= max_index)
break;
Task *task = m_parent->m_in[curindex];
task->execute();
curindex += NUMBER_OF_WORKERS;
}
}

Is there a way to synchronize this without locks? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Say I have 3 functions that can be called by an upper layer:
Start - Will only be called if we haven't been started yet, or Stop was previously called
Stop - Will only be called after a successful call to Start
Process - Can be called at any time (simultaneously on different threads); if started, will call into lower layer
In Stop, it must wait for all Process calls to finish calling into the lower layer, and prevent any further calls. With a locking mechanism, I can come up with the following pseudo code:
Start() {
ResetEvent(&StopCompleteEvent);
IsStarted = true;
RefCount = 0;
}
Stop() {
AcquireLock();
IsStarted = false;
WaitForCompletionEvent = (RefCount != 0);
ReleaseLock();
if (WaitForCompletionEvent)
WaitForEvent(&StopCompleteEvent);
ASSERT(RefCount == 0);
}
Process() {
AcquireLock();
AddedRef = IsStarted;
if (AddedRef)
RefCount++;
ReleaseLock();
if (!AddedRef) return;
ProcessLowerLayer();
AcquireLock();
FireCompletionEvent = (--RefCount == 0);
ReleaseLock();
if (FilreCompletionEvent)
SetEvent(&StopCompleteEvent);
}
Is there a way to achieve the same behavior without a locking mechanism? Perhaps with some fancy usage of InterlockedCompareExchange and InterlockedIncremenet/InterlockedDecrement?
The reason I ask is that this is in the data path of a network driver and I would really prefer not to have any locks.

I believe it is possible to avoid the use of explicit locks and any unnecessary blocking or kernel calls.
Note that this is pseudo-code only, for illustrative purposes; it hasn't seen a compiler. And while I believe the threading logic is sound, please verify its correctness for yourself, or get an expert to validate it; lock-free programming is hard.
#define STOPPING 0x20000000;
#define STOPPED 0x40000000;
volatile LONG s = STOPPED;
// state and count
// bit 30 set -> stopped
// bit 29 set -> stopping
// bits 0 through 28 -> thread count
Start()
{
KeClearEvent(&StopCompleteEvent);
LONG n = InterlockedExchange(&s, 0); // sets s to 0
if ((n & STOPPED) == 0)
bluescreen("Invalid call to Start()");
}
Stop()
{
LONG n = InterlockedCompareExchange(&s, STOPPED, 0);
if (n == 0)
{
// No calls to Process() were running so we could jump directly to stopped.
// Mission accomplished!
return;
}
LONG n = InterlockedOr(&s, STOPPING);
if ((n & STOPPED) != 0)
bluescreen("Stop called when already stopped");
if ((n & STOPPING) != 0)
bluescreen("Stop called when already stopping");
n = InterlockedCompareExchange(&s, STOPPED, STOPPING);
if (n == STOPPING)
{
// The last call to Process() exited before we set the STOPPING flag.
// Mission accomplished!
return;
}
// Now that STOPPING mode is set, and we know at least one call to Process
// is running, all we need do is wait for the event to be signaled.
KeWaitForSingleObject(...);
// The event is only ever signaled after a thread has successfully
// changed the state to STOPPED. Mission accomplished!
return;
}
Process()
{
LONG n = InterlockedCompareExchange(&s, STOPPED, STOPPING);
if (n == STOPPING)
{
// We've just stopped; let the call to Stop() complete.
KeSetEvent(&StopCompleteEvent);
return;
}
if ((n & STOPPED) != 0 || (n & STOPPING) != 0)
{
// Checking here avoids changing the state unnecessarily when
// we already know we can't enter the lower layer.
// It also ensures that the transition from STOPPING to STOPPED can't
// be delayed even if there are lots of threads making new calls to Process().
return;
}
n = InterlockedIncrement(&s);
if ((n & STOPPED) != 0)
{
// Turns out we've just stopped, so the call to Process() must be aborted.
// Explicitly set the state back to STOPPED, rather than decrementing it,
// in case Start() has been called. At least one thread will succeed.
InterlockedCompareExchange(&s, STOPPED, n);
return;
}
if ((n & STOPPING) == 0)
{
ProcessLowerLayer();
}
n = InterlockedDecrement(&s);
if ((n & STOPPED) != 0 || n == (STOPPED - 1))
bluescreen("Stopped during call to Process, shouldn't be possible!");
if (n != STOPPING)
return;
// Stop() has been called, and it looks like we're the last
// running call to Process() in which case we need to change the
// status to STOPPED and signal the call to Stop() to exit.
// However, another thread might have beaten us to it, so we must
// check again. The event MUST only be set once per call to Stop().
n = InterlockedCompareExchange(&s, STOPPED, STOPPING);
if (n == STOPPING)
{
// We've just stopped; let the call to Stop() complete.
KeSetEvent(&StopCompleteEvent);
}
return;
}

Linux 3.0: futex-lock deadlock bug?

// SubFetch(x,y) = atomically x-=y and return x (__sync_sub_and_fetch)
// AddFetch(x,y) = atomically x+=y and return x (__sync_add_and_fetch)
// CompareWait(x, y) = futex(&x, FUTEX_WAIT, y) wait on x if x == y
// Wake(x, y) = futex(&x, FUTEX_WAKE, y) wake up y waiters
struct Lock
{
Lock() : x(1) {}
void lock()
{
while (true)
{
if (SubFetch(x, 1) == 0)
return;
x = -1;
CompareWait(x, -1);
}
}
void unlock()
{
if (AddFetch(x, 1) == 1)
return;
x = 1;
Wake(x, 1);
}
private:
int x;
};
Linux 3.0 provides a system call called futex, upon which many concurrency utilities are based including recent pthread_mutex implementations. Whenever you write code you should always consider whether using an existing implementation or writing it yourself is the better choice for your project.
Above is an implementation of a Lock (mutex, 1 permit counting semaphore) based upon futex and the semantics description in man futex(7)
It appears to contain a deadlock bug whereby after multiple threads are trying to lock and unlock it a few thousand times, the threads can get into a state where x == -1 and all the threads are stuck in CompareWait, however noone is holding the lock.
Can anyone see where the bug is?
Update: I'm a little surprised that futex(7)/semantics is so broken. I completely rewrote Lock as follows... is this correct now?
// CompareAssign(x,y,z) atomically: if (x == y) {x = z; ret true; } else ret false;
struct Lock
{
Lock() : x(0) {}
void lock()
{
while (!CompareAssign(x, 0, 1))
if (x == 2 || CompareAssign(x, 1, 2))
CompareWait(x, 2);
}
void unlock()
{
if (SubFetch(x, 1) == 0)
return;
x = 0;
Wake(x, 1);
}
private:
int x;
};
The idea here is that x has the following three states:
0: unlocked
1: locked & no waiters
2: locked & waiters

The problem is that you explicitly -1 assign to x if the SubFetch fails to acquire the lock. This races with the unlock.
Thread 1 acquires the lock. x==0.
Thread 2 tries to acquire the lock. The SubFetch sets x to -1, and then thread 2 is suspended.
Thread 1 releases the lock. The AddFetch sets x to 0, so the code then explicitly sets x to 1 and calls Wake.
Thread 2 wakes up and sets x to -1, and then calls CompareWait.
Thread 2 is now stuck waiting, with x set to -1, but there is no one around to wake it, as thread 1 has already released the lock.

The proper implementation of a futex-based Mutex is described in Ulrich Drepper's paper "Futexes are tricky"
http://people.redhat.com/drepper/futex.pdf
It includes not only the code but also a very detailed explanation of why it is correct. The code from the paper:
class mutex
{
public:
mutex () : val (0) { }
void lock () {
int c;
if ((c = cmpxchg (val, 0, 1)) != 0)
do {
if (c == 2 || cmpxchg (val, 1, 2) != 0)
futex_wait (&val, 2);
} while ((c = cmpxchg (val, 0, 2)) != 0);
}
void unlock () {
//NOTE: atomic_dec returns the value BEFORE the operation, unlike your SubFetch !
if (atomic_dec (val) != 1) {
val = 0;
futex_wake (&val, 1);
}
}
private:
int val;
};
Comparing the code in the paper with your code, I spot a difference
You have
if (x == 2 || CompareAssign(x, 1, 2))
using the futex's value directly whereas Drepper uses the return value from the previous CompareAssign(). That difference will probably affect performance only.
Your unlock code is different, too, but seems to be semantically equivalent.
In any case I would strongly advise you to follow Drepper's code to the letter. That paper has stood the test of time and received a lot of peer review. You gain nothing from rolling your own.

How about this scenario with three threads, A, B , and C.
The initial state of this scenario has:
thread A holding the lock
thread B not contending for the lock just yet
thread C in CompareWait()
x == -1 from when C failed to acquire the lock
A B C
============== ================ ===============
AddFetch()
(so x == 0)
SubFetch()
(so x == -1)
x = 1
x = -1
Wake()
At this point whether B or C are unblocked, they will not get a result of 0 when they SubFetch().

Win32 Read/Write Lock Using Only Critical Sections

I have to implement a read/write lock in C++ using the Win32 api as part of a project at work. All of the existing solutions use kernel objects (semaphores and mutexes) that require a context switch during execution. This is far too slow for my application.
I would like implement one using only critical sections, if possible. The lock does not have to be process safe, only threadsafe. Any ideas on how to go about this?

If you can target Vista or greater, you should use the built-in SRWLock's. They are lightweight like critical sections, entirely user-mode when there is no contention.
Joe Duffy's blog has some recent entries on implementing different types of non-blocking reader/writer locks. These locks do spin, so they would not be appropriate if you intend to do a lot of work while holding the lock. The code is C#, but should be straightforward to port to native.
You can implement a reader/writer lock using critical sections and events - you just need to keep enough state to only signal the event when necessary to avoid an unnecessary kernel mode call.

I don't think this can be done without using at least one kernel-level object (Mutex or Semaphore), because you need the help of the kernel to make the calling process block until the lock is available.
Critical sections do provide blocking, but the API is too limited. e.g. you cannot grab a CS, discover that a read lock is available but not a write lock, and wait for the other process to finish reading (because if the other process has the critical section it will block other readers which is wrong, and if it doesn't then your process will not block but spin, burning CPU cycles.)
However what you can do is use a spin lock and fall back to a mutex whenever there is contention. The critical section is itself implemented this way. I would take an existing critical section implementation and replace the PID field with separate reader & writer counts.

Old question, but this is something that should work. It doesn't spin on contention. Readers incur limited extra cost if they have little or no contention, because SetEvent is called lazily (look at the edit history for a more heavyweight version that doesn't have this optimization).
#include <windows.h>
typedef struct _RW_LOCK {
CRITICAL_SECTION countsLock;
CRITICAL_SECTION writerLock;
HANDLE noReaders;
int readerCount;
BOOL waitingWriter;
} RW_LOCK, *PRW_LOCK;
void rwlock_init(PRW_LOCK rwlock)
{
InitializeCriticalSection(&rwlock->writerLock);
InitializeCriticalSection(&rwlock->countsLock);
/*
* Could use a semaphore as well. There can only be one waiter ever,
* so I'm showing an auto-reset event here.
*/
rwlock->noReaders = CreateEvent (NULL, FALSE, FALSE, NULL);
}
void rwlock_rdlock(PRW_LOCK rwlock)
{
/*
* We need to lock the writerLock too, otherwise a writer could
* do the whole of rwlock_wrlock after the readerCount changed
* from 0 to 1, but before the event was reset.
*/
EnterCriticalSection(&rwlock->writerLock);
EnterCriticalSection(&rwlock->countsLock);
++rwlock->readerCount;
LeaveCriticalSection(&rwlock->countsLock);
LeaveCriticalSection(&rwlock->writerLock);
}
int rwlock_wrlock(PRW_LOCK rwlock)
{
EnterCriticalSection(&rwlock->writerLock);
/*
* readerCount cannot become non-zero within the writerLock CS,
* but it can become zero...
*/
if (rwlock->readerCount > 0) {
EnterCriticalSection(&rwlock->countsLock);
/* ... so test it again. */
if (rwlock->readerCount > 0) {
rwlock->waitingWriter = TRUE;
LeaveCriticalSection(&rwlock->countsLock);
WaitForSingleObject(rwlock->noReaders, INFINITE);
} else {
/* How lucky, no need to wait. */
LeaveCriticalSection(&rwlock->countsLock);
}
}
/* writerLock remains locked. */
}
void rwlock_rdunlock(PRW_LOCK rwlock)
{
EnterCriticalSection(&rwlock->countsLock);
assert (rwlock->readerCount > 0);
if (--rwlock->readerCount == 0) {
if (rwlock->waitingWriter) {
/*
* Clear waitingWriter here to avoid taking countsLock
* again in wrlock.
*/
rwlock->waitingWriter = FALSE;
SetEvent(rwlock->noReaders);
}
}
LeaveCriticalSection(&rwlock->countsLock);
}
void rwlock_wrunlock(PRW_LOCK rwlock)
{
LeaveCriticalSection(&rwlock->writerLock);
}
You could decrease the cost for readers by using a single CRITICAL_SECTION:
countsLock is replaced with writerLock in rdlock and rdunlock
rwlock->waitingWriter = FALSE is removed in wrunlock
wrlock's body is changed to
EnterCriticalSection(&rwlock->writerLock);
rwlock->waitingWriter = TRUE;
while (rwlock->readerCount > 0) {
LeaveCriticalSection(&rwlock->writerLock);
WaitForSingleObject(rwlock->noReaders, INFINITE);
EnterCriticalSection(&rwlock->writerLock);
}
rwlock->waitingWriter = FALSE;
/* writerLock remains locked. */
However this loses in fairness, so I prefer the above solution.

Take a look at the book "Concurrent Programming on Windows" which has lots of different reference examples for reader/writer locks.

Check out the spin_rw_mutex from Intel's Thread Building Blocks ...
spin_rw_mutex is strictly in user-land
and employs spin-wait for blocking

This is an old question but perhaps someone will find this useful. We developed a high-performance, open-source RWLock for Windows that automatically uses Vista+ SRWLock Michael mentioned if available, or otherwise falls back to a userspace implementation.
As an added bonus, there are four different "flavors" of it (though you can stick to the basic, which is also the fastest), each providing more synchronization options. It starts with the basic RWLock() which is non-reentrant, limited to single-process synchronization, and no swapping of read/write locks to a full-fledged cross-process IPC RWLock with re-entrance support and read/write de-elevation.
As mentioned, they dynamically swap out to the Vista+ slim read-write locks for best performance when possible, but you don't have to worry about that at all as it'll fall back to a fully-compatible implementation on Windows XP and its ilk.

If you already know of a solution that only uses mutexes, you should be able to modify it to use critical sections instead.
We rolled our own using two critical sections and some counters. It suits our needs - we have a very low writer count, writers get precedence over readers, etc. I'm not at liberty to publish ours but can say that it is possible without mutexes and semaphores.

Here is the smallest solution that I could come up with:
http://www.baboonz.org/rwlock.php
And pasted verbatim:
/** A simple Reader/Writer Lock.
This RWL has no events - we rely solely on spinlocks and sleep() to yield control to other threads.
I don't know what the exact penalty is for using sleep vs events, but at least when there is no contention, we are basically
as fast as a critical section. This code is written for Windows, but it should be trivial to find the appropriate
equivalents on another OS.
**/
class TinyReaderWriterLock
{
public:
volatile uint32 Main;
static const uint32 WriteDesireBit = 0x80000000;
void Noop( uint32 tick )
{
if ( ((tick + 1) & 0xfff) == 0 ) // Sleep after 4k cycles. Crude, but usually better than spinning indefinitely.
Sleep(0);
}
TinyReaderWriterLock() { Main = 0; }
~TinyReaderWriterLock() { ASSERT( Main == 0 ); }
void EnterRead()
{
for ( uint32 tick = 0 ;; tick++ )
{
uint32 oldVal = Main;
if ( (oldVal & WriteDesireBit) == 0 )
{
if ( InterlockedCompareExchange( (LONG*) &Main, oldVal + 1, oldVal ) == oldVal )
break;
}
Noop(tick);
}
}
void EnterWrite()
{
for ( uint32 tick = 0 ;; tick++ )
{
if ( (tick & 0xfff) == 0 ) // Set the write-desire bit every 4k cycles (including cycle 0).
_InterlockedOr( (LONG*) &Main, WriteDesireBit );
uint32 oldVal = Main;
if ( oldVal == WriteDesireBit )
{
if ( InterlockedCompareExchange( (LONG*) &Main, -1, WriteDesireBit ) == WriteDesireBit )
break;
}
Noop(tick);
}
}
void LeaveRead()
{
ASSERT( Main != -1 );
InterlockedDecrement( (LONG*) &Main );
}
void LeaveWrite()
{
ASSERT( Main == -1 );
InterlockedIncrement( (LONG*) &Main );
}
};

I wrote the following code using only critical sections.
class ReadWriteLock {
volatile LONG writelockcount;
volatile LONG readlockcount;
CRITICAL_SECTION cs;
public:
ReadWriteLock() {
InitializeCriticalSection(&cs);
writelockcount = 0;
readlockcount = 0;
}
~ReadWriteLock() {
DeleteCriticalSection(&cs);
}
void AcquireReaderLock() {
retry:
while (writelockcount) {
Sleep(0);
}
EnterCriticalSection(&cs);
if (!writelockcount) {
readlockcount++;
}
else {
LeaveCriticalSection(&cs);
goto retry;
}
LeaveCriticalSection(&cs);
}
void ReleaseReaderLock() {
EnterCriticalSection(&cs);
readlockcount--;
LeaveCriticalSection(&cs);
}
void AcquireWriterLock() {
retry:
while (writelockcount||readlockcount) {
Sleep(0);
}
EnterCriticalSection(&cs);
if (!writelockcount&&!readlockcount) {
writelockcount++;
}
else {
LeaveCriticalSection(&cs);
goto retry;
}
LeaveCriticalSection(&cs);
}
void ReleaseWriterLock() {
EnterCriticalSection(&cs);
writelockcount--;
LeaveCriticalSection(&cs);
}
};
To perform a spin-wait, comment the lines with Sleep(0).

Look my implementation here:
https://github.com/coolsoftware/LockLib
VRWLock is a C++ class that implements single writer - multiple readers logic.
Look also test project TestLock.sln.
UPD. Below is the simple code for reader and writer:
LONG gCounter = 0;
// reader
for (;;) //loop
{
LONG n = InterlockedIncrement(&gCounter);
// n = value of gCounter after increment
if (n <= MAX_READERS) break; // writer does not write anything - we can read
InterlockedDecrement(&gCounter);
}
// read data here
InterlockedDecrement(&gCounter); // release reader
// writer
for (;;) //loop
{
LONG n = InterlockedCompareExchange(&gCounter, (MAX_READERS+1), 0);
// n = value of gCounter before attempt to replace it by MAX_READERS+1 in InterlockedCompareExchange
// if gCounter was 0 - no readers/writers and in gCounter will be MAX_READERS+1
// if gCounter was not 0 - gCounter stays unchanged
if (n == 0) break;
}
// write data here
InterlockedExchangeAdd(&gCounter, -(MAX_READERS+1)); // release writer
VRWLock class supports spin count and thread-specific reference count that allows to release locks of terminated threads.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js