std::condition_variable memory writes visibility - c++

Does std::condition_variable::notify_one() or std::condition_variable::notify_all() guarantee that non-atomic memory writes in the current thread prior to the call will be visible in notified threads?
Other threads do:
{
std::unique_lock lock(mutex);
cv.wait(lock, []() { return values[threadIndex] != 0; });
// May a thread here see a zero value and therefore start to wait again?
}
Main thread does:
fillData(values); // All values are zero and all threads wait() before calling this.
cv.notify_all(); // Do need some memory fence or lock before this
// to ensure that new non-zero values will be visible
// in other threads immediately after waking up?
Doesn't notify_all() store some atomic value therefore enforcing memory ordering? I did not clarified it.
UPD: according to Superlokkus' answer and an answer here: we have to acquire a lock to ensure memory writes visibility in other threads (memory propagation), otherwise threads in my case may read zero values.
Also I missed this quote here about condition_variable, which specifically answers my question. Even an atomic variable has to be modified under a lock in a case when the modification must become visible immediately.
Even if the shared variable is atomic, it must be modified under the
mutex in order to correctly publish the modification to the waiting
thread.

I guess you are mixing up memory ordering of so called atomic values and the mechanisms of classic lock based synchronization.
When you have a datum which is shared between threads, lets say an int for example, one thread can not simply read it while the other thread might be write to it meanwhile. Otherwise we would have a data race.
To get around this for long time we used classic lock based synchronization:
The threads share at least a mutex and the int. To read or to write any thread has to hold the lock first, meaning they wait on the mutex. Mutexes are build so that they are fine that this can happen concurrently. If a thread wins gettting the mutex it can change or read the int and then should unlock it, so others can read/write too. Using a conditional variable like you used is just to make the pattern "readers wait for a change of a value by a writer" more efficient, they get woken up by the cv instead of periodically waiting on the lock, reading, and unlocking, which would be called busy waiting.
So because you hold the lock in any after waiting on the mutex or in you case, correctly (mutex is still needed) waiting on the conditional variable, you can change the int. And readers will read the new value after the writer was able to wrote it, never the old. UPDATE: However one thing if have to add, which might also be the cause of confusion: Conditional variables are subject for so called spurious wakeups. Meaning even though you write did not have notified any thread, a read thread might still wake up, with the mutex locked. So you have to check if you writer actually waked you up, which is usually done by the writer by changing another datum just to notify this, or if its suitable by using the same datum you already wanted to share. The lambda parameter overload of std::condition_variable::wait was just made to make the checking and going back to sleep code looking a bit prettier. Based on your question now I don't know if you want to use you values for this job.
However at snippet for the "main" thread is incorrect or incomplete:
You are not synchronizing on the mutex in order to change values.
You have to hold the lock for that, but notifying can be done without the lock.
std::unique_lock lock(mutex);
fillData(values);
lock.unlock();
cv.notify_all();
But these mutex based patters have some drawbacks and are slow, only one thread at a time can do something. This is were so called atomics, like std::atomic<int> came into play. They can be written and read at the same time without an mutex by multiple threads concurrently. Memory ordering is only a thing to consider there and an optimization for cases where you uses several of them in a meaningful way or you don't need the "after the write, I never see the old value" guarantee. However with it's default memory ordering memory_order_seq_cst you would also be fine.

Related

What does std::mutex prevent threads from modifying?

What part of memory gets locked by mutex when .lock() or .try_lock(), is it just the function or is it the whole program that gets locked?
Nothing is locked except the mutex. Everything else continues running (until it tries to lock an already locked mutex that is). The mutex is only there so that two threads cannot run the code between a mutex lock and a mutex unlock at the same time.
A mutex doesn't really lock anything, except for itself. You can think of a mutex as being a gate where you can only unlock it from the inside. When the gate is locked, any thread that tries to lock the mutex will sit there at the gate and wait for the current thread that is behind the gate to unlock it and let them in. When they gate is not locked then when you call lock you can just go in, close and lock the gate, and now no threads can get past the gate until you unlock it and let them in.
A mutex doesn't lock anything. You just use a mutex to communicate to other parts of your code that they should consider whatever you decide needs to be protected from access by several threads at the same time to be off-limits for now.
You could think of a mutex as something like a boolean okToModify. Whenever you want to edit something, you check if okToModify is true. If it is, you set it to false (preventing any other threads from modifying it), change it, then set okToModify back to true to tell the other threads you're done and give them a chance to modify:
// WARNING! This code doesn't actually work as a lock!
// it is just an example of the concept.
struct LockedInt {
bool okToModify; // This would be your mutex instead of a bool.
int integer;
};
struct LockedInt myLockedInt = { true, 0 };
...
while (myLockedInt.okToModify == false)
; // wait doing nothing until whoever is modifying the int is done.
myLockedInt.okToModify = false; // Prevent other threads from getting out of while loop above.
myLockedInt.integer += 1;
myLockedInt.okToModify = true; // Now other threads get out of the while loop if they were waiting and can modify.
The while loop and okToModify = false above is basically what locking a mutex does, and okToModify = true is what unlocking a mutex does.
Now, why do we need mutexes and don't use booleans? Because a thread could be running at the same time as those three lines above. The code for locking a mutex actually guarantees that the waiting for okToModify to become true and setting okToModify = false happen in one go, and therefore no other thread can get "in between the lines", for example by using a special machine-code instruction called "compare-and-exchange".
So do not use booleans instead of mutexes, but you can think of a mutex as a special, thread-safe boolean.
m.lock() doesn't really lock anything. What it does is, it waits to take ownership of the mutex. A mutex always either is owned by exactly one thread or else it is available. m.lock() waits until the mutex becomes available, and then it takes ownership of it in the name of the calling thread.
m.unlock releases the mutex (i.e., it relinquishes ownership), causing the mutex to once again become available.
Mutexes also perform another very important function. In modern C++, when some thread T performs a sequence of assignments of various values to various memory locations, the system makes no guarantees about when other threads U, V, and W will see those assignments, whether the other threads will see the assignments happen in the same order in which thread T performed them, or even, whether the other threads will ever see the assignments.
There are some quite complicated rules governing things that a programmer can do to ensure that different threads see a consistent view of shared memory objects (Google "C++ memory model"), but here's one simple rule:
Whatever thread T did before it releases some mutex M is guaranteed to be visible to any other thread U after thread U subsequently locks the same mutex M.

Need to mutex-protect (atomic) assignment sought by condition variable?

I understand how to use condition variables (crummy name for this construct, IMO, as the cv object neither is a variable nor indicates a condition). So I have a pair of threads, canonically set up with Boost.Thread as:
bool awake = false;
boost::mutex sync;
boost::condition_variable cv;
void thread1()
{
boost::unique_lock<boost::mutex> lock1(sync);
while (!awake)
cv.wait(lock1);
lock1.unlock(); // this line actually not canonical, but why not?
// proceed...
}
void thread2()
{
//...
boost::unique_lock<boost::mutex> lock2;
awake = true;
lock2.unlock();
cv.notify_all();
}
My question is: does thread2 really need to be protecting the assignment to awake? It seems to me the notify_all() call should be sufficient. If the data being manipulated and checked against were more than a simple "ok to proceed" flag, I see the value in the mutex, but here it seems like overkill.
A secondary question is that asked in the code fragment: Why doesn't the Boost documentation show the lock in thread1 being unlocked before the "process data" step?
EDIT: Maybe my question is really: Is there a cleaner construct than a CV to implement this kind of wait?
does thread2 really need to be protecting the assignment to awake?
Yes. Modifying an object from one thread and accessing it from another without synchronisation gives undefined behaviour. Even if it's just a bool.
For example, on some multiprocessor systems the write might only affect local memory; without an explicit synchronisation operation, other threads might never see the change.
Why doesn't the Boost documentation show the lock in thread1 being unlocked before the "process data" step?
If you unlocked the mutex before clearing the flag, then you might miss another signal.
Is there a cleaner construct than a CV to implement this kind of wait?
In Boost and the standard C++ library, no; a condition variable is flexible enough to handle arbitrary shared state and not particularly over-complicated for this simple case, so there's no particular need for anything simpler.
More generally, you could use a semaphore or a pipe to send a simple signal between threads.
Formally, you definitely need the lock in both threads: if any thread
modifies an object, and more than one thread accesses it, then all
accesses must be synchronized.
In practice, you'll probably get away with it without the lock; it's
almost certain that notify_all will issue the necessary fence or
membar instructions to ensure that the memory is properly synchronized.
But why take the risk?
As to the absense of the unlock, that's the whole point of the scoped
locking pattern: the unlock is in the destructor of the object, so
that the mutex will be unlocked even if an exception passes through.

Cheapest way to wake up multiple waiting threads without blocking

I use boost::thread to manage threads. In my program i have pool of threads (workers) that are activated sometimes to do some job simultaneously.
Now i use boost::condition_variable: and all threads are waiting inside boost::condition_variable::wait() call on their own conditional_variableS objects.
Can i AVOID using mutexes in classic scheme, when i work with conditional_variables? I want to wake up threads, but don't need to pass some data to them, so don't need a mutex to be locked/unlocked during awakening process, why should i spend CPU on this (but yes, i should remember about spurious wakeups)?
The boost::condition_variable::wait() call trying to REACQUIRE the locking object when CV received the notification. But i don't need this exact facility.
What is cheapest way to awake several threads from another thread?
If you don't reacquire the locking object, how can the threads know that they are done waiting? What will tell them that? Returning from the block tells them nothing because the blocking object is stateless. It doesn't have an "unlocked" or "not blocking" state for it to return in.
You have to pass some data to them, otherwise how will they know that before they had to wait and now they don't? A condition variable is completely stateless, so any state that you need must be maintained and passed by you.
One common pattern is to use a mutex, condition variable, and a state integer. To block, do this:
Acquire the mutex.
Copy the value of the state integer.
Block on the condition variable, releasing the mutex.
If the state integer is the same as it was when you coped it, go to step 3.
Release the mutex.
To unblock all threads, do this:
Acquire the mutex.
Increment the state integer.
Broadcast the condition variable.
Release the mutex.
Notice how step 4 of the locking algorithm tests whether the thread is done waiting? Notice how this code tracks whether or not there has been an unblock since the thread decided to block? You have to do that because condition variables don't do it themselves. (And that's why you need to reacquire the locking object.)
If you try to remove the state integer, your code will behave unpredictably. Sometimes you will block too long due to missed wakeups and sometimes you won't block long enough due to spurious wakeups. Only a state integer (or similar predicate) protected by the mutex tells the threads when to wait and when to stop waiting.
Also, I haven't seen how your code uses this, but it almost always folds into logic you're already using. Why did the threads block anyway? Is it because there's no work for them to do? And when they wakeup, are they going to figure out what to do? Well, finding out that there's no work for them to do and finding out what work they do need to do will require some lock since it's shared state, right? So there almost always is already a lock you're holding when you decide to block and need to reacquire when you're done waiting.
For controlling threads doing parallel jobs, there is a nice primitive called a barrier.
A barrier is initialized with some positive integer value N representing how many threads it holds. A barrier has only a single operation: wait. When N threads call wait, the barrier releases all of them. Additionally, one of the threads is given a special return value indicating that it is the "serial thread"; that thread will be the one to do some special job, like integrating the results of the computation from the other threads.
The limitation is that a given barrier has to know the exact number of threads. It's really suitable for parallel processing type situations.
POSIX added barriers in 2003. A web search indicates that Boost has them, too.
http://www.boost.org/doc/libs/1_33_1/doc/html/barrier.html
Generally speaking, you can't.
Assuming the algorithm looks something like this:
ConditionVariable cv;
void WorkerThread()
{
for (;;)
{
cv.wait();
DoWork();
}
}
void MainThread()
{
for (;;)
{
ScheduleWork();
cv.notify_all();
}
}
NOTE: I intentionally omitted any reference to mutexes in this pseudo-code. For the purposes of this example, we'll suppose ConditionVariable does not require a mutex.
The first time through MainTnread(), work is queued and then it notifies WorkerThread() that it should execute its work. At this point two things can happen:
WorkerThread() completes DoWork() before MainThread() can complete ScheduleWork().
MainThread() completes ScheduleWork() before WorkerThread() can complete DoWork().
In case #1, WorkerThread() comes back around to sleep on the CV, and is awoken by the next cv.notify() and all is well.
In case #2, MainThread() comes back around and notifies... nobody and continues on. Meanwhile WorkerThread() eventually comes back around in its loop and waits on the CV but it is now one or more iterations behind MainThread().
This is known as a "lost wakeup". It is similar to the notorious "spurious wakeup" in that the two threads now have different ideas about how many notify()s have taken place. If you are expecting the two threads to maintain synchrony (and usually you are), you need some sort of shared synchronization primitive to control it. This is where the mutex comes in. It helps avoid lost wakeups which, arguably, are a more serious problem than the spurious variety. Either way, the effects can be serious.
UPDATE: For further rationale behind this design, see this comment by one of the original POSIX authors: https://groups.google.com/d/msg/comp.programming.threads/cpJxTPu3acc/Hw3sbptsY4sJ
Spurious wakeups are two things:
Write your program carefully, and make sure it works even if you
missed something.
Support efficient SMP implementations
There may be rare cases where an "absolutely, paranoiacally correct"
implementation of condition wakeup, given simultaneous wait and
signal/broadcast on different processors, would require additional
synchronization that would slow down ALL condition variable operations
while providing no benefit in 99.99999% of all calls. Is it worth the
overhead? No way!
But, really, that's an excuse because we wanted to force people to
write safe code. (Yes, that's the truth.)
boost::condition_variable::notify_*(lock) does NOT require that the caller hold the lock on the mutex. THis is a nice improvement over the Java model in that it decouples the notification of threads with the holding of the lock.
Strictly speaking, this means the following pointless code SHOULD DO what you are asking:
lock_guard lock(mutex);
// Do something
cv.wait(lock);
// Do something else
unique_lock otherLock(mutex);
//do something
otherLock.unlock();
cv.notify_one();
I do not believe you need to call otherLock.lock() first.

Modelling boost::Lockable with semaphore rather than mutex (previously titled: Unlocking a mutex from a different thread)

I'm using the C++ boost::thread library, which in my case means I'm using pthreads. Officially, a mutex must be unlocked from the same thread which locks it, and I want the effect of being able to lock in one thread and then unlock in another. There are many ways to accomplish this. One possibility would be to write a new mutex class which allows this behavior.
For example:
class inter_thread_mutex{
bool locked;
boost::mutex mx;
boost::condition_variable cv;
public:
void lock(){
boost::unique_lock<boost::mutex> lck(mx);
while(locked) cv.wait(lck);
locked=true;
}
void unlock(){
{
boost::lock_guard<boost::mutex> lck(mx);
if(!locked) error();
locked=false;
}
cv.notify_one();
}
// bool try_lock(); void error(); etc.
}
I should point out that the above code doesn't guarantee FIFO access, since if one thread calls lock() while another calls unlock(), this first thread may acquire the lock ahead of other threads which are waiting. (Come to think of it, the boost::thread documentation doesn't appear to make any explicit scheduling guarantees for either mutexes or condition variables). But let's just ignore that (and any other bugs) for now.
My question is, if I decide to go this route, would I be able to use such a mutex as a model for the boost Lockable concept. For example, would anything go wrong if I use a boost::unique_lock< inter_thread_mutex > for RAII-style access, and then pass this lock to boost::condition_variable_any.wait(), etc.
On one hand I don't see why not. On the other hand, "I don't see why not" is usually a very bad way of determining whether something will work.
The reason I ask is that if it turns out that I have to write wrapper classes for RAII locks and condition variables and whatever else, then I'd rather just find some other way to achieve the same effect.
EDIT:
The kind of behavior I want is basically as follows. I have an object, and it needs to be locked whenever it is modified. I want to lock the object from one thread, and do some work on it. Then I want to keep the object locked while I tell another worker thread to complete the work. So the first thread can go on and do something else while the worker thread finishes up. When the worker thread gets done, it unlocks the mutex.
And I want the transition to be seemless so nobody else can get the mutex lock in between when thread 1 starts the work and thread 2 completes it.
Something like inter_thread_mutex seems like it would work, and it would also allow the program to interact with it as if it were an ordinary mutex. So it seems like a clean solution. If there's a better solution, I'd be happy to hear that also.
EDIT AGAIN:
The reason I need locks to begin with is that there are multiple master threads, and the locks are there to prevent them from accessing shared objects concurrently in invalid ways.
So the code already uses loop-level lock-free sequencing of operations at the master thread level. Also, in the original implementation, there were no worker threads, and the mutexes were ordinary kosher mutexes.
The inter_thread_thingy came up as an optimization, primarily to improve response time. In many cases, it was sufficient to guarantee that the "first part" of operation A, occurs before the "first part" of operation B. As a dumb example, say I punch object 1 and give it a black eye. Then I tell object 1 to change it's internal structure to reflect all the tissue damage. I don't want to wait around for the tissue damage before I move on to punch object 2. However, I do want the tissue damage to occur as part of the same operation; for example, in the interim, I don't want any other thread to reconfigure the object in such a way that would make tissue damage an invalid operation. (yes, this example is imperfect in many ways, and no I'm not working on a game)
So we made the change to a model where ownership of an object can be passed to a worker thread to complete an operation, and it actually works quite nicely; each master thread is able to get a lot more operations done because it doesn't need to wait for them all to complete. And, since the event sequencing at the master thread level is still loop-based, it is easy to write high-level master-thread operations, as they can be based on the assumption that an operation is complete (more precisely, the critical "first part" upon which the sequencing logic depends is complete) when the corresponding function call returns.
Finally, I thought it would be nice to use inter_thread mutex/semaphore thingies using RAII with boost locks to encapsulate the necessary synchronization that is required to make the whole thing work.
man pthread_unlock (this is on OS X, similar wording on Linux) has the answer:
NAME
pthread_mutex_unlock -- unlock a mutex
SYNOPSIS
#include <pthread.h>
int
pthread_mutex_unlock(pthread_mutex_t *mutex);
DESCRIPTION
If the current thread holds the lock on mutex, then the
pthread_mutex_unlock() function unlocks mutex.
Calling pthread_mutex_unlock() with a mutex that the
calling thread does not hold will result in
undefined behavior.
...
My counter-question would be - what kind of synchronization problem are you trying to solve with this? Most probably there is an easier solution.
Neither pthreads nor boost::thread (built on top of it) guarantee any order in which a contended mutex is acquired by competing threads.
Sorry, but I don't understand. what will be the state of your mutex in line [1] in the following code if another thread can unlock it?
inter_thread_mutex m;
{
m.lock();
// [1]
m.unlock();
}
This has no sens.
There's a few ways to approach this. Both of the ones I'm going to suggest are going to involve adding an additional piece of information to the object, rather adding a mechanism to unlock a thread from a thread other than the one that owns it.
1) you can add some information to indicate the object's state:
enum modification_state { consistent, // ready to be examined or to start being modified
phase1_complete, // ready for the second thread to finish the work
};
// first worker thread
lock();
do_init_work(object);
object.mod_state = phase1_complete;
unlock();
signal();
do_other_stuff();
// second worker thread
lock()
while( object.mod_state != phase1_complete )
wait()
do_final_work(obj)
object.mod_state = consistent;
unlock()
signal()
// some other thread that needs to read the data
lock()
while( object.mod_state != consistent )
wait();
read_data(obj)
unlock()
Works just fine with condition variables, because obviously you're not writing your own lock.
2) If you have a specific thread in mind, you can give the object an owner.
// first worker
lock();
while( obj.owner != this_thread() ) wait();
do_initial_work(obj);
obj.owner = second_thread_id;
unlock()
signal()
...
This is pretty much the same solution as my first solution, but more flexible in the adding/removing of phases, and less flexible in the adding/removing of threads.
To be honest, I'm not sure how inter thread mutex would help you here. You'd still need a semaphore or condition variable to signal the passing of the work to the second thread.
Small modification to what you already have: how about storing the id of the thread which you want to take the lock, in your inter_thread_whatever? Then unlock it, and send a message to that thread, saying "I want you execute whatever routine it is that tries to take this lock".
Then the condition in lock becomes while(locked || (desired_locker != thisthread && desired_locker != 0)). Technically you've "released the lock" in the first thread, and "taken it again" in the second thread, but there's no way that any other thread can grab it in between, so it's as if you've transferred it directly from one to the other.
There's a potential problem, that if a thread exits or is killed, while it's the desired locker of your lock, then that thread deadlocks. But you were already talking about the first thread waiting for a message from the second thread to say that it has successfully acquired the lock, so presumably you already have a plan in mind for what happens if that message is never received. To that plan, add "reset the desired_locker field on the inter_thread_whatever".
This is all very hairy, though, I'm not convinced that what I've proposed is correct. Is there a way that the "master" thread (the one that's directing all these helpers) can just make sure that it doesn't order any more operations to be performed on whatever is protected by this lock, until the first op is completed (or fails and some RAII thing notifies you)? You don't need locks as such, if you can deal with it at the level of the message loop.
I don't think it is a good idea to say that your inter_thread_mutex (binary_semaphore) can be seen as a model of Lockable. The main issue is that the main feature of your inter_thread_mutex defeats the Locakble concept. If inter_thread_mutex was a model of lockable you will expect in In [1] that the inter_thread_mutex m is locked.
// thread T1
inter_thread_mutex m;
{
unique_lock<inter_thread_mutex> lk(m);
// [1]
}
But as an other thread T2 can do m.unlock() while T1 is in [1], the guaranty is broken.
Binary semaphores can be used as Lockables as far as each thread tries to lock before unlocking. But the main goal of your class is exactly the contrary.
This is one of the reason semaphores in Boost.Interprocess don't use lock/unlock to name the functions, but wait/notify. Curiously these are the same names used by conditions :)
A mutex is a mechanism for describing mutually exclusive blocks of code. It does not make sense for these blocks of code to cross thread boundaries. Trying to use such a concept in such an counter intuitive way can only lead to problems down the line.
It sounds very much like you're looking for a different multi-threading concept, but without more detail it's hard to know what.

How to implement a recursive MRSW lock?

I need a fully-recursive multiple-reader/single-writer lock (shared mutex) for my project - I don't agree with the notion that if you have complete const-correctness you shouldn't need them (there was some discussion about that on the boost mailing list), in my case the lock should protect a completely transparent cache which would be mutable in any case.
As for the semantics of recursive MRSW locks, I think the only ones that make sense are that acquiring a exclusive lock in addition to a shared one temporarily releases the shared one, to be reacquired when the exclusive one is released.
Has the somewhat strange effect that unlocking can wait but I can live with that - writing rarely happens anyway and recursive locking usually only happens through recursive code paths, in which case the caller has to be prepared that the call might wait in any case. To avoid it one can still simply upgrade the lock instead of using recursive locking.
Acquiring a shared lock on top of an exclusive one should obviously just increases the lock count.
So the question becomes - how should I implement it? The usual approach with a critical section and two semaphores doesn't work here because - as far as I can see - the woken up thread has to handshake, by inserting it's thread id into the lock's owner map.
I suppose it would be doable with two condition variables and a couple of mutexes but the sheer amount of synchronization primitives that would end up using sounds like a bit too much overhead for my taste.
An idea which just sprang into my mind is to utilize TLS to remember the type of lock I'm holding (and possibly the local lock counts). Have to think it through - but I'll still post the question for now.
Target platform is Win32 but that shouldn't really matter. Note that I'm specifically targeting Win2k so anything related to the new MRSW lock primitive in Windows 7 is not relevant for me. :-)
Okay, I solved it.
It can be done with just 2 semaphores, a critical section and almost no more locking than for a regular non-recursive MRSW lock (there is obviously some more CPU-time spent inside the lock because that multimap must be managed) - but it's tricky. The structure I came up with looks like this:
// Protects everything that follows, except mWriterThreadId and mRecursiveUpgrade
CRITICAL_SECTION mLock;
// Semaphore to wait on for a read lock
HANDLE mSemaReader;
// Semaphore to wait on for a write lock
HANDLE mSemaWriter;
// Number of threads waiting for a write lock.
int mWriterWaiting;
// Number of times the writer entered the write lock.
int mWriterActive;
// Number of threads inside a read lock. Note that this does not include
// recursive read locks.
int mReaderActiveThreads;
// Whether or not the current writer obtained the lock by a recursive
// upgrade. Note that this member might be set outside the critical
// section, so it should only be read from by the writer during his
// unlock.
bool mRecursiveUpgrade;
// This member contains the current thread id once for each
// (recursive) read lock held by the current thread in addition to an
// undefined number of other thread ids which may or may not hold a
// read lock, even inside the critical section (!).
std::multiset<unsigned long> mReaderActive;
// If there is no writer this member contains 0.
// If the current thread is the writer this member contains his
// thread-id.
// Otherwise it can contain either of them, even inside the
// critical section (!).
// Also note that it might be set outside the critical section.
unsigned long mWriterThreadId;
Now, the basic idea is this:
Full update of mWriterWaiting and mWriterActive for an unlock is performed by the unlocking thread.
For mWriterThreadId and mReaderActive this is not possible, as the waiting thread needs to insert itself when it was released.
So the rule is, that you may never access those two members except to check whether you are holding a read lock or are the current writer - specifically it may not be used to checker whether or not there are any readers / writers - for that you have to use the (somewhat redundant but necessary for this reason) mReaderActiveThreads and mWriterActive.
I'm currently running some test code (which has been going on deadlock- and crash-free for 30 minutes or so) - when I'm sure that it's stable and I've cleaned up the code somewhat I'll put it on some pastebin and add a link in a comment here (just in case someone else ever needs this).
Well, I did some thinking. Starting from the simple "two semaphores and a critical section" one adds a writer lock count and a owning writer TID to the structure.
Unlock still set most of the new status in the critsec. Readers still normally increase the lock count - recursive locking simply adds a non-existing reader to the counter.
During writers lock() I compare the owning TID, and if the writer already own it the write lock counter is increased.
Setting the new writer TID can't be done by the unlock() - it doesn't know which one will be wakened, but if writers reset it back to zero in their unlock() it's not a problem - the current thread id won't ever be zero and setting it is an atomic operation.
All sounds simple enough - one nasty problem left: A recursive reader-reader lock while a writer is waiting will deadlock. And I don't know how to solve that short of doing a reader-biased lock... somehow I need to know whether or not I already own a reader lock.
Using TLS doesn't sound too great after I realized that the number if available slots might be rather limited...
As far as I understand, you need to provide your writer exclusive access to the data, while readers can operate simultaneously (if this is not what you want, please clarify your question).
I think you need to implement a sort of "inverse semaphore", i.e. a semaphore that will block a thread when positive, and signal all waiting threads when zero. If you do this, you can use two such semaphores for your program. The operation of your threads could then be the following:
Reader:
(1) wait on sem A
(2) increase sem B
(3) read operation
(4) decrease sem B
Writer:
(1) increase sem A
(2) wait on sem B
(3) write operation
(4) decrease sem A
In this way the writer will perform the write operation as soon as all pending readers have finished reading. As soon as your writer finishes, readers can resume their operation without blocking each other.
I am not familiar with Windows mutex/semaphore facilities but I can think of a way to implement such semaphores using the POSIX threads API (combining a mutex, a counter and a conditional variable).