I'm a little confused as to the proper use of critical sections in multithreaded applications. In my application there are several objects (some circular buffers and a serial port object) that are shared among threads. Should access of these objects always be placed within critical sections, or only at certain times? I suspect only at certain times because when I attempted to wrap each use with an EnterCriticalSection / LeaveCriticalSection I ran into what seemed to be a deadlock condition. Any insight you may have would be appreciated. Thanks.
If you share a resource across threads, and some of those threads read while others write, then it must be protected always.
It's hard to give any more advice without knowing more about your code, but here are some general points to keep in mind.
1) Critical sections protect resources, not processes.
2) Enter/leave critical sections in the same order across all threads. If thread A enters Foo, then enters Bar, then thread B must enter Foo and Bar in the same order. If you don't, you could create a race.
3) Entering and leaving must be done in opposite order. Example, since you entered Foo then entered Bar, you must leave Bar before leaving Foo. If you don't do this, you could create a deadlock.
4) Keep locks for the shortest time period reasonably possible. If you're done with Foo before you start using Bar, release Foo before grabbing Bar. But you still have to keep the ordering rules in mind from above. In every thread that uses both Foo and Bar, you must acquire and release in the same order:
Enter Foo
Use Foo
Leave Foo
Enter Bar
Use Bar
Leave Bar
5) If you only read 99.9% of the time and write 0.1% of the time, don't try to be clever. You still have to enter the crit sec even when you're only reading. This is because you don't want a write to start when your'e in the middle of a read.
6) Keep the critical sections granular. Each critical section should protect one resource, not multiple resources. If you make the critical sections too "big", you could serialize your application or create a very mysterious set of deadlocks or races.
Use a C++ wrapper around the critical section which supports RAII:
{
CriticalSectionLock lock ( mutex_ );
Do stuff...
}
The constructor for the lock acquires the mutex and the destructor releases the mutex even if an exception is thrown.
Try not to gain more more than one lock at a time, and try to avoid calling functions outside of your class while holding locks; this helps avoid gaining locks in different places, so you tend to get fewer possibilities for deadlocks.
If you must gain more than one lock at the same time, sort the locks by their address and gain them in order. That way multiple processes gain the same locks in the same order without co-ordination.
With an IO port, consider whether you need to lock output at the same time as input - often you have a case where something tries to write, then expects to read, or visa-versa. If you have two locks, then you can get a deadlock if one thread writes then reads, and the other reads then writes. Often having one thread which does the IO and a queue of requests solves that, but that's a bit more complicated than just wrapping calls up with locks, and without much more detail I can't recommend it.
Related
I have a code that required to lock multiple mutexes.
void AttackAoeRequest(Player* attacker, int range)
{
std::lock_guard<std::mutex> lk_attacker(attacker->mtx);
if (attacker->isInVehicle)
{
return;
}
//there are a lot of code that need to check before the loop, and these code need to access attacker properties.
//s_map is the global map class that contains all player in the map.
for (Player* defender : s_map.GetAllPlayers())
{
if (attacker == defender) continue;
std::lock_guard<std::mutex> lk_defender(defender->mtx);
if (GetDistance(attacker->position, defender->position) <= 5)
{
printf("%d attack %d damage : %d\n", attacker->id, defender->id
, attacker->attackUpgrade - defender->defenseUpgrade);
}
}
}
There is a deadlock occured when the attacker is the defender as the same time.
e.g.
//playerA and playerB are in the global map class.
std::thread threadA = std::thread(AttackAoeRequest, &playerA, 5);
std::thread threadB = std::thread(AttackAoeRequest, &playerB, 5);
UPDATE
Actually the threadA, threadB illustrate which situation the cause the deadlock.
AttackAoeRequest is calling from a multithread networking.
The networking is going to handle messages from client and call AttackAoeRequest. There are might be a situation that clientA(playerA) and clientB(playerB) attack each others.
As the code described. There is a situation the player might be the attacker and defender in the same times, and this cause the deadlock.
I had searched about std::lock to lock multiple mutexes in same time, but in this case the mutex aren't lock in the same time.
Presumably who is "attacker" and who is "defender" is very fluid, and so you are getting opposite locking order issues.
One defense against deadlocks is to write the code so that it avoids holding multiple locks at the same time. Or, going the other way, make the locking more coarse-grained so that a single lock covers all the objects.
If you have to lock an attacker and defender, you could have the code do it always in the same order. For instance, by address. The object with the lower address in memory is locked first, then the higher one. Acquire both locks this way, and then execute the all the code that has to work with both of them.
You could have some scoped lock for this which takes two objects. Make a template class supporting lock_double_guard<std::mutex> dbl_lk(attacker->mtx, defender->mtx); which puts the two objects in sorted order, and locks them in that order.
In C and C++, pointer to distinct objects may not be compared other than for exact equality, but being able to do ptrObj1 < ptrObj2 is a common extension. If that makes you nervous, you could just assign an unsigned integer serial number to each object which is incremented whenever a new object is made. The object with a lower serial number is locked first.
There is no universal answer to your question. You will have to evaluate what makes most sense in your design and possibly redesign your code. Here are a few avenues to explore:
Avoid locking in the first place. Use atomics and lock-free techniques to work with player structures. This is not always easy or even possible to do, but may provide good performance.
Make locking more coarse grained. For example, don't lock individual players, instead lock all players with a single lock. This, obviously, limits parallelism, but this may not be an issue in your code at a large scale.
Avoid locking multiple players at the same time. For example, complete all you need to do with attacker in AttackAoeRequest, release lk_attacker and then proceed to iterate over defenders. Copy/cache the necessary data from attacker if you have to to avoid having to access attacker during iteration. Your design should allow that some of the cached data will become stale during iteration, if another thread modifies attacker while you're iterating.
Introduce asynchronicity or retries. For example, try locking the defender opportunistically, using try_lock. If it fails, postpone processing that player and go on with the rest. After you've completed the iteration, release all locks and retry the whole operation on the leftover defenders a bit later. Hopefully, by that time other threads will have completed their work with the defenders and released their locks. You may need to redo some work on the attacker on the second retry, or reuse the previously cached data.
Separate players processing to different threads. Or, more generally make sure that a given player is never accessed by multiple threads concurrently. Use message passing between threads to implement interaction between players. The message passing mechanism does not need to lock any players, and in fact, locking the players should not be necessary at all. This will also introduce some asynchronicity in the sense that the effects of AttackAoeRequest may be applied to defenders with a delay - when the corresponding thread processes damage notifications from the attacker.
I'm sure there are other ideas as well.
I'm writing a game engine (for fun), and have a lot of threads running concurrently. I have a class which holds an instance of another class as a private variable, which in turn holds and instance of a different class as a private variable. My question is, which one of these classes should I strive to make thread safe?
Do I make all of them thread safe, and have each of them protect their data with a mutex, do I make just one of them thread safe, and assume that anybody using my code must understand that if you are using underlying classes they aren't inherently thread safe.
Example:
class A {
private:
B b;
}
class B {
private:
C c;
}
class C {
// data
}
I understand I need every class's data to avoid being corrupted via a data race, however I would like to avoid throwing a ton of mutexes on every single method of every class. I'm not sure what the proper convention is.
You almost certainly don't want to try to make every class thread-safe, since doing so would end up being very inefficient (with lots of unnecessary locking and unlocking of mutexes for no benefit) and also prone to deadlocks (the more mutexes you have to lock at once, the more likely you are to have different threads locking sequences of mutexes in a different order, which is the entry condition for a deadlock and therefore your program freezing up on you).
What you want to do instead if figure out which data structures needs to be accessed by which thread(s). When designing your data structures, you want to try to design them in such a way that the amount of data shared between threads is as minimal as possible -- if you can reduce it to zero, then you don't need to do any serialization at all! (you probably won't manage that, but if you do a CSP/message-passing design you can get pretty close, in that the only mutexes you ever need to lock are the ones protecting your message-passing queues)
Keep in mind also that your mutexes are there not just to "protect the data" but also to allow a thread to make a series of changes appear to be atom from the viewpoint of the other threads that might access that data. That is, if your thread #1 needs to make changes to objects A, B, and C, and all three of those objects each have their own mutex, which thread #1 locks before modifying the object and then unlocks afterwards, you can still have a race condition, because thread #2 might "see" the update half-completed (i.e. thread #2 might examine the objects after you've updated A but before you've updated B and C). Therefore you usually need to push your mutexes up to a level where they cover all the objects you might need to change in one go -- in the ABC example case, that means you might want to have a single mutex that is used to serialize access to A, B, and C.
One way to approach it would be to start with just a single global mutex for your entire program -- anytime any thread needs to read or write any data structure that is accessible to other threads, that is the mutex it locks (and unlocks afterwards). That design probably won't be very efficient (since threads might spend a lot of time waiting for the mutex), but it will definitely not suffer from deadlock problems. Then once you have that working, you could look to see if that single mutex is actually a noticeable performance bottleneck for you -- if not, you're done, ship your program :) OTOH if it is a bottleneck, you can then analyze which of your data structures are logically independent from each other, and split your global mutex into two mutexes -- one to serialize access to subset A of the data structures, and another one to serialize access to subset B. (Note that the subsets don't need to be equal size -- subset B might contain just one particular data structure that is critical to performance) Repeat as necessary until either you're happy with performance, or your program starts to get too complicated or buggy (in which case you might want to dial the mutex-granularity back again a bit in order to regain your sanity).
I am new to multi-threading programming, and confused about how Mutex works. In the Boost::Thread manual, it states:
Mutexes guarantee that only one thread can lock a given mutex. If a code section is surrounded by a mutex locking and unlocking, it's guaranteed that only a thread at a time executes that section of code. When that thread unlocks the mutex, other threads can enter to that code region:
My understanding is that Mutex is used to protect a section of code from being executed by multiple threads at the same time, NOT protect the memory address of a variable. It's hard for me to grasp the concept, what happen if I have 2 different functions trying to write to the same memory address.
Is there something like this in Boost library:
lock a memory address of a variable, e.g., double x, lock (x); So
that other threads with a different function can not write to x.
do something with x, e.g., x = x + rand();
unlock (x)
Thanks.
The mutex itself only ensures that only one thread of execution can lock the mutex at any given time. It's up to you to ensure that modification of the associated variable happens only while the mutex is locked.
C++ does give you a way to do that a little more easily than in something like C. In C, it's pretty much up to you to write the code correctly, ensuring that anywhere you modify the variable, you first lock the mutex (and, of course, unlock it when you're done).
In C++, it's pretty easy to encapsulate it all into a class with some operator overloading:
class protected_int {
int value; // this is the value we're going to share between threads
mutex m;
public:
operator int() { return value; } // we'll assume no lock needed to read
protected_int &operator=(int new_value) {
lock(m);
value = new_value;
unlock(m);
return *this;
}
};
Obviously I'm simplifying that a lot (to the point that it's probably useless as it stands), but hopefully you get the idea, which is that most of the code just treats the protected_int object as if it were a normal variable.
When you do that, however, the mutex is automatically locked every time you assign a value to it, and unlocked immediately thereafter. Of course, that's pretty much the simplest possible case -- in many cases, you need to do something like lock the mutex, modify two (or more) variables in unison, then unlock. Regardless of the complexity, however, the idea remains that you centralize all the code that does the modification in one place, so you don't have to worry about locking the mutex in the rest of the code. Where you do have two or more variables together like that, you generally will have to lock the mutex to read, not just to write -- otherwise you can easily get an incorrect value where one of the variables has been modified but the other hasn't.
No, there is nothing in boost(or elsewhere) that will lock memory like that.
You have to protect the code that access the memory you want protected.
what happen if I have 2 different functions trying to write to the same
memory address.
Assuming you mean 2 functions executing in different threads, both functions should lock the same mutex, so only one of the threads can write to the variable at a given time.
Any other code that accesses (either reads or writes) the same variable will also have to lock the same mutex, failure to do so will result in indeterministic behavior.
It is possible to do non-blocking atomic operations on certain types using Boost.Atomic. These operations are non-blocking and generally much faster than a mutex. For example, to add something atomically you can do:
boost::atomic<int> n = 10;
n.fetch_add(5, boost:memory_order_acq_rel);
This code atomically adds 5 to n.
In order to protect a memory address shared by multiple threads in two different functions, both functions have to use the same mutex ... otherwise you will run into a scenario where threads in either function can indiscriminately access the same "protected" memory region.
So boost::mutex works just fine for the scenario you describe, but you just have to make sure that for a given resource you're protecting, all paths to that resource lock the exact same instance of the boost::mutex object.
I think the detail you're missing is that a "code section" is an arbitrary section of code. It can be two functions, half a function, a single line, or whatever.
So the portions of your 2 different functions that hold the same mutex when they access the shared data, are "a code section surrounded by a mutex locking and unlocking" so therefore "it's guaranteed that only a thread at a time executes that section of code".
Also, this is explaining one property of mutexes. It is not claiming this is the only property they have.
Your understanding is correct with respect to mutexes. They protect the section of code between the locking and unlocking.
As per what happens when two threads write to the same location of memory, they are serialized. One thread writes its value, the other thread writes to it. The problem with this is that you don't know which thread will write first (or last), so the code is not deterministic.
Finally, to protect a variable itself, you can find a near concept in atomic variables. Atomic variables are variables that are protected by either the compiler or the hardware, and can be modified atomically. That is, the three phases you comment (read, modify, write) happen atomically. Take a look at Boost atomic_count.
I have a thread pool with some threads (e.g. as many as number of cores) that work on many objects, say thousands of objects. Normally I would give each object a mutex to protect access to its internals, lock it when I'm doing work, then release it. When two threads would try to access the same object, one of the threads has to wait.
Now I want to save some resources and be scalable, as there may be thousands of objects, and still only a hand full of threads. I'm thinking about a class design where the thread has some sort of mutex or lock object, and assigns the lock to the object when the object should be accessed. This would save resources, as I only have as much lock objects as I have threads.
Now comes the programming part, where I want to transfer this design into code, but don't know quite where to start. I'm programming in C++ and want to use Boost classes where possible, but self written classes that handle these special requirements are ok. How would I implement this?
My first idea was to have a boost::mutex object per thread, and each object has a boost::shared_ptr that initially is unset (or NULL). Now when I want to access the object, I lock it by creating a scoped_lock object and assign it to the shared_ptr. When the shared_ptr is already set, I wait on the present lock. This idea sounds like a heap full of race conditions, so I sort of abandoned it. Is there another way to accomplish this design? A completely different way?
Edit:
The above description is a bit abstract, so let me add a specific example. Imagine a virtual world with many objects (think > 100.000). Users moving in the world could move through the world and modify objects (e.g. shoot arrows at monsters). When only using one thread, I'm good with a work queue where modifications to objects are queued. I want a more scalable design, though. If 128 core processors are available, I want to use all 128, so use that number of threads, each with work queues. One solution would be to use spatial separation, e.g. use a lock for an area. This could reduce number of locks used, but I'm more interested if there's a design which saves as much locks as possible.
You could use a mutex pool instead of allocating one mutex per resource or one mutex per thread. As mutexes are requested, first check the object in question. If it already has a mutex tagged to it, block on that mutex. If not, assign a mutex to that object and signal it, taking the mutex out of the pool. Once the mutex is unsignaled, clear the slot and return the mutex to the pool.
Without knowing it, what you were looking for is Software Transactional Memory (STM).
STM systems manage with the needed locks internally to ensure the ACI properties (Atomic,Consistent,Isolated). This is a research activity. You can find a lot of STM libraries; in particular I'm working on Boost.STM (The library is not yet for beta test, and the documentation is not really up to date, but you can play with). There are also some compilers that are introducing TM in (as Intel, IBM, and SUN compilers). You can get the draft specification from here
The idea is to identify the critical regions as follows
transaction {
// transactional block
}
and let the STM system to manage with the needed locks as far as it ensures the ACI properties.
The Boost.STM approach let you write things like
int inc_and_ret(stm::object<int>& i) {
BOOST_STM_TRANSACTION {
return ++i;
} BOOST_STM_END_TRANSACTION
}
You can see the couple BOOST_STM_TRANSACTION/BOOST_STM_END_TRANSACTION as a way to determine a scoped implicit lock.
The cost of this pseudo transparency is of 4 meta-data bytes for each stm::object.
Even if this is far from your initial design I really think is what was behind your goal and initial design.
I doubt there's any clean way to accomplish your design. The problem that assigning the mutex to the object looks like it'll modify the contents of the object -- so you need a mutex to protect the object from several threads trying to assign mutexes to it at once, so to keep your first mutex assignment safe, you'd need another mutex to protect the first one.
Personally, I think what you're trying to cure probably isn't a problem in the first place. Before I spent much time on trying to fix it, I'd do a bit of testing to see what (if anything) you lose by simply including a Mutex in each object and being done with it. I doubt you'll need to go any further than that.
If you need to do more than that I'd think of having a thread-safe pool of objects, and anytime a thread wants to operate on an object, it has to obtain ownership from that pool. The call to obtain ownership would release any object currently owned by the requesting thread (to avoid deadlocks), and then give it ownership of the requested object (blocking if the object is currently owned by another thread). The object pool manager would probably operate in a thread by itself, automatically serializing all access to the pool management, so the pool management code could avoid having to lock access to the variables telling it who currently owns what object and such.
Personally, here's what I would do. You have a number of objects, all probably have a key of some sort, say names. So take the following list of people's names:
Bill Clinton
Bill Cosby
John Doe
Abraham Lincoln
Jon Stewart
So now you would create a number of lists: one per letter of the alphabet, say. Bill and Bill would go in one list, John, Jon Abraham all by themselves.
Each list would be assigned to a specific thread - access would have to go through that thread (you would have to marshall operations to an object onto that thread - a great use of functors). Then you only have two places to lock:
thread() {
loop {
scoped_lock lock(list.mutex);
list.objectAccess();
}
}
list_add() {
scoped_lock lock(list.mutex);
list.add(..);
}
Keep the locks to a minimum, and if you're still doing a lot of locking, you can optimise the number of iterations you perform on the objects in your lists from 1 to 5, to minimize the amount of time spent acquiring locks. If your data set grows or is keyed by number, you can do any amount of segregating data to keep the locking to a minimum.
It sounds to me like you need a work queue. If the lock on the work queue became a bottle neck you could switch it around so that each thread had its own work queue then some sort of scheduler would give the incoming object to the thread with the least amount of work to do. The next level up from that is work stealing where threads that have run out of work look at the work queues of other threads.(See Intel's thread building blocks library.)
If I follow you correctly ....
struct table_entry {
void * pObject; // substitute with your object
sem_t sem; // init to empty
int nPenders; // init to zero
};
struct table_entry * table;
object_lock (void * pObject) {
goto label; // yes it is an evil goto
do {
pEntry->nPenders++;
unlock (mutex);
sem_wait (sem);
label:
lock (mutex);
found = search (table, pObject, &pEntry);
} while (found);
add_object_to_table (table, pObject);
unlock (mutex);
}
object_unlock (void * pObject) {
lock (mutex);
pEntry = remove (table, pObject); // assuming it is in the table
if (nPenders != 0) {
nPenders--;
sem_post (pEntry->sem);
}
unlock (mutex);
}
The above should work, but it does have some potential drawbacks such as ...
A possible bottleneck in the search.
Thread starvation. There is no guarantee that any given thread will get out of the do-while loop in object_lock().
However, depending upon your setup, these potential draw-backs might not matter.
Hope this helps.
We here have an interest in a similar model. A solution we have considered is to have a global (or shared) lock but used in the following manner:
A flag that can be atomically set on the object. If you set the flag you then own the object.
You perform your action then reset the variable and signal (broadcast) a condition variable.
If the acquire failed you wait on the condition variable. When it is broadcast you check its state to see if it is available.
It does appear though that we need to lock the mutex each time we change the value of this variable. So there is a lot of locking and unlocking but you do not need to keep the lock for any long period.
With a "shared" lock you have one lock applying to multiple items. You would use some kind of "hash" function to determine which mutex/condition variable applies to this particular entry.
Answer the following question under the #JohnDibling's post.
did you implement this solution ? I've a similar problem and I would like to know how you solved to release the mutex back to the pool. I mean, how do you know, when you release the mutex, that it can be safely put back in queue if you do not know if another thread is holding it ?
by #LeonardoBernardini
I'm currently trying to solve the same kind of problem. My approach is create your own mutex struct (call it counterMutex) with a counter field and the real resource mutex field. So every time you try to lock the counterMutex, first you increment the counter then lock the underlying mutex. When you're done with it, you decrement the coutner and unlock the mutex, after that check the counter to see if it's zero which means no other thread is trying to acquire the lock . If so put the counterMutex back to the pool. Is there a race condition when manipulating the counter? you may ask. The answer is NO. Remember you have a global mutex to ensure that only one thread can access the coutnerMutex at one time.
I need a fully-recursive multiple-reader/single-writer lock (shared mutex) for my project - I don't agree with the notion that if you have complete const-correctness you shouldn't need them (there was some discussion about that on the boost mailing list), in my case the lock should protect a completely transparent cache which would be mutable in any case.
As for the semantics of recursive MRSW locks, I think the only ones that make sense are that acquiring a exclusive lock in addition to a shared one temporarily releases the shared one, to be reacquired when the exclusive one is released.
Has the somewhat strange effect that unlocking can wait but I can live with that - writing rarely happens anyway and recursive locking usually only happens through recursive code paths, in which case the caller has to be prepared that the call might wait in any case. To avoid it one can still simply upgrade the lock instead of using recursive locking.
Acquiring a shared lock on top of an exclusive one should obviously just increases the lock count.
So the question becomes - how should I implement it? The usual approach with a critical section and two semaphores doesn't work here because - as far as I can see - the woken up thread has to handshake, by inserting it's thread id into the lock's owner map.
I suppose it would be doable with two condition variables and a couple of mutexes but the sheer amount of synchronization primitives that would end up using sounds like a bit too much overhead for my taste.
An idea which just sprang into my mind is to utilize TLS to remember the type of lock I'm holding (and possibly the local lock counts). Have to think it through - but I'll still post the question for now.
Target platform is Win32 but that shouldn't really matter. Note that I'm specifically targeting Win2k so anything related to the new MRSW lock primitive in Windows 7 is not relevant for me. :-)
Okay, I solved it.
It can be done with just 2 semaphores, a critical section and almost no more locking than for a regular non-recursive MRSW lock (there is obviously some more CPU-time spent inside the lock because that multimap must be managed) - but it's tricky. The structure I came up with looks like this:
// Protects everything that follows, except mWriterThreadId and mRecursiveUpgrade
CRITICAL_SECTION mLock;
// Semaphore to wait on for a read lock
HANDLE mSemaReader;
// Semaphore to wait on for a write lock
HANDLE mSemaWriter;
// Number of threads waiting for a write lock.
int mWriterWaiting;
// Number of times the writer entered the write lock.
int mWriterActive;
// Number of threads inside a read lock. Note that this does not include
// recursive read locks.
int mReaderActiveThreads;
// Whether or not the current writer obtained the lock by a recursive
// upgrade. Note that this member might be set outside the critical
// section, so it should only be read from by the writer during his
// unlock.
bool mRecursiveUpgrade;
// This member contains the current thread id once for each
// (recursive) read lock held by the current thread in addition to an
// undefined number of other thread ids which may or may not hold a
// read lock, even inside the critical section (!).
std::multiset<unsigned long> mReaderActive;
// If there is no writer this member contains 0.
// If the current thread is the writer this member contains his
// thread-id.
// Otherwise it can contain either of them, even inside the
// critical section (!).
// Also note that it might be set outside the critical section.
unsigned long mWriterThreadId;
Now, the basic idea is this:
Full update of mWriterWaiting and mWriterActive for an unlock is performed by the unlocking thread.
For mWriterThreadId and mReaderActive this is not possible, as the waiting thread needs to insert itself when it was released.
So the rule is, that you may never access those two members except to check whether you are holding a read lock or are the current writer - specifically it may not be used to checker whether or not there are any readers / writers - for that you have to use the (somewhat redundant but necessary for this reason) mReaderActiveThreads and mWriterActive.
I'm currently running some test code (which has been going on deadlock- and crash-free for 30 minutes or so) - when I'm sure that it's stable and I've cleaned up the code somewhat I'll put it on some pastebin and add a link in a comment here (just in case someone else ever needs this).
Well, I did some thinking. Starting from the simple "two semaphores and a critical section" one adds a writer lock count and a owning writer TID to the structure.
Unlock still set most of the new status in the critsec. Readers still normally increase the lock count - recursive locking simply adds a non-existing reader to the counter.
During writers lock() I compare the owning TID, and if the writer already own it the write lock counter is increased.
Setting the new writer TID can't be done by the unlock() - it doesn't know which one will be wakened, but if writers reset it back to zero in their unlock() it's not a problem - the current thread id won't ever be zero and setting it is an atomic operation.
All sounds simple enough - one nasty problem left: A recursive reader-reader lock while a writer is waiting will deadlock. And I don't know how to solve that short of doing a reader-biased lock... somehow I need to know whether or not I already own a reader lock.
Using TLS doesn't sound too great after I realized that the number if available slots might be rather limited...
As far as I understand, you need to provide your writer exclusive access to the data, while readers can operate simultaneously (if this is not what you want, please clarify your question).
I think you need to implement a sort of "inverse semaphore", i.e. a semaphore that will block a thread when positive, and signal all waiting threads when zero. If you do this, you can use two such semaphores for your program. The operation of your threads could then be the following:
Reader:
(1) wait on sem A
(2) increase sem B
(3) read operation
(4) decrease sem B
Writer:
(1) increase sem A
(2) wait on sem B
(3) write operation
(4) decrease sem A
In this way the writer will perform the write operation as soon as all pending readers have finished reading. As soon as your writer finishes, readers can resume their operation without blocking each other.
I am not familiar with Windows mutex/semaphore facilities but I can think of a way to implement such semaphores using the POSIX threads API (combining a mutex, a counter and a conditional variable).