Is double-check locking safe in C++ for unidirectional data transfer?

Is double-check locking safe in C++ for unidirectional data transfer? - c++

I have inherited an application which I'm trying to improve the performance of and it currently uses mutexes (std::lock_guard<std::mutex>) to transfer data from one thread to another. One thread is a low-frequency (slow) one which simply modifies the data to be used by the other.
The other thread (which we'll call fast) has rather stringent performance requirements (it needs to do maximum number of cycles per second possible) and we believe this is being impacted by the use of the mutexes.
Basically, the current logic is:
slow thread: fast thread:
occasionally: very-often:
claim mutex claim mutex
change data use data
release mutex release mutex
In order to get the fast thread running at maximum throughput, I'd like to experiment with removing the number of mutex locks it has to do.
I suspect a variation of the double locking check pattern may be of use here. I know it has serious issues with bi-directional data flow (or singleton creation) but the areas of responsibility in my case are a little more limited in terms of which thread performs which operations (and when) on the shared data.
Basically, the slow thread sets up the data and never reads or writes to it again unless a new change comes in. The fast thread uses and changes the data but never expects to pass any information back to the other thread. In other words, ownership mostly flows strictly one way.
I wanted to see if anyone could pick any holes in the strategy I'm thinking of.
The new idea is to have two sets of data, one current and one pending. There is no need for a queue in my case as incoming data overwrites previous data.
The pending data will only ever be written to by the slow thread under the control of the mutex and it will have an atomic flag to indicate that it has written and relinquished control (for now).
The fast thread will continue to use current data (without the mutex) until such time as the atomic flag is set. Since it is responsible for transferring pending to current, it can ensure the current data is always consistent.
At the point where the flag is set, it will lock the mutex and, transfer pending to current, clear the flag, unlock the mutex and carry on.
So, basically, the fast thread runs at full speed and only does mutex locks when it knows the pending data needs to be transferred.
Getting into more concrete details, the class will have the following data members:
std::atomic_bool m_newDataReady;
std::mutex m_protectData;
MyStruct m_pendingData;
MyStruct m_currentData;
The method for receiving new data in the slow thread would be:
void NewData(const MyStruct &newData) {
std::lock_guard<std::mutex> guard(m_protectData);
m_newDataReady = false;
Transfer(m_newData, 'to', m_pendingData);
m_newDataReady = true;
}
Clearing the flag prevents the fast thread from even trying to check for new data until the immediate transfer operation is complete.
The fast thread is a little trickier, using the flag to keep mutex locks to a minimum:
while (true) {
if (m_newDataReady) {
std::lock_guard<std::mutex> guard(m_protectData);
if (m_newDataReady) {
Transfer(m_pendingData, 'to', m_currentData);
m_newDataReady = false;
}
}
Use (m_currentData);
}
Now it appears to me that the use of this method in the fast thread could improve performance quite a bit:
There is only one place where the atomic flag is used outside the control of the mutex and the fact that it's an atomic means its state should be consistent there.
Even if it's not consistent, the second check inside the mutex-locked area should provide a safety valve (it's rechecked when we know it's consistent).
The transfer of data is only ever performed under the control of the mutex so that should always be consistent.
The outer loop in the fast thread means that unnecessary mutex locks will be avoided - they'll only be done if the flag is true (or "half-true", a possibly inconsistent state).
The inner if will take care of that "half-true" possibility that, between checking the and locking the mutex, the flag has been cleared.
I can't see any holes in this strategy but, given I'm only just getting into atomics/threading in the standard-C++ world, it may be I'm missing something.
Are there any clear problems in using this method?

Related

How to solve deadlock in multiple mutexes

I have a code that required to lock multiple mutexes.
void AttackAoeRequest(Player* attacker, int range)
{
std::lock_guard<std::mutex> lk_attacker(attacker->mtx);
if (attacker->isInVehicle)
{
return;
}
//there are a lot of code that need to check before the loop, and these code need to access attacker properties.
//s_map is the global map class that contains all player in the map.
for (Player* defender : s_map.GetAllPlayers())
{
if (attacker == defender) continue;
std::lock_guard<std::mutex> lk_defender(defender->mtx);
if (GetDistance(attacker->position, defender->position) <= 5)
{
printf("%d attack %d damage : %d\n", attacker->id, defender->id
, attacker->attackUpgrade - defender->defenseUpgrade);
}
}
}
There is a deadlock occured when the attacker is the defender as the same time.
e.g.
//playerA and playerB are in the global map class.
std::thread threadA = std::thread(AttackAoeRequest, &playerA, 5);
std::thread threadB = std::thread(AttackAoeRequest, &playerB, 5);
UPDATE
Actually the threadA, threadB illustrate which situation the cause the deadlock.
AttackAoeRequest is calling from a multithread networking.
The networking is going to handle messages from client and call AttackAoeRequest. There are might be a situation that clientA(playerA) and clientB(playerB) attack each others.
As the code described. There is a situation the player might be the attacker and defender in the same times, and this cause the deadlock.
I had searched about std::lock to lock multiple mutexes in same time, but in this case the mutex aren't lock in the same time.

Presumably who is "attacker" and who is "defender" is very fluid, and so you are getting opposite locking order issues.
One defense against deadlocks is to write the code so that it avoids holding multiple locks at the same time. Or, going the other way, make the locking more coarse-grained so that a single lock covers all the objects.
If you have to lock an attacker and defender, you could have the code do it always in the same order. For instance, by address. The object with the lower address in memory is locked first, then the higher one. Acquire both locks this way, and then execute the all the code that has to work with both of them.
You could have some scoped lock for this which takes two objects. Make a template class supporting lock_double_guard<std::mutex> dbl_lk(attacker->mtx, defender->mtx); which puts the two objects in sorted order, and locks them in that order.
In C and C++, pointer to distinct objects may not be compared other than for exact equality, but being able to do ptrObj1 < ptrObj2 is a common extension. If that makes you nervous, you could just assign an unsigned integer serial number to each object which is incremented whenever a new object is made. The object with a lower serial number is locked first.

There is no universal answer to your question. You will have to evaluate what makes most sense in your design and possibly redesign your code. Here are a few avenues to explore:
Avoid locking in the first place. Use atomics and lock-free techniques to work with player structures. This is not always easy or even possible to do, but may provide good performance.
Make locking more coarse grained. For example, don't lock individual players, instead lock all players with a single lock. This, obviously, limits parallelism, but this may not be an issue in your code at a large scale.
Avoid locking multiple players at the same time. For example, complete all you need to do with attacker in AttackAoeRequest, release lk_attacker and then proceed to iterate over defenders. Copy/cache the necessary data from attacker if you have to to avoid having to access attacker during iteration. Your design should allow that some of the cached data will become stale during iteration, if another thread modifies attacker while you're iterating.
Introduce asynchronicity or retries. For example, try locking the defender opportunistically, using try_lock. If it fails, postpone processing that player and go on with the rest. After you've completed the iteration, release all locks and retry the whole operation on the leftover defenders a bit later. Hopefully, by that time other threads will have completed their work with the defenders and released their locks. You may need to redo some work on the attacker on the second retry, or reuse the previously cached data.
Separate players processing to different threads. Or, more generally make sure that a given player is never accessed by multiple threads concurrently. Use message passing between threads to implement interaction between players. The message passing mechanism does not need to lock any players, and in fact, locking the players should not be necessary at all. This will also introduce some asynchronicity in the sense that the effects of AttackAoeRequest may be applied to defenders with a delay - when the corresponding thread processes damage notifications from the attacker.
I'm sure there are other ideas as well.

Cheapest way to wake up multiple waiting threads without blocking

I use boost::thread to manage threads. In my program i have pool of threads (workers) that are activated sometimes to do some job simultaneously.
Now i use boost::condition_variable: and all threads are waiting inside boost::condition_variable::wait() call on their own conditional_variableS objects.
Can i AVOID using mutexes in classic scheme, when i work with conditional_variables? I want to wake up threads, but don't need to pass some data to them, so don't need a mutex to be locked/unlocked during awakening process, why should i spend CPU on this (but yes, i should remember about spurious wakeups)?
The boost::condition_variable::wait() call trying to REACQUIRE the locking object when CV received the notification. But i don't need this exact facility.
What is cheapest way to awake several threads from another thread?

If you don't reacquire the locking object, how can the threads know that they are done waiting? What will tell them that? Returning from the block tells them nothing because the blocking object is stateless. It doesn't have an "unlocked" or "not blocking" state for it to return in.
You have to pass some data to them, otherwise how will they know that before they had to wait and now they don't? A condition variable is completely stateless, so any state that you need must be maintained and passed by you.
One common pattern is to use a mutex, condition variable, and a state integer. To block, do this:
Acquire the mutex.
Copy the value of the state integer.
Block on the condition variable, releasing the mutex.
If the state integer is the same as it was when you coped it, go to step 3.
Release the mutex.
To unblock all threads, do this:
Acquire the mutex.
Increment the state integer.
Broadcast the condition variable.
Release the mutex.
Notice how step 4 of the locking algorithm tests whether the thread is done waiting? Notice how this code tracks whether or not there has been an unblock since the thread decided to block? You have to do that because condition variables don't do it themselves. (And that's why you need to reacquire the locking object.)
If you try to remove the state integer, your code will behave unpredictably. Sometimes you will block too long due to missed wakeups and sometimes you won't block long enough due to spurious wakeups. Only a state integer (or similar predicate) protected by the mutex tells the threads when to wait and when to stop waiting.
Also, I haven't seen how your code uses this, but it almost always folds into logic you're already using. Why did the threads block anyway? Is it because there's no work for them to do? And when they wakeup, are they going to figure out what to do? Well, finding out that there's no work for them to do and finding out what work they do need to do will require some lock since it's shared state, right? So there almost always is already a lock you're holding when you decide to block and need to reacquire when you're done waiting.

For controlling threads doing parallel jobs, there is a nice primitive called a barrier.
A barrier is initialized with some positive integer value N representing how many threads it holds. A barrier has only a single operation: wait. When N threads call wait, the barrier releases all of them. Additionally, one of the threads is given a special return value indicating that it is the "serial thread"; that thread will be the one to do some special job, like integrating the results of the computation from the other threads.
The limitation is that a given barrier has to know the exact number of threads. It's really suitable for parallel processing type situations.
POSIX added barriers in 2003. A web search indicates that Boost has them, too.
http://www.boost.org/doc/libs/1_33_1/doc/html/barrier.html

Generally speaking, you can't.
Assuming the algorithm looks something like this:
ConditionVariable cv;
void WorkerThread()
{
for (;;)
{
cv.wait();
DoWork();
}
}
void MainThread()
{
for (;;)
{
ScheduleWork();
cv.notify_all();
}
}
NOTE: I intentionally omitted any reference to mutexes in this pseudo-code. For the purposes of this example, we'll suppose ConditionVariable does not require a mutex.
The first time through MainTnread(), work is queued and then it notifies WorkerThread() that it should execute its work. At this point two things can happen:
WorkerThread() completes DoWork() before MainThread() can complete ScheduleWork().
MainThread() completes ScheduleWork() before WorkerThread() can complete DoWork().
In case #1, WorkerThread() comes back around to sleep on the CV, and is awoken by the next cv.notify() and all is well.
In case #2, MainThread() comes back around and notifies... nobody and continues on. Meanwhile WorkerThread() eventually comes back around in its loop and waits on the CV but it is now one or more iterations behind MainThread().
This is known as a "lost wakeup". It is similar to the notorious "spurious wakeup" in that the two threads now have different ideas about how many notify()s have taken place. If you are expecting the two threads to maintain synchrony (and usually you are), you need some sort of shared synchronization primitive to control it. This is where the mutex comes in. It helps avoid lost wakeups which, arguably, are a more serious problem than the spurious variety. Either way, the effects can be serious.
UPDATE: For further rationale behind this design, see this comment by one of the original POSIX authors: https://groups.google.com/d/msg/comp.programming.threads/cpJxTPu3acc/Hw3sbptsY4sJ
Spurious wakeups are two things:
Write your program carefully, and make sure it works even if you
missed something.
Support efficient SMP implementations
There may be rare cases where an "absolutely, paranoiacally correct"
implementation of condition wakeup, given simultaneous wait and
signal/broadcast on different processors, would require additional
synchronization that would slow down ALL condition variable operations
while providing no benefit in 99.99999% of all calls. Is it worth the
overhead? No way!
But, really, that's an excuse because we wanted to force people to
write safe code. (Yes, that's the truth.)

boost::condition_variable::notify_*(lock) does NOT require that the caller hold the lock on the mutex. THis is a nice improvement over the Java model in that it decouples the notification of threads with the holding of the lock.
Strictly speaking, this means the following pointless code SHOULD DO what you are asking:
lock_guard lock(mutex);
// Do something
cv.wait(lock);
// Do something else
unique_lock otherLock(mutex);
//do something
otherLock.unlock();
cv.notify_one();
I do not believe you need to call otherLock.lock() first.

Thread-safe linked list with fine grained locks

In a program I have a class M:
class M{
/*
very big immutable fields
*/
int status;
};
And I need a linked-list of objects of type M.
Three types of threads are accessing the list:
Producers: Produce and append objects to the end of the list. All of the newly produced objects have the status=NEW. (Operation time = O(1))
Consumers: Consume objects at the beginning of the list. An object can be consumed by a consumer if it has status=CONSUMER_ID. Each of the consumers keeps the first item in the linked-list that it can consume so the consumption is (amortized?) O(1)(see note below).
Destructor: Deletes consumed objects when there is a notification that says the object has been consumed correctly (Operation time = O(1)).
Modifier: Changes the status of the objects based on a state diagram. The final status of any object is the id of a consumer (Operation time = O(1) per object).
The number of consumers is less than 10. The number of Producers may be as big as a couple of hundreds. There is one modifier.
note: The modifier may modify the already consumed objects and thus the stored items of consumers may move back and forth. I did not find any better solutions for this problem (Although, the comparison between objects is O(1), the operation is no more amortized O(1)).
The performance is very important. Therefore, I want to use atomic operations or fine-grained locks (one per object) to avoid unnecessary blocking.
My questions are:
Atomic operations are preferred because they are lighter. I guess I must use locks for updating the pointers in destructor thread only and I can use atomic operations for handling contention between other threads. Please let me know if I am missing something and there is a reason that I cannot use atomic operations on status field.
I think I cannot use STL list because it does not support fine-grained locks. But would you recommend using Boost::Intrusive lists (instead of writing my own)? Here it is mentioned that intrusive data structures are harder to make thread-safe? Is this true for fine-grained locks?
The producers, consumers and destructor would be called asynchronously based on some events (I am planning to use Boost::asio. But I don't know how to run the modifier to minimize its contention with other threads. The options are:
Asynchronously from producers.
Asynchronously from consumers.
Using its own timer.
Any such call would operate on the list only if some conditions hold. My own intuition is that there is no difference between how I call the modifier. Am I missing something?
My system is Linux/GCC and I am using boost 1.47 in case it matters.
Similar question: Thread-safe deletion of a linked list node, using the fine-grained approach

The performance is very important. Therefore, I want to use atomic operations or fine-grained locks (one per object) to avoid unnecessary blocking.
This will make performance worse by increasing the probability that threads that contend (access the same data) will run at the same time on different cores. If the locks are too fine, threads may contend (ping-pong data between their caches) and run in slow lock step without ever blocking on a lock, causing terrible performance.
You want to use coarse enough locks that threads that contend over the same data block each other as soon as possible. That will force the scheduler to schedule non-contending threads, eliminating the cache ping-ponging that destroys performance.
You have a common misconception that blocking is bad. In fact, contention is bad, because it slows cores down to bus speeds. Blocking ends contention. Blocking is good because it de-schedules contending threads, allowing non-contending threads (that can run concurrently at full speed) to be scheduled.

If you're already planning to use Boost Asio, then good news! You can stop writing your custom asynchronous producer-consumer queue right now.
The Boost Asio io_service class is an asynchronous queue, so you can easily use it to pass objects from producers to consumers. Use the io_service::post() method to enqueue a bound function object for asychronous callback by another thread.
boost::asio::io_service io_service_;
void produce()
{
M* m = new M;
io_service_.post(boost::bind(&consume, m));
}
void consume(M* m)
{
delete m;
}
Have your producer threads call produce(), then have your consumers threads call io_service_.run(), and then consume() will be called back on your consumer threads. Instant producer-consumer!
Plus, you can enqueue all kinds of other heterogeneous events into the io_service_ to be handled by your consumer threads if you like, such as network reads and waiting for signals. Boost Asio is more than just a network library-- it's also an easy way to express a proactor, reactor, producer-consumer, thread-pool, or any other kind of threading architecture.
EDIT
Oh, and one more tip. Don't make separate pools of dedicated producer threads and dedicated consumer threads. Just make one thread for each core available on your machine (4 core machine => 4 threads). Then have all those threads call io_service_.run(). Use the io_service_ to asynchronously read stuff to produce, from files or the network or whatever, then use the io_service_ again to asynchronously consume whatever was produced.
That's the most performant threading architecture. One thread per core.

As #David Schwartz fairly noted, blocking is not always slow and spinning (in user space multithreaded applications) can be quite dangerous.
Moreover, linux pthread library has "smart" implementation of pthread_mutex. It's designed to be "lightweight", i.e. when a thread tries to lock already acquired mutex, it spins some time making several attempts to get the lock before it blocks. Number of attempts is not big enough to harm your system or even break real-time requirements (if any). Additional linux specific feature is so-called fast user space mutex (FUTEX), which reduces number of syscalls. The main idea is that it'll do mutex_lock syscall only when a thread really needs to block on a mutex (when a thread locks unacquired mutex, it doesn't do a syscall).
Actually in most cases you don't need to reinvent the wheel or introduce some very specific locking techniques. If you have to, then either something wrong with design or you're dealing with highly concurrent environment (for the first sight 10 consumers don't seem that and all these seem like over engineering).
If I were you I'd prefer to use conditional variable + mutex protecting the list.
Another thing I'd do is to go over the design again. Why use one global list when consumer needs to do a search to find out whether the list contains the item with its ID (and if so, remove/dequeue it)? May be it's better to make a separate list for each consumer? In this case you probably can get rid of status field.
Does read access is more frequent than write access? If so it would be better to use R/W locks or RCU
If I wouldn't satisfied with pthread primitives and futex stuff (and if I wouldn't, I would have proved by the tests that locking primitives are bottleneck, not the number of consumers or the algorithm I chosen), then I'd try to think about complicated algorithm with reference counting, separate GC thread and restriction of all updates to be atomic.

I would advice you on a slightly different approach to the problem:
Producers: Enqueue objects at the end of a shared queue (SQ). Wakes up
the Modifier via a semaphore.
producer()
{
while (true)
{
o = get_object_from_somewhere ()
atomic_enqueue (SQ.queue, o)
signal(SQ.sem)
}
}
Consumers: Deque objects from the front of a per consumer queue (CQ[i]).
consumer()
{
while (true)
{
wait (CQ[self].sem)
o = atomic_dequeue (CQ[self].queue)
process (o)
destroy (o)
}
}
Destructor: Destructor does not exist, after a consumer is done with
an object, the consumer destroys it.
Modifier: The modifier dequeues objects from the shared queue,
processed them and enqueues them to the private queue of the appropriate consumer.
modifier()
{
while (true)
{
wait (SQ.sem)
o = atomic_dequeue (SQ.queue)
FSM (o)
atomic_enqueue (CQ [o.status].queue, o)
signal (CQ [o.status].sem)
}
}
A note to the various atomic_xxx functions in the pseudo code: this
does not necessarily mean using atomic instructions like CAS, CAS2,
LL/SC, etc. It can be using atomics, spinlocks or plain mutexes. I
would advice implementing it in the most straighforward way
(e.g. mutexes) and optimizing it later if it proves to be a
performance issue.

How to synchronize access to many objects

I have a thread pool with some threads (e.g. as many as number of cores) that work on many objects, say thousands of objects. Normally I would give each object a mutex to protect access to its internals, lock it when I'm doing work, then release it. When two threads would try to access the same object, one of the threads has to wait.
Now I want to save some resources and be scalable, as there may be thousands of objects, and still only a hand full of threads. I'm thinking about a class design where the thread has some sort of mutex or lock object, and assigns the lock to the object when the object should be accessed. This would save resources, as I only have as much lock objects as I have threads.
Now comes the programming part, where I want to transfer this design into code, but don't know quite where to start. I'm programming in C++ and want to use Boost classes where possible, but self written classes that handle these special requirements are ok. How would I implement this?
My first idea was to have a boost::mutex object per thread, and each object has a boost::shared_ptr that initially is unset (or NULL). Now when I want to access the object, I lock it by creating a scoped_lock object and assign it to the shared_ptr. When the shared_ptr is already set, I wait on the present lock. This idea sounds like a heap full of race conditions, so I sort of abandoned it. Is there another way to accomplish this design? A completely different way?
Edit:
The above description is a bit abstract, so let me add a specific example. Imagine a virtual world with many objects (think > 100.000). Users moving in the world could move through the world and modify objects (e.g. shoot arrows at monsters). When only using one thread, I'm good with a work queue where modifications to objects are queued. I want a more scalable design, though. If 128 core processors are available, I want to use all 128, so use that number of threads, each with work queues. One solution would be to use spatial separation, e.g. use a lock for an area. This could reduce number of locks used, but I'm more interested if there's a design which saves as much locks as possible.

You could use a mutex pool instead of allocating one mutex per resource or one mutex per thread. As mutexes are requested, first check the object in question. If it already has a mutex tagged to it, block on that mutex. If not, assign a mutex to that object and signal it, taking the mutex out of the pool. Once the mutex is unsignaled, clear the slot and return the mutex to the pool.

Without knowing it, what you were looking for is Software Transactional Memory (STM).
STM systems manage with the needed locks internally to ensure the ACI properties (Atomic,Consistent,Isolated). This is a research activity. You can find a lot of STM libraries; in particular I'm working on Boost.STM (The library is not yet for beta test, and the documentation is not really up to date, but you can play with). There are also some compilers that are introducing TM in (as Intel, IBM, and SUN compilers). You can get the draft specification from here
The idea is to identify the critical regions as follows
transaction {
// transactional block
}
and let the STM system to manage with the needed locks as far as it ensures the ACI properties.
The Boost.STM approach let you write things like
int inc_and_ret(stm::object<int>& i) {
BOOST_STM_TRANSACTION {
return ++i;
} BOOST_STM_END_TRANSACTION
}
You can see the couple BOOST_STM_TRANSACTION/BOOST_STM_END_TRANSACTION as a way to determine a scoped implicit lock.
The cost of this pseudo transparency is of 4 meta-data bytes for each stm::object.
Even if this is far from your initial design I really think is what was behind your goal and initial design.

I doubt there's any clean way to accomplish your design. The problem that assigning the mutex to the object looks like it'll modify the contents of the object -- so you need a mutex to protect the object from several threads trying to assign mutexes to it at once, so to keep your first mutex assignment safe, you'd need another mutex to protect the first one.
Personally, I think what you're trying to cure probably isn't a problem in the first place. Before I spent much time on trying to fix it, I'd do a bit of testing to see what (if anything) you lose by simply including a Mutex in each object and being done with it. I doubt you'll need to go any further than that.
If you need to do more than that I'd think of having a thread-safe pool of objects, and anytime a thread wants to operate on an object, it has to obtain ownership from that pool. The call to obtain ownership would release any object currently owned by the requesting thread (to avoid deadlocks), and then give it ownership of the requested object (blocking if the object is currently owned by another thread). The object pool manager would probably operate in a thread by itself, automatically serializing all access to the pool management, so the pool management code could avoid having to lock access to the variables telling it who currently owns what object and such.

Personally, here's what I would do. You have a number of objects, all probably have a key of some sort, say names. So take the following list of people's names:
Bill Clinton
Bill Cosby
John Doe
Abraham Lincoln
Jon Stewart
So now you would create a number of lists: one per letter of the alphabet, say. Bill and Bill would go in one list, John, Jon Abraham all by themselves.
Each list would be assigned to a specific thread - access would have to go through that thread (you would have to marshall operations to an object onto that thread - a great use of functors). Then you only have two places to lock:
thread() {
loop {
scoped_lock lock(list.mutex);
list.objectAccess();
}
}
list_add() {
scoped_lock lock(list.mutex);
list.add(..);
}
Keep the locks to a minimum, and if you're still doing a lot of locking, you can optimise the number of iterations you perform on the objects in your lists from 1 to 5, to minimize the amount of time spent acquiring locks. If your data set grows or is keyed by number, you can do any amount of segregating data to keep the locking to a minimum.

It sounds to me like you need a work queue. If the lock on the work queue became a bottle neck you could switch it around so that each thread had its own work queue then some sort of scheduler would give the incoming object to the thread with the least amount of work to do. The next level up from that is work stealing where threads that have run out of work look at the work queues of other threads.(See Intel's thread building blocks library.)

If I follow you correctly ....
struct table_entry {
void * pObject; // substitute with your object
sem_t sem; // init to empty
int nPenders; // init to zero
};
struct table_entry * table;
object_lock (void * pObject) {
goto label; // yes it is an evil goto
do {
pEntry->nPenders++;
unlock (mutex);
sem_wait (sem);
label:
lock (mutex);
found = search (table, pObject, &pEntry);
} while (found);
add_object_to_table (table, pObject);
unlock (mutex);
}
object_unlock (void * pObject) {
lock (mutex);
pEntry = remove (table, pObject); // assuming it is in the table
if (nPenders != 0) {
nPenders--;
sem_post (pEntry->sem);
}
unlock (mutex);
}
The above should work, but it does have some potential drawbacks such as ...
A possible bottleneck in the search.
Thread starvation. There is no guarantee that any given thread will get out of the do-while loop in object_lock().
However, depending upon your setup, these potential draw-backs might not matter.
Hope this helps.

We here have an interest in a similar model. A solution we have considered is to have a global (or shared) lock but used in the following manner:
A flag that can be atomically set on the object. If you set the flag you then own the object.
You perform your action then reset the variable and signal (broadcast) a condition variable.
If the acquire failed you wait on the condition variable. When it is broadcast you check its state to see if it is available.
It does appear though that we need to lock the mutex each time we change the value of this variable. So there is a lot of locking and unlocking but you do not need to keep the lock for any long period.
With a "shared" lock you have one lock applying to multiple items. You would use some kind of "hash" function to determine which mutex/condition variable applies to this particular entry.

Answer the following question under the #JohnDibling's post.
did you implement this solution ? I've a similar problem and I would like to know how you solved to release the mutex back to the pool. I mean, how do you know, when you release the mutex, that it can be safely put back in queue if you do not know if another thread is holding it ?
by #LeonardoBernardini
I'm currently trying to solve the same kind of problem. My approach is create your own mutex struct (call it counterMutex) with a counter field and the real resource mutex field. So every time you try to lock the counterMutex, first you increment the counter then lock the underlying mutex. When you're done with it, you decrement the coutner and unlock the mutex, after that check the counter to see if it's zero which means no other thread is trying to acquire the lock . If so put the counterMutex back to the pool. Is there a race condition when manipulating the counter? you may ask. The answer is NO. Remember you have a global mutex to ensure that only one thread can access the coutnerMutex at one time.

How to implement a recursive MRSW lock?

I need a fully-recursive multiple-reader/single-writer lock (shared mutex) for my project - I don't agree with the notion that if you have complete const-correctness you shouldn't need them (there was some discussion about that on the boost mailing list), in my case the lock should protect a completely transparent cache which would be mutable in any case.
As for the semantics of recursive MRSW locks, I think the only ones that make sense are that acquiring a exclusive lock in addition to a shared one temporarily releases the shared one, to be reacquired when the exclusive one is released.
Has the somewhat strange effect that unlocking can wait but I can live with that - writing rarely happens anyway and recursive locking usually only happens through recursive code paths, in which case the caller has to be prepared that the call might wait in any case. To avoid it one can still simply upgrade the lock instead of using recursive locking.
Acquiring a shared lock on top of an exclusive one should obviously just increases the lock count.
So the question becomes - how should I implement it? The usual approach with a critical section and two semaphores doesn't work here because - as far as I can see - the woken up thread has to handshake, by inserting it's thread id into the lock's owner map.
I suppose it would be doable with two condition variables and a couple of mutexes but the sheer amount of synchronization primitives that would end up using sounds like a bit too much overhead for my taste.
An idea which just sprang into my mind is to utilize TLS to remember the type of lock I'm holding (and possibly the local lock counts). Have to think it through - but I'll still post the question for now.
Target platform is Win32 but that shouldn't really matter. Note that I'm specifically targeting Win2k so anything related to the new MRSW lock primitive in Windows 7 is not relevant for me. :-)

Okay, I solved it.
It can be done with just 2 semaphores, a critical section and almost no more locking than for a regular non-recursive MRSW lock (there is obviously some more CPU-time spent inside the lock because that multimap must be managed) - but it's tricky. The structure I came up with looks like this:
// Protects everything that follows, except mWriterThreadId and mRecursiveUpgrade
CRITICAL_SECTION mLock;
// Semaphore to wait on for a read lock
HANDLE mSemaReader;
// Semaphore to wait on for a write lock
HANDLE mSemaWriter;
// Number of threads waiting for a write lock.
int mWriterWaiting;
// Number of times the writer entered the write lock.
int mWriterActive;
// Number of threads inside a read lock. Note that this does not include
// recursive read locks.
int mReaderActiveThreads;
// Whether or not the current writer obtained the lock by a recursive
// upgrade. Note that this member might be set outside the critical
// section, so it should only be read from by the writer during his
// unlock.
bool mRecursiveUpgrade;
// This member contains the current thread id once for each
// (recursive) read lock held by the current thread in addition to an
// undefined number of other thread ids which may or may not hold a
// read lock, even inside the critical section (!).
std::multiset<unsigned long> mReaderActive;
// If there is no writer this member contains 0.
// If the current thread is the writer this member contains his
// thread-id.
// Otherwise it can contain either of them, even inside the
// critical section (!).
// Also note that it might be set outside the critical section.
unsigned long mWriterThreadId;
Now, the basic idea is this:
Full update of mWriterWaiting and mWriterActive for an unlock is performed by the unlocking thread.
For mWriterThreadId and mReaderActive this is not possible, as the waiting thread needs to insert itself when it was released.
So the rule is, that you may never access those two members except to check whether you are holding a read lock or are the current writer - specifically it may not be used to checker whether or not there are any readers / writers - for that you have to use the (somewhat redundant but necessary for this reason) mReaderActiveThreads and mWriterActive.
I'm currently running some test code (which has been going on deadlock- and crash-free for 30 minutes or so) - when I'm sure that it's stable and I've cleaned up the code somewhat I'll put it on some pastebin and add a link in a comment here (just in case someone else ever needs this).

Well, I did some thinking. Starting from the simple "two semaphores and a critical section" one adds a writer lock count and a owning writer TID to the structure.
Unlock still set most of the new status in the critsec. Readers still normally increase the lock count - recursive locking simply adds a non-existing reader to the counter.
During writers lock() I compare the owning TID, and if the writer already own it the write lock counter is increased.
Setting the new writer TID can't be done by the unlock() - it doesn't know which one will be wakened, but if writers reset it back to zero in their unlock() it's not a problem - the current thread id won't ever be zero and setting it is an atomic operation.
All sounds simple enough - one nasty problem left: A recursive reader-reader lock while a writer is waiting will deadlock. And I don't know how to solve that short of doing a reader-biased lock... somehow I need to know whether or not I already own a reader lock.
Using TLS doesn't sound too great after I realized that the number if available slots might be rather limited...

As far as I understand, you need to provide your writer exclusive access to the data, while readers can operate simultaneously (if this is not what you want, please clarify your question).
I think you need to implement a sort of "inverse semaphore", i.e. a semaphore that will block a thread when positive, and signal all waiting threads when zero. If you do this, you can use two such semaphores for your program. The operation of your threads could then be the following:
Reader:
(1) wait on sem A
(2) increase sem B
(3) read operation
(4) decrease sem B
Writer:
(1) increase sem A
(2) wait on sem B
(3) write operation
(4) decrease sem A
In this way the writer will perform the write operation as soon as all pending readers have finished reading. As soon as your writer finishes, readers can resume their operation without blocking each other.
I am not familiar with Windows mutex/semaphore facilities but I can think of a way to implement such semaphores using the POSIX threads API (combining a mutex, a counter and a conditional variable).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js