C++: Recursive locks - are there any drawbacks?

C++: Recursive locks - are there any drawbacks? - c++

The background: I have a few threads that should access shared data. One of the threads might lock a Mutex, and within the mutual exclusion block, some functions (of the same thread) might call the very same lock again.
-I don't want to create many Mutexes
-I don't want to give up locking (obviously)
-I'd rather not change the design as it's quite a big change
void funcB()
{
lock(MA);
...
unlock(MA);
}
void funcA()
{
lock(MA);
...
funcB();
...
unlock(MA);
}
It seems the only way to go is - use a recursive lock. Are there any drawbacks to using this feature?
Of course, if you think of any other way to solve this case, please share

are there any drawbacks?
Slight performance penalty - measure if you care.
any other way to solve
You could give funcB a bool should_lock = true argument, or lots of variations on the theme, e.g. have one overload that locks a mutex then calls another overload that expects a reference to an already locked mutex (perhaps use an assert to check it's locked in debug builds): then funcA can call the latter.

Related

C++ atomics: how to allow only a single thread to access a function?

I'd like to write a function that is accessible only by a single thread at a time. I don't need busy waits, a brutal 'rejection' is enough if another thread is already running it. This is what I have come up with so far:
std::atomic<bool> busy (false);
bool func()
{
if (m_busy.exchange(true) == true)
return false;
// ... do stuff ...
m_busy.exchange(false);
return true;
}
Is the logic for the atomic exchange correct?
Is it correct to mark the two atomic operations as std::memory_order_acq_rel? As far as I understand a relaxed ordering (std::memory_order_relaxed) wouldn't be enough to prevent reordering.

Your atomic swap implementation might work. But trying to do thread safe programming without a lock is most always fraught with issues and is often harder to maintain.
Unless there's a performance improvement that's needed, then std::mutex with the try_lock() method is all you need, eg:
std::mutex mtx;
bool func()
{
// making use of std::unique_lock so if the code throws an
// exception, the std::mutex will still get unlocked correctly...
std::unique_lock<std::mutex> lck(mtx, std::try_to_lock);
bool gotLock = lck.owns_lock();
if (gotLock)
{
// do stuff
}
return gotLock;
}

Your code looks correct to me, as long as you leave the critical section by falling out, not returning or throwing an exception.
You can unlock with a release store; an RMW (like exchange) is unnecessary. The initial exchange only needs acquire. (But does need to be an atomic RMW like exchange or compare_exchange_strong)
Note that ISO C++ says that taking a std::mutex is an "acquire" operation, and releasing is is a "release" operation, because that's the minimum necessary for keeping the critical section contained between the taking and the releasing.
Your algo is exactly like a spinlock, but without retry if the lock's already taken. (i.e. just a try_lock). All the reasoning about necessary memory-order for locking applies here, too. What you've implemented is logically equivalent to the try_lock / unlock in #selbie's answer, and very likely performance-equivalent, too. If you never use mtx.lock() or whatever, you're never actually blocking i.e. waiting for another thread to do something, so your code is still potentially lock-free in the progress-guarantee sense.
Rolling your own with an atomic<bool> is probably good; using std::mutex here gains you nothing; you want it to be doing only this for try-lock and unlock. That's certainly possible (with some extra function-call overhead), but some implementations might do something more. You're not using any of the functionality beyond that. The one nice thing std::mutex gives you is the comfort of knowing that it safely and correctly implements try_lock and unlock. But if you understand locking and acquire / release, it's easy to get that right yourself.
The usual performance reason to not roll your own locking is that mutex will be tuned for the OS and typical hardware, with stuff like exponential backoff, x86 pause instructions while spinning a few times, then fallback to a system call. And efficient wakeup via system calls like Linux futex. All of this is only beneficial to the blocking behaviour. .try_lock leaves that all unused, and if you never have any thread sleeping then unlock never has any other threads to notify.
There is one advantage to using std::mutex: you can use RAII without having to roll your own wrapper class. std::unique_lock with the std::try_to_lock policy will do this. This will make your function exception-safe, making sure to always unlock before exiting, if it got the lock.

Multiple mutex locking strategies and why libraries don't use address comparison

There is a widely known way of locking multiple locks, which relies on choosing fixed linear ordering and aquiring locks according to this ordering.
That was proposed, for example, in the answer for "Acquire a lock on two mutexes and avoid deadlock". Especially, the solution based on address comparison seems to be quite elegant and obvious.
When I tried to check how it is actually implemented, I've found, to my surprise, that this solution in not widely used.
To quote the Kernel Docs - Unreliable Guide To Locking:
Textbooks will tell you that if you always lock in the same order, you
will never get this kind of deadlock. Practice will tell you that this
approach doesn't scale: when I create a new lock, I don't understand
enough of the kernel to figure out where in the 5000 lock hierarchy it
will fit.
PThreads doesn't seem to have such a mechanism built in at all.
Boost.Thread came up with
completely different solution, lock() for multiple (2 to 5) mutexes is based on trying and locking as many mutexes as it is possible at the moment.
This is the fragment of the Boost.Thread source code (Boost 1.48.0, boost/thread/locks.hpp:1291):
template<typename MutexType1,typename MutexType2,typename MutexType3>
void lock(MutexType1& m1,MutexType2& m2,MutexType3& m3)
{
unsigned const lock_count=3;
unsigned lock_first=0;
for(;;)
{
switch(lock_first)
{
case 0:
lock_first=detail::lock_helper(m1,m2,m3);
if(!lock_first)
return;
break;
case 1:
lock_first=detail::lock_helper(m2,m3,m1);
if(!lock_first)
return;
lock_first=(lock_first+1)%lock_count;
break;
case 2:
lock_first=detail::lock_helper(m3,m1,m2);
if(!lock_first)
return;
lock_first=(lock_first+2)%lock_count;
break;
}
}
}
where lock_helper returns 0 on success and number of mutexes that weren't successfully locked otherwise.
Why is this solution better, than comparing addresses or any other kind of ids? I don't see any problems with pointer comparison, which can be avoided using this kind of "blind" locking.
Are there any other ideas on how to solve this problem on a library level?

From the bounty text:
I'm not even sure if I can prove correctness of the presented Boost solution, which seems more tricky than the one with linear order.
The Boost solution cannot deadlock because it never waits while already holding a lock. All locks but the first are acquired with try_lock. If any try_lock call fails to acquire its lock, all previously acquired locks are freed. Also, in the Boost implementation the new attempt will start from the lock failed to acquire the previous time, and will first wait till it is available; it's a smart design decision.
As a general rule, it's always better to avoid blocking calls while holding a lock. Therefore, the solution with try-lock, if possible, is preferred (in my opinion). As a particular consequence, in case of lock ordering, the system at whole might get stuck. Imagine the very last lock (e.g. the one with the biggest address) was acquired by a thread which was then blocked. Now imagine some other thread needs the last lock and another lock, and due to ordering it will first get the other one and will wait on the last lock. Same can happen with all other locks, and the whole system makes no progress until the last lock is released. Of course it's an extreme and rather unlikely case, but it illustrates the inherent problem with lock ordering: the higher a lock number the more indirect impact the lock has when acquired.
The shortcoming of the try-lock-based solution is that it can cause livelock, and in extreme cases the whole system might also get stuck for at least some time. Therefore it is important to have some back-off schema that make pauses between locking attempts longer with time, and perhaps randomized.

Sometimes, lock A needs to be acquired before lock B does. Lock B might have either a lower or a higher address, so you can't use address comparison in this case.
Example: When you have a tree data-structure, and threads try to read and update nodes, you can protect the tree using a reader-writer lock per node. This only works if your threads always acquire locks top-down root-to-leave. The address of the locks does not matter in this case.
You can only use address comparison if it does not matter at all which lock gets acquired first. If this is the case, address comparison is a good solution. But if this is not the case you can't do it.
I guess the Linux kernel requires certain subsystems to be locked before others are. This cannot be done using address comparison.

The "address comparison" and similar approaches, although used quite often, are special cases. They works fine if you have
a lock-free mechanism to get
two (or more) "items" of the same kind or hierarchy level
any stable ordering schema between those items
For example: You have a mechanism to get two "accounts" from a list. Assume that the access to the list is lock-free. Now you have pointers to both items and want to lock them. Since they are "siblings" you have to choose which one to lock first. Here the approach using addresses (or any other stable ordering schema like "account id") is OK.
But the linked Linux text talks about "lock hierarchies". This means locking not between "siblings" (of the same kind) but between "parent" and "children" which might be from different types. This may happen in actual tree structures as well in other scenarios.
Contrived example: To load a program you must
lock the file inode,
lock the process table
lock the destination memory
These three locks are not "siblings" not in a clear hierarchy. The locks are also not taken directly one after the other - each subsystem will take the locks at free will. If you consider all usecases where those three (and more) subsystems interact you see, that there is no clear, stable ordering you can think of.
The Boost library is in the same situation: It strives to provide generic solutions. So they cannot assume the points from above and must fall back to a more complicated strategy.

One scenario when address compare will fail is if you use the proxy pattern.
You can delegate the locks to the same object and the addresses will be different.
Consider the following example
template<typename MutexType>
class MutexHelper
{
MutexHelper(MutexType &m) : _m(m) {}
void lock()
{
std::cout <<"locking ";
m.lock();
}
void unlock()
{
std::cout <<"unlocking ";
m.unlock();
}
MutexType &_m;
};
if the function
template<typename MutexType1,typename MutexType2,typename MutexType3>
void lock(MutexType1& m1,MutexType2& m2,MutexType3& m3);
will actually use address compare the following code ca produce a deadlock
Mutex m1;
Mutex m1;
thread1
MutexHelper hm1(m1);
MutexHelper hm2(m2);
lock(hm1, hm2);
thread2:
MutexHelper hm2(m2);
MutexHelper hm1(m1);
lock(hm1, hm2);
EDIT:
this is an interesting thread that share some light on boost::lock implementation
thread-best-practice-to-lock-multiple-mutexes
Address compare does not work for inter-process shared mutexes (named synchronization objects).

How to detect circular calls?

I've been looking for causes for deadlocks and strategies/tools to avoid and detect them.
Another potential cause for deadlocks is to have blocking functions calling other blocking functions in a circular way, so that eventually a call never returns.
Sometimes this is hard to discover, specially in very large projects.
So, are there any tools/libraries/techiques that allow to automate the detection of circular calls in a program?
EDIT:
I code mostly in C and C++ so, if possible, give any information about the topic that is applicable to those languages.
Nevertheless, it seems this topic is scarcely covered in SO, so answers for other languages are ok too. although maybe those deserve a topic of its own if someone finds it relevant
Thanks.

Circular (or recursive) calls that try to acquire the same non-reentrant lock are one of the easiest to debug blocking scenarios: locking is deterministic, and can be easily checked. When the application locks, fire up the debugger and look at the stack trace to understand what locks are held and why.
As to general solutions for the problem of locking... you can look into some libraries that provide mutex ordering, and detect when you are trying to lock on a mutex out of order. This type of solutions might be complex to implement correctly, but once in place it ensures that you cannot enter a deadlock condition, as it forces all processes to obtain the locks in the same order (i.e. if process A holds lock La, and it tries to acquire lock Lb for which the ordering is correct, then it can either succeed or lock, but whichever process is holding lock Lb cannot try to lock La as the ordering constraint would not be met).

If you are on Linux there 2 Valgrind tools for detecting deadlocks and race conditions: Helgrind, DRD. They both complement each other and it's worth to check for thread errors by both of them.

In linux you can use valgrind to detect deadlocks, use --tool=helgrind.

Best way to detect deadlocks (IMO) is to make a test program that calls all the functions in a random order in like 30 different threads 10000s of times.
If you get a deadlock you can use VS2010 "Parallel Stacks" window. Debug->Windows->Parallel Stacks
This window will show you all the stacks, so you can find the methods that are deadlocking.
A simple strategy I use to write thread-safe objects:
A thread safe object should be safe when its public methods are called, so you don't get deadlocks when it is used.
So, the idea is to just lock all the public methods that access the object's data.
Besides that you need to insure that within the class' code you never call a public method. If you need to use one of the public methods, then make that method private, and wrap the private method with a public method that locks and then calls it.
If you want better lock granularity you could just create objects for each part that has its own lock, and lock it like I suggested. Then use encapsulation to combine those classes to the one class.
Example:
class Blah {
MyData data;
Lock lock;
public:
DataItem GetData(int index)
{
ReadLock read(lock);
return LocalGetData(index);
}
DataItem FindData(string key)
{
ReadLock read(lock);
DataItem item;
//find the item, can use LocalGetData() to get the item without deadlocking
return item;
}
void PutData(DataItem item)
{
ReadLock write(lock);
//put item in database
}
private:
DataItem LocalGetData(int index)
{
return data[index];
}
}

You could find a tool that builds a call graph, and check the graph for cycles.
Otherwise, there are a number of strategies for detecting deadlocks or other circularities, but they all depend on having some sort of supporting infrastructure in place.
There are deadlock avoidance strategies, having to do with assigning lock priorities and ordering the locks according to priority. These require code changes and enforcing the standards, though.

Adding locks to the class by composition

I'm writing thread-safe class in C++. All of its public methods use locks (non-recursive spin locks). Private methods are lock-free. So, everything should be OK: user calls public method, it locks object and then does the work through private methods. But I got dead lock when a public method calls another public method. I've read that recursive mutexes are bad, cause it's difficult to debug them. So I use C's stdio way: public method Foo() only locks the object and calls Foo_nolock() to do the whole work. But I don't like these _nolock() methods. I think it duplicates my code.
So I've got an idea: I'll make lock-free class BarNoLock, and thread-safe class Bar that has only one member: an instance of BarNoLock. And all Bar's methods will only lock this member and call it's methods.
Is it a good idea or maybe there are some better patterns/practices? Thanks.
Update: I know about pimpl and bridge. I ask about multi-threading patterns, not OOP.

I'm not sure why recursive mutexes would be considered bad, see this question for a discussion of them.
Recursive Lock (Mutex) vs Non-Recursive Lock (Mutex)
But I don't think that's necessarily your problem because Win32 critical sections support multiple entries from the same thread without blocking. From the doc:
When a thread owns a critical section, it can make additional calls to EnterCriticalSection or TryEnterCriticalSection without blocking its execution. This prevents a thread from deadlocking itself while waiting for a critical section that it already owns. To release its ownership, the thread must call LeaveCriticalSection one time for each time that it entered the critical section. There is no guarantee about the order in which waiting threads will acquire ownership of the critical section
So maybe you were doing something else wrong when you were deadlocking yourself? Having to work around not deadlocking yourself on the same mutex from the same thread with weird function call semantics is not something you should have to do.

Looks like you have reinvented the Bridge Pattern. Sounds perfectly in order.

When to use recursive mutex?

I understand recursive mutex allows mutex to be locked more than once without getting to a deadlock and should be unlocked the same number of times. But in what specific situations do you need to use a recursive mutex? I'm looking for design/code-level situations.

For example when you have function that calls it recursively, and you want to get synchronized access to it:
void foo() {
... mutex_acquire();
... foo();
... mutex_release();
}
without a recursive mutex you would have to create an "entry point" function first, and this becomes cumbersome when you have a set of functions that are mutually recursive. Without recursive mutex:
void foo_entry() {
mutex_acquire(); foo(); mutex_release(); }
void foo() { ... foo(); ... }

Recursive and non-recursive mutexes have different use cases. No mutex type can easily replace the other. Non-recursive mutexes have less overhead, and recursive mutexes have in some situations useful or even needed semantics and in other situations dangerous or even broken semantics. In most cases, someone can replace any strategy using recursive mutexes with a different safer and more efficient strategy based on the usage of non-recursive mutexes.
If you just want to exclude other threads from using your mutex protected resource, then you could use any mutex type, but might want to use the non-recursive mutex because of its smaller overhead.
If you want to call functions recursively, which lock the same mutex, then they either
have to use one recursive mutex, or
have to unlock and lock the same non-recursive mutex again and again (beware of concurrent threads!) (assuming this is semantically sound, it could still be a performance issue), or
have to somehow annotate which mutexes they already locked (simulating recursive ownership/mutexes).
If you want to lock several mutex-protected objects from a set of such objects, where the sets could have been built by merging, you can choose
to use per object exactly one mutex, allowing more threads to work in parallel, or
to use per object one reference to any possibly shared recursive mutex, to lower the probability of failing to lock all mutexes together, or
to use per object one comparable reference to any possibly shared non-recursive mutex, circumventing the intent to lock multiple times.
If you want to release a lock in a different thread than it has been locked, then you have to use non-recursive locks (or recursive locks which explicitly allow this instead of throwing exceptions).
If you want to use synchronization variables, then you need to be able to explicitly unlock the mutex while waiting on any synchronization variable, so that the resource is allowed to be used in other threads. That is only sanely possible with non-recursive mutexes, because recursive mutexes could already have been locked by the caller of the current function.

I encountered the need for a recursive mutex today, and I think it's maybe the simplest example among the posted answers so far:
This is a class that exposes two API functions, Process(...) and reset().
public void Process(...)
{
acquire_mutex(mMutex);
// Heavy processing
...
reset();
...
release_mutex(mMutex);
}
public void reset()
{
acquire_mutex(mMutex);
// Reset
...
release_mutex(mMutex);
}
Both functions must not run concurrently because they modify internals of the class, so I wanted to use a mutex.
Problem is, Process() calls reset() internally, and it would create a deadlock because mMutex is already acquired.
Locking them with a recursive lock instead fixes the problem.

If you want to see an example of code that uses recursive mutexes, look at the sources for "Electric Fence" for Linux/Unix. 'Twas one of the common Unix tools for finding "bounds checking" read/write overruns and underruns as well as using memory that has been freed, before Valgrind came along.
Just compile and link electric fence with sources (option -g with gcc/g++), and then link it with your software with the link option -lefence, and start stepping through the calls to malloc/free. http://elinux.org/Electric_Fence

It would certainly be a problem if a thread blocked trying to acquire (again) a mutex it already owned...
Is there a reason to not permit a mutex to be acquired multiple times by the same thread?

In general, like everyone here said, it's more about design. A recursive mutex is normally used in a recursive functions.
What others fail to tell you here is that there's actually almost no cost overhead in recursive mutexes.
In general, a simple mutex is a 32 bits key with bits 0-30 containing owner's thread id and bit 31 a flag saying if the mutex has waiters or not. It has a lock method which is a CAS atomic race to claim the mutex with a syscall in case of failure. The details are not important here. It looks like this:
class mutex {
public:
void lock();
void unlock();
protected:
uint32_t key{}; //bits 0-30: thread_handle, bit 31: hasWaiters_flag
};
a recursive_mutex is normally implemented as:
class recursive_mutex : public mutex {
public:
void lock() {
uint32_t handle = current_thread_native_handle(); //obtained from TLS memory in most OS
if ((key & 0x7FFFFFFF) == handle) { // Impossible to return true unless you own the mutex.
uses++; // we own the mutex, just increase uses.
} else {
mutex::lock(); // we don't own the mutex, try to obtain it.
uses = 1;
}
}
void unlock() {
// asserts for debug, we should own the mutex and uses > 0
--uses;
if (uses == 0) {
mutex::unlock();
}
}
private:
uint32_t uses{}; // no need to be atomic, can only be modified in exclusion and only interesting read is on exclusion.
};
As you see it's an entirely user space construct. (base mutex is not though, it MAY fall into a syscall if it fails to obtain the key in an atomic compare and swap on lock and it will do a syscall on unlock if the has_waitersFlag is on).
For a base mutex implementation: https://github.com/switchbrew/libnx/blob/master/nx/source/kernel/mutex.c

If you want to be able to call public methods from different threads inside other public methods of a class and many of these public methods change the state of the object, you should use a recursive mutex. In fact, I make it a habit of using by default a recursive mutex unless there is a good reason (e.g. special performance considerations) not to use it.
It leads to better interfaces, because you don't have to split your implementation among non-locked and locked parts and you are free to use your public methods with peace of mind inside all methods as well.
It leads also in my experience to interfaces that are easier to get right in terms of locking.

Seems no one mentioned it before, but code using recursive_mutex is way easier to debug, since its internal structure contains identifier of a thread holding it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js