std::scoped_lock and mutex ordering

std::scoped_lock and mutex ordering - c++

I'm trying to determine if std::scoped_lock tries to establish an ordering or mutex id to acquire locks in a prescribed order. Is not clear to me that it does it from a somewhat brief looking at a browsable implementation I found googling around.
In case it is not doing that, What would be the closest to standard implementation of acquiring an ordered set of locks?
Typically the cleanest way to avoid deadlocks is to always acquire a group of locks in the same order (and yes, sure, always release them all before trying to acquire new locks again, but perhaps 2PL is a little beyond the scope of what std::scoped_lock should aim to do)

The order for std::lock isn't defined until run-time, and it is not fixed. It is discovered experimentally by the algorithm for each individual call to std::lock. The second call to std::lock could lock the mutexes in a different order than the first, even though both calls might use the same list of mutexes in the same order at the call site.
Here is a detailed performance analysis of several possible implementations of std::lock: http://howardhinnant.github.io/dining_philosophers.html
Using a fixed ordering of the mutexes is one of the algorithms that is performance-compared in the above link. It is not the best performing algorithm for the experiments conducted.
The libstdc++ implementation the OP points to is a high quality implementation of what the analysis labels "Smart & Polite"

scoped_lock's constructor is stated to call std::lock on its mutexes, so its behavior is governed by this function. And std::lock is specifically defined to avoid deadlocks on what it locks. The order it locks the mutexes in is undefined, but it won't result in a deadlock.

Related

C++ What are the differences between std::lock and std::unique_lock?

I came into a situation where I need to lock a resource (a std::queue) between two processing threads.
The first thread needs to push data to std::queue, while the second thread is going to pop that data out of the queue and process it.
I need to make sure both threads will not compete for my std::queue.
As this is my first time using C++ locks, I came into different approaches: std::lock and std::unique_lock, but I don´t know which one to choose...
What is the difference between std::lock and std::unique_lock and how they should be used.
Thanks for helping.

std::lock is an algorithm that locks a collection of lockable objects all at once in a specific way that avoids deadlocks.
std::unique_lock is a class template that wraps a mutex and can be used as a scoped lock guard, similar to std::lock_guard, but more powerful than the latter (it is itself lockable, can be unlocked early and can be moved around).
You probably want neither of those, but instead just use the good old std::lock_guard.

What is lock-free multithreaded programming?

I have seen people/articles/SO posts who say they have designed their own "lock-free" container for multithreaded usage. Assuming they haven't used a performance-hitting modulus trick (i.e. each thread can only insert based upon some modulo) how can data structures be multi-threaded but also lock-free???
This question is intended towards C and C++.

The key in lock-free programming is to use hardware-intrinsic atomic operations.
As a matter of fact, even locks themselves must use those atomic operations!
But the difference between locked and lock-free programming is that a lock-free program can never be stalled entirely by any single thread. By contrast, if in a locking program one thread acquires a lock and then gets suspended indefinitely, the entire program is blocked and cannot make progress. By contrast, a lock-free program can make progress even if individual threads are suspended indefinitely.
Here's a simple example: A concurrent counter increment. We present two versions which are both "thread-safe", i.e. which can be called multiple times concurrently. First the locked version:
int counter = 0;
std::mutex counter_mutex;
void increment_with_lock()
{
std::lock_guard<std::mutex> _(counter_mutex);
++counter;
}
Now the lock-free version:
std::atomic<int> counter(0);
void increment_lockfree()
{
++counter;
}
Now imagine hundreds of thread all call the increment_* function concurrently. In the locked version, no thread can make progress until the lock-holding thread unlocks the mutex. By contrast, in the lock-free version, all threads can make progress. If a thread is held up, it just won't do its share of the work, but everyone else gets to get on with their work.
It is worth noting that in general lock-free programming trades throughput and mean latency throughput for predictable latency. That is, a lock-free program will usually get less done than a corresponding locking program if there is not too much contention (since atomic operations are slow and affect a lot of the rest of the system), but it guarantees to never produce unpredictably large latencies.

For locks, the idea is that you acquire a lock and then do your work knowing that nobody else can interfere, then release the lock.
For "lock-free", the idea is that you do your work somewhere else and then attempt to atomically commit this work to "visible state", and retry if you fail.
The problems with "lock-free" are that:
it's hard to design a lock-free algorithm for something that isn't trivial. This is because there's only so many ways to do the "atomically commit" part (often relying on an atomic "compare and swap" that replaces a pointer with a different pointer).
if there's contention, it performs worse than locks because you're repeatedly doing work that gets discarded/retried
it's virtually impossible to design a lock-free algorithm that is both correct and "fair". This means that (under contention) some tasks can be lucky (and repeatedly commit their work and make progress) and some can be very unlucky (and repeatedly fail and retry).
The combination of these things mean that it's only good for relatively simple things under low contention.
Researchers have designed things like lock-free linked lists (and FIFO/FILO queues) and some lock-free trees. I don't think there's anything more complex than those. For how these things work, because it's hard it's complicated. The most sane approach would be to determine what type of data structure you're interested in, then search the web for relevant research into lock-free algorithms for that data structure.
Also note that there is something called "block free", which is like lock-free except that you know you can always commit the work and never need to retry. It's even harder to design a block-free algorithm, but contention doesn't matter so the other 2 problems with lock-free disappear. Note: the "concurrent counter" example in Kerrek SB's answer is not lock free at all, but is actually block free.

The idea of "lock free" is not really not having any lock, the idea is to minimize the number of locks and/or critical sections, by using some techniques that allow us not to use locks for most operations.
It can be achieved using optimistic design or transactional memory, where you do not lock the data for all operations, but only on some certain points (when doing the transaction in transactional memory, or when you need to roll-back in optimistic design).
Other alternatives are based on atomic implementations of some commands, such as CAS (Compare And Swap), that even allows us to solve the consensus problem given an implementation of it. By doing swap on references (and no thread is working on the common data), the CAS mechanism allows us to easily implement a lock-free optimistic design (swapping to the new data if and only if no one have changed it already, and this is done atomically).
However, to implement the underlying mechanism to one of these - some locking will most likely be used, but the amount of time the data will be locked is (supposed) to be kept to minimum, if these techniques are used correctly.

The new C and C++ standards (C11 and C++11) introduced threads, and thread shared atomic data types and operations. An atomic operation gives guarantees for operations that run into a race between two threads. Once a thread returns from such an operation, it can be sure that the operation has gone through in its entirety.
Typical processor support for such atomic operations exists on modern processors for compare and swap (CAS) or atomic increments.
Additionally to being atomic, data type can have the "lock-free" property. This should perhaps have been coined "stateless", since this property implies that an operation on such a type will never leave the object in an intermediate state, even when it is interrupted by an interrupt handler or a read of another thread falls in the middle of an update.
Several atomic types may (or may not) be lock-free, there are macros to test for that property. There is always one type that is guaranteed to be lock free, namely atomic_flag.

C++11 : Atomic variable : lock_free property : What does it mean?

I wanted to understand what does one mean by lock_free property of atomic variables in c++11. I did googled out and saw the other relevant questions on this forum but still having partial understanding. Appreciate if someone can explain it end-to-end and in simple way.

It's probably easiest to start by talking about what would happen if it was not lock-free.
The most obvious way to handle most atomic tasks is by locking. For example, to ensure that only one thread writes to a variable at a time, you can protect it with a mutex. Any code that's going to write to the variable needs obtain the mutex before doing the write (and release it afterwards). Only one thread can own the mutex at a time, so as long as all the threads follow the protocol, no more than one can write at any given time.
If you're not careful, however, this can be open to deadlock. For example, let's assume you need to write to two different variables (each protected by a mutex) as an atomic operation -- i.e., you need to ensure that when you write to one, you also write to the other). In such a case, if you aren't careful, you can cause a deadlock. For example, let's call the two mutexes A and B. Thread 1 obtains mutex A, then tries to get mutex B. At the same time, thread 2 obtains mutex B, and then tries to get mutex A. Since each is holding one mutex, neither can obtain both, and neither can progress toward its goal.
There are various strategies to avoid them (e.g., all threads always try to obtain the mutexes in the same order, or upon failure to obtain a mutex within a reasonable period of time, each thread releases the mutex it holds, waits some random amount of time, and then tries again).
With lock-free programming, however, we (obviously enough) don't use locks. This means that a deadlock like above simply cannot happen. When done properly, you can guarantee that all threads continuously progress toward their goal. Contrary to popular belief, it does not mean the code will necessarily run any faster than well written code using locks. It does, however, mean that deadlocks like the above (and some other types of problems like livelocks and some types of race conditions) are eliminated.
Now, as to exactly how you do that: the answer is short and simple: it varies -- widely. In a lot of cases, you're looking at specific hardware support for doing certain specific operations atomically. Your code either uses those directly, or extends them to give higher level operations that are still atomic and lock free. It's even possible (though only rarely practical) to implement lock-free atomic operations without hardware support (but given its impracticality, I'll pass on trying to go into more detail about it, at least for now).

Jerry already mentioned common correctness problems with locks, i.e. they're hard to understand and program correctly.
Another danger with locks is that you lose determinism regarding your execution time: if a thread that has acquired a lock gets delayed (e.g. descheduled by the operating system, or "swapped out"), then it is possible that the entire program is delayed because it is waiting for the lock. By contrast, a lock-free algorithm is always guaranteed to make some progress, even if any number of threads are held up somewhere else.
By and large, lock-free programming is often slower (sometimes significantly so) than locked programming using non-atomic operations, because atomic operations cause a significant hit on caching and pipelining; however, it offers determinism and upper bounds on latency (at least overall latency of your process; as #J99 observed, individual threads may still be starved as long as enough other threads are making progress). Your program may get a lot slower, but it never locks up entirely and always makes some progress.
The very nature of hardware architectures allows for certain small operations to be intrinsically atomic. In fact, this is a very necessity of any hardware that supports multitasking and multithreading. At the very heart of any synchronisation primitive, such as a mutex, you need some sort of atomic instruction that guarantees correct locking behaviour.
So, with that in mind, we now know that it is possible for certain types like booleans and machine-sized integers to be loaded, stored and exchanged atomically. Thus when we wrap such a type into an std::atomic template, we can expect that the resulting data type will indeed offer load, store and exchange operations that do not use locks. By contrast, your library implementation is always allowed to implement an atomic Foo as an ordinary Foo guarded by a lock.
To test whether an atomic object is lock-free, you can use the is_lock_free member function. Additionally, there are ATOMIC_*_LOCK_FREE macros that tell you whether atomic primitive types potentially have a lock-free instantiation. If you are writing concurrent algorithms that you want to be lock-free, you should thus include an assertion that your atomic objects are indeed lock-free, or a static assertion on the macro to have value 2 (meaning that every object of the corresponding type is always lock-free).

Lock-free is one of the non-blocking techniques. For an algorithm, it involves a global progress property: whenever a thread of the program is active, it can make a forward step in its action, for itself or eventually for the other.
Lock-free algorithms are supposed to have a better behavior under heavy contentions where threads acting on a shared resources may spent a lot of time waiting for their next active time slice. They are also almost mandatory in context where you can't lock, like interrupt handlers.
Implementations of lock-free algorithms almost always rely on Compare-and-Swap (some may used things like ll/sc) and strategy where visible modification can be simplified to one value (mostly pointer) change, making it a linearization point, and looping over this modification if the value has change. Most of the time, these algorithms try to complete jobs of other threads when possible. A good example is the lock-free queue of Micheal&Scott (http://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html).
For lower-level instructions like Compare-and-Swap, it means that the implementation (probably the micro-code of the corresponding instruction) is wait-free (see http://www.diku.dk/OLD/undervisning/2005f/dat-os/skrifter/lockfree.pdf)
For the sake of completeness, wait-free algorithm enforces progression for each threads: each operations are guaranteed to terminate in a finite number of steps.

Standard way to make STL objects threadsafe?

I need several STL containers, threadsafe.
Basically I was thinking I just need 2 methods added to each of the STL container objects,
.lock()
.unlock()
I could also break it into
.lockForReading()
.unlockForReading()
.lockForWriting()
.unlockForWriting()
The way that would work is any number of locks for parallel reading are acceptable, but if there's a lock for writing then reading AND writing are blocked.
An attempt to lock for writing waits until the lockForReading semaphore drops to 0.
Is there a standard way to do this?
Is how I'm planning on doing this wrong or shortsighted?

This is really kind of bad. External code will not recognize or understand your threading semantics, and the ease of availability of aliases to objects in the containers makes them poor thread-safe interfaces.
Thread-safety occurs at design time. You can't solve thread safety by throwing locks at the problem. You solve thread safety by not having two threads writing to the same data at the same time- in the general case, of course. However, it is not the responsibility of a specific object to handle thread safety, except direct threading synchronization primitives.
You can have concurrent containers, designed to allow concurrent use. However, their interfaces are vastly different to what's offered by the Standard containers. Less aliases to objects in the container, for example, and each individual operation is encapsulated.

The standard way to do this is acquire the lock in a constructor, and release it in the destructor. This is more commonly know as Resource Acquisition Is Initialization, or RAII. I strongly suggest you use this methodology rather than
.lock()
.unlock()
Which is not exception safe. You can easily forget to unlock the mutex prior to throwing, resulting in a deadlock the next time a lock is attempted.
There are several synchronization types in the Boost.Thread library that will be useful to you, notably boost::mutex::scoped_lock. Rather than add lock() and unlock() methods to whatever container you wish to access from multiple threads, I suggest you use a boost:mutex or equivalent and instantiate a boost::mutex::scoped_lock whenever accessing the container.

Is there a standard way to do this?
No, and there's a reason for that.
Is how I'm planning on doing this
wrong or shortsighted?
It's not necessarily wrong to want to synchronize access to a single container object, but the interface of the container class is very often the wrong place to put the synchronization (like DeadMG says: object aliases, etc.).
Personally I think both TBB and stuff like concurrent_vector may either be overkill or still the wrong tools for a "simple" synchronization problem.
I find that ofttimes just adding a (private) Lock object (to the class holding the container) and wrapping up the 2 or 3 access patterns to the one container object will suffice and will be much easier to grasp and maintain for others down the road.

Sam: You don't want a .lock() method because something could go awry that prevents calling the .unlock() method at the end of the block, but if .unlock() is called as a consequence of object destruction of a stack allocated variable then any kind of early return from the function that calls .lock() will be guaranteed to free the lock.
DeadMG:
Intel's Threading Building Blocks (open source) may be what you're looking for.
There's also Microsoft's concurrent_vector and concurrent_queue, which already comes with Visual Studio 2010.

Threads and simple Dead lock cure

When dealing with threads (specifically in C++) using mutex locks and semaphores is there a simple rule of thumb to avoid Dead Locks and have nice clean Synchronization?

A good simple rule of thumb is to always obtain your locks in a consistent predictable order from everywhere in your application. For example, if your resources have names, always lock them in alphabetical order. If they have numeric ids, always lock from lowest to highest. The exact order or criteria is arbitrary. The key is to be consistent. That way you'll never have a deadlock situation. eg.
Thread 1 locks resource A
Thread 2 locks resource B
Thread 1 waits to obtain a lock on B
Thread 2 waits to obtain a lock on A
Deadlock
The above can never happen if you follow the rule of thumb outlined above. For a more detailed discussion, see the Wikipedia entry on the Dining Philosophers problem.

If at all possible, design your code so that you never have to lock more then a single mutex/semaphore at a time.
If that's not possible, make sure to always lock multiple mutex/semaphores in the same order. So if one part of the code locks mutex A and then takes semaphore B, make sure that no other part of the code takes semaphore B and then locks mutex A.

Try to avoid acquiring one lock and trying to acquire another. This can result into circular dependency and cause for deadlock.
If it is un-avoidable then at least the order of acquire locks should be predictable.
Use RAII ( to make sure lock is release properly in case of exception as well)

There is no simple deadlock cure.
Acquire locks in agreed order: If all calls acquire A->B->C then no deadlock can occur. Deadlocks can occur only if the locking order differs between the two threads (one acquires A->B the second B->A).
In practice is hard to choose an order between arbitrary objects in memory. On a simple trivial project is possible, but on large projects with many individual contributors is very hard. A partial solution is to create hierarchies, by ranking the locks. All locks in module A have rank 1, all locks in module B have rank 2. One can acquire a lock of rank 2 when helding locks of rank 1, but not vice-versa. Of course you need a framework around the locking primitives that tracks and validates the ranking.

One way to ensure the ordering that other folks have talked about is to acquire locks in an order defined by their memory address. If at any point, you try to acquire a lock that should have been earlier in the sequence, you release all the locks and start over.
With a little work, it's possible to do this nearly automatically with some wrapper classes around the system primitives.

There's no practical cure. Specifically, there's no way to simply test code for being synchronizationally correct, or to have your programmers obey the rules of the gentleman with the green V.
There's no way to properly test the multithreaded code, because the program logic may depend on timing of locks acquisition, and therefore, be different from execution to execution, somehow invalidating the concept of QA.
I would say
prefer using threads only as a performance optimization for multi-core machines
only optimize performance when you are sure you need this performance
you may use threads to simplify program logic, but only when you are absolutely sure what you are doing. Be extra careful and all locks are confined to a very small piece of code. Do not let any newbies near such code.
never use threads in a mission-critical system, such as flying an aircraft or operating dangerous machinery
in all cases, threads are seldom cost-effective, due to higher debug and QA costs
If you determined to do threads or maintaining existing codebase:
confine all locks to small and simple pieces of code, which operate on primitives
avoid function calls or getting the program flow away to where the fact of being executed under lock is not immediately visible. This function will change by future authors, widening your lock span without your control.
get locks inside objects to reduce locking scope, wrap non-thread-safe 3rd-party objects with your own thread-safe interfaces.
never send synchronous notifications (callbacks) when executing under lock
use only RAII locks, to reduce the cognitive load when thinking "how else can we exit from here", as in exceptions, etc.
A few words on how to avoid multi-threading.
A single-threaded design usually involves some heart-beat function provided by program components, and called in a loop (called heartbeat cycle) which, when called, gives a chance to all components to do the next piece of work and to surrender control back again. What algorithmists like to think of as "loops" inside the components, will turn into state machines, to identify what is the next thing that should be done when called. State is best maintained as member data of respective objects.

There are plenty of simple "deadlock cures". But none that are easy to apply and work universally.
The simplest of all, of course, is "never have more than one thread".
Assuming you have a multithreaded application though, there are still a number of solutions:
You can try to minimize shared state and synchronization. Two threads that just run in parallel and never interact can never deadlock. Deadlocks only occur when multiple threads try to access the same resource. Why do they do that? Can that be avoided? Can the resource be restructured or divided so that for example, one thread can write to it, and other threads are asynchronously passed the data they need?
Perhaps the resource can be copied, giving each thread its own private copy to work with?
And as already mentioned by every other answer, if and when you try to acquire locks, do so in a global consistent order. To simplify this, you should try to ensure that all the locks a thread is going to need are acquired as a single operation. If a thread needs to acquire locks A, B and C, it should not make three lock() calls at different times and from different places. You'll get confused, and you won't be able to keep track of which locks are held by the thread, and which ones it has yet to acquire, and then you'll mess up the order. If you can acquire all the lock you need once, then you can factor it out into a separate function call which acquires N locks, and does so in the correct order to avoid deadlocks.
Then there are the more ambitious approaches: Techniques like CSP make threading extremely simple and easy to prove correct, even with thousands of concurrent threads. But it requires you to structure your program very differently from what you're used to.
Transactional Memory is another promising option, and one that may be easier to integrate into conventional programs. But production-quality implementations are still very rare.

Read Deadlock: the Problem and a Solution.
"The common advice for avoiding deadlock is to always lock the two mutexes in the same order: if you always lock mutex A before mutex B, then you'll never deadlock. Sometimes this is straightforward, as the mutexes are serving different purposes, but other times it is not so simple, such as when the mutexes are each protecting a separate instance of the same class".

If you want to attack the possibility of a deadlock you must attack one of the 4 crucial conditions for the existence of a deadlock.
The 4 conditions for a deadlock are:
1. Mutual Exclusion - only one thread can enter the critical section at a time.
2. Hold and Wait - a thread doesn't release the resources he acquired as long as he didn't finish his job even if other resources are un available.
3. No preemption - A thread doesn't have a priority over other threads.
4. Resource Cycle - There has to be a cycle chain of threads that waits for resources from other threads.
The easiest condition to attack is the resource cycle by making sure that no cycles are possible.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js