When using a shared_mutex, there is exclusive access and shared access. Exclusive access only allows one thread to access the resource while others are blocked until the thread holding the lock releases the lock. A shared access is when multiple threads are allowed to access the resource but under a "read lock". What is a "read lock"? I don't understand the meaning of a read lock. Can someone give code examples of a "read lock".
My guess of a read lock: I thought that a read lock only allows threads to run the code and not modify anything. Apparently I was wrong when I tried some code and it didn't work as I thought.
Again, can someone help me understand what a read lock is.
Thanks.
Your guess is very close to being right but there's a slight subtlety.
The read/write distinction shows intent of the locker, it does not limit what the locker actually does. If a read-locker decides to modify the protected resource, it can, and hilarity will ensue. However, that's no different to a thread modifying the protected resource with no lock at all. The programmer must follow the rules, or pay the price.
As to the distinction, a read lock is for when you only want to read the shared resource. A billion different threads can do this safely because the resource is not changing while they do it.
A write lock, on the other hand, means you want to change the shared resource, so there should be no readers (or other writers) with an active lock.
It's a way of optimising locks based on usage. For example, consider the following lock queue, with the earliest entries on the right, and no queue jumping (readers jumping in front of writers if the writer is currently blocked) allowed:
(queue here) -> RR W RRRR -> (action here)
The "resource controller" can allow all four R lockers at the queue head access at the same time.
The W locker will then have to wait until all of those locks are released before gaining access.
The final two R lockers must wait until the W locker releases but can then both get access concurrently.
In other words, the possible lock holders at any given time are (mutually exclusive):
any number of readers.
one writer.
As an aside, there are other possible optimisation strategies you can use as well.
For example (as alluded to earlier), you may want a reader to be able to jump the queue if it looks like the writer would have to wait a while anyway (such as there being a large number of current readers).
That's fraught with danger since you could starve a write locker, but it can be done provided there are limits on the number of queue jumpers.
I have done it in the past. From memory (a long ago memory of an embedded OS built in BCPL for the 6809 CPU, which shows more of my age than I probably want to), the writer included with its lock request the maximum number of queue jumpers it was happy with (defaulting to zero), and the queueing of future readers took that into account.
A write-lock only allows one lock to be acquired at a time.
A read-lock only allows other read-locks to be acquired at the same time, but prevents write-locks.
If you try to acquire a write-lock, you will be forced to wait until every other lock (read or write) has been released. Once you have the write-lock, no one else can get a lock of either kind.
If you try to acquire a read-lock, you will have to wait for any write-lock to be released. If there is no write-lock, you will obtain the read-lock, regardless how many other threads have a read-lock.
Whether or not you modify data is entirely up to you to police. If you have a read-lock, you should not be modifying shared data.
A read lock is just another way of saying that a thread of execution has acquired shared ownership of the mutex. It says nothing about what that thread can actually do.
It is up to the programmer not to update whatever shared resource is being protected by the mutex under a shared lock. To do that, acquire an exclusive lock.
Related
I have a scenario, where I have a shared data model between several threads. Some threads are going to write to that data model cyclically and other threads are reading from that data model cyclically. But it is guaranteed that writer threads are only writing and reader threads are only reading.
Now the scenario is, that reading data shall have higher priority than writing data due to real time constraints on the reader side. So it is not acceptable that e.g. a writer is locking the data for a too long time. But a lock with a guaranteed locking time would be acceptable (e.g. it would be acceptable for the reader to wait max 1 ms until the data is synchronized and available).
So I'm wondering how this is achievable, because the "traditional" locking mechanisms (e.g. std::lock) wouldn't give those real time guarantees.
Normally in such a scenario you use a reader-writer-lock. This allows either a read by all readers in parallel or a write by a single writer.
But that does nothing to stop a writer from holding the lock for minutes if it so desires. Forcing the writer out of the lock is probably also not a good idea. The object is probably in some inconsistent state mid changed.
There is another synchronization method called read-copy-update that might help. This allows writers to modify element without being blocked by readers. The drawback is that you might get some readers still reading the old data and others reading the new data for some time.
It also might be problematic with multiple writers if they try to change the same member. The slower writer might have computed all the needed updates only to notice some other thread changes the object. It then has to start over wasting all the time it already spend.
Note: copying the element can be done in constant time, certainly under 1ms. So you can guarantee readers are never blocked for long. By releasing the write lock first you guarantee readers to read between any 2 writes, assuming the RW lock is designed with the same principle.
So I would suggest another solution I call write-intent-locking:
You start with a RW lock but add a lock to handle write-intent. Any writer can acquire the write-intent lock at any time, but only one of them, it's exclusive. Once a write holds the write-intent lock it copies the element
and starts modifying the copy. It can take as long as it wants to do that as it's not blocking any readers. It does block other writers though.
When all the modifications are done the writer acquires the write lock and then quickly copies, moves or replaces the element with the prepared copy. It then releases the write and write-intent lock, unblocking both the readers and writers that want to access the same element.
The way I would approach this is to have two identical copies of the dataset; call them copy A and copy B.
Readers always read from copy B, being careful to lock a reader/writer lock in read-only mode before accessing it.
When a writer-thread wants to update the dataset, it locks copy A (using a regular mutex) and updates it. The writer-thread can take as long as it likes to do this, because no readers are using copy A.
When the writer-thread is done updating copy A, it locks the reader/writer lock (in exclusive/writer-lock mode) and swaps dataset A with dataset B. (This swap should be done by exchanging pointers, and is therefore O(1) fast).
The writer-thread then unlocks the reader/writer-lock (so that any waiting reader-threads can now access the updated data-set), and then updates the other data-set the same way it updated the first data-set. This can also take as long as the writer-thread likes, since no reader-threads are waiting on this dataset anymore.
Finally the writer-thread unlocks the regular mutex, and we're done.
Well, you've got readers, and you've got writers, and you need a lock, so.... how about a readers/writer lock?
The reason I mention that up-front is because (a) you might not be aware of it, but more importantly (b) there's no standard RW lock in C++ (EDIT: my mistake, one was added in C++14), so your thinking about this is perhaps being done in the context of std::mutex. Once you've decided to go with a RW lock, you can benefit from other people's thinking about those locks.
In particular, there's a number of different options for prioritizing threads contending over RW locks. With one option, a thread acquiring a write lock waits until all current reader threads drop the lock, but readers who start waiting after the writer don't get the lock until the writer's done with it.
With that strategy, as long as the writer thread releases and reacquires the lock after each transaction, and as long as the writer completes each transaction within your 1 ms target, readers don't starve.
And if your writer can't promise that, then there is zero alternative but to redesign the writer: either doing more processing before acquiring the lock, or splitting a transaction into multiple pieces where it's safe to drop the lock between each.
If, on the other hand, your writer's transactions take much less than 1 ms, then you might consider skipping the release/reacquire between each one if less than 1 ms has elapsed (purely to reduce the processing overhead of doing so).... but I wouldn't advise it. Adding complexity and special cases and (shudder) wall clock time to your implementation is rarely the most practical way to maximize performance, and rapidly increases the risk of bugs. A simple multithreading system is a reliable multithreading system.
If model allows writing to be interrupted, then it also allows buffering. Use a fifo queue and start reading only when there are 50 elements written already. Use (smart)pointers to swap data in fifo queue. Swapping 8 bytes of pointer takes nanoseconds. Since there is buffering, writing will be on a different element than readers are working with so there wont be lock contention as long as producer can keep the pace with producers.
Why doesn't the reader produce its own consumer data? If you can have n producers and n consumers, each consumer can produce its own data too, without any producer. But this will have different multithread scaling. Maybe your algorithm is not applicable here but if it is, it would be more like independent multi-processing instead of multi-threading.
Writer work can be converted to multiple smaller jobs? Progress within writer can be reported to an atomic counter. When a reader has a waiting budget, it checks atomic value and if it looks slow, it can use same atomic value to instantly push it to 100% progress and writer sees it and early-quits lock.
Posix provides a mechanism for a mutex to be marked as "robust", allowing multi-processes systems to recover gracefully from the crash of a process holding a mutex.
pthread_mutexattr_setrobust(&mutexattr, PTHREAD_MUTEX_ROBUST);
http://man7.org/linux/man-pages/man3/pthread_mutexattr_setrobust.3.html
However, there doesn't seem to be an equivalent for rwlock (reader-writer locks).
How can a process gracefully recover from a process crashing while holding a rwlock?
Implementing a robust rwlock is actually quite difficult due to the "concurrent readers" property - a rwlock with bounded storage but an unbounded number of concurrent readers fundamentally cannot track who its readers are, so if knowledge of who the current readers are is to be kept (in order to decrement the current read lock count when a reader dies), it must be the reader tasks themselves, not the rwlock, which are aware of their ownership of it. I don't see any obvious way it can be built on top of robust mutexes, or on top of the underlying mechanisms (like robust_list on Linux) typically used to implement robust mutexes.
If you really need robust rwlock semantics, you're probably better off having some sort of protocol with a dedicated coordinator process that's assumed not to die, that tracks death of clients via closure of a pipe/socket to them and is able to tell via shared memory contents whether the process that died held a read lock. Note that this still involves implementing your own sort of rwlock.
How can a process gracefully recover from a process crashing while holding a rwlock?
POSIX does not define a robustness option for its rwlocks. If a process dies while holding one locked, the lock is not recoverable -- you cannot even pthread_rwlock_destroy it. You also cannot expect any processes blocked trying to acquire the lock to unblock until their timeout, if any, expires. Threads blocked trying unconditionally and without a timeout to acquire the lock cannot be unblocked even by delivering a signal, because POSIX specifies that if their wait is interrupted by a signal, they will resume blocking after the signal handler finishes.
Therefore, in the case that one of several cooperating processes dies while holding a shared rwlock locked, whether for reading or writing, the best possible result is probably to arrange for the other cooperating processes to shut down as cleanly as possible. You shhould be able to arrange for something like that via a separate monitoring process that sends SIGTERMs to the others when a locking failure occurs. There would need to be much more to it than that, though, probably including some kind of extra discipline or wrapping around acquiring the rwlock.
Honestly, if you want robustness then you're probably better off rolling your own read/write lock using, for example, POSIX robust mutexes and POSIX condition variables, as #R.. described in comments.
Consider also that robustness of the lock itself is only half the picture. If a thread dies while holding the write lock, then you have the additional issue of ensuring the integrity of the data protected by the lock before continuing.
i found out that windows implemented a slim reader-writer-lock (see https://msdn.microsoft.com/en-us/library/windows/desktop/aa904937%28v=vs.85%29.aspx ). Unfortunately (for me) this rw-lock is neither fifo nor is it fair (in any sense).
Is there a possibility to make the windows rw-lock with some workaround fair or fifo?
If not, in which scenarios would you use the windows slim rw-lock?
It is unlikely you can change the slim lock itself to be fair, especially since the documentation doesn't indicate any method of doing so, and most locks today are unfair for performance reasons.
That said, it is fairly straightforward to roll your own approximately FIFO lock with Windows events, and a 64-bit control word that you manipulate with compare and swap that is still very slim. Here's an outline:
The state of the lock is reflected in the control word is manipulated atomically to transition between the states, and allows threads to enter the lock (if allowed) with a single atomic operation and no kernel switch (that's the performance part of "slim"). The reset events are used to notify waiting threads, when threads need to block and can be allocated on-demand (that's the low memory footprint of slim).
The lock control word has the follow states:
Free - no readers or writers, and no waiters. Any thread can acquire the lock for reading or writing by atomically transitioning the lock into state (2) or (3).
N readers in the lock. There are N readers in the lock at the moment. New readers can immediately acquire the lock by adding 1 to the count - use a field of 30-bits or so within the control word to represent this count. Writers must block (perhaps after spinning). When readers leave the lock, they decrement the count, which may transition to state (1) when the last reader leaves (although they don't need to do anything special in a (2) -> (1) transition).
State (2) + waiting writers + 0 or more waiting readers. In this state, there are 1 or more readers still in the lock, but at least one waiting writer. The writers should wait on a manual-reset event, which is designed, although not guaranteed, to be FIFO. There is a field in the control word to indicate how many writers are waiting. In this state, new readers that want to enter the lock cannot, and set a reader-waiting bit instead, and block on the reader-waiting event. New writers increment the waiting writer count and block on the writer-waiting event. When the last reader leaves (setting the reader-count field to 0), it signals the writer-waiting event, releasing the longest-waiting writer to enter the lock.
Writer in the lock. When a writer is in the lock, all readers queue up and wait on the reader-waiting event. All incoming writers increment the waiting-writer count and queue up as usual on the writer-waiting event. There may even be some waiting readers when the writer acquires the lock because of state (3) above, and these are treated identically. When the writer leaves the lock, it checks for waiting writers and readers and either unblocks a writer or all readers, depending on policy, discussed below.
All the state transitions discussed above are done atomically using compare-and-swap. The typical pattern is that any of the lock() or unlock() calls look at the control word, determine what state they are in and what transition needs to happen (following the rules above), calculate the new control word in a temporary then attempt to swap in the new control word with compare-and-swap. Sometimes that attempt fails because another thread concurrently modified the control word (e.g., another reader entered the lock, incrementing the reader count). No problem, just start over from "determine state..." and try again. Such races are rare in practice since the state word calculation is very short, and that's just how things work with CAS-based complex locks.
This lock design is "slim" is almost every sense. Performance-wise, it is near the top of what you can get for a general purpose design. In particular, the common fast-paths of (a) reader entering the lock with 0 or more readers already in the block (b) reader leaving the lock with 0 or more readers still in the lock and (c) writer entering/leaving an uncontended lock are all about as fast as possible in the usual case: a single atomic operation. Furthermore, the reader entry and exit paths are "lock free" in the sense that incoming readers do not temporarily take an mutex internal to the rwlock, manipulate state, and then unlock it while entering/leaving the lock. This approach is slow and subject to issues whenever a reader thread performs a context switch at the critical moment in holds the internal lock. Such approaches do not scale to heaver reader activity with a short rwlock critical section: even though multiple readers can, in theory, enter the critical section, they all bottleneck on entering and leaving the internal lock (which happens twice for every enter/exit operation) and performance is worse than a normal mutex!
It is also lightweight in that it only needs a couple of Windows Event objects, and these objects can be allocated on demand - they are only needed when contention occurs and a state transition that requires blocking is about to occur. That's how CRITICAL_SECTION objects work.
The lock above is fair in the sense that readers won't block writers, and writers are served in FIFO order. How writers interact with waiting readers is up to your policy for who to unblock when the lock becomes free after a writer unlocks and there are both waiting readers and writers. On simple policy is to unblock all waiting readers.
In this design, writers will alternate in FIFO order with FIFO batches of readers. Writers are FIFO relative to other writers, and reader batches are FIFO relative to other reader batches, but the relationship between writers and readers isn't exactly FIFO: because all incoming readers are added to the same reader-waiting set, in the case that there are already several waiting writers, arriving readers all go into the next "batch" to be released, which actually puts them ahead of writers that are already waiting. That's quite reasonable though: readers all go at once, so adding more readers to the batch doesn't necessary cost much and probably increases efficiency, and if you did serve everything thread in strict FIFO order, the lock would reduce in behavior to a simple mutex under contention.
Another possible design is to always unblock writers if any are waiting. This favors writers at the expense of readers and does mean that a never-ending stream of writers could block out readers indefinitely. This approach makes sense where you know your writes are latency sensitive important and you either aren't worried about reader starvation, or you know it can't occur due to the design of your application (e.g., because there is only one possible writer at a time).
Beyond that, there are a lot of other policies possible, such as favoring writers up until readers have been waiting for a certain period, or limiting reader batch sizes, or whatever. They are mostly possible to implement efficiently since the bookkeeping is generally limited to the slow paths where threads will block anyway.
I've glossed over some implementation details and gotchas here (in particular, the need to be careful when making the transitions that involve blocking to avoid "missed wakeup" issues) - but this definitely works. I've written such a lock before the slim rwlock existed to fill the need for a fast high-performance rwlock, and it performs very well. Other tradeoffs are possible too, e.g., for designs in which reads are expected to dominate, contention can be reduced by splitting up the control word across cache lines, at the cost of more expensive write operations.
One final note - this lock is a bit fatter, in memory use, than the Windows one in the case that is contended - because it allocates one or two windows Events per lock, while the slim lock avoids this. The slim lock likely does it by directly supporting the slim lock behavior in the kernel, so the control word can directly be used as part of the kernel-side waitlist. You can't reproduce that exactly, but you can still remove the per-lock overhead in another way: use thread-local storage to allocate your two events per thread rather than per lock. Since a thread can only be waiting on one lock at a time, you only need this structure one per thread. That brings it into line with slim lock in memory use (unless you have very few locks and a ton of threads).
this rw-lock is neither fifo nor is it fair
I wouldn't expect anything to do with threading to be "fair" or "fifo" unless it said it was explicitly. In this case, I would expect writing locks to take priority, as it might never be able to obtain a lock if there are a lot of reading threads, but then I also wouldn't assume that's the case, and I've not read the MS documentation for a better understanding.
That said, it sounds like your issue is that you have a lot of contention on the lock, caused by write threads; because otherwise your readers would always be able to share the lock. You should probably examine why your writing thread(s) are trying to hold the lock for so long; buffering up adds for example will help mitigate this.
A coworker had an issue recently that boiled down to what we believe was the following sequence of events in a C++ application with two threads:
Thread A holds a mutex.
While thread A is holding the mutex, thread B attempts to lock it. Since it is held, thread B is suspended.
Thread A finishes the work that it was holding the mutex for, thus releasing the mutex.
Very shortly thereafter, thread A needs to touch a resource that is protected by the mutex, so it locks it again.
It appears that thread A is given the mutex again; thread B is still waiting, even though it "asked" for the lock first.
Does this sequence of events fit with the semantics of, say, C++11's std::mutex and/or pthreads? I can honestly say I've never thought about this aspect of mutexes before.
Are there any fairness guarantees to prevent starvation of other threads for too long, or any way to get such guarantees?
Known problem. C++ mutexes are thin layer on top of OS-provided mutexes, and OS-provided mutexes are often not fair. They do not care for FIFO.
The other side of the same coin is that threads are usually not pre-empted until they run out of their time slice. As a result, thread A in this scenario was likely to continue to be executed, and got the mutex right away because of that.
The guarantee of a std::mutex is enable exclusive access to shared resources. Its sole purpose is to eliminate the race condition when multiple threads attempt to access shared resources.
The implementer of a mutex may choose to favor the current thread acquiring a mutex (over another thread) for performance reasons. Allowing the current thread to acquire the mutex and make forward progress without requiring a context switch is often a preferred implementation choice supported by profiling/measurements.
Alternatively, the mutex could be constructed to prefer another (blocked) thread for acquisition (perhaps chosen according FIFO). This likely requires a thread context switch (on the same or other processor core) increasing latency/overhead. NOTE: FIFO mutexes can behave in surprising ways. E.g. Thread priorities must be considered in FIFO support - so acquisition won't be strictly FIFO unless all competing threads are the same priority.
Adding a FIFO requirement to a mutex's definition constrains implementers to provide suboptimal performance in nominal workloads. (see above)
Protecting a queue of callable objects (std::function) with a mutex would enable sequenced execution. Multiple threads can acquire the mutex, enqueue a callable object, and release the mutex. The callable objects can be executed by a single thread (or a pool of threads if synchrony is not required).
•Thread A finishes the work that it was holding the mutex for, thus
releasing the mutex.
•Very shortly thereafter, thread A needs to touch a resource that is
protected by the mutex, so it locks it again
In real world, when the program is running. there is no guarantee provided by any threading library or the OS. Here "shortly thereafter" may mean a lot to the OS and the hardware. If you say, 2 minutes, then thread B would definitely get it. If you say 200 ms or low, there is no promise of A or B getting it.
Number of cores, load on different processors/cores/threading units, contention, thread switching, kernel/user switches, pre-emption, priorities, deadlock detection schemes et. al. will make a lot of difference. Just by looking at green signal from far you cannot guarantee that you will get it green.
If you want that thread B must get the resource, you may use IPC mechanism to instruct the thread B to gain the resource.
You are inadvertently suggesting that threads should synchronise access to the synchronisation primitive. Mutexes are, as the name suggests, about Mutual Exclusion. They are not designed for control flow. If you want to signal a thread to run from another thread you need to use a synchronisation primitive designed for control flow i.e. a signal.
You can use a fair mutex to solve your task, i.e. a mutex that will guarantee the FIFO order of your operations. Unfortunately, C++ standard library doesn't have a fair mutex.
Thankfully, there are open-source implementations, for example yamc (a header-only library).
The logic here is very simple - the thread is not preempted based on mutexes, because that would require a cost incurred for each mutex operation, which is definitely not what you want. The cost of grabbing a mutex is high enough without forcing the scheduler to look for other threads to run.
If you want to fix this you can always yield the current thread. You can use std::this_thread::yield() - http://en.cppreference.com/w/cpp/thread/yield - and that might offer the chance to thread B to take over the mutex. But before you do that, allow me to tell you that this is a very fragile way of doing things, and offers no guarantee. You could, alternatively, investigate the issue deeper:
Why is it a problem that the B thread is not started when A releases the resource? Your code should not depend on such logic.
Consider using alternative thread synchronization objects like barriers (boost::barrier or http://linux.die.net/man/3/pthread_barrier_wait ) instead, if you really need this sort of logic.
Investigate if you really need to release the mutex from A at that point - I find the practice of locking and releasing fast a mutex for more than one time a code smell, it usually impacts terribly the performace. See if you can group extraction of data in immutable structures which you can play around with.
Ambitious, but try to work without mutexes - use instead lock-free structures and a more functional approach, including using a lot of immutable structures. I often found quite a performance gain from updating my code to not use mutexes (and still work correctly from the mt point of view)
How do you know this:
While thread A is holding the mutex, thread B attempts to lock it.
Since it is held, thread B is suspended.
How do you know thread B is suspended. How do you know that it is not just finished the line of code before trying to grab the lock, but not yet grabbed the lock:
Thread B:
x = 17; // is the thread here?
// or here? ('between' lines of code)
mtx.lock(); // or suspended in here?
// how can you tell?
You can't tell. At least not in theory.
Thus the order of acquiring the lock is, to the abstract machine (ie the language), not definable.
For example the c++0x interfaces
I am having a hard time figuring out when to use which of these things (cv, mutex and lock).
Can anyone please explain or point to a resource?
Thanks in advance.
On the page you refer to, "mutex" is the actual low-level synchronizing primitive. You can take a mutex and then release it, and only one thread can take it at any single time (hence it is a synchronizing primitive). A recursive mutex is one which can be taken by the same thread multiple times, and then it needs to be released as many times by the same thread before others can take it.
A "lock" here is just a C++ wrapper class that takes a mutex in its constructor and releases it at the destructor. It is useful for establishing synchronizing for C++ scopes.
A condition variable is a more advanced / high-level form of synchronizing primitive which combines a lock with a "signaling" mechanism. It is used when threads need to wait for a resource to become available. A thread can "wait" on a CV and then the resource producer can "signal" the variable, in which case the threads who wait for the CV get notified and can continue execution. A mutex is combined with CV to avoid the race condition where a thread starts to wait on a CV at the same time another thread wants to signal it; then it is not controllable whether the signal is delivered or gets lost.
I'm not too familiar w/ C++0x so take this answer w/ a grain of salt.
re: Mutex vs. locks: From the documentation you posted, it looks like a mutex is an object representing an OS mutex, whereas a lock is an object that holds a mutex to facilitate the RAII pattern.
Condition variables are a handy mechanism to associate a blocking/signaling mechanism (signal+wait) with a mutual exclusion mechanism, yet keep them decoupled in the OS so that you as system programmer can choose the association between condvar and mutex. (useful for dealing with multiple sets of concurrently-accessed objects) Rob Krten has some good explanations on condvars in one of the online chapters of his book on QNX.
As far as general references: This book (not out yet) looks interesting.
This question has been answered. I just add this that may help to decide WHEN to use these synchronization primitives.
Simply, the mutex is used to guarantee mutual access to a shared resource in the critical section of multiple threads. The luck is a general term but a binary mutex can be used as a lock. In modern C++ we use lock_guard and similar objects to utilize RAII to simplify and make safe the mutex usage. The conditional variable is another primitive that often combined with a mutex to make something know as a monitor.
I am having a hard time figuring out when to use which of these things
(cv, mutex and lock). Can anyone please explain or point to a
resource?
Use a mutex to guarantee mutual exclusive access to something. It's the default solution for a broad range of concurrency problems. Use lock_guard if you have a scope in C++ that you want to guard it with a mutex. The mutex is handled by the lock_guard. You just create a lock_guard in the scope and initialize it with a mutex and then C++ does the rest for you. The mutex is released when the scope is removed from the stack, for any reason including throwing an exception or returning from a function. It's the idea behind RAII and the lock_guard is another resource handler.
There are some concurrency issues that are not easily solvable by only using a mutex or a simple solution can lead to complexity or inefficiency. For example, the produced-consumer problem is one of them. If we want to implement a consumer thread reading items from a buffer shared with a producer, we should protect the buffer with a mutex but, without using a conditional variable we should lock the mutex, check the buffer and read an item if it's not empty, unlock it and wait for some time period, lock it again and go on. It's a waste of time if the buffer is often empty (busy waiting) and also there will be lots of locking and unlocking and sleeps.
The solution we need for the producer-consumer problem must be simpler and more efficient. A monitor (a mutex + a conditional variable) helps us here. We still need a mutex to guarantee mutual exclusive access but a conditional variable lets us sleep and wait for a certain condition. The condition here is the producer adding an item to the buffer. The producer thread notifies the consumer thread that there is and item in the buffer and the consumer wakes up and gets the item. Simply, the producer locks the mutex, puts something in the buffer, notifies the consumer. The consumer locks the mutex, sleeps while waiting for a condition, wake s up when there is something in the buffer and gets the item from the buffer. It's a simpler and more efficient solution.
The next time you face a concurrency problem think this way: If you need mutual exclusive access to something, use a mutex. Use lock_guard if you want to be safer and simpler. If the problem has a clue of waiting for a condition that must happen in another thread, you MIGHT need a conditional variable.
As a general rule of thumb, first, analyze your problem and try to find a famous concurrency problem similar to yours (for example, see classic problems of synchronization section in this page). Read about the solutions proposed for the well-known solution to peak the best one. You may need some customization.