Can std::atomic be used sometimes instead of std::mutex in C++?

Can std::atomic be used sometimes instead of std::mutex in C++? - c++

I suppose that std::atomic sometimes can replace usages of std::mutex. But is it always safe to use atomic instead of mutex? Example code:
std::atomic_flag f, ready; // shared
// ..... Thread 1 (and others) ....
while (true) {
// ... Do some stuff in the beginning ...
while (f.test_and_set()); // spin, acquire system lock
if (ready.test()) {
UseSystem(); // .... use our system for 50-200 nanoseconds ....
}
f.clear(); // release lock
// ... Do some stuff at the end ...
}
// ...... Thread 2 .....
while (true) {
// ... Do some stuff in the beginning ...
InitSystem();
ready.test_and_set(); // signify system ready
// .... sleep for 10-30 milli-seconds ....
while (f.test_and_set()); // acquire system lock
ready.clear(); // signify system shutdown
f.clear(); // release lock
DeInitSystem(); // finalize/destroy system
// ... Do some stuff at the end ...
}
Here I use std::atomic_flag to protect use of my system (some complex library). But is it safe code? Here I suppose that if ready is false then system is not available and I can't use it and if it is true then it is available and I can use it. For simplicity suppose that code above doesn't throw exceptions.
Of cause I can use std::mutex to protect read/modify of my system. But right now I need very high performance code in Thread-1 that should use atomics very often instead of mutexes (Thread-2 can be slow and use mutexes if needed).
In Thread-1 system-usage code (inside while loop) is run very often, each iteration around 50-200 nano-seconds. So using extra mutexes will be to heavy. But Thread-2 iterations are quite large, as you can see in each iteration of while loop when system is ready it sleeps for 10-30 milli-seconds, so using mutexes only in Thread-2 is quite alright.
Thread-1 is example of one thread, there are several threads running same (or very similar) code as Thread-1 in my real project.
I'm concerned about memory operations ordering, meaning that it can probably happen somtimes that system is not yet in fully consistent state (not yet inited fully) when ready becomes true in Thread-1. Also it may happen that ready becomes false in Thread-1 too late, when system already made some destroying (deinit) operations. Also as you can see system can be inited/destroyed many times in a loop of Thread-2 and used many times in Thread-1 whenever it is ready.
Can my task be solved somehow without std::mutex and other heavy stuff in Thread-1? Only using std::atomic (or std::atomic_flag). Thread-2 can use heavy synchronization stuff if needed, mutexes etc.
Basically Thread-2 should somehow propagate whole inited state of system to all cores and other threads before ready becomes true and also Thread-2 should propagate ready equal to false before any single small operation of system destruction (deinit) is done. By propagating state I mean that all system's inited data should be 100% written consistently to global memory and caches of other core, so that other threads see fully consistent system whenever ready is true.
It is even allowed to make small (milliseconds) pause after system init and before ready is set to true if it improves situation and guarantees. And also it is allowed to make pause after ready is set to false and before starting system destruction (deinit). Also doing some expensive CPU operations in Thread-2 is also alright if there exist some operations like "propagate all Thread-2 writes to global memory and caches to all other CPU cores and threads".
Update: As a solution for my question above right now in my project I decided to use next code with std::atomic_flag to replace std::mutex:
std::atomic_flag f = ATOMIC_FLAG_INIT; // shared
// .... Later in all threads ....
while (f.test_and_set(std::memory_order_acquire)) // try acquiring
std::this_thread::yield();
shared_value += 5; // Any code, it is lock-protected.
f.clear(std::memory_order_release); // release
This solution above runs 9 nanoseconds on average (measured 2^25 operations) in single thread (release compiled) on my Windows 10 64-bit 2Ghz 2-core laptop. While using std::unique_lock<std::mutex> lock(mux); for same protection purpose takes 100-120 nanoseconds on same Windows PC. If it is needed for threads to spinlock instead of sleeping while waiting then instead of std::this_thread::yield(); in code above I just use semicolon ;. Full online example of usage and time measurements.

I'll ignore your code for the sake of the answer, the answer generally is yes.
A lock does the following things :
allows only one thread to acquire it at any given time
when the lock is acquired, a read barrier is placed
right before the lock is released, a write barrier is placed
The combination of the 3 points above makes the critical section thread safe. only one thread can touch the shared memory, all changes are observed by the locking thread because of the read barrier, and all the changes are to be visible to other locking threads, because of the write barrier.
Can you use atomics to achieve it? Yes, And real life locks (provided for example, by Win32/Posix) ARE implemented by either using atomics and lock free programming, either by using locks that use atomics and lock free programing.
Now, realistically speaking, should you use a self-written lock instead of the standard locks? Absolutely not.
Many concurrency tutorials preserve the notion that spin-locks are "more efficient" than regular locks. I can't stress enough how foolish it is. A user-mode spinlock IS NEVER more efficient than a lock that the OS provides. The reason is simple, that OS locks are wired to the OS scheduler. So if a lock tries to lock a lock and fails - the OS knows to freeze this thread and not reschedule it to run until the lock has been released.
With user-mode spinlocks, this doesn't happen. The OS can't know that the relevant thread tries to acquire to the lock in a tight loop. Yielding is just a patch and not a solution - we want to spin for a short time, then go to sleep until the lock is released. With user mode spin locks, we might waste the entire thread quantum trying to lock the spinlock and yielding.
I will say, for the sake of honesty, that recent C++ standards do give us the ability to sleep on an atomic waiting for it to change its value. So we can, in a very lame way, implement our own "real" locks that try to spin for a while and then sleep until the lock is released. However, implementing a correct and efficient lock when you're not a concurrency expert is pretty much impossible.
My own philosophical opinion that in 2021, developers should rarely deal with those very low-level concurrency topics. Leave those things to the kernel guys.
Use some high level concurrency library and focus on the product you want to develop rather than micro-optimizing your code. This is concurrency, where correctness >>> efficiency.
A related rant by Linus Torvalds

Related

C++20 mutex with atomic wait [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 months ago.
Improve this question
From C++20 std::atomics have wait and notify operations. With is_always_lock_free we can ensure that the implementation is lock free. With these bricks building a lock-free mutex is not so difficult. In trivial cases locking would be a compare exchange operation, or a wait if the mutex is locked. The big question here is if it is worth it or not. If I can create such an implementation most probably the STL version is much better and faster. However I still remember how surprised I was when I saw how QMutex outperformed std::mutex QMutex vs std::mutex in 2016. So what do you think, should I experiment with such an implementation or the current implementation of std::mutex is matured enough to be optimized far beyond these tricks?
UPDATE
My wording wasn't the best, I mean that the implementation could be lock free on the happy path (locking from not locked state). Of course we should be blocked and re-scheduled if we need to wait to acquire the lock. Most probably atomic::wait is not implemented by a simple spinlock on most of the platforms (let's ignore the corner cases now), so basically it achieves the very same thing mutex::lock does. So basically if I implement such a class it would do exactly the same std::mutex does (again on most of the popular platforms). It means that STL could use the same tricks in their mutex implementations on the platforms that support these tricks. Like this spinlock, but I would use atomic wait instead of spinning. Should I trust my STL implementation that they did so?

A lock-free mutex is a contradiction.
You can build a lock out of lock-free building-blocks, and in fact that's the normal thing to do whether it's hand-written in asm or with std::atomic.
But the overall locking algorithm is by definition not lock-free. (https://en.wikipedia.org/wiki/Non-blocking_algorithm). The entire point is to stop other threads from making forward progress while one thread is in the critical section, even if it unfortunately sleeps while it's there.
I mean that the implementation could be lock free on the happy path (locking from not locked state)
std::mutex::lock() is that way too: it doesn't block if it doesn't have to! It might need to make a system call like futex(FUTEX_WAIT_PRIVATE) if there's a thread waiting for the lock. But so does an implementation that used std::notify.
Perhaps you haven't understood what "lock-free" means: it never blocks, regardless of what other threads are doing. That is all. It doesn't mean "faster in the simple/easy case". For a complex algorithm (e.g. a queue), it's often faster to just give up and block if the circular buffer is full, rather than adding overhead to the simple case to allow other threads to "assist" or cancel a stuck partial operation. (Lock-free Progress Guarantees in a circular buffer queue)
There is no inherent advantage to rolling your own std::mutex out of std::atomic. The compiler-generated machine code has to do approximately the same things either way, and I'd expect the fast path to be about the same. The only difference would be in choice of what to do in the already-locked case. (But maybe tricky to design a way to avoid calling notify iff there were no waiters, something that actual std::mutex manages in the glibc / pthreads implementation Linux.)
(I'm assuming the overhead of the library function call is negligible compared to the cost of an atomic RMW to take the mutex. Having that inline into your code is a pretty minor advantage.)
A mutex implementation can be tuned for certain use-cases, in terms of how long it spin-waits before sleeping (using an OS-assisted mechanism like futex to enable other threads to wake it when releasing the lock), and in exponential backoff for the spin-wait portion.
If std::mutex doesn't perform well for your application on the hardware you care about, it's worth considering an alternative. Although IDK exactly how you'd go about measuring whether it worked well or not. Perhaps if you could figure out that it was deciding to sleep
And yes, you could consider rolling your own with std::atomic now that there's a portable mechanism to hopefully expose a way to fall-back to OS-assisted sleep/wake mechanisms like futex. You'd still want to manually use system-specific things like x86 _mm_pause() inside a spin-wait loop, though, since I don't think C++ has anything equivalent to Rust's std::hint::spin_loop() that an implementation can use to expose things like the x86 pause instruction, intended for use in the body of a spin-loop. (See Locks around memory manipulation via inline assembly re: such considerations, and spinning read-only instead of spamming atomic RMW attempts. And for a look at the necessary parts of a spinlock in x86 assembly language, which are the same whether you get a C++ compiler to generate that machine code for you or not.)
See also https://rigtorp.se/spinlock/ re: implementing a mutex in C++ with std::atomic.
Linux / libstdc++ behaviour of notify/wait
I tested what system calls std::wait makes when it has to wait for a long time, on Arch Linux (glibc 2.33).
std::mutex lock no-contention fast path, and unlock with no waiters: zero system calls, purely user-space atomic operations. Notably is able to detect that there are no waiters when unlocking, so it doesn't make a FUTEX_WAKE system call (which otherwise would maybe a hundred times more than taking and releasing an uncontended mutex that was still hot in this cores L1d cache.)
std::mutex lock() on already locked: only a futex(0x55be9461b0c0, FUTEX_WAIT_PRIVATE, 2, NULL) system call. Possibly some spinning in user-space before that; I didn't single-step into it with GDB, but if so probably with a pause instruction.
std::mutex unlock() with a waiter: uses futex(0x55ef4af060c0, FUTEX_WAKE_PRIVATE, 1) = 1. (After an atomic RMW, IIRC; not sure why it doesn't just use a release store.)
std::notify_one: always a futex(address, FUTEX_WAKE, 1) even if there are no waiters, so it's up to you to avoid it if there are no waiters when you unlock a lock.
std::wait: spinning a few times in user-space, including 4x sched_yield() calls before a futex(addr, FUTEX_WAIT, old_val, NULL).
Note the use of FUTEX_WAIT instead of FUTEX_WAIT_PRIVATE by the wait/notify functions: these should work across processes on shared memory. The futex(2) man page says the _PRIVATE versions (only for threads of a single process) allow some additional optimizations.
I don't know about other systems, although I've heard some kinds of locks on Windows / MSVC have to suck (always a syscall even on the fast path) for backwards-compat with some ABI choices, or something. Like perhaps std::lock_guard is slow on MSVC, but std::unique_lock isn't?
Test code:
#include <atomic>
#include <thread>
#include <unistd.h> // for quick & dirty sleep and usleep. TODO: port to non-POSIX.
#include <mutex>
#if 1 // test std::atomic
//std::atomic_unsigned_lock_free gvar;
//std::atomic_uint_fast_wait_t gvar;
std::atomic<unsigned> gvar;
void waiter(){
volatile unsigned sink;
while (1) {
sink = gvar;
gvar.wait(sink); // block until it's not the old value.
// on Linux/glibc, 4x sched_yield(), then futex(0x562c3c4881c0, FUTEX_WAIT, 46, NULL ) or whatever the current counter value is
}
}
void notifier(){
while(1){
sleep(1);
gvar.store(gvar.load(std::memory_order_relaxed)+1, std::memory_order_relaxed);
gvar.notify_one();
}
}
#else
std::mutex mut;
void waiter(){
unsigned sink = 0;
while (1) {
mut.lock(); // Linux/glibc2.33 - just a futex system call, maybe some spinning in user-space first. But no sched_yield
mut.unlock();
sink++;
usleep(sink); // give the other thread plenty of time to take the lock, so we don't get it twice.
}
}
void notifier(){
while(1){
mut.lock();
sleep(1);
mut.unlock();
}
}
#endif
int main(){
std::thread t (waiter); // comment this to test the no-contention case
notifier();
// loops forever: kill it with control-C
}
Compile with g++ -Og -std=gnu++20 notifywait.cpp -pthread, run with strace -f ./a.out to see the system calls. (A couple or a few per second, since I used a nice long sleep.)
If there's any spin-waiting in user-space, it's negligible compared to the 1 second sleep interval, so it uses about a millisecond of CPU time including startup, to run for 19 iterations. (perf stat ./a.out)
Usually your time would be better spent trying to reduce the amount of locking involved, or the amount of contention, rather than trying to optimize the locks themselves. Locking is an extremely important thing, and lots of engineering has gone into tuning it for most use-cases.
If you're rolling your own lock, you probably want to get your hands dirty with system-specific stuff, because it's all a matter of tuning choices. Different systems are unlikely to have made the same tuning choices for std::mutex and wait as Linux/glibc. Unless std::wait's retry strategy on the only system you care about happens to be perfect for your use-case.
It doesn't make sense to roll your own mutex without first investigating exactly what std::mutex does on your system, e.g. single-step the asm for the already-locked case to see what retries it makes. Then you'll have a better idea whether you can do any better.

You may want to distinguish between a mutex (which is generally a sleeping lock that interacts with the scheduler) and a spinlock (which does not put the current thread to sleep, and makes sense only when a thread on a different CPU is likely to be holding the lock).
Using C++20 atomics, you can definitely implement a spinlock, but this won't be directly comparable to std::mutex, which puts the current thread to sleep. Mutexes and spinlocks are useful in different situations. When successful, the spinlock is probably going to be faster--after all the mutex implementation likely contains a spinlock. It's also the only kind of lock you can acquire in an interrupt handler (though this is less relevant to user-level code). But if you hold the spinlock for a long time and there is contention, you will waste a huge amount of CPU time.

Why is there no std:: equivalent to pthread_spinlock_t like there is for pthread_mutex_t & std::mutex?

I've used pthreads a fair bit for concurrent programs, mainly utilising spinlocks, mutexes, and condition variables.
I started looking into multithreading using std::thread and using std::mutex, and I noticed that there doesn't seem to be an equivalent to spinlock in pthreads.
Anyone know why this is?

there doesn't seem to be an equivalent to spinlock in pthreads.
Spinlocks are often considered a wrong tool in user-space because there is no way to disable thread preemption while the spinlock is held (unlike in kernel). So that a thread can acquire a spinlock and then get preempted, causing all other threads trying to acquire the spinlock to spin unnecessarily (and if those threads are of higher priority that may cause a deadlock (threads waiting for I/O may get a priority boost on wake up)). This reasoning also applies to all lockless data structures, unless the data structure is truly wait-free (there aren't many practically useful ones, apart from boost::spsc_queue).
In kernel, a thread that has locked a spinlock cannot be preempted or interrupted before it releases the spinlock. And that is why spinlocks are appropriate there (when RCU cannot be used).
On Linux, one can prevent preemption (not sure if completely, but there has been recent kernel changes towards such a desirable effect) by using isolated CPU cores and FIFO real-time threads pinned to those isolated cores. But that requires a deliberate kernel/machine configuration and an application designed to take advantage of that configuration. Nevertheless, people do use such a setup for business-critical applications along with lockless (but not wait-free) data structures in user-space.
On Linux, there is adaptive mutex PTHREAD_MUTEX_ADAPTIVE_NP, which spins for a limited number of iterations before blocking in the kernel (similar to InitializeCriticalSectionAndSpinCount). However, that mutex cannot be used through std::mutex interface because there is no option to customise non-portable pthread_mutexattr_t before initialising pthread_mutex_t.
One can neither enable process-sharing, robostness, error-checking or priority-inversion prevention through std::mutex interface. In practice, people write their own wrappers of pthread_mutex_t which allows to set desirable mutex attributes; along with a corresponding wrapper for condition variables. Standard locks like std::unique_lock and std::lock_guard can be reused.
IMO, there could be provisions to set desirable mutex and condition variable properties in std:: APIs, like providing a protected constructor for derived classes that would initialize that native_handle, but there aren't any. That native_handle looks like a good idea to do platform specific stuff, however, there must be a constructor for the derived class to be able to initialize it appropriately. After the mutex or condition variable is initialized that native_handle is pretty much useless. Unless the idea was only to be able to pass that native_handle to (C language) APIs that expect a pointer or reference to an initialized pthread_mutex_t.
There is another example of Boost/C++ standard not accepting semaphores on the basis that they are too much of a rope to hang oneself, and that mutex (a binary semaphore, essentially) and condition variable are more fundamental and more flexible synchronisation primitives, out of which a semaphore can be built.
From the point of view of the C++ standard those are probably right decisions because educating users to use spinlocks and semaphores correctly with all the nuances is a difficult task. Whereas advanced users can whip out a wrapper for pthread_spinlock_t with little effort.

You are right there's no spin lock implementation in the std namespace. A spin lock is a great concept but in user space is generally quite poor. OS doesn't know your process wants to spin and usually you can have worse results than using a mutex. To be noted that on several platforms there's the optimistic spinning implemented so a mutex can do a really good job. In addition adjusting the time to "pause" between each loop iteration can be not trivial and portable and a fine tuning is required. TL;DR don't use a spinlock in user space unless you are really really sure about what you are doing.
C++ Thread discussion
Article explaining how to write a spin lock with benchmark
Reply by Linus Torvalds about the above article explaining why it's a bad idea

Spin locks have two advantages:
They require much fewer storage as a std::mutex, because they do not need a queue of threads waiting for the lock. On my system, sizeof(pthread_spinlock_t) is 4, while sizeof(std::mutex) is 40.
They are much more performant than std::mutex, if the protected code region is small and the contention level is low to moderate.
On the downside, a poorly implemented spin lock can hog the CPU. For example, a tight loop with a compare-and-set assembler instructions will spam the cache system with loads and loads of unnecessary writes. But that's what we have libraries for, that they implement best practice and avoid common pitfalls. That most user implementations of spin locks are poor, is not a reason to not put spin locks into the library. Rather, it is a reason to put it there, to stop users from trying it themselves.
There is a second problem, that arises from the scheduler: If thread A acquires the lock and then gets preempted by the scheduler before it finishes executing the critical section, another thread B could spin "forever" (or at least for many milliseconds, before thread A gets scheduled again) on that lock.
Unfortunately, there is no way, how userland code can tell the kernel "please don't preempt me in this critical code section". But if we know, that under normal circumstances, the critical code section executes within 10 ns, we could at least tell thread B: "preempt yourself voluntarily, if you have been spinning for over 30 ns". This is not guaranteed to return control directly back to thread A. But it will stop the waste of CPU cycles, that otherwise would take place. And in most scenarios, where thread A and B run in the same process at the same priority, the scheduler will usually schedule thread A before thread B, if B called std::this_thread::yield().
So, I am thinking about a template spin lock class, that takes a single unsigned integer as a parameter, which is the number of memory reads in the critical section. This parameter is then used in the library to calculate the appropriate number of spins, before a yield() is performed. With a zero count, yield() would never be called.

InterviewQ: How do you code a mutex?

I'm unfortunately out of a job and have been interviewing around lately. I faced this same question twice now, and was lost both times I was asked this question.
"How do you code a mutex"?
Conceptually I understand a mutex locks a certain part of code so multiple threads can not enter the critical section at the same time, eliminating data races. The first time I was asked to conceptually describe how I would code it, the second time I was asked to code it. I've been googling and haven't found any answers... can anyone help?
Thanks.

There's lots of ways to implement a mutex lock, but it typically starts with the basic premise that the cpu architecture offers some concept of atomic add and atomic subtract. That is, an addition operation can be done to an integer variable in memory (and return the result) without being corrupted by another thread attempting to access same memory location. Or at the very least, "atomic increment" and "atomic decrement".
On modern Intel chips, for example, there's an instruction called XADD. When combined with the LOCK prefix it executes atomically and invalidates cached values across other cores. gcc implements a wrapper for this instruction called __sync_add_and_fetch. Win32 implements a similar function called InterlockedIncrement. Both are just calling LOCK XADD under the hood. Other CPU architectures should offer something similar.
So the most basic mutex lock could be implemented something like this. This is often called a "spin" lock. And this cheap version offers no ability to recursively enter the lock.
// A correct, but poorly performant mutex implementation
void EnterLock(int* lock)
{
while (true)
{
int result = LOCK_XADD(lock,1); // increment the value in lock and return the result atomically
if (result == 1)
{
// if the value in lock was successfully incremented
// from 0 to 1 by this thread. It means this thread "acquired" the lock
return;
}
LOCK XADD(lock,-1); // we didn't get the lock - decrement it atmoically back to what it was
sleep(0); // give the thread quantum back before trying again
}
}
void LeaveLock(int* lock)
{
LOCK XADD(lock,-1); // release the lock. Assumes we successfully acquired it correctly with EnterLock above
}
The above suffers from poor performance of "spinning" and doesn't guarantee any fairness. A higher priority thread could continue to win the EnterLock battle over a lower priority thread. And the programmer could make a mistake and call LeaveLock with with a thread that did not previously call EnterLock. You could expand the above to operate on a data structure that not only includes the lock integer, but also has record keeping for the owner thread id and a recursion count.
The second concept for implementing a mutex is that the operating system can offer a wait and notify service such that a thread doesn't have to spin until the owner thread has released it. The thread or process waiting on lock can register itself with the OS to be put to sleep until the owner thread has released it. In OS terms, this is called a semaphore. Additionally, the OS level semaphore can also be used to implement locks across different processes and for the cases where the CPU doesn't offer an atomic add. And can be used to guaranteed fairness between multiple threads trying to acquire the lock.
Most implementations will try spinning for multiple attempts before falling back to making a system call.

I wouldn't say that this is a stupid question. On any level of abstraction for the position. On the high level you just say, you use standard library, or any threading library. If you apply for a position as the compiler developer you need to understand how it acutally works and what is needed for the implementation.
To implement a mutex, you need a locking mechanism, that is you need to have a resource that can be marked as taken across all threads. This is not trivial. You need to remember that two cores share memory, but they have caches. This piece of information must be guaranteed to be actual. So you do need support for hardware to ensure atomicity.
If you take at implementation of clang, they offload (at least in once case) implementation to pthreads, typedefs in threading support:
#if defined(_LIBCPP_HAS_THREAD_API_PTHREAD)
# include <pthread.h>
# include <sched.h>
#elif defined(_LIBCPP_HAS_THREAD_API_WIN32)
#include <Windows.h>
#include <process.h>
#include <fibersapi.h>
#endif
And if you dig through pthreads repo, you can find asm implementations of the interlocking operations. They rely on the lock asm keyword which make the operations atomic, i.e. no other thread can execute them at the same time. This eliminates racing conditions, and guarantees coherency.
Based on this, you can build a lock, which you can use for a mutex implementation.

Do mutexes guarantee ordering of acquisition? Unlocking thread takes it again while others are still waiting

A coworker had an issue recently that boiled down to what we believe was the following sequence of events in a C++ application with two threads:
Thread A holds a mutex.
While thread A is holding the mutex, thread B attempts to lock it. Since it is held, thread B is suspended.
Thread A finishes the work that it was holding the mutex for, thus releasing the mutex.
Very shortly thereafter, thread A needs to touch a resource that is protected by the mutex, so it locks it again.
It appears that thread A is given the mutex again; thread B is still waiting, even though it "asked" for the lock first.
Does this sequence of events fit with the semantics of, say, C++11's std::mutex and/or pthreads? I can honestly say I've never thought about this aspect of mutexes before.
Are there any fairness guarantees to prevent starvation of other threads for too long, or any way to get such guarantees?

Known problem. C++ mutexes are thin layer on top of OS-provided mutexes, and OS-provided mutexes are often not fair. They do not care for FIFO.
The other side of the same coin is that threads are usually not pre-empted until they run out of their time slice. As a result, thread A in this scenario was likely to continue to be executed, and got the mutex right away because of that.

The guarantee of a std::mutex is enable exclusive access to shared resources. Its sole purpose is to eliminate the race condition when multiple threads attempt to access shared resources.
The implementer of a mutex may choose to favor the current thread acquiring a mutex (over another thread) for performance reasons. Allowing the current thread to acquire the mutex and make forward progress without requiring a context switch is often a preferred implementation choice supported by profiling/measurements.
Alternatively, the mutex could be constructed to prefer another (blocked) thread for acquisition (perhaps chosen according FIFO). This likely requires a thread context switch (on the same or other processor core) increasing latency/overhead. NOTE: FIFO mutexes can behave in surprising ways. E.g. Thread priorities must be considered in FIFO support - so acquisition won't be strictly FIFO unless all competing threads are the same priority.
Adding a FIFO requirement to a mutex's definition constrains implementers to provide suboptimal performance in nominal workloads. (see above)
Protecting a queue of callable objects (std::function) with a mutex would enable sequenced execution. Multiple threads can acquire the mutex, enqueue a callable object, and release the mutex. The callable objects can be executed by a single thread (or a pool of threads if synchrony is not required).

•Thread A finishes the work that it was holding the mutex for, thus
releasing the mutex.
•Very shortly thereafter, thread A needs to touch a resource that is
protected by the mutex, so it locks it again
In real world, when the program is running. there is no guarantee provided by any threading library or the OS. Here "shortly thereafter" may mean a lot to the OS and the hardware. If you say, 2 minutes, then thread B would definitely get it. If you say 200 ms or low, there is no promise of A or B getting it.
Number of cores, load on different processors/cores/threading units, contention, thread switching, kernel/user switches, pre-emption, priorities, deadlock detection schemes et. al. will make a lot of difference. Just by looking at green signal from far you cannot guarantee that you will get it green.
If you want that thread B must get the resource, you may use IPC mechanism to instruct the thread B to gain the resource.

You are inadvertently suggesting that threads should synchronise access to the synchronisation primitive. Mutexes are, as the name suggests, about Mutual Exclusion. They are not designed for control flow. If you want to signal a thread to run from another thread you need to use a synchronisation primitive designed for control flow i.e. a signal.

You can use a fair mutex to solve your task, i.e. a mutex that will guarantee the FIFO order of your operations. Unfortunately, C++ standard library doesn't have a fair mutex.
Thankfully, there are open-source implementations, for example yamc (a header-only library).

The logic here is very simple - the thread is not preempted based on mutexes, because that would require a cost incurred for each mutex operation, which is definitely not what you want. The cost of grabbing a mutex is high enough without forcing the scheduler to look for other threads to run.
If you want to fix this you can always yield the current thread. You can use std::this_thread::yield() - http://en.cppreference.com/w/cpp/thread/yield - and that might offer the chance to thread B to take over the mutex. But before you do that, allow me to tell you that this is a very fragile way of doing things, and offers no guarantee. You could, alternatively, investigate the issue deeper:
Why is it a problem that the B thread is not started when A releases the resource? Your code should not depend on such logic.
Consider using alternative thread synchronization objects like barriers (boost::barrier or http://linux.die.net/man/3/pthread_barrier_wait ) instead, if you really need this sort of logic.
Investigate if you really need to release the mutex from A at that point - I find the practice of locking and releasing fast a mutex for more than one time a code smell, it usually impacts terribly the performace. See if you can group extraction of data in immutable structures which you can play around with.
Ambitious, but try to work without mutexes - use instead lock-free structures and a more functional approach, including using a lot of immutable structures. I often found quite a performance gain from updating my code to not use mutexes (and still work correctly from the mt point of view)

How do you know this:
While thread A is holding the mutex, thread B attempts to lock it.
Since it is held, thread B is suspended.
How do you know thread B is suspended. How do you know that it is not just finished the line of code before trying to grab the lock, but not yet grabbed the lock:
Thread B:
x = 17; // is the thread here?
// or here? ('between' lines of code)
mtx.lock(); // or suspended in here?
// how can you tell?
You can't tell. At least not in theory.
Thus the order of acquiring the lock is, to the abstract machine (ie the language), not definable.

Overhead of pthread mutexes?

I'm trying to make a C++ API (for Linux and Solaris) thread-safe, so that its functions can be called from different threads without breaking internal data structures. In my current approach I'm using pthread mutexes to protect all accesses to member variables. This means that a simple getter function now locks and unlocks a mutex, and I'm worried about the overhead of this, especially as the API will mostly be used in single-threaded apps where any mutex locking seems like pure overhead.
So, I'd like to ask:
do you have any experience with performance of single-threaded apps that use locking versus those that don't?
how expensive are these lock/unlock calls, compared to eg. a simple "return this->isActive" access for a bool member variable?
do you know better ways to protect such variable accesses?

All modern thread implementations can handle an uncontended mutex lock entirely in user space (with just a couple of machine instructions) - only when there is contention, the library has to call into the kernel.
Another point to consider is that if an application doesn't explicitly link to the pthread library (because it's a single-threaded application), it will only get dummy pthread functions (which don't do any locking at all) - only if the application is multi-threaded (and links to the pthread library), the full pthread functions will be used.
And finally, as others have already pointed out, there is no point in protecting a getter method for something like isActive with a mutex - once the caller gets a chance to look at the return value, the value might already have been changed (as the mutex is only locked inside the getter method).

"A mutex requires an OS context switch. That is fairly expensive. "
This is not true on Linux, where mutexes are implemented using something called futex'es. Acquiring an uncontested (i.e., not already locked) mutex is, as cmeerw points out, a matter of a few simple instructions, and is typically in the area of 25 nanoseconds w/current hardware.
For more info:
Futex
Numbers everybody should know

This is a bit off-topic but you seem to be new to threading - for one thing, only lock where threads can overlap. Then, try to minimize those places. Also, instead of trying to lock every method, think of what the thread is doing (overall) with an object and make that a single call, and lock that. Try to get your locks as high up as possible (this again increases efficiency and may /help/ to avoid deadlocking). But locks don't 'compose', you have to mentally at least cross-organize your code by where the threads are and overlap.

I did a similar library and didn't have any trouble with lock performance. (I can't tell you exactly how they're implemented, so I can't say conclusively that it's not a big deal.)
I'd go for getting it right first (i.e. use locks) then worry about performance. I don't know of a better way; that's what mutexes were built for.
An alternative for single thread clients would be to use the preprocessor to build a non-locked vs locked version of your library. E.g.:
#ifdef BUILD_SINGLE_THREAD
inline void lock () {}
inline void unlock () {}
#else
inline void lock () { doSomethingReal(); }
inline void unlock () { doSomethingElseReal(); }
#endif
Of course, that adds an additional build to maintain, as you'd distribute both single and multithread versions.

I can tell you from Windows, that a mutex is a kernel object and as such incurs a (relatively) significant locking overhead. To get a better performing lock, when all you need is one that works in threads, is to use a critical section. This would not work across processes, just the threads in a single process.
However.. linux is quite a different beast to multi-process locking. I know that a mutex is implemented using the atomic CPU instructions and only apply to a process - so they would have the same performance as a win32 critical section - ie be very fast.
Of course, the fastest locking is not to have any at all, or to use them as little as possible (but if your lib is to be used in a heavily threaded environment, you will want to lock for as short a time as possible: lock, do something, unlock, do something else, then lock again is better than holding the lock across the whole task - the cost of locking isn't in the time taken to lock, but the time a thread sits around twiddling its thumbs waiting for another thread to release a lock it wants!)

A mutex requires an OS context switch. That is fairly expensive. The CPU can still do it hundreds of thousands of times per second without too much trouble, but it is a lot more expensive than not having the mutex there. Putting it on every variable access is probably overkill.
It also probably is not what you want. This kind of brute-force locking tends to lead to deadlocks.
do you know better ways to protect such variable accesses?
Design your application so that as little data as possible is shared. Some sections of code should be synchronized, probably with a mutex, but only those that are actually necessary. And typically not individual variable accesses, but tasks containing groups of variable accesses that must be performed atomically. (perhaps you need to set your is_active flag along with some other modifications. Does it make sense to set that flag and make no further changes to the object?)

I was curious about the expense of using a pthred_mutex_lock/unlock.
I had a scenario where I needed to either copy anywhere from 1500-65K bytes without using
a mutex or to use a mutex and do a single write of a pointer to the data needed.
I wrote a short loop to test each
gettimeofday(&starttime, NULL)
COPY DATA
gettimeofday(&endtime, NULL)
timersub(&endtime, &starttime, &timediff)
print out timediff data
or
ettimeofday(&starttime, NULL)
pthread_mutex_lock(&mutex);
gettimeofday(&endtime, NULL)
pthread_mutex_unlock(&mutex);
timersub(&endtime, &starttime, &timediff)
print out timediff data
If I was copying less than 4000 or so bytes, then the straight copy operation took less time. If however I was copying more than 4000 bytes, then it was less costly to do the mutex lock/unlock.
The timing on the mutex lock/unlock ran between 3 and 5 usec long including the time for
the gettimeofday for the currentTime which took about 2 usec

For member variable access, you should use read/write locks, which have slightly less overhead and allow multiple concurrent reads without blocking.
In many cases you can use atomic builtins, if your compiler provides them (if you are using gcc or icc __sync_fetch*() and the like), but they are notouriously hard to handle correctly.
If you can guarantee the access being atomic (for example on x86 an dword read or write is always atomic, if it is aligned, but not a read-modify-write), you can often avoid locks at all and use volatile instead, but this is non portable and requires knowledge of the hardware.

Well a suboptimal but simple approach is to place macros around your mutex locks and unlocks. Then have a compiler / makefile option to enable / disable threading.
Ex.
#ifdef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //actual mutex call
#endif
#ifndef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //do nothing
#endif
Then when compiling do a gcc -DTHREAD_ENABLED to enable threading.
Again I would NOT use this method in any large project. But only if you want something fairly simple.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js