C++20 mutex with atomic wait [closed]

C++20 mutex with atomic wait [closed] - c++

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 months ago.
Improve this question
From C++20 std::atomics have wait and notify operations. With is_always_lock_free we can ensure that the implementation is lock free. With these bricks building a lock-free mutex is not so difficult. In trivial cases locking would be a compare exchange operation, or a wait if the mutex is locked. The big question here is if it is worth it or not. If I can create such an implementation most probably the STL version is much better and faster. However I still remember how surprised I was when I saw how QMutex outperformed std::mutex QMutex vs std::mutex in 2016. So what do you think, should I experiment with such an implementation or the current implementation of std::mutex is matured enough to be optimized far beyond these tricks?
UPDATE
My wording wasn't the best, I mean that the implementation could be lock free on the happy path (locking from not locked state). Of course we should be blocked and re-scheduled if we need to wait to acquire the lock. Most probably atomic::wait is not implemented by a simple spinlock on most of the platforms (let's ignore the corner cases now), so basically it achieves the very same thing mutex::lock does. So basically if I implement such a class it would do exactly the same std::mutex does (again on most of the popular platforms). It means that STL could use the same tricks in their mutex implementations on the platforms that support these tricks. Like this spinlock, but I would use atomic wait instead of spinning. Should I trust my STL implementation that they did so?

A lock-free mutex is a contradiction.
You can build a lock out of lock-free building-blocks, and in fact that's the normal thing to do whether it's hand-written in asm or with std::atomic.
But the overall locking algorithm is by definition not lock-free. (https://en.wikipedia.org/wiki/Non-blocking_algorithm). The entire point is to stop other threads from making forward progress while one thread is in the critical section, even if it unfortunately sleeps while it's there.
I mean that the implementation could be lock free on the happy path (locking from not locked state)
std::mutex::lock() is that way too: it doesn't block if it doesn't have to! It might need to make a system call like futex(FUTEX_WAIT_PRIVATE) if there's a thread waiting for the lock. But so does an implementation that used std::notify.
Perhaps you haven't understood what "lock-free" means: it never blocks, regardless of what other threads are doing. That is all. It doesn't mean "faster in the simple/easy case". For a complex algorithm (e.g. a queue), it's often faster to just give up and block if the circular buffer is full, rather than adding overhead to the simple case to allow other threads to "assist" or cancel a stuck partial operation. (Lock-free Progress Guarantees in a circular buffer queue)
There is no inherent advantage to rolling your own std::mutex out of std::atomic. The compiler-generated machine code has to do approximately the same things either way, and I'd expect the fast path to be about the same. The only difference would be in choice of what to do in the already-locked case. (But maybe tricky to design a way to avoid calling notify iff there were no waiters, something that actual std::mutex manages in the glibc / pthreads implementation Linux.)
(I'm assuming the overhead of the library function call is negligible compared to the cost of an atomic RMW to take the mutex. Having that inline into your code is a pretty minor advantage.)
A mutex implementation can be tuned for certain use-cases, in terms of how long it spin-waits before sleeping (using an OS-assisted mechanism like futex to enable other threads to wake it when releasing the lock), and in exponential backoff for the spin-wait portion.
If std::mutex doesn't perform well for your application on the hardware you care about, it's worth considering an alternative. Although IDK exactly how you'd go about measuring whether it worked well or not. Perhaps if you could figure out that it was deciding to sleep
And yes, you could consider rolling your own with std::atomic now that there's a portable mechanism to hopefully expose a way to fall-back to OS-assisted sleep/wake mechanisms like futex. You'd still want to manually use system-specific things like x86 _mm_pause() inside a spin-wait loop, though, since I don't think C++ has anything equivalent to Rust's std::hint::spin_loop() that an implementation can use to expose things like the x86 pause instruction, intended for use in the body of a spin-loop. (See Locks around memory manipulation via inline assembly re: such considerations, and spinning read-only instead of spamming atomic RMW attempts. And for a look at the necessary parts of a spinlock in x86 assembly language, which are the same whether you get a C++ compiler to generate that machine code for you or not.)
See also https://rigtorp.se/spinlock/ re: implementing a mutex in C++ with std::atomic.
Linux / libstdc++ behaviour of notify/wait
I tested what system calls std::wait makes when it has to wait for a long time, on Arch Linux (glibc 2.33).
std::mutex lock no-contention fast path, and unlock with no waiters: zero system calls, purely user-space atomic operations. Notably is able to detect that there are no waiters when unlocking, so it doesn't make a FUTEX_WAKE system call (which otherwise would maybe a hundred times more than taking and releasing an uncontended mutex that was still hot in this cores L1d cache.)
std::mutex lock() on already locked: only a futex(0x55be9461b0c0, FUTEX_WAIT_PRIVATE, 2, NULL) system call. Possibly some spinning in user-space before that; I didn't single-step into it with GDB, but if so probably with a pause instruction.
std::mutex unlock() with a waiter: uses futex(0x55ef4af060c0, FUTEX_WAKE_PRIVATE, 1) = 1. (After an atomic RMW, IIRC; not sure why it doesn't just use a release store.)
std::notify_one: always a futex(address, FUTEX_WAKE, 1) even if there are no waiters, so it's up to you to avoid it if there are no waiters when you unlock a lock.
std::wait: spinning a few times in user-space, including 4x sched_yield() calls before a futex(addr, FUTEX_WAIT, old_val, NULL).
Note the use of FUTEX_WAIT instead of FUTEX_WAIT_PRIVATE by the wait/notify functions: these should work across processes on shared memory. The futex(2) man page says the _PRIVATE versions (only for threads of a single process) allow some additional optimizations.
I don't know about other systems, although I've heard some kinds of locks on Windows / MSVC have to suck (always a syscall even on the fast path) for backwards-compat with some ABI choices, or something. Like perhaps std::lock_guard is slow on MSVC, but std::unique_lock isn't?
Test code:
#include <atomic>
#include <thread>
#include <unistd.h> // for quick & dirty sleep and usleep. TODO: port to non-POSIX.
#include <mutex>
#if 1 // test std::atomic
//std::atomic_unsigned_lock_free gvar;
//std::atomic_uint_fast_wait_t gvar;
std::atomic<unsigned> gvar;
void waiter(){
volatile unsigned sink;
while (1) {
sink = gvar;
gvar.wait(sink); // block until it's not the old value.
// on Linux/glibc, 4x sched_yield(), then futex(0x562c3c4881c0, FUTEX_WAIT, 46, NULL ) or whatever the current counter value is
}
}
void notifier(){
while(1){
sleep(1);
gvar.store(gvar.load(std::memory_order_relaxed)+1, std::memory_order_relaxed);
gvar.notify_one();
}
}
#else
std::mutex mut;
void waiter(){
unsigned sink = 0;
while (1) {
mut.lock(); // Linux/glibc2.33 - just a futex system call, maybe some spinning in user-space first. But no sched_yield
mut.unlock();
sink++;
usleep(sink); // give the other thread plenty of time to take the lock, so we don't get it twice.
}
}
void notifier(){
while(1){
mut.lock();
sleep(1);
mut.unlock();
}
}
#endif
int main(){
std::thread t (waiter); // comment this to test the no-contention case
notifier();
// loops forever: kill it with control-C
}
Compile with g++ -Og -std=gnu++20 notifywait.cpp -pthread, run with strace -f ./a.out to see the system calls. (A couple or a few per second, since I used a nice long sleep.)
If there's any spin-waiting in user-space, it's negligible compared to the 1 second sleep interval, so it uses about a millisecond of CPU time including startup, to run for 19 iterations. (perf stat ./a.out)
Usually your time would be better spent trying to reduce the amount of locking involved, or the amount of contention, rather than trying to optimize the locks themselves. Locking is an extremely important thing, and lots of engineering has gone into tuning it for most use-cases.
If you're rolling your own lock, you probably want to get your hands dirty with system-specific stuff, because it's all a matter of tuning choices. Different systems are unlikely to have made the same tuning choices for std::mutex and wait as Linux/glibc. Unless std::wait's retry strategy on the only system you care about happens to be perfect for your use-case.
It doesn't make sense to roll your own mutex without first investigating exactly what std::mutex does on your system, e.g. single-step the asm for the already-locked case to see what retries it makes. Then you'll have a better idea whether you can do any better.

You may want to distinguish between a mutex (which is generally a sleeping lock that interacts with the scheduler) and a spinlock (which does not put the current thread to sleep, and makes sense only when a thread on a different CPU is likely to be holding the lock).
Using C++20 atomics, you can definitely implement a spinlock, but this won't be directly comparable to std::mutex, which puts the current thread to sleep. Mutexes and spinlocks are useful in different situations. When successful, the spinlock is probably going to be faster--after all the mutex implementation likely contains a spinlock. It's also the only kind of lock you can acquire in an interrupt handler (though this is less relevant to user-level code). But if you hold the spinlock for a long time and there is contention, you will waste a huge amount of CPU time.

Related

Can std::atomic be used sometimes instead of std::mutex in C++?

I suppose that std::atomic sometimes can replace usages of std::mutex. But is it always safe to use atomic instead of mutex? Example code:
std::atomic_flag f, ready; // shared
// ..... Thread 1 (and others) ....
while (true) {
// ... Do some stuff in the beginning ...
while (f.test_and_set()); // spin, acquire system lock
if (ready.test()) {
UseSystem(); // .... use our system for 50-200 nanoseconds ....
}
f.clear(); // release lock
// ... Do some stuff at the end ...
}
// ...... Thread 2 .....
while (true) {
// ... Do some stuff in the beginning ...
InitSystem();
ready.test_and_set(); // signify system ready
// .... sleep for 10-30 milli-seconds ....
while (f.test_and_set()); // acquire system lock
ready.clear(); // signify system shutdown
f.clear(); // release lock
DeInitSystem(); // finalize/destroy system
// ... Do some stuff at the end ...
}
Here I use std::atomic_flag to protect use of my system (some complex library). But is it safe code? Here I suppose that if ready is false then system is not available and I can't use it and if it is true then it is available and I can use it. For simplicity suppose that code above doesn't throw exceptions.
Of cause I can use std::mutex to protect read/modify of my system. But right now I need very high performance code in Thread-1 that should use atomics very often instead of mutexes (Thread-2 can be slow and use mutexes if needed).
In Thread-1 system-usage code (inside while loop) is run very often, each iteration around 50-200 nano-seconds. So using extra mutexes will be to heavy. But Thread-2 iterations are quite large, as you can see in each iteration of while loop when system is ready it sleeps for 10-30 milli-seconds, so using mutexes only in Thread-2 is quite alright.
Thread-1 is example of one thread, there are several threads running same (or very similar) code as Thread-1 in my real project.
I'm concerned about memory operations ordering, meaning that it can probably happen somtimes that system is not yet in fully consistent state (not yet inited fully) when ready becomes true in Thread-1. Also it may happen that ready becomes false in Thread-1 too late, when system already made some destroying (deinit) operations. Also as you can see system can be inited/destroyed many times in a loop of Thread-2 and used many times in Thread-1 whenever it is ready.
Can my task be solved somehow without std::mutex and other heavy stuff in Thread-1? Only using std::atomic (or std::atomic_flag). Thread-2 can use heavy synchronization stuff if needed, mutexes etc.
Basically Thread-2 should somehow propagate whole inited state of system to all cores and other threads before ready becomes true and also Thread-2 should propagate ready equal to false before any single small operation of system destruction (deinit) is done. By propagating state I mean that all system's inited data should be 100% written consistently to global memory and caches of other core, so that other threads see fully consistent system whenever ready is true.
It is even allowed to make small (milliseconds) pause after system init and before ready is set to true if it improves situation and guarantees. And also it is allowed to make pause after ready is set to false and before starting system destruction (deinit). Also doing some expensive CPU operations in Thread-2 is also alright if there exist some operations like "propagate all Thread-2 writes to global memory and caches to all other CPU cores and threads".
Update: As a solution for my question above right now in my project I decided to use next code with std::atomic_flag to replace std::mutex:
std::atomic_flag f = ATOMIC_FLAG_INIT; // shared
// .... Later in all threads ....
while (f.test_and_set(std::memory_order_acquire)) // try acquiring
std::this_thread::yield();
shared_value += 5; // Any code, it is lock-protected.
f.clear(std::memory_order_release); // release
This solution above runs 9 nanoseconds on average (measured 2^25 operations) in single thread (release compiled) on my Windows 10 64-bit 2Ghz 2-core laptop. While using std::unique_lock<std::mutex> lock(mux); for same protection purpose takes 100-120 nanoseconds on same Windows PC. If it is needed for threads to spinlock instead of sleeping while waiting then instead of std::this_thread::yield(); in code above I just use semicolon ;. Full online example of usage and time measurements.

I'll ignore your code for the sake of the answer, the answer generally is yes.
A lock does the following things :
allows only one thread to acquire it at any given time
when the lock is acquired, a read barrier is placed
right before the lock is released, a write barrier is placed
The combination of the 3 points above makes the critical section thread safe. only one thread can touch the shared memory, all changes are observed by the locking thread because of the read barrier, and all the changes are to be visible to other locking threads, because of the write barrier.
Can you use atomics to achieve it? Yes, And real life locks (provided for example, by Win32/Posix) ARE implemented by either using atomics and lock free programming, either by using locks that use atomics and lock free programing.
Now, realistically speaking, should you use a self-written lock instead of the standard locks? Absolutely not.
Many concurrency tutorials preserve the notion that spin-locks are "more efficient" than regular locks. I can't stress enough how foolish it is. A user-mode spinlock IS NEVER more efficient than a lock that the OS provides. The reason is simple, that OS locks are wired to the OS scheduler. So if a lock tries to lock a lock and fails - the OS knows to freeze this thread and not reschedule it to run until the lock has been released.
With user-mode spinlocks, this doesn't happen. The OS can't know that the relevant thread tries to acquire to the lock in a tight loop. Yielding is just a patch and not a solution - we want to spin for a short time, then go to sleep until the lock is released. With user mode spin locks, we might waste the entire thread quantum trying to lock the spinlock and yielding.
I will say, for the sake of honesty, that recent C++ standards do give us the ability to sleep on an atomic waiting for it to change its value. So we can, in a very lame way, implement our own "real" locks that try to spin for a while and then sleep until the lock is released. However, implementing a correct and efficient lock when you're not a concurrency expert is pretty much impossible.
My own philosophical opinion that in 2021, developers should rarely deal with those very low-level concurrency topics. Leave those things to the kernel guys.
Use some high level concurrency library and focus on the product you want to develop rather than micro-optimizing your code. This is concurrency, where correctness >>> efficiency.
A related rant by Linus Torvalds

Why is there no std:: equivalent to pthread_spinlock_t like there is for pthread_mutex_t & std::mutex?

I've used pthreads a fair bit for concurrent programs, mainly utilising spinlocks, mutexes, and condition variables.
I started looking into multithreading using std::thread and using std::mutex, and I noticed that there doesn't seem to be an equivalent to spinlock in pthreads.
Anyone know why this is?

there doesn't seem to be an equivalent to spinlock in pthreads.
Spinlocks are often considered a wrong tool in user-space because there is no way to disable thread preemption while the spinlock is held (unlike in kernel). So that a thread can acquire a spinlock and then get preempted, causing all other threads trying to acquire the spinlock to spin unnecessarily (and if those threads are of higher priority that may cause a deadlock (threads waiting for I/O may get a priority boost on wake up)). This reasoning also applies to all lockless data structures, unless the data structure is truly wait-free (there aren't many practically useful ones, apart from boost::spsc_queue).
In kernel, a thread that has locked a spinlock cannot be preempted or interrupted before it releases the spinlock. And that is why spinlocks are appropriate there (when RCU cannot be used).
On Linux, one can prevent preemption (not sure if completely, but there has been recent kernel changes towards such a desirable effect) by using isolated CPU cores and FIFO real-time threads pinned to those isolated cores. But that requires a deliberate kernel/machine configuration and an application designed to take advantage of that configuration. Nevertheless, people do use such a setup for business-critical applications along with lockless (but not wait-free) data structures in user-space.
On Linux, there is adaptive mutex PTHREAD_MUTEX_ADAPTIVE_NP, which spins for a limited number of iterations before blocking in the kernel (similar to InitializeCriticalSectionAndSpinCount). However, that mutex cannot be used through std::mutex interface because there is no option to customise non-portable pthread_mutexattr_t before initialising pthread_mutex_t.
One can neither enable process-sharing, robostness, error-checking or priority-inversion prevention through std::mutex interface. In practice, people write their own wrappers of pthread_mutex_t which allows to set desirable mutex attributes; along with a corresponding wrapper for condition variables. Standard locks like std::unique_lock and std::lock_guard can be reused.
IMO, there could be provisions to set desirable mutex and condition variable properties in std:: APIs, like providing a protected constructor for derived classes that would initialize that native_handle, but there aren't any. That native_handle looks like a good idea to do platform specific stuff, however, there must be a constructor for the derived class to be able to initialize it appropriately. After the mutex or condition variable is initialized that native_handle is pretty much useless. Unless the idea was only to be able to pass that native_handle to (C language) APIs that expect a pointer or reference to an initialized pthread_mutex_t.
There is another example of Boost/C++ standard not accepting semaphores on the basis that they are too much of a rope to hang oneself, and that mutex (a binary semaphore, essentially) and condition variable are more fundamental and more flexible synchronisation primitives, out of which a semaphore can be built.
From the point of view of the C++ standard those are probably right decisions because educating users to use spinlocks and semaphores correctly with all the nuances is a difficult task. Whereas advanced users can whip out a wrapper for pthread_spinlock_t with little effort.

You are right there's no spin lock implementation in the std namespace. A spin lock is a great concept but in user space is generally quite poor. OS doesn't know your process wants to spin and usually you can have worse results than using a mutex. To be noted that on several platforms there's the optimistic spinning implemented so a mutex can do a really good job. In addition adjusting the time to "pause" between each loop iteration can be not trivial and portable and a fine tuning is required. TL;DR don't use a spinlock in user space unless you are really really sure about what you are doing.
C++ Thread discussion
Article explaining how to write a spin lock with benchmark
Reply by Linus Torvalds about the above article explaining why it's a bad idea

Spin locks have two advantages:
They require much fewer storage as a std::mutex, because they do not need a queue of threads waiting for the lock. On my system, sizeof(pthread_spinlock_t) is 4, while sizeof(std::mutex) is 40.
They are much more performant than std::mutex, if the protected code region is small and the contention level is low to moderate.
On the downside, a poorly implemented spin lock can hog the CPU. For example, a tight loop with a compare-and-set assembler instructions will spam the cache system with loads and loads of unnecessary writes. But that's what we have libraries for, that they implement best practice and avoid common pitfalls. That most user implementations of spin locks are poor, is not a reason to not put spin locks into the library. Rather, it is a reason to put it there, to stop users from trying it themselves.
There is a second problem, that arises from the scheduler: If thread A acquires the lock and then gets preempted by the scheduler before it finishes executing the critical section, another thread B could spin "forever" (or at least for many milliseconds, before thread A gets scheduled again) on that lock.
Unfortunately, there is no way, how userland code can tell the kernel "please don't preempt me in this critical code section". But if we know, that under normal circumstances, the critical code section executes within 10 ns, we could at least tell thread B: "preempt yourself voluntarily, if you have been spinning for over 30 ns". This is not guaranteed to return control directly back to thread A. But it will stop the waste of CPU cycles, that otherwise would take place. And in most scenarios, where thread A and B run in the same process at the same priority, the scheduler will usually schedule thread A before thread B, if B called std::this_thread::yield().
So, I am thinking about a template spin lock class, that takes a single unsigned integer as a parameter, which is the number of memory reads in the critical section. This parameter is then used in the library to calculate the appropriate number of spins, before a yield() is performed. With a zero count, yield() would never be called.

InterviewQ: How do you code a mutex?

I'm unfortunately out of a job and have been interviewing around lately. I faced this same question twice now, and was lost both times I was asked this question.
"How do you code a mutex"?
Conceptually I understand a mutex locks a certain part of code so multiple threads can not enter the critical section at the same time, eliminating data races. The first time I was asked to conceptually describe how I would code it, the second time I was asked to code it. I've been googling and haven't found any answers... can anyone help?
Thanks.

There's lots of ways to implement a mutex lock, but it typically starts with the basic premise that the cpu architecture offers some concept of atomic add and atomic subtract. That is, an addition operation can be done to an integer variable in memory (and return the result) without being corrupted by another thread attempting to access same memory location. Or at the very least, "atomic increment" and "atomic decrement".
On modern Intel chips, for example, there's an instruction called XADD. When combined with the LOCK prefix it executes atomically and invalidates cached values across other cores. gcc implements a wrapper for this instruction called __sync_add_and_fetch. Win32 implements a similar function called InterlockedIncrement. Both are just calling LOCK XADD under the hood. Other CPU architectures should offer something similar.
So the most basic mutex lock could be implemented something like this. This is often called a "spin" lock. And this cheap version offers no ability to recursively enter the lock.
// A correct, but poorly performant mutex implementation
void EnterLock(int* lock)
{
while (true)
{
int result = LOCK_XADD(lock,1); // increment the value in lock and return the result atomically
if (result == 1)
{
// if the value in lock was successfully incremented
// from 0 to 1 by this thread. It means this thread "acquired" the lock
return;
}
LOCK XADD(lock,-1); // we didn't get the lock - decrement it atmoically back to what it was
sleep(0); // give the thread quantum back before trying again
}
}
void LeaveLock(int* lock)
{
LOCK XADD(lock,-1); // release the lock. Assumes we successfully acquired it correctly with EnterLock above
}
The above suffers from poor performance of "spinning" and doesn't guarantee any fairness. A higher priority thread could continue to win the EnterLock battle over a lower priority thread. And the programmer could make a mistake and call LeaveLock with with a thread that did not previously call EnterLock. You could expand the above to operate on a data structure that not only includes the lock integer, but also has record keeping for the owner thread id and a recursion count.
The second concept for implementing a mutex is that the operating system can offer a wait and notify service such that a thread doesn't have to spin until the owner thread has released it. The thread or process waiting on lock can register itself with the OS to be put to sleep until the owner thread has released it. In OS terms, this is called a semaphore. Additionally, the OS level semaphore can also be used to implement locks across different processes and for the cases where the CPU doesn't offer an atomic add. And can be used to guaranteed fairness between multiple threads trying to acquire the lock.
Most implementations will try spinning for multiple attempts before falling back to making a system call.

I wouldn't say that this is a stupid question. On any level of abstraction for the position. On the high level you just say, you use standard library, or any threading library. If you apply for a position as the compiler developer you need to understand how it acutally works and what is needed for the implementation.
To implement a mutex, you need a locking mechanism, that is you need to have a resource that can be marked as taken across all threads. This is not trivial. You need to remember that two cores share memory, but they have caches. This piece of information must be guaranteed to be actual. So you do need support for hardware to ensure atomicity.
If you take at implementation of clang, they offload (at least in once case) implementation to pthreads, typedefs in threading support:
#if defined(_LIBCPP_HAS_THREAD_API_PTHREAD)
# include <pthread.h>
# include <sched.h>
#elif defined(_LIBCPP_HAS_THREAD_API_WIN32)
#include <Windows.h>
#include <process.h>
#include <fibersapi.h>
#endif
And if you dig through pthreads repo, you can find asm implementations of the interlocking operations. They rely on the lock asm keyword which make the operations atomic, i.e. no other thread can execute them at the same time. This eliminates racing conditions, and guarantees coherency.
Based on this, you can build a lock, which you can use for a mutex implementation.

Why is std::mutex so slow on OSX?

I have the following benchmark: https://gist.github.com/leifwalsh/10010580
Essentially it spins up k threads and then each thread does about 16 million / k lock/increment/unlock cycles, using a spinlock and a std::mutex. On OSX, the std::mutex is devastatingly slower than the spinlock when contended, whereas on Linux it's competitive or a bit faster.
OSX:
spinlock 1: 334ms
spinlock 2: 3537ms
spinlock 3: 4815ms
spinlock 4: 5653ms
std::mutex 1: 813ms
std::mutex 2: 38464ms
std::mutex 3: 44254ms
std::mutex 4: 47418ms
Linux:
spinlock 1: 305ms
spinlock 2: 1590ms
spinlock 3: 1820ms
spinlock 4: 2300ms
std::mutex 1: 377ms
std::mutex 2: 1124ms
std::mutex 3: 1739ms
std::mutex 4: 2668ms
The processors are different, but not that different (OSX is Intel(R) Core(TM) i7-2677M CPU # 1.80GHz, Linux is Intel(R) Core(TM) i5-2500K CPU # 3.30GHz), this seems like a library or kernel problem. Anyone know the source of the slowness?
To clarify my question, I understand that "there are different mutex implementations that optimize for different things and this isn't a problem, it's expected". This question is: what are the actual differences in implementation that cause this? Or, if it's a hardware issue (maybe the cache is just a lot slower on the macbook), that's acceptable too.

You're just measuring the library's choice of trading off throughput for fairness. The benchmark is heavily artificial and penalizes any attempt to provide any fairness at all.
The implementation can do two things. It can let the same thread get the mutex twice in a row, or it can change which thread gets the mutex. This benchmark heavily penalizes a change in threads because the context switch takes time and because ping-ponging the mutex and val from cache to cache takes time.
Most likely, this is just showing the different trade-offs that implementations have to make. It heavily rewards implementations that prefer to give the mutex back to the thread that last held it. The benchmark even rewards implementations that waste CPU to do that! It even rewards implementations that waste CPU to avoid context switches, even when there's other useful work the CPU could do! It also doesn't penalize the implementation for inter-core traffic which can slow down other unrelated threads.
Also, people who implement mutexes generally presume that performance in the uncontended case is more important than performance in the contended case. There are numerous tradeoffs you can make between these cases, such as presuming that there might be a thread waiting or specifically checking if there is. The benchmark tests only (or at least, almost only) the case that is typically traded off in favor of the case presumed more common.
Bluntly, this is a senseless benchmark that is incapable of identifying a problem.
The specific explanation is almost certainly that the Linux implementation is a spinlock/futex hybrid while the OSX implementation is conventional, equivalent to locking a kernel object. The spinlock portion of the Linux implementation favors allowing the same thread that just released the mutex to lock it again, which your benchmark heavily rewards.

You need to use the same STL implementation on both systems. This could either be an issue in libc++ or in pthread_mutex_*().
What the other posters are saying about mutex locks being conventional on OS X is a complete lie, however. Yes, Mach locksets and semaphores require system calls for every operation. But unless you explicitly use lockset or semaphore Mach APIs, then these ARE NOT USED in your application.
OS X's libpthread uses __psynch_* BSD system calls, which remotely match Linux futexes. In the uncontended case, libpthread makes no system call to acquire the mutex. Only an instruction such as cmpxchg is used.
Source: libpthread source code and my own knowledge (I'm the developer of Darling).

David Schwartz is basically correct, except for the throughput/adaptiveness comment. It is actually much faster on Linux because it uses a futex & the overhead of a contended call is much smaller. What this means is that in the uncontended case, it simply does a function call, atomic operation & returns. If most of your locks are uncontended (which is usually the typical behaviour you'll see in many real-world programs), acquiring a lock is basically free. Even in the contended case, it's basically a function call, syscall + atomic operation + adding 1 thread to a list (the syscall being the expensive part of the operation). If the mutex is released during the syscall, then the function returns right away without enqueuing onto a wait-list.
On OSX, there is no futex. Acquiring a mutex requires always talking to the kernel. Moreover, OSX is a micro-kernel hybrid. That means to talk to the kernel, you need to send it a message. This means you do data marshalling, syscall, copy the data to a separate buffer. Then at some point the kernel comes by, unmarshalls the data & acquires the lock & sends you back a message. Thus in the uncontended case, it's much heavier-weight. In the contended case, it depends on how long you're blocked waiting for a lock: the longer you wait, the cheaper your lock operation becomes when amortized across total runtime.
On OSX, there is a much faster mechanism called dispatch queues, but it requires re-thinking how your program works. In addition to using lock-free synchronization (i.e. uncontended cases never jump to the kernel), they also do thread pooling & job scheduling. Additionally, they provide asynchronous dispatch which lets you schedule a job without needing to wait for a lock.

Overhead of pthread mutexes?

I'm trying to make a C++ API (for Linux and Solaris) thread-safe, so that its functions can be called from different threads without breaking internal data structures. In my current approach I'm using pthread mutexes to protect all accesses to member variables. This means that a simple getter function now locks and unlocks a mutex, and I'm worried about the overhead of this, especially as the API will mostly be used in single-threaded apps where any mutex locking seems like pure overhead.
So, I'd like to ask:
do you have any experience with performance of single-threaded apps that use locking versus those that don't?
how expensive are these lock/unlock calls, compared to eg. a simple "return this->isActive" access for a bool member variable?
do you know better ways to protect such variable accesses?

All modern thread implementations can handle an uncontended mutex lock entirely in user space (with just a couple of machine instructions) - only when there is contention, the library has to call into the kernel.
Another point to consider is that if an application doesn't explicitly link to the pthread library (because it's a single-threaded application), it will only get dummy pthread functions (which don't do any locking at all) - only if the application is multi-threaded (and links to the pthread library), the full pthread functions will be used.
And finally, as others have already pointed out, there is no point in protecting a getter method for something like isActive with a mutex - once the caller gets a chance to look at the return value, the value might already have been changed (as the mutex is only locked inside the getter method).

"A mutex requires an OS context switch. That is fairly expensive. "
This is not true on Linux, where mutexes are implemented using something called futex'es. Acquiring an uncontested (i.e., not already locked) mutex is, as cmeerw points out, a matter of a few simple instructions, and is typically in the area of 25 nanoseconds w/current hardware.
For more info:
Futex
Numbers everybody should know

This is a bit off-topic but you seem to be new to threading - for one thing, only lock where threads can overlap. Then, try to minimize those places. Also, instead of trying to lock every method, think of what the thread is doing (overall) with an object and make that a single call, and lock that. Try to get your locks as high up as possible (this again increases efficiency and may /help/ to avoid deadlocking). But locks don't 'compose', you have to mentally at least cross-organize your code by where the threads are and overlap.

I did a similar library and didn't have any trouble with lock performance. (I can't tell you exactly how they're implemented, so I can't say conclusively that it's not a big deal.)
I'd go for getting it right first (i.e. use locks) then worry about performance. I don't know of a better way; that's what mutexes were built for.
An alternative for single thread clients would be to use the preprocessor to build a non-locked vs locked version of your library. E.g.:
#ifdef BUILD_SINGLE_THREAD
inline void lock () {}
inline void unlock () {}
#else
inline void lock () { doSomethingReal(); }
inline void unlock () { doSomethingElseReal(); }
#endif
Of course, that adds an additional build to maintain, as you'd distribute both single and multithread versions.

I can tell you from Windows, that a mutex is a kernel object and as such incurs a (relatively) significant locking overhead. To get a better performing lock, when all you need is one that works in threads, is to use a critical section. This would not work across processes, just the threads in a single process.
However.. linux is quite a different beast to multi-process locking. I know that a mutex is implemented using the atomic CPU instructions and only apply to a process - so they would have the same performance as a win32 critical section - ie be very fast.
Of course, the fastest locking is not to have any at all, or to use them as little as possible (but if your lib is to be used in a heavily threaded environment, you will want to lock for as short a time as possible: lock, do something, unlock, do something else, then lock again is better than holding the lock across the whole task - the cost of locking isn't in the time taken to lock, but the time a thread sits around twiddling its thumbs waiting for another thread to release a lock it wants!)

A mutex requires an OS context switch. That is fairly expensive. The CPU can still do it hundreds of thousands of times per second without too much trouble, but it is a lot more expensive than not having the mutex there. Putting it on every variable access is probably overkill.
It also probably is not what you want. This kind of brute-force locking tends to lead to deadlocks.
do you know better ways to protect such variable accesses?
Design your application so that as little data as possible is shared. Some sections of code should be synchronized, probably with a mutex, but only those that are actually necessary. And typically not individual variable accesses, but tasks containing groups of variable accesses that must be performed atomically. (perhaps you need to set your is_active flag along with some other modifications. Does it make sense to set that flag and make no further changes to the object?)

I was curious about the expense of using a pthred_mutex_lock/unlock.
I had a scenario where I needed to either copy anywhere from 1500-65K bytes without using
a mutex or to use a mutex and do a single write of a pointer to the data needed.
I wrote a short loop to test each
gettimeofday(&starttime, NULL)
COPY DATA
gettimeofday(&endtime, NULL)
timersub(&endtime, &starttime, &timediff)
print out timediff data
or
ettimeofday(&starttime, NULL)
pthread_mutex_lock(&mutex);
gettimeofday(&endtime, NULL)
pthread_mutex_unlock(&mutex);
timersub(&endtime, &starttime, &timediff)
print out timediff data
If I was copying less than 4000 or so bytes, then the straight copy operation took less time. If however I was copying more than 4000 bytes, then it was less costly to do the mutex lock/unlock.
The timing on the mutex lock/unlock ran between 3 and 5 usec long including the time for
the gettimeofday for the currentTime which took about 2 usec

For member variable access, you should use read/write locks, which have slightly less overhead and allow multiple concurrent reads without blocking.
In many cases you can use atomic builtins, if your compiler provides them (if you are using gcc or icc __sync_fetch*() and the like), but they are notouriously hard to handle correctly.
If you can guarantee the access being atomic (for example on x86 an dword read or write is always atomic, if it is aligned, but not a read-modify-write), you can often avoid locks at all and use volatile instead, but this is non portable and requires knowledge of the hardware.

Well a suboptimal but simple approach is to place macros around your mutex locks and unlocks. Then have a compiler / makefile option to enable / disable threading.
Ex.
#ifdef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //actual mutex call
#endif
#ifndef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //do nothing
#endif
Then when compiling do a gcc -DTHREAD_ENABLED to enable threading.
Again I would NOT use this method in any large project. But only if you want something fairly simple.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js