Why is std::mutex so slow on OSX? - c++

I have the following benchmark: https://gist.github.com/leifwalsh/10010580
Essentially it spins up k threads and then each thread does about 16 million / k lock/increment/unlock cycles, using a spinlock and a std::mutex. On OSX, the std::mutex is devastatingly slower than the spinlock when contended, whereas on Linux it's competitive or a bit faster.
OSX:
spinlock 1: 334ms
spinlock 2: 3537ms
spinlock 3: 4815ms
spinlock 4: 5653ms
std::mutex 1: 813ms
std::mutex 2: 38464ms
std::mutex 3: 44254ms
std::mutex 4: 47418ms
Linux:
spinlock 1: 305ms
spinlock 2: 1590ms
spinlock 3: 1820ms
spinlock 4: 2300ms
std::mutex 1: 377ms
std::mutex 2: 1124ms
std::mutex 3: 1739ms
std::mutex 4: 2668ms
The processors are different, but not that different (OSX is Intel(R) Core(TM) i7-2677M CPU # 1.80GHz, Linux is Intel(R) Core(TM) i5-2500K CPU # 3.30GHz), this seems like a library or kernel problem. Anyone know the source of the slowness?
To clarify my question, I understand that "there are different mutex implementations that optimize for different things and this isn't a problem, it's expected". This question is: what are the actual differences in implementation that cause this? Or, if it's a hardware issue (maybe the cache is just a lot slower on the macbook), that's acceptable too.

You're just measuring the library's choice of trading off throughput for fairness. The benchmark is heavily artificial and penalizes any attempt to provide any fairness at all.
The implementation can do two things. It can let the same thread get the mutex twice in a row, or it can change which thread gets the mutex. This benchmark heavily penalizes a change in threads because the context switch takes time and because ping-ponging the mutex and val from cache to cache takes time.
Most likely, this is just showing the different trade-offs that implementations have to make. It heavily rewards implementations that prefer to give the mutex back to the thread that last held it. The benchmark even rewards implementations that waste CPU to do that! It even rewards implementations that waste CPU to avoid context switches, even when there's other useful work the CPU could do! It also doesn't penalize the implementation for inter-core traffic which can slow down other unrelated threads.
Also, people who implement mutexes generally presume that performance in the uncontended case is more important than performance in the contended case. There are numerous tradeoffs you can make between these cases, such as presuming that there might be a thread waiting or specifically checking if there is. The benchmark tests only (or at least, almost only) the case that is typically traded off in favor of the case presumed more common.
Bluntly, this is a senseless benchmark that is incapable of identifying a problem.
The specific explanation is almost certainly that the Linux implementation is a spinlock/futex hybrid while the OSX implementation is conventional, equivalent to locking a kernel object. The spinlock portion of the Linux implementation favors allowing the same thread that just released the mutex to lock it again, which your benchmark heavily rewards.

You need to use the same STL implementation on both systems. This could either be an issue in libc++ or in pthread_mutex_*().
What the other posters are saying about mutex locks being conventional on OS X is a complete lie, however. Yes, Mach locksets and semaphores require system calls for every operation. But unless you explicitly use lockset or semaphore Mach APIs, then these ARE NOT USED in your application.
OS X's libpthread uses __psynch_* BSD system calls, which remotely match Linux futexes. In the uncontended case, libpthread makes no system call to acquire the mutex. Only an instruction such as cmpxchg is used.
Source: libpthread source code and my own knowledge (I'm the developer of Darling).

David Schwartz is basically correct, except for the throughput/adaptiveness comment. It is actually much faster on Linux because it uses a futex & the overhead of a contended call is much smaller. What this means is that in the uncontended case, it simply does a function call, atomic operation & returns. If most of your locks are uncontended (which is usually the typical behaviour you'll see in many real-world programs), acquiring a lock is basically free. Even in the contended case, it's basically a function call, syscall + atomic operation + adding 1 thread to a list (the syscall being the expensive part of the operation). If the mutex is released during the syscall, then the function returns right away without enqueuing onto a wait-list.
On OSX, there is no futex. Acquiring a mutex requires always talking to the kernel. Moreover, OSX is a micro-kernel hybrid. That means to talk to the kernel, you need to send it a message. This means you do data marshalling, syscall, copy the data to a separate buffer. Then at some point the kernel comes by, unmarshalls the data & acquires the lock & sends you back a message. Thus in the uncontended case, it's much heavier-weight. In the contended case, it depends on how long you're blocked waiting for a lock: the longer you wait, the cheaper your lock operation becomes when amortized across total runtime.
On OSX, there is a much faster mechanism called dispatch queues, but it requires re-thinking how your program works. In addition to using lock-free synchronization (i.e. uncontended cases never jump to the kernel), they also do thread pooling & job scheduling. Additionally, they provide asynchronous dispatch which lets you schedule a job without needing to wait for a lock.

Related

C++20 mutex with atomic wait [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 months ago.
Improve this question
From C++20 std::atomics have wait and notify operations. With is_always_lock_free we can ensure that the implementation is lock free. With these bricks building a lock-free mutex is not so difficult. In trivial cases locking would be a compare exchange operation, or a wait if the mutex is locked. The big question here is if it is worth it or not. If I can create such an implementation most probably the STL version is much better and faster. However I still remember how surprised I was when I saw how QMutex outperformed std::mutex QMutex vs std::mutex in 2016. So what do you think, should I experiment with such an implementation or the current implementation of std::mutex is matured enough to be optimized far beyond these tricks?
UPDATE
My wording wasn't the best, I mean that the implementation could be lock free on the happy path (locking from not locked state). Of course we should be blocked and re-scheduled if we need to wait to acquire the lock. Most probably atomic::wait is not implemented by a simple spinlock on most of the platforms (let's ignore the corner cases now), so basically it achieves the very same thing mutex::lock does. So basically if I implement such a class it would do exactly the same std::mutex does (again on most of the popular platforms). It means that STL could use the same tricks in their mutex implementations on the platforms that support these tricks. Like this spinlock, but I would use atomic wait instead of spinning. Should I trust my STL implementation that they did so?
A lock-free mutex is a contradiction.
You can build a lock out of lock-free building-blocks, and in fact that's the normal thing to do whether it's hand-written in asm or with std::atomic.
But the overall locking algorithm is by definition not lock-free. (https://en.wikipedia.org/wiki/Non-blocking_algorithm). The entire point is to stop other threads from making forward progress while one thread is in the critical section, even if it unfortunately sleeps while it's there.
I mean that the implementation could be lock free on the happy path (locking from not locked state)
std::mutex::lock() is that way too: it doesn't block if it doesn't have to! It might need to make a system call like futex(FUTEX_WAIT_PRIVATE) if there's a thread waiting for the lock. But so does an implementation that used std::notify.
Perhaps you haven't understood what "lock-free" means: it never blocks, regardless of what other threads are doing. That is all. It doesn't mean "faster in the simple/easy case". For a complex algorithm (e.g. a queue), it's often faster to just give up and block if the circular buffer is full, rather than adding overhead to the simple case to allow other threads to "assist" or cancel a stuck partial operation. (Lock-free Progress Guarantees in a circular buffer queue)
There is no inherent advantage to rolling your own std::mutex out of std::atomic. The compiler-generated machine code has to do approximately the same things either way, and I'd expect the fast path to be about the same. The only difference would be in choice of what to do in the already-locked case. (But maybe tricky to design a way to avoid calling notify iff there were no waiters, something that actual std::mutex manages in the glibc / pthreads implementation Linux.)
(I'm assuming the overhead of the library function call is negligible compared to the cost of an atomic RMW to take the mutex. Having that inline into your code is a pretty minor advantage.)
A mutex implementation can be tuned for certain use-cases, in terms of how long it spin-waits before sleeping (using an OS-assisted mechanism like futex to enable other threads to wake it when releasing the lock), and in exponential backoff for the spin-wait portion.
If std::mutex doesn't perform well for your application on the hardware you care about, it's worth considering an alternative. Although IDK exactly how you'd go about measuring whether it worked well or not. Perhaps if you could figure out that it was deciding to sleep
And yes, you could consider rolling your own with std::atomic now that there's a portable mechanism to hopefully expose a way to fall-back to OS-assisted sleep/wake mechanisms like futex. You'd still want to manually use system-specific things like x86 _mm_pause() inside a spin-wait loop, though, since I don't think C++ has anything equivalent to Rust's std::hint::spin_loop() that an implementation can use to expose things like the x86 pause instruction, intended for use in the body of a spin-loop. (See Locks around memory manipulation via inline assembly re: such considerations, and spinning read-only instead of spamming atomic RMW attempts. And for a look at the necessary parts of a spinlock in x86 assembly language, which are the same whether you get a C++ compiler to generate that machine code for you or not.)
See also https://rigtorp.se/spinlock/ re: implementing a mutex in C++ with std::atomic.
Linux / libstdc++ behaviour of notify/wait
I tested what system calls std::wait makes when it has to wait for a long time, on Arch Linux (glibc 2.33).
std::mutex lock no-contention fast path, and unlock with no waiters: zero system calls, purely user-space atomic operations. Notably is able to detect that there are no waiters when unlocking, so it doesn't make a FUTEX_WAKE system call (which otherwise would maybe a hundred times more than taking and releasing an uncontended mutex that was still hot in this cores L1d cache.)
std::mutex lock() on already locked: only a futex(0x55be9461b0c0, FUTEX_WAIT_PRIVATE, 2, NULL) system call. Possibly some spinning in user-space before that; I didn't single-step into it with GDB, but if so probably with a pause instruction.
std::mutex unlock() with a waiter: uses futex(0x55ef4af060c0, FUTEX_WAKE_PRIVATE, 1) = 1. (After an atomic RMW, IIRC; not sure why it doesn't just use a release store.)
std::notify_one: always a futex(address, FUTEX_WAKE, 1) even if there are no waiters, so it's up to you to avoid it if there are no waiters when you unlock a lock.
std::wait: spinning a few times in user-space, including 4x sched_yield() calls before a futex(addr, FUTEX_WAIT, old_val, NULL).
Note the use of FUTEX_WAIT instead of FUTEX_WAIT_PRIVATE by the wait/notify functions: these should work across processes on shared memory. The futex(2) man page says the _PRIVATE versions (only for threads of a single process) allow some additional optimizations.
I don't know about other systems, although I've heard some kinds of locks on Windows / MSVC have to suck (always a syscall even on the fast path) for backwards-compat with some ABI choices, or something. Like perhaps std::lock_guard is slow on MSVC, but std::unique_lock isn't?
Test code:
#include <atomic>
#include <thread>
#include <unistd.h> // for quick & dirty sleep and usleep. TODO: port to non-POSIX.
#include <mutex>
#if 1 // test std::atomic
//std::atomic_unsigned_lock_free gvar;
//std::atomic_uint_fast_wait_t gvar;
std::atomic<unsigned> gvar;
void waiter(){
volatile unsigned sink;
while (1) {
sink = gvar;
gvar.wait(sink); // block until it's not the old value.
// on Linux/glibc, 4x sched_yield(), then futex(0x562c3c4881c0, FUTEX_WAIT, 46, NULL ) or whatever the current counter value is
}
}
void notifier(){
while(1){
sleep(1);
gvar.store(gvar.load(std::memory_order_relaxed)+1, std::memory_order_relaxed);
gvar.notify_one();
}
}
#else
std::mutex mut;
void waiter(){
unsigned sink = 0;
while (1) {
mut.lock(); // Linux/glibc2.33 - just a futex system call, maybe some spinning in user-space first. But no sched_yield
mut.unlock();
sink++;
usleep(sink); // give the other thread plenty of time to take the lock, so we don't get it twice.
}
}
void notifier(){
while(1){
mut.lock();
sleep(1);
mut.unlock();
}
}
#endif
int main(){
std::thread t (waiter); // comment this to test the no-contention case
notifier();
// loops forever: kill it with control-C
}
Compile with g++ -Og -std=gnu++20 notifywait.cpp -pthread, run with strace -f ./a.out to see the system calls. (A couple or a few per second, since I used a nice long sleep.)
If there's any spin-waiting in user-space, it's negligible compared to the 1 second sleep interval, so it uses about a millisecond of CPU time including startup, to run for 19 iterations. (perf stat ./a.out)
Usually your time would be better spent trying to reduce the amount of locking involved, or the amount of contention, rather than trying to optimize the locks themselves. Locking is an extremely important thing, and lots of engineering has gone into tuning it for most use-cases.
If you're rolling your own lock, you probably want to get your hands dirty with system-specific stuff, because it's all a matter of tuning choices. Different systems are unlikely to have made the same tuning choices for std::mutex and wait as Linux/glibc. Unless std::wait's retry strategy on the only system you care about happens to be perfect for your use-case.
It doesn't make sense to roll your own mutex without first investigating exactly what std::mutex does on your system, e.g. single-step the asm for the already-locked case to see what retries it makes. Then you'll have a better idea whether you can do any better.
You may want to distinguish between a mutex (which is generally a sleeping lock that interacts with the scheduler) and a spinlock (which does not put the current thread to sleep, and makes sense only when a thread on a different CPU is likely to be holding the lock).
Using C++20 atomics, you can definitely implement a spinlock, but this won't be directly comparable to std::mutex, which puts the current thread to sleep. Mutexes and spinlocks are useful in different situations. When successful, the spinlock is probably going to be faster--after all the mutex implementation likely contains a spinlock. It's also the only kind of lock you can acquire in an interrupt handler (though this is less relevant to user-level code). But if you hold the spinlock for a long time and there is contention, you will waste a huge amount of CPU time.

Implementing spin-lock without XCHG?

C++ spin-lock can be easily implemented using std::atomic_flag, it can be coded roughly (without special features) like following:
std::atomic_flag f = ATOMIC_FLAG_INIT;
while (f.test_and_set(std::memory_order_acquire)); // Acquire lock
// Here do some lock-protected work .....
f.clear(std::memory_order_release); // Release lock
One can see online assembly, it shows that acquiring is implemented atomically through XCHG instruction.
Also as one can see on uops.info (screen here) that XCHG may take up to 30 CPU cycles on quite popular Skylake. This is quite slow.
Overall speed of spin lock can be measure through such program.
Is it possible to implement spin locking without XCHG? Main concern is about speed, not about just using another instruction.
What is the fastest possible spin lock? Is it possible to make it 10 cycles instead of 30? And 5 cycles? Maybe some probabilistic spin-lock that runs fast on average?
It should be implemented in a strict way, meaning that in 100% cases it correctly protects piece of code and data. If it is probabilistic then it should run probable time but yet protects 100% correctly after each run.
Main purpose for such spin lock for me is to protect very tiny operations inside multiple threads, that run a dozen or two of cycles, hence 30 cycles delay is too much overhead. Of course one can say that I can use atomics or other lock-free techniques to implement all operations. But this techniques is not possible for all cases, and also will take to much work to strictly implement in huge code base for many classes and methods. Hence something generic like regular spin lock is also needed.
Is it possible to implement spin locking without XCHG?
Yes. For 80x86, you can lock bts or lock cmpxchg or lock xadd or ...
What is the fastest possible spin lock?
Possible interpretations of "fast" include:
a) Fast in the uncontended case. In this case it's not going to matter very much what you do because most of the possible operations (exchanging, adding, testing...) are cheap and the real costs are cache coherency (getting the cache line containing the lock into the "exclusive" state in the current CPU's cache, possibly including fetching it from RAM or other CPUs' caches) and serialization.
b) Fast in the contended case. In this case you really need a "test without lock; then test & set with lock" approach. The main problem with a simple spinloop (for the contended case) is that when multiple CPUs are spinning the cache line will be rapidly bouncing from one CPU's cache to the next and consuming a huge amount of inter-connect bandwidth for nothing. To prevent this, you'd have a loop that tests the lock state without modifying it so that the cache line can remain in all CPUs caches as "shared" at the same time while those CPUs are spinning.
But note that testing read-only to start with can hurt the un-contended case, resulting in more coherency traffic: first a share-request for the cache line which will only get you MESI S state if another core had recently unlocked, and then an RFO (Read For Ownership) when you do try to take the lock. So best practice is probably to start with an RMW, and if that fails then spin read-only with pause until you see the lock available, unless profiling your code on the system you care about shows a different choice is better.
c) Fast to exit the spinloop (after contention) when the lock is acquired. In this case CPU can speculatively execute many iterations of the loop, and when the lock becomes acquired all the CPU has to drain those "speculatively execute many iterations of the loop" which costs a little time. To prevent that you want a pause instruction to prevent many iterations of the loop/s from being speculatively executed.
d) Fast for other CPUs that don't touch the lock. For some cases (hyper-threading) the core's resources are shared between logical processors; and when one logical process is spinning it consumes resources that the other logical processor could've used to get useful work done (especially for the "spinlock speculatively executes many iterations of the loop" situation). To minimize this you need a pause in the spinloop/s (so that the spinning logical processor doesn't consume as much of the core's resources and the other logical processor in the core can get more useful work done).
e) Minimum "worst case time to acquire". With a simple lock, under contention, some CPUs or threads can be lucky and always get the lock while other CPUs/threads are very unlucky and take ages to get the lock; and the "worst case time to acquire" is theoretically infinite (a CPU can spin forever). To fix that you need a fair lock - something to ensure that only the thread that has been waiting/spinning for the longest amount of time is able to acquire the lock when it is released. Note that it's possible to design a fair lock such that each thread spins on a different cache line; which is a different way to solve the "cache line bouncing between CPUs" problem I mentioned in "b) Fast in the contended case".
f) Minimum "worst case until lock released". This has to involve the length of the worst critical section; but in some situations may also include the cost of any number IRQs, the cost of any number of task switches and the time the code isn't using any CPU. It's entirely possible to have a situation where a thread acquires the lock then the scheduler does a thread switch; then many CPUs all spin (wasting a huge amount of time) on a lock that can not be released (because the lock holder is the only one that can release the lock and it isn't even using any CPU). The way to fix/improve this is to disable the scheduler and IRQs; which is fine in kernel code, but "likely impossible for security reasons" in normal user-space code. This is also the reason why spinlocks should probably never be used in user-space (and why user-space should probably use a mutex where the thread is put in a "blocked waiting for lock" state and not given CPU time by the scheduler until/unless the thread actually can acquire the lock).
Note that making it fast for one possible interpretation of "fast" can make it slower/worse for other interpretations of "fast". For example; the speed of the uncontended case is made worse by everything else.
Example Spinlock
This example is untested, and written in (NASM syntax) assembly.
;Input
; ebx = address of lock
;Initial optimism in the hope the lock isn't contended
spinlock_acquire:
lock bts dword [ebx],0 ;Set the lowest bit and get its previous value in carry flag
;Did we actually acquire it, i.e. was it previously 0 = unlocked?
jnc .acquired ; Yes, done!
;Waiting (without modifying) to avoid "cache line bouncing"
.spin:
pause ;Reduce resource consumption
; and avoid memory order mis-speculation when the lock becomes available.
test dword [ebx],1 ;Has the lock been released?
jne .spin ; no, wait until it was released
;Try to acquire again
lock bts dword [ebx],0 ;Set the lowest bit and get its previous value in carry flag
;Did we actually acquire it?
jc .spin ; No, go back to waiting
.acquired:
Spin-unlock can be just mov dword [ebx], 0, not lock btr, because you know you own the lock and that has release semantics on x86. You could read it first to catch double-unlock bugs.
Notes:
a) lock bts is a little slower than other possibilities; but it doesn't interfere with or depend on the other 31 bits (or 63 bits) of the lock, which means that those other bits can be used for detecting programming mistakes (e.g. store 31 bits of "thread ID that currently holds lock" in them when the lock is acquired and check them when the lock is released to auto-detect "Wrong thread releasing lock" and "Lock being released when it was never acquired" bugs) and/or used for gathering performance information (e.g. set bit 1 when there's contention so that other code can scan periodically to determine which locks are rarely contended and which locks are heavily contended). Bugs with the use of locks are often extremely insidious and hard to find (unpredictable and unreproducible "Heisenbugs" that disappear as soon as you try to find them); so I have a preference for "slower with automatic bug detection".
b) This is not a fair lock, which means its not well suited to situations where contention is likely.
c) For memory; there's a compromise between memory consumption/cache misses, and false sharing. For rarely contended locks I like to put the lock in the same cache line/s as the data the lock protects, so that the acquiring the lock means that the data the lock holder wants is already in the cache (and no subsequent cache miss occurs). For heavily contended locks this causes false sharing and should be avoided by reserving the whole cache line for the lock and nothing else (e.g. by adding 60 bytes of unused padding after the 4 bytes used by the actual lock, like in C++ alignas(64) struct { std::atomic<int> lock; }; ). Of course a spinlock like this shouldn't be used for heavily contended locks so its reasonable to assume that minimizing memory consumption (and not having any padding, and not caring about false sharing) makes sense.
Main purpose for such spin lock for me is to protect very tiny operations inside multiple threads, that run a dozen or two of cycles, hence 30 cycles delay is too much overhead
In that case I'd suggest trying to replace locks with atomics, block-free algorithms, and lock-free algorithms. A trivial example is tracking statistics, where you might want to do lock inc dword [number_of_chickens] instead of acquiring a lock to increase "number_of_chickens".
Beyond that it's hard to say much - for one extreme, the program could be spending most of its time doing work without needing locks and the cost of locking may have almost no impact on overall performance (even though acquire/release is more expensive than the tiny critical sections); and for the other extreme the program could be spending most of its time acquiring and releasing locks. In other words, the cost of acquiring/releasing locks is somewhere between "irrelevant" and "major design flaw (using far too many locks and needing to redesign the entire program)".

Why mutex (std::mutex) is heavy?

Quite often on this site, other forums I read phrases like "mutex is heavy, better use something else". But I can't really find explanation why it's heavy? Also, if we are talking about standard C++11 before C++20, we basically have only std::mutex, used with locks or condition_variable, to make something thread-safe, I expected something from std be quite efficient, especially if it's the only tool(before C++20) to make some task, thread-safety in this case.
So why mutexes and particularly std::mutex is heavy? And what we as C++ developers should use instead? Something from boost?
Mutexes are considered "heavy" because they are often believed to result in a syscall, i.e. a round-trip to the kernel. A trip to the kernel takes on the order of 1,000+ CPU cycles due to context switching between privileged and unprivileged code.
In many OSes these days mutexes are optimized to not go to the kernel until a contention occurs. For example, in Linux it's implemented using a futex ("fast userspace mutex"), in Windows - SRW lock. However, once there is a contention, there will be a trip to the kernel. And once a thread needs to wait, it will be "put to sleep" by the OS and there will be a significant delay between the moment the lock is released and the time the thread is scheduled to be executed again.
If you need synchronization, sometimes looping on a simple atomic can be sufficient. If contentions are rare and short, then you can achieve better performance with a "spin lock", i.e. looping until certain condition is met. Even if you loop 10000 times, it can be faster than a single syscall.
In practice, however, a mutex will provide adequate balance between performance and convenience. So I wouldn't worry about it unless you are counting nanoseconds (as in HFT or real-time applications).
std::mutex was designed to be a light-weight portable wrapper around the operating system's native mutex facility. If your goal was to invoke those facilities, mutex introduces only a very negligible overhead over calling the OS-native API directly.
However, depending on what your use case is, using an OS facility might not be the optimal solution. For instance, to protect data from concurrent access, you could also write your own lock from lower-level primitives like std::atomic. This will however be a different kind of lock algorithm. In particular, std::mutex will put a waiting thread to sleep if the mutex can't be obtained right away, which is something you cannot do without talking to the OS. In some cases though, such a simpler locking algorithm is sufficient to get the job done. A popular example here are cases where lock contention is expected to occur only in rare cases.
That being said, such thoughts get you quite deep into expert-level concurrency programming. Unless you have specific concerns that require worrying about micro-optimizations like rolling your own locking, std::mutex is the way to go and its overhead is well withing reasonable bounds for what it's doing.
All kinds of synchronization is "heavy", and lock based is heavier than atomic.
https://github.com/markwaterman/MutexShootout
This person did a comparison between various mutex implementations. A raw windows SWR lock was the fastest option, but the most recent std mutex they compared it to was the MSVC 2017 one.
I believe that std::shared_mutex is a windows SRW lock under the hood.
Do you need every last tiny percent of performance? Then you should be profiling and swapping out mutexes. If not, std::mutex is within 10s of percents of best options, and will probably continue to be iterated on and supported.
Atomic integer operations are generally cheaper than a mutex lock, but the rules are more complex. In addition, atomic operations cause non-local slowdowns in your code, as it causes cache lines to be cleared to avoid someone else having the wrong value.
In my experience, until you get to extreme situations, you can do algorithm changes to get far more than 10%s of performance changes. And when you really, really need performance, you will probably be stripping out mutexes as much as possible anyhow; even the fastest mutex isn't fast enough for really high performance situations.
Optimization is fungible; you can spend your development effort making code faster when you identify a bottleneck. Don't write code that is prematurely pessimized; but 10s of percent hit from using std mutex over the alternative locks is not usually large enough to be that problem.
Mutexes are expensive in the same sense that copying is expensive. Meaning if you can obviate the need for a copy, that is better than having to copy. But if you need a copy, there is no way around that. The same goes for std::mutex. Not because std::mutex is inefficient, but because mutexes are inherently expensive.

Why is there no std:: equivalent to pthread_spinlock_t like there is for pthread_mutex_t & std::mutex?

I've used pthreads a fair bit for concurrent programs, mainly utilising spinlocks, mutexes, and condition variables.
I started looking into multithreading using std::thread and using std::mutex, and I noticed that there doesn't seem to be an equivalent to spinlock in pthreads.
Anyone know why this is?
there doesn't seem to be an equivalent to spinlock in pthreads.
Spinlocks are often considered a wrong tool in user-space because there is no way to disable thread preemption while the spinlock is held (unlike in kernel). So that a thread can acquire a spinlock and then get preempted, causing all other threads trying to acquire the spinlock to spin unnecessarily (and if those threads are of higher priority that may cause a deadlock (threads waiting for I/O may get a priority boost on wake up)). This reasoning also applies to all lockless data structures, unless the data structure is truly wait-free (there aren't many practically useful ones, apart from boost::spsc_queue).
In kernel, a thread that has locked a spinlock cannot be preempted or interrupted before it releases the spinlock. And that is why spinlocks are appropriate there (when RCU cannot be used).
On Linux, one can prevent preemption (not sure if completely, but there has been recent kernel changes towards such a desirable effect) by using isolated CPU cores and FIFO real-time threads pinned to those isolated cores. But that requires a deliberate kernel/machine configuration and an application designed to take advantage of that configuration. Nevertheless, people do use such a setup for business-critical applications along with lockless (but not wait-free) data structures in user-space.
On Linux, there is adaptive mutex PTHREAD_MUTEX_ADAPTIVE_NP, which spins for a limited number of iterations before blocking in the kernel (similar to InitializeCriticalSectionAndSpinCount). However, that mutex cannot be used through std::mutex interface because there is no option to customise non-portable pthread_mutexattr_t before initialising pthread_mutex_t.
One can neither enable process-sharing, robostness, error-checking or priority-inversion prevention through std::mutex interface. In practice, people write their own wrappers of pthread_mutex_t which allows to set desirable mutex attributes; along with a corresponding wrapper for condition variables. Standard locks like std::unique_lock and std::lock_guard can be reused.
IMO, there could be provisions to set desirable mutex and condition variable properties in std:: APIs, like providing a protected constructor for derived classes that would initialize that native_handle, but there aren't any. That native_handle looks like a good idea to do platform specific stuff, however, there must be a constructor for the derived class to be able to initialize it appropriately. After the mutex or condition variable is initialized that native_handle is pretty much useless. Unless the idea was only to be able to pass that native_handle to (C language) APIs that expect a pointer or reference to an initialized pthread_mutex_t.
There is another example of Boost/C++ standard not accepting semaphores on the basis that they are too much of a rope to hang oneself, and that mutex (a binary semaphore, essentially) and condition variable are more fundamental and more flexible synchronisation primitives, out of which a semaphore can be built.
From the point of view of the C++ standard those are probably right decisions because educating users to use spinlocks and semaphores correctly with all the nuances is a difficult task. Whereas advanced users can whip out a wrapper for pthread_spinlock_t with little effort.
You are right there's no spin lock implementation in the std namespace. A spin lock is a great concept but in user space is generally quite poor. OS doesn't know your process wants to spin and usually you can have worse results than using a mutex. To be noted that on several platforms there's the optimistic spinning implemented so a mutex can do a really good job. In addition adjusting the time to "pause" between each loop iteration can be not trivial and portable and a fine tuning is required. TL;DR don't use a spinlock in user space unless you are really really sure about what you are doing.
C++ Thread discussion
Article explaining how to write a spin lock with benchmark
Reply by Linus Torvalds about the above article explaining why it's a bad idea
Spin locks have two advantages:
They require much fewer storage as a std::mutex, because they do not need a queue of threads waiting for the lock. On my system, sizeof(pthread_spinlock_t) is 4, while sizeof(std::mutex) is 40.
They are much more performant than std::mutex, if the protected code region is small and the contention level is low to moderate.
On the downside, a poorly implemented spin lock can hog the CPU. For example, a tight loop with a compare-and-set assembler instructions will spam the cache system with loads and loads of unnecessary writes. But that's what we have libraries for, that they implement best practice and avoid common pitfalls. That most user implementations of spin locks are poor, is not a reason to not put spin locks into the library. Rather, it is a reason to put it there, to stop users from trying it themselves.
There is a second problem, that arises from the scheduler: If thread A acquires the lock and then gets preempted by the scheduler before it finishes executing the critical section, another thread B could spin "forever" (or at least for many milliseconds, before thread A gets scheduled again) on that lock.
Unfortunately, there is no way, how userland code can tell the kernel "please don't preempt me in this critical code section". But if we know, that under normal circumstances, the critical code section executes within 10 ns, we could at least tell thread B: "preempt yourself voluntarily, if you have been spinning for over 30 ns". This is not guaranteed to return control directly back to thread A. But it will stop the waste of CPU cycles, that otherwise would take place. And in most scenarios, where thread A and B run in the same process at the same priority, the scheduler will usually schedule thread A before thread B, if B called std::this_thread::yield().
So, I am thinking about a template spin lock class, that takes a single unsigned integer as a parameter, which is the number of memory reads in the critical section. This parameter is then used in the library to calculate the appropriate number of spins, before a yield() is performed. With a zero count, yield() would never be called.

Spinlock vs std::mutex::try_lock

What are the benefits of using a specifically designed spinlock (e.g. http://anki3d.org/spinlock) vs. code like this:
std::mutex m;
while (!m.try_lock()) {}
# do work
m.unlock();
On typical hardware, there are massive benefits:
Your naive "fake spinlock" may saturate internal CPU buses while the CPU spins, starving other physical cores including the physical core that holds the lock.
If the CPU supports hyper-threading or something similar, your naive "fake spinlock" may consume excessive execution resources on the physical core, starving another thread sharing that physical core.
Your naive "fake spinlock" probably does extraneous write operations that result in bad cache behavior. When you perform a read-modify-write operation on an x86/x86_64 CPU (like the compare/exchange that try_lock probably does), it always writes even if the value isn't changed. This write causes the cache line to be invalidated on other cores, requiring them to re-share it when another core accesses that line. This is awful if threads on other cores contend for the same lock at the same time.
Your naive "fake spinlock" interacts badly with branch prediction. When you finally do get the lock, you take the mother of all mispredicted branches right at the point where you are locking out other threads and need to execute as quickly as possible. This is like a runner being all pumped up and ready to run at the starting line but then when he hears the starting pistol, he stops to catch his breath.
Basically, that code does everything wrong that it is possible for a spinlock to do wrong. Absolutely nothing is done efficiently. Writing good synchronization primitives requires deep hardware expertise.
The main benefit of using a spinlock is that it is extremely cheap to acquire and release if the all-important precondition is true: There is little or no congestion on the lock.
If you know with sufficient certitude that there will be no contention, a spinlock will greatly outperform a naive implementation of a mutex which will go through library code doing validations that you don't necessarily need, and do a syscall. This means doing a context switch (consuming several hundreds of cycles), and abandoning the thread's time slice and causing your thread to be rescheduled. This may take an indefinite time -- even if the lock would be available almost immediately afterwards, you can still have to wait several dozen milliseconds before your thread runs again in unfavorable conditions.
If, however, the precondition of no contention does not hold, a spinlock will usually be vastly inferior as it makes no progress, but it still consumes CPU resources as if it was performing work. When blocking on a mutex, your thread does not consume CPU resources, so these can be used for a different thread to do work, or the CPU may throttle down, saving power. That's not possible with a spinlock, which is doing "active work" until it succeeds (or fails).
In the worst case, if the number of waiters is greater than the number of CPU cores, spinlocks may cause huge, dysproportionate performance impacts because the threads that are active and running are waiting on a condition that can never happen while they are running (since releasing the lock requires a different thread to run!).
On the other hand, one should expect every modern no-suck implementation of std::mutex to already include a tiny spinlock before falling back to doing a syscall. But... while it is a reasonable assumption, this is not guaranteed.
Another non-technical reason for using spinlocks in favor of a std::mutex may be license terms. License terms are a poor rationale for a design decision, but they may nevertheless be very real.
For example, the present GCC implementation is based exclusively on pthreads, which implies that "anything MinGW" using anything from the standard threads library necessarily links with winpthreads (lacking alternatives). That means you are subject to the winpthreads license, which implies you must reproduce their copyright message. For some people, that's a dealbreaker.