Spinlock vs std::mutex::try_lock

Spinlock vs std::mutex::try_lock - c++

What are the benefits of using a specifically designed spinlock (e.g. http://anki3d.org/spinlock) vs. code like this:
std::mutex m;
while (!m.try_lock()) {}
# do work
m.unlock();

On typical hardware, there are massive benefits:
Your naive "fake spinlock" may saturate internal CPU buses while the CPU spins, starving other physical cores including the physical core that holds the lock.
If the CPU supports hyper-threading or something similar, your naive "fake spinlock" may consume excessive execution resources on the physical core, starving another thread sharing that physical core.
Your naive "fake spinlock" probably does extraneous write operations that result in bad cache behavior. When you perform a read-modify-write operation on an x86/x86_64 CPU (like the compare/exchange that try_lock probably does), it always writes even if the value isn't changed. This write causes the cache line to be invalidated on other cores, requiring them to re-share it when another core accesses that line. This is awful if threads on other cores contend for the same lock at the same time.
Your naive "fake spinlock" interacts badly with branch prediction. When you finally do get the lock, you take the mother of all mispredicted branches right at the point where you are locking out other threads and need to execute as quickly as possible. This is like a runner being all pumped up and ready to run at the starting line but then when he hears the starting pistol, he stops to catch his breath.
Basically, that code does everything wrong that it is possible for a spinlock to do wrong. Absolutely nothing is done efficiently. Writing good synchronization primitives requires deep hardware expertise.

The main benefit of using a spinlock is that it is extremely cheap to acquire and release if the all-important precondition is true: There is little or no congestion on the lock.
If you know with sufficient certitude that there will be no contention, a spinlock will greatly outperform a naive implementation of a mutex which will go through library code doing validations that you don't necessarily need, and do a syscall. This means doing a context switch (consuming several hundreds of cycles), and abandoning the thread's time slice and causing your thread to be rescheduled. This may take an indefinite time -- even if the lock would be available almost immediately afterwards, you can still have to wait several dozen milliseconds before your thread runs again in unfavorable conditions.
If, however, the precondition of no contention does not hold, a spinlock will usually be vastly inferior as it makes no progress, but it still consumes CPU resources as if it was performing work. When blocking on a mutex, your thread does not consume CPU resources, so these can be used for a different thread to do work, or the CPU may throttle down, saving power. That's not possible with a spinlock, which is doing "active work" until it succeeds (or fails).
In the worst case, if the number of waiters is greater than the number of CPU cores, spinlocks may cause huge, dysproportionate performance impacts because the threads that are active and running are waiting on a condition that can never happen while they are running (since releasing the lock requires a different thread to run!).
On the other hand, one should expect every modern no-suck implementation of std::mutex to already include a tiny spinlock before falling back to doing a syscall. But... while it is a reasonable assumption, this is not guaranteed.
Another non-technical reason for using spinlocks in favor of a std::mutex may be license terms. License terms are a poor rationale for a design decision, but they may nevertheless be very real.
For example, the present GCC implementation is based exclusively on pthreads, which implies that "anything MinGW" using anything from the standard threads library necessarily links with winpthreads (lacking alternatives). That means you are subject to the winpthreads license, which implies you must reproduce their copyright message. For some people, that's a dealbreaker.

Related

In what circumstances lock free data structures are faster than lock based ones?

I'm currently reading C++ Concurrency in Action book by Anthony Williams and there are several lock free data structures implementations. In the forward of the chapter about lock free data structures in the book Anthony is writing:
This brings us to another downside of lock-free and wait-free code: although it can increase the potential for concurrency of operations on a data structure and reduce the time an individual thread spends waiting, it may well decrease overall performance.
And indeed I tested all lock free stack implementations described in the book against lock based implementation from one of the previous chapters. And it seems the performance of lock free code is always lower than the lock based stack.
In what circumstances lock free data structure are more optimal and must be preferred?

One benefit of lock-free structures is that they do not require context switch. However, in modern systems, uncontended locks are also context-switch free. To benefit (performance-wise) from lock-free algo, several conditions have to be met:
Contention has to be high
There should be enough CPU cores so that spinning thread can run uninterrupted (ideally, should be pinned to its own core)

I've done performance study years ago. When the number of threads is small, lock-free data structures and lock-based data structures are comparable. But as the number of threads rises, at some point lock-based data structures exhibit a sharp performance drop, while lock-free data structures scale up to thousands of threads.

it depends on the probability of a collision.
if a collision is very likely, than a mutex is the optimal solution.
For example: 2 threads are constantly pushing data to the end of a container.
With lock-freedom only 1 thread will succeed. The other will need to retry. In this scenario the blocking and waiting would be better.
But if you have a large container and the 2 threads will access the container at different areas, its very likely, that there will be no collision.
For example: one thread modifies the first element of a container and the other thread the last element.
In this case, the probability of a retry is very small, hence lock-freedom would be better here.
Other problem with lock-freedom are spin-locks (heavy memory-usage), the overall performance of the atomic-variables and some constraints on variables.
For example if you have the constraint x == y which needs to be true, you cannot use atomic-variables for x and y, because you cannot change both variables at once, while a lock() would satisfy the constraint

The only way to know which is better is to profile each. The result will change drastically from use case to use case, from one cpu to another, from one arch to another, from one year to another. What might be best today might not be best tomorrow. So always measure and keep measuring.
That said let me give you some of my private thoughts on this:
First: If there is no contention it shouldn't matter what you do. The no-collision case should always be fast. If it's not then you need a different implementation tuned to the no contention case. One implementation might use fewer or faster machine instruction than the other and win but the difference should be minimal. Test, but I expect near identical results.
Next lets look at cases with (high) contention:
Again you might need an implementation tuned to the contention case. One lock mechanism isn't like the other same as lock-free methods.
threads <= cores
It's reasonable to assume all threads will be running and do work. There might be small pauses where a thread gets interrupted but that's really the exception. This obviously will only hold true if you only have one application doing this. The threads of all cpu heavy applications add up for this scenario.
Now with a lock one thread will get the lock and work. Other threads can wait for the lock to become available and react the instant the lock becomes free. They can busy loop or for longer durations sleep, doesn't matter much. The lock limits you to 1 thread doing work and you get that with barely any cpu cycles wasted when switching locks.
On the other hand lock free data structures all fall into some try&repeat loop. They will do work and at some crucial point they will try to commit that work and if there was contention they will fail and try again. Often repeating a lot of expensive operations. The more contention there is the more wasted work there is. Worse, all that access on the caches and memory will slow down the thread that actually manages to commit work in the end. So you are not only not getting ahead faster, you are slowing down progress.
I would always go with locks with any workload that takes more cpu cycles than the lock instruction vs. the CAS (or similar) instruction a lock free implementation needs. It really doesn't take much work there leaving only trivial cases for the lock-free approach. The builtin atomic types are such a case and often CPUs have opcodes to do those atomic operations lock-free in hardware in a single instruction that is (relatively) fast. In the end the lock will use such an instruction itself and can never be faster than such a trivial case.
threads >> cores
If you have much more threads than cores then only a fraction of them can run at any one time. It is likely a thread that sleeps will hold a lock. All other threads needing the lock will then also have to go to sleep until the lock holding thread wakes up again. This is probably the worst case for locking data structures. Nobody gets to do any work.
Now there are implementations for locks (with help from the operating system) where one thread trying to acquire a lock will cause the lock holding thread to take over till it releases the lock. In such systems the waste is reduced to context switching between the thread.
There is also a problem with locks called the thundering herd problem. If you have 100 threads waiting on a lock and the lock gets freed, then depending on your lock implementation and OS support, 100 threads will wake up. One will get the lock and 99 will waste time trying to acquire the lock, fail and go back to sleep. You really don't want a lock implementation suffering from thundering herds.
Lock free data structures begin to shine here. If one thread is descheduled then the other thread will continue their work and succeed in committing the result. The thread will wake up again at some point and fail to commit it's work and retry. The waste is in the work the one descheduled thread did.
cores < threads < 2 * cores
There is a grey zone there when the number of threads is near the number of cores. The chance the blocking thread is running remains high. But this is a very chaotic region. Results what method is better are rather random there. My conclusion: If you don't have tons of threads then try really hard to stay <= core count.
Some more thoughs:
Sometimes the work, once started, needs to be done in a specific order. If one thread is descheduled you can't just skip it. You see this in some data structures where the code will detect a conflict and one thread actually finishes the work a different thread started before it can commit it's own results. Now this is really great if the other thread was descheduled. But if it's actually running it's just wasteful to do the work twice. So data structure with this scheme really aim towards scenario 2 above.
With the amount of mobile computing done today it becomes more and more important to consider the power usage of your code. There are many ways you can optimize your code to change power usage. But really the only way for your code to use less power is to sleep. Something you hear more and more is "race to sleep". If you can make your code run faster so it can sleep earlier then you save power. But the emphasis here is on sleep earlier, or maybe I should say sleep more. If you have 2 threads running 75% of the time they might solve your problem in 75 seconds. But if you can solve the same problem with 2 threads running 50% of the time, alternating with a lock, then they take 100 seconds. But the first also uses 150% cpu power. For a shorter time, true, but 75 * 150% = 112.5 > 100 * 100%. Power wise the slower solution wins. Locks let you sleep while lock free trades power for speed.
Keep that in mind and balance your need for speed with the need to recharge your phone of laptop.

The mutex design will very rarely, if ever out perform the lockless one.
So the follow up question is why would anyone ever use a mutex rather than a lockless design?
The problem is that lockless designs can be hard to do, and require a significant amount of designing to be reliable; while a mutex is quite trivial (in comparison), and when debugging can be even harder. For this reason, people generally prefer to use mutexes first, and then migrate to lock free later once the contention has been proven to be a bottleneck.

I think one thing missing in these answers is locking period. If your locking period is very short, i.e. after acquiring lock if you perform a task for a very short period(like incrementing a variable) then using lock-based data structure would bring in unnecessary context switching, cpu scheduling etc. In this case, lock-free is a good option as the thread would be spinning for a very short time.

Fast way : Synchronization with mutex and use of Atomic Boolean in C++ [duplicate]

The main reason for using atomics over mutexes, is that mutexes are expensive but with the default memory model for atomics being memory_order_seq_cst, isn't this just as expensive?
Question: Can concurrent a program using locks be as fast as concurrent lock-free program?
If so, it may not be worth the effort unless I want to use memory_order_acq_rel for atomics.
Edit:
I may be missing something but lock-based cant be faster than lock-free because each lock will have to be a full memory barrier too. But with lock-free, it's possible to use techniques that are less restrictive then memory barriers.
So back to my question, is lock-free any faster than lock based in new C++11 standard with default memory_model?
Is "lock-free >= lock-based when measured in performance" true? Let's assume 2 hardware threads.
Edit 2:
My question is not about progress guarantees, and maybe I'm using "lock-free" out of context.
Basically when you have 2 threads with shared memory, and the only guarantee you need is that if one thread is writing then the other thread can't read or write, my assumption is that a simple atomic compare_and_swap operation would be much faster than locking a mutex.
Because if one thread never even touches the shared memory, you will end up locking and unlocking over and over for no reason but with atomic operations you only use 1 CPU cycle each time.
In regards to the comments, a spin-lock vs a mutex-lock is very different when there is very little contention.

Lockfree programming is about progress guarantees: From strongest to weakest, those are wait-free, lock-free, obstruction-free, and blocking.
A guarantee is expensive and comes at a price. The more guarantees you want, the more you pay. Generally, a blocking algorithm or datastructure (with a mutex, say) has the greatest liberties, and thus is potentially the fastest. A wait-free algorithm on the other extreme must use atomic operations at every step, which may be much slower.
Obtaining a lock is actually rather cheap, so you should never worry about that without a deep understanding of the subject. Moreover, blocking algorithms with mutexes are much easier to read, write and reason about. By contrast, even the simplest lock-free data structures are the result of long, focused research, each of them worth one or more PhDs.
In a nutshell, lock- or wait-free algorithms trade worst latency for mean latency and throughput. Everything is slower, but nothing is ever very slow. This is a very special characteristic that is only useful in very specific situations (like real-time systems).

A lock tends to require more operations than a simple atomic operation does. In the simplest cases, memory_order_seq_cst will be about twice as fast as locking because locking tends to require, at minimum two atomic operations in its implementation (one to lock, one to unlock). In many cases, it takes even more than that. However, once you start leveraging the memory orders, it can be much faster because you are willing to accept less synchronization.
Also, you'll often see "locking algorithms are always as fast as lock free algorithms." This is somewhat true. The basic idea is that if the fastest algorithm happens to be lock free, then the fastest algorithm without the lock-free guarentee is ALSO the same algorithm! However, if the fastest algortihm requires locks, then those demanding lockfree guarantees have to go find a slower algorithm.
In general, you will see lockfree algorithms in a few low level algorithms, where the performance of leveraging specialized opcodes helps. In almost all other code, locking is more than satisfactory performance, and much easier to read.

Question: Can concurrent a program using locks be as fast as
concurrent lock-free program?
It can be faster: lock free algorithm must keep the global state in a consistent state at all time, and do calculations without knowing if they will be productive as the state might have changed when the calculation is done, making it irrelevant, with lost CPU cycles.
The lock free strategy makes the serialization happen at the end of the process, when the calculation is done. In a pathological case many threads can do an effort and only one effort will be productive, and the others will retry.
Lock free can lead to starvation of some threads, whatever their priority is, and there is no way to avoid that. (Although it's unlikely for a thread to starve retrying for very long unless there is crazy contention.)
On the other hand, "serialized calculation and series of side effect based" (aka lock based) algorithms will not start before they know they will not be prevented by other actors to operate on that specific locked ressource (the guarantee is provided by the use of a mutex). Note that they might be prevented from finishing by the need to access another resource, if multiple locks are taken, leading to possible dead lock when multiple locks are needed in a badly designed program.
Note that this dead lock issue isn't in the scope of lock free code, which can't even act on multiple entities: it usually can't do an atomic commit based on two unrelated objects(1).
So the lack of chance of dead lock for lock free code is sign of weakness of lock free code: not being able to dead lock is a limit of your tool. A system that can only hold of lock at a time also wouldn't be able to dead lock.
The scope of lock free algorithms is minuscule compared to the scope of lock based algorithms. For a lot of problem, lock free doesn't even make sense.
A lock based algorithm is polite, the threads will have to wait in line before doing what they need to do: that is maximally efficient in term of computation steps by each thread. But it's inefficient to have to queue threads in a wait list: they often can't use the end of their time slice, so it can be very inefficient, as someone trying to do serious work while being interrupted by the phone all the time: his concentration is gone and he can't never reach maximum efficiency because his work time to cut into small pieces.
(1) You would have at least need to be able to do a double CAS for that, that is an operation atomic on two arbitrary addresses (not a double word CAS, which is just a CAS on more bits, which can trivially be implemented up to the natural CPU memory access arbitration unit that is the cache line).

Why is std::mutex so slow on OSX?

I have the following benchmark: https://gist.github.com/leifwalsh/10010580
Essentially it spins up k threads and then each thread does about 16 million / k lock/increment/unlock cycles, using a spinlock and a std::mutex. On OSX, the std::mutex is devastatingly slower than the spinlock when contended, whereas on Linux it's competitive or a bit faster.
OSX:
spinlock 1: 334ms
spinlock 2: 3537ms
spinlock 3: 4815ms
spinlock 4: 5653ms
std::mutex 1: 813ms
std::mutex 2: 38464ms
std::mutex 3: 44254ms
std::mutex 4: 47418ms
Linux:
spinlock 1: 305ms
spinlock 2: 1590ms
spinlock 3: 1820ms
spinlock 4: 2300ms
std::mutex 1: 377ms
std::mutex 2: 1124ms
std::mutex 3: 1739ms
std::mutex 4: 2668ms
The processors are different, but not that different (OSX is Intel(R) Core(TM) i7-2677M CPU # 1.80GHz, Linux is Intel(R) Core(TM) i5-2500K CPU # 3.30GHz), this seems like a library or kernel problem. Anyone know the source of the slowness?
To clarify my question, I understand that "there are different mutex implementations that optimize for different things and this isn't a problem, it's expected". This question is: what are the actual differences in implementation that cause this? Or, if it's a hardware issue (maybe the cache is just a lot slower on the macbook), that's acceptable too.

You're just measuring the library's choice of trading off throughput for fairness. The benchmark is heavily artificial and penalizes any attempt to provide any fairness at all.
The implementation can do two things. It can let the same thread get the mutex twice in a row, or it can change which thread gets the mutex. This benchmark heavily penalizes a change in threads because the context switch takes time and because ping-ponging the mutex and val from cache to cache takes time.
Most likely, this is just showing the different trade-offs that implementations have to make. It heavily rewards implementations that prefer to give the mutex back to the thread that last held it. The benchmark even rewards implementations that waste CPU to do that! It even rewards implementations that waste CPU to avoid context switches, even when there's other useful work the CPU could do! It also doesn't penalize the implementation for inter-core traffic which can slow down other unrelated threads.
Also, people who implement mutexes generally presume that performance in the uncontended case is more important than performance in the contended case. There are numerous tradeoffs you can make between these cases, such as presuming that there might be a thread waiting or specifically checking if there is. The benchmark tests only (or at least, almost only) the case that is typically traded off in favor of the case presumed more common.
Bluntly, this is a senseless benchmark that is incapable of identifying a problem.
The specific explanation is almost certainly that the Linux implementation is a spinlock/futex hybrid while the OSX implementation is conventional, equivalent to locking a kernel object. The spinlock portion of the Linux implementation favors allowing the same thread that just released the mutex to lock it again, which your benchmark heavily rewards.

You need to use the same STL implementation on both systems. This could either be an issue in libc++ or in pthread_mutex_*().
What the other posters are saying about mutex locks being conventional on OS X is a complete lie, however. Yes, Mach locksets and semaphores require system calls for every operation. But unless you explicitly use lockset or semaphore Mach APIs, then these ARE NOT USED in your application.
OS X's libpthread uses __psynch_* BSD system calls, which remotely match Linux futexes. In the uncontended case, libpthread makes no system call to acquire the mutex. Only an instruction such as cmpxchg is used.
Source: libpthread source code and my own knowledge (I'm the developer of Darling).

David Schwartz is basically correct, except for the throughput/adaptiveness comment. It is actually much faster on Linux because it uses a futex & the overhead of a contended call is much smaller. What this means is that in the uncontended case, it simply does a function call, atomic operation & returns. If most of your locks are uncontended (which is usually the typical behaviour you'll see in many real-world programs), acquiring a lock is basically free. Even in the contended case, it's basically a function call, syscall + atomic operation + adding 1 thread to a list (the syscall being the expensive part of the operation). If the mutex is released during the syscall, then the function returns right away without enqueuing onto a wait-list.
On OSX, there is no futex. Acquiring a mutex requires always talking to the kernel. Moreover, OSX is a micro-kernel hybrid. That means to talk to the kernel, you need to send it a message. This means you do data marshalling, syscall, copy the data to a separate buffer. Then at some point the kernel comes by, unmarshalls the data & acquires the lock & sends you back a message. Thus in the uncontended case, it's much heavier-weight. In the contended case, it depends on how long you're blocked waiting for a lock: the longer you wait, the cheaper your lock operation becomes when amortized across total runtime.
On OSX, there is a much faster mechanism called dispatch queues, but it requires re-thinking how your program works. In addition to using lock-free synchronization (i.e. uncontended cases never jump to the kernel), they also do thread pooling & job scheduling. Additionally, they provide asynchronous dispatch which lets you schedule a job without needing to wait for a lock.

Is synchronizing with `std::mutex` slower than with `std::atomic(memory_order_seq_cst)`?

The main reason for using atomics over mutexes, is that mutexes are expensive but with the default memory model for atomics being memory_order_seq_cst, isn't this just as expensive?
Question: Can concurrent a program using locks be as fast as concurrent lock-free program?
If so, it may not be worth the effort unless I want to use memory_order_acq_rel for atomics.
Edit:
I may be missing something but lock-based cant be faster than lock-free because each lock will have to be a full memory barrier too. But with lock-free, it's possible to use techniques that are less restrictive then memory barriers.
So back to my question, is lock-free any faster than lock based in new C++11 standard with default memory_model?
Is "lock-free >= lock-based when measured in performance" true? Let's assume 2 hardware threads.
Edit 2:
My question is not about progress guarantees, and maybe I'm using "lock-free" out of context.
Basically when you have 2 threads with shared memory, and the only guarantee you need is that if one thread is writing then the other thread can't read or write, my assumption is that a simple atomic compare_and_swap operation would be much faster than locking a mutex.
Because if one thread never even touches the shared memory, you will end up locking and unlocking over and over for no reason but with atomic operations you only use 1 CPU cycle each time.
In regards to the comments, a spin-lock vs a mutex-lock is very different when there is very little contention.

Lockfree programming is about progress guarantees: From strongest to weakest, those are wait-free, lock-free, obstruction-free, and blocking.
A guarantee is expensive and comes at a price. The more guarantees you want, the more you pay. Generally, a blocking algorithm or datastructure (with a mutex, say) has the greatest liberties, and thus is potentially the fastest. A wait-free algorithm on the other extreme must use atomic operations at every step, which may be much slower.
Obtaining a lock is actually rather cheap, so you should never worry about that without a deep understanding of the subject. Moreover, blocking algorithms with mutexes are much easier to read, write and reason about. By contrast, even the simplest lock-free data structures are the result of long, focused research, each of them worth one or more PhDs.
In a nutshell, lock- or wait-free algorithms trade worst latency for mean latency and throughput. Everything is slower, but nothing is ever very slow. This is a very special characteristic that is only useful in very specific situations (like real-time systems).

A lock tends to require more operations than a simple atomic operation does. In the simplest cases, memory_order_seq_cst will be about twice as fast as locking because locking tends to require, at minimum two atomic operations in its implementation (one to lock, one to unlock). In many cases, it takes even more than that. However, once you start leveraging the memory orders, it can be much faster because you are willing to accept less synchronization.
Also, you'll often see "locking algorithms are always as fast as lock free algorithms." This is somewhat true. The basic idea is that if the fastest algorithm happens to be lock free, then the fastest algorithm without the lock-free guarentee is ALSO the same algorithm! However, if the fastest algortihm requires locks, then those demanding lockfree guarantees have to go find a slower algorithm.
In general, you will see lockfree algorithms in a few low level algorithms, where the performance of leveraging specialized opcodes helps. In almost all other code, locking is more than satisfactory performance, and much easier to read.

Question: Can concurrent a program using locks be as fast as
concurrent lock-free program?
It can be faster: lock free algorithm must keep the global state in a consistent state at all time, and do calculations without knowing if they will be productive as the state might have changed when the calculation is done, making it irrelevant, with lost CPU cycles.
The lock free strategy makes the serialization happen at the end of the process, when the calculation is done. In a pathological case many threads can do an effort and only one effort will be productive, and the others will retry.
Lock free can lead to starvation of some threads, whatever their priority is, and there is no way to avoid that. (Although it's unlikely for a thread to starve retrying for very long unless there is crazy contention.)
On the other hand, "serialized calculation and series of side effect based" (aka lock based) algorithms will not start before they know they will not be prevented by other actors to operate on that specific locked ressource (the guarantee is provided by the use of a mutex). Note that they might be prevented from finishing by the need to access another resource, if multiple locks are taken, leading to possible dead lock when multiple locks are needed in a badly designed program.
Note that this dead lock issue isn't in the scope of lock free code, which can't even act on multiple entities: it usually can't do an atomic commit based on two unrelated objects(1).
So the lack of chance of dead lock for lock free code is sign of weakness of lock free code: not being able to dead lock is a limit of your tool. A system that can only hold of lock at a time also wouldn't be able to dead lock.
The scope of lock free algorithms is minuscule compared to the scope of lock based algorithms. For a lot of problem, lock free doesn't even make sense.
A lock based algorithm is polite, the threads will have to wait in line before doing what they need to do: that is maximally efficient in term of computation steps by each thread. But it's inefficient to have to queue threads in a wait list: they often can't use the end of their time slice, so it can be very inefficient, as someone trying to do serious work while being interrupted by the phone all the time: his concentration is gone and he can't never reach maximum efficiency because his work time to cut into small pieces.
(1) You would have at least need to be able to do a double CAS for that, that is an operation atomic on two arbitrary addresses (not a double word CAS, which is just a CAS on more bits, which can trivially be implemented up to the natural CPU memory access arbitration unit that is the cache line).

What is a scalable lock?

What is a scalable lock? And how is it different from a non-scalable lock? I first saw this term in the context of the TBB rw-lock, and couldn't decide which to use.
In addition, is there any rw-lock that prioritizes readers over writers?

There is no formal definition of the term "scalable lock" or "non-scalable lock". What it's meant to imply is that some locking algorithms, techniques or implementations perform reasonably well even when there is a lot of contention for the lock, and some do not.
Sometimes the problem is algorithmic. A naive implementation of priority inheritance, for example, may require O(n) work to release a lock, where n is the number of waiting threads. That implies O(n^2) work for every waiting thread to be serviced.
Sometimes the problem is to do with hardware. Simple spinlocks (e.g. implementations for which the lock cache line is shared and acquirers don't back off) don't scale on SMP hardware with a single bus interconnect, because writing to a cache line requires that the CPU acquires the cache line, and the CPU interconnect is a single point of contention. If there are n CPUs trying to acquire the same lock at the same time, you may end up with O(n) bus traffic to acquire the lock. Again, this means O(n^2) time for all n CPUs to be satisfied.
In general, you should avoid non-scalable locks unless two conditions are met:
Contention is light.
The critical section is very short.
You really have to know that the two conditions are met. A critical section may be short in terms of lines of code, but not be short in wall time. If in doubt, use a scalable lock, and later fix any which have been measured to cause performance problems.
As for your last question, I'm not aware of an off-the-shelf read-write lock which favours readers. Actually, most APIs don't specify policy, including pthreads (annoyingly).
My first comment is that you probably don't want it. If you have high contention, favouring one over the other kills throughput, and if you don't have high contention, it won't make a difference. About the only reason I can think not to use a rw lock with a completely fair policy is if you have thread priorities which must be respected, so you want the highest priority thread to be preferred.
But if you must, you could always roll your own. All you need is a couple of flags (one for "readers can go now" and one for "a writer can go now"), condition variables protecting the flags, a single mutex protecting the condition variables, and a couple of counters indicating how many readers and writers are waiting. That should be all you need; implementing this should be quite instructive.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js