Implementing spin-lock without XCHG?

Implementing spin-lock without XCHG? - c++

C++ spin-lock can be easily implemented using std::atomic_flag, it can be coded roughly (without special features) like following:
std::atomic_flag f = ATOMIC_FLAG_INIT;
while (f.test_and_set(std::memory_order_acquire)); // Acquire lock
// Here do some lock-protected work .....
f.clear(std::memory_order_release); // Release lock
One can see online assembly, it shows that acquiring is implemented atomically through XCHG instruction.
Also as one can see on uops.info (screen here) that XCHG may take up to 30 CPU cycles on quite popular Skylake. This is quite slow.
Overall speed of spin lock can be measure through such program.
Is it possible to implement spin locking without XCHG? Main concern is about speed, not about just using another instruction.
What is the fastest possible spin lock? Is it possible to make it 10 cycles instead of 30? And 5 cycles? Maybe some probabilistic spin-lock that runs fast on average?
It should be implemented in a strict way, meaning that in 100% cases it correctly protects piece of code and data. If it is probabilistic then it should run probable time but yet protects 100% correctly after each run.
Main purpose for such spin lock for me is to protect very tiny operations inside multiple threads, that run a dozen or two of cycles, hence 30 cycles delay is too much overhead. Of course one can say that I can use atomics or other lock-free techniques to implement all operations. But this techniques is not possible for all cases, and also will take to much work to strictly implement in huge code base for many classes and methods. Hence something generic like regular spin lock is also needed.

Is it possible to implement spin locking without XCHG?
Yes. For 80x86, you can lock bts or lock cmpxchg or lock xadd or ...
What is the fastest possible spin lock?
Possible interpretations of "fast" include:
a) Fast in the uncontended case. In this case it's not going to matter very much what you do because most of the possible operations (exchanging, adding, testing...) are cheap and the real costs are cache coherency (getting the cache line containing the lock into the "exclusive" state in the current CPU's cache, possibly including fetching it from RAM or other CPUs' caches) and serialization.
b) Fast in the contended case. In this case you really need a "test without lock; then test & set with lock" approach. The main problem with a simple spinloop (for the contended case) is that when multiple CPUs are spinning the cache line will be rapidly bouncing from one CPU's cache to the next and consuming a huge amount of inter-connect bandwidth for nothing. To prevent this, you'd have a loop that tests the lock state without modifying it so that the cache line can remain in all CPUs caches as "shared" at the same time while those CPUs are spinning.
But note that testing read-only to start with can hurt the un-contended case, resulting in more coherency traffic: first a share-request for the cache line which will only get you MESI S state if another core had recently unlocked, and then an RFO (Read For Ownership) when you do try to take the lock. So best practice is probably to start with an RMW, and if that fails then spin read-only with pause until you see the lock available, unless profiling your code on the system you care about shows a different choice is better.
c) Fast to exit the spinloop (after contention) when the lock is acquired. In this case CPU can speculatively execute many iterations of the loop, and when the lock becomes acquired all the CPU has to drain those "speculatively execute many iterations of the loop" which costs a little time. To prevent that you want a pause instruction to prevent many iterations of the loop/s from being speculatively executed.
d) Fast for other CPUs that don't touch the lock. For some cases (hyper-threading) the core's resources are shared between logical processors; and when one logical process is spinning it consumes resources that the other logical processor could've used to get useful work done (especially for the "spinlock speculatively executes many iterations of the loop" situation). To minimize this you need a pause in the spinloop/s (so that the spinning logical processor doesn't consume as much of the core's resources and the other logical processor in the core can get more useful work done).
e) Minimum "worst case time to acquire". With a simple lock, under contention, some CPUs or threads can be lucky and always get the lock while other CPUs/threads are very unlucky and take ages to get the lock; and the "worst case time to acquire" is theoretically infinite (a CPU can spin forever). To fix that you need a fair lock - something to ensure that only the thread that has been waiting/spinning for the longest amount of time is able to acquire the lock when it is released. Note that it's possible to design a fair lock such that each thread spins on a different cache line; which is a different way to solve the "cache line bouncing between CPUs" problem I mentioned in "b) Fast in the contended case".
f) Minimum "worst case until lock released". This has to involve the length of the worst critical section; but in some situations may also include the cost of any number IRQs, the cost of any number of task switches and the time the code isn't using any CPU. It's entirely possible to have a situation where a thread acquires the lock then the scheduler does a thread switch; then many CPUs all spin (wasting a huge amount of time) on a lock that can not be released (because the lock holder is the only one that can release the lock and it isn't even using any CPU). The way to fix/improve this is to disable the scheduler and IRQs; which is fine in kernel code, but "likely impossible for security reasons" in normal user-space code. This is also the reason why spinlocks should probably never be used in user-space (and why user-space should probably use a mutex where the thread is put in a "blocked waiting for lock" state and not given CPU time by the scheduler until/unless the thread actually can acquire the lock).
Note that making it fast for one possible interpretation of "fast" can make it slower/worse for other interpretations of "fast". For example; the speed of the uncontended case is made worse by everything else.
Example Spinlock
This example is untested, and written in (NASM syntax) assembly.
;Input
; ebx = address of lock
;Initial optimism in the hope the lock isn't contended
spinlock_acquire:
lock bts dword [ebx],0 ;Set the lowest bit and get its previous value in carry flag
;Did we actually acquire it, i.e. was it previously 0 = unlocked?
jnc .acquired ; Yes, done!
;Waiting (without modifying) to avoid "cache line bouncing"
.spin:
pause ;Reduce resource consumption
; and avoid memory order mis-speculation when the lock becomes available.
test dword [ebx],1 ;Has the lock been released?
jne .spin ; no, wait until it was released
;Try to acquire again
lock bts dword [ebx],0 ;Set the lowest bit and get its previous value in carry flag
;Did we actually acquire it?
jc .spin ; No, go back to waiting
.acquired:
Spin-unlock can be just mov dword [ebx], 0, not lock btr, because you know you own the lock and that has release semantics on x86. You could read it first to catch double-unlock bugs.
Notes:
a) lock bts is a little slower than other possibilities; but it doesn't interfere with or depend on the other 31 bits (or 63 bits) of the lock, which means that those other bits can be used for detecting programming mistakes (e.g. store 31 bits of "thread ID that currently holds lock" in them when the lock is acquired and check them when the lock is released to auto-detect "Wrong thread releasing lock" and "Lock being released when it was never acquired" bugs) and/or used for gathering performance information (e.g. set bit 1 when there's contention so that other code can scan periodically to determine which locks are rarely contended and which locks are heavily contended). Bugs with the use of locks are often extremely insidious and hard to find (unpredictable and unreproducible "Heisenbugs" that disappear as soon as you try to find them); so I have a preference for "slower with automatic bug detection".
b) This is not a fair lock, which means its not well suited to situations where contention is likely.
c) For memory; there's a compromise between memory consumption/cache misses, and false sharing. For rarely contended locks I like to put the lock in the same cache line/s as the data the lock protects, so that the acquiring the lock means that the data the lock holder wants is already in the cache (and no subsequent cache miss occurs). For heavily contended locks this causes false sharing and should be avoided by reserving the whole cache line for the lock and nothing else (e.g. by adding 60 bytes of unused padding after the 4 bytes used by the actual lock, like in C++ alignas(64) struct { std::atomic<int> lock; }; ). Of course a spinlock like this shouldn't be used for heavily contended locks so its reasonable to assume that minimizing memory consumption (and not having any padding, and not caring about false sharing) makes sense.
Main purpose for such spin lock for me is to protect very tiny operations inside multiple threads, that run a dozen or two of cycles, hence 30 cycles delay is too much overhead
In that case I'd suggest trying to replace locks with atomics, block-free algorithms, and lock-free algorithms. A trivial example is tracking statistics, where you might want to do lock inc dword [number_of_chickens] instead of acquiring a lock to increase "number_of_chickens".
Beyond that it's hard to say much - for one extreme, the program could be spending most of its time doing work without needing locks and the cost of locking may have almost no impact on overall performance (even though acquire/release is more expensive than the tiny critical sections); and for the other extreme the program could be spending most of its time acquiring and releasing locks. In other words, the cost of acquiring/releasing locks is somewhere between "irrelevant" and "major design flaw (using far too many locks and needing to redesign the entire program)".

Related

What's a good alternative to PAUSE for use in the implementation of a spinlock?

I am working on making a fiber-based job system for my latest project which will depend on the use of spinlocks for proper functionality. I had intended to use the PAUSE instruction as that seems to be the gold-standard for the waiting portion of your average modern spinlock. However, on doing some research into implementing my own fibers, I came across the fact that the cycle duration of PAUSE on recent machines has increased to an adverse extent.
I found this out from here, where it says, quoting the Intel Optimization Manual, "The latency of PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles," and
"As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss."
As a result, I'd like to find an alternative to the PAUSE instruction for use in my own spinlocks. I've read that in the past, PAUSE has been preferred due to it somehow saving on energy usage which I'm guessing is due to the other often quoted factoid that using PAUSE somehow signals to the processor that it's in the midst of a spinlock. I'm also guessing that this is on the other end of the spectrum power-wise to doing some dummy calculation for the desired number of cycles.
Given this, is there a best-case solution that comes close to PAUSE's apparent energy efficiency while having the flexibility and low-cycle count as a repeat 'throwout' calculation?

I'm guessing is due to the other often quoted factoid that using PAUSE somehow signals to the processor that it's in the midst of a spinlock.
Yes, pause lets the CPU avoid memory-order mis-speculation when leaving a read-only spin-wait loop, which is how you should spin to avoid creating contention for the thread trying to unlock that location. (Don't spam xchg). See also:
How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?
Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock? - spin read-only (with pause) if the first atomic RMW fails to take the lock. You do want to start with an RMW attempt so you don't get a Shared copy of the cache line and then have to wait for another off-core request; if the first access is an RMW like xchg or lock cmpxchg, the first request the core makes will be an RFO.
Locks around memory manipulation via inline assembly a minimal x86 asm spinlock using pause (but no fallback to OS-assisted sleep/wake, so inappropriate if the locking thread might ever sleep while holding the lock.)
If you have to wait for another core to change something in memory, it doesn't help much to check much more frequently than the inter-core latency, especially if you'll stall right when you finally do see the change you've been waiting for.
Dealing with pause being slow (newer Intel) or fast (AMD and old Intel)
If you have code that uses multiple pause instructions between checking the shared value, you should change the code to do fewer.
See also Why have AMD-CPUs such a silly PAUSE-timing with Brendan's suggestion of checking rdtsc in spin loops to better adapt to unknown delays from pause on different CPUs.
Basically try to make your workload not as sensitive to pause latency. That may also mean trying to avoid waiting for data from other threads as much, or having other useful work to do until a lock becomes available.
Alternatives, IDK if this is lower wakeup latency than spinning with pause
On CPUs new enough to have the WAITPKG extension (Tremont or Alder Lake / Sapphire Rapids), umonitor / umwait could be viable to wait in user-space for a memory location to change, like spinning but the CPU wakes itself up when it sees cache coherency traffic about a change, or something like that. Although that may be slower than pause if it has to enter a sleep state.
(You can ask umwait to only go into C0.1, not C0.2, with bit 0 of the register you specify, and EDX:EAX as a TSC deadline. Intel says a C0.2 sleep state improves performance of the other hyperthread, so presumably that means switching back to single-core-active mode, de-partitioning the store-buffer, ROB, etc., and having to wait for re-partitioning before this core can wake up. But C0.1 state doesn't do that.)
Even in the worst case, pause is only about 140 core clock cycles. That's still much faster than a Linux system call on a modern x86-64, especially with Spectre / Meltdown mitigation. (Thousands to tens of thousands of clock cycles, up from a couple hundred just for syscall + sysret, let alone calling schedule() and maybe running something else.)
So if you're aiming to minimize wakeup latency at the expense of wasting CPU time spinning longer, nanosleep is not an option. If might be good for other use-cases though, as a fallback after spinning on pause a couple times.
Or use futex to sleep on a value changing or on a notify from another process. (It doesn't guarantee that the kernel will use monitor / mwait to sleep until a change, it will let other tasks run. So you do need to have the unlocking thread make a futex system call if any waiters had gone to sleep with futex. But you can still make a lightweight mutex that avoids any system calls in the locker and unlocker if there isn't contention and no threads go to sleep.)
But at that point, with futex you're probably reproducing what glibc's pthread_mutex does.

I can only refer you to this talk by Fedor Pikus. In this he claims that (on linux) nanosleep is generally the fastest on most systems. The real answer though, is to benchmark different implementations, and see whichever is fastest!

In what circumstances lock free data structures are faster than lock based ones?

I'm currently reading C++ Concurrency in Action book by Anthony Williams and there are several lock free data structures implementations. In the forward of the chapter about lock free data structures in the book Anthony is writing:
This brings us to another downside of lock-free and wait-free code: although it can increase the potential for concurrency of operations on a data structure and reduce the time an individual thread spends waiting, it may well decrease overall performance.
And indeed I tested all lock free stack implementations described in the book against lock based implementation from one of the previous chapters. And it seems the performance of lock free code is always lower than the lock based stack.
In what circumstances lock free data structure are more optimal and must be preferred?

One benefit of lock-free structures is that they do not require context switch. However, in modern systems, uncontended locks are also context-switch free. To benefit (performance-wise) from lock-free algo, several conditions have to be met:
Contention has to be high
There should be enough CPU cores so that spinning thread can run uninterrupted (ideally, should be pinned to its own core)

I've done performance study years ago. When the number of threads is small, lock-free data structures and lock-based data structures are comparable. But as the number of threads rises, at some point lock-based data structures exhibit a sharp performance drop, while lock-free data structures scale up to thousands of threads.

it depends on the probability of a collision.
if a collision is very likely, than a mutex is the optimal solution.
For example: 2 threads are constantly pushing data to the end of a container.
With lock-freedom only 1 thread will succeed. The other will need to retry. In this scenario the blocking and waiting would be better.
But if you have a large container and the 2 threads will access the container at different areas, its very likely, that there will be no collision.
For example: one thread modifies the first element of a container and the other thread the last element.
In this case, the probability of a retry is very small, hence lock-freedom would be better here.
Other problem with lock-freedom are spin-locks (heavy memory-usage), the overall performance of the atomic-variables and some constraints on variables.
For example if you have the constraint x == y which needs to be true, you cannot use atomic-variables for x and y, because you cannot change both variables at once, while a lock() would satisfy the constraint

The only way to know which is better is to profile each. The result will change drastically from use case to use case, from one cpu to another, from one arch to another, from one year to another. What might be best today might not be best tomorrow. So always measure and keep measuring.
That said let me give you some of my private thoughts on this:
First: If there is no contention it shouldn't matter what you do. The no-collision case should always be fast. If it's not then you need a different implementation tuned to the no contention case. One implementation might use fewer or faster machine instruction than the other and win but the difference should be minimal. Test, but I expect near identical results.
Next lets look at cases with (high) contention:
Again you might need an implementation tuned to the contention case. One lock mechanism isn't like the other same as lock-free methods.
threads <= cores
It's reasonable to assume all threads will be running and do work. There might be small pauses where a thread gets interrupted but that's really the exception. This obviously will only hold true if you only have one application doing this. The threads of all cpu heavy applications add up for this scenario.
Now with a lock one thread will get the lock and work. Other threads can wait for the lock to become available and react the instant the lock becomes free. They can busy loop or for longer durations sleep, doesn't matter much. The lock limits you to 1 thread doing work and you get that with barely any cpu cycles wasted when switching locks.
On the other hand lock free data structures all fall into some try&repeat loop. They will do work and at some crucial point they will try to commit that work and if there was contention they will fail and try again. Often repeating a lot of expensive operations. The more contention there is the more wasted work there is. Worse, all that access on the caches and memory will slow down the thread that actually manages to commit work in the end. So you are not only not getting ahead faster, you are slowing down progress.
I would always go with locks with any workload that takes more cpu cycles than the lock instruction vs. the CAS (or similar) instruction a lock free implementation needs. It really doesn't take much work there leaving only trivial cases for the lock-free approach. The builtin atomic types are such a case and often CPUs have opcodes to do those atomic operations lock-free in hardware in a single instruction that is (relatively) fast. In the end the lock will use such an instruction itself and can never be faster than such a trivial case.
threads >> cores
If you have much more threads than cores then only a fraction of them can run at any one time. It is likely a thread that sleeps will hold a lock. All other threads needing the lock will then also have to go to sleep until the lock holding thread wakes up again. This is probably the worst case for locking data structures. Nobody gets to do any work.
Now there are implementations for locks (with help from the operating system) where one thread trying to acquire a lock will cause the lock holding thread to take over till it releases the lock. In such systems the waste is reduced to context switching between the thread.
There is also a problem with locks called the thundering herd problem. If you have 100 threads waiting on a lock and the lock gets freed, then depending on your lock implementation and OS support, 100 threads will wake up. One will get the lock and 99 will waste time trying to acquire the lock, fail and go back to sleep. You really don't want a lock implementation suffering from thundering herds.
Lock free data structures begin to shine here. If one thread is descheduled then the other thread will continue their work and succeed in committing the result. The thread will wake up again at some point and fail to commit it's work and retry. The waste is in the work the one descheduled thread did.
cores < threads < 2 * cores
There is a grey zone there when the number of threads is near the number of cores. The chance the blocking thread is running remains high. But this is a very chaotic region. Results what method is better are rather random there. My conclusion: If you don't have tons of threads then try really hard to stay <= core count.
Some more thoughs:
Sometimes the work, once started, needs to be done in a specific order. If one thread is descheduled you can't just skip it. You see this in some data structures where the code will detect a conflict and one thread actually finishes the work a different thread started before it can commit it's own results. Now this is really great if the other thread was descheduled. But if it's actually running it's just wasteful to do the work twice. So data structure with this scheme really aim towards scenario 2 above.
With the amount of mobile computing done today it becomes more and more important to consider the power usage of your code. There are many ways you can optimize your code to change power usage. But really the only way for your code to use less power is to sleep. Something you hear more and more is "race to sleep". If you can make your code run faster so it can sleep earlier then you save power. But the emphasis here is on sleep earlier, or maybe I should say sleep more. If you have 2 threads running 75% of the time they might solve your problem in 75 seconds. But if you can solve the same problem with 2 threads running 50% of the time, alternating with a lock, then they take 100 seconds. But the first also uses 150% cpu power. For a shorter time, true, but 75 * 150% = 112.5 > 100 * 100%. Power wise the slower solution wins. Locks let you sleep while lock free trades power for speed.
Keep that in mind and balance your need for speed with the need to recharge your phone of laptop.

The mutex design will very rarely, if ever out perform the lockless one.
So the follow up question is why would anyone ever use a mutex rather than a lockless design?
The problem is that lockless designs can be hard to do, and require a significant amount of designing to be reliable; while a mutex is quite trivial (in comparison), and when debugging can be even harder. For this reason, people generally prefer to use mutexes first, and then migrate to lock free later once the contention has been proven to be a bottleneck.

I think one thing missing in these answers is locking period. If your locking period is very short, i.e. after acquiring lock if you perform a task for a very short period(like incrementing a variable) then using lock-based data structure would bring in unnecessary context switching, cpu scheduling etc. In this case, lock-free is a good option as the thread would be spinning for a very short time.

Spinlock vs std::mutex::try_lock

What are the benefits of using a specifically designed spinlock (e.g. http://anki3d.org/spinlock) vs. code like this:
std::mutex m;
while (!m.try_lock()) {}
# do work
m.unlock();

On typical hardware, there are massive benefits:
Your naive "fake spinlock" may saturate internal CPU buses while the CPU spins, starving other physical cores including the physical core that holds the lock.
If the CPU supports hyper-threading or something similar, your naive "fake spinlock" may consume excessive execution resources on the physical core, starving another thread sharing that physical core.
Your naive "fake spinlock" probably does extraneous write operations that result in bad cache behavior. When you perform a read-modify-write operation on an x86/x86_64 CPU (like the compare/exchange that try_lock probably does), it always writes even if the value isn't changed. This write causes the cache line to be invalidated on other cores, requiring them to re-share it when another core accesses that line. This is awful if threads on other cores contend for the same lock at the same time.
Your naive "fake spinlock" interacts badly with branch prediction. When you finally do get the lock, you take the mother of all mispredicted branches right at the point where you are locking out other threads and need to execute as quickly as possible. This is like a runner being all pumped up and ready to run at the starting line but then when he hears the starting pistol, he stops to catch his breath.
Basically, that code does everything wrong that it is possible for a spinlock to do wrong. Absolutely nothing is done efficiently. Writing good synchronization primitives requires deep hardware expertise.

The main benefit of using a spinlock is that it is extremely cheap to acquire and release if the all-important precondition is true: There is little or no congestion on the lock.
If you know with sufficient certitude that there will be no contention, a spinlock will greatly outperform a naive implementation of a mutex which will go through library code doing validations that you don't necessarily need, and do a syscall. This means doing a context switch (consuming several hundreds of cycles), and abandoning the thread's time slice and causing your thread to be rescheduled. This may take an indefinite time -- even if the lock would be available almost immediately afterwards, you can still have to wait several dozen milliseconds before your thread runs again in unfavorable conditions.
If, however, the precondition of no contention does not hold, a spinlock will usually be vastly inferior as it makes no progress, but it still consumes CPU resources as if it was performing work. When blocking on a mutex, your thread does not consume CPU resources, so these can be used for a different thread to do work, or the CPU may throttle down, saving power. That's not possible with a spinlock, which is doing "active work" until it succeeds (or fails).
In the worst case, if the number of waiters is greater than the number of CPU cores, spinlocks may cause huge, dysproportionate performance impacts because the threads that are active and running are waiting on a condition that can never happen while they are running (since releasing the lock requires a different thread to run!).
On the other hand, one should expect every modern no-suck implementation of std::mutex to already include a tiny spinlock before falling back to doing a syscall. But... while it is a reasonable assumption, this is not guaranteed.
Another non-technical reason for using spinlocks in favor of a std::mutex may be license terms. License terms are a poor rationale for a design decision, but they may nevertheless be very real.
For example, the present GCC implementation is based exclusively on pthreads, which implies that "anything MinGW" using anything from the standard threads library necessarily links with winpthreads (lacking alternatives). That means you are subject to the winpthreads license, which implies you must reproduce their copyright message. For some people, that's a dealbreaker.

Is synchronizing with `std::mutex` slower than with `std::atomic(memory_order_seq_cst)`?

The main reason for using atomics over mutexes, is that mutexes are expensive but with the default memory model for atomics being memory_order_seq_cst, isn't this just as expensive?
Question: Can concurrent a program using locks be as fast as concurrent lock-free program?
If so, it may not be worth the effort unless I want to use memory_order_acq_rel for atomics.
Edit:
I may be missing something but lock-based cant be faster than lock-free because each lock will have to be a full memory barrier too. But with lock-free, it's possible to use techniques that are less restrictive then memory barriers.
So back to my question, is lock-free any faster than lock based in new C++11 standard with default memory_model?
Is "lock-free >= lock-based when measured in performance" true? Let's assume 2 hardware threads.
Edit 2:
My question is not about progress guarantees, and maybe I'm using "lock-free" out of context.
Basically when you have 2 threads with shared memory, and the only guarantee you need is that if one thread is writing then the other thread can't read or write, my assumption is that a simple atomic compare_and_swap operation would be much faster than locking a mutex.
Because if one thread never even touches the shared memory, you will end up locking and unlocking over and over for no reason but with atomic operations you only use 1 CPU cycle each time.
In regards to the comments, a spin-lock vs a mutex-lock is very different when there is very little contention.

Lockfree programming is about progress guarantees: From strongest to weakest, those are wait-free, lock-free, obstruction-free, and blocking.
A guarantee is expensive and comes at a price. The more guarantees you want, the more you pay. Generally, a blocking algorithm or datastructure (with a mutex, say) has the greatest liberties, and thus is potentially the fastest. A wait-free algorithm on the other extreme must use atomic operations at every step, which may be much slower.
Obtaining a lock is actually rather cheap, so you should never worry about that without a deep understanding of the subject. Moreover, blocking algorithms with mutexes are much easier to read, write and reason about. By contrast, even the simplest lock-free data structures are the result of long, focused research, each of them worth one or more PhDs.
In a nutshell, lock- or wait-free algorithms trade worst latency for mean latency and throughput. Everything is slower, but nothing is ever very slow. This is a very special characteristic that is only useful in very specific situations (like real-time systems).

A lock tends to require more operations than a simple atomic operation does. In the simplest cases, memory_order_seq_cst will be about twice as fast as locking because locking tends to require, at minimum two atomic operations in its implementation (one to lock, one to unlock). In many cases, it takes even more than that. However, once you start leveraging the memory orders, it can be much faster because you are willing to accept less synchronization.
Also, you'll often see "locking algorithms are always as fast as lock free algorithms." This is somewhat true. The basic idea is that if the fastest algorithm happens to be lock free, then the fastest algorithm without the lock-free guarentee is ALSO the same algorithm! However, if the fastest algortihm requires locks, then those demanding lockfree guarantees have to go find a slower algorithm.
In general, you will see lockfree algorithms in a few low level algorithms, where the performance of leveraging specialized opcodes helps. In almost all other code, locking is more than satisfactory performance, and much easier to read.

Question: Can concurrent a program using locks be as fast as
concurrent lock-free program?
It can be faster: lock free algorithm must keep the global state in a consistent state at all time, and do calculations without knowing if they will be productive as the state might have changed when the calculation is done, making it irrelevant, with lost CPU cycles.
The lock free strategy makes the serialization happen at the end of the process, when the calculation is done. In a pathological case many threads can do an effort and only one effort will be productive, and the others will retry.
Lock free can lead to starvation of some threads, whatever their priority is, and there is no way to avoid that. (Although it's unlikely for a thread to starve retrying for very long unless there is crazy contention.)
On the other hand, "serialized calculation and series of side effect based" (aka lock based) algorithms will not start before they know they will not be prevented by other actors to operate on that specific locked ressource (the guarantee is provided by the use of a mutex). Note that they might be prevented from finishing by the need to access another resource, if multiple locks are taken, leading to possible dead lock when multiple locks are needed in a badly designed program.
Note that this dead lock issue isn't in the scope of lock free code, which can't even act on multiple entities: it usually can't do an atomic commit based on two unrelated objects(1).
So the lack of chance of dead lock for lock free code is sign of weakness of lock free code: not being able to dead lock is a limit of your tool. A system that can only hold of lock at a time also wouldn't be able to dead lock.
The scope of lock free algorithms is minuscule compared to the scope of lock based algorithms. For a lot of problem, lock free doesn't even make sense.
A lock based algorithm is polite, the threads will have to wait in line before doing what they need to do: that is maximally efficient in term of computation steps by each thread. But it's inefficient to have to queue threads in a wait list: they often can't use the end of their time slice, so it can be very inefficient, as someone trying to do serious work while being interrupted by the phone all the time: his concentration is gone and he can't never reach maximum efficiency because his work time to cut into small pieces.
(1) You would have at least need to be able to do a double CAS for that, that is an operation atomic on two arbitrary addresses (not a double word CAS, which is just a CAS on more bits, which can trivially be implemented up to the natural CPU memory access arbitration unit that is the cache line).

What is a scalable lock?

What is a scalable lock? And how is it different from a non-scalable lock? I first saw this term in the context of the TBB rw-lock, and couldn't decide which to use.
In addition, is there any rw-lock that prioritizes readers over writers?

There is no formal definition of the term "scalable lock" or "non-scalable lock". What it's meant to imply is that some locking algorithms, techniques or implementations perform reasonably well even when there is a lot of contention for the lock, and some do not.
Sometimes the problem is algorithmic. A naive implementation of priority inheritance, for example, may require O(n) work to release a lock, where n is the number of waiting threads. That implies O(n^2) work for every waiting thread to be serviced.
Sometimes the problem is to do with hardware. Simple spinlocks (e.g. implementations for which the lock cache line is shared and acquirers don't back off) don't scale on SMP hardware with a single bus interconnect, because writing to a cache line requires that the CPU acquires the cache line, and the CPU interconnect is a single point of contention. If there are n CPUs trying to acquire the same lock at the same time, you may end up with O(n) bus traffic to acquire the lock. Again, this means O(n^2) time for all n CPUs to be satisfied.
In general, you should avoid non-scalable locks unless two conditions are met:
Contention is light.
The critical section is very short.
You really have to know that the two conditions are met. A critical section may be short in terms of lines of code, but not be short in wall time. If in doubt, use a scalable lock, and later fix any which have been measured to cause performance problems.
As for your last question, I'm not aware of an off-the-shelf read-write lock which favours readers. Actually, most APIs don't specify policy, including pthreads (annoyingly).
My first comment is that you probably don't want it. If you have high contention, favouring one over the other kills throughput, and if you don't have high contention, it won't make a difference. About the only reason I can think not to use a rw lock with a completely fair policy is if you have thread priorities which must be respected, so you want the highest priority thread to be preferred.
But if you must, you could always roll your own. All you need is a couple of flags (one for "readers can go now" and one for "a writer can go now"), condition variables protecting the flags, a single mutex protecting the condition variables, and a couple of counters indicating how many readers and writers are waiting. That should be all you need; implementing this should be quite instructive.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js