How does warp work with atomic operation? - c++

The threads in a warp run physically parallel, so if one of them (called, thread X) start an atomic operation, what other will do? Wait? Is it mean, all threads will be waiting while thread X is pushed to the atomic-queue, get the access (mutex) and do some stuff with memory, which was protected with that mutex, and realese mutex after?
Is there any way to take other threads for some work, like reads some memory, so the atomic operation will hide it's latency? I mean, 15 idle threads it's.. not well, I guess. Atomic is really slow, does it? How can I accelerate it? Is there any pattern to work with it?
Does atomic operation with shared memory lock for a bank or whole memory?
For example (without mutexs), there is __shared__ float smem[256];
Thread1 runs atomicAdd(smem, 1);
Thread2 runs atomicAdd(smem + 1, 1);
Those threads works with different banks, but in general shared memory. Does they run parralel or they will be queued? Is there any difference with this example, if Thread1 and Thread2 are from separated warps or general one?

I count something like 10 questions. It makes it quite difficult to answer. It's suggested you ask one question per question.
Generally speaking, all threads in a warp are executing the same instruction stream. So there are two cases we can consider:
without conditionals (e.g. if...then...else) In this case, all threads are executing the same instruction, which happens to be an atomic instruction. Then all 32 threads will execute an atomic, although not necessarily on the same location. All of these atomics will get processed by the SM, and to some extent will serialize (they will completely serialize if they are updating the same location).
with conditionals For example, suppose we had if (!threadIdx.x) AtomicAdd(*data, 1); Then thread 0 would execute the atomic, and
others wouldn't. It might seem like we could get the others to do
something else, but the lockstep warp execution doesn't allow this.
Warp execution is serialized such that all threads taking the if
(true) path will execute together, and all threads executing the
if (false) path will execute together, but the true and false
paths will be serialized. So again, we can't really have different
threads in a warp executing different instructions simultaneously.
The net of it is, within a warp, we can't have one thread do an atomic while others do something else simultaneously.
A number of your other questions seem to expect that memory transactions are completed at the end of the instruction cycle that they originated in. This is not the case. With global and with shared memory, we must take special steps in the code to ensure that previous write transactions are visible to other threads (which could be argued as the evidence that the transaction completed.) One typical way to do this is to use barrier instructions, such as __syncthreads() or __threadfence() But without those barrier instructions, threads are not "waiting" for writes to complete. A (an operation dependent on a) read can stall a thread. A write generally cannot stall a thread.
Now lets see about your questions:
so if one of them start an atomic operation, what other will do? Wait?
No, they don't wait. The atomic operation gets dispatched to a functional unit on the SM that handles atomics, and all threads continue, together, in lockstep. Since an atomic generally implies a read, yes, the read can stall the warp. But the threads do not wait until the atomic operation is completed (i.e, the write). However, a subsequent read of this location could stall the warp, again, waiting for the atomic (write) to complete. In the case of a global atomic, which is guaranteed to update global memory, it will invalidate the L1 in the originating SM (if enabled) and the L2, if they contain that location as an entry.
Is there any way to take other threads for some work, like reads some memory, so the atomic operation will hide it's latency?
Not really, for the reasons I stated at the beginning.
Atomic is really slow, does it? How can I accelerate it? Is there any pattern to work with it?
Yes, atomics can make a program run much more slowly if they dominate the activity (such as naive reductions or naive histogramming.) Generally speaking, the way to accelerate atomic operations is to not use them, or use them sparingly, in a way that doesn't dominate program activity. For example, a naive reduction would use an atomic to add every element to the global sum. A smart parallel reduction will use no atomics at all for the work done in the threadblock. At the end of the threadblock reduction, a single atomic might be used to update the threadblock partial sum into the global sum. This means that I can do a fast parallel reduction of an arbitrarily large number of elements with perhaps on the order of 32 atomic adds, or less. This sparing use of atomics will basically not be noticeable in the overall program execution, except that it enables the parallel reduction to be done in a single kernel call rather than 2.
Shared memory: Does they run parralel or they will be queued?
They will be queued. The reason for this is that there are a limited number of functional units that can process atomic operations on shared memory, not enough to service all the requests from a warp in a single cycle.
I've avoided trying to answer questions that relate to the throughput of atomic operations, because this data is not well specified in the documentation AFAIK. It may be that if you issue enough simultaneous or nearly-simultaneous atomic operations, that some warps will stall on the atomic instruction, due to the queues that feed the atomic functional units being full. I don't know this to be true and I can't answer questions about it.

Related

Mutex is defying the very idea of threads: parallel processing [duplicate]

When I have a block of code like this:
mutex mtx;
void hello(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hello";
}
mtx.unlock();
}
void hi(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hi";
}
mtx.unlock();
}
int main(){
thread x(hello);
thread y(hi);
x.join();
y.join();
}
What is the difference between just calling `hello()` and `hi()`? (Like so)
...
int main(){
hello();
hi();
}
Are threads more efficient? The purpose of thread is to run at the same time, right?
Can someone explain why we use mutexes within thread functions? Thank you!
The purpose of thread is to run at the same time, right?
Yes, threads are used to perform multiple tasks in parallel, especially on different CPUs.
Can someone explain why we use mutexes within thread functions?
To serialize multiple threads with each other, such as when they are accessing a shared resource that is not safe to access concurrently and needs to be protected.
Are threads more efficient?
No. But see final note (below).
On a single core, threads are much, much less efficient (than function/method calls).
As one example, on my Ubuntu 15.10(64), using g++ v5.2.1,
a) a context switch (from one thread to the other) enforced by use of std::mutex takes about 12,000 nanoseconds
b) but invoking 2 simple methods, for instance std::mutex lock() & unlock(), this takes < 50 nanoseconds. 3 orders of magnitude! So context switch vx function call is no contest.
The purpose of thread is to run at the same time, right?
Yes ... but this can not happen on a single core processor.
And on a multi-core system, context switch time can still dominate.
For example, my Ubuntu system is dual core. The measurement of context switch time I reported above uses a chain of 10 threads, where each thread simply waits for its input semaphore to be unlock()'d. When a thread's input semaphore is unlocked, the thread gets to run ... but the brief thread activity is simply 1) increment a count and check a flag, and 2) unlock() the next thread, and 3) lock() its own input mutex, i.e. wait again for the previous task signal. In that test, the thread we known as main starts the thread-sequencing with unlock() of one of the threads, and stops it with a flag that all threads can see.
During this measurement activity (about 3 seconds), Linux system monitor shows both cores are involved, and reports both cores at abut 60% utilization. I expected both cores at 100% .. don't know why they are not.
Can someone explain why we use mutexes within thread functions? Thank
you!
I suppose the most conventional use of std::mutex's is to serialize access to a memory structure (perhaps a shared-access storage or structure). If your application has data accessible by multiple threads, each write access must be serialized to prevent race conditions from corrupting the data. Sometimes, both read and write access needs to be serialized. (See dining philosophers problem.)
In your code, as an example (although I do not know what system you are using), it is possible that std::cout (a shared structure) will 'interleave' text. That is, a thread context switch might happen in the middle of printing a "hello", or even a 'hi'. This behaviour is usually undesired, but might be acceptable.
A number of years ago, I worked with vxWorks and my team learned to use mutex's on access to std::cout to eliminate that interleaving. Such behavior can be distracting, and generally, customers do not like it. (ultimately, for that app, we did away with the use of the std trio-io (cout, cerr, cin))
Devices, of various kinds, also might not function properly if you allow more than 1 thread to attempt operations on them 'simultaneously'. For example, I have written software for a device that required 50 us or more to complete its reaction to my software's 'poke', before any additional action to the device should be applied. The device simply ignored my codes actions without the wait.
You should also know that there are techniques that do not involve semaphores, but instead use a thread and an IPC to provide serialized (i.e. protected) resource access.
From wikipedia, "In concurrent programming, a monitor is a synchronization construct that allows threads to have both mutual exclusion and the ability to wait (block) for a certain condition to become true."
When the os provides a suitable IPC, I prefer to use a Hoare monitor. In my interpretation, the monitor is simply a thread that accepts commands over the IPC, and is the only thread to access the shared structure or device. When only 1 thread accesses a structure, NO mutex is needed. All other threads must send a message (via IPC) to request (or perhaps command) another structure change. The monitor thread handles one request at a time, sequentially out of the IPC.
Definition: collision
In the context of "thread context switch' and 'mutex semaphores', a 'collision' occurs when a thread must block-and-wait for access to a resource, because that resource is already 'in use' (i.e. 'occupied'). This is a forced context switch. See also the term "critical section".
When the shared resource is NOT currently in use, no collision. The lock() and unlock() cost almost nothing (by comparison to context switch).
When there is a collision, the context switch slows things down by a 'bunch'. But this 'bunch' might still be acceptable ... consider when 'bunch' is small compared to the duration of the activity inside the critical section.
Final note ... With this new idea of 'collision':
a) Multiple threads can be far less efficient in the face of many collisions.
For unexpected example, the function 'new' accesses a thread-shared resource we can call "dynamic memory". In one experience, each thread generated 1000's of new's at start up. One thread could complete that effort in 0.5 seconds. Four threads, started quickly back-to-back, took 40 seconds to complete the 4 start ups. Context switches!
b) Multiple threads can be more efficient, when you have multiple cores and no / or few collisions. Essentially, if the threads seldom interact, they can run (mostly) simultaneously.
Thread efficiency can be any where between a or b, when multiple cores and collisions.
For instance, my ram based "log" mechanisms seems to work well - one mutex access per log entry. Generally, I intentionally used minimal logging. And when debugging a 'discovered' challenge, I added additional logging (maybe later removed) to determine what was going wrong. Generally, the debugger is better than a general logging technique. But sometimes, adding several log entries worked well.
Threads have at least two advantages over purely serial code.
Convenience in separating logically independent sequences of instructions. This is true even on a single core machine. This gives you logical concurrency without necessarily parallelism.
Having multiple threads allows either the operating system or a user-level threading library to multiplex multiple logical threads over a smaller number of CPU cores, without the application developer having to worry about other threads and processes.
Taking advantage of multiple cores / processors. Threads allow you to scale your execution to the number of CPU cores you have, enabling parallelism.
Your example is a little contrived because the entire thread's execution is locked. Normally, threads perform many actions independently and only take a mutex when accessing a shared resource.
More specifically, under your scenario you would not gain any performance. However, if your entire thread was not under a mutex, then you could potentially gain efficiency. I say potentially because there are overheads to running multiple threads which may offset any efficiency gain you obtain.
Threads theoretically run simultaneously, it means that threads could write to the same memory block at the same time. For example, if you have a global var int i;, and two threads try to write different values at same time, which one value remains in i?
Mutex forces synchronous access to memory, inside a mutex block (mutex.lock & mutex.unlock) you warrant synchronous memory access and avoid memory corruption.
When you call mtx.lock(), JUST ONE THREAD KEEPS RUNNING, and any other thread calling the same mtx.lock() stops, waiting for mtx.unlock call.

Implementing spin-lock without XCHG?

C++ spin-lock can be easily implemented using std::atomic_flag, it can be coded roughly (without special features) like following:
std::atomic_flag f = ATOMIC_FLAG_INIT;
while (f.test_and_set(std::memory_order_acquire)); // Acquire lock
// Here do some lock-protected work .....
f.clear(std::memory_order_release); // Release lock
One can see online assembly, it shows that acquiring is implemented atomically through XCHG instruction.
Also as one can see on uops.info (screen here) that XCHG may take up to 30 CPU cycles on quite popular Skylake. This is quite slow.
Overall speed of spin lock can be measure through such program.
Is it possible to implement spin locking without XCHG? Main concern is about speed, not about just using another instruction.
What is the fastest possible spin lock? Is it possible to make it 10 cycles instead of 30? And 5 cycles? Maybe some probabilistic spin-lock that runs fast on average?
It should be implemented in a strict way, meaning that in 100% cases it correctly protects piece of code and data. If it is probabilistic then it should run probable time but yet protects 100% correctly after each run.
Main purpose for such spin lock for me is to protect very tiny operations inside multiple threads, that run a dozen or two of cycles, hence 30 cycles delay is too much overhead. Of course one can say that I can use atomics or other lock-free techniques to implement all operations. But this techniques is not possible for all cases, and also will take to much work to strictly implement in huge code base for many classes and methods. Hence something generic like regular spin lock is also needed.
Is it possible to implement spin locking without XCHG?
Yes. For 80x86, you can lock bts or lock cmpxchg or lock xadd or ...
What is the fastest possible spin lock?
Possible interpretations of "fast" include:
a) Fast in the uncontended case. In this case it's not going to matter very much what you do because most of the possible operations (exchanging, adding, testing...) are cheap and the real costs are cache coherency (getting the cache line containing the lock into the "exclusive" state in the current CPU's cache, possibly including fetching it from RAM or other CPUs' caches) and serialization.
b) Fast in the contended case. In this case you really need a "test without lock; then test & set with lock" approach. The main problem with a simple spinloop (for the contended case) is that when multiple CPUs are spinning the cache line will be rapidly bouncing from one CPU's cache to the next and consuming a huge amount of inter-connect bandwidth for nothing. To prevent this, you'd have a loop that tests the lock state without modifying it so that the cache line can remain in all CPUs caches as "shared" at the same time while those CPUs are spinning.
But note that testing read-only to start with can hurt the un-contended case, resulting in more coherency traffic: first a share-request for the cache line which will only get you MESI S state if another core had recently unlocked, and then an RFO (Read For Ownership) when you do try to take the lock. So best practice is probably to start with an RMW, and if that fails then spin read-only with pause until you see the lock available, unless profiling your code on the system you care about shows a different choice is better.
c) Fast to exit the spinloop (after contention) when the lock is acquired. In this case CPU can speculatively execute many iterations of the loop, and when the lock becomes acquired all the CPU has to drain those "speculatively execute many iterations of the loop" which costs a little time. To prevent that you want a pause instruction to prevent many iterations of the loop/s from being speculatively executed.
d) Fast for other CPUs that don't touch the lock. For some cases (hyper-threading) the core's resources are shared between logical processors; and when one logical process is spinning it consumes resources that the other logical processor could've used to get useful work done (especially for the "spinlock speculatively executes many iterations of the loop" situation). To minimize this you need a pause in the spinloop/s (so that the spinning logical processor doesn't consume as much of the core's resources and the other logical processor in the core can get more useful work done).
e) Minimum "worst case time to acquire". With a simple lock, under contention, some CPUs or threads can be lucky and always get the lock while other CPUs/threads are very unlucky and take ages to get the lock; and the "worst case time to acquire" is theoretically infinite (a CPU can spin forever). To fix that you need a fair lock - something to ensure that only the thread that has been waiting/spinning for the longest amount of time is able to acquire the lock when it is released. Note that it's possible to design a fair lock such that each thread spins on a different cache line; which is a different way to solve the "cache line bouncing between CPUs" problem I mentioned in "b) Fast in the contended case".
f) Minimum "worst case until lock released". This has to involve the length of the worst critical section; but in some situations may also include the cost of any number IRQs, the cost of any number of task switches and the time the code isn't using any CPU. It's entirely possible to have a situation where a thread acquires the lock then the scheduler does a thread switch; then many CPUs all spin (wasting a huge amount of time) on a lock that can not be released (because the lock holder is the only one that can release the lock and it isn't even using any CPU). The way to fix/improve this is to disable the scheduler and IRQs; which is fine in kernel code, but "likely impossible for security reasons" in normal user-space code. This is also the reason why spinlocks should probably never be used in user-space (and why user-space should probably use a mutex where the thread is put in a "blocked waiting for lock" state and not given CPU time by the scheduler until/unless the thread actually can acquire the lock).
Note that making it fast for one possible interpretation of "fast" can make it slower/worse for other interpretations of "fast". For example; the speed of the uncontended case is made worse by everything else.
Example Spinlock
This example is untested, and written in (NASM syntax) assembly.
;Input
; ebx = address of lock
;Initial optimism in the hope the lock isn't contended
spinlock_acquire:
lock bts dword [ebx],0 ;Set the lowest bit and get its previous value in carry flag
;Did we actually acquire it, i.e. was it previously 0 = unlocked?
jnc .acquired ; Yes, done!
;Waiting (without modifying) to avoid "cache line bouncing"
.spin:
pause ;Reduce resource consumption
; and avoid memory order mis-speculation when the lock becomes available.
test dword [ebx],1 ;Has the lock been released?
jne .spin ; no, wait until it was released
;Try to acquire again
lock bts dword [ebx],0 ;Set the lowest bit and get its previous value in carry flag
;Did we actually acquire it?
jc .spin ; No, go back to waiting
.acquired:
Spin-unlock can be just mov dword [ebx], 0, not lock btr, because you know you own the lock and that has release semantics on x86. You could read it first to catch double-unlock bugs.
Notes:
a) lock bts is a little slower than other possibilities; but it doesn't interfere with or depend on the other 31 bits (or 63 bits) of the lock, which means that those other bits can be used for detecting programming mistakes (e.g. store 31 bits of "thread ID that currently holds lock" in them when the lock is acquired and check them when the lock is released to auto-detect "Wrong thread releasing lock" and "Lock being released when it was never acquired" bugs) and/or used for gathering performance information (e.g. set bit 1 when there's contention so that other code can scan periodically to determine which locks are rarely contended and which locks are heavily contended). Bugs with the use of locks are often extremely insidious and hard to find (unpredictable and unreproducible "Heisenbugs" that disappear as soon as you try to find them); so I have a preference for "slower with automatic bug detection".
b) This is not a fair lock, which means its not well suited to situations where contention is likely.
c) For memory; there's a compromise between memory consumption/cache misses, and false sharing. For rarely contended locks I like to put the lock in the same cache line/s as the data the lock protects, so that the acquiring the lock means that the data the lock holder wants is already in the cache (and no subsequent cache miss occurs). For heavily contended locks this causes false sharing and should be avoided by reserving the whole cache line for the lock and nothing else (e.g. by adding 60 bytes of unused padding after the 4 bytes used by the actual lock, like in C++ alignas(64) struct { std::atomic<int> lock; }; ). Of course a spinlock like this shouldn't be used for heavily contended locks so its reasonable to assume that minimizing memory consumption (and not having any padding, and not caring about false sharing) makes sense.
Main purpose for such spin lock for me is to protect very tiny operations inside multiple threads, that run a dozen or two of cycles, hence 30 cycles delay is too much overhead
In that case I'd suggest trying to replace locks with atomics, block-free algorithms, and lock-free algorithms. A trivial example is tracking statistics, where you might want to do lock inc dword [number_of_chickens] instead of acquiring a lock to increase "number_of_chickens".
Beyond that it's hard to say much - for one extreme, the program could be spending most of its time doing work without needing locks and the cost of locking may have almost no impact on overall performance (even though acquire/release is more expensive than the tiny critical sections); and for the other extreme the program could be spending most of its time acquiring and releasing locks. In other words, the cost of acquiring/releasing locks is somewhere between "irrelevant" and "major design flaw (using far too many locks and needing to redesign the entire program)".

In what circumstances lock free data structures are faster than lock based ones?

I'm currently reading C++ Concurrency in Action book by Anthony Williams and there are several lock free data structures implementations. In the forward of the chapter about lock free data structures in the book Anthony is writing:
This brings us to another downside of lock-free and wait-free code: although it can increase the potential for concurrency of operations on a data structure and reduce the time an individual thread spends waiting, it may well decrease overall performance.
And indeed I tested all lock free stack implementations described in the book against lock based implementation from one of the previous chapters. And it seems the performance of lock free code is always lower than the lock based stack.
In what circumstances lock free data structure are more optimal and must be preferred?
One benefit of lock-free structures is that they do not require context switch. However, in modern systems, uncontended locks are also context-switch free. To benefit (performance-wise) from lock-free algo, several conditions have to be met:
Contention has to be high
There should be enough CPU cores so that spinning thread can run uninterrupted (ideally, should be pinned to its own core)
I've done performance study years ago. When the number of threads is small, lock-free data structures and lock-based data structures are comparable. But as the number of threads rises, at some point lock-based data structures exhibit a sharp performance drop, while lock-free data structures scale up to thousands of threads.
it depends on the probability of a collision.
if a collision is very likely, than a mutex is the optimal solution.
For example: 2 threads are constantly pushing data to the end of a container.
With lock-freedom only 1 thread will succeed. The other will need to retry. In this scenario the blocking and waiting would be better.
But if you have a large container and the 2 threads will access the container at different areas, its very likely, that there will be no collision.
For example: one thread modifies the first element of a container and the other thread the last element.
In this case, the probability of a retry is very small, hence lock-freedom would be better here.
Other problem with lock-freedom are spin-locks (heavy memory-usage), the overall performance of the atomic-variables and some constraints on variables.
For example if you have the constraint x == y which needs to be true, you cannot use atomic-variables for x and y, because you cannot change both variables at once, while a lock() would satisfy the constraint
The only way to know which is better is to profile each. The result will change drastically from use case to use case, from one cpu to another, from one arch to another, from one year to another. What might be best today might not be best tomorrow. So always measure and keep measuring.
That said let me give you some of my private thoughts on this:
First: If there is no contention it shouldn't matter what you do. The no-collision case should always be fast. If it's not then you need a different implementation tuned to the no contention case. One implementation might use fewer or faster machine instruction than the other and win but the difference should be minimal. Test, but I expect near identical results.
Next lets look at cases with (high) contention:
Again you might need an implementation tuned to the contention case. One lock mechanism isn't like the other same as lock-free methods.
threads <= cores
It's reasonable to assume all threads will be running and do work. There might be small pauses where a thread gets interrupted but that's really the exception. This obviously will only hold true if you only have one application doing this. The threads of all cpu heavy applications add up for this scenario.
Now with a lock one thread will get the lock and work. Other threads can wait for the lock to become available and react the instant the lock becomes free. They can busy loop or for longer durations sleep, doesn't matter much. The lock limits you to 1 thread doing work and you get that with barely any cpu cycles wasted when switching locks.
On the other hand lock free data structures all fall into some try&repeat loop. They will do work and at some crucial point they will try to commit that work and if there was contention they will fail and try again. Often repeating a lot of expensive operations. The more contention there is the more wasted work there is. Worse, all that access on the caches and memory will slow down the thread that actually manages to commit work in the end. So you are not only not getting ahead faster, you are slowing down progress.
I would always go with locks with any workload that takes more cpu cycles than the lock instruction vs. the CAS (or similar) instruction a lock free implementation needs. It really doesn't take much work there leaving only trivial cases for the lock-free approach. The builtin atomic types are such a case and often CPUs have opcodes to do those atomic operations lock-free in hardware in a single instruction that is (relatively) fast. In the end the lock will use such an instruction itself and can never be faster than such a trivial case.
threads >> cores
If you have much more threads than cores then only a fraction of them can run at any one time. It is likely a thread that sleeps will hold a lock. All other threads needing the lock will then also have to go to sleep until the lock holding thread wakes up again. This is probably the worst case for locking data structures. Nobody gets to do any work.
Now there are implementations for locks (with help from the operating system) where one thread trying to acquire a lock will cause the lock holding thread to take over till it releases the lock. In such systems the waste is reduced to context switching between the thread.
There is also a problem with locks called the thundering herd problem. If you have 100 threads waiting on a lock and the lock gets freed, then depending on your lock implementation and OS support, 100 threads will wake up. One will get the lock and 99 will waste time trying to acquire the lock, fail and go back to sleep. You really don't want a lock implementation suffering from thundering herds.
Lock free data structures begin to shine here. If one thread is descheduled then the other thread will continue their work and succeed in committing the result. The thread will wake up again at some point and fail to commit it's work and retry. The waste is in the work the one descheduled thread did.
cores < threads < 2 * cores
There is a grey zone there when the number of threads is near the number of cores. The chance the blocking thread is running remains high. But this is a very chaotic region. Results what method is better are rather random there. My conclusion: If you don't have tons of threads then try really hard to stay <= core count.
Some more thoughs:
Sometimes the work, once started, needs to be done in a specific order. If one thread is descheduled you can't just skip it. You see this in some data structures where the code will detect a conflict and one thread actually finishes the work a different thread started before it can commit it's own results. Now this is really great if the other thread was descheduled. But if it's actually running it's just wasteful to do the work twice. So data structure with this scheme really aim towards scenario 2 above.
With the amount of mobile computing done today it becomes more and more important to consider the power usage of your code. There are many ways you can optimize your code to change power usage. But really the only way for your code to use less power is to sleep. Something you hear more and more is "race to sleep". If you can make your code run faster so it can sleep earlier then you save power. But the emphasis here is on sleep earlier, or maybe I should say sleep more. If you have 2 threads running 75% of the time they might solve your problem in 75 seconds. But if you can solve the same problem with 2 threads running 50% of the time, alternating with a lock, then they take 100 seconds. But the first also uses 150% cpu power. For a shorter time, true, but 75 * 150% = 112.5 > 100 * 100%. Power wise the slower solution wins. Locks let you sleep while lock free trades power for speed.
Keep that in mind and balance your need for speed with the need to recharge your phone of laptop.
The mutex design will very rarely, if ever out perform the lockless one.
So the follow up question is why would anyone ever use a mutex rather than a lockless design?
The problem is that lockless designs can be hard to do, and require a significant amount of designing to be reliable; while a mutex is quite trivial (in comparison), and when debugging can be even harder. For this reason, people generally prefer to use mutexes first, and then migrate to lock free later once the contention has been proven to be a bottleneck.
I think one thing missing in these answers is locking period. If your locking period is very short, i.e. after acquiring lock if you perform a task for a very short period(like incrementing a variable) then using lock-based data structure would bring in unnecessary context switching, cpu scheduling etc. In this case, lock-free is a good option as the thread would be spinning for a very short time.

Pros and Cons of Busy Waiting on Modern Processors

I'm using busy waiting to synchronize access to critical regions, like this:
while (p1_flag != T_ID);
/* begin: critical section */
for (int i=0; i<N; i++) {
...
}
/* end: critical section */
p1_flag++;
p1_flag is a global volatile variable that is updated by another concurrent thread. As a matter of fact, I've two critical sections inside a loop and I've two threads (both executing the same loop) that commute execution of these critical regions. For instance, critical regions are named A and B.
Thread 1 Thread 2
A
B A
A B
B A
A B
B A
B
The parallel code executes faster than the serial one, however not as much as I expected. Profiling the parallel program using VTune Amplifier I noticed that a large amount of time is being spent in the synchronization directives, that is, the while(...) and flag update. I'm not sure why I'm seeing so large overhead on these "instructions" since region A is exactly the same as region B. My best guess is that this is due the cache coherence latency: I'm using an Intel i7 Ivy Bridge Machine and this micro-architecture resolves cache coherence at the L3. VTune also tells that the while (...) instruction is consuming all front-end bandwidth, but why?
To make the question(s) clear: Why are while(...) and update flag instructions taking so much execution time? Why would the while(...) instruction saturate the front-end bandwidth?
The overhead you're paying may very well be due to passing the sync variable back and forth between the core caches.
Cache coherency dictates that when you modify the cache line (p1_flag++) you need to have ownership on it. This means it would invalidate any copy existing in other cores, waiting for it to write back any changes made by that other core to a shared cache level. It would then provide the line to the requesting core in M state and perform the modification.
However, the other core would by then be constantly reading this line, read that would snoop the first core and ask if it has copy of that line. Since the first core is holding an M copy of that line, it would get written back to the shared cache and the core would lose ownership.
Now this depends on actual implementation in HW, but if the line was snooped before the change was actually made, the first core would have to attempt to get ownership on it again. In some cases i'd imagine this might take several iterations of attempts.
If you're set on using busy wait, you should at least use some pause inside it
: _mm_pause intrisic, or just __asm("pause"). This would both serve to give the other thread a chance get the lock and release you from waiting, as well as reducing the CPU effort in busy waiting (an out-of-order CPU would fill all pipelines with parallel instances of this busy wait, consuming lots of power - a pause would serialize it so only a single iteration can run at any given time - much less consuming and with the same effect).
A busy-wait is almost never a good idea in multithreaded applications.
When you busy-wait, thread scheduling algorithms will have no way of knowing that your loop is waiting on another thread, so they must allocate time as if your thread is doing useful work. And it does take processor time to check that variable over, and over, and over, and over, and over, and over...until it is finally "unlocked" by the other thread. In the meantime, your other thread will be preempted by your busy-waiting thread again and again, for no purpose at all.
This is an even worse problem if the scheduler is a priority-based one, and the busy-waiting thread is at a higher priority. In this situation, the lower-priority thread will NEVER preempt the higher-priority thread, thus you have a deadlock situation.
You should ALWAYS use semaphores or mutex objects or messaging to synchronize threads. I've never seen a situation where a busy-wait was the right solution.
When you use a semaphore or mutex, then the scheduler knows never to schedule that thread until the semaphore or mutex is released. Thus your thread will never be taking time away from threads that do real work.

What is a scalable lock?

What is a scalable lock? And how is it different from a non-scalable lock? I first saw this term in the context of the TBB rw-lock, and couldn't decide which to use.
In addition, is there any rw-lock that prioritizes readers over writers?
There is no formal definition of the term "scalable lock" or "non-scalable lock". What it's meant to imply is that some locking algorithms, techniques or implementations perform reasonably well even when there is a lot of contention for the lock, and some do not.
Sometimes the problem is algorithmic. A naive implementation of priority inheritance, for example, may require O(n) work to release a lock, where n is the number of waiting threads. That implies O(n^2) work for every waiting thread to be serviced.
Sometimes the problem is to do with hardware. Simple spinlocks (e.g. implementations for which the lock cache line is shared and acquirers don't back off) don't scale on SMP hardware with a single bus interconnect, because writing to a cache line requires that the CPU acquires the cache line, and the CPU interconnect is a single point of contention. If there are n CPUs trying to acquire the same lock at the same time, you may end up with O(n) bus traffic to acquire the lock. Again, this means O(n^2) time for all n CPUs to be satisfied.
In general, you should avoid non-scalable locks unless two conditions are met:
Contention is light.
The critical section is very short.
You really have to know that the two conditions are met. A critical section may be short in terms of lines of code, but not be short in wall time. If in doubt, use a scalable lock, and later fix any which have been measured to cause performance problems.
As for your last question, I'm not aware of an off-the-shelf read-write lock which favours readers. Actually, most APIs don't specify policy, including pthreads (annoyingly).
My first comment is that you probably don't want it. If you have high contention, favouring one over the other kills throughput, and if you don't have high contention, it won't make a difference. About the only reason I can think not to use a rw lock with a completely fair policy is if you have thread priorities which must be respected, so you want the highest priority thread to be preferred.
But if you must, you could always roll your own. All you need is a couple of flags (one for "readers can go now" and one for "a writer can go now"), condition variables protecting the flags, a single mutex protecting the condition variables, and a couple of counters indicating how many readers and writers are waiting. That should be all you need; implementing this should be quite instructive.