The definition of lock-free

The definition of lock-free - c++

There are three different types of "lock-free" algorithms. The definitions given in Concurrency in Action are:
Obstruction-Free: If all other threads are paused, then any given
thread will complete its operation in a bounded number of steps.
Lock-Free: If multiple threads are operating on a data structure, then
after a bounded number of steps one of them will complete its
operation.
Wait-Free: Every thread operating on a data structure
will complete its operation in a bounded number of steps, even if
other threads are also operating on the data structure.
Herb Sutter says in his talk Lock-Free Programming:
Informally, "lock-free" ≈ "doesn't use mutexes" == any of these.
I do not see why lock-based algorithms can't fall into the lock-free definition given above. Here is a simple lock-based program:
#include <iostream>
#include <mutex>
#include <thread>
std::mutex x_mut;
void print(int& x) {
std::lock_guard<std::mutex> lk(x_mut);
std::cout << x;
}
void add(int& x, int y) {
std::lock_guard<std::mutex> lk(x_mut);
x += y;
}
int main() {
int i = 3;
std::thread thread1{print, std::ref(i)};
std::thread thread2(add, std::ref(i), 4);
thread1.join();
thread2.join();
}
If both of these threads are operating, then after a bounded number of steps, one of them must complete. Why does my program not satisfy the definition of "Lock-free"?

I would be careful about saying "bounded" without defining by what.
The canonical lock-free primitives - CAS loops do not give any bounds, if there is heavy contention, a thread can be repeatedly unlucky and wait forever, that is allowed. The defining property of lock-free algorithms is there is always system-wide progress. At any point of time, some thread will make progress.
Stronger guarantee of some progress for each thread at each point of time is called wait-free.
In other words, lock-free guarantees that a misbehaving thread cannot fatally impact all other threads, wait-free cannot fatally impact any thread.
If both of these threads are operating, then after a bounded number of steps, one of them must complete. Why does my program not satisfy the definition of "Lock-free"?
Because an (unfair) scheduler must be taken into account.* If a thread holding the lock is put to sleep, no other thread will be able to make any progress -> not-lock-free and there is certainly no bound. That won't happen with lock-free programming, resources are always available and unfortunate scheduling of one thread can make other threads' operations complete only faster, not slower.
* In particular, where the suspension time for any thread is not limited in frequency or duration. If it was, any algorithm would be wait-free (with some huge constant) by definition.

Your quote from Concurrency in Action is taken out of context.
In fact, what the book actually says is:
7.1 Definitions and consequences
Algorithms and data structures that use mutexes, condition variables,
and futures to synchronize the data are called blocking data
structures and algorithms.
Data structures and algorithms that don’t use blocking library
functions are said to be nonblocking. Not all these data structures
are lock-free ...
Then it proceeds to further break down nonblocking algorithms into Obstruction-Free, Lock-Free and Wait-Free.
So a Lock-Free algorithm is
a nonblocking algorithm (it does not use locks like a mutex) and
it is able to make progress in at least one thread in a bounded number of steps.
So both Herb Sutter and Anthony Williams are correct.

Related

Lock-free producer consumer [duplicate]

Anecdotally, I've found that a lot of programmers mistakenly believe that "lock-free" simply means "concurrent programming without mutexes". Usually, there's also a correlated misunderstanding that the purpose of writing lock-free code is for better concurrent performance. Of course, the correct definition of lock-free is actually about progress guarantees. A lock-free algorithm guarantees that at least one thread is able to make forward progress regardless of what any other threads are doing.
This means a lock-free algorithm can never have code where one thread is depending on another thread in order to proceed. E.g., lock-free code can not have a situation where Thread A sets a flag, and then Thread B keeps looping while waiting for Thread A to unset the flag. Code like that is basically implementing a lock (or what I would call a mutex in disguise).
However, other cases are more subtle and there are some cases where I honestly can't really tell if an algorithm qualifies as lock-free or not, because the notion of "making progress" sometimes appears subjective to me.
One such case is in the (well-regarded, afaik) concurrency library, liblfds. I was studying the implementation of a multi-producer/multi-consumer bounded queue in liblfds - the implementation is very straightforward, but I cannot really tell if it should qualify as lock-free.
The relevant algorithm is in lfds711_queue_bmm_enqueue.c. Liblfds uses custom atomics and memory barriers, but the algorithm is simple enough for me to describe in a paragraph or so.
The queue itself is a bounded contiguous array (ringbuffer). There is a shared read_index and write_index. Each slot in the queue contains a field for user-data, and a sequence_number value, which is basically like an epoch counter. (This avoids ABA issues).
The PUSH algorithm is as follows:
Atomically LOAD the write_index
Attempt to reserve a slot in the queue at write_index % queue_size using a CompareAndSwap loop that attempts to set write_index to write_index + 1.
If the CompareAndSwap is successful, copy the user data into the
reserved slot.
Finally, update the sequence_index on the
slot by making it equal to write_index + 1.
The actual source code uses custom atomics and memory barriers, so for further clarity about this algorithm I've briefly translated it into (untested) standard C++ atomics for better readability, as follows:
bool mcmp_queue::enqueue(void* data)
{
int write_index = m_write_index.load(std::memory_order_relaxed);
for (;;)
{
slot& s = m_slots[write_index % m_num_slots];
int sequence_number = s.sequence_number.load(std::memory_order_acquire);
int difference = sequence_number - write_index;
if (difference == 0)
{
if (m_write_index.compare_exchange_weak(
write_index,
write_index + 1,
std::memory_order_acq_rel
))
{
break;
}
}
if (difference < 0) return false; // queue is full
}
// Copy user-data and update sequence number
//
s.user_data = data;
s.sequence_number.store(write_index + 1, std::memory_order_release);
return true;
}
Now, a thread that wants to POP an element from the slot at read_index will not be able to do so until it observes that the slot's sequence_number is equal to read_index + 1.
Okay, so there are no mutexes here, and the algorithm likely performs well (it's only a single CAS for PUSH and POP), but is this lock-free? The reason it's unclear to me is because the definition of "making progress" seems murky when there is the possibility that a PUSH or POP can always just fail if the queue is observed to be full or empty.
But what's questionable to me is that the PUSH algorithm essentially reserves a slot, meaning that the slot can never be POP'd until the push thread gets around to updating the sequence number. This means that a POP thread that wants to pop a value depends on the PUSH thread having completed the operation. Otherwise, the POP thread will always return false because it thinks the queue is EMPTY. It seems debatable to me whether this actually falls within the definition of "making progress".
Generally, truly lock-free algorithms involve a phase where a pre-empted thread actually tries to ASSIST the other thread in completing an operation. So, in order to be truly lock-free, I would think that a POP thread that observes an in-progress PUSH would actually need to try and complete the PUSH, and then only after that, perform the original POP operation. If the POP thread simply returns that the queue is EMPTY when a PUSH is in progress, the POP thread is basically blocked until the PUSH thread completes the operation. If the PUSH thread dies, or goes to sleep for 1,000 years, or otherwise gets scheduled into oblivion, the POP thread can do nothing except continuously report that the queue is EMPTY.
So does this fit the defintion of lock-free? From one perspective, you can argue that the POP thread can always make progress, because it can always report that the queue is EMPTY (which is at least some form of progress I guess.) But to me, this isn't really making progress, since the only reason the queue is observed as empty is because we are blocked by a concurrent PUSH operation.
So, my question is: is this algorithm truly lock-free? Or is the index reservation system basically a mutex in disguise?

This queue data structure is not strictly lock-free by what I consider the most reasonable definition. That definition is something like:
A structure is lock-free if only if any thread can be indefinitely
suspended at any point while still leaving the structure usable by the
remaining threads.
Of course this implies a suitable definition of usable, but for most structures this is fairly simple: the structure should continue to obey its contracts and allow elements to be inserted and removed as expected.
In this case a thread that has succeeded in incrementing m_write_increment, but hasn't yet written s.sequence_number leaves the container in what will soon be an unusable state. If such a thread is killed, the container will eventually report both "full" and "empty" to push and pop respectively, violating the contract of a fixed size queue.
There is a hidden mutex here (the combination of m_write_index and the associated s.sequence_number) - but it basically works like a per-element mutex. So the failure only becomes apparent to writers once you've looped around and a new writer tries to get the mutex, but in fact all subsequent writers have effectively failed to insert their element into the queue since no reader will ever see it.
Now this doesn't mean this is a bad implementation of a concurrent queue. For some uses it may behave mostly as if it was lock free. For example, this structure may have most of the useful performance properties of a truly lock-free structure, but at the same time it lacks some of the useful correctness properties. Basically the term lock-free usually implies a whole bunch of properties, only a subset of which will usually be important for any particular use. Let's look at them one by one and see how this structure does. We'll broadly categorize them into performance and functional categories.
Performance
Uncontended Performance
The uncontended or "best case" performance is important for many structures. While you need a concurrent structure for correctness, you'll usually still try to design your application so that contention is kept to a minimum, so the uncontended cost is often important. Some lock-free structures help here, by reducing the number of expensive atomic operations in the uncontended fast-path, or avoiding a syscall.
This queue implementation does a reasonable job here: there is only a single "definitely expensive" operation: the compare_exchange_weak, and a couple of possibly expensive operations (the memory_order_acquire load and memory_order_release store)1, and little other overhead.
This compares to something like std::mutex which would imply something like one atomic operation for lock and another for unlock, and in practice on Linux the pthread calls have non-negligible overhead as well.
So I expect this queue to perform reasonably well in the uncontended fast-path.
Contended Performance
One advantage of lock-free structures is that they often allow better scaling when a structure is heavily contended. This isn't necessarily an inherent advantage: some lock-based structures with multiple locks or read-write locks may exhibit scaling that matches or exceeds some lock-free approaches, but it is usually that case that lock-free structures exhibit better scaling that a simple one-lock-to-rule-them-all alternative.
This queue performs reasonably in this respect. The m_write_index variable is atomically updated by all readers and will be a point of contention, but the behavior should be reasonable as long as the underlying hardware CAS implementation is reasonable.
Note that a queue is generally a fairly poor concurrent structure since inserts and removals all happen at the same places (the head and the tail), so contention is inherent in the definition of the structure. Compare this to a concurrent map, where different elements have no particular ordered relationship: such a structure can offer efficient contention-free simultaneous mutation if different elements are being accessed.
Context-switch Immunity
One performance advantage of lock-free structures that is related to the core definition above (and also to the functional guarantees) is that a context switch of a thread which is mutating the structure doesn't delay all the other mutators. In a heavily loaded system (especially when runnable threads >> available cores), a thread may be switched out for hundreds of milliseconds or seconds. During this time, any concurrent mutators will block and incur additional scheduling costs (or they will spin which may also produce poor behavior). Even though such "unluckly scheduling" may be rare, when it does occur the entire system may incur a serious latency spike.
Lock-free structures avoid this since there is no "critical region" where a thread can be context switched out and subsequently block forward progress by other threads.
This structure offers partial protection in this area — the specifics of which depend on the queue size and application behavior. Even if a thread is switched out in the critical region between the m_write_index update and the sequence number write, other threads can continue to push elements to the queue as long as they don't wrap all the way around to the in-progress element from the stalled thread. Threads can also pop elements, but only up to the in-progress element.
While the push behavior may not be a problem for high-capacity queues, the pop behavior can be a problem: if the queue has a high throughput compared to the average time a thread is context switched out, and the average fullness, the queue will quickly appear empty to all consumer threads, even if there are many elements added beyond the in-progress element. This isn't affected by the queue capacity, but simply the application behavior. It means that the consumer side may completely stall when this occurs. In this respect, the queue doesn't look very lock-free at all!
Functional Aspects
Async Thread Termination
On advantage of lock-free structures it they are safe for use by threads that may be asynchronously canceled or may otherwise terminate exceptionally in the critical region. Cancelling a thread at any point leaves the structure is a consistent state.
This is not the case for this queue, as described above.
Queue Access from Interrupt or Signal
A related advantage is that lock-free structures can usually be examined or mutated from an interrupt or signal. This is useful in many cases where an interrupt or signal shares a structure with regular process threads.
This queue mostly supports this use case. Even if the signal or interrupt occurs when another thread is in the critical region, the asynchronous code can still push an element onto the queue (which will only be seen later by consuming threads) and can still pop an element off of the queue.
The behavior isn't as complete as a true lock-free structure: imagine a signal handler with a way to tell the remaining application threads (other than the interrupted one) to quiesce and which then drains all the remaining elements of the queue. With a true lock-free structure, this would allow the signal handler to full drain all the elements, but this queue might fail to do that in the case a thread was interrupted or switched out in the critical region.
1 In particular, on x86, this will only use an atomic operation for the CAS as the memory model is strong enough to avoid the need for atomics or fencing for the other operations. Recent ARM can do acquire and release fairly efficiently as well.

I am the author of liblfds.
The OP is correct in his description of this queue.
It is the single data structure in the library which is not lock-free.
This is described in the documentation for the queue;
http://www.liblfds.org/mediawiki/index.php?title=r7.1.1:Queue_%28bounded,_many_producer,_many_consumer%29#Lock-free_Specific_Behaviour
"It must be understood though that this is not actually a lock-free data structure."
This queue is an implementation of an idea from Dmitry Vyukov (1024cores.net) and I only realised it was not lock-free while I was making the test code work.
By then it was working, so I included it.
I do have some thought to remove it, since it is not lock-free.

Most of the time people use lock-free when they really mean lockless. lockless means a data-structure or algorithm that does not use locks, but there is no guarantee for forward progress. Also check this question. So the queue in liblfds is lockless, but as BeeOnRope mentioned is not lock-free.

A thread that calls POP before the next update in sequence is complete is NOT "effectively blocked" if the POP call returns FALSE immediately. The thread can go off and do something else. I'd say that this queue qualifies as lock-free.
However, I wouldn't say that it qualifies as a "queue" -- at least not the kind of queue that you could publish as a queue in a library or something -- because it doesn't guarantee a lot of the behaviors that you can normally expect from a queue. In particular, you can PUSH and element and then try and FAIL to POP it, because some other thread is busy pushing an earlier item.
Even so, this queue could still be useful in some lock-free solutions for various problems.
For many applications, however, I would worry about the possibility for consumer threads to be starved while a producer thread is pre-empted. Maybe liblfds does something about that?

"Lock-free" is a property of the algorithm, which implements some functionality. The property doesn't correlate with a way, how given functionality is used by a program.
When talk about mcmp_queue::enqueue function, which returns FALSE if underlying queue is full, its implementation (given in the question post) is lock-free.
However, implementing mcmp_queue::dequeue in lock-free manner would be difficult. E.g., this pattern is obviously not-lock free, as it spins on the variable changed by other thread:
while(s.sequence_number.load(std::memory_order_acquire) == read_index);
data = s.user_data;
...
return data;

I did formal verification on this same code using Spin a couple years ago for a course in concurrency testing and it is definitely not lock-free.
Just because there is no explicit "locking", doesn't mean it's lock-free. When it comes to reasoning about progress conditions, think of it from an individual thread's perspective:
Blocking/locking: if another thread gets descheduled and this can block my progress, then it is blocking.
Lock-free/non-blocking: if I am able to eventually make progress in the absence of contention from other threads, then it is at most lock-free.
If no other thread can block my progress indefinitely, then it is wait-free.

What makes the boost::shared_mutex so slow

I used google benchmark to run following 3 tests and the result surprised me since the RW lock is ~4x slower than simple mutex in release mode. (and ~10x slower than simple mutex in debug mode)
void raw_access() {
(void) (gp->a + gp->b);
}
void mutex_access() {
std::lock_guard<std::mutex> guard(g_mutex);
(void) (gp->a + gp->b);
}
void rw_mutex_access() {
boost::shared_lock<boost::shared_mutex> l(g_rw_mutex);
(void) (gp->a + gp->b);
}
the result is:
2019-06-26 08:30:45
Running ./perf
Run on (4 X 2500 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 262K (x2)
L3 Unified 4194K (x1)
Load Average: 5.35, 3.22, 2.57
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------
BM_RawAccess 1.01 ns 1.01 ns 681922241
BM_MutexAccess 18.2 ns 18.2 ns 38479510
BM_RWMutexAccess 92.8 ns 92.8 ns 7561437
I didn't get the enough information via google, so hope some help here.
Thanks

I don't know the particulars of how the standard library/boost/etc. implementations differ, although it seems like the standard library version is faster (congrats, whoever wrote it).
So instead I'll try to explain the speed differences between various mutex types on a theoretical level, which will explain why a shared mutex (should) be slower.
Atomic Spin Lock
More-so as an academic exercise, consider the simplest thread-safety "mutex-like" implementation: a simple atomic spin lock.
In essence, this is nothing more than an std::atomic<bool> or an std::atomic_flag. It is initialized to false. To "lock" the mutex, you simply do an atomic compare-and-exchange operation in a loop until you get a false value (i.e. the previous value was false prior to atomically setting it to true).
std::atomic_flag flag = ATOMIC_FLAG_INIT;
// lock it by looping until we observe a false value
while (flag.test_and_set()) ;
// do stuff under "mutex" lock
// unlock by setting it back to false state
flag.clear();
However, due to the nature of this construct, it's what we call an "unfair" mutex because the order of threads that acquire the lock is not necessarily the order in which they began their attempts to lock it. That is, under high contention, it's possible a thread tries to lock and simply never succeed because other threads are luckier. It's very timing-sensitive. Imagine musical chairs.
Because of this, although it functions like a mutex, it's not what we consider a "mutex".
Mutex
A mutex can be thought of as building on top of an atomic spin lock (although it's not typically implemented as such, since they typically are implemented with support of the operating system and/or hardware).
In essence, a mutex is a step above atomic spin locks because it has a queue of waiting threads. This allows it to be "fair" because the order of lock acquisition is (more or less) the same as the order of locking attempts.
If you've noticed, if you run sizeof(std::mutex) it might be a bit larger than you might expect. On my platform it's 40 bytes. That extra space is used to hold state information, notably including some way of accessing a lock queue for each individual mutex.
When you try to lock a mutex, it performs some low-level thread-safety operation to have thread-safe access to the mutex's status information (e.g. atomic spin lock), checks the state of the mutex, adds your thread to the lock queue, and (typically) puts your thread to sleep while you wait so you don't burn precious CPU time. The low-level thread-safety operation (e.g. the atomic spin lock) is atomically released at the same time the thread goes to sleep (this is typically where OS or hardware support is necessary to be efficient).
Unlocking is performed by performing a low-level thread-safe operation (e.g. atomic spin lock), popping the next waiting thread from the queue, and waking it up. The thread that has been awoken now "owns" the lock. Rinse wash and repeat.
Shared Mutex
A shared mutex takes this concept a step further. It can be owned by a single thread for read/write permissions (like a normal mutex), or by multiple threads for read-only permissions (semantically, anyway - it's up to the programmer to ensure it's safe).
Thus, in addition to the unique ownership queue (like a normal mutex) it also has a shared ownership state. The shared ownership state could be simply a count of the number of threads that currently have shared ownership. If you inspect sizeof(std::shared_mutex) you'll find it's typically even larger than std::mutex. On my system, for instance, it's 56 bytes.
So when you go to lock a shared mutex, it has to do everything a normal mutex does, but additionally has to verify some other stuff. For instance, if you're trying to lock uniquely it must verify that there are no shared owners. And when you're trying to lock sharingly it must verify that there are no unique owners.
Because we typically want mutexes to be "fair", once a unique locker is in the queue, future sharing lock attempts must queue instead of acquiring the lock, even though it might currently be under sharing (i.e. non-unique) lock by several threads. This is to ensure shared owners don't "bully" a thread that wants unique ownership.
But this also goes the other way: the queuing logic must ensure a shared locker is never placed into an empty queue during shared ownership (because it should immediately succeed and be another shared owner).
Additionally, if there's a unique locker, followed by a shared locker, followed by a unique locker, it must (roughly) guarantee that acquisition order. So each entry in the lock queue also needs a flag denoting its purpose (i.e. shared vs. unique).
And then we think of the wake-up logic. When you unlock a shared mutex, the logic differs depending on the current ownership type of the mutex. If the unlocking thread has unique ownership or is the last shared owner it might have to wake up some thread(s) from the queue. It will either wake up all threads at the front of the queue who are requesting shared ownership, or a single thread at the front of the queue requesting unique ownership.
As you can imagine, all of this extra logic for who's locking for what reasons and how it changes depending not only on the current owners but also on the contents of the queue makes this potentially quite a bit slower. The hope is that you read significantly more frequent than you write, and thus you can have many sharing owners running concurrently, which mitigates the performance hit of coordinating all of that.

Using Boost.Lockfree queue is slower than using mutexes

Until now I was using std::queue in my project. I measured the average time which a specific operation on this queue requires.
The times were measured on 2 machines: My local Ubuntu VM and a remote server.
Using std::queue, the average was almost the same on both machines: ~750 microseconds.
Then I "upgraded" the std::queue to boost::lockfree::spsc_queue, so I could get rid of the mutexes protecting the queue. On my local VM I could see a huge performance gain, the average is now on 200 microseconds. On the remote machine however, the average went up to 800 microseconds, which is slower than it was before.
First I thought this might be because the remote machine might not support the lock-free implementation:
From the Boost.Lockfree page:
Not all hardware supports the same set of atomic instructions. If it is not available in hardware, it can be emulated in software using guards. However this has the obvious drawback of losing the lock-free property.
To find out if these instructions are supported, boost::lockfree::queue has a method called bool is_lock_free(void) const;.
However, boost::lockfree::spsc_queue does not have a function like this, which, for me, implies that it does not rely on the hardware and that is is always lockfree - on any machine.
What could be the reason for the performance loss?
Exmple code (Producer/Consumer)
// c++11 compiler and boost library required
#include <iostream>
#include <cstdlib>
#include <chrono>
#include <async>
#include <thread>
/* Using blocking queue:
* #include <mutex>
* #include <queue>
*/
#include <boost/lockfree/spsc_queue.hpp>
boost::lockfree::spsc_queue<int, boost::lockfree::capacity<1024>> queue;
/* Using blocking queue:
* std::queue<int> queue;
* std::mutex mutex;
*/
int main()
{
auto producer = std::async(std::launch::async, [queue /*,mutex*/]()
{
// Producing data in a random interval
while(true)
{
/* Using the blocking queue, the mutex must be locked here.
* mutex.lock();
*/
// Push random int (0-9999)
queue.push(std::rand() % 10000);
/* Using the blocking queue, the mutex must be unlocked here.
* mutex.unlock();
*/
// Sleep for random duration (0-999 microseconds)
std::this_thread::sleep_for(std::chrono::microseconds(rand() % 1000));
}
}
auto consumer = std::async(std::launch::async, [queue /*,mutex*/]()
{
// Example operation on the queue.
// Checks if 1234 was generated by the producer, returns if found.
while(true)
{
/* Using the blocking queue, the mutex must be locked here.
* mutex.lock();
*/
int value;
while(queue.pop(value)
{
if(value == 1234)
return;
}
/* Using the blocking queue, the mutex must be unlocked here.
* mutex.unlock();
*/
// Sleep for 100 microseconds
std::this_thread::sleep_for(std::chrono::microseconds(100));
}
}
consumer.get();
std::cout << "1234 was generated!" << std::endl;
return 0;
}

Lock free algorithms generally perform more poorly than lock-based algorithms. That's a key reason they're not used nearly as frequently.
The problem with lock free algorithms is that they maximize contention by allowing contending threads to continue to contend. Locks avoid contention by de-scheduling contending threads. Lock free algorithms, to a first approximation, should only be used when it's not possible to de-schedule contending threads. That only rarely applies to application-level code.
Let me give you a very extreme hypothetical. Imagine four threads are running on a typical, modern dual-core CPU. Threads A1 and A2 are manipulating collection A. Threads B1 and B2 are manipulating collection B.
First, let's imagine the collection uses locks. That will mean that if threads A1 and A2 (or B1 and B2) try to run at the same time, one of them will get blocked by the lock. So, very quickly, one A thread and one B thread will be running. These threads will run very quickly and will not contend. Any time threads try to contend, the conflicting thread will get de-scheduled. Yay.
Now, imagine the collection uses no locks. Now, threads A1 and A2 can run at the same time. This will cause constant contention. Cache lines for the collection will ping-pong between the two cores. Inter-core buses may get saturated. Performance will be awful.
Again, this is highly exaggerated. But you get the idea. You want to avoid contention, not suffer through as much of it as possible.
However, now run this thought experiment again where A1 and A2 are the only threads on the entire system. Now, the lock free collection is probably better (though you may find that it's better just to have one thread in that case!).
Almost every programmer goes through a phase where they think that locks are bad and avoiding locks makes code go faster. Eventually, they realize that it's contention that makes things slow and locks, used correctly, minimize contention.

I cannot say that the boost lockfree queue is slower in all possible cases. In my experience, the push(const T& item) is trying to make a copy. If you are constructing tmp objects and pushing on the queue, then you are hit by a performance drag. I think the library just need the overloaded version push(T&& item) to make movable object more efficient. Before the addition of the new function, you might have to use pointers, the plain type, or the smart ones offered after C++11. This is a rather limited aspect of the queue, and I only use the lockfree queue vary rarely.

Mutex granularity

I have a question regarding threads. It is known that basically when we call for mutex(lock) that means that thread keeps on executing the part of code uninterrupted by other threads until it meets mutex(unlock). (At least that's what they say in the book) So my question is if it is actually possible to have several scoped WriteLocks which do not interfere with each other. For example something like this:
If I have a buffer with N elements without any new elements coming, however with high frequency updates (like change value of Kth element) is it possible to set a different lock on each element so that the only time threads would stall and wait is if actually 2 or more threads are trying to update the same element?

To answer your question about N mutexes: yes, that is indeed possible. What resources are protected by a mutex depends entirely on you as the user of that mutex.
This leads to the first (statement) part of your question. A mutex by itself does not guarantee that a thread will work uninterrupted. All it guarantees is MUTual EXclusion - if thread B attempts to lock a mutex which thread A has locked, thread B will block (execute no code) until thread A unlocks the mutex.
This means mutexes can be used to guarantee that a thread executes a block of code uninterrupted; but this works only if all threads follow the same mutex-locking protocol around that block of code. Which means it is your responsibility to assign semantics (or meaning) to each individual mutex, and correctly adhere to those semantics in your code.
If you decide for the semantics to be "I have an array a of N data elements and an array m of N mutexes, and accessing a[i] can only be done when m[i] is locked," then that's how it will work.
The need to consistently stick to the same protocol is why you should generally encapsulate the mutex and the code/data protected by it in a class in some way or another, so that outside code doesn't need to know the details of the protocol. It just knows "call this member function, and the synchronisation will happen automagically." This "automagic" will be the class correcrtly implementing the protocol.

A crucial consideration when deciding between a mutex per array and a mutex per element is whether there are operations - like tracking the number of "in-use" array elements, the "active" element, or moving a pointer-to-array to a larger buffer - that can only be done safely by one thread while all the others are blocked.
A lesser but sometimes important consideration is the amount of extra memory more mutexes use.
If you genuinely need to do this kind of update as quickly as possible in a highly contested multi-threaded program, you may also want to learn about lock-free atomic types and their compare-and-swap / exchange operations, but I'd recommend against considering that unless profiling the existing locking is significant in your overall program performance.

A mutex does not stop other threads from running completely, it only stops other threads from locking the same mutex. I.e. while one thread is keeping the mutex locked, the operating system continues to do context switches letting other threads run also, but if any other thread is trying to lock the same mutex its execution will be halted until the mutex is unlocked.
So yes, you can indeed have several different mutexes and lock/unlock them independently. Just beware of deadlocks, i.e. if one thread can lock more than one mutex at a time you can run into a situation where thread 1 has locked mutex A and is trying to lock mutex B but blocks because thread 2 already has mutex B locked and it is trying to lock mutex A..

Its not completely clear that your use case is:
the threads gets a buffer assigned on that they have to work
the threads have some results and request a special buffer to update.
On the first variant you need some assignment logic that assigns a buffer to a thread.
This logic has to be exectued in an atomic way. so the best is to use a mutex to protect the assignment logic.
On the other variant it may be the best to have a vector of mutexes, one for each buffer element.
In Both cases the buffer does not need a protection because it (or better each field of it) is only accessed from one thread at a time.
You also may inform yourself about 'semaphores'. These contain a counter that allows to manage ressources that have a limited amount but more than one. Mutexes are a special case of semaphores with n=1.

You can have mutex per entry, C++11 mutex can be easily converted into an adaptive-spinlock, so you can achieve good CPU/Latency tradeoff.
Or, if you need very low latency yet have enough CPUs you can use an atomic "busy" flag per entry and spin in a tight compare-exchange loop on contention.
From experience, though, the best performance and scalability are achieved when concurrent writes are serialized via a command queue (or a queue of smaller immutable buffers to be concatenated at destination) and a single thread processing the queue.

Is a ticket lock bounded wait-free? (under certain assumptions)

I am talking about a ticket lock that might look as follows (in pseudo-C syntax):
unsigned int ticket_counter = 0, lock_counter = 0;
void lock() {
unsigned int my_ticket = fetch_and_increment(ticket_counter);
while (my_ticket != lock_counter) {}
}
void unlock() {
atomic_increment(lock_counter);
}
Let's assume such a ticket lock synchronizes access to a critical section S that is wait-free, i.e. executing the critical section takes exactly c cycles/instructions. Assuming, that there are at most p threads in the system, is the synchronization of S using the ticket lock bounded wait-free?
In my opinion, it is, since the ticket lock is fair and thus the upper bound for waiting is O(p * c).
Do I make a mistake? I am little bit confused. I always thought locking implies not to be (bounded) wait-free because of the following statement:
"It is impossible to construct a wait-free implementation of a queue, stack, priority queue, set, or list from a set of atomic register." (Corollary 5.4.1 in Art of Multiprocessor Programming, Herlihy and Shavit)
However, if the ticket lock (and maybe any other fair locking mechanism) is bounded wait-free under the mentioned assumptions, it (might) enables the construction of bounded wait-free implementations of queue, stack, etc. (That's the question I am actually faced with.)
Recall the definition of bounded wait-free in "Art of Multiprocessor Programming", p.59 by Herlihy and Shavit:
"A method is wait-free if it guarantees that every call finishes its execution in a finite number of steps. It is bounded wait-free if there is a bound on the number of steps a method call can take."

Well, I believe you are correct, with some caveats.
Namely, the bounded wait-free property holds only if the critical section S is non-preemptive, which I suppose you can guarantee only for kernel space code (by disabling interrupts in the critical section). Otherwise the OS might decide to switch to another thread while one thread is in the critical section, and then the wait time is unbounded, no?
Also, for kernel code, I suppose p is not the number of software threads but rather the number of hardware threads (or cores, for CPU's that don't support several threads per CPU core). Because at most p software threads will be runnable at a time, and as S is non-preemptive you have no sleeping threads waiting on the lock.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js