Is a ticket lock bounded wait-free? (under certain assumptions)

Is a ticket lock bounded wait-free? (under certain assumptions) - concurrency

I am talking about a ticket lock that might look as follows (in pseudo-C syntax):
unsigned int ticket_counter = 0, lock_counter = 0;
void lock() {
unsigned int my_ticket = fetch_and_increment(ticket_counter);
while (my_ticket != lock_counter) {}
}
void unlock() {
atomic_increment(lock_counter);
}
Let's assume such a ticket lock synchronizes access to a critical section S that is wait-free, i.e. executing the critical section takes exactly c cycles/instructions. Assuming, that there are at most p threads in the system, is the synchronization of S using the ticket lock bounded wait-free?
In my opinion, it is, since the ticket lock is fair and thus the upper bound for waiting is O(p * c).
Do I make a mistake? I am little bit confused. I always thought locking implies not to be (bounded) wait-free because of the following statement:
"It is impossible to construct a wait-free implementation of a queue, stack, priority queue, set, or list from a set of atomic register." (Corollary 5.4.1 in Art of Multiprocessor Programming, Herlihy and Shavit)
However, if the ticket lock (and maybe any other fair locking mechanism) is bounded wait-free under the mentioned assumptions, it (might) enables the construction of bounded wait-free implementations of queue, stack, etc. (That's the question I am actually faced with.)
Recall the definition of bounded wait-free in "Art of Multiprocessor Programming", p.59 by Herlihy and Shavit:
"A method is wait-free if it guarantees that every call finishes its execution in a finite number of steps. It is bounded wait-free if there is a bound on the number of steps a method call can take."

Well, I believe you are correct, with some caveats.
Namely, the bounded wait-free property holds only if the critical section S is non-preemptive, which I suppose you can guarantee only for kernel space code (by disabling interrupts in the critical section). Otherwise the OS might decide to switch to another thread while one thread is in the critical section, and then the wait time is unbounded, no?
Also, for kernel code, I suppose p is not the number of software threads but rather the number of hardware threads (or cores, for CPU's that don't support several threads per CPU core). Because at most p software threads will be runnable at a time, and as S is non-preemptive you have no sleeping threads waiting on the lock.

Related

Lock-free producer consumer [duplicate]

Anecdotally, I've found that a lot of programmers mistakenly believe that "lock-free" simply means "concurrent programming without mutexes". Usually, there's also a correlated misunderstanding that the purpose of writing lock-free code is for better concurrent performance. Of course, the correct definition of lock-free is actually about progress guarantees. A lock-free algorithm guarantees that at least one thread is able to make forward progress regardless of what any other threads are doing.
This means a lock-free algorithm can never have code where one thread is depending on another thread in order to proceed. E.g., lock-free code can not have a situation where Thread A sets a flag, and then Thread B keeps looping while waiting for Thread A to unset the flag. Code like that is basically implementing a lock (or what I would call a mutex in disguise).
However, other cases are more subtle and there are some cases where I honestly can't really tell if an algorithm qualifies as lock-free or not, because the notion of "making progress" sometimes appears subjective to me.
One such case is in the (well-regarded, afaik) concurrency library, liblfds. I was studying the implementation of a multi-producer/multi-consumer bounded queue in liblfds - the implementation is very straightforward, but I cannot really tell if it should qualify as lock-free.
The relevant algorithm is in lfds711_queue_bmm_enqueue.c. Liblfds uses custom atomics and memory barriers, but the algorithm is simple enough for me to describe in a paragraph or so.
The queue itself is a bounded contiguous array (ringbuffer). There is a shared read_index and write_index. Each slot in the queue contains a field for user-data, and a sequence_number value, which is basically like an epoch counter. (This avoids ABA issues).
The PUSH algorithm is as follows:
Atomically LOAD the write_index
Attempt to reserve a slot in the queue at write_index % queue_size using a CompareAndSwap loop that attempts to set write_index to write_index + 1.
If the CompareAndSwap is successful, copy the user data into the
reserved slot.
Finally, update the sequence_index on the
slot by making it equal to write_index + 1.
The actual source code uses custom atomics and memory barriers, so for further clarity about this algorithm I've briefly translated it into (untested) standard C++ atomics for better readability, as follows:
bool mcmp_queue::enqueue(void* data)
{
int write_index = m_write_index.load(std::memory_order_relaxed);
for (;;)
{
slot& s = m_slots[write_index % m_num_slots];
int sequence_number = s.sequence_number.load(std::memory_order_acquire);
int difference = sequence_number - write_index;
if (difference == 0)
{
if (m_write_index.compare_exchange_weak(
write_index,
write_index + 1,
std::memory_order_acq_rel
))
{
break;
}
}
if (difference < 0) return false; // queue is full
}
// Copy user-data and update sequence number
//
s.user_data = data;
s.sequence_number.store(write_index + 1, std::memory_order_release);
return true;
}
Now, a thread that wants to POP an element from the slot at read_index will not be able to do so until it observes that the slot's sequence_number is equal to read_index + 1.
Okay, so there are no mutexes here, and the algorithm likely performs well (it's only a single CAS for PUSH and POP), but is this lock-free? The reason it's unclear to me is because the definition of "making progress" seems murky when there is the possibility that a PUSH or POP can always just fail if the queue is observed to be full or empty.
But what's questionable to me is that the PUSH algorithm essentially reserves a slot, meaning that the slot can never be POP'd until the push thread gets around to updating the sequence number. This means that a POP thread that wants to pop a value depends on the PUSH thread having completed the operation. Otherwise, the POP thread will always return false because it thinks the queue is EMPTY. It seems debatable to me whether this actually falls within the definition of "making progress".
Generally, truly lock-free algorithms involve a phase where a pre-empted thread actually tries to ASSIST the other thread in completing an operation. So, in order to be truly lock-free, I would think that a POP thread that observes an in-progress PUSH would actually need to try and complete the PUSH, and then only after that, perform the original POP operation. If the POP thread simply returns that the queue is EMPTY when a PUSH is in progress, the POP thread is basically blocked until the PUSH thread completes the operation. If the PUSH thread dies, or goes to sleep for 1,000 years, or otherwise gets scheduled into oblivion, the POP thread can do nothing except continuously report that the queue is EMPTY.
So does this fit the defintion of lock-free? From one perspective, you can argue that the POP thread can always make progress, because it can always report that the queue is EMPTY (which is at least some form of progress I guess.) But to me, this isn't really making progress, since the only reason the queue is observed as empty is because we are blocked by a concurrent PUSH operation.
So, my question is: is this algorithm truly lock-free? Or is the index reservation system basically a mutex in disguise?

This queue data structure is not strictly lock-free by what I consider the most reasonable definition. That definition is something like:
A structure is lock-free if only if any thread can be indefinitely
suspended at any point while still leaving the structure usable by the
remaining threads.
Of course this implies a suitable definition of usable, but for most structures this is fairly simple: the structure should continue to obey its contracts and allow elements to be inserted and removed as expected.
In this case a thread that has succeeded in incrementing m_write_increment, but hasn't yet written s.sequence_number leaves the container in what will soon be an unusable state. If such a thread is killed, the container will eventually report both "full" and "empty" to push and pop respectively, violating the contract of a fixed size queue.
There is a hidden mutex here (the combination of m_write_index and the associated s.sequence_number) - but it basically works like a per-element mutex. So the failure only becomes apparent to writers once you've looped around and a new writer tries to get the mutex, but in fact all subsequent writers have effectively failed to insert their element into the queue since no reader will ever see it.
Now this doesn't mean this is a bad implementation of a concurrent queue. For some uses it may behave mostly as if it was lock free. For example, this structure may have most of the useful performance properties of a truly lock-free structure, but at the same time it lacks some of the useful correctness properties. Basically the term lock-free usually implies a whole bunch of properties, only a subset of which will usually be important for any particular use. Let's look at them one by one and see how this structure does. We'll broadly categorize them into performance and functional categories.
Performance
Uncontended Performance
The uncontended or "best case" performance is important for many structures. While you need a concurrent structure for correctness, you'll usually still try to design your application so that contention is kept to a minimum, so the uncontended cost is often important. Some lock-free structures help here, by reducing the number of expensive atomic operations in the uncontended fast-path, or avoiding a syscall.
This queue implementation does a reasonable job here: there is only a single "definitely expensive" operation: the compare_exchange_weak, and a couple of possibly expensive operations (the memory_order_acquire load and memory_order_release store)1, and little other overhead.
This compares to something like std::mutex which would imply something like one atomic operation for lock and another for unlock, and in practice on Linux the pthread calls have non-negligible overhead as well.
So I expect this queue to perform reasonably well in the uncontended fast-path.
Contended Performance
One advantage of lock-free structures is that they often allow better scaling when a structure is heavily contended. This isn't necessarily an inherent advantage: some lock-based structures with multiple locks or read-write locks may exhibit scaling that matches or exceeds some lock-free approaches, but it is usually that case that lock-free structures exhibit better scaling that a simple one-lock-to-rule-them-all alternative.
This queue performs reasonably in this respect. The m_write_index variable is atomically updated by all readers and will be a point of contention, but the behavior should be reasonable as long as the underlying hardware CAS implementation is reasonable.
Note that a queue is generally a fairly poor concurrent structure since inserts and removals all happen at the same places (the head and the tail), so contention is inherent in the definition of the structure. Compare this to a concurrent map, where different elements have no particular ordered relationship: such a structure can offer efficient contention-free simultaneous mutation if different elements are being accessed.
Context-switch Immunity
One performance advantage of lock-free structures that is related to the core definition above (and also to the functional guarantees) is that a context switch of a thread which is mutating the structure doesn't delay all the other mutators. In a heavily loaded system (especially when runnable threads >> available cores), a thread may be switched out for hundreds of milliseconds or seconds. During this time, any concurrent mutators will block and incur additional scheduling costs (or they will spin which may also produce poor behavior). Even though such "unluckly scheduling" may be rare, when it does occur the entire system may incur a serious latency spike.
Lock-free structures avoid this since there is no "critical region" where a thread can be context switched out and subsequently block forward progress by other threads.
This structure offers partial protection in this area — the specifics of which depend on the queue size and application behavior. Even if a thread is switched out in the critical region between the m_write_index update and the sequence number write, other threads can continue to push elements to the queue as long as they don't wrap all the way around to the in-progress element from the stalled thread. Threads can also pop elements, but only up to the in-progress element.
While the push behavior may not be a problem for high-capacity queues, the pop behavior can be a problem: if the queue has a high throughput compared to the average time a thread is context switched out, and the average fullness, the queue will quickly appear empty to all consumer threads, even if there are many elements added beyond the in-progress element. This isn't affected by the queue capacity, but simply the application behavior. It means that the consumer side may completely stall when this occurs. In this respect, the queue doesn't look very lock-free at all!
Functional Aspects
Async Thread Termination
On advantage of lock-free structures it they are safe for use by threads that may be asynchronously canceled or may otherwise terminate exceptionally in the critical region. Cancelling a thread at any point leaves the structure is a consistent state.
This is not the case for this queue, as described above.
Queue Access from Interrupt or Signal
A related advantage is that lock-free structures can usually be examined or mutated from an interrupt or signal. This is useful in many cases where an interrupt or signal shares a structure with regular process threads.
This queue mostly supports this use case. Even if the signal or interrupt occurs when another thread is in the critical region, the asynchronous code can still push an element onto the queue (which will only be seen later by consuming threads) and can still pop an element off of the queue.
The behavior isn't as complete as a true lock-free structure: imagine a signal handler with a way to tell the remaining application threads (other than the interrupted one) to quiesce and which then drains all the remaining elements of the queue. With a true lock-free structure, this would allow the signal handler to full drain all the elements, but this queue might fail to do that in the case a thread was interrupted or switched out in the critical region.
1 In particular, on x86, this will only use an atomic operation for the CAS as the memory model is strong enough to avoid the need for atomics or fencing for the other operations. Recent ARM can do acquire and release fairly efficiently as well.

I am the author of liblfds.
The OP is correct in his description of this queue.
It is the single data structure in the library which is not lock-free.
This is described in the documentation for the queue;
http://www.liblfds.org/mediawiki/index.php?title=r7.1.1:Queue_%28bounded,_many_producer,_many_consumer%29#Lock-free_Specific_Behaviour
"It must be understood though that this is not actually a lock-free data structure."
This queue is an implementation of an idea from Dmitry Vyukov (1024cores.net) and I only realised it was not lock-free while I was making the test code work.
By then it was working, so I included it.
I do have some thought to remove it, since it is not lock-free.

Most of the time people use lock-free when they really mean lockless. lockless means a data-structure or algorithm that does not use locks, but there is no guarantee for forward progress. Also check this question. So the queue in liblfds is lockless, but as BeeOnRope mentioned is not lock-free.

A thread that calls POP before the next update in sequence is complete is NOT "effectively blocked" if the POP call returns FALSE immediately. The thread can go off and do something else. I'd say that this queue qualifies as lock-free.
However, I wouldn't say that it qualifies as a "queue" -- at least not the kind of queue that you could publish as a queue in a library or something -- because it doesn't guarantee a lot of the behaviors that you can normally expect from a queue. In particular, you can PUSH and element and then try and FAIL to POP it, because some other thread is busy pushing an earlier item.
Even so, this queue could still be useful in some lock-free solutions for various problems.
For many applications, however, I would worry about the possibility for consumer threads to be starved while a producer thread is pre-empted. Maybe liblfds does something about that?

"Lock-free" is a property of the algorithm, which implements some functionality. The property doesn't correlate with a way, how given functionality is used by a program.
When talk about mcmp_queue::enqueue function, which returns FALSE if underlying queue is full, its implementation (given in the question post) is lock-free.
However, implementing mcmp_queue::dequeue in lock-free manner would be difficult. E.g., this pattern is obviously not-lock free, as it spins on the variable changed by other thread:
while(s.sequence_number.load(std::memory_order_acquire) == read_index);
data = s.user_data;
...
return data;

I did formal verification on this same code using Spin a couple years ago for a course in concurrency testing and it is definitely not lock-free.
Just because there is no explicit "locking", doesn't mean it's lock-free. When it comes to reasoning about progress conditions, think of it from an individual thread's perspective:
Blocking/locking: if another thread gets descheduled and this can block my progress, then it is blocking.
Lock-free/non-blocking: if I am able to eventually make progress in the absence of contention from other threads, then it is at most lock-free.
If no other thread can block my progress indefinitely, then it is wait-free.

The definition of lock-free

There are three different types of "lock-free" algorithms. The definitions given in Concurrency in Action are:
Obstruction-Free: If all other threads are paused, then any given
thread will complete its operation in a bounded number of steps.
Lock-Free: If multiple threads are operating on a data structure, then
after a bounded number of steps one of them will complete its
operation.
Wait-Free: Every thread operating on a data structure
will complete its operation in a bounded number of steps, even if
other threads are also operating on the data structure.
Herb Sutter says in his talk Lock-Free Programming:
Informally, "lock-free" ≈ "doesn't use mutexes" == any of these.
I do not see why lock-based algorithms can't fall into the lock-free definition given above. Here is a simple lock-based program:
#include <iostream>
#include <mutex>
#include <thread>
std::mutex x_mut;
void print(int& x) {
std::lock_guard<std::mutex> lk(x_mut);
std::cout << x;
}
void add(int& x, int y) {
std::lock_guard<std::mutex> lk(x_mut);
x += y;
}
int main() {
int i = 3;
std::thread thread1{print, std::ref(i)};
std::thread thread2(add, std::ref(i), 4);
thread1.join();
thread2.join();
}
If both of these threads are operating, then after a bounded number of steps, one of them must complete. Why does my program not satisfy the definition of "Lock-free"?

I would be careful about saying "bounded" without defining by what.
The canonical lock-free primitives - CAS loops do not give any bounds, if there is heavy contention, a thread can be repeatedly unlucky and wait forever, that is allowed. The defining property of lock-free algorithms is there is always system-wide progress. At any point of time, some thread will make progress.
Stronger guarantee of some progress for each thread at each point of time is called wait-free.
In other words, lock-free guarantees that a misbehaving thread cannot fatally impact all other threads, wait-free cannot fatally impact any thread.
If both of these threads are operating, then after a bounded number of steps, one of them must complete. Why does my program not satisfy the definition of "Lock-free"?
Because an (unfair) scheduler must be taken into account.* If a thread holding the lock is put to sleep, no other thread will be able to make any progress -> not-lock-free and there is certainly no bound. That won't happen with lock-free programming, resources are always available and unfortunate scheduling of one thread can make other threads' operations complete only faster, not slower.
* In particular, where the suspension time for any thread is not limited in frequency or duration. If it was, any algorithm would be wait-free (with some huge constant) by definition.

Your quote from Concurrency in Action is taken out of context.
In fact, what the book actually says is:
7.1 Definitions and consequences
Algorithms and data structures that use mutexes, condition variables,
and futures to synchronize the data are called blocking data
structures and algorithms.
Data structures and algorithms that don’t use blocking library
functions are said to be nonblocking. Not all these data structures
are lock-free ...
Then it proceeds to further break down nonblocking algorithms into Obstruction-Free, Lock-Free and Wait-Free.
So a Lock-Free algorithm is
a nonblocking algorithm (it does not use locks like a mutex) and
it is able to make progress in at least one thread in a bounded number of steps.
So both Herb Sutter and Anthony Williams are correct.

What makes the boost::shared_mutex so slow

I used google benchmark to run following 3 tests and the result surprised me since the RW lock is ~4x slower than simple mutex in release mode. (and ~10x slower than simple mutex in debug mode)
void raw_access() {
(void) (gp->a + gp->b);
}
void mutex_access() {
std::lock_guard<std::mutex> guard(g_mutex);
(void) (gp->a + gp->b);
}
void rw_mutex_access() {
boost::shared_lock<boost::shared_mutex> l(g_rw_mutex);
(void) (gp->a + gp->b);
}
the result is:
2019-06-26 08:30:45
Running ./perf
Run on (4 X 2500 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 262K (x2)
L3 Unified 4194K (x1)
Load Average: 5.35, 3.22, 2.57
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------
BM_RawAccess 1.01 ns 1.01 ns 681922241
BM_MutexAccess 18.2 ns 18.2 ns 38479510
BM_RWMutexAccess 92.8 ns 92.8 ns 7561437
I didn't get the enough information via google, so hope some help here.
Thanks

I don't know the particulars of how the standard library/boost/etc. implementations differ, although it seems like the standard library version is faster (congrats, whoever wrote it).
So instead I'll try to explain the speed differences between various mutex types on a theoretical level, which will explain why a shared mutex (should) be slower.
Atomic Spin Lock
More-so as an academic exercise, consider the simplest thread-safety "mutex-like" implementation: a simple atomic spin lock.
In essence, this is nothing more than an std::atomic<bool> or an std::atomic_flag. It is initialized to false. To "lock" the mutex, you simply do an atomic compare-and-exchange operation in a loop until you get a false value (i.e. the previous value was false prior to atomically setting it to true).
std::atomic_flag flag = ATOMIC_FLAG_INIT;
// lock it by looping until we observe a false value
while (flag.test_and_set()) ;
// do stuff under "mutex" lock
// unlock by setting it back to false state
flag.clear();
However, due to the nature of this construct, it's what we call an "unfair" mutex because the order of threads that acquire the lock is not necessarily the order in which they began their attempts to lock it. That is, under high contention, it's possible a thread tries to lock and simply never succeed because other threads are luckier. It's very timing-sensitive. Imagine musical chairs.
Because of this, although it functions like a mutex, it's not what we consider a "mutex".
Mutex
A mutex can be thought of as building on top of an atomic spin lock (although it's not typically implemented as such, since they typically are implemented with support of the operating system and/or hardware).
In essence, a mutex is a step above atomic spin locks because it has a queue of waiting threads. This allows it to be "fair" because the order of lock acquisition is (more or less) the same as the order of locking attempts.
If you've noticed, if you run sizeof(std::mutex) it might be a bit larger than you might expect. On my platform it's 40 bytes. That extra space is used to hold state information, notably including some way of accessing a lock queue for each individual mutex.
When you try to lock a mutex, it performs some low-level thread-safety operation to have thread-safe access to the mutex's status information (e.g. atomic spin lock), checks the state of the mutex, adds your thread to the lock queue, and (typically) puts your thread to sleep while you wait so you don't burn precious CPU time. The low-level thread-safety operation (e.g. the atomic spin lock) is atomically released at the same time the thread goes to sleep (this is typically where OS or hardware support is necessary to be efficient).
Unlocking is performed by performing a low-level thread-safe operation (e.g. atomic spin lock), popping the next waiting thread from the queue, and waking it up. The thread that has been awoken now "owns" the lock. Rinse wash and repeat.
Shared Mutex
A shared mutex takes this concept a step further. It can be owned by a single thread for read/write permissions (like a normal mutex), or by multiple threads for read-only permissions (semantically, anyway - it's up to the programmer to ensure it's safe).
Thus, in addition to the unique ownership queue (like a normal mutex) it also has a shared ownership state. The shared ownership state could be simply a count of the number of threads that currently have shared ownership. If you inspect sizeof(std::shared_mutex) you'll find it's typically even larger than std::mutex. On my system, for instance, it's 56 bytes.
So when you go to lock a shared mutex, it has to do everything a normal mutex does, but additionally has to verify some other stuff. For instance, if you're trying to lock uniquely it must verify that there are no shared owners. And when you're trying to lock sharingly it must verify that there are no unique owners.
Because we typically want mutexes to be "fair", once a unique locker is in the queue, future sharing lock attempts must queue instead of acquiring the lock, even though it might currently be under sharing (i.e. non-unique) lock by several threads. This is to ensure shared owners don't "bully" a thread that wants unique ownership.
But this also goes the other way: the queuing logic must ensure a shared locker is never placed into an empty queue during shared ownership (because it should immediately succeed and be another shared owner).
Additionally, if there's a unique locker, followed by a shared locker, followed by a unique locker, it must (roughly) guarantee that acquisition order. So each entry in the lock queue also needs a flag denoting its purpose (i.e. shared vs. unique).
And then we think of the wake-up logic. When you unlock a shared mutex, the logic differs depending on the current ownership type of the mutex. If the unlocking thread has unique ownership or is the last shared owner it might have to wake up some thread(s) from the queue. It will either wake up all threads at the front of the queue who are requesting shared ownership, or a single thread at the front of the queue requesting unique ownership.
As you can imagine, all of this extra logic for who's locking for what reasons and how it changes depending not only on the current owners but also on the contents of the queue makes this potentially quite a bit slower. The hope is that you read significantly more frequent than you write, and thus you can have many sharing owners running concurrently, which mitigates the performance hit of coordinating all of that.

Mutex granularity

I have a question regarding threads. It is known that basically when we call for mutex(lock) that means that thread keeps on executing the part of code uninterrupted by other threads until it meets mutex(unlock). (At least that's what they say in the book) So my question is if it is actually possible to have several scoped WriteLocks which do not interfere with each other. For example something like this:
If I have a buffer with N elements without any new elements coming, however with high frequency updates (like change value of Kth element) is it possible to set a different lock on each element so that the only time threads would stall and wait is if actually 2 or more threads are trying to update the same element?

To answer your question about N mutexes: yes, that is indeed possible. What resources are protected by a mutex depends entirely on you as the user of that mutex.
This leads to the first (statement) part of your question. A mutex by itself does not guarantee that a thread will work uninterrupted. All it guarantees is MUTual EXclusion - if thread B attempts to lock a mutex which thread A has locked, thread B will block (execute no code) until thread A unlocks the mutex.
This means mutexes can be used to guarantee that a thread executes a block of code uninterrupted; but this works only if all threads follow the same mutex-locking protocol around that block of code. Which means it is your responsibility to assign semantics (or meaning) to each individual mutex, and correctly adhere to those semantics in your code.
If you decide for the semantics to be "I have an array a of N data elements and an array m of N mutexes, and accessing a[i] can only be done when m[i] is locked," then that's how it will work.
The need to consistently stick to the same protocol is why you should generally encapsulate the mutex and the code/data protected by it in a class in some way or another, so that outside code doesn't need to know the details of the protocol. It just knows "call this member function, and the synchronisation will happen automagically." This "automagic" will be the class correcrtly implementing the protocol.

A crucial consideration when deciding between a mutex per array and a mutex per element is whether there are operations - like tracking the number of "in-use" array elements, the "active" element, or moving a pointer-to-array to a larger buffer - that can only be done safely by one thread while all the others are blocked.
A lesser but sometimes important consideration is the amount of extra memory more mutexes use.
If you genuinely need to do this kind of update as quickly as possible in a highly contested multi-threaded program, you may also want to learn about lock-free atomic types and their compare-and-swap / exchange operations, but I'd recommend against considering that unless profiling the existing locking is significant in your overall program performance.

A mutex does not stop other threads from running completely, it only stops other threads from locking the same mutex. I.e. while one thread is keeping the mutex locked, the operating system continues to do context switches letting other threads run also, but if any other thread is trying to lock the same mutex its execution will be halted until the mutex is unlocked.
So yes, you can indeed have several different mutexes and lock/unlock them independently. Just beware of deadlocks, i.e. if one thread can lock more than one mutex at a time you can run into a situation where thread 1 has locked mutex A and is trying to lock mutex B but blocks because thread 2 already has mutex B locked and it is trying to lock mutex A..

Its not completely clear that your use case is:
the threads gets a buffer assigned on that they have to work
the threads have some results and request a special buffer to update.
On the first variant you need some assignment logic that assigns a buffer to a thread.
This logic has to be exectued in an atomic way. so the best is to use a mutex to protect the assignment logic.
On the other variant it may be the best to have a vector of mutexes, one for each buffer element.
In Both cases the buffer does not need a protection because it (or better each field of it) is only accessed from one thread at a time.
You also may inform yourself about 'semaphores'. These contain a counter that allows to manage ressources that have a limited amount but more than one. Mutexes are a special case of semaphores with n=1.

You can have mutex per entry, C++11 mutex can be easily converted into an adaptive-spinlock, so you can achieve good CPU/Latency tradeoff.
Or, if you need very low latency yet have enough CPUs you can use an atomic "busy" flag per entry and spin in a tight compare-exchange loop on contention.
From experience, though, the best performance and scalability are achieved when concurrent writes are serialized via a command queue (or a queue of smaller immutable buffers to be concatenated at destination) and a single thread processing the queue.

difference between lock, memory barrier, semaphore

This article: http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf (page 12)
seems to make a difference between a lock and a memory barrier
I would like to know what the difference is between a lock, memory barrier, and a semaphore?
(While other questions might mention the difference between a lock and a synchronisation object, I found none about the difference between a lock and a memory barrier)

A memory barrier (also known as a fence) is a hardware operation, which
ensures the ordering of different reads and writes to the globally
visible store. On a typical modern processor, memory accesses are
pipelined, and may occur out of order. A memory barrier ensures that
this doesn't happen. A full memory barrier will ensure that all loads
and stores which precede it occur before any load or store which follows
it. (Many processors have support partial barriers; e.g. on a Sparc, a
membar #StoreStore ensures that all stores which occur before it will
be visible to all other processes before any store which occurs after
it.)
That's all a memory barrier does. It doesn't block the thread, or
anything.
Mutexes and semaphores are higher level primatives, implemented in the
operating system. A thread which requests a mutex lock will block, and
have its execution suspended by the OS, until that mutex is free. The
kernel code in the OS will contain memory barrier instructions in order
to implement a mutex, but it does much more; a memory barrier
instruction will suspend the hardware execution (all threads) until the
necessary conditions have been met—a microsecond or so at the
most, and the entire processor stops for this time. When you try to
lock a mutex, and another thread already has it, the OS will suspend
your thread (and only your thread—the processor continues to
execute other threads) until whoever holds the mutex frees it, which
could be seconds, minutes or even days. (Of course, if it's more than a
few hundred milliseconds, it's probably a bug.)
Finally, there's not really much difference between semaphores and
mutexes; a mutex can be considered a semaphore with a count of one.

A memory barrier is a method to order memory access. Compilers and CPU's can change this order to optimize, but in multithreaded environments, this can be an issue. The main difference with the others is that threads are not stopped by this.
A lock or mutex makes sure that code can only be accessed by 1 thread. Within this section, you can view the environment as singlethreaded, so memory barriers should not be needed.
a semaphore is basically a counter that can be increased (v()) or decreased (p()). If the counter is 0, then p() halts the thread until the counter is no longer 0. This is a way to synchronize threads, but I would prefer using mutexes or condition variables (controversial, but that's my opinion). When the initial counter is 1, then the semaphore is called a binary semaphore and it is similar to a lock.
A big difference between locks and semaphores is that the thread owns the lock, so no other thread should try to unlock, while this is not the case for semaphores.

Simples explanation for now.
Lock
Is an atomic test for whether this piece of code can proceed
lock (myObject)
{
// Stuff to do when I acquire the lock
}
This is normally a single CPU instruction that tests and sets a variable as a single atomic operation. More here, http://en.wikipedia.org/wiki/Test-and-set#Hardware_implementation_of_test-and-set_2
Memory Barrier
Is a hint to the processor that it can not execute these instructions out of order. Without it, the instructions could be executed out of order, like in double checked locking, the two null checks could be executed before the lock has.
Thread.MemoryBarrier();
public static Singleton Instance()
{
if (_singletonInstance == null)
{
lock(myObject)
{
if (_singletonInstance == null)
{
_singletonInstance = new Singleton();
}
}
}
}
This are also a set of CPU instructions that implement memory barriers to explicitly tell a CPU it can not execute things out of order.
Semaphores
Are similar to locks except they normally are for more than one thread. I.e. if you can handle 10 concurrent disk reads for example, you'd use a semaphore. Depending on the processor this is either its own instruction, or a test and set instruction with load more work around interrupts (for e.g. on ARM).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js