Lock a mutex from mutex array with atomic index - c++

I'm trying to write a buffer which can push data to the buffers, checks if full, swaps the buffer if necessary. Another thread can get a buffer for file output.
I've successfully implemented the buffer but I wanted to add a ForceSwapBuffer method that would force an incomplete buffer to be swapped and return the data from the incomplete buffer. In order to do this I check if the read and write buffer are the same (there is no use in trying to force swap a buffer to write to a file while there are still other full buffers that could be written).
I want this method to be able to run side by side with the GetBuffer method (not really necessary but I wanted to try it and stumbled upon this problem).
The GetBuffer would block and when ForceSwapBuffer is finished it would still block until the new buffer is completely full, because in the ForceSwapBuffer I change the atomic _read_buffer_index. I wonder if this will always work? Will the blocking lock of GetBuffer detect the change of the atomic read_buffer_index and change the mutex it is trying to lock or would it check at the start of the lock what mutex it has to lock and keep trying to lock the same mutex even when the index changes?
/* selection of member data */
unsigned int _size, _count;
std::atomic<unsigned int> _write_buffer_index, _read_buffer_index;
unsigned int _index;
std::unique_ptr< std::unique_ptr<T[]>[] > _buffers;
std::unique_ptr< std::mutex[] > _mutexes;
std::recursive_mutex _force_swap_buffer;
/* selection of implementation of member functions */
template<typename T> // included to show the use of the recursive_mutex
void Buffer<T>::Push(T *data, unsigned int length) {
std::lock_guard<std::recursive_mutex> lock(_force_swap_buffer);
if (_index + length <= _size) {
memcpy(&_buffers[_write_buffer_index][_index], data, length*sizeof(T));
_index += length;
} else {
memcpy(&_buffers[_write_buffer_index][_index], data, (_size - _index)*sizeof(T));
unsigned int t_index = _index;
Push(&data[_size - t_index], length - (_size - t_index));
template<typename T>
std::unique_ptr<T[]> Buffer<T>::GetBuffer() {
std::lock_guard<std::mutex> lock(_mutexes[_read_buffer_index]); // where the magic should happen
std::unique_ptr<T[]> result(new T[_size]);
memcpy(result.get(), _buffers[_read_buffer_index].get(), _size*sizeof(T));
_read_buffer_index = (_read_buffer_index + 1) % _count;
return std::move(result);
template<typename T>
std::unique_ptr<T[]> Buffer<T>::ForceSwapBuffer() {
std::lock_guard<std::recursive_mutex> lock(_force_swap_buffer); // lock that forbids pushing and force swapping at the same time
if (_write_buffer_index != _read_buffer_index)
return nullptr;
std::unique_ptr<T[]> result(new T[_index]);
memcpy(result.get(), _buffers[_read_buffer_index].get(), _index*sizeof(T));
unsigned int next = (_write_buffer_index + 1) % _count;
_read_buffer_index = next; // changing the read_index while the other thread it blocked, the new mutex is already locked so the other thread should remain locked
_write_buffer_index = next;
_index = 0;
return result;

There are some problems with your code. First, be careful when modifying atomic variables. Only a small set of operations is really atomic (see http://en.cppreference.com/w/cpp/atomic/atomic), and combinations of atomic operations are not atomic. Consider:
_read_buffer_index = (_read_buffer_index + 1) % _count;
What happens here is that you have an atomic read of the variable, an increment, a modulo operation, and an atomic store. However, the whole statement itself is not atomic! If _count is a power of 2, you can just use the ++-operator. If it is not, you have to read _read_buffer_index into a temporary variable, perform the above calculations, and then use a compare_exchange function to store the new value if the variable was not changed in the mean time. Obviously the latter has to be done in a loop until it succeeds. You also have to worry about the possibility that one thread increments the variable _count times between the read and compare_exchange of a second thread, in which case the second thread erroneously thinks the variable was not changed.
The second problem is cache-line bouncing. If you have multiple mutexes on the same cache line, then if two or more threads try to access them simultaneously, the performance will be very bad. What the size of a cache-line is depends on your platform.
The main problem is that while ForceSwapBuffer() and Push() both lock the _force_swap_buffer mutex, GetBuffer() does not. GetBuffer() however does change _read_buffer_index. So in ForceSwapBuffer():
std::lock_guard<std::recursive_mutex> lock(_force_swap_buffer);
if (_write_buffer_index != _read_buffer_index)
return nullptr;
// another thread can call GetBuffer() here and change _read_buffer_index
// rest of the code here
The assumption that _write_buffer_index == _read_buffer_index after the if-statement is actually invalid.


Multithreaded C++: force read from memory, bypassing cache

I'm working on a personal hobby-time game engine and I'm working on a multithreaded batch executor. I was originally using a concurrent lockless queue and std::function all over the place to facilitate communication between the master and slave threads, but decided to scrap it in favor of a lighter-weight way of doing things that give me tight control over memory allocation: function pointers and memory pools.
Anyway, I've run into a problem:
The function pointer, no matter what I try, is only getting read correctly by one thread while the others read a null pointer and thus fail an assert.
I'm fairly certain this is a problem with caching. I have confirmed that all threads have the same address for the pointer. I've tried declaring it as volatile, intptr_t, std::atomic, and tried all sorts of casting-fu and the threads all just seem to ignore it and continue reading their cached copies.
I've modeled the master and slave in a model checker to make sure the concurrency is good, and there is no livelock or deadlock (provided that the shared variables all synchronize correctly)
void Executor::operator() (int me) {
while (true) {
printf("Slave %d waiting.\n", me);
std::unique_lock<std::mutex> lock(batch.ready_m);
while(!batch.running) batch.ready.wait(lock);
printf("Slave %d running.\n", me);
BatchFunc func = batch.func;
assert(func != nullptr);
int index;
if (batch.store_values) {
while ((index = batch.item.fetch_add(1)) < batch.n_items) {
void* data = reinterpret_cast<void*>(batch.data_buffer + index * batch.item_size);
func(batch.share_data, data);
else {
while ((index = batch.item.fetch_add(1)) < batch.n_items) {
void** data = reinterpret_cast<void**>(batch.data_buffer + index * batch.item_size);
func(batch.share_data, *data);
// at least one thread finished, so make sure we won't loop back around
batch.running = false;
if (running_threads.fetch_sub(1) == 1) { // I am the last one
batch.done = true; // therefore all threads are done
void Executor::run_batch() {
if (batch.func == nullptr || batch.n_items == 0) return;
batch.running = true;
batch.done = false;
printf("Master waiting.\n");
std::unique_lock<std::mutex> lock(batch.complete_m);
while (!batch.done) batch.complete.wait(lock);
printf("Master ready.\n");
batch.func = nullptr;
batch.n_items = 0;
batch.func is being set by another function
template<typename SharedT, typename ItemT>
void set_batch_job(void(*func)(const SharedT*, ItemT*), const SharedT& share_data, bool byValue = true) {
static_assert(sizeof(SharedT) <= SHARED_DATA_MAXSIZE, "Shared data too large");
static_assert(std::is_pod<SharedT>::value, "Shared data type must be POD");
assert(std::is_pod<ItemT>::value || !byValue);
batch.func = reinterpret_cast<volatile BatchFunc>(func);
memcpy(batch.share_data, (void*) &share_data, sizeof(SharedT));
batch.store_values = byValue;
if (byValue) {
batch.item_size = sizeof(ItemT);
else { // store pointers instead of values
batch.item_size = sizeof(ItemT*);
batch.n_items = 0;
and here is the struct (and typedef) that it's dealing with
typedef void(*BatchFunc)(const void*, void*);
struct JobBatch {
volatile BatchFunc func;
void* const share_data = operator new(SHARED_DATA_MAXSIZE);
intptr_t const data_buffer = reinterpret_cast<intptr_t>(operator new (EXEC_DATA_BUFFER_SIZE));
volatile size_t item_size;
std::atomic<int> item; // Index into the data array
volatile int n_items = 0;
std::condition_variable complete; // slave -> master signal
std::condition_variable ready; // master -> slave signal
std::mutex complete_m;
std::mutex ready_m;
bool store_values = false;
volatile bool running = false; // there is work to do in the batch
volatile bool done = false; // there is no work left to do
} batch;
How do I make sure that all the necessary reads and writes to batch.func get synchronized properly between threads?
Just in case it matters: I'm using Visual Studio and compiling an x64 Debug Windows executable. Intel i5, Windows 10, 8GB RAM.
So I did a little reading on the C++ memory model and I managed to hack together a solution using atomic_thread_fence. Everything is probably super broken because I'm crazy and shouldn't roll my own system here, but hey, it's fun to learn!
Basically, whenever you're done writing things that you want other threads to see, you need to call atomic_thread_fence(std::memory_order_release)
On the receiving thread(s), you call atomic_thread_fence(std::memory_order_acquire) before reading shared data.
In my case, release should be done immediately before waiting on a condition variable and acquire should be done immediately before using data written by other threads.
This ensures that the writes on one thread are seen by the others.
I'm no expert, so this is probably not the right way to tackle the problem and will likely be faced with certain doom. For instance, I still have a deadlock/livelock problem to sort out.
tl;dr: it's not exactly a cache thing: threads may not have their data totally in sync with each other unless you enforce that with atomic memory fences.

Using a mutex to block execution from outside the critical section

I'm not sure I got the terminology right but here goes - I have this function that is used by multiple threads to write data (using pseudo code in comments to illustrate what I want)
//these are initiated in the constructor
int* data;
std::atomic<size_t> size;
void write(int value) {
//wait here while "read_lock"
//set "write_lock" to "write_lock" + 1
auto slot = size.fetch_add(1, std::memory_order_acquire);
data[slot] = value;
//set "write_lock" to "write_lock" - 1
the order of the writes is not important, all I need here is for each write to go to a unique slot
Every once in a while though, I need one thread to read the data using this function
int* read() {
//set "read_lock" to true
//wait here while "write_lock"
int* ret = data;
data = new int[capacity];
size = 0;
//set "read_lock" to false
return ret;
so it basically swaps out the buffer and returns the old one (I've removed capacity logic to make the snippets shorter)
In theory this should lead to 2 operating scenarios:
1 - just a bunch of threads writing into the container
2 - when some thread executes the read function, all new writers will have to wait, the reader will wait until all existing writes are finished, it will then do the read logic and scenario 1 can continue.
The question part is that I don't know what kind of a barrier to use for the locks -
A spinlock would be wasteful since there are many containers like this and they all need cpu cycles
I don't know how to apply std::mutex since I only want the write function to be in a critical section if the read function is triggered. Wrapping the whole write function in a mutex would cause unnecessary slowdown for operating scenario 1.
So what would be the optimal solution here?
If you have C++14 capability then you can use a std::shared_timed_mutex to separate out readers and writers. In this scenario it seems you need to give your writer threads shared access (allowing other writer threads at the same time) and your reader threads unique access (kicking all other threads out).
So something like this may be what you need:
class MyClass
using mutex_type = std::shared_timed_mutex;
using shared_lock = std::shared_lock<mutex_type>;
using unique_lock = std::unique_lock<mutex_type>;
mutable mutex_type mtx;
// All updater threads can operate at the same time
auto lock_for_updates() const
return shared_lock(mtx);
// Reader threads need to kick all the updater threads out
auto lock_for_reading() const
return unique_lock(mtx);
// many threads can call this
void do_writing_work(std::shared_ptr<MyClass> sptr)
auto lock = sptr->lock_for_updates();
// update the data here
// access the data from one thread only
void do_reading_work(std::shared_ptr<MyClass> sptr)
auto lock = sptr->lock_for_reading();
// read the data here
The shared_locks allow other threads to gain a shared_lock at the same time but prevent a unique_lock gaining simultaneous access. When a reader thread tries to gain a unique_lock all shared_locks will be vacated before the unique_lock gets exclusive control.
You can also do this with regular mutexes and condition variables rather than shared. Supposedly shared_mutex has higher overhead, so I'm not sure which will be faster. With Gallik's solution you'd presumably be paying to lock the shared mutex on every write call; I got the impression from your post that write gets called way more than read so maybe this is undesirable.
int* data; // initialized somewhere
std::atomic<size_t> size = 0;
std::atomic<bool> reading = false;
std::atomic<int> num_writers = 0;
std::mutex entering;
std::mutex leaving;
std::condition_variable cv;
void write(int x) {
if (reading) {
if (num_writers == 0)
std::lock_guard l(leaving);
{ std::lock_guard l(entering); }
auto slot = size.fetch_add(1, std::memory_order_acquire);
data[slot] = x;
if (reading && num_writers == 0)
std::lock_guard l(leaving);
int* read() {
int* other_data = new int[capacity];
std::unique_lock enter_lock(entering);
reading = true;
std::unique_lock leave_lock(leaving);
cv.wait(leave_lock, [] () { return num_writers == 0; });
swap(data, other_data);
size = 0;
reading = false;
return other_data;
It's a bit complicated and took me some time to work out, but I think this should serve the purpose pretty well.
In the common case where only writing is happening, reading is always false. So you do the usual, and pay for two additional atomic increments and two untaken branches. So the common path does not need to lock any mutexes, unlike the solution involving a shared mutex, this is supposedly expensive: http://permalink.gmane.org/gmane.comp.lib.boost.devel/211180.
Now, suppose read is called. The expensive, slow heap allocation happens first, meanwhile writing continues uninterrupted. Next, the entering lock is acquired, which has no immediate effect. Now, reading is set to true. Immediately, any new calls to write enter the first branch, and eventually hit the entering lock which they are unable to acquire (as its already taken), and those threads then get put to sleep.
Meanwhile, the read thread is now waiting on the condition that the number of writers is 0. If we're lucky, this could actually go through right away. If however there are threads in write in either of the two locations between incrementing and decrementing num_writers, then it will not. Each time a write thread decrements num_writers, it checks if it has reduced that number to zero, and when it does it will signal the condition variable. Because num_writers is atomic which prevents various reordering shenanigans, it is guaranteed that the last thread will see num_writers == 0; it could also be notified more than once but this is ok and cannot result in bad behavior.
Once that condition variable has been signalled, that shows that all writers are either trapped in the first branch or are done modifying the array. So the read thread can now safely swap the data, and then unlock everything, and then return what it needs to.
As mentioned before, in typical operation there are no locks, just increments and untaken branches. Even when a read does occur, the read thread will have one lock and one condition variable wait, whereas a typical write thread will have about one lock/unlock of a mutex and that's all (one, or a small number of write threads, will also perform a condition variable notification).

Mutex Safety with Interrupts (Embedded Firmware)

Edit #Mike pointed out that my try_lock function in the code below is unsafe and that accessor creation can produce a race condition as well. The suggestions (from everyone) have convinced me that I'm going down the wrong path.
Original Question
The requirements for locking on an embedded microcontroller are different enough from multithreading that I haven't been able to convert multithreading examples to my embedded applications. Typically I don't have an OS or threads of any kind, just main and whatever interrupt functions are called by the hardware periodically.
It's pretty common that I need to fill up a buffer from an interrupt, but process it in main. I've created the IrqMutex class below to try to safely implement this. Each person trying to access the buffer is assigned a unique id through IrqMutexAccessor, then they each can try_lock() and unlock(). The idea of a blocking lock() function doesn't work from interrupts because unless you allow the interrupt to complete, no other code can execute so the unlock() code never runs. I do however use a blocking lock from the main() code occasionally.
However, I know that the double-check lock doesn't work without C++11 memory barriers (which aren't available on many embedded platforms). Honestly despite reading quite a bit about it, I don't really understand how/why the memory access reordering can cause a problem. I think that the use of volatile sig_atomic_t (possibly combined with the use of unique IDs) makes this different from the double-check lock. But I'm hoping someone can: confirm that the following code is correct, explain why it isn't safe, or offer a better way to accomplish this.
class IrqMutex {
friend class IrqMutexAccessor;
std::sig_atomic_t accessorIdEnum;
volatile std::sig_atomic_t owner;
std::sig_atomic_t nextAccessor(void) { return ++accessorIdEnum; }
bool have_lock(std::sig_atomic_t accessorId) {
return (owner == accessorId);
bool try_lock(std::sig_atomic_t accessorId) {
// Only try to get a lock, while it isn't already owned.
while (owner == SIG_ATOMIC_MIN) {
// <-- If an interrupt occurs here, both attempts can get a lock at the same time.
// Try to take ownership of this Mutex.
owner = accessorId; // SET
// Double check that we are the owner.
if (owner == accessorId) return true;
// Someone else must have taken ownership between CHECK and SET.
// If they released it after CHECK, we'll loop back and try again.
// Otherwise someone else has a lock and we have failed.
// This shouldn't happen unless they called try_lock on something they already owned.
if (owner == accessorId) return true;
// If someone else owns it, we failed.
return false;
bool unlock(std::sig_atomic_t accessorId) {
// Double check that the owner called this function (not strictly required)
if (owner == accessorId) {
return true;
// We still return true if the mutex was unlocked anyway.
return (owner == SIG_ATOMIC_MIN);
IrqMutex(void) : accessorIdEnum(SIG_ATOMIC_MIN), owner(SIG_ATOMIC_MIN) {}
// This class is used to manage our unique accessorId.
class IrqMutexAccessor {
friend class IrqMutex;
IrqMutex& mutex;
const std::sig_atomic_t accessorId;
IrqMutexAccessor(IrqMutex& m) : mutex(m), accessorId(m.nextAccessor()) {}
bool have_lock(void) { return mutex.have_lock(accessorId); }
bool try_lock(void) { return mutex.try_lock(accessorId); }
bool unlock(void) { return mutex.unlock(accessorId); }
Because there is one processor, and no threading the mutex serves what I think is a subtly different purpose than normal. There are two main use cases I run into repeatedly.
The interrupt is a Producer and takes ownership of a free buffer and loads it with a packet of data. The interrupt/Producer may keep its ownership lock for a long time spanning multiple interrupt calls. The main function is the Consumer and takes ownership of a full buffer when it is ready to process it. The race condition rarely happens, but if the interrupt/Producer finishes with a packet and needs a new buffer, but they are all full it will try to take the oldest buffer (this is a dropped packet event). If the main/Consumer started to read and process that oldest buffer at exactly the same time they would trample all over each other.
The interrupt is just a quick change or increment of something (like a counter). However, if we want to reset the counter or jump to some new value with a call from the main() code we don't want to try to write to the counter as it is changing. Here main actually does a blocking loop to obtain a lock, however I think its almost impossible to have to actually wait here for more than two attempts. Once it has a lock, any calls to the counter interrupt will be skipped, but that's generally not a big deal for something like a counter. Then I update the counter value and unlock it so it can start incrementing again.
I realize these two samples are dumbed down a bit, but some version of these patterns occur in many of the peripherals in every project I work on and I'd like once piece of reusable code that can safely handle this across various embedded platforms. I included the C tag, because all of this is directly convertible to C code, and on some embedded compilers that's all that is available. So I'm trying to find a general method that is guaranteed to work in both C and C++.
struct ExampleCounter {
volatile long long int value;
IrqMutex mutex;
} exampleCounter;
struct ExampleBuffer {
volatile char data[256];
volatile size_t index;
IrqMutex mutex; // One mutex per buffer.
} exampleBuffers[2];
const volatile char * const REGISTER;
// This accessor shouldn't be created in an interrupt or a race condition can occur.
static IrqMutexAccessor myMutex(exampleCounter.mutex);
void __irqQuickFunction(void) {
// Obtain a lock, add the data then unlock all within one function call.
if (myMutex.try_lock()) {
} else {
// If we failed to obtain a lock, we skipped this update this one time.
// These accessors shouldn't be created in an interrupt or a race condition can occur.
static IrqMutexAccessor myMutexes[2] = {
void __irqLongFunction(void) {
static size_t bufferIndex = 0;
// Check if we have a lock.
if (!myMutex[bufferIndex].have_lock() and !myMutex[bufferIndex].try_lock()) {
// If we can't get a lock try the other buffer
bufferIndex = (bufferIndex + 1) % 2;
// One buffer should always be available so the next line should always be successful.
if (!myMutex[bufferIndex].try_lock()) return;
// ... at this point we know we have a lock ...
// Get data from the hardware and modify the buffer here.
const char c = *REGISTER;
exampleBuffers[bufferIndex].data[exampleBuffers[bufferIndex].index++] = c;
// We may keep the lock for multiple function calls until the end of packet.
static const char END_PACKET_SIGNAL = '\0';
// Unlock this buffer so it can be read from main.
// Switch to the other buffer for next time.
bufferIndex = (bufferIndex + 1) % 2;
int main(void) {
while (true) {
// Mutex for counter
static IrqMutexAccessor myCounterMutex(exampleCounter.mutex);
// Change counter value
// Skip any updates that occur while we are updating the counter.
while(!myCounterMutex.try_lock()) {
// Wait for the interrupt to release its lock.
// Set the counter to a new value.
exampleCounter.value = 500;
// Updates will start again as soon as we unlock it.
// Mutexes for __irqLongFunction.
static IrqMutexAccessor myBufferMutexes[2] = {
// Process buffers from __irqLongFunction.
for (size_t i = 0; i < 2; i++) {
// Obtain a lock so we can read the data.
if (!myBufferMutexes[i].try_lock()) continue;
// Check that the buffer isn't empty.
if (exampleBuffers[i].index == 0) {
myBufferMutexes[i].unlock(); // Don't forget to unlock.
// ... read and do something with the data here ...
exampleBuffer.index = 0;
Also note that I used volatile on any variable that is read-by or written-by the interrupt routine (unless the variable was only accessed from the interrupt like the static bufferIndex value in __irqLongFunction). I've read that mutexes remove some of need for volatile in multithreaded code, but I don't think that applies here. Did I use the right amount of volatile? I used it on: ExampleBuffer[].data[256], ExampleBuffer[].index, and ExampleCounter.value.
I apologize for the long answer, but perhaps it is fitting for a long question.
To answer your first question, I would say that your implementation of IrqMutex is not safe. Let me try to explain where I see problems.
Function nextAccessor
std::sig_atomic_t nextAccessor(void) { return ++accessorIdEnum; }
This function has a race condition, because the increment operator is not atomic, despite it being on an atomic value marked volatile. It involves 3 operations: reading the current value of accessorIdEnum, incrementing it, and writing the result back. If two IrqMutexAccessors are created at the same time, it's possible that they both get the same ID.
Function try_lock
The try_lock function also has a race condition. One thread (eg main), could go into the while loop, and then before taking ownership, another thread (eg an interrupt) can also go into the while loop and take ownership of the lock (returning true). Then the first thread can continue, moving onto owner = accessorId, and thus "also" take the lock. So two threads (or your main thread and an interrupt) can try_lock on an unowned mutex at the same time and both return true.
Disabling interrupts by RAII
We can achieve some level of simplicity and encapsulation by using RAII for interrupt disabling, for example the following class:
class InterruptLock {
InterruptLock() {
prevInterruptState = currentInterruptState();
~InterruptLock() {
int prevInterruptState; // Whatever type this should be for the platform
InterruptLock(const InterruptLock&); // Not copy-constructable
And I would recommend disabling interrupts to get the atomicity you need within the mutex implementation itself. For example something like:
bool try_lock(std::sig_atomic_t accessorId) {
InterruptLock lock;
if (owner == SIG_ATOMIC_MIN) {
owner = accessorId;
return true;
return false;
bool unlock(std::sig_atomic_t accessorId) {
InterruptLock lock;
if (owner == accessorId) {
return true;
return false;
Depending on your platform, this might look different, but you get the idea.
As you said, this provides a platform to abstract away from the disabling and enabling interrupts in general code, and encapsulates it to this one class.
Mutexes and Interrupts
Having said how I would consider implementing the mutex class, I would not actually use a mutex class for your use-cases. As you pointed out, mutexes don't really play well with interrupts, because an interrupt can't "block" on trying to acquire a mutex. For this reason, for code that directly exchanges data with an interrupt, I would instead strongly consider just directly disabling interrupts (for a very short time while the main "thread" touches the data).
So your counter might simply look like this:
volatile long long int exampleCounter;
void __irqQuickFunction(void) {
// Change counter value
InterruptLock lock;
exampleCounter = 500;
In my mind, this is easier to read, easier to reason about, and won't "slip" when there's contention (ie miss timer beats).
Regarding the buffer use-case, I would strongly recommend against holding a lock for multiple interrupt cycles. A lock/mutex should be held for just the slightest moment required to "touch" a piece of memory - just long enough to read or write it. Get in, get out.
So this is how the buffering example might look:
struct ExampleBuffer {
char data[256];
} exampleBuffers[2];
ExampleBuffer* volatile bufferAwaitingConsumption = nullptr;
ExampleBuffer* volatile freeBuffer = &exampleBuffers[1];
const volatile char * const REGISTER;
void __irqLongFunction(void) {
static const char END_PACKET_SIGNAL = '\0';
static size_t index = 0;
static ExampleBuffer* receiveBuffer = &exampleBuffers[0];
// Get data from the hardware and modify the buffer here.
const char c = *REGISTER;
receiveBuffer->data[index++] = c;
// End of packet?
// Make the packet available to the consumer
bufferAwaitingConsumption = receiveBuffer;
// Move on to the next buffer
receiveBuffer = freeBuffer;
freeBuffer = nullptr;
index = 0;
int main(void) {
while (true) {
// Fetch packet from shared variable
ExampleBuffer* packet;
InterruptLock lock;
packet = bufferAwaitingConsumption;
bufferAwaitingConsumption = nullptr;
if (packet) {
// ... read and do something with the data here ...
// Once we're done with the buffer, we need to release it back to the producer
InterruptLock lock;
freeBuffer = packet;
This code is arguably easier to reason about, since there are only two memory locations shared between the interrupt and the main loop: one to pass packets from the interrupt to the main loop, and one to pass empty buffers back to the interrupt. We also only touch those variables under "lock", and only for the minimum time needed to "move" the value. (for simplicity I've skipped over the buffer overflow logic when the main loop takes too long to free the buffer).
It's true that in this case one may not even need the locks, since we're just reading and writing simple value, but the cost of disabling the interrupts is not much, and the risk of making mistakes otherwise, is not worth it in my opinion.
As pointed out in the comments, the above solution was meant to only tackle the multithreading problem, and omitted overflow checking. Here is more complete solution which should be robust under overflow conditions:
const size_t BUFFER_COUNT = 2;
struct ExampleBuffer {
char data[256];
ExampleBuffer* next;
} exampleBuffers[BUFFER_COUNT];
volatile size_t overflowCount = 0;
class BufferList {
BufferList() : first(nullptr), last(nullptr) { }
// Atomic enqueue
void enqueue(ExampleBuffer* buffer) {
InterruptLock lock;
if (last)
last->next = buffer;
else {
first = buffer;
last = buffer;
// Atomic dequeue (or returns null)
ExampleBuffer* dequeueOrNull() {
InterruptLock lock;
ExampleBuffer* result = first;
if (first) {
first = first->next;
if (!first)
last = nullptr;
return result;
ExampleBuffer* first;
ExampleBuffer* last;
} freeBuffers, buffersAwaitingConsumption;
const volatile char * const REGISTER;
void __irqLongFunction(void) {
static const char END_PACKET_SIGNAL = '\0';
static size_t index = 0;
static ExampleBuffer* receiveBuffer = &exampleBuffers[0];
// Recovery from overflow?
if (!receiveBuffer) {
// Try get another free buffer
receiveBuffer = freeBuffers.dequeueOrNull();
// Still no buffer?
if (!receiveBuffer) {
// Get data from the hardware and modify the buffer here.
const char c = *REGISTER;
if (index < sizeof(receiveBuffer->data))
receiveBuffer->data[index++] = c;
// End of packet, or out of space?
// Make the packet available to the consumer
// Move on to the next free buffer
receiveBuffer = freeBuffers.dequeueOrNull();
index = 0;
size_t getAndResetOverflowCount() {
InterruptLock lock;
size_t result = overflowCount;
overflowCount = 0;
return result;
int main(void) {
// All buffers are free at the start
for (int i = 0; i < BUFFER_COUNT; i++)
while (true) {
// Fetch packet from shared variable
ExampleBuffer* packet = dequeueOrNull();
if (packet) {
// ... read and do something with the data here ...
// Once we're done with the buffer, we need to release it back to the producer
size_t overflowBytes = getAndResetOverflowCount();
if (overflowBytes) {
// ...
The key changes:
If the interrupt runs out of free buffers, it will recover
If the interrupt receives data while it doesn't have a receive buffer, it will communicate that to the main thread via getAndResetOverflowCount
If you keep getting buffer overflows, you can simply increase the buffer count
I've encapsulated the multithreaded access into a queue class implemented as a linked list (BufferList), which supports atomic dequeue and enqueue. The previous example also used queues, but of length 0-1 (either an item is enqueued or it isn't), and so the implementation of the queue was just a single variable. In the case of running out of free buffers, the receive queue could have 2 items, so I upgraded it to a proper queue rather than adding more shared variables.
If the interrupt is the producer and mainline code is the consumer, surely it's as simple as disabling the interrupt for the duration of the consume operation?
That's how I used to do it in my embedded micro controller days.

Lazy loaded data in multithreaded environment

I have a struct like this:
struct Chunk
Chunk* mParent;
Chunk* mSubLevels;
Int16 mDepth;
Int16 mIndex;
Reference<ValueType> mFirstItem;
Reference<ValueType> mLastItem;
mSubLevels = nullptr;
mFirstItem = nullptr;
mLastItem = nullptr;
~Chunk() {}
mSubLevels in chunk is null until first access. On first access to mSubLevels i create an array of chunks for mSubLevels and fill other members. but because multiple threads work with chunks i do this process with a mutex. so creation of new chunks is protected by a mutex. after this process there is no write to this chunks and they are read-only data, so threads access to this chunks without any mutex.
Indeed, i have some method, that in one of them, in first access to mSubLevels i check this pointer and if that is null i will create required data by a mutex. but other methods are read-only and i don't change structure. So i don't use any mutex in this functions. (there isn't any acquire/release ordering between thread that create chunks and threads that read them).
Now can i use regular data types, or i must use atomic types?
Edit 2:
For creating data i use double checked locking:
(This is a function that will create new chunks)
Chunk* lTargetChunk = ...;
if (!std::atomic_load(lTargetChunk->mSubLevels, std::memory_order_relaxed))
std::lock_guard lGaurd(mMutex);
if (!std::atomic_load(lTargetChunk->mSubLevels, std::memory_order_relaxed))
Chunk* lChunks = new Chunk[mLevelSizes[l]];
for (UINT32 i = 0; i < mLevelSizes[l]; ++i)
Chunk* lCurrentChunk = &lChunks[i];
lCurrentChunk->mParent = lTargetChunk;
lCurrentChunk->mDepth = lDepth - 1;
lCurrentChunk->mIndex = i;
st::atomic_store(lCurrentChunk->mSubLevels, (Chunk*)bcNULL, memory_order_relaxed);
bcAtomicOperation::bcAtomicStore(lTargetChunk->mSubLevels, lChunks, std::memory_order_release);
For a moment, imagine that i don't use atomic op for mSubLevels.
I have some other methods that only will read this chunks without any 'mutex':
bcInline Chunk* _getSuccessorChunk(const Chunk* pChunk)
// If pChunk->mSubLevels isn't null do this operation.
const Chunk* lChunk = &pChunk->mSubLevels[0];
Chunk* lNextChunk;
if (lChunk->mIndex != mLevelSizes[lChunk->mDepth] - 1)
lNextChunk = lChunk + 1;
return lNextChunk;
else ...
As you can see i access to mSubLevels, mIndex and some other. in this function i don't use any 'mutex' so if writer thread don't flush it's cache to main memory, any thread that will run this function won't see affected changes. If i use mMutex in this function, i think the problem will be solved. (writer thread and reader threads will be synchronized via atomic operations in mutex) Now if i use atomic op for mSubLevels in first function(as i have wrote) and use 'acquire' to load that in second function:
bcInline Chunk* _getSuccessorChunk(const Chunk* pChunk)
// If pChunk->mSubLevels isn't null do this operation.
const Chunk* lChunk = &std::atomic_load(pChunk->mSubLevels, std::memory_order_acquire)[0];
Chunk* lNextChunk;
if (lChunk->mIndex != mLevelSizes[lChunk->mDepth] - 1)
lNextChunk = lChunk + 1;
return lNextChunk;
else ...
Reader threads will see changes from writer thread and no cache coherence problem will happen. Is this sentence true?
Your problem goes farther than just cache coherence. It's about correctness. What you're doing is a case of double checked locking.
It is problematic insofar as one thread may see mSubLevels being null and allocate a new object. While this is happening, another thread may concurrently access mSubLevels and see that it is null, and allocate an object as well. What now? Which one is the "correct" object to be assigned to the pointer. Will you just leak one object, or what do you do with the other one? How to detect this condition at all?
To solve this issue, you must eiher lock (i.e. use a mutex) before checking the value, or you must do some kind of atomic operation that lets you distinguish a null object from a still invalid being-created object and a valid object (such as an atomic compare-exchange with (Chunk*)1, which would basically be something like a micro-spinlock, except you're not spinning).
So in one word, yes, you must at least use atomic ops for this, or even a mutex. Using "normal" data types won't work.
For everything else where you only have readers and no writers, you can use regular types, it will work just fine.
There are two issues you need to overcome here:
You cannot afford reading without the array being created, obviously
For efficiency reasons, you probably do not want to create the array multiple times
I would suggest simply using a reader-writer mutex
The basic idea is:
lock in reader mode
check if the data is ready
if not ready, upgrade the lock to writer mode
check if the data is ready (it might have been prepared by another writer) and if not prepare it
release the lock in writer mode (keep the lock in reader mode)
do things with the data
release the lock in reader mode
There are some issues with this design (specifically the contention that occurs during initialization), but it has the advantage of being dead simple.

Does a multiple producer single consumer lock-free queue exist for c++? [closed]

The more I read the more confused I become... I would have thought it trivial to find a formally correct MPSC queue implemented in C++.
Every time I find another stab at it, further research seems to suggest there are issues such as ABA or other subtle race conditions.
Many talk of the need for garbage collection. This is something I want to avoid.
Is there an accepted correct open-source implementation out there?
You may want to check disruptor; it's available in C++ here: http://lmax-exchange.github.io/disruptor/
You can also find explanation how it works here on stackoverflow Basically it's circular buffer with no locking, optimized for passing FIFO messages between threads in a fixed-size slots.
Here are two implementations which I found useful: Lock-free Multi-producer Multi-consumer Queue on Ring Buffer # NatSys Lab. Blog and
Yet another implementation of a lock-free circular array queue
# CodeProject
NOTE: the code below is incorrect, I leave it only as an example how tricky these things can be.
If you don't like the complexity of google version, here is something similar from me - it's much simpler, but I leave it as an exercise to the reader to make it work (it's part of larger project, not portable at the moment). The whole idea is to maintain cirtular buffer for data and a small set of counters to identify slots for writing/written and reading/read. Since each counter is in its own cache line, and (normally) each is only atomically updated once in the live of a message, they can all be read without any synchronisation. There is one potential contention point between writing threads in post_done, it's required for FIFO guarantee. Counters (head_, wrtn_, rdng_, tail_) were selected to ensure correctness and FIFO, so dropping FIFO would also require change of counters (and that might be difficult to do without sacrifying correctness). It is possible to slightly improve performance for scenarios with one consumer, but I would not bother - you would have to undo it if other use cases with multiple readers are found.
On my machine latency looks like following (percentile on left, mean within this percentile on right, unit is microsecond, measured by rdtsc):
total=1000000 samples, avg=0.24us
50%=0.214us, avg=0.093us
90%=0.23us, avg=0.151us
99%=0.322us, avg=0.159us
99.9%=15.566us, avg=0.173us
These results are for single polling consumer, i.e. worker thread calling wheel.read() in tight loop and checking if not empty (scroll to bottom for example). Waiting consumers (much lower CPU utilization) would wait on event (one of acquire... functions), this adds about 1-2us to average latency due to context switch.
Since there is verly little contention on read, consumers scale very well with number of worker threads, e.g. for 3 threads on my machine:
total=1500000 samples, avg=0.07us
50%=0us, avg=0us
90%=0.155us, avg=0.016us
99%=0.361us, avg=0.038us
99.9%=8.723us, avg=0.044us
Patches will be welcome :)
// Copyright (c) 2011-2012, Bronislaw (Bronek) Kozicki
// Distributed under the Boost Software License, Version 1.0. (See accompanying
// file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
#pragma once
#include <core/api.hxx>
#include <core/wheel/exception.hxx>
#include <boost/noncopyable.hpp>
#include <boost/type_traits.hpp>
#include <boost/lexical_cast.hpp>
#include <typeinfo>
namespace core { namespace wheel
struct bad_size : core::exception
template<typename T> explicit bad_size(const T&, size_t m)
: core::exception(std::string("Slot capacity exceeded, sizeof(")
+ typeid(T).name()
+ ") = "
+ boost::lexical_cast<std::string>(sizeof(T))
+ ", capacity = "
+ boost::lexical_cast<std::string>(m)
// inspired by Disruptor
template <typename Header>
class wheel : boost::noncopyable
struct slot_detail
// slot write: (memory barrier in wheel) > post_done > (memory barrier in wheel)
// slot read: (memory barrier in wheel) > read_done > (memory barrier in wheel)
// done writing or reading, must update wrtn_ or tail_ in wheel, as appropriate
template <bool Writing>
void done(wheel* w)
if (Writing)
// cache line for sequence number and header
long long sequence;
Header header;
// there is no such thing as data type with variable size, but we need it to avoid thrashing
// cache - so we invent one. The memory is reserved in runtime and we simply go beyond last element.
// This is well into UB territory! Using template parameter for this is not good, since it
// results in this small implementation detail leaking to all possible user interfaces.
char data[8];
// use this as a storage space for slot_detail, to guarantee 64 byte alignment
struct slot_block { long long padding[8]; };
// wrap slot data to outside world
template <bool Writable>
class slot
template<typename> friend class wheel;
slot& operator=(const slot&); // moveable but non-assignable
// may only be constructed by wheel
slot(slot_detail* impl, wheel<Header>* w, size_t c)
: slot_(impl) , wheel_(w) , capacity_(c)
slot(slot&& s)
: slot_(s.slot_) , wheel_(s.wheel_) , capacity_(s.capacity_)
s.slot_ = NULL;
if (slot_)
// slot accessors - use Header to store information on what type is actually stored in data
bool empty() const { return !slot_; }
long long sequence() const { return slot_->sequence; }
Header& header() { return slot_->header; }
char* data() { return slot_->data; }
template <typename T> T& cast()
static_assert(boost::is_pod<T>::value, "Data type must be POD");
if (sizeof(T) > capacity_)
throw bad_size(T(), capacity_);
if (empty())
throw no_data();
return *((T*) data());
slot_detail* slot_;
wheel<Header>* wheel_;
const size_t capacity_;
// dynamic size of slot, with extra capacity, expressed in 64 byte blocks
static size_t sizeof_slot(size_t s)
size_t m = sizeof(slot_detail);
// add capacity less 8 bytes already within sizeof(slot_detail)
m += max(8, s) - 8;
// round up to 64 bytes, i.e. alignment of slot_detail
size_t r = m & ~(unsigned int)63;
if (r < m)
r += 64;
r /= 64;
return r;
// calculate actual slot capacity back from number of 64 byte blocks
static size_t slot_capacity(size_t s)
return s*64 - sizeof(slot_detail) + 8;
// round up to power of 2
static size_t round_size(size_t s)
// enfore minimum size
if (s <= min_size)
return min_size;
// find rounded value
size_t r = 1;
while (s)
s >>= 1;
r <<= 1;
return r;
slot_detail& at(long long sequence)
// find index from sequence number and return slot at found index of the wheel
return *((slot_detail*) &wheel_[(sequence & (size_ - 1)) * blocks_]);
wheel(size_t capacity, size_t size)
: head_(0) , wrtn_(0) , rdng_(0) , tail_(0) , event_()
, blocks_(sizeof_slot(capacity)) , capacity_(slot_capacity(blocks_)) , size_(round_size(size))
static_assert(boost::is_pod<Header>::value, "Header type must be POD");
static_assert(sizeof(slot_block) == 64, "This was unexpected");
wheel_ = new slot_block[size_ * blocks_];
// all slots must be initialised to 0
memset(wheel_, 0, size_ * 64 * blocks_);
active_ = 1;
delete[] wheel_;
// all accessors needed
size_t capacity() const { return capacity_; } // capacity of a single slot
size_t size() const { return size_; } // number of slots available
size_t queue() const { return (size_t)head_ - (size_t)tail_; }
bool active() const { return active_ == 1; }
// enough to call it just once, to fine tune slot capacity
template <typename T>
void check() const
static_assert(boost::is_pod<T>::value, "Data type must be POD");
if (sizeof(T) > capacity_)
throw bad_size(T(), capacity_);
// stop the wheel - safe to execute many times
size_t stop()
InterlockedExchange(&active_, 0);
// must wait for current read to complete
while (rdng_ != tail_)
return size_t(head_ - tail_);
// return first available slot for write
slot<true> post()
if (!active_)
throw stopped();
// the only memory barrier on head seq. number we need, if not overflowing
long long h = InterlockedIncrement64(&head_);
while(h - (long long) size_ > tail_)
if (InterlockedDecrement64(&head_) == h - 1)
throw overflowing();
// protection against case of race condition when we are overflowing
// and two or more threads try to post and two or more messages are read,
// all at the same time. If this happens we must re-try, otherwise we
// could have skipped a sequence number - causing infinite wait in post_done
h = InterlockedIncrement64(&head_);
slot_detail& r = at(h);
r.sequence = h;
// wrap in writeable slot
return slot<true>(&r, this, capacity_);
// return first available slot for write, nothrow variant
slot<true> post(std::nothrow_t)
if (!active_)
return slot<true>(NULL, this, capacity_);
// the only memory barrier on head seq. number we need, if not overflowing
long long h = InterlockedIncrement64(&head_);
while(h - (long long) size_ > tail_)
if (InterlockedDecrement64(&head_) == h - 1)
return slot<true>(NULL, this, capacity_);
// must retry if race condition described above
h = InterlockedIncrement64(&head_);
slot_detail& r = at(h);
r.sequence = h;
// wrap in writeable slot
return slot<true>(&r, this, capacity_);
// read first available slot for read
slot<false> read()
slot_detail* r = NULL;
// compare rdng_ and wrtn_ early to avoid unnecessary memory barrier
if (active_ && rdng_ < wrtn_)
// the only memory barrier on reading seq. number we need
const long long h = InterlockedIncrement64(&rdng_);
// check if this slot has been written, step back if not
if (h > wrtn_)
r = &at(h);
// wrap in readable slot
return slot<false>(r , this, capacity_);
// waiting for new post, to be used by non-polling clients
void acquire()
bool try_acquire()
return event_.try_acquire();
bool try_acquire(unsigned long timeout)
return event_.try_acquire(timeout);
void release()
void post_done(long long sequence)
const long long t = sequence - 1;
// the only memory barrier on written seq. number we need
while(InterlockedCompareExchange64(&wrtn_, sequence, t) != t)
// this is outside of critical path for polling clients
void read_done()
// the only memory barrier on tail seq. number we need
// each in its own cache line
// head_ - wrtn_ = no. of messages being written at this moment
// rdng_ - tail_ = no. of messages being read at the moment
// head_ - tail_ = no. of messages to read (including those being written and read)
// wrtn_ - rdng_ = no. of messages to read (excluding those being written or read)
__declspec(align(64)) volatile long long head_; // currently writing or written
__declspec(align(64)) volatile long long wrtn_; // written
__declspec(align(64)) volatile long long rdng_; // currently reading or read
__declspec(align(64)) volatile long long tail_; // read
__declspec(align(64)) volatile long active_; // flag switched to 0 when stopped
api::event event_; // set when new message is posted
const size_t blocks_; // number of 64-byte blocks in a single slot_detail
const size_t capacity_; // capacity of data() section per single slot. Initialisation depends on blocks_
const size_t size_; // number of slots available, always power of 2
slot_block* wheel_;
Here is what polling consumer worker thread may look like:
while (wheel.active())
core::wheel::wheel<int>::slot<false> slot = wheel.read();
if (!slot.empty())
Data& d = slot.cast<Data>();
// do work
// uncomment below for waiting consumer, saving CPU cycles
// else
// wheel.try_acquire(10);
Edited added consumer example
The most suitable implementation depends on the desired properties of a queue. Should it be unbounded or a bounded one is fine? Should it be linearizable, or less strict requirements would be fine? How strong FIFO guarantees you need? Are you willing to pay the cost of reverting the list by the consumer (there exists a very simple implementation where the consumer grabs the tail of a single-linked list, thus getting at once all items put by producers till the moment)? Should it guarantee that no thread is ever blocked, or tiny chances to get some thread blocked are ok? And etc.
Some useful links:
Is multiple-producer, single-consumer possible in a lockfree setting?
Hope that helps.
Below is the technique I used for my Cooperative Multi-tasking / Multi-threading library (MACE) http://bytemaster.github.com/mace/. It has the benefit of being lock-free except for when the queue is empty.
struct task {
boost::function<void()> func;
task* next;
boost::mutex task_ready_mutex;
boost::condition_variable task_ready;
boost::atomic<task*> task_in_queue;
// this can be called from any thread
void thread::post_task( task* t ) {
// atomically post the task to the queue.
task* stale_head = task_in_queue.load(boost::memory_order_relaxed);
do { t->next = stale_head;
} while( !task_in_queue.compare_exchange_weak( stale_head, t, boost::memory_order_release ) );
// Because only one thread can post the 'first task', only that thread will attempt
// to aquire the lock and therefore there should be no contention on this lock except
// when *this thread is about to block on a wait condition.
if( !stale_head ) {
boost::unique_lock<boost::mutex> lock(task_ready_mutex);
// this is the consumer thread.
void process_tasks() {
while( !done ) {
// this will atomically pop everything that has been posted so far.
pending = task_in_queue.exchange(0,boost::memory_order_consume);
// pending is a linked list in 'reverse post order', so process them
// from tail to head if you want to maintain order.
if( !pending ) { // lock scope
boost::unique_lock<boost::mutex> lock(task_ready_mutex);
// check one last time while holding the lock before blocking.
if( !task_in_queue ) task_ready.wait( lock );
I'm guessing no such thing exists - and if it does, it either isn't portable or isn't open source.
Conceptually, you are trying to control two pointers simultaneously: the tail pointer and the tail->next pointer. That can't generally be done with just lock-free primitives.