Lazy loaded data in multithreaded environment - c++

I have a struct like this:
struct Chunk
{
private:
public:
Chunk* mParent;
Chunk* mSubLevels;
Int16 mDepth;
Int16 mIndex;
Reference<ValueType> mFirstItem;
Reference<ValueType> mLastItem;
public:
Chunk()
{
mSubLevels = nullptr;
mFirstItem = nullptr;
mLastItem = nullptr;
}
~Chunk() {}
};
mSubLevels in chunk is null until first access. On first access to mSubLevels i create an array of chunks for mSubLevels and fill other members. but because multiple threads work with chunks i do this process with a mutex. so creation of new chunks is protected by a mutex. after this process there is no write to this chunks and they are read-only data, so threads access to this chunks without any mutex.
Indeed, i have some method, that in one of them, in first access to mSubLevels i check this pointer and if that is null i will create required data by a mutex. but other methods are read-only and i don't change structure. So i don't use any mutex in this functions. (there isn't any acquire/release ordering between thread that create chunks and threads that read them).
Now can i use regular data types, or i must use atomic types?
Edit 2:
For creating data i use double checked locking:
(This is a function that will create new chunks)
Chunk* lTargetChunk = ...;
if (!std::atomic_load(lTargetChunk->mSubLevels, std::memory_order_relaxed))
{
std::lock_guard lGaurd(mMutex);
if (!std::atomic_load(lTargetChunk->mSubLevels, std::memory_order_relaxed))
{
Chunk* lChunks = new Chunk[mLevelSizes[l]];
for (UINT32 i = 0; i < mLevelSizes[l]; ++i)
{
Chunk* lCurrentChunk = &lChunks[i];
lCurrentChunk->mParent = lTargetChunk;
lCurrentChunk->mDepth = lDepth - 1;
lCurrentChunk->mIndex = i;
st::atomic_store(lCurrentChunk->mSubLevels, (Chunk*)bcNULL, memory_order_relaxed);
}
bcAtomicOperation::bcAtomicStore(lTargetChunk->mSubLevels, lChunks, std::memory_order_release);
}
}
For a moment, imagine that i don't use atomic op for mSubLevels.
I have some other methods that only will read this chunks without any 'mutex':
bcInline Chunk* _getSuccessorChunk(const Chunk* pChunk)
{
// If pChunk->mSubLevels isn't null do this operation.
const Chunk* lChunk = &pChunk->mSubLevels[0];
Chunk* lNextChunk;
if (lChunk->mIndex != mLevelSizes[lChunk->mDepth] - 1)
{
lNextChunk = lChunk + 1;
return lNextChunk;
}
else ...
As you can see i access to mSubLevels, mIndex and some other. in this function i don't use any 'mutex' so if writer thread don't flush it's cache to main memory, any thread that will run this function won't see affected changes. If i use mMutex in this function, i think the problem will be solved. (writer thread and reader threads will be synchronized via atomic operations in mutex) Now if i use atomic op for mSubLevels in first function(as i have wrote) and use 'acquire' to load that in second function:
bcInline Chunk* _getSuccessorChunk(const Chunk* pChunk)
{
// If pChunk->mSubLevels isn't null do this operation.
const Chunk* lChunk = &std::atomic_load(pChunk->mSubLevels, std::memory_order_acquire)[0];
Chunk* lNextChunk;
if (lChunk->mIndex != mLevelSizes[lChunk->mDepth] - 1)
{
lNextChunk = lChunk + 1;
return lNextChunk;
}
else ...
Reader threads will see changes from writer thread and no cache coherence problem will happen. Is this sentence true?

Your problem goes farther than just cache coherence. It's about correctness. What you're doing is a case of double checked locking.
It is problematic insofar as one thread may see mSubLevels being null and allocate a new object. While this is happening, another thread may concurrently access mSubLevels and see that it is null, and allocate an object as well. What now? Which one is the "correct" object to be assigned to the pointer. Will you just leak one object, or what do you do with the other one? How to detect this condition at all?
To solve this issue, you must eiher lock (i.e. use a mutex) before checking the value, or you must do some kind of atomic operation that lets you distinguish a null object from a still invalid being-created object and a valid object (such as an atomic compare-exchange with (Chunk*)1, which would basically be something like a micro-spinlock, except you're not spinning).
So in one word, yes, you must at least use atomic ops for this, or even a mutex. Using "normal" data types won't work.
For everything else where you only have readers and no writers, you can use regular types, it will work just fine.

There are two issues you need to overcome here:
You cannot afford reading without the array being created, obviously
For efficiency reasons, you probably do not want to create the array multiple times
I would suggest simply using a reader-writer mutex
The basic idea is:
lock in reader mode
check if the data is ready
if not ready, upgrade the lock to writer mode
check if the data is ready (it might have been prepared by another writer) and if not prepare it
release the lock in writer mode (keep the lock in reader mode)
do things with the data
release the lock in reader mode
There are some issues with this design (specifically the contention that occurs during initialization), but it has the advantage of being dead simple.

Related

When can memory_order_acquire or memory_order_release be safely removed from compare_exchange?

I refer to the code in Lewiss Baker's coroutine tutorial.
https://lewissbaker.github.io/2017/11/17/understanding-operator-co-await
bool async_manual_reset_event::awaiter::await_suspend(
std::experimental::coroutine_handle<> awaitingCoroutine) noexcept
{
// Special m_state value that indicates the event is in the 'set' state.
const void* const setState = &m_event;
// Remember the handle of the awaiting coroutine.
m_awaitingCoroutine = awaitingCoroutine;
// Try to atomically push this awaiter onto the front of the list.
void* oldValue = m_event.m_state.load(std::memory_order_acquire);
do
{
// Resume immediately if already in 'set' state.
if (oldValue == setState) return false;
// Update linked list to point at current head.
m_next = static_cast<awaiter*>(oldValue);
// Finally, try to swap the old list head, inserting this awaiter
// as the new list head.
} while (!m_event.m_state.compare_exchange_weak(
oldValue,
this,
std::memory_order_release,
std::memory_order_acquire));
// Successfully enqueued. Remain suspended.
return true;
}
where m_state is just a std::atomic<void *>.
bool async_manual_reset_event::is_set() const noexcept
{
return m_state.load(std::memory_order_acquire) == this;
}
void async_manual_reset_event::reset() noexcept
{
void* oldValue = this;
m_state.compare_exchange_strong(oldValue, nullptr, std::memory_order_acquire);
}
void async_manual_reset_event::set() noexcept
{
// Needs to be 'release' so that subsequent 'co_await' has
// visibility of our prior writes.
// Needs to be 'acquire' so that we have visibility of prior
// writes by awaiting coroutines.
void* oldValue = m_state.exchange(this, std::memory_order_acq_rel);
if (oldValue != this)
{
// Wasn't already in 'set' state.
// Treat old value as head of a linked-list of waiters
// which we have now acquired and need to resume.
auto* waiters = static_cast<awaiter*>(oldValue);
while (waiters != nullptr)
{
// Read m_next before resuming the coroutine as resuming
// the coroutine will likely destroy the awaiter object.
auto* next = waiters->m_next;
waiters->m_awaitingCoroutine.resume();
waiters = next;
}
}
}
Note in m_state.exchange of the set() method, the comment above shows clearly why the call to exchange requires both acquire and release.
I wonder why in the m_state.compare_exchange_weak of the await_suspend() method, the third parameter is a std::memory_order_release but not a memory_order_acq_rel (the acquire is removed).
The author (Lewis) did explain that we need release in the compare_exchange_weak because we need to let later set() see the writes in compare_exchange_weak. But why don't we require other compare_exchange_weak in other threads to see the writes in the current compare_exchange_weak?
Is it because of release sequence? I.e., in a release chain (write release at first, and all the middle operations are "read acquire then write release" operations, and the final operation is read acquire), then you don't need to tell them to acquire in the middle?
In the following code, I tried to implement a shared lock,
struct lock {
uint64_t exclusive : 1;
uint64_t id : 48;
uint64_t shared_count : 15;
};
std::atomic<lock> lock_ { {0, 0, 0} };
bool try_lock_shared() noexcept {
lock currentlock = lock_.load(std::memory_order_acquire);
if (currentlock.exclusive == 1) {
return false;
}
lock newlock;
do {
newlock = currentlock;
newlock.shared_count++;
}
while(!lock_.compare_exchange_weak(currentlock, newlock, std::memory_order_acq_rel) && currentlock.exclusive == 0);
return currentlock.exclusive == 0;
}
bool try_lock() noexcept {
uint64_t id = utils::get_thread_id();
lock currentlock = lock_.load(std::memory_order_acquire);
if (currentlock.exclusive == 1) {
assert(currentlock.id != id);
return false;
}
bool result = false;
lock newlock { 1, id, 0 };
do {
newlock.shared_count = currentlock.shared_count;
}
while(!(result = lock_.compare_exchange_weak(currentlock, newlock, std::memory_order_acq_rel)) && currentlock.exclusive == 0);
return result;
}
I used lock_.compare_exchange_weak(currentlock, newlock, std::memory_order_acq_rel) everywhere, can I safely replace them to compare_exchange_weak(currentlock, newlock, std::memory_order_release, std::memory_order_acquire) ?
I could also see examples that memory_order_release is removed from compare_exchange_strong (see the compare_exchange_strong in reset() function of Lewis code), where you only need std::memory_order_acquire for compare_exchange_strong (but not release). I didn't really see memory_order_release is removed from weak nor memory_order_acquire is removed from strong.
This made me wonder whether there's deeper rule that I didn't understand or not.
Thanks.
memory_order_acquire makes only sense for operations that read a value, and memory_order_release makes only sense for operations that write a value. Since a read-modify-write operations reads and writes, it is possible to combine these memory orders, but it is not always necessary.
The m_event.m_state.compare_exchange_weak uses memory_order_release to write the new value, because it tries to replace a value that has previously been read using memory_order_acquire:
// load initial value using memory_order_acquire
void* oldValue = m_event.m_state.load(std::memory_order_acquire);
do {
...
} while (!m_event.m_state.compare_exchange_weak(oldValue, this,
std::memory_order_release,
// in case of failure, load new value using memory_order_acquire
std::memory_order_acquire));
IMHO in this case it is not even necessary to use memory_order_acquire at all, since oldValue is never de-referenced, but only stored as next pointer, i.e., it would be perfectly find to replace these two memory_order_acquire with memory_order_relaxed.
In async_manual_reset_event::set() the situtation is different:
void* oldValue = m_state.exchange(this, std::memory_order_acq_rel);
if (oldValue != this)
{
auto* waiters = static_cast<awaiter*>(oldValue);
while (waiters != nullptr)
{
// we are de-referencing the pointer read from m_state!
auto* next = waiters->m_next;
waiters->m_awaitingCoroutine.resume();
waiters = next;
}
Since we are de-referencing the pointer we read from m_state, we have to ensure that these reads happen after the writes to these waiter objects. This is ensured via the synchronize-with relation on m_state. The writer is added via the previously discussed compare_exchange using memory_order_release. The acquire-part of the exchange synchronizes with the release-compare_exchange (and in fact all prior release-compare_exchange that are part of the release sequence), thus providing the necessary happens-before relation.
To be honest, I am not sure why this exchange would need the release part. I think the author might have wanted to be on "the safe side", since several other operations are also stronger than necessary (I already mentioned that await_suspend does not need memory_order_acquire, and the same goes for is_set and reset).
For your lock implementation it is very simple - when you want to acquire the lock (try_lock_shared/try_lock) use memory_order_acquire for the compare-exchange operation only. Releasing the lock has to use memory_order_release.
The argument is also quite simple: you have to ensure that when you have acquired the lock, any changes previously made to the data protected by the lock is visible to the current owner, that is, you have to ensure that these changes happened before the operations you are about to perform after acquiring the lock. This is achieved by establishing a synchronize-with relation between the try_lock (acquire-CAS) and the previous unlock (release-store).
When trying to argue about the correctness of an implementation based on the semantics of the C++ memory model I usually do this as follows:
identify the necessary happens-before relations (like for your lock)
make sure that these happens-before relations are established correctly on all code paths
And I always annotate the atomic operations to document how these relations are established (i.e., which other operations are involved). For example:
// (1) - this acquire-load synchronizes-with the release-CAS (11)
auto n = head.load(std::memory_order_acquire);
// (8) - this acquire-load synchronizes-with the release-CAS (11)
h.acquire(head, std::memory_order_acquire);
// (11) - this release-CAS synchronizes-with the acquire-load (1, 8)
if (head.compare_exchange_weak(expected, next, std::memory_order_release, std::memory_order_relaxed))
(see https://github.com/mpoeter/xenium/blob/master/xenium/michael_scott_queue.hpp for the full code)
For more details about the C++ memory model I can recommend this paper which I have co-authored: Memory Models for C/C++ Programmers

C++: __sync_synchronize() still needed with std::atomic?

I've been running into an infrequent but re-occurring race condition.
The program has two threads and uses std::atomic. I'll simplify the critical parts of the code to look like:
std::atomic<uint64_t> b; // flag, initialized to 0
uint64_t data[100]; // shared data, initialized to 0
thread 1 (publishing):
// set various shared variables here, for example
data[5] = 10;
uint64_t a = b.exchange(1); // signal to thread 2 that data is ready
thread 2 (receiving):
if (b.load() != 0) { // signal that data is ready
// read various shared variables here, for example:
uint64_t x = data[5];
// race condition sometimes (x sometimes not consistent)
}
The odd thing is that when I add __sync_synchronize() to each thread, then the race condition goes away. I've seen this happen on two different servers.
i.e. when I change the code to look like the following, then the problem goes away:
thread 1 (publishing):
// set various shared variables here, for example
data[5] = 10;
__sync_synchronize();
uint64_t a = b.exchange(1); // signal to thread 2 that data is ready
thread 2 (receiving):
if (b.load() != 0) { // signal that data is ready
__sync_synchronize();
// read various shared variables here, for example:
uint64_t x = data[5];
}
Why is __sync_synchronize() necessary? It seems redundant as I thought both exchange and load ensured the correct sequential ordering of logic.
Architecture is x86_64 processors, linux, g++ 4.6.2
Whilst it is impossible to say from your simplified code what actually goes on in your actual application, the fact that __sync_synchronize helps, and the fact that this function is a memory barrier tells me that you are writing things in the one thread that the other thread is reading, in a way that isn't atomic.
An example:
thread_1:
object *p = new object;
p->x = 1;
b.exchange(p); /* give pointer p to other thread */
thread_2:
object *p = b.load();
if (p->x == 1) do_stuff();
else error("Huh?");
This may very well trigger the error-path in thread2, because the write to p->x has not actually been completed when thread 2 reads the new pointer value p.
Adding memory barrier, in this case, in the thread_1 code should fix this. Note that for THIS case, a memory barrier in thread_2 will not do anything - it may alter the timing and appear to fix the problem, but it won't be the right thing. You may need memory barriers on both sides still, if you are reading/writing memory that is shared between two threads.
I understand that this may not be precisely what your code is doing, but the concept is the same - __sync_synchronize ensures that memory reads and memory writes have completed for ALL of the instructions before that function call [which isn't a real function call, it will inline a single instruction that waits for any pending memory operations to comlete].
Noteworthy is that operations on std::atomic ONLY affect the actual data stored in the atomic object. Not reads/writes of other data.
Sometimes you also need a "compiler barrier" to avoid the compiler moving stuff from one side of an operation to another:
std::atomic<bool> flag(false);
value = 42;
flag.store(true);
....
another thread:
while(!flag.load());
print(value);
Now, there is a chance that the compiler generates the first form as:
flag.store(true);
value = 42;
Now, that wouldn't be good, would it? std::atomic is guaranteed to be a "compiler barrier", but in other cases, the compiler may well shuffle stuff around in a similar way.

Lock a mutex from mutex array with atomic index

I'm trying to write a buffer which can push data to the buffers, checks if full, swaps the buffer if necessary. Another thread can get a buffer for file output.
I've successfully implemented the buffer but I wanted to add a ForceSwapBuffer method that would force an incomplete buffer to be swapped and return the data from the incomplete buffer. In order to do this I check if the read and write buffer are the same (there is no use in trying to force swap a buffer to write to a file while there are still other full buffers that could be written).
I want this method to be able to run side by side with the GetBuffer method (not really necessary but I wanted to try it and stumbled upon this problem).
The GetBuffer would block and when ForceSwapBuffer is finished it would still block until the new buffer is completely full, because in the ForceSwapBuffer I change the atomic _read_buffer_index. I wonder if this will always work? Will the blocking lock of GetBuffer detect the change of the atomic read_buffer_index and change the mutex it is trying to lock or would it check at the start of the lock what mutex it has to lock and keep trying to lock the same mutex even when the index changes?
/* selection of member data */
unsigned int _size, _count;
std::atomic<unsigned int> _write_buffer_index, _read_buffer_index;
unsigned int _index;
std::unique_ptr< std::unique_ptr<T[]>[] > _buffers;
std::unique_ptr< std::mutex[] > _mutexes;
std::recursive_mutex _force_swap_buffer;
/* selection of implementation of member functions */
template<typename T> // included to show the use of the recursive_mutex
void Buffer<T>::Push(T *data, unsigned int length) {
std::lock_guard<std::recursive_mutex> lock(_force_swap_buffer);
if (_index + length <= _size) {
memcpy(&_buffers[_write_buffer_index][_index], data, length*sizeof(T));
_index += length;
} else {
memcpy(&_buffers[_write_buffer_index][_index], data, (_size - _index)*sizeof(T));
unsigned int t_index = _index;
SwapBuffer();
Push(&data[_size - t_index], length - (_size - t_index));
}
}
template<typename T>
std::unique_ptr<T[]> Buffer<T>::GetBuffer() {
std::lock_guard<std::mutex> lock(_mutexes[_read_buffer_index]); // where the magic should happen
std::unique_ptr<T[]> result(new T[_size]);
memcpy(result.get(), _buffers[_read_buffer_index].get(), _size*sizeof(T));
_read_buffer_index = (_read_buffer_index + 1) % _count;
return std::move(result);
}
template<typename T>
std::unique_ptr<T[]> Buffer<T>::ForceSwapBuffer() {
std::lock_guard<std::recursive_mutex> lock(_force_swap_buffer); // lock that forbids pushing and force swapping at the same time
if (_write_buffer_index != _read_buffer_index)
return nullptr;
std::unique_ptr<T[]> result(new T[_index]);
memcpy(result.get(), _buffers[_read_buffer_index].get(), _index*sizeof(T));
unsigned int next = (_write_buffer_index + 1) % _count;
_mutexes[next].lock();
_read_buffer_index = next; // changing the read_index while the other thread it blocked, the new mutex is already locked so the other thread should remain locked
_mutexes[_write_buffer_index].unlock();
_write_buffer_index = next;
_index = 0;
return result;
}
There are some problems with your code. First, be careful when modifying atomic variables. Only a small set of operations is really atomic (see http://en.cppreference.com/w/cpp/atomic/atomic), and combinations of atomic operations are not atomic. Consider:
_read_buffer_index = (_read_buffer_index + 1) % _count;
What happens here is that you have an atomic read of the variable, an increment, a modulo operation, and an atomic store. However, the whole statement itself is not atomic! If _count is a power of 2, you can just use the ++-operator. If it is not, you have to read _read_buffer_index into a temporary variable, perform the above calculations, and then use a compare_exchange function to store the new value if the variable was not changed in the mean time. Obviously the latter has to be done in a loop until it succeeds. You also have to worry about the possibility that one thread increments the variable _count times between the read and compare_exchange of a second thread, in which case the second thread erroneously thinks the variable was not changed.
The second problem is cache-line bouncing. If you have multiple mutexes on the same cache line, then if two or more threads try to access them simultaneously, the performance will be very bad. What the size of a cache-line is depends on your platform.
The main problem is that while ForceSwapBuffer() and Push() both lock the _force_swap_buffer mutex, GetBuffer() does not. GetBuffer() however does change _read_buffer_index. So in ForceSwapBuffer():
std::lock_guard<std::recursive_mutex> lock(_force_swap_buffer);
if (_write_buffer_index != _read_buffer_index)
return nullptr;
// another thread can call GetBuffer() here and change _read_buffer_index
// rest of the code here
The assumption that _write_buffer_index == _read_buffer_index after the if-statement is actually invalid.

Using a mutex to block execution from outside the critical section

I'm not sure I got the terminology right but here goes - I have this function that is used by multiple threads to write data (using pseudo code in comments to illustrate what I want)
//these are initiated in the constructor
int* data;
std::atomic<size_t> size;
void write(int value) {
//wait here while "read_lock"
//set "write_lock" to "write_lock" + 1
auto slot = size.fetch_add(1, std::memory_order_acquire);
data[slot] = value;
//set "write_lock" to "write_lock" - 1
}
the order of the writes is not important, all I need here is for each write to go to a unique slot
Every once in a while though, I need one thread to read the data using this function
int* read() {
//set "read_lock" to true
//wait here while "write_lock"
int* ret = data;
data = new int[capacity];
size = 0;
//set "read_lock" to false
return ret;
}
so it basically swaps out the buffer and returns the old one (I've removed capacity logic to make the snippets shorter)
In theory this should lead to 2 operating scenarios:
1 - just a bunch of threads writing into the container
2 - when some thread executes the read function, all new writers will have to wait, the reader will wait until all existing writes are finished, it will then do the read logic and scenario 1 can continue.
The question part is that I don't know what kind of a barrier to use for the locks -
A spinlock would be wasteful since there are many containers like this and they all need cpu cycles
I don't know how to apply std::mutex since I only want the write function to be in a critical section if the read function is triggered. Wrapping the whole write function in a mutex would cause unnecessary slowdown for operating scenario 1.
So what would be the optimal solution here?
If you have C++14 capability then you can use a std::shared_timed_mutex to separate out readers and writers. In this scenario it seems you need to give your writer threads shared access (allowing other writer threads at the same time) and your reader threads unique access (kicking all other threads out).
So something like this may be what you need:
class MyClass
{
public:
using mutex_type = std::shared_timed_mutex;
using shared_lock = std::shared_lock<mutex_type>;
using unique_lock = std::unique_lock<mutex_type>;
private:
mutable mutex_type mtx;
public:
// All updater threads can operate at the same time
auto lock_for_updates() const
{
return shared_lock(mtx);
}
// Reader threads need to kick all the updater threads out
auto lock_for_reading() const
{
return unique_lock(mtx);
}
};
// many threads can call this
void do_writing_work(std::shared_ptr<MyClass> sptr)
{
auto lock = sptr->lock_for_updates();
// update the data here
}
// access the data from one thread only
void do_reading_work(std::shared_ptr<MyClass> sptr)
{
auto lock = sptr->lock_for_reading();
// read the data here
}
The shared_locks allow other threads to gain a shared_lock at the same time but prevent a unique_lock gaining simultaneous access. When a reader thread tries to gain a unique_lock all shared_locks will be vacated before the unique_lock gets exclusive control.
You can also do this with regular mutexes and condition variables rather than shared. Supposedly shared_mutex has higher overhead, so I'm not sure which will be faster. With Gallik's solution you'd presumably be paying to lock the shared mutex on every write call; I got the impression from your post that write gets called way more than read so maybe this is undesirable.
int* data; // initialized somewhere
std::atomic<size_t> size = 0;
std::atomic<bool> reading = false;
std::atomic<int> num_writers = 0;
std::mutex entering;
std::mutex leaving;
std::condition_variable cv;
void write(int x) {
++num_writers;
if (reading) {
--num_writers;
if (num_writers == 0)
{
std::lock_guard l(leaving);
cv.notify_one();
}
{ std::lock_guard l(entering); }
++num_writers;
}
auto slot = size.fetch_add(1, std::memory_order_acquire);
data[slot] = x;
--num_writers;
if (reading && num_writers == 0)
{
std::lock_guard l(leaving);
cv.notify_one();
}
}
int* read() {
int* other_data = new int[capacity];
{
std::unique_lock enter_lock(entering);
reading = true;
std::unique_lock leave_lock(leaving);
cv.wait(leave_lock, [] () { return num_writers == 0; });
swap(data, other_data);
size = 0;
reading = false;
}
return other_data;
}
It's a bit complicated and took me some time to work out, but I think this should serve the purpose pretty well.
In the common case where only writing is happening, reading is always false. So you do the usual, and pay for two additional atomic increments and two untaken branches. So the common path does not need to lock any mutexes, unlike the solution involving a shared mutex, this is supposedly expensive: http://permalink.gmane.org/gmane.comp.lib.boost.devel/211180.
Now, suppose read is called. The expensive, slow heap allocation happens first, meanwhile writing continues uninterrupted. Next, the entering lock is acquired, which has no immediate effect. Now, reading is set to true. Immediately, any new calls to write enter the first branch, and eventually hit the entering lock which they are unable to acquire (as its already taken), and those threads then get put to sleep.
Meanwhile, the read thread is now waiting on the condition that the number of writers is 0. If we're lucky, this could actually go through right away. If however there are threads in write in either of the two locations between incrementing and decrementing num_writers, then it will not. Each time a write thread decrements num_writers, it checks if it has reduced that number to zero, and when it does it will signal the condition variable. Because num_writers is atomic which prevents various reordering shenanigans, it is guaranteed that the last thread will see num_writers == 0; it could also be notified more than once but this is ok and cannot result in bad behavior.
Once that condition variable has been signalled, that shows that all writers are either trapped in the first branch or are done modifying the array. So the read thread can now safely swap the data, and then unlock everything, and then return what it needs to.
As mentioned before, in typical operation there are no locks, just increments and untaken branches. Even when a read does occur, the read thread will have one lock and one condition variable wait, whereas a typical write thread will have about one lock/unlock of a mutex and that's all (one, or a small number of write threads, will also perform a condition variable notification).

SPSC lock free queue without atomics

I have below a SPSC queue for my logger.
It is certainly not a general-use SPSC lock-free queue.
However, given a bunch of assumptions around how it will be used, target architecture etc, and a number of acceptable tradeoffs, which I go into detail below, my questions is basically, is it safe / does it work?
It will only be used on x86_64 architecture, so writes to uint16_t will be atomic.
Only the producer updates the tail.
Only the consumer updates the head.
If the producer reads an old value of head, it will look like there is less space in the queue than reality, which is an acceptable limitation in the context in which is is used.
If the consumer reads an old value of tail, it will look like there is less data waiting in the queue than reality, again an acceptable limitation.
The limitations above are acceptable because:
the consumer may not get the latest tail immediately, but eventually the latest tail will arrive, and queued data will be logged.
the producer may not get the latest head immediately, so the queue will look more full than it really is. In our load testing we have found the amount we log vs the size of the queue, and the speed at which the logger drains the queue, this limitation has no effect - there is always space in the queue.
A final point, the use of volatile is necessary to prevent the variable which each thread only reads from being optimised out.
My questions:
Is this logic correct?
Is the queue thread safe?
Is volatile sufficient?
Is volatile necessary?
My queue:
class LogBuffer
{
public:
bool is_empty() const { return head_ == tail_; }
bool is_full() const { return uint16_t(tail_ + 1) == head_; }
LogLine& head() { return log_buffer_[head_]; }
LogLine& tail() { return log_buffer_[tail_]; }
void advance_head() { ++head_; }
void advance_hail() { ++tail_; }
private:
volatile uint16_t tail_ = 0; // write position
LogLine log_buffer_[0xffff + 1]; // relies on the uint16_t overflowing
volatile uint16_t head_ = 0; // read position
};
Is this logic correct?
Yes.
Is the queue thread safe?
No.
Is volatile sufficient? Is volatile necessary?
No, to both. Volatile is not a magic keyword that makes any variable threadsafe. You still need to use atomic variables or memory barriers for the indexes to ensure memory ordering is correct when you produce or consume an item.
To be more specific, after you produce or consume an item for your queue you need to issue a memory barrier to guarantee that other threads will see the changes. Many atomic libraries will do this for you when you update an atomic variable.
As an aside, use "was_empty" instead of "is_empty" to be clear about what it does. The result of this call is one instance in time which may have changed by the time you act on its value.