A readers/writer lock... without having a lock for the readers? - c++

I get the feeling this may be a very general and common situation for which a well-known no-lock solution exists.
In a nutshell, I'm hoping there's approach like a readers/writer lock, but that doesn't require the readers to acquire a lock and thus can be better average performance.
Instead there'd be some atomic operations (128-bit CAS) for a reader, and a mutex for a writer. I'd have two copies of the data structure, a read-only one for the normally-successful queries, and an identical copy to be update under mutex protection. Once the data has been inserted into the writable copy, we make it the new readable copy. The old readable copy then gets inserted in turn, once all the pending readers have finished reading it, and the writer spins on the number of readers left until its zero, then modifies it in turn, and finally releases the mutex.
Or something like that.
Anything along these lines exist?

If your data fits in a 64-bit value, most systems can cheaply read/write that atomically, so just use std::atomic<my_struct>.
For smallish and/or infrequently-written data, there are a couple ways to make readers truly read-only on the shared data, not having to do any atomic RMW operations on a shared counter or anything. This allows read-side scaling to many threads without readers contending with each other (unlike a 128-bit atomic read on x86 using lock cmpxchg16b1, or taking a RWlock).
Ideally just an extra level of indirection via an atomic<T*> pointer (RCU), or just an extra load + compare-and-branch (SeqLock); no atomic RMWs or memory barriers stronger than acq/rel or anything else in the read side.
This can be appropriate for data that's read very frequently by many threads, e.g. a timestamp updated by a timer interrupt but read all over the place. Or a config setting that typically never changes.
If your data is larger and/or changes more frequently, one of the strategies suggested in other answers that requires a reader to still take a RWlock on something or atomically increment a counter will be more appropriate. This won't scale perfectly because each reader still needs to get exclusive ownership of the shared cache line containing lock or counter so it can modify it, but there's no such thing as a free lunch.
Note 1: Update: x86 vendors finally decided to guarantee that 128-bit SSE/AVX loads / stores are atomic on CPUs with AVX, so if you're lucky std::atomic<16-byte-struct> has cheap loads when running on a CPU with AVX enabled. e.g. not Pentium/Celeron before Ice Lake. GCC for a while has been indirecting to a libgcc atomic_load_16 function for 16-byte operations, so runtime dispatching for it can pick a lock cmpxchg16b strategy on CPUs that support it. Now it has a much better option to choose from on some CPUs.
RCU
It sounds like you're half-way to inventing RCU (Read Copy Update) where you update a pointer to the new version.
But remember a lock-free reader might stall after loading the pointer, so you have a deallocation problem. This is the hard part of RCU. In a kernel it can be solved by having sync points where you know that there are no readers older than some time t, and thus can free old versions. There are some user-space implementations. https://en.wikipedia.org/wiki/Read-copy-update and https://lwn.net/Articles/262464/.
For RCU, the less frequent the changes, the larger a data structure you can justify copying. e.g. even a moderate-sized tree could be doable if it's only ever changed interactively by an admin, while readers are running on dozens of cores all checking something in parallel. e.g. kernel config settings are one thing where RCU is great in Linux.
SeqLock
If your data is small (e.g. a 64-bit timestamp on a 32-bit machine), another good option is a SeqLock. Readers check a sequence counter before/after non-atomic copy of the data into a private buffer. If the sequence counters match, we know there wasn't tearing. (Writers mutually exclude each with a separate mutex). Implementing 64 bit atomic counter with 32 bit atomics / how to implement a seqlock lock using c++11 atomic library.
It's a bit of a hack in C++ to write something that can compile efficiently to a non-atomic copy that might have tearing, because inevitably that's data-race UB. (Unless you use std::atomic<long> with mo_relaxed for each chunk separately, but then you're defeating the compiler from using movdqu or something to copy 16 bytes at once.)
A SeqLock makes the reader copy the whole thing (or ideally just load it into registers) every read so it's only ever appropriate for a small struct or 128-bit integer or something. But for less than 64 bytes of data it can be quite good, better than having readers use lock cmpxchg16b for a 128-bit datum if you have many readers and infrequent writes.
It's not lock-free, though: a writer that sleeps while modifying the SeqLock could get readers stuck retrying indefinitely. For a small SeqLock the window is small, and obviously you want to have all the data ready before you do the first sequence-counter update to minimize the chance for an interrupt pausing the writer in mid update.
The best case is when there's only 1 writer so it doesn't have to do any locking; it knows nothing else will be modifying the sequence counter.

What you're describing is very similar to double instance locking and left-right concurrency control.
In terms of progress guarantees, the difference between the two is that the former is lock-free for readers while the latter is wait-free. Both are blocking for writers.

It turns out the two-structure solution I was thinking of has similarities to http://concurrencyfreaks.blogspot.com/2013/12/left-right-concurrency-control.html
Here's the specific data structure and pseudocode I had in mind.
We have two copies of some arbitrary data structure called MyMap allocated, and two
pointers out of a group of three pointers point to these two. Initially, one is pointed
to by achReadOnly[0].pmap and the other by pmapMutable.
A quick note on achReadOnly: it has a normal state and two temporary states. The normal state will be (WLOG for cell 0/1):
achReadOnly = { { pointer to one data structure, number of current readers },
{ nullptr, 0 } }
pmapMutable = pointer to the other data structure
When we've finished mutating "the other," we store it in the unused slot of the array
as it is the next-generation read-only and it's fine for readers to start accessing it.
achReadOnly = { { pointer to one data structure, number of old readers },
{ pointer to the other data structure, number of new readers } }
pmapMutable = pointer to the other data structure
The writer then clears the pointer to "the one", the previous-generation readonly, forcing
readers to go to the next-generation one. We move that to pmapMutable.
achReadOnly = { { nullptr, number of old readers },
{ pointer to the other data structure, number of new readers } }
pmapMutable = pointer to the one data structure
The writer then spins for number of old readers to hit one (itself) at which point
it can receive the same update. That 1 is overwritten with 0 to clean up in preparation to move forward. Though in fact it could be left dirty as it won't be referred to before being overwritten.
struct CountedHandle {
MyMap* pmap;
int iReaders;
};
// Data Structure:
atomic<CountedHandle> achReadOnly[2];
MyMap* pmapMutable;
mutex_t muxMutable;
data Read( key ) {
int iWhich = 0;
CountedHandle chNow, chUpdate;
// Spin if necessary to update the reader counter on a pmap, and/or
// to find a pmap (as the pointer will be overwritten with nullptr once
// a writer has finished updating the mutable copy and made it the next-
// generation read-only in the other slot of achReadOnly[].
do {
chNow = achReadOnly[ iWhich ];
if ( !chNow .pmap ) {
iWhich = 1 - iWhich;
continue;
}
chUpdate = chNow;
chNow.iReaders++;
} while ( CAS( ach[ iWhich ], chNow, chUpdate ) fails );
// Now we've found a map, AND registered ourselves as a reader of it atomicly.
// Importantly, it is impossible any reader has this pointer but isn't
// represented in that count.
if ( data = chnow.pmap->Find( key ) ) {
// Deregister ourselves as a reader.
do {
chNow = achReadOnly[ iWhich ];
chUpdate = chNow;
chNow.iReaders--;
} while ( CAS( ach[ iWhich ], chNow, chUpdate ) fails );
return data;
}
// OK, we have to add it to the structure.
lock muxMutable;
figure out data for this key
pmapMutable->Add( key, data );
// It's now the next-generation read-only. Put it where readers can find it.
achReadOnly[ 1 - iWhich ].pmap = pmapMutable;
// Prev-generation readonly is our Mutable now, though we can't change it
// until the readers are gone.
pmapMutable = achReadOnly[ iWhich ].pmap;
// Force readers to look for the next-generation readonly.
achReadOnly[ iWhich ].pmap = nullptr;
// Spin until all readers finish with previous-generation readonly.
// Remember we added ourselves as reader so wait for 1, not 0.
while ( achReadOnly[ iWhich ].iReaders > 1 }
;
// Remove our reader count.
achReadOnly[ iWhich ].iReaders = 0;
// No more readers for previous-generation readonly, so we can now write to it.
pmapMutable->Add( key, data );
unlock muxMutable;
return data;
}

Solution that has come to me:
Every thread has a thread_local copy of the data structure, and this can be queried at will without locks. Any time you find your data, great, you're done.
If you do NOT find your data, then you acquire a mutex for the master copy.
This will have potentially many new insertions in it from other threads (possibly including the data you need!). Check to see if it has your data and if not insert it.
Finally, copy all the recent updates--including the entry for the data you need--to your own thread_local copy. Release mutex and done.
Readers can read all day long, in parallel, even when updates are happening, without locks. A lock is only needed when writing, (or sometimes when catching up). This general approach would work for a wide range of underlying data structures. QED
Having many thread_local indexes sounds memory-inefficient if you have lots of threads using this structure.
However, the data found by the index, if it's read-only, need only have one copy, referred to by many indices. (Luckily, that is my case.)
Also, many threads might not be randomly accessing the full range of entries; maybe some only need a few entries and will very quickly reach a final state where their local copy of the structure can find all the data needed, before it grows much. And yet many other threads may not refer to this at all. (Luckily, that is my case.)
Finally, to "copy all the recent updates" it'd help if all new data added to the structure were, say, pushed onto the end of a vector so given that say you have 4000 entries in your local copy, the master copy has 4020, you can with a few machine cycles locate the 20 objects you need to add. (Luckily, that is my case.)

Related

std::atomic - behaviour of relaxed ordering

Can the following call to print result in outputting stale/unintended values?
std::mutex g;
std::atomic<int> seq;
int g_s = 0;
int i = 0, j = 0, k = 0; // ignore fact that these could easily made atomic
// Thread 1
void do_work() // seldom called
{
// avoid over
std::lock_guard<std::mutex> lock{g};
i++;
j++;
k++;
seq.fetch_add(1, std::memory_order_relaxed);
}
// Thread 2
void consume_work() // spinning
{
const auto s = g_s;
// avoid overhead of constantly acquiring lock
g_s = seq.load(std::memory_order_relaxed);
if (s != g_s)
{
// no lock guard
print(i, j, k);
}
}
TL:DR: this is super broken; use a Seq Lock instead. Or RCU if your data structure is bigger.
Yes, you have data-race UB, and in practice stale values are likely; so are inconsistent values (from different increments). ISO C++ has nothing to say about what will happen, so it depends on how it happens to compile for some real machine, and interrupts / context switches in the reader that happen in the middle of reading some of these multiple vars. e.g. if the reader sleeps for any reason between reading i and j, you could miss many updates, or at least get a j that doesn't match your i.
Relaxed seq with writer+reader using lock_guard
I'm assuming the writer would look the same, so the atomic RMW increment is inside the critical section.
I'm picturing the reader checking seq like it is now, and only taking a lock after that, inside the block that runs print.
Even if you did use lock_guard to make sure the reader got a consistent snapshot of all three variables (something you couldn't get from making each of them separately atomic), I'm not sure relaxed would be sufficient in theory. It might be in practice on most real implementations for real machines (where compilers have to assume there might be a reader that synchronizes a certain way, even if there isn't in practice). I'd use at least release/acquire for seq, if I was going to take a lock in the reader.
Taking a mutex is an acquire operation, same as a std::memory_order_acquire load on the mutex object. A relaxed increment inside a critical section can't become visible to other threads until after the writer has taken the lock.
But in the reader, with if( xyz != seq.load(relaxed) ) { take_lock; ... }, the load is not guaranteed to "happen before" taking the lock. In practice on many ISAs it will, especially x86 where all atomic RMWs are full memory barriers. But in ISO C++, and maybe some real implementations, it's possible for the relaxed load to reorder into the reader's critical section. Of course, ISO C++ doesn't define things in terms of "reordering", only in terms of syncing with and values loads are allowed to see.
(This reordering may not be fully plausible; it would mean the read side would have to actually take the lock based on branch prediction / speculation on the load result. Maybe with lock elision like x86 did with transactional memory, except without x86's strong memory ordering?)
Anyway, it's pretty hairly to reason about, and release / acquire ops are quite cheap on most CPUs. If you expected it to be expensive, and for the check to often be false, you could check again with an acquire load, or put an acquire fence inside the if so it doesn't happen on the no-new-work path.
Use a Seq Lock
Your problem is better solved by using your sequence counter as part of a Seq Lock, so neither reader nor writer needs a mutex. (Summary: increment before writing, then touch the payload, then increment again. In the reader, read i, j, and k into local temporaries, then check the sequence number again to make sure it's the same, and an even number. With appropriate memory barriers.
See the wikipedia article and/or link below for actual details, but the real change from what you have now is that the sequence number has to increment by 2. If you can't handle that, use a separate counter for the actual lock, with seq as part of the payload.)
If you don't want to use a mutex in the reader, using one in the writer only helps in terms of implementation-detail side-effects, like making sure stores to memory actually happen, not keeping i in a register across calls if do_work inlines into some caller.
BTW, updating seq doesn't need to be an atomic RMW if there's only one writer. You can relaxed load and separately store an incremented temporary (with release semantics).
A Seq Lock is good for cheap reads and occasional writes that make the reader retry. Implementing 64 bit atomic counter with 32 bit atomics shows appropriate fencing.
It relies on non-atomic reads that may have a data race, but not using the result if your sequence counter detects tearing. C++ doesn't define the behaviour in that case, but it works in practice on real implementations. (C++ is mostly keeping its options open in case of hardware race detection, which normal CPUs don't do.)
If you have multiple writers, you'd still use a normal lock to give mutual exclusion between them. or use the sequence counter as a spinlock, as a writer acquires it by making the count odd. Otherwise you just need the sequence counter.
Your global g_s is just to track the latest sequence number the reader has seen? Storing it next to the data defeats some of the purpose/benefit, since it means the reader is writing the same cache line as the writer, assuming that variables declared near each other all end up together. Consider making it static inside the function, or separate it with other stuff, or with padding, like alignas(64) or 128. (That wouldn't guarantee that a compiler doesn't put it right before the other vars, though; a struct would let you control the layout of all of them. With enough alignment, you can make sure they're not in the same aligned pair of cache lines.)
Even ignoring the staleness, this is causes a data race and UB.
Thread 2 can read i,j,k while thread 1 is modifying them, you don't synchronize the access to those variables. If thread 2 doesn't respect the g, there's no point in locking it in thread 1.
Yes, it can.
First of all, the lock guard does not have any effect on your code. A lock has to be used by at least two threads to have any effect.
Thread 2 can read at any moment. It can read an incremented i and not incremented j and k. In theory, it can even read a weird partial value obtained by reading in between updating the various bytes that compose i - for example incrementing from 0xFF to 0x100 results reading 0x1FF or 0x0 - but not on x86 where these updates happen to be atomic.

Do I need writer/reader synchronization if I don't care if reader will read stale data?

Let I have a reader thread. Reader has a vector of bools. Size of the vector isn't changed and always known. Reader reads some data from another source, calculates an index from the data and checks if vector[index] == true. If true, Reader sends data further. If not, drops data.
Let I have a writer thread. Writer makes vector[index] true or false.
Do I really need a mutex for vector if I don't bother that some extra data chunks will be sent or some chunks will be lost? Is it absolutely safe to use a vector in this way?
Reading and writing the same value, however small, from multiple threads without synchronization, is a data race, a form of undefined behavior.
Even if the hardware guarantees cache coherency (as in x86), the C++ memory model is defined such that in the absence of synchronization each thread is assumed to be executing in isolation. Then according to the as-if rule the compiler is allowed to optimize away and reorder memory accesses any way it sees fit, so the behavior of a program with a data race becomes unpredictable. The reader thread may never "see" any updated value, for example. Or the writer may not write anything to memory until the thread is finished, or write in a different order. The behavior may change between compiler versions, optimization levels, etc.
Note that synchronization doesn't mean a mutex, an atomic will do too (a vector of atomics is somewhat complicated, but is possible too, though my feeling is that a userspace mutex would be more efficient).
Bonus note: don't forget about false sharing when accessing the same vector from multiple threads.
As rustyx already indicated, atomics could do the trick.
If you just care about reading the value at some point in the future and not suffer from a data race (so the lack of a happens before relation between the write and the read), then it would be sufficient to set the flags using a memory_order_release and get the flags using a memory_order_acquire.
On the X86; which uses the TSO memory model, all regular stores are release stores and all regular loads are acquire loads. So on a hardware level there is no price to pay. Only the compiler will be prevented from doing certain reorderings.
The expensive write on an X86 is the memory_order_sec_cst. In that case, the store is put on the store buffer and the CPU stops executing any loads till the store buffer has been drained. With a memory_order_sec_cst, the store is placed on the store buffer and the CPU can continue with the next instruction (even loads); so the CPU is not stalled.

Could this publish / check-for-update class for a single writer + reader use memory_order_relaxed or acquire/release for efficiency?

Introduction
I have a small class which make use of std::atomic for a lock free operation. Since this class is being called massively, it's affecting the performance and I'm having trouble.
Class description
The class similar to a LIFO, but once the pop() function is called, it only return the last written element of its ring-buffer (only if there are new elements since last pop()).
A single thread is calling push(), and another single thread is calling pop().
Source I've read
Since this is using too much time of my computer time, I decided to study a bit further the std::atomic class and its memory_order. I've read a lot of memory_order post avaliable in StackOverflow and other sources and books, but I'm not able to get a clear idea about the different modes. Specially, I'm struggling between acquire and release modes: I fail too see why they are different to memory_order_seq_cst.
What I think each memory order do using my words, from my own research
memory_order_relaxed: In the same thread, the atomic operations are instant, but other threads may fail to see the lastest values instantly, they will need some time until they are updated. The code can be re-ordered freely by the compiler or OS.
memory_order_acquire / release: Used by atomic::load. It prevents the lines of code there are before this from being reordered (the compiler/OS may reorder after this line all it want), and reads the lastest value that was stored on this atomic using memory_order_release or memory_order_seq_cst in this thread or another thread. memory_order_release also prevents that code after it may be reordered. So, in an acquire/release, all the code between both can be shuffled by the OS. I'm not sure if that's between same thread, or different threads.
memory_order_seq_cst: Easiest to use because it's like the natural writting we are used with variables, instantly refreshing the values of other threads load functions.
The LockFreeEx class
template<typename T>
class LockFreeEx
{
public:
void push(const T& element)
{
const int wPos = m_position.load(std::memory_order_seq_cst);
const int nextPos = getNextPos(wPos);
m_buffer[nextPos] = element;
m_position.store(nextPos, std::memory_order_seq_cst);
}
const bool pop(T& returnedElement)
{
const int wPos = m_position.exchange(-1, std::memory_order_seq_cst);
if (wPos != -1)
{
returnedElement = m_buffer[wPos];
return true;
}
else
{
return false;
}
}
private:
static constexpr int maxElements = 8;
static constexpr int getNextPos(int pos) noexcept {return (++pos == maxElements)? 0 : pos;}
std::array<T, maxElements> m_buffer;
std::atomic<int> m_position {-1};
};
How I expect it could be improved
So, my first idea was using memory_order_relaxed in all atomic operations, since the pop() thread is in a loop looking for avaliable updates in pop function each 10-15 ms, then it's allowed to fail in the firsts pop() functions to realize later that there is a new update. It's only a bunch of milliseconds.
Another option would be using release/acquire - but I'm not sure about them. Using release in all store() and acquire in all load() functions.
Unfortunately, all the memory_order I described seems to work, and I'm not sure when will they fail, if they are supposed to fail.
Final
Please, could you tell me if you see some problem using relaxed memory order here? Or should I use release/acquire (maybe a further explanation on these could help me)? why?
I think that relaxed is the best for this class, in all its store() or load(). But I'm not sure!
Thanks for reading.
EDIT: EXTRA EXPLANATION:
Since I see everyone is asking for the 'char', I've changed it to int, problem solved! But it doesn't it the one I want to solve.
The class, as I stated before, is something likely to a LIFO but where only matters the last element pushed, if there is any.
I have a big struct T (copiable and asignable), that I must share between two threads in a lock-free way. So, the only way I know to do it is using a circular buffer that writes the last known value for T, and a atomic which know the index of the last value written. When there isn't any, the index would be -1.
Notice that my push thread must know when there is a "new T" avaliable, that's why pop() returns a bool.
Thanks again to everyone trying to assist me with memory orders! :)
AFTER READING SOLUTIONS:
template<typename T>
class LockFreeEx
{
public:
LockFreeEx() {}
LockFreeEx(const T& initValue): m_data(initValue) {}
// WRITE THREAD - CAN BE SLOW, WILL BE CALLED EACH 500-800ms
void publish(const T& element)
{
// I used acquire instead relaxed to makesure wPos is always the lastest w_writePos value, and nextPos calculates the right one
const int wPos = m_writePos.load(std::memory_order_acquire);
const int nextPos = (wPos + 1) % bufferMaxSize;
m_buffer[nextPos] = element;
m_writePos.store(nextPos, std::memory_order_release);
}
// READ THREAD - NEED TO BE VERY FAST - CALLED ONCE AT THE BEGGINING OF THE LOOP each 2ms
inline void update()
{
// should I change to relaxed? It doesn't matter I don't get the new value or the old one, since I will call this function again very soon, and again, and again...
const int writeIndex = m_writePos.load(std::memory_order_acquire);
// Updating only in case there is something new... T may be a heavy struct
if (m_readPos != writeIndex)
{
m_readPos = writeIndex;
m_data = m_buffer[m_readPos];
}
}
// NEED TO BE LIGHTNING FAST, CALLED MULTIPLE TIMES IN THE READ THREAD
inline const T& get() const noexcept {return m_data;}
private:
// Buffer
static constexpr int bufferMaxSize = 4;
std::array<T, bufferMaxSize> m_buffer;
std::atomic<int> m_writePos {0};
int m_readPos = 0;
// Data
T m_data;
};
Memory order is not about when you see some particular change to an atomic object but rather about what this change can guarantee about the surrounding code. Relaxed atomics guarantee nothing except the change to the atomic object itself: the change will be atomic. But you can't use relaxed atomics in any synchronization context.
And you have some code which requires synchronization. You want to pop something that was pushed and not trying to pop what has not been pushed yet. So if you use a relaxed operation then there is no guarantee that your pop will see this push code:
m_buffer[nextPos] = element;
m_position.store(nextPos, std::memory_relaxed);
as it is written. It just as well can see it this way:
m_position.store(nextPos, std::memory_relaxed);
m_buffer[nextPos] = element;
So you might try to get an element from the buffer which is not there yet. Hence, you have to use some synchronization and at least use acquire/release memory order.
And to your actual code. I think the order can be as follows:
const char wPos = m_position.load(std::memory_order_relaxed);
...
m_position.store(nextPos, std::memory_order_release);
...
const char wPos = m_position.exchange(-1, memory_order_acquire);
Your writer only needs release, not seq-cst, but relaxed is too weak. You can't publish a value for m_position until after the non-atomic assignment to the corresponding m_buffer[] entry. You need release ordering to make sure the m_position store is visible to other threads only after all earlier memory operations. (Including the non-atomic assignment). https://preshing.com/20120913/acquire-and-release-semantics/
This has to "synchronize-with" an acquire or seq_cst load in the reader. Or at least mo_consume in the reader.
In theory you also need wpos = m_position to be at least acquire (or consume in the reader), not relaxed, because C++11's memory model is weak enough for things like value-prediction which can let the compiler speculatively use a value for wPos before the load actually takes a value from coherent cache.
(In practice on real CPUs, a crazy compiler could do this with test/branch to introduce a control dependency, allowing branch prediction + speculative execution to break the data dependency for a likely value of wPos.)
But with normal compilers don't do that. On CPUs other than DEC Alpha, the data dependency in the source code of wPos = m_position and then using m_buffer[wPos] will create a data dependency in the asm, like mo_consume is supposed to take advantage of. Real ISAs other than Alpha guarantee dependency-ordering for dependent loads. (And even on Alpha, using a relaxed atomic exchange might be enough to close the tiny window that exists on the few real Alpha CPUs that allow this reordering.)
When compiling for x86, there's no downside at all to using mo_acquire; it doesn't cost any extra barriers. There can be on other ISAs, like 32-bit ARM where acquire costs a barrier, so "cheating" with a relaxed load could be a win that's still safe in practice. Current compilers always strengthen mo_consume to mo_acquire so we unfortunately can't take advantage of it.
You already have a real-word race condition even using seq_cst.
initial state: m_position = 0
reader "claims" slot 0 by exchanging in a m_position = -1 and reads part of m_buffer[0];
reader sleeps for some reason (e.g. timer interrupt deschedules it), or simply races with a writer.
writer reads wPos = m_position as -1, and calculates nextPos = 0.
It overwrites the partially-read m_buffer[0]
reader wakes up and finishes reading, getting a torn T &element. Data race UB in the C++ abstract machine, and tearing in practice.
Adding a 2nd check of m_position after the read (like a SeqLock) can't detect this in every case because the writer doesn't update m_position until after writing the buffer element.
Even though you your real use-case has long gaps between reads and writes, this defect can bite you with just one read and write happening at almost the same time.
I for sure know that the read side cannot wait for nothing and cannot be stopped (it's audio) and it's poped each 5-10ms, and the write side is the user input, which is more slower, a faster one could do a push once each 500ms.
A millisecond is ages on a modern CPU. Inter-thread latency is often something like 60 ns, so fractions of a microsecond, e.g. from a quad-core Intel x86. As long as you don't sleep on a mutex, it's not a problem to spin-retry once or twice before giving up.
Code review:
The class similar to a LIFO, but once the pop() function is called, it only return the last written element of its ring-buffer (only if there are new elements since last pop()).
This isn't a real queue or stack: push and pop aren't great names. "publish" and "read" or "get" might be better and make it more obvious what this is for.
I'd include comments in the code to describe the fact that this is safe for a single writer, multiple readers. (The non-atomic increment of m_position in push makes it clearly unsafe for multiple writers.)
Even so, it's kinda weird even with 1 writer + 1 reader running at the same time. If a read starts while a write is in progress, it will get the "old" value instead of spin-waiting for a fraction of a microsecond to get the new value. Then next time it reads there will already be a new value waiting; the one it just missed seeing last time. So e.g. m_position can update in this order: 2, -1, 3.
That might or might not be desirable, depending on whether "stale" data has any value, and on acceptability of the reader blocking if the writer sleeps mid-write. Or even without the writer sleeping, of spin-waiting.
The standard pattern for rarely written smallish data with multiple read-only readers is a SeqLock. e.g. for publishing a 128-bit current timestamp on a CPU that can't atomically read or write a 128-bit value. See Implementing 64 bit atomic counter with 32 bit atomics
Possible design changes
To make this safe, we could let the writer run free, always wrapping around its circular buffer, and have the reader keep track of the last element it looked at.
If there's only one reader, this should be a simple non-atomic variable. If it's an instance variable, at least put it on the other side of m_buffer[] from the write-position.
// Possible failure mode: writer wraps around between reads, leaving same m_position
// single-reader
const bool read(T &elem)
{
// FIXME: big hack to get this in a separate cache line from the instance vars
// maybe instead use alignas(64) int m_lastread as a class member, and/or on the other side of m_buffer from m_position.
static int lastread = -1;
int wPos = m_position.load(std::memory_order_acquire); // or cheat with relaxed to get asm that's like "consume"
if (lastread == wPos)
return false;
elem = m_buffer[wPos];
lastread = wPos;
return true;
}
You want lastread in a separate cache line from the stuff the writer writes. Otherwise the reader's updates of readPos will be slower because of false-sharing with the writer's writes and vice versa.
This lets the reader(s) be truly read-only wrt. the cache lines written by the writer. It will still take MESI traffic to request read access to lines that are in Modified state after the writer writes them, though. But the writer can still read m_position with no cache miss, so it can get its stores into the store buffer right away. It only has to wait for an RFO to get exclusive ownership of the cache line(s) before it can commit the element and the updated m_position from its store buffer to L1d cache.
TODO: let m_position increment without manual wrapping, so we have a write sequence number that takes a very long time to wrap around, avoiding false-negative early out from lastread == wPos.
Use wPos & (maxElements-1) as the index. And static_assert(maxElements & (maxElements-1) == 0, "maxElements must be a power of 2");
Then the only danger is undetected tearing in a tiny time-window if the writer has wrapped all the way around and is writing the element being read. For frequent reads and infrequent writes, and a buffer that's not too small, this should never happen. Checking the m_position again after a read (like a SeqLock, similar to below) narrows the race window to only writes that are still in progress.
If there are multiple readers, another good option might be a claimed flag in each m_buffer entry. So you'd define
template<typename T>
class WaitFreePublish
{
private:
struct {
alignas(32) T elem; // at most 2 elements per cache line
std::atomic<int8_t> claimed; // writers sets this to 0, readers try to CAS it to 1
// could be bool if we don't end up needing 3 states for anything.
// set to "1" in the constructor? or invert and call it "unclaimed"
} m_buffer[maxElements];
std::atomic<int> m_position {-1};
}
If T has padding at the end, it's a shame we can't take advantage of that for the claimed flag :/
This avoids the possible failure mode of comparing positions: if the writer wraps around between reads, the worst we get is tearing. And we could detect such tearing by having the writer clear the claimed flag first, before writing the rest of the element.
With no other threads writing m_position, we can definitely use a relaxed load without worry. We could even cache the write-position somewhere else, but the reader hopefully isn't invalidating the cache-line containing m_position very often. And apparently in your use-case, writer performance/latency probably isn't a big deal.
So the writer + reader could look like this, with SeqLock-style tearing detection using the known update-order for claimed flag, element, and m_position.
/// claimed flag per array element supports concurrent readers
// thread-safety: single-writer only
// update claimed flag first, then element, then m_position.
void publish(const T& elem)
{
const int wPos = m_position.load(std::memory_order_relaxed);
const int nextPos = getNextPos(wPos);
m_buffer[nextPos].claimed.store(0, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release); // make sure that `0` is visible *before* the non-atomic element modification
m_buffer[nextPos].elem = elem;
m_position.store(nextPos, std::memory_order_release);
}
// thread-safety: multiple readers are ok. First one to claim an entry gets it
// check claimed flag before/after to detect overwrite, like a SeqLock
const bool read(T &elem)
{
int rPos = m_position.load(std::memory_order_acquire);
int8_t claimed = m_buffer[rPos].claimed.load(std::memory_order_relaxed);
if (claimed != 0)
return false; // read-only early-out
claimed = 0;
if (!m_buffer[rPos].claimed.compare_exchange_strong(
claimed, 1, std::memory_order_acquire, std::memory_order_relaxed))
return false; // strong CAS failed: another thread claimed it
elem = m_buffer[rPos].elem;
// final check that the writer didn't step on this buffer during read, like a SeqLock
std::atomic_thread_fence(std::memory_order_acquire); // LoadLoad barrier
// We expect it to still be claimed=1 like we set with CAS
// Otherwise we raced with a writer and elem may be torn.
// optionally retry once or twice in this case because we know there's a new value waiting to be read.
return m_buffer[rPos].claimed.load(std::memory_order_relaxed) == 1;
// Note that elem can be updated even if we return false, if there was tearing. Use a temporary if that's not ok.
}
Using claimed = m_buffer[rPos].exchange(1) and checking for claimed==0 would be another option, vs. CAS-strong. Maybe slightly more efficient on x86. On LL/SC machines I guess CAS might be able to bail out without doing a write at all if it finds a mismatch with expected, in which case the read-only check is pointless.
I used .claimed.compare_exchange_strong(claimed, 1) with success ordering = acquire to make sure that read of claimed happens-before reading .elem.
The "failure" memory ordering can be relaxed: If we see it already claimed by another thread, we give up and don't look at any shared data.
The memory-ordering of the store part of compare_exchange_strong can be relaxed, so we just need mo_acquire, not acq_rel. Readers don't do any other stores to the shared data, and I don't think the ordering of the store matters wrt. to the loads. CAS is an atomic RMW. Only one thread's CAS can succeed on a given buffer element because they're all trying to set it from 0 to 1. That's how atomic RMWs work, regardless of being relaxed or seq_cst or anything in between.
It doesn't need to be seq_cst: we don't need to flush the store buffer or whatever to make sure the store is visible before this thread reads .elem. Just being an atomic RMW is enough to stop multiple threads from actually thinking they succeed. Release would just make sure it can't move earlier, ahead of the relaxed read-only check. That wouldn't be a correctness problem. Hopefully no x86 compilers would do that at compile time. (At runtime on x86, RMW atomic operations are always seq_cst.)
I think being an RMW makes it impossible for it to "step on" a write from a writer (after wrapping around). But this might be real-CPU implementation detail, not ISO C++. In the global modification order for any given .claimed, I think the RMW stays together, and the "acquire" ordering does keep it ahead of the read of the .elem. A release store that wasn't part of a RMW would be a potential problem though: a writer could wrap around and put claimed=0 in a new entry, then the reader's store could eventually commit and set it to 1, when actually no reader has ever read that element.
If we're very sure the reader doesn't need to detect writer wrap-around of the circular buffer, leave out the std::atomic_thread_fence in the writer and reader. (The claimed and the non-atomic element store will still be ordered by the release-store to m_position). The reader can be simplified to leave out the 2nd check and always return true if it gets past the CAS.
Notice that m_buffer[nextPos].claimed.store(0, std::memory_order_release); would not be sufficient to stop later non-atomic stores from appearing before it: release-stores are a one-way barrier, unlike release fences. A release-fence is like a 2-way StoreStore barrier. (Free on x86, cheap on other ISAs.)
This SeqLock-style tearing detection doesn't technically avoid UB in the C++ abstract machine, unfortunately. There's no good / safe way to express this pattern in ISO C++, and it's known to be safe in asm on real hardware. Nothing actually uses the torn value (assuming read()'s caller ignores its elem value if it returns false).
Making elem a std::atomic<T> would be defeat the entire purpose: that would use a spinlock to get atomicity so it might as well use it directly.
Using volatile T elem would break buffer[i].elem = elem because unlike C, C++ doesn't allow copying a volatile struct to/from a regular struct. (volatile struct = struct not possible, why?). This is highly annoying for a SeqLock type of pattern where you'd like the compiler to emit efficient code to copy the whole object representation, optionally using SIMD vectors. You won't get that if you write a constructor or assignment operator that takes a volatile &T argument and does individual members. So clearly volatile is the wrong tool, and that only leaves compiler memory barriers to make sure the non-atomic object is fully read or fully written before the barrier. std::atomic_thread_fence is I think actually safe for that, like asm("" ::: "memory") in GNU C. It works in practice on current compilers.

memory range sharing in threads : ensure data is not stuck in cache

When sending address of memory location from one thread to another, how to ensure that data is not stuck in CPU cache, and that the second thread actually reads the correct value ? ( I'm using a socketpair() to send the
pointer from one thread to another )
And related question , how does c++ compiler, along with thread primitives, figure out what memory address need to be handled specially for synchronozations.
struct Test { int fld; }
thread_1 ( ) {
Test *ptr1 = new Test;
ptr1->fld = 100;
::write(write_fd, &ptr1, sizeof(ptr1));
}
thread_2 () {
Test *ptr2;
::read(read_fd, &ptr2, sizeof(ptr2));
// WHAT MAGIC IS REQUIRED TO ENSURE THIS ?
assert(ptr2->fld == 100 );
}
If you want to pass the value between threads in the same process, I would ensure that std::atomic<int> as the type of field, and the related setter and getter functions. Obviously, passing a pointer from one process to another doesn't work at all, unless it's from an area of memory that is guaranteed to have the same address in both processes - shared memory for example, but then you shouldn't need sockets...
Compilers do not, in general, know how to deal with caches, except for atomic types (technically, atomics are usually dealt with using separate instructions, rather than cache-flushing and cache-invalidation, and the processor hardware handles the relevant "talk to other processors about the cache content").
The OS (subject to bugs of course) does that sort of thing when passing between processes - or within processes. But for passing pointers, you can't rely on that, the newly received pointer value is correct, but the content the pointer is pointing at isn't cache-managed.
In some processors, you can use a memory barrier to the correct order of memory content between threads. This forces the processor to "perform all memory operations before this point". However, in the case of system calls like read and write, the OS should take care of that for you, and ensure that the memory has been properly written to before the read starts to read the memory it wants to store in the socket buffer, and write will have a memory barrier after it's stored your data (in this case the value of the pointer, but memory barriers affect all reads and/or writes that preceed that point).
If you were to implement your own primitives for passing data, and the processors do not have cache coherency (most of the modern processors do), you will also need to add a cache-flush for the writing side, and a cache invalidate for the reading side. This is architecture dependent, there is no support for this in standard C or C++ (and in some processors, only OS functionality [kernel mode] can flush or invalidate cache content, in other processors it can be done in user-mode code - the granularity of such operations also varies, it may be necessary to flush or invalidate the entire cache-system, or individual lines of 32, 64 or 128 bytes can be flushed at a time)
In C++, you don't need to care about implementation details like caches. The only thing you need to do is to make sure there is a C++ happens-after relation.
As Mats Petersson's answer shows, std::atomic is one way to achieve that. All accesses to an atomic variable are ordered, although the order might not be statically determined (i.e. if you have two threads trying to write to the same atomic variable, you can't predict which write happens last).
Another mechanism to enforce synchronization is std::mutex. Threads can attempt to lock a mutex, but only one thread can have a mutex locked at a time. The other threads will block. The compiler will make certain that when one thread unlocks a mutex and the next thread locks the mutex, writes by the first thread can be read by the second thread. If this requires flushing the cache, the compiler will arrange that.
Yet another mechanism is std::atomic_thread_fence. This is useful if you have multiple objects shared between threads (all in the same direction). Instead of making them all atomic, you can make one of them atomic and "attach" a fence to that atomic variable. You then write to the atomic variable last, and read from it first. Obviously this is best encapsulated in a class.

Thread-safe Settings

Im writing some settings classes that can be accessed from everywhere in my multithreaded application. I will read these settings very often (so read access should be fast), but they are not written very often.
For primitive datatypes it looks like boost::atomic offers what I need, so I came up with something like this:
class UInt16Setting
{
private:
boost::atomic<uint16_t> _Value;
public:
uint16_t getValue() const { return _Value.load(boost::memory_order_relaxed); }
void setValue(uint16_t value) { _Value.store(value, boost::memory_order_relaxed); }
};
Question 1: Im not sure about the memory ordering. I think in my application I don't really care about memory ordering (do I?). I just want to make sure that getValue() always returns a non-corrupted value (either the old or the new one). So are my memory ordering settings correct?
Question 2: Is this approach using boost::atomic recommended for this kind of synchronization? Or are there other constructs that offer better read performance?
I will also need some more complex setting types in my application, like std::string or for example a list of boost::asio::ip::tcp::endpoints. I consider all these setting values as immutable. So once I set the value using setValue(), the value itself (the std::string or the list of endpoints itself) does not change anymore. So again I just want to make sure that I get either the old value or the new value, but not some corrupted state.
Question 3: Does this approach work with boost::atomic<std::string>? If not, what are alternatives?
Question 4: How about more complex setting types like the list of endpoints? Would you recommend something like boost::atomic<boost::shared_ptr<std::vector<boost::asio::ip::tcp::endpoint>>>? If not, what would be better?
Q1, Correct if you don't try to read any shared non-atomic variables after reading the atomic. Memory barriers only synchronize access to non-atomic variables that may happen between atomic operations
Q2 I don't know (but see below)
Q3 Should work (if compiles). However,
atomic<string>
possibly isn't lock free
Q4 Should work but, again, the implementation isn't possibly lockfree (Implementing lockfree shared_ptr is challenging and patent-mined field).
So probably readers-writers lock (as Damon suggests in the comments) may be simpler and even more effective if your config includes data with size more than 1 machine word (for which CPU native atomics usually works)
[EDIT]However,
atomic<shared_ptr<TheWholeStructContainigAll> >
may have some sense even being non-lock free: this approach minimize collision probability for readers that need more than one coherent value, though the writer should make a new copy of the whole "parameter sheet" every time it changes something.
For question 1, the answer is "depends, but probably not". If you really only care that a single value isn't garbled, then yes, this is fine, and you don't care about memory order either.
Usually, though, this is a false premise.
For questions 2, 3, and 4 yes, this will work, but it will likely use locking for complex objects such as string (internally, for every access, without you knowing). Only rather small objects which are roughly the size of one or two pointers can normally be accessed/changed atomically in a lockfree manner. This depends on your platform, too.
It's a big difference whether one successfully updates one or two values atomically. Say you have the values left and right which delimit the left and right boundaries of where a task will do some processing in an array. Assume they are 50 and 100, respectively, and you change them to 101 and 150, each atomically. So the other thread picks up the change from 50 to 101 and starts doing its calculation, sees that 101 > 100, finishes, and writes the result to a file. After that, you change the output file's name, again, atomically.
Everything was atomic (and thus, more expensive than normal), but none of it was useful. The result is still wrong, and was written to the wrong file, too.
This may not be a problem in your particular case, but usually it is (and, your requirements may change in the future). Usually you really want the complete set of changes being atomic.
That said, if you have either many or complex (or, both many and complex) updates like this to do, you might want to use one big (reader-writer) lock for the whole config in the first place anyway, since that is more efficient than acquiring and releasing 20 or 30 locks or doing 50 or 100 atomic operations. Do however note that in any case, locking will severely impact performance.
As pointed out in the comments above, I would preferrably make a deep copy of the configuration from the one thread that modifies the configuration, and schedule updates of the reference (shared pointer) used by consumers as a normal tasks. That copy-modify-publish approach a bit similar to how MVCC databases work, too (these, too, have the problem that locking kills their performance).
Modifying a copy asserts that only readers are accessing any shared state, so no synchronization is necessary either for readers or for the single writer. Reading and writing is fast.
Swapping the configuration set happens only at well-defined points in times when the set is guaranteed to be in a complete, consistent state and threads are guaranteed not to do something else, so no ugly surprises of any kind can happen.
A typical task-driven application would look somewhat like this (in C++-like pseudocode):
// consumer/worker thread(s)
for(;;)
{
task = queue.pop();
switch(task.code)
{
case EXIT:
return;
case SET_CONFIG:
my_conf = task.data;
break;
default:
task.func(task.data, &my_conf); // can read without sync
}
}
// thread that interacts with user (also producer)
for(;;)
{
input = get_input();
if(input.action == QUIT)
{
queue.push(task(EXIT, 0, 0));
for(auto threads : thread)
thread.join();
return 0;
}
else if(input.action == CHANGE_SETTINGS)
{
new_config = new config(config); // copy, readonly operation, no sync
// assume we have operator[] overloaded
new_config[...] = ...; // I own this exclusively, no sync
task t(SET_CONFIG, 0, shared_ptr<...>(input.data));
queue.push(t);
}
else if(input.action() == ADD_TASK)
{
task t(RUN, input.func, input.data);
queue.push(t);
}
...
}
For anything more substantial than a pointer, use a mutex. The tbb (opensource) library supports the concept of reader-writer mutices, which allow multiple simultaneous readers, see the documentation.