std::atomic - behaviour of relaxed ordering - c++

Can the following call to print result in outputting stale/unintended values?
std::mutex g;
std::atomic<int> seq;
int g_s = 0;
int i = 0, j = 0, k = 0; // ignore fact that these could easily made atomic
// Thread 1
void do_work() // seldom called
{
// avoid over
std::lock_guard<std::mutex> lock{g};
i++;
j++;
k++;
seq.fetch_add(1, std::memory_order_relaxed);
}
// Thread 2
void consume_work() // spinning
{
const auto s = g_s;
// avoid overhead of constantly acquiring lock
g_s = seq.load(std::memory_order_relaxed);
if (s != g_s)
{
// no lock guard
print(i, j, k);
}
}

TL:DR: this is super broken; use a Seq Lock instead. Or RCU if your data structure is bigger.
Yes, you have data-race UB, and in practice stale values are likely; so are inconsistent values (from different increments). ISO C++ has nothing to say about what will happen, so it depends on how it happens to compile for some real machine, and interrupts / context switches in the reader that happen in the middle of reading some of these multiple vars. e.g. if the reader sleeps for any reason between reading i and j, you could miss many updates, or at least get a j that doesn't match your i.
Relaxed seq with writer+reader using lock_guard
I'm assuming the writer would look the same, so the atomic RMW increment is inside the critical section.
I'm picturing the reader checking seq like it is now, and only taking a lock after that, inside the block that runs print.
Even if you did use lock_guard to make sure the reader got a consistent snapshot of all three variables (something you couldn't get from making each of them separately atomic), I'm not sure relaxed would be sufficient in theory. It might be in practice on most real implementations for real machines (where compilers have to assume there might be a reader that synchronizes a certain way, even if there isn't in practice). I'd use at least release/acquire for seq, if I was going to take a lock in the reader.
Taking a mutex is an acquire operation, same as a std::memory_order_acquire load on the mutex object. A relaxed increment inside a critical section can't become visible to other threads until after the writer has taken the lock.
But in the reader, with if( xyz != seq.load(relaxed) ) { take_lock; ... }, the load is not guaranteed to "happen before" taking the lock. In practice on many ISAs it will, especially x86 where all atomic RMWs are full memory barriers. But in ISO C++, and maybe some real implementations, it's possible for the relaxed load to reorder into the reader's critical section. Of course, ISO C++ doesn't define things in terms of "reordering", only in terms of syncing with and values loads are allowed to see.
(This reordering may not be fully plausible; it would mean the read side would have to actually take the lock based on branch prediction / speculation on the load result. Maybe with lock elision like x86 did with transactional memory, except without x86's strong memory ordering?)
Anyway, it's pretty hairly to reason about, and release / acquire ops are quite cheap on most CPUs. If you expected it to be expensive, and for the check to often be false, you could check again with an acquire load, or put an acquire fence inside the if so it doesn't happen on the no-new-work path.
Use a Seq Lock
Your problem is better solved by using your sequence counter as part of a Seq Lock, so neither reader nor writer needs a mutex. (Summary: increment before writing, then touch the payload, then increment again. In the reader, read i, j, and k into local temporaries, then check the sequence number again to make sure it's the same, and an even number. With appropriate memory barriers.
See the wikipedia article and/or link below for actual details, but the real change from what you have now is that the sequence number has to increment by 2. If you can't handle that, use a separate counter for the actual lock, with seq as part of the payload.)
If you don't want to use a mutex in the reader, using one in the writer only helps in terms of implementation-detail side-effects, like making sure stores to memory actually happen, not keeping i in a register across calls if do_work inlines into some caller.
BTW, updating seq doesn't need to be an atomic RMW if there's only one writer. You can relaxed load and separately store an incremented temporary (with release semantics).
A Seq Lock is good for cheap reads and occasional writes that make the reader retry. Implementing 64 bit atomic counter with 32 bit atomics shows appropriate fencing.
It relies on non-atomic reads that may have a data race, but not using the result if your sequence counter detects tearing. C++ doesn't define the behaviour in that case, but it works in practice on real implementations. (C++ is mostly keeping its options open in case of hardware race detection, which normal CPUs don't do.)
If you have multiple writers, you'd still use a normal lock to give mutual exclusion between them. or use the sequence counter as a spinlock, as a writer acquires it by making the count odd. Otherwise you just need the sequence counter.
Your global g_s is just to track the latest sequence number the reader has seen? Storing it next to the data defeats some of the purpose/benefit, since it means the reader is writing the same cache line as the writer, assuming that variables declared near each other all end up together. Consider making it static inside the function, or separate it with other stuff, or with padding, like alignas(64) or 128. (That wouldn't guarantee that a compiler doesn't put it right before the other vars, though; a struct would let you control the layout of all of them. With enough alignment, you can make sure they're not in the same aligned pair of cache lines.)

Even ignoring the staleness, this is causes a data race and UB.
Thread 2 can read i,j,k while thread 1 is modifying them, you don't synchronize the access to those variables. If thread 2 doesn't respect the g, there's no point in locking it in thread 1.

Yes, it can.
First of all, the lock guard does not have any effect on your code. A lock has to be used by at least two threads to have any effect.
Thread 2 can read at any moment. It can read an incremented i and not incremented j and k. In theory, it can even read a weird partial value obtained by reading in between updating the various bytes that compose i - for example incrementing from 0xFF to 0x100 results reading 0x1FF or 0x0 - but not on x86 where these updates happen to be atomic.

Related

C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads

Could you please help me to understand what std::memory_order should be used in std::atomic_flag::test_and_set to do some work only once by a set of threads and why? The work should be done by whatever thread gets to it first, and all other threads should just check as quickly as possible that someone is already going the work and continue working on other tasks.
In my tests of the example below, any memory order works, but I think that it is just a coincidence. I suspect that Release-Acquire ordering is what I need, but, in my case, only one memory_order can be used in both threads (it is not the case that one thread can use memory_order_release and the other can use memory_order_acquire since I do not know which thread will arrive to doing the work first).
#include <atomic>
#include <iostream>
#include <thread>
std::atomic_flag done = ATOMIC_FLAG_INIT;
const std::memory_order order = std::memory_order_seq_cst;
//const std::memory_order order = std::memory_order_acquire;
//const std::memory_order order = std::memory_order_relaxed;
void do_some_work_that_needs_to_be_done_only_once(void)
{ std::cout<<"Hello, my friend\n"; }
void run(void)
{
if(not done.test_and_set(order))
do_some_work_that_needs_to_be_done_only_once();
}
int main(void)
{
std::thread a(run);
std::thread b(run);
a.join();
b.join();
// expected result:
// * only one thread said hello
// * all threads spent as little time as possible to check if any
// other thread said hello yet
return 0;
}
Thank you very much for your help!
Following up on some things in the comments:
As has been discussed, there is a well-defined modification order M for done on any given run of the program. Every thread does one store to done, which means one entry in M. And by the nature of atomic read-modify-writes, the value returned by each thread's test_and_set is the value that immediately precedes its own store in the order M. That's promised in C++20 atomics.order p10, which is the critical clause for understanding atomic RMW in the C++ memory model.
Now there are a finite number of threads, each corresponding to one entry in M, which is a total order. Necessarily there is one such entry that precedes all the others. Call it m1. The test_and_set whose store is entry m1 in M must return the preceding value in M. That can only be the value 0 which initialized done. So the thread corresponding to m1 will see test_and_set return 0. Every other thread will see it return 1, because each of their modifications m2, ..., mN follows (in M) another modification, which must have been a test_and_set storing the value 1.
We may not be bothering to observe all of the total order M, but this program does determine which of its entries is first on this particular run. It's the unique one whose test_and_set returns 0. A thread that sees its test_and_set return 1 won't know whether it came 2nd or 8th or 96th in that order, but it does know that it wasn't first, and that's all that matters here.
Another way to think about it: suppose it were possible for two threads (tA, tB) both to load the value 0. Well, each one makes an entry in the modification order; call them mA and mB. M is a total order so one has to go before the other. And bearing in mind the all-important [atomics.order p10], you will quickly find there is no legal way for you to fill out the rest of M.
All of this is promised by the standard without any reference to memory ordering, so it works even with std::memory_order_relaxed. The only effect of relaxed memory ordering is that we can't say much about how our load/store will become visible with respect to operations on other variables. That's irrelevant to the program at hand; it doesn't even have any other variables.
In the actual implementation, this means that an atomic RMW really has to exclusively own the variable for the duration of the operation. We must ensure that no other thread does a store to that variable, nor the load half of a read-modify-write, during that period. In a MESI-like coherent cache, this is done by temporarily locking the cache line in the E state; if the system makes it possible for us to lose that lock (like an LL/SC architecture), abort and start again.
As to your comment about "a thread reading false from its own cache/buffer": the implementation mustn't allow that in an atomic RMW, not even with relaxed ordering. When you do an atomic RMW, you must read it while you hold the lock, and use that value in the RMW operation. You can't use some old value that happens to be in a buffer somewhere. Likewise, you have to complete the write while you still hold the lock; you can't stash it in a buffer and let it complete later.
relaxed is fine if you just need to determine the winner of the race to set the flag1, so one thread can start on the work and later threads can just continue on.
If the run_once work produces data that other threads need to be able to read, you'll need a release store after that, to let potential readers know that the work is finished, not just started. If it was instead just something like printing or writing to a file, and other threads don't care when that finishes, then yeah you have no ordering requirements between threads beyond the modification order of done which exists even with relaxed. An atomic RMW like test_and_set lets you determines which thread's modification was first.
BTW, you should check read-only before even trying to test-and-set; unless run() is only called very infrequently, like once per thread startup. For something like a static int foo = non_constant; local var, compilers use a guard variable that's loaded (with an acquire load) to see if init is already complete. If it's not, branch to code that uses an atomic RMW to modify the guard variable, with one thread winning, the rest effectively waiting on a mutex for that thread to init.
You might want something like that if you have data that all threads should read. Or just use a static int foo = something_to_run_once(), or some type other than int, if you actually have some data to init.
Or perhaps use C++11 std::call_once to solve this problem for you.
On normal systems, atomic_flag has no advantage over and atomic_bool. done.exchange(true) on a bool is equivalent to test_and_set of a flag. But atomic_bool is more flexible in terms of the operations it supports, like plain read that isn't part of an RMW test-and-set.
C++20 does add a test() method for atomic_flag. ISO C++ guarantees that atomic_flag is lock-free, but in practice so is std::atomic<bool> on all real-world systems.
Footnote 1: why relaxed guarantees a single winner
The memory_order parameter only governs ordering wrt. operations on other variables by the same thread.
Does calling test_and_set by a thread force somehow synchronization of the flag with values written by other threads?
It's not a pure write, it's an atomic read-modify-write, so the result of the one that went first is guaranteed to be visible to the one that happens to be second. That's the whole point of test-and-set as a primitive building block for mutual exclusion.
If two TAS operations could both load the original value (false), and then both store true, they would be atomic. They'd have overlapped with each other.
Two atomic RMWs on the same atomic object must happen in some order, the modification-order of that object. (Because they're not read-only: an RMW includes a modification. But also includes a read so you can see what the value was immediately before the new value; that read is tied to the modification order, unlike a plain read).
Every atomic object separately has a modification-order that all threads can agree on; this is guaranteed by ISO C++. (With orders less than seq_cst, ordering between objects can be different from source order, and not guaranteed that all threads even agree which store happened first, the IRIW problem.)
Being an atomic RMW guarantees that exactly one test_and_set will return false in thread A or B. Same for fetch_add with multiple threads incrementing a counter: the increments have to happen in some order (i.e. serialized with each other), and whatever that order is becomes the modification-order of that atomic object.
Atomic RMWs have to work this way to not lose counts. i.e. to actually be atomic.

memory model, how load acquire semantic actually works?

From very nice Paper and article about memory reordering.
Q1: I understand that cache-coherence, store buffer and invalidation queue is root cause of memory reordering ?
Store release is quite understandable, have to wait for all load and store are completed before set flag to true.
About load acquire, typical use of atomic load is waiting for a flag. Suppose we have 2 threads:
int x = 0;
std::atomic<bool> ready_flag = false;
// thread-1
if(ready_flag.load(std::memory_order_relaxed))
{
// (1)
// load x here
}
// (2)
// load x here
// thread-2
x = 100;
ready_flag.store(true, std::memory_order_release);
EDIT: in thread-1, it should be a while loop, but I copied the logic from article above. So, assume memory-reorder is occurred just in time.
Q2: Because (1) and (2) depends on if condition, CPU have to wait for ready_flag, does it mean write-release is enough ? How memory-reordering can happens with this context ?
Q3: Obviously we have load-acquire, so I guess mem-reorder is possible, then where should we place the fence, (1) or (2) ?
Accessing an atomic variable is not a mutex operation; it merely accesses the stored value atomically, with no chance for any CPU operation to interrupt the access such that no data races can occur with regard to accessing that value (it can also issue barriers with regard to other accesses, which is what the memory orders provide). But that's it; it doesn't wait for any particular value to appear in the atomic variable.
As such, your if statement will read whatever value happens to be there at the time. If you want to guard access to x until the other statement has written to it and signaled the atomic, you must:
Not allow any code to read from x until the atomic flag has returned the value true. Simply testing the value once won't do that; you must loop over repeated accesses until it is true. Any other attempt to read from x results in a data race and is therefore undefined behavior.
Whenever you access the flag, you must do so in a way that tells the system that values written by the thread setting that flag should be visible to subsequent operations that see the set value. That requires a proper memory order, one which must be at least memory_order_acquire.
To be technical, the read from the flag itself doesn't have to do the acquire. You could perform an acquire operation after having read the proper value from the flag. But you need to have an acquire-equivalent operation happen before reading x.
The writing statement must set the flag using a releasing memory order that must be at least as powerful as memory_order_release.
Because (1) and (2) depends on if condition, CPU have to wait for ready_flag
There are 2 showstopper flaws in that reasoning:
Branch prediction + speculative execution is a real thing in real CPUs. Control dependencies behave differently from data dependencies. Speculative execution breaks control dependencies.
In most (but not all) real CPUs, data dependencies do work like C++ memory_order_consume. A typical use-case is loading a pointer and then dereferencing it. That's still not safe in C++'s very weak memory model, but will happen to compile to asm that works for most ISAs other than DEC Alpha. Alpha can (in practice on some hardware) even manage to violate causality and load a stale value when dereferencing a just-loaded pointer, even if the stores were correctly ordered.
Compilers can break control and even data dependencies. C++ source logic doesn't always translate directly to asm. In this case a compiler could emit asm that works like this:
tmp = load(x); // compile time reordering before the relaxed load
if (load(ready_flag)
actually use tmp;
It's data-race UB in C++ to read x while it might still be being written, but for most specific ISAs there's no problem with that. You just have to avoid actually using any load results that might be bogus.
This might not be a useful optimization for most ISAs but nothing rules it out. Hiding load latency on in-order pipelines by doing the load earlier might actually be useful sometimes, (if it wasn't being written by another thread, and the compiler might guess that wasn't happening because there's no acquire load).
By far your best bet is to use ready_flag.load(mo_acquire).
A separate problem is that you have commented out code that reads x after the if(), which will run even if the load didn't see the data ready. As #Nicol explained in an answer, this means data-race UB is possible because you might be reading x while the producer is writing it.
Perhaps you wanted to write a spin-wait loop like while(!ready_flag){ _mm_pause(); }? Generally be careful of wasting huge amounts of CPU time spinning; if it might be a long time, use a library-supported thing like maybe a condition variable that gives you efficient fallback to OS-supported sleep/wakeup (e.g. Linux futex) after spinning for a short time.
If you did want a manual barrier separate from the load, it would be
if (ready_flag.load(mo_relaxed))
atomic_thread_fence(mo_acquire);
int tmp = x; // now this is safe
}
// atomic_thread_fence(mo_acquire); // still wouldn't make it safe to read x
// because this code runs even after ready_flag == false
Using if(ready_flag.load(mo_acquire)) would lead to an unconditional fence before branching on the ready_flag load, when compiling for any ISA where acquire-load wasn't available with a single instruction. (On x86 all loads are acquire, on AArch64 ldar does an acquire load. ARM needs load + dsb ish)
The C++ standard doesn't specify the code generated by any particular construct; only correct combinations of thread communication tools product a guaranteed result.
You don't get guarantees from the CPU in C++ because C++ is not a kind of (macro) assembly, not even a "high level assembly", at least not when not all objects have a volatile type.
Atomic objects are communication tools to exchange data between threads. The correct use, for correct visibility of memory operations, is either a store operation with (at least) release followed by a load with acquire, the same with RMW in between, either the store (resp. the load) replaced by RMW with (at least) a release (resp. acquire), on any variant with a relaxed operation and a separate fence.
In all cases:
the thread "publishing" the "done" flag must use a memory ordering at least release (that is: release, release+acquire or sequential consistency),
and the "subscribing" thread, the one acting on the flag must use at least acquire (that is: acquire, release+acquire or sequential consistency).
In practice with separately compiled code other modes might work, depending on the CPU.

relaxed ordering and inter thread visibility

I learnt from relaxed ordering as a signal that a store on an atomic variable should be visible to other thread in a "within a reasonnable amount of time".
That say, I am pretty sure it should happen in a very short time (some nano second ?).
However, I don't want to rely on "within a reasonnable amount of time".
So, here is some code :
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_relaxed))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_relaxed);
}
Thread A and B are within a kind of ThreadPool, so there is no creation of thread or whatsoever in this problem.
I don't need to protect any data, so acquire / consume / release ordering on atomic store/load are not needed here (I think?).
We know for sure that the functionThatWillBeLaunchedInThreadAfunction will be launched AFTER the end of the functionThatWillBeLaunchedInThreadB.
However, in such a code, we don't have any guarantee that the store will be visible in the thread A, so the thread A can read a stale value (false).
Here are some solution I think about.
Solution 1 : Use volatility
Just declare volatile std::atomic_bool canBegin{false}; Here the volatileness guarantee us that we will not see stale value.
Solution 2 : Use mutex or spinlock
Here the idea is to protect the canBegin access via a mutex / spinlock that guarantee via a release/acquire ordering that we will not see a stale value.
I don't need canGo to be an atomic either.
Solution 3 : not sure at all, but memory fence?
Maybe this code will not work, so, tell me :).
bool canGo{false}; // not an atomic value now
// in thread A
std::atomic_thread_fence(std::memory_order_acquire);
if(canGo) produceData();
// in thread B
canGo = true;
std::atomic_thread_fence(std::memory_order_release);
On cpp reference, for this case, it is write that :
all non-atomic and relaxed atomic stores that are sequenced-before FB
in thread B will happen-before all non-atomic and relaxed atomic loads
from the same locations made in thread A after FA
Which solution would you use and why?
There's nothing you can do to make a store visible to other threads any sooner. See If I don't use fences, how long could it take a core to see another core's writes? - barriers don't speed up visibility to other cores, they just make this core wait until that's happened.
The store part of an RMW is not different from a pure store for this, either.
(Certainly on x86; not totally sure about other ISAs, where a relaxed LL/SC might possibly get special treatment from the store buffer, possibly being more likely to commit before other stores if this core can get exclusive ownership of the cache line. But I think it still would have to retire from out-of-order execution so the core knows it's not speculative.)
Anthony's answer that was linked in comment is misleading; as I commented there:
If the RMW runs before the other thread's store commits to cache, it doesn't see the value, just like if it was a pure load. Does that mean "stale"? No, it just means that the store hasn't happened yet.
The only reason RMWs need a guarantee about "latest" value is that they're inherently serializing operations on that memory location. This is what you need if you want 100 unsynchronized fetch_add operations to not step on each other and be equivalent to += 100, but otherwise best-effort / latest-available value is fine, and that's what you get from a normal atomic load.
If you require instant visibility of results (a nanosecond or so), that's only possible within a single thread, like x = y; x += z;
Also note, the C / C++ standard requirement (actually just a note) to make stores visible in a reasonable amount of time is in addition to the requirements on ordering of operations. It doesn't mean seq_cst store visibility can be delayed until after later loads. All seq_cst operations happen in some interleaving of program order across all threads.
On real-world C++ implementations, the visibility time is entirely up to hardware inter-core latency. But the C++ standard is abstract, and could in theory be implemented on a CPU that required manual flushing to make stores visible to other threads. Then it would be up to the compiler to not be lazy and defer that for "too long".
volatile atomic<T> is useless; compilers already don't optimize atomic<T>, so every atomic access done by the abstract machine will already happen in the asm. (Why don't compilers merge redundant std::atomic writes?). That's all that volatile does, so volatile atomic<T> compiles to the same asm as atomic<T> for anything you can with the atomic.
Defining "stale" is a problem because separate threads running on separate cores can't see each other's actions instantly. It takes tens of nanoseconds on modern hardware to see a store from another thread.
But you can't read "stale" values from cache; that's impossible because real CPUs have coherent caches. (That's why volatile int could be used to roll your own atomics before C++11, but is no longer useful.) You may need an ordering stronger than relaxed to get the semantics you want as far as one value being older than another (i.e. "reordering", not "stale"). But for a single value, if you don't see a store, that means your load executed before the other core took exclusive ownership of the cache line in order to commit its store. i.e. that the store hasn't truly happened yet.
In the formal ISO C++ rules, there are guarantees about what value you're allowed to see which effectively give you the guarantees you'd expect from cache coherency for a single object, like that after a reader sees a store, further loads in this thread won't see some older store and then eventually back to the newest store. (https://eel.is/c++draft/intro.multithread#intro.races-19).
(Note for 2 writers + 2 readers with non-seq_cst operations, it's possible for the readers to disagree about the order in which the stores happened. This is called IRIW reordering, but most hardware can't do it; only some PowerPC. Will two atomic writes to different locations in different threads always be seen in the same order by other threads? - so it's not always quite as simple as "the store hasn't happened yet", it be visible to some threads before others. But it's still true that you can't speed up visibility, only for example slow down the readers so none of them see it via the "early" mechanism, i.e. with hwsync for the PowerPC loads to drain the store buffer first.)
We know for sure that the functionThatWillBeLaunchedInThreadAfunction
will be launched AFTER the end of the
functionThatWillBeLaunchedInThreadB.
First of all, if this is the case then it's likely that your task queue mechanism takes care of the necessary synchronization already.
On to the answer...
By far the simplest thing to do is acquire/release ordering. All the solutions you gave are worse.
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_acquire))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_release);
}
By the way, shouldn't this be a while loop?
void functionThatWillBeLaunchedInThreadA() {
while (!canBegin.load(std::memory_order_acquire))
{ }
produceData();
}
I don't need to protect any data, so acquire / consume / release
ordering on atomic store/load are not needed here (I think?)
In this case, the ordering is required to keep the compiler/CPU/memory subsystem from ordering the canBegin store true before the previous reads/writes have completed. And it should actually stall the CPU until it can be guaranteed that every write that comes before in program order will propagate before the store to canBegin. On the load side it prevents memory from being read/written before canBegin is read as true.
However, in such a code, we don't have any guarantee that the store
will be visible in the thread A, so the thread A can read a stale
value (false).
You said yourself:
a store on an atomic variable should be visible to other thread in a
"within a reasonnable amount of time".
Even with relaxed memory order, a write is guaranteed to eventually reach the other cores and all cores will eventually agree on any given variable's store history, so there are no stale values. There are only values that haven't propagated yet. What's "relaxed" about it is the store order in relation to other variables. Thus, memory_order_relaxed solves the stale read problem (but doesn't address the ordering required as discussed above).
Don't use volatile. It doesn't provide all the guarantees required of atomics in the C++ memory model, so using it would be undefined behavior. See https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering at the bottom to read about it.
You could use a mutex or spinlock, but a mutex operation is much more expensive than a lock-free std::atomic acquire-load/release-store. A spinlock will do at least one atomic read-modify-write operation...and possibly many. A mutex is definitely overkill. But both have the benefit of simplicity in the C++ source. Most people know how to use locks so it's easier to demonstrate correctness.
A memory fence will also work but your fences are in the wrong spot (it's counter-intuitive) and the inter-thread communication variable should be std::atomic. (Careful when playing these games...! It's easy to get undefined behavior) Relaxed ordering is ok thanks to the fences.
std::atomic<bool> canGo{false}; // MUST be atomic
// in thread A
if(canGo.load(std::memory_order_relaxed))
{
std::atomic_thread_fence(std::memory_order_acquire);
produceData();
}
// in thread B
std::atomic_thread_fence(std::memory_order_release);
canGo.store(true, memory_order_relaxed);
The memory fences are actually more strict than acquire/release ordering on the std::atomicload/store so this gains nothing and could be more expensive.
It seems like you really want to avoid overhead with your signaling mechanism. This is exactly what the std::atomic acquire/release semantics were invented for! You are worrying too much about stale values. Yes, an atomic RMW will give you the "latest" value, but they're also very expensive operations themselves. I want to give you an idea of how fast acquire/release is. It's most likely that you're targeting x86. x86 has total store order and word-sized loads/stores are atomic, so an load acquire compiles to just a regular load and and a release store compiles to a regular store. So it turns out that almost everything in this long post will probably compile to exactly the same code anyway.

Atomic operation propagation/visibility (atomic load vs atomic RMW load)

Context 
I am writing a thread-safe protothread/coroutine library in C++, and I am using atomics to make task switching lock-free. I want it to be as performant as possible. I have a general understanding of atomics and lock-free programming, but I do not have enough expertise to optimise my code. I did a lot of researching, but it was hard to find answers to my specific problem: What is the propagation delay/visiblity for different atomic operations under different memory orders?
Current assumptions 
I read that changes to memory are propagated from other threads, in such a way that they might become visible:
in different orders to different observers,
with some delay.
I am unsure as to whether this delayed visibility and inconsistent propagation applies only to non-atomic reads, or to atomic reads as well, potentially depending on what memory order is used. As I am developing on an x86 machine, I have no way of testing the behaviour on weakly ordered systems.
Do all atomic reads always read the latest values, regardless of the type of operation and the memory order used? 
I am pretty sure that all read-modify-write (RMW) operations always read the most recent value written by any thread, regardless of the memory order used. The same seems to be true for sequentially consistent operations, but only if all other modifications to a variable are also sequentially consistent. Both are said to be slow, which is not good for my task. If not all atomic reads get the most recent value, then I will have to use RMW operations just for reading an atomic variable's latest value, or use atomic reads in a while loop, to my current understanding.
Does the propagation of writes (ignoring side effects) depend on the memory order and the atomic operation used? 
(This question only matters if the answer to the previous question is that not all atomic reads always read the most recent value. Please read carefully, I do not ask about the visibility and propagation of side-effects here. I am merely concerned with the value of the atomic variable itself.) This would imply that depending on what operation is used to modify an atomic variable, it would be guaranteed that any following atomic read receives the most recent value of the variable. So I would have to choose between an operation guaranteed to always read the latest value, or use relaxed atomic reads, in tandem with this special write operation that guarantees instant visibility of the modification to other atomic operations.
Is atomic lock-free ?
First of all, let's get rid of the elephant in the room: using atomic in your code doesn't guarantee a lock-free implementation. atomic is only an enabler for a lock-free implementation. is_lock_free() will tell you if it's really lock-free for the C++ implementation and the underlying types that you are using.
What's the latest value ?
The term "latest" is very ambiguous in the world of multithreading. Because what is the "latest" for one thread that might be put asleep by the OS, might no longer be what is the latest for another thread that is active.
std::atomic only guarantees is a protection against racing conditions, by ensuring that R, M and RMW performed on one atomic in one thread are performed atomically, without any interruption, and that all other threads see either the value before or the value after, but never what's in-between. So atomic synchronize threads by creating an order between concurrent operations on the same atomic object.
You need to see every thread as a parallel universe with its own time and that is unaware of the time in the parallel universes. And like in quantum physics, the only thing that you can know in one thread about another thread is what you can observe (i.e. a "happened before" relation between the universes).
This means that you should not conceive multithreaded time as if there would be an absolute "latest" across all the threads. You need to conceive time as relative to the other threads. This is why atomics don't create an absolute latest, but only ensure a sequential ordering of the successive states that an atomic will have.
Propagation
The propagation doesn't depend on the memory order nor the atomic operation performed. memory_order is about sequential constraints on non-atomic variables around atomic operations that are seen like fences. The best explanation of how this works is certainly Herb Sutters presentation, that is definitively worth its hour and half if you're working on multithreading optimisation.
Although it is possible that a particular C++ implementation could implement some atomic operation in a way that influences propagation, you cannot rely on any such observation that you would do, since there would be no guarantee that propagation works in the same fashion in the next release of the compiler or on another compiler on another CPU architecture.
But does propagation matter ?
When designing lock-free algorithms, it is tempting to read atomic variables to get the latest status. But whereas such a read-only access is atomic, the action immediately after is not. So the following instructions might assume a state which is already obsolete (for example because the thread is send asleep immediately after the atomic read).
Take if(my_atomic_variable<10) and suppose that you read 9. Suppose you're in the best possible world and 9 would be the absolutely latest value set by all the concurrent threads. Comparing its value with <10 is not atomic, so that when the comparison succeeds and if branches, my_atomic_variable might already have a new value of 10. And this kind of problems might occur regardless of how fast the propagation is, and even if the read would be guaranteed to always get the latest value. And I didn't even mention the ABA problem yet.
The only benefit of the read is to avoid a data race and UB. But if you want to synchronize decisions/actions across threads, you need to use an RMW, such compare-and-swap (e.g. atomic_compare_exchange_strong) so that the ordering of atomic operations result in a predictable outcome.
After some discussion, here are my findings: First, let's define what an atomic variable's latest value means: In wall-clock time, the very latest write to an atomic variable, so, from an external observer's point of view. If there are multiple simultaneous last writes (i.e., on multiple cores during the same cycle), then it doesn't really matter which one of them is chosen.
Atomic loads of any memory order have no guarantee that the latest value is read. This means that writes have to propagate before you can access them. This propagation may be out of order with respect to the order in which they were executed, as well as differing in order with respect to different observers.
std::atomic_int counter = 0;
void thread()
{
// Imagine no race between read and write.
int value = counter.load(std::memory_order_relaxed);
counter.store(value+1, std::memory_order_relaxed);
}
for(int i = 0; i < 1000; i++)
std::async(thread);
In this example, according to my understanding of the specs, even if no read-write executions were to interfere, there could still be multiple executions of thread that read the same values, so that in the end, counter would not be 1000. This is because when using normal reads, although threads are guaranteed to read modifications to the same variable in the correct order (they will not read a new value and on the next value read an older value), they are not guaranteed to read the globally latest written value to a variable.
This creates the relativity effect (as in Einstein's physics) that every thread has its own "truth", and this is exactly why we need to use sequential consistency (or acquire/release) to restore causality: If we simply use relaxed loads, then we can even have broken causality and apparent time loops, which can happen because of instruction reordering in combination with out-of-order propagation. Memory ordering will ensure that those separate realities perceived by separate threads are at least causally consistent.
Atomic read-modify-write (RMW) operations (such as exchange, compare_exchange, fetch_add,…) are guaranteed to operate on the latest value as defined above. This means that propagation of writes is forced, and results in one universal view on the memory (if all reads you make are from atomic variables using RMW operations), independent of threads. So, if you use atomic.compare_exchange_strong(value,value, std::memory_order_relaxed) or atomic.fetch_or(0, std::memory_order_relaxed), then you are guaranteed to perceive one global order of modification that encompasses all atomic variables. Note that this does not guarantee you any ordering or causality of non-RMW reads.
std::atomic_int counter = 0;
void thread()
{
// Imagine no race between read and write.
int value = counter.fetch_or(0, std::memory_order_relaxed);
counter.store(value+1, std::memory_order_relaxed);
}
for(int i = 0; i < 1000; i++)
std::async(thread);
In this example (again, under the assumption that none of the thread() executions interfere with each other), it seems to me that the spec forbids value to contain anything but the globally latest written value. So, counter would always be 1000 in the end.
Now, when to use which kind of read? 
If you only need causality within each thread (there might still be different views on what happened in which order, but at least every single reader has a causally consistent view on the world), then atomic loads and acquire/release or sequential consistency suffice.
But if you also need fresh reads (so that you must never read values other than the globally (across all threads) latest value), then you should use RMW operations for reading. Those alone do not create causality for non-atomic and non-RMW reads, but all RMW reads across all threads share the exact same view on the world, which is always up to date.
So, to conclude: Use atomic loads if different world views are allowed, but if you need an objective reality, use RMWs to load.
Multithreading is surprising area.
First, an atomic read is not ordered after a write. I e reading a value does not mean that it were written before. Sometimes such read may ever see (indirect, by other thread) result of some subsequent atomic write by the same thread.
Sequential consistency are clearly about visibility and propagation. When a thread writes an atomic "sequentially consistent" it makes all its previous writes to be visible to other threads (propagation). In such case a (sequentially consistent) read is ordered in relation to a write.
Generally the most performant operations are "relaxed" atomic operations, but they provide minimum guarranties on ordering. In principle there is ever some causality paradoxes... :-)

C++ atomic load ordering efficiency

I have a memory variable that is updated in thread A and read in other threads. The reader only cares if the value is non-zero. I am guaranteed that once the value is incremented, it never goes back to zero.
Does it make sense to optimize as below?
In other words, on the reader side, I dont need "fence" once I got my condition satisfied.
std::atomic<int> counter;
writer:
increment()
{
counter.store(counter+1, std:memory_order_release)
}
reader:
iszero()
{
if (counter.load(std::memory_order_relaxed) > 0) return false;
// memory fence only if condition not yet reached
return (counter.load(std::memory_order_acquire) == 0);
}
First, if you've not actually tried using the default (sequentially consistent) atomics, measured the performance of your app, profiled it, and shown observed them causing a performance problem, I'd suggest turning back now.
However, if you really do need to start reasoning about relaxed atomics...
That is not guaranteed to do what you expect, although it will almost certainly work on x86.
I'm guessing that you're using this to guard the publication of some other non-atomic data.
In that case, you need the guarantee that if you read a non-zero value in the reader thread, then various other side-effects to non-atomic memory locations (i.e. initializing the data you're publishing) that you made in the writer thread prior to the store will be visible to the reader thread.
Reading non-zero with std::memory_order_relaxed does not synchronize with the std::memory_order_release store, so your code above does not have this guarantee.
To get the behaviour I've described, you need to use std::memory_order_acquire. If you're on x86, then acquire doesn't produce any memory fence instructions, so the only way it will differ in performance from memory_order_relaxed is via preventing some compiler optimizations.