Memory barriers vs. interlocked operations

Memory barriers vs. interlocked operations - concurrency

I am trying to improve my understanding of memory barriers. Suppose we have a weak memory model and we adapt Dekker's algorithm. Is it possible to make it work correctly under the weak memory model by adding memory barriers?
I think the answer is a surprising no. The reason (if I am correct) is that although a memory barrier can be used to ensure that a read is not moved past another, it cannot ensure that a read does not see stale data (such as that in a cache). Thus it could see some time in the past when the critical section was unlocked (per the CPU's cache) but at the current time other processors might see it as locked. If my understanding is correct, one must use interlocked operations such as those commonly called test-and-set or compare-and-swap to ensure synchronized agreement of a value at a memory location among multiple processors.
Thus, can we rightly expect that no weak memory model system would provide only memory barriers? The system must supply operations like test-and-set or compare-and-swap to be useful.
I realize that popular processors, including x86, provide memory models much stronger than a weak memory model. Please focus the discussion on weak memory models.
(If Dekker's algorithm is a poor choice, choose another mutual exclusion algorithm where memory barriers can successfully achieve correct synchronization, if possible.)

You are right that a memory barrier cannot ensure that a read sees up-to-date values. What it does do is enforce an ordering between operations, both on a single thread, and between threads.
For example, if thread A does a series of stores and then executes a release barrier before a final store to a flag location, and thread B reads from the flag location and then executes an acquire barrier before reading the other values then the other variables will have the values stored by thread A:
// initially x=y=z=flag=0
// thread A
x=1;
y=2;
z=3;
release_barrier();
flag=1;
// thread B
while(flag==0) ; // loop until flag is 1
acquire_barrier();
assert(x==1); // asserts will not fire
assert(y==2);
assert(z==3);
Of course, you need to ensure that the load and store to flag is atomic (which simple load and store instructions are on most common CPUs, provided the variables are suitably aligned). Without the while loop on thread B, thread B may well read a stale value (0) for flag, and thus you cannot guarantee any of the values read for the other variables.
Fences can thus be used to enforce synchronization in Dekker's algorithm.
Here's an example implementation in C++ (using C++0x atomic variables):
std::atomic<bool> flag0(false),flag1(false);
std::atomic<int> turn(0);
void p0()
{
flag0.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
while (flag1.load(std::memory_order_relaxed))
{
if (turn.load(std::memory_order_relaxed) != 0)
{
flag0.store(false,std::memory_order_relaxed);
while (turn.load(std::memory_order_relaxed) != 0)
{
}
flag0.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
}
}
std::atomic_thread_fence(std::memory_order_acquire);
// critical section
turn.store(1,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
flag0.store(false,std::memory_order_relaxed);
}
void p1()
{
flag1.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
while (flag0.load(std::memory_order_relaxed))
{
if (turn.load(std::memory_order_relaxed) != 1)
{
flag1.store(false,std::memory_order_relaxed);
while (turn.load(std::memory_order_relaxed) != 1)
{
}
flag1.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
}
}
std::atomic_thread_fence(std::memory_order_acquire);
// critical section
turn.store(0,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
flag1.store(false,std::memory_order_relaxed);
}
For a full analysis see my blog entry at http://www.justsoftwaresolutions.co.uk/threading/implementing_dekkers_algorithm_with_fences.html

Say you put in a load and store barrier after every statement, and in addition you ensured that the compiler didn't reorder your stores. Wouldn't this, on any reasonable architecture, provide strict consistency? Dekker's works on sequentially consistent architectures. Sequential consistency is a weaker condition than strict consistency.
http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/consistency.html
Even on a CPU that has a weak consistency model, you'd still expect cache coherence. I think that where things get derailed is the behavior of store buffers and speculated reads, and what operations are available flush stored writes and invalidate speculated reads. If there isn't a load fence that can invalidate speculated reads, or there isn't a write fence that flushes a store buffer, in addition to not being able to implement Dekker's, you won't be able to implement a mutex!
So here's my claim. If you have a write barrier available, and a read barrier, and the cache is coherent between CPUs, then you can trivially make all code sequentially consistent by flushing writes (store fence) after every instruction, and flushing speculations (read fence) before every instruction. So I claim that you don't need atomics for what you're talking about, and that you can do what you need with Dekker's only. Sure you wouldn't want to.
BTW, Corensic, the company I work for, writes cool tools for debugging concurrency issues. Check out http://www.corensic.com.

Some barriers (such as the powerpc isync, and a .acq load on ia64) also have an effect on ubsequent loads. ie: if a load was available before the isync due to prefetching it must be discarded. When used appropriately perhaps that's enough to make Dekker's algorithm work on a weak memory model.
You've also got cache invalidation logic working for you. If you know that your load is current due to something like an isync and that the cached version of the data is invalidated if another cpu touches it, is that enough?
Interesting questions aside, Dekker's algorithm is for all practical purposes dumb. You are going to want to use atomic hardware interfaces and memory barriers for any real application, so focusing on how to fix up Dekker's with atomics just doesn't seem worthwhile to me;)

Related

atomic exchange with memory_order_acquire and memory_order_release

I have a situation that I would like to prepare some data in one thread:
// My boolean flag
std::atomic<bool> is_data_ready = false;
Thread 1 (producer thread):
PrepareData();
if (!is_data_ready.exchange(true, std::memory_order_release)) {
NotifyConsumerThread();
}
else {
return;
}
In consumer thread,
Thread 2:
if (is_data_ready.exchange(false, std::memory_order_acquire)) {
ProcessData();
}
Does it make sense to use acquire/release order (instead of acq_rel order) for exchange? I am not sure if I understand it correctly: does std::memory_order_release in exchange mean the store is a release store? If so, what is the memory order for the load?

An atomic RMW has a load part and a store part. memory_order_release gives the store side release semantics, while leaving the load side relaxed. The reverse for exchange(val, acquire). With exchange(val, acq_rel) or seq_cst, the load would be an acquire load, the store would be a release store.
(compare_exchange_weak/_strong can have one memory order for the pure-load case where the compare failed, and a separate memory order for the RMW case where it succeeds. This distinction is meaningful on some ISAs, but not on ones like x86 where it's just a single instruction that effectively always stores, even in the false case.)
And of course atomicity of the exchange (or any other RMW) is guaranteed regardless of anything else; no stores or RMWs to this object by other cores can come between the load and store parts of the exchange. Notice that I didn't mention pure loads, or operations on other objects. See later in this answer and also For purposes of ordering, is atomic read-modify-write one operation or two?
Yes, this looks sensible, although simplistic and maybe racy in allowing more stuff to be published after the first batch is consumed (or started to consume)1. But for the purposes of understanding how atomic RMWs work, and the ordering of its load and store sides, we can ignore that.
exchange(true, release) "publishes" some shared data stored by PrepareData(), and checks the old value to see if the worker thread needs to get notified.
And in the reader, is_data_ready.exchange(false, acquire) is a load that syncs with the release-store if there was one, creating a happens-before relationship that makes it safe to read that data without data-race UB. And tied to that (as part of the atomic RMW), lets other threads see that it has gone past the point of checking for new work, so it needs another notify if there is any.
Yes, exchange(value, release) means the store part of the RMW has release ordering wrt. other operations in the same thread. The load part is relaxed, but the load/store pair still form an atomic RMW. So the load can't take a value until this core has exclusive ownership of the cache line.
Or in C++ terms, it sees the "latest value" in the modification order of is_data_ready; if some other thread was also storing to is_data_ready, that store will happen either before the load (before the whole exchange), or after the store (after the whole exchange).
Note that a pure load in another core coming after the load part of this exchange is indistinguishable from coming before, so only operations that involve a store are part of the modification order of an object. (That modification order is guaranteed to exist such that all threads can agree on it, even when you're using relaxed loads/stores.)
But the load part of another atomic RMW will have to come before the load part of the exchange, otherwise that other RMW would have this exchange happening between its load and its store. That would violate the atomicity guarantee of the other RMW, so that can't happen. Atomic RMWs on the same object effectively serialize across threads. That's why a million fetch_add(1, mo_relaxed) operations on an atomic counter will increment it by 1 million, regardless of what order they end up running in. (See also C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads re: why atomic RMWs have to work this way.)
C++ is specified in terms of syncs-with and whether a happens-before guarantee exists that allows your other loads to see other stores by other threads. But humans often like to think in terms of local reordering (within execution of one thread) of operations that access shared memory (via coherent cache).
In terms of a memory-reordering model, the store part of an exchange(val, release) can reorder with later operations other than release or seq_cst. (Note that unlocking a mutex counts as a release operation). But not with any earlier operations. This is what acquire and release semantics are all about, as Jeff Preshing explains: https://preshing.com/20120913/acquire-and-release-semantics/.
Wherever the store ends up, the load is at some point before it. Right before it in the modification order of is_data_ready, but operations on other objects by this thread (especially in other cache lines) may be able to happen in between the load and store parts of an atomic exchange.
In practice, some CPU architectures don't make that possible. Notably x86 atomic RMW operations are always full barriers, which waits for all earlier loads and stores to complete before the exchange, and doesn't start any later loads and stores until after. So not even StoreLoad reordering of the store part of an exchange with later loads is possible on x86.
But on AArch64 you can observe StoreLoad reordering of the store part of a seq_cst exchange with a later relaxed load. But only the store part, not the load part; being seq_cst means the load part of the exchange has acquire semantics and thus happens before any later loads. See For purposes of ordering, is atomic read-modify-write one operation or two?
Footnote 1: is this a usable producer/consumer sync algorithm?
With a single boolean flag (not a queue with a read-index / write-index), IDK how a producer would know when it can overwrite the shared variables that the consumer will look at. If it (or another producer thread) did that right away after seeing is_data_ready == false, you'd race with the reader that's just started reading.
If you can solve that problem, this does appear to avoid the possibility of the consumer missing an update and going to sleep, as long as it handles the case where a second writer adds more data and sends a notify before the consumer finishes ProcessData. (The writers only know that the consumer has started, not when it finishes.) I guess this example isn't showing the notification mechanism, which might itself create synchronization.
If two producers run PrepareData() at overlapping times, the first one to finish will send a notification, not both. Unless the consumer does an exchange and resets is_data_ready between the two exchanges in the producers, then it will get a second notification. (So that sound pretty hard to deal with in the consumer, and in whatever data structure PrepareData() manages, unless it's something like a lock-free queue itself, in which case just check the queue for work instead of this mechanism. But again, this is still a usable example to talk about how exchange works.)
If a consumer is frequently checking and finding no work needing doing, that's also extra contention that could have been avoided if it checks read-only until they see a true and exchange it to false (with an acquire exchange). But since you're worrying about notifications, I assume it's not a spin-wait loop, instead sleeping if there isn't work to do.

memory model, how load acquire semantic actually works?

From very nice Paper and article about memory reordering.
Q1: I understand that cache-coherence, store buffer and invalidation queue is root cause of memory reordering ?
Store release is quite understandable, have to wait for all load and store are completed before set flag to true.
About load acquire, typical use of atomic load is waiting for a flag. Suppose we have 2 threads:
int x = 0;
std::atomic<bool> ready_flag = false;
// thread-1
if(ready_flag.load(std::memory_order_relaxed))
{
// (1)
// load x here
}
// (2)
// load x here
// thread-2
x = 100;
ready_flag.store(true, std::memory_order_release);
EDIT: in thread-1, it should be a while loop, but I copied the logic from article above. So, assume memory-reorder is occurred just in time.
Q2: Because (1) and (2) depends on if condition, CPU have to wait for ready_flag, does it mean write-release is enough ? How memory-reordering can happens with this context ?
Q3: Obviously we have load-acquire, so I guess mem-reorder is possible, then where should we place the fence, (1) or (2) ?

Accessing an atomic variable is not a mutex operation; it merely accesses the stored value atomically, with no chance for any CPU operation to interrupt the access such that no data races can occur with regard to accessing that value (it can also issue barriers with regard to other accesses, which is what the memory orders provide). But that's it; it doesn't wait for any particular value to appear in the atomic variable.
As such, your if statement will read whatever value happens to be there at the time. If you want to guard access to x until the other statement has written to it and signaled the atomic, you must:
Not allow any code to read from x until the atomic flag has returned the value true. Simply testing the value once won't do that; you must loop over repeated accesses until it is true. Any other attempt to read from x results in a data race and is therefore undefined behavior.
Whenever you access the flag, you must do so in a way that tells the system that values written by the thread setting that flag should be visible to subsequent operations that see the set value. That requires a proper memory order, one which must be at least memory_order_acquire.
To be technical, the read from the flag itself doesn't have to do the acquire. You could perform an acquire operation after having read the proper value from the flag. But you need to have an acquire-equivalent operation happen before reading x.
The writing statement must set the flag using a releasing memory order that must be at least as powerful as memory_order_release.

Because (1) and (2) depends on if condition, CPU have to wait for ready_flag
There are 2 showstopper flaws in that reasoning:
Branch prediction + speculative execution is a real thing in real CPUs. Control dependencies behave differently from data dependencies. Speculative execution breaks control dependencies.
In most (but not all) real CPUs, data dependencies do work like C++ memory_order_consume. A typical use-case is loading a pointer and then dereferencing it. That's still not safe in C++'s very weak memory model, but will happen to compile to asm that works for most ISAs other than DEC Alpha. Alpha can (in practice on some hardware) even manage to violate causality and load a stale value when dereferencing a just-loaded pointer, even if the stores were correctly ordered.
Compilers can break control and even data dependencies. C++ source logic doesn't always translate directly to asm. In this case a compiler could emit asm that works like this:
tmp = load(x); // compile time reordering before the relaxed load
if (load(ready_flag)
actually use tmp;
It's data-race UB in C++ to read x while it might still be being written, but for most specific ISAs there's no problem with that. You just have to avoid actually using any load results that might be bogus.
This might not be a useful optimization for most ISAs but nothing rules it out. Hiding load latency on in-order pipelines by doing the load earlier might actually be useful sometimes, (if it wasn't being written by another thread, and the compiler might guess that wasn't happening because there's no acquire load).
By far your best bet is to use ready_flag.load(mo_acquire).
A separate problem is that you have commented out code that reads x after the if(), which will run even if the load didn't see the data ready. As #Nicol explained in an answer, this means data-race UB is possible because you might be reading x while the producer is writing it.
Perhaps you wanted to write a spin-wait loop like while(!ready_flag){ _mm_pause(); }? Generally be careful of wasting huge amounts of CPU time spinning; if it might be a long time, use a library-supported thing like maybe a condition variable that gives you efficient fallback to OS-supported sleep/wakeup (e.g. Linux futex) after spinning for a short time.
If you did want a manual barrier separate from the load, it would be
if (ready_flag.load(mo_relaxed))
atomic_thread_fence(mo_acquire);
int tmp = x; // now this is safe
}
// atomic_thread_fence(mo_acquire); // still wouldn't make it safe to read x
// because this code runs even after ready_flag == false
Using if(ready_flag.load(mo_acquire)) would lead to an unconditional fence before branching on the ready_flag load, when compiling for any ISA where acquire-load wasn't available with a single instruction. (On x86 all loads are acquire, on AArch64 ldar does an acquire load. ARM needs load + dsb ish)

The C++ standard doesn't specify the code generated by any particular construct; only correct combinations of thread communication tools product a guaranteed result.
You don't get guarantees from the CPU in C++ because C++ is not a kind of (macro) assembly, not even a "high level assembly", at least not when not all objects have a volatile type.
Atomic objects are communication tools to exchange data between threads. The correct use, for correct visibility of memory operations, is either a store operation with (at least) release followed by a load with acquire, the same with RMW in between, either the store (resp. the load) replaced by RMW with (at least) a release (resp. acquire), on any variant with a relaxed operation and a separate fence.
In all cases:
the thread "publishing" the "done" flag must use a memory ordering at least release (that is: release, release+acquire or sequential consistency),
and the "subscribing" thread, the one acting on the flag must use at least acquire (that is: acquire, release+acquire or sequential consistency).
In practice with separately compiled code other modes might work, depending on the CPU.

relaxed ordering and inter thread visibility

I learnt from relaxed ordering as a signal that a store on an atomic variable should be visible to other thread in a "within a reasonnable amount of time".
That say, I am pretty sure it should happen in a very short time (some nano second ?).
However, I don't want to rely on "within a reasonnable amount of time".
So, here is some code :
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_relaxed))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_relaxed);
}
Thread A and B are within a kind of ThreadPool, so there is no creation of thread or whatsoever in this problem.
I don't need to protect any data, so acquire / consume / release ordering on atomic store/load are not needed here (I think?).
We know for sure that the functionThatWillBeLaunchedInThreadAfunction will be launched AFTER the end of the functionThatWillBeLaunchedInThreadB.
However, in such a code, we don't have any guarantee that the store will be visible in the thread A, so the thread A can read a stale value (false).
Here are some solution I think about.
Solution 1 : Use volatility
Just declare volatile std::atomic_bool canBegin{false}; Here the volatileness guarantee us that we will not see stale value.
Solution 2 : Use mutex or spinlock
Here the idea is to protect the canBegin access via a mutex / spinlock that guarantee via a release/acquire ordering that we will not see a stale value.
I don't need canGo to be an atomic either.
Solution 3 : not sure at all, but memory fence?
Maybe this code will not work, so, tell me :).
bool canGo{false}; // not an atomic value now
// in thread A
std::atomic_thread_fence(std::memory_order_acquire);
if(canGo) produceData();
// in thread B
canGo = true;
std::atomic_thread_fence(std::memory_order_release);
On cpp reference, for this case, it is write that :
all non-atomic and relaxed atomic stores that are sequenced-before FB
in thread B will happen-before all non-atomic and relaxed atomic loads
from the same locations made in thread A after FA
Which solution would you use and why?

There's nothing you can do to make a store visible to other threads any sooner. See If I don't use fences, how long could it take a core to see another core's writes? - barriers don't speed up visibility to other cores, they just make this core wait until that's happened.
The store part of an RMW is not different from a pure store for this, either.
(Certainly on x86; not totally sure about other ISAs, where a relaxed LL/SC might possibly get special treatment from the store buffer, possibly being more likely to commit before other stores if this core can get exclusive ownership of the cache line. But I think it still would have to retire from out-of-order execution so the core knows it's not speculative.)
Anthony's answer that was linked in comment is misleading; as I commented there:
If the RMW runs before the other thread's store commits to cache, it doesn't see the value, just like if it was a pure load. Does that mean "stale"? No, it just means that the store hasn't happened yet.
The only reason RMWs need a guarantee about "latest" value is that they're inherently serializing operations on that memory location. This is what you need if you want 100 unsynchronized fetch_add operations to not step on each other and be equivalent to += 100, but otherwise best-effort / latest-available value is fine, and that's what you get from a normal atomic load.
If you require instant visibility of results (a nanosecond or so), that's only possible within a single thread, like x = y; x += z;
Also note, the C / C++ standard requirement (actually just a note) to make stores visible in a reasonable amount of time is in addition to the requirements on ordering of operations. It doesn't mean seq_cst store visibility can be delayed until after later loads. All seq_cst operations happen in some interleaving of program order across all threads.
On real-world C++ implementations, the visibility time is entirely up to hardware inter-core latency. But the C++ standard is abstract, and could in theory be implemented on a CPU that required manual flushing to make stores visible to other threads. Then it would be up to the compiler to not be lazy and defer that for "too long".
volatile atomic<T> is useless; compilers already don't optimize atomic<T>, so every atomic access done by the abstract machine will already happen in the asm. (Why don't compilers merge redundant std::atomic writes?). That's all that volatile does, so volatile atomic<T> compiles to the same asm as atomic<T> for anything you can with the atomic.
Defining "stale" is a problem because separate threads running on separate cores can't see each other's actions instantly. It takes tens of nanoseconds on modern hardware to see a store from another thread.
But you can't read "stale" values from cache; that's impossible because real CPUs have coherent caches. (That's why volatile int could be used to roll your own atomics before C++11, but is no longer useful.) You may need an ordering stronger than relaxed to get the semantics you want as far as one value being older than another (i.e. "reordering", not "stale"). But for a single value, if you don't see a store, that means your load executed before the other core took exclusive ownership of the cache line in order to commit its store. i.e. that the store hasn't truly happened yet.
In the formal ISO C++ rules, there are guarantees about what value you're allowed to see which effectively give you the guarantees you'd expect from cache coherency for a single object, like that after a reader sees a store, further loads in this thread won't see some older store and then eventually back to the newest store. (https://eel.is/c++draft/intro.multithread#intro.races-19).
(Note for 2 writers + 2 readers with non-seq_cst operations, it's possible for the readers to disagree about the order in which the stores happened. This is called IRIW reordering, but most hardware can't do it; only some PowerPC. Will two atomic writes to different locations in different threads always be seen in the same order by other threads? - so it's not always quite as simple as "the store hasn't happened yet", it be visible to some threads before others. But it's still true that you can't speed up visibility, only for example slow down the readers so none of them see it via the "early" mechanism, i.e. with hwsync for the PowerPC loads to drain the store buffer first.)

We know for sure that the functionThatWillBeLaunchedInThreadAfunction
will be launched AFTER the end of the
functionThatWillBeLaunchedInThreadB.
First of all, if this is the case then it's likely that your task queue mechanism takes care of the necessary synchronization already.
On to the answer...
By far the simplest thing to do is acquire/release ordering. All the solutions you gave are worse.
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_acquire))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_release);
}
By the way, shouldn't this be a while loop?
void functionThatWillBeLaunchedInThreadA() {
while (!canBegin.load(std::memory_order_acquire))
{ }
produceData();
}
I don't need to protect any data, so acquire / consume / release
ordering on atomic store/load are not needed here (I think?)
In this case, the ordering is required to keep the compiler/CPU/memory subsystem from ordering the canBegin store true before the previous reads/writes have completed. And it should actually stall the CPU until it can be guaranteed that every write that comes before in program order will propagate before the store to canBegin. On the load side it prevents memory from being read/written before canBegin is read as true.
However, in such a code, we don't have any guarantee that the store
will be visible in the thread A, so the thread A can read a stale
value (false).
You said yourself:
a store on an atomic variable should be visible to other thread in a
"within a reasonnable amount of time".
Even with relaxed memory order, a write is guaranteed to eventually reach the other cores and all cores will eventually agree on any given variable's store history, so there are no stale values. There are only values that haven't propagated yet. What's "relaxed" about it is the store order in relation to other variables. Thus, memory_order_relaxed solves the stale read problem (but doesn't address the ordering required as discussed above).
Don't use volatile. It doesn't provide all the guarantees required of atomics in the C++ memory model, so using it would be undefined behavior. See https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering at the bottom to read about it.
You could use a mutex or spinlock, but a mutex operation is much more expensive than a lock-free std::atomic acquire-load/release-store. A spinlock will do at least one atomic read-modify-write operation...and possibly many. A mutex is definitely overkill. But both have the benefit of simplicity in the C++ source. Most people know how to use locks so it's easier to demonstrate correctness.
A memory fence will also work but your fences are in the wrong spot (it's counter-intuitive) and the inter-thread communication variable should be std::atomic. (Careful when playing these games...! It's easy to get undefined behavior) Relaxed ordering is ok thanks to the fences.
std::atomic<bool> canGo{false}; // MUST be atomic
// in thread A
if(canGo.load(std::memory_order_relaxed))
{
std::atomic_thread_fence(std::memory_order_acquire);
produceData();
}
// in thread B
std::atomic_thread_fence(std::memory_order_release);
canGo.store(true, memory_order_relaxed);
The memory fences are actually more strict than acquire/release ordering on the std::atomicload/store so this gains nothing and could be more expensive.
It seems like you really want to avoid overhead with your signaling mechanism. This is exactly what the std::atomic acquire/release semantics were invented for! You are worrying too much about stale values. Yes, an atomic RMW will give you the "latest" value, but they're also very expensive operations themselves. I want to give you an idea of how fast acquire/release is. It's most likely that you're targeting x86. x86 has total store order and word-sized loads/stores are atomic, so an load acquire compiles to just a regular load and and a release store compiles to a regular store. So it turns out that almost everything in this long post will probably compile to exactly the same code anyway.

How do I make memory stores in one thread "promptly" visible in other threads?

Suppose I wanted to copy the contents of a device register into a variable that would be read by multiple threads. Is there a good general way of doing this? Here are examples of two possible methods of doing this:
#include <atomic>
volatile int * const Device_reg_ptr = reinterpret_cast<int *>(0x666);
// This variable is read by multiple threads.
std::atomic<int> device_reg_copy;
// ...
// Method 1
const_cast<volatile std::atomic<int> &>(device_reg_copy)
.store(*Device_reg_ptr, std::memory_order_relaxed);
// Method 2
device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
More generally, in the face of possible whole program optimization, how does one correctly control the latency of memory writes in one thread being visible in other threads?
EDIT: In your answer, please consider the following scenario:
The code is running on a CPU in an embedded system.
A single application is running on the CPU.
The application has far fewer threads than the CPU has processor cores.
Each core has a massive number of registers.
The application is small enough that whole program optimization is successfully used when building its executable.
How do we make sure that a store in one thread does not remain invisible to other threads indefinitely?

If you would like to update the value of device_reg_copy in atomic fashion, then device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed); suffices.
There is no need to apply volatile to atomic variables, it is unnecessary.
std::memory_order_relaxed store is supposed to incur the least amount of synchronization overhead. On x86 it is just a plain mov instruction.
However, if you would like to update it in such a way, that the effects of any preceding stores become visible to other threads along with the new value of device_reg_copy, then use std::memory_order_release store, i.e. device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);. The readers need to load device_reg_copy as std::memory_order_acquire in this case. Again, on x86 std::memory_order_release store is a plain mov.
Whereas if you use the most expensive std::memory_order_seq_cst store, it does insert the memory barrier for you on x86.
This is why they say that x86 memory model is a bit too strong for C++11: plain mov instruction is std::memory_order_release on stores and std::memory_order_acquire on loads. There is no relaxed store or load on x86.
I cannot recommend enough CPU Cache Flushing Fallacy article.

The C++ standard is rather vague about making atomic stores visible to other threads..
29.3.12
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
That is as detailed as it gets, there is no definition of 'reasonable', and it does not have to be immediately.
Using a stand-alone fence to force a certain memory ordering is not necessary since you can specify those on atomic operations, but the question is,
what is your expectation with regards to using a memory fence..
Fences are designed to enforce ordering on memory operations (between threads), but they do not guarantee visibility in a timely manner.
You can store a value to an atomic variable with the strongest memory ordering (ie. seq_cst), but even when another thread executes load() at a later time than the store(),
you might still get an old value from the cache and yet (surprisingly) it does not violate the happens-before relationship.
Using a stronger fence might make a difference wrt. timing and visibility, but there are no guarantees.
If prompt visibility is important, I would consider using a Read-Modify-Write (RMW) operation to load the value.
These are atomic operations that read and modify atomically (ie. in a single call), and have the additional property that they are guaranteed to operate on the latest value.
But since they have to reach a little further than the local cache, these calls also tend to be more expensive to execute.
As pointed out by Maxim Egorushkin, whether or not you can use weaker memory orderings than the default (seq_cst) depends on whether other memory operations need to be synchronized (made visible) between threads.
That is not clear from your question, but it is generally considered safe to use the default (sequential consistency).
If you are on an unusually weak platform, if performance is problematic, and if you need data synchronization between threads, you could consider using acquire/release semantics:
// thread 1
device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);
// thread 2
device_reg_copy.fetch_add(0, std::memory_order_acquire);
If thread 2 sees the value written by thread 1, it is guaranteed that memory operations prior to the store in thread 1 are visible after the load in thread 2.
Acquire/Release operations form a pair and they synchronize based on a run-time relationship between the store and load. In other words, if thread 2 does not see the value stored by thread 1,
there are no ordering guarantees.
If the atomic variable has no dependencies on any other data, you can use std::memory_order_relaxed; store ordering is always guaranteed for a single atomic variable.
As mentioned by others, there is no need for volatile when it comes to inter-thread communication with std::atomic.

Memory model ordering and visibility?

I tried looking for details on this, I even read the standard on mutexes and atomics... but still I couldnt understand the C++11 memory model visibility guarantees.
From what I understand the very important feature of mutex BESIDE mutual exclusion is ensuring visibility. Aka it is not enough that only one thread per time is increasing the counter, it is important that the thread increases the counter that was stored by the thread that was last using the mutex(I really dont know why people dont mention this more when discussing mutexes, maybe I had bad teachers :)).
So from what I can tell atomic doesnt enforce immediate visibility:
(from the person that maintains boost::thread and has implemented c++11 thread and mutex library):
A fence with memory_order_seq_cst does not enforce immediate
visibility to other threads (and neither does an MFENCE instruction).
The C++0x memory ordering constraints are just that --- ordering
constraints. memory_order_seq_cst operations form a total order, but
there are no restrictions on what that order is, except that it must
be agreed on by all threads, and it must not violate other ordering
constraints. In particular, threads may continue to see "stale" values
for some time, provided they see values in an order consistent with
the constraints.
And I'm OK with that. But the problem is that I have trouble understanding what C++11 constructs regarding atomic are "global" and which only ensure consistency on atomic variables.
In particular I have understanding which(if any) of the following memory orderings guarantee that there will be a memory fence before and after load and stores:
http://www.stdthread.co.uk/doc/headers/atomic/memory_order.html
From what I can tell std::memory_order_seq_cst inserts mem barrier while other only enforce ordering of the operations on certain memory location.
So can somebody clear this up, I presume a lot of people are gonna be making horrible bugs using std::atomic , esp if they dont use default (std::memory_order_seq_cst memory ordering)
2. if I'm right does that mean that second line is redundand in this code:
atomicVar.store(42);
std::atomic_thread_fence(std::memory_order_seq_cst);
3. do std::atomic_thread_fences have same requirements as mutexes in a sense that to ensure seq consistency on nonatomic vars one must do std::atomic_thread_fence(std::memory_order_seq_cst);
before load and
std::atomic_thread_fence(std::memory_order_seq_cst);
after stores?
4. Is
{
regularSum+=atomicVar.load();
regularVar1++;
regularVar2++;
}
//...
{
regularVar1++;
regularVar2++;
atomicVar.store(74656);
}
equivalent to
std::mutex mtx;
{
std::unique_lock<std::mutex> ul(mtx);
sum+=nowRegularVar;
regularVar++;
regularVar2++;
}
//..
{
std::unique_lock<std::mutex> ul(mtx);
regularVar1++;
regularVar2++;
nowRegularVar=(74656);
}
I think not, but I would like to be sure.
EDIT:
5.
Can assert fire?
Only two threads exist.
atomic<int*> p=nullptr;
first thread writes
{
nonatomic_p=(int*) malloc(16*1024*sizeof(int));
for(int i=0;i<16*1024;++i)
nonatomic_p[i]=42;
p=nonatomic;
}
second thread reads
{
while (p==nullptr)
{
}
assert(p[1234]==42);//1234-random idx in array
}

If you like to deal with fences, then a.load(memory_order_acquire) is equivalent to a.load(memory_order_relaxed) followed by atomic_thread_fence(memory_order_acquire). Similarly, a.store(x,memory_order_release) is equivalent to a call to atomic_thread_fence(memory_order_release) before a call to a.store(x,memory_order_relaxed). memory_order_consume is a special case of memory_order_acquire, for dependent data only. memory_order_seq_cst is special, and forms a total order across all memory_order_seq_cst operations. Mixed with the others it is the same as an acquire for a load, and a release for a store. memory_order_acq_rel is for read-modify-write operations, and is equivalent to an acquire on the read part and a release on the write part of the RMW.
The use of ordering constraints on atomic operations may or may not result in actual fence instructions, depending on the hardware architecture. In some cases the compiler will generate better code if you put the ordering constraint on the atomic operation rather than using a separate fence.
On x86, loads are always acquire, and stores are always release. memory_order_seq_cst requires stronger ordering with either an MFENCE instruction or a LOCK prefixed instruction (there is an implementation choice here as to whether to make the store have the stronger ordering or the load). Consequently, standalone acquire and release fences are no-ops, but atomic_thread_fence(memory_order_seq_cst) is not (again requiring an MFENCE or LOCKed instruction).
An important effect of the ordering constraints is that they order other operations.
std::atomic<bool> ready(false);
int i=0;
void thread_1()
{
i=42;
ready.store(true,memory_order_release);
}
void thread_2()
{
while(!ready.load(memory_order_acquire)) std::this_thread::yield();
assert(i==42);
}
thread_2 spins until it reads true from ready. Since the store to ready in thread_1 is a release, and the load is an acquire then the store synchronizes-with the load, and the store to i happens-before the load from i in the assert, and the assert will not fire.
2) The second line in
atomicVar.store(42);
std::atomic_thread_fence(std::memory_order_seq_cst);
is indeed potentially redundant, because the store to atomicVar uses memory_order_seq_cst by default. However, if there are other non-memory_order_seq_cst atomic operations on this thread then the fence may have consequences. For example, it would act as a release fence for a subsequent a.store(x,memory_order_relaxed).
3) Fences and atomic operations do not work like mutexes. You can use them to build mutexes, but they do not work like them. You do not have to ever use atomic_thread_fence(memory_order_seq_cst). There is no requirement that any atomic operations are memory_order_seq_cst, and ordering on non-atomic variables can be achieved without, as in the example above.
4) No these are not equivalent. Your snippet without the mutex lock is thus a data race and undefined behaviour.
5) No your assert cannot fire. With the default memory ordering of memory_order_seq_cst, the store and load from the atomic pointer p work like the store and load in my example above, and the stores to the array elements are guaranteed to happen-before the reads.

From what I can tell std::memory_order_seq_cst inserts mem barrier while other only enforce ordering of the operations on certain memory location.
It really depends on what you're doing, and on what platform you're working with. The strong memory ordering model on a platform like x86 will create a different set of requirements for the existence of memory fence operations compared to a weaker ordering model on platforms like IA64, PowerPC, ARM, etc. What the default parameter of std::memory_order_seq_cst is ensuring is that depending on the platform, the proper memory fence instructions will be used. On a platform like x86, there is no need for a full memory barrier unless you are doing a read-modify-write operation. Per the x86 memory model, all loads have load-acquire semantics, and all stores have store-release semantics. Thus, in these cases the std::memory_order_seq_cst enum basically creates a no-op since the memory model for x86 already ensures that those types of operations are consistent across threads, and therefore there are no assembly instructions that implement these types of partial memory barriers. Thus the same no-op condition would be true if you explicitly set a std::memory_order_release or std::memory_order_acquire setting on x86. Furthermore, requiring a full memory-barrier in these situations would be an unnecessary performance impediment. As noted, it would only be required for read-modify-store operations.
On other platforms with weaker memory consistency models though, that would not be the case, and therefore using std::memory_order_seq_cst would employ the proper memory fence operations without the user having to explicitly specify whether they would like a load-acquire, store-release, or full memory fence operation. These platforms have specific machine instructions for enforcing such memory consistency contracts, and the std::memory_order_seq_cst setting would work out the proper case. If the user would like to specifically call for one of these operations they can through the explicit std::memory_order enum types, but it would not be necessary ... the compiler would work out the correct settings.
I presume a lot of people are gonna be making horrible bugs using std::atomic , esp if they dont use default (std::memory_order_seq_cst memory ordering)
Yes, if they don't know what they're doing, and don't understand which types of memory barrier semantics that are called for in certain operations, then there will be a lot of mistakes made if they attempt to explicitly state the type of memory barrier and it's the incorrect one, especially on platforms that will not help their mis-understanding of memory ordering because they are weaker in nature.
Finally, keep in mind with your situation #4 concerning a mutex that there are two different things that need to happen here:
The compiler must not be allowed to reorder operations across the mutex and critical section (especially in the case of an optimizing compiler)
There must be the requisite memory fences created (depending on the platform) that maintain a state where all stores are completed before the critical section and reading of the mutex variable, and all stores are completed before exiting the critical section.
Since by default, atomic stores and loads are implemented with std::memory_order_seq_cst, then using atomics would also implement the proper mechanisms to satisfy conditions #1 and #2. That being said, in your first example with atomics, the load would enforce acquire-semantics for the block, while the store would enforce release semantics. It would not though enforce any particular ordering inside the "critical section" between these two operations though. In your second example, you have two different sections with locks, each lock having acquire semantics. Since at some point you would have to release the locks, which would have release semantics, then no, the two code blocks would not be equivalent. In the first example, you've created a big "critical section" between the load and store (assuming this is all happening on the same thread). In the second example you have two different critical sections.
P.S. I've found the following PDF particularly instructive, and you may find it too:
http://www.nwcpp.org/Downloads/2008/Memory_Fences.pdf

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js