Should load-acquire see store-release immediately?

Should load-acquire see store-release immediately? - c++

Suppose we have one simple variable(std::atomic<int> var) and 2 threads T1 and T2 and we have the following code for T1:
...
var.store(2, mem_order);
...
and for T2
...
var.load(mem_order)
...
Also let's assume that T2(load) executes 123ns later in time(later in the modification order in terms of the C++ standard) than T1(store).
My understanding of this situation is as follows(for different memory orders):
memory_order_seq_cst - T2 load is obliged to load 2. So effectively it has to load the latest value(just as it is the case with the RMW operations)
memory_order_acquire/memory_order_release/memory_order_relaxed - T2 is not obliged to load 2 but can load any older value with the only restriction: that value should not be older than the latest loaded by that thread. So, for example var.load returns 0.
Am I right with my understanding?
UPDATE1:
If I'm wrong with the reasoning, please provide the text from the C++ standard which proofs it. Not just theoretical reasoning of how some architecture might work.

Am I right with my understanding?
No. You misunderstand memory orders.
let's assume that T2(load) executes 123ns later than T1(store)...
In that case, T2 will see what T1 does with any type of memory orders(moreover, this property is applied to read/write of any memory region, see e.g. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4431.pdf, 1.10, p.15). The key word in your phrase is later: it means that someone else forces ordering of these operations.
Memory orders are used for other scenario:
Lets some operation OP1 comes in thread T1 before store operation, OP2comes after it, OP3 comes in thread T2 before load operation, OP4 comes after it.
//T1: //T2:
OP1 OP3
var.store(2, mem_order) var.load(mem_order)
OP2 OP4
Assume, that some order between var.store() and var.load() can be observed by the threads. What one can garantee about cross threads order of other operations?
If var.store uses memory_order_release, var.load uses memory_order_acquire and var.store is ordered before var.load (that is, load returns 2), then effect of OP1 is ordered before OP4.
E.g., if OP1 writes some variable var1, OP4 reads that variable, then one can be assured that OP4 will read what OP1 write before. This is the most utilized case.
If both var.store and var.load uses memory_order_seq_cst and var.store is ordered after var.load (that is, load returns 0, which was value of variable before store), then effect of OP2 is ordered after OP3.
This memory order is required by some tricky syncronization schemes.
If either var.store or var.load uses memory_order_relaxed, then with any order of var.store and var.load one can garantee no order of cross threads operations.
This memory order is used in case, when someone else ensure order of operations. E.g., if thread T2 creation comes after var.store in T1, then OP3 and OP4 are ordered after OP1.
UPDATE: 123 ns later implies *someone else* force ordering because computer's processor has no notion about universal time, and no operation has precise moment when it is executed. For measure time between two operations you should:
Observe ordering between finishing the first operation and beginning time counting operation on some cpu.
Observe ordering between beginning and finishing time counting operations.
Observe ordering between finishing time counting operation and start of the second operation.
Transitively, these steps make ordering between the first operation and the second one.

Having found no arguments to prove my understanding wrong I deem it correct and my proof is as follows:
memory_order_seq_cst - T2 load is obliged to load 2.
That's correct because all operations using memory_order_seq_cst should form the single total order on the atomic variable of all the memory operations.
Excerpt from the standard:
[29.9/3] There shall be a single total order S on all memory_order_seq_cst
operations, consistent with the “happens before” order and
modification orders for all affected locations, such that each
memory_order_seq_cst operation B that loads a value from an atomic
object M observes one of the following values <...>
The next point of my question:
memory_order_acquire/memory_order_release/memory_order_relaxed - T2 is
not obliged to load 2 but can load any older value <...>
I didn't find any evidences which might indicate that the load executed later in the modification order should see the latest value. The only points I found for the store/load operations with any memory order different from the memory_order_seq_cst are these:
[29.3/12] Implementations should make atomic stores visible to atomic
loads within a reasonable amount of time.
and
[1.10/28] An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation
will become visible to all other threads in a finite period of time.
So the only guarantee we have is that the variable written will be visible within some time - that's pretty reasonable guarantee but it doesn't imply immediate visibility of the previous store. And it proofs my second point.
Given all that my initial understanding was correct.

123 nS later doesn't enforce of ordering T2 seeing the results of T1. That's because if the physical program counter (transistors, etc.) running T2 is more than 40 Meters away from the physical program counter running T1 (large multi-core supercomputer, etc.), then the speed of light will not allow the T1 written state information to propagate that far (yet). Similar effect if the physical memory used for the load/stores is remote by some distance to both thread processors.

Related

Do I understand the semantics of std::memory_order correctly?

c++reference.com states about memory_order::seq_cst:
A load operation with this memory order performs an acquire operation, a store performs a release operation, and read-modify-write performs both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order.
[ Q1 ]: Does this mean that the order goes straight down through every operation of all (others + this) atomic_vars with memory_order::seq_cst?
[ Q2 ]: And release , acquire and rel_acq are not included in "single total order" ?
I understood that seq_cst is equivalent to the other three with write, read and write_read operation, but I'm confused about whether seq_cst can order other atomic_vars too, not only the same var.

cppreference is only a summary of the C++ standard, and sometimes its text is less precise. The actual standard draft makes it clear: The final C++20 working draft N4681 states in atomics.order, par. 4 (p. 1525):
There is a single total order S on all memory_order::seq_cst operations, including fences, that satisfies the following constraints [...]
This clearly says all seq_cst operations, not just all operations on a particular object.
And notes 6 and 7 further down emphasize that the order does not apply to weaker memory orders:
6 [Note: We do not require that S be consistent with “happens before” (6.9.2.1). This allows more efficient
implementation of memory_order::acquire and memory_order::release on some machine architectures.
It can produce surprising results when these are mixed with memory_order::seq_cst accesses. — end note]
7 [Note: memory_order::seq_cst ensures sequential consistency only for a program that is free of data races
and uses exclusively memory_order::seq_cst atomic operations. Any use of weaker ordering will invalidate
this guarantee unless extreme care is used. In many cases, memory_order::seq_cst atomic operation

I find this part incomplete:
A load operation with this memory order performs an acquire operation,
a store performs a release operation, and read-modify-write performs
both an acquire operation and a release operation, plus a single total
order exists in which all threads observe all modifications in the
same order.
If those things (release store, acquire loads, and a total store order) were actually sufficient to give sequential consistency, that would imply that release and acquire operations on their own would be more strongly ordered than the actually are.
Let's have a look at the following counter example:
CPU1:
a = 1 // release store
int r1 = b // acquire load
Then based on the above definition for SC (and the known properties sequential consistency must have to fit the name), I would presume that the store of a and the load of b can't be reordered:
we have a release-store and an acquire-load
we (can) have a total order over all loads/stores
So we have satisfied the above definition for sequential consistency.
But a release-store followed by an acquire-load to a different address can be reordered. The canonical example would be Dekker's algorithm. Therefor the above definition for SC is broken because it is missing that memory order needs to preserve the program order. Apart from a compiler messing things up, the typical cause of this violation would be store buffers which most modern CPUs have can cause an older store to be reordered with a newer load to a different address.
The single total order is a different concern than CPU local instruction reordering as you can get with e.g. store buffers. It effectively means that there is some moment where an operation takes effect in the memory order and nobody should be able to disagree with that. The standard litmus test for this is the independent reads of independent writes (IRIW):
CPU1:
A=1
CPU2:
B=1
CPU3:
r1=A
[LoadLoad]
r2=B
CPU4:
r3=B
[LoadLoad]
r4=A
So could it be that CPU3 and CPU4 see the stores to different addresses in different orders? If the answer is yes, then no total order over the load/stores exist.
Another cause of a not having a total order over the loads/stores is store to load forwarding (STLF).
CPU1:
A=1
r1=A
r2=B
CPU2:
B=1
r3=B
r4=A
It is possible that r1=1, r2=0, r3=1 and r4=0?
On the X86 this is possible due to store to load forwarding. So if CPU1 does a store of A followed by a load of A, then the CPU must look in the store buffer for value of A. This causes the store of A not to be atomic; the local CPU can see the store early and the consequence is no total order over the loads/stores exist.
So instead of having a total order over all load/stores, it is reduced to a total order over the stores and this is how the X86 gets its name for its memory model (Total Store Order).
[edit]
Made some clarifications. I cleaned up some text and cleaned up the original example because it was misleading.

Relaxed Atomics and Memory Coherence in the Absence of Synchronisation

I've written a basic graph scheduler that synchronises task execution in a wait-free manner. Since the graph topology is immutable, I figured I'll make all atomic operations relaxed. However, as I learned more about the CPU hardware, I started to become concerned about the behaviour of my data structure on platforms with weaker memory models (I've only tested my code on x86). Here's the scenario that bothers me:
Thread 1 (T1) and Thread 2 (T2) concurrently update (non-atomically) memory locations X and Y respectively, and then proceed to execute other unrelated tasks.
Thread 3 (T3) picks up a dependent task after T1 and T2 are done, loads X and Y, and sums them up. There are no acquire/release synchronisations, thread joins, or locks being invoked, and T3's task is guaranteed to be scheduled after T1's and T2's are done.
Assuming that T1, T2, and T3 are scheduled (by the OS) on different CPU cores, my question is: In the absence of any memory fences or lock-like instructions, is T3 guaranteed to see the latest values of X and Y? Another way of asking this is: If you don't insert a fence, how long after a store can you perform a load, or are there no guarantees regarding that?
My concern is that there are no guarantees that the cores which executed T1 and T2 have flushed their store buffers by the time T3's core attempts to load that information. I tend to think of data races as data corruptions that happen due to a load and a store (or store and store) happening at the same time. However, I've come to realise that I'm not quite sure what at the same time really means given the distributed nature of CPUs at the micro scale. According to CppRef:
A program that has two conflicting evaluations has a data race unless:
both evaluations execute on the same thread or in the same signal handler, or
both conflicting evaluations are atomic operations (see std::atomic), or
one of the conflicting evaluations happens-before another (see std::memory_order)
This seems to imply that anyone using my graph scheduler would experience a data race (assuming they don't protect against it themselves) even if I can guarantee that T3 doesn't execute until T1 and T2 are done. I've yet to observe a data race in my tests but I'm not naive enough to think that tests alone suffice to prove this.

how long after a store can you perform a load
ISO C++ makes zero guarantees about timing. It's almost always a bad idea to rely on timing / distance for correctness.
In this case all you need is acquire/release synchronization somewhere in the scheduler itself, e.g. T1 and T2 declaring themselves done using a release-store, and the scheduler checking for that with an acquire load.
Otherwise what does it even mean that T3 executes after T1 and T2? If the scheduler could see an "I'm done" store way early, it could start T3 while T1 or T2 is not done all its stores.
If you make sure that everything in T3 happens after T1 and T2 (using acquire loads that "synchronize-with" a release store from each of T1 and T2), you don't even need to use atomics in T1 and T2, just in the scheduler machinery.
Acquire load and release store are relatively cheap compared to seq_cst. On real HW, seq_cst has to flush the store buffer after a store, release doesn't. x86 does acq_rel for free.
(And yes, testing on x86 doesn't prove anything; the hardware memory model is basically acq_rel so compile-time reordering picks some legal order, and then that order runs with acq_rel.)
I'm not sure if starting a new thread guarantees that everything in that thread "happens after" that point in this thread. If so, then this is formally safe.
If not, then in theory IRIW reordering is something to worry about. (All threads using seq_cst loads have to agree about the global order of seq_cst stores, but not with weaker memory orders. In practice PowerPC is the hardware that can do this in real life, AFAIK, and only for short-ish windows. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. Any std::thread constructor would involve a system call and be long enough in practice, as well as involving barriers anyway whether ISO C++ formally guarantees that or not.
If you're not starting a new thread, but instead storing a flag for a worker to see, then acq/rel is again enough; happens-before is transitive so A -> B and B -> C implies A -> C.

LLVM - Atomic Ordering Unordered

I'm working on a library that heavily relies on bitwise operations, where the most important operate on shared memory. I was also having a look at LLVM's atomic ordering documentation and noticed unordered, which seems to be even weaker than C/C++'s relaxed memory order. I have several questions about it:
What are the differences between unordered and relaxed?
Say I have an atomic bool, is it safe to mutate it via unordered load/store?
Say I have an atomic bitmask, is it safe to mutate it via unordered load/store?
Is it safe to mutate it via unordered fetch_and/or/xor?
Is it safe to mutate it via unordered swap?
Is it safe to mutate it via unordered compare_and_swap?

The short answer is: whatever your problem is, unordered is screamingly unlikely to be the solution !
The longer answer is...
...the LLVM Language Reference Manual says:
unordered
The set of values that can be read is governed by the happens-before partial order. A value cannot be read unless some operation wrote it. This is intended to provide a guarantee strong enough to model Java’s non-volatile shared variables. This ordering cannot be specified for read-modify-write operations; it is not strong enough to make them atomic in any interesting way.
The "A value cannot be read unless some operation wrote it." is fun ! What this means is that “speculative” writes are not allowed. Say you have if (y == 99) x = 0 ; else x = y+1 ;: an optimizer could turn that into x = y+1 ; if (y == 99) x = 0 ; where the first write of x is the “speculative” one. (I'm not saying that's a sensible or common optimization. The point is that transformations which are perfectly OK from the perspective of a single thread, are not OK for atomics.) The C/C++ standards have the same restriction: no “out-of-thin-air” values are allowed.
Elsewhere the LLVM documentation describes unordered as loads/stores which complete without interruption from any other store -- so a load which reads two halves (say) of a value would not qualify if the two halves could be the result of two separate writes !
It seems monotonic is the local name for C/C++ memory_order_relaxed, and is described:
monotonic
In addition to the guarantees of unordered, there is a single total order for modifications by monotonic operations on each address. All modification orders must be compatible with the happens-before order.
Unlike unordered, with monotonic all threads will see writes to a given address in the same order. This means that if thread 'a' writes '1' to a given location, and afterwards thread 'b' writes '2', then after that threads 'c' and 'd' must both read '2'. (Hence the name.)
There is no guarantee that the modification orders can be combined to a global total order for the whole program (and this often will not be possible).
This is the relaxed bit.
The read in an atomic read-modify-write operation (cmpxchg and atomicrmw) reads the value in the modification order immediately before the value it writes.
Same like C/C++: read-modify-write cannot be interrupted by another write.
If one atomic read happens before another atomic read of the same address, the later read must see the same value or a later value in the address’s modification order. This disallows reordering of monotonic (or stronger) operations on the same address.
It's monotonic, guys.
If an address is written monotonic-ally by one thread, and other threads monotonic-ally read that address repeatedly, the other threads must eventually see the write.
So... there may be some latency between a write and reads, but that latency is finite (but may, I assume, not be the same for all threads). The "eventually" is interesting. The C11 standard says "Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.", which is similar but with a more positive spin :-)
This corresponds to the C++0x/C1x memory_order_relaxed.
So there you go.
You asked:
Say I have an atomic bool, is it safe to mutate it via unordered load/store?
Say I have an atomic bitmask, is it safe to mutate it via unordered load/store?
It rather depends on what you mean by safe :-( With any atomic load of 'x' followed later by an atomic store of 'x', you have no idea how many other stores to 'x' there have been since the load. But, the added joy of unordered is that it does not guarantee that all threads will see all writes to 'x' in the same order !
Your other questions are moot, because you cannot have unordered read-modify-write operations.
As a practical matter, your x86_64 guarantees that all writes are visible to all threads in the same order -- so all atomic operations are at the very least monotonic/memory_order_relaxed -- no LOCK prefixes and no xFENCE instructions are required, plain MOV to memory will do the trick. In fact, better than that, plain MOV to/from memory gives memory_order_release/memory_order_acquire.
FWIW: you mention bit-wise operations. I guess it's obvious that reading/writing a few bits is going to involve reading/writing some number of other bits as a side-effect. Which adds to the fun.
My guess is that at a minimum you will need to be doing read-modify-write operations. Again, on x86_64 this boils down to an instruction with a LOCK prefix, which is going to cost 10's of clocks -- how many 10's depends on the CPU and degree of contention. Now, a mutex lock and unlock will each involve a LOCKed, so it's generally worth replacing a mutex_lock/...do some reading and writing.../mutex_unlock sequence by an atomic read-modify-write(s) really only if it's just one read-modify-write.
The disadvantage of a mutex is, of course, that a thread can be "swapped out" while it holds the mutex :-(
A spin-lock (on x86_64) requires a LOCK to acquire but not to release... but the effect of being "swapped out" while holding a spin-lock is even worse :-(

What is guaranteed with C++ std::atomic at the programmer level?

I have listened and read to several articles, talks and stackoverflow questions about std::atomic, and I would like to be sure that I have understood it well. Because I am still a bit confused with cache line writes visibility due to possible delays in MESI (or derived) cache coherency protocols, store buffers, invalidate queues, and so on.
I read x86 has a stronger memory model, and that if a cache invalidation is delayed x86 can revert started operations. But I am now interested only on what I should assume as a C++ programmer, independently of the platform.
[T1: thread1 T2: thread2 V1: shared atomic variable]
I understand that std::atomic guarantees that,
(1) No data races occur on a variable (thanks to exclusive access to the cache line).
(2) Depending which memory_order we use, it guarantees (with barriers) that sequential consistency happens (before a barrier, after a barrier or both).
(3) After an atomic write(V1) on T1, an atomic RMW(V1) on T2 will be coherent (its cache line will have been updated with the written value on T1).
But as cache coherency primer mention,
The implication of all these things is that, by default, loads can fetch stale data (if a corresponding invalidation request was sitting in the invalidation queue)
So, is the following correct?
(4) std::atomic does NOT guarantee that T2 won't read a 'stale' value on an atomic read(V) after an atomic write(V) on T1.
Questions if (4) is right: if the atomic write on T1 invalidates the cache line no matter the delay, why is T2 waiting for the invalidation to be effective when does an atomic RMW operation but not on an atomic read?
Questions if (4) is wrong: when can a thread read a 'stale' value and "it's visible" in the execution, then?
I appreciate your answers a lot
Update 1
So it seems I was wrong on (3) then. Imagine the following interleave, for an initial V1=0:
T1: W(1)
T2: R(0) M(++) W(1)
Even though T2's RMW is guaranteed to happen entirely after W(1) in this case, it can still read a 'stale' value (I was wrong). According to this, atomic doesn't guarantee full cache coherency, only sequential consistency.
Update 2
(5) Now imagine this example (x = y = 0 and are atomic):
T1: x = 1;
T2: y = 1;
T3: if (x==1 && y==0) print("msg");
according to what we've talked, seeing the "msg" displayed on screen wouldn't give us information beyond that T2 was executed after T1. So either of the following executions might have happened:
T1 < T3 < T2
T1 < T2 < T3 (where T3 sees x = 1 but not y = 1 yet)
is that right?
(6) If a thread can always read 'stale' values, what would happen if we took the typical "publish" scenario but instead of signaling that some data is ready, we do just the opposite (delete the data)?
T1: delete gameObjectPtr; is_enabled.store(false, std::memory_order_release);
T2: while (is_enabled.load(std::memory_order_acquire)) gameObjectPtr->doSomething();
where T2 would still be using a deleted ptr until sees that is_enabled is false.
(7) Also, the fact that threads may read 'stale' values means that a mutex cannot be implemented with just one lock-free atomic right? It would require a synch mechanism between threads. Would it require a lockable atomic?

Yes, there are no data races
Yes, with appropriate memory_order values you can guarantee sequential consistency
An atomic read-modify-write will always occur entirely before or entirely after an atomic write to the same variable
Yes, T2 can read a stale value from a variable after an atomic write on T1
Atomic read-modify-write operations are specified in a way to guarantee their atomicity. If another thread could write to the value after the initial read and before the write of an RMW operation, then that operation would not be atomic.
Threads can always read stale values, except when happens-before guarantees relative ordering.
If a RMW operation reads a "stale" value, then it guarantees that the write it generates will be visible before any writes from other threads that would overwrite the value it read.
Update for example
If T1 writes x=1 and T2 does x++, with x initially 0, the choices from the point of view of the storage of x are:
T1's write is first, so T1 writes x=1, then T2 reads x==1, increments that to 2 and writes back x=2 as a single atomic operation.
T1's write is second. T2 reads x==0, increments it to 1, and writes back x=1 as a single operation, then T1 writes x=1.
However, provided there are no other points of synchronization between these two threads, the threads can proceed with the operations not flushed to memory.
Thus T1 can issue x=1, then proceed with other things, even though T2 will still read x==0 (and thus write x=1).
If there are any other points of synchronization then it will become apparent which thread modified x first, because those synchronization points will force an order.
This is most apparent if you have a conditional on the value read from a RMW operation.
Update 2
If you use memory_order_seq_cst (the default) for all atomic operations you don't need to worry about this sort of thing. From the point of view of the program, if you see "msg" then T1 ran, then T3, then T2.
If you use other memory orderings (especially memory_order_relaxed) then you may see other scenarios in your code.
In this case, you have a bug. Suppose the is_enabled flag is true, when T2 enters its while loop, so it decides to run the body. T1 now deletes the data, and T2 then deferences the pointer, which is a dangling pointer, and undefined behaviour ensues. The atomics don't help or hinder in any way beyond preventing the data race on the flag.
You can implement a mutex with a single atomic variable.

Regarding (3) - it depends on the memory order used. If both, the store and the RMW operation use std::memory_order_seq_cst, then both operations are ordered in some way - i.e., either the store happens before the RMW, or the other way round. If the store is order before the RMW, then it is guaranteed that the RMW operation "sees" the value that was stored. If the store is ordered after the RMW, it would overwrite the value written by the RMW operation.
If you use more relaxed memory orders, the modifications will still be ordered in some way (the modification order of the variable), but you have no guarantees on whether the RMW "sees" the value from the store operation - even if the RMW operation is order after the write in the variable's modification order.
In case you want to read yet another article I can refer you to Memory Models for C/C++ Programmers.

What do each memory_order mean?

I read a chapter and I didn't like it much. I'm still unclear what the differences is between each memory order. This is my current speculation which I understood after reading the much more simple http://en.cppreference.com/w/cpp/atomic/memory_order
The below is wrong so don't try to learn from it
memory_order_relaxed: Does not sync but is not ignored when order is done from another mode in a different atomic var
memory_order_consume: Syncs reading this atomic variable however It doesnt sync relaxed vars written before this. However if the thread uses var X when modifying Y (and releases it). Other threads consuming Y will see X released as well? I don't know if this means this thread pushes out changes of x (and obviously y)
memory_order_acquire: Syncs reading this atomic variable AND makes sure relaxed vars written before this are synced as well. (does this mean all atomic variables on all threads are synced?)
memory_order_release: Pushes the atomic store to other threads (but only if they read the var with consume/acquire)
memory_order_acq_rel: For read/write ops. Does an acquire so you don't modify an old value and releases the changes.
memory_order_seq_cst: The same thing as acquire release except it forces the updates to be seen in other threads (if a store with relaxed on another thread. I store b with seq_cst. A 3rd thread reading a with relax will see changes along with b and any other atomic variable?).
I think I understood but correct me if i am wrong. I couldn't find anything that explains it in easy to read english.

The GCC Wiki gives a very thorough and easy to understand explanation with code examples.
(excerpt edited, and emphasis added)
IMPORTANT:
Upon re-reading the below quote copied from the GCC Wiki in the process of adding my own wording to the answer, I noticed that the quote is actually wrong. They got acquire and consume exactly the wrong way around. A release-consume operation only provides an ordering guarantee on dependent data whereas a release-acquire operation provides that guarantee regardless of data being dependent on the atomic value or not.
The first model is "sequentially consistent". This is the default mode used when none is specified, and it is the most restrictive. It can also be explicitly specified via memory_order_seq_cst. It provides the same restrictions and limitation to moving loads around that sequential programmers are inherently familiar with, except it applies across threads.
[...]
From a practical point of view, this amounts to all atomic operations acting as optimization barriers. It's OK to re-order things between atomic operations, but not across the operation. Thread local stuff is also unaffected since there is no visibility to other threads. [...] This mode also provides consistency across all threads.
The opposite approach is memory_order_relaxed. This model allows for much less synchronization by removing the happens-before restrictions. These types of atomic operations can also have various optimizations performed on them, such as dead store removal and commoning. [...] Without any happens-before edges, no thread can count on a specific ordering from another thread.
The relaxed mode is most commonly used when the programmer simply wants a variable to be atomic in nature rather than using it to synchronize threads for other shared memory data.
The third mode (memory_order_acquire / memory_order_release) is a hybrid between the other two. The acquire/release mode is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables. This allows for a relaxing of the synchronization required between independent reads of independent writes.
memory_order_consume is a further subtle refinement in the release/acquire memory model that relaxes the requirements slightly by removing the happens before ordering on non-dependent shared variables as well.
[...]
The real difference boils down to how much state the hardware has to flush in order to synchronize. Since a consume operation may therefore execute faster, someone who knows what they are doing can use it for performance critical applications.
Here follows my own attempt at a more mundane explanation:
A different approach to look at it is to look at the problem from the point of view of reordering reads and writes, both atomic and ordinary:
All atomic operations are guaranteed to be atomic within themselves (the combination of two atomic operations is not atomic as a whole!) and to be visible in the total order in which they appear on the timeline of the execution stream. That means no atomic operation can, under any circumstances, be reordered, but other memory operations might very well be. Compilers (and CPUs) routinely do such reordering as an optimization.
It also means the compiler must use whatever instructions are necessary to guarantee that an atomic operation executing at any time will see the results of each and every other atomic operation, possibly on another processor core (but not necessarily other operations), that were executed before.
Now, a relaxed is just that, the bare minimum. It does nothing in addition and provides no other guarantees. It is the cheapest possible operation. For non-read-modify-write operations on strongly ordered processor architectures (e.g. x86/amd64) this boils down to a plain normal, ordinary move.
The sequentially consistent operation is the exact opposite, it enforces strict ordering not only for atomic operations, but also for other memory operations that happen before or after. Neither one can cross the barrier imposed by the atomic operation. Practically, this means lost optimization opportunities, and possibly fence instructions may have to be inserted. This is the most expensive model.
A release operation prevents ordinary loads and stores from being reordered after the atomic operation, whereas an acquire operation prevents ordinary loads and stores from being reordered before the atomic operation. Everything else can still be moved around.
The combination of preventing stores being moved after, and loads being moved before the respective atomic operation makes sure that whatever the acquiring thread gets to see is consistent, with only a small amount of optimization opportunity lost.
One may think of that as something like a non-existent lock that is being released (by the writer) and acquired (by the reader). Except... there is no lock.
In practice, release/acquire usually means the compiler needs not use any particularly expensive special instructions, but it cannot freely reorder loads and stores to its liking, which may miss out some (small) optimization opportuntities.
Finally, consume is the same operation as acquire, only with the exception that the ordering guarantees only apply to dependent data. Dependent data would e.g. be data that is pointed-to by an atomically modified pointer.
Arguably, that may provide for a couple of optimization opportunities that are not present with acquire operations (since fewer data is subject to restrictions), however this happens at the expense of more complex and more error-prone code, and the non-trivial task of getting dependency chains correct.
It is currently discouraged to use consume ordering while the specification is being revised.

This is a quite complex subject. Try to read http://en.cppreference.com/w/cpp/atomic/memory_order several times, try to read other resources, etc.
Here's a simplified description:
The compiler and CPU can reorder memory accesses. That is, they can happen in different order than what's specified in the code. That's fine most of the time, the problem arises when different thread try to communicate and may see such order of memory accesses that breaks the invariants of the code.
Usually you can use locks for synchronization. The problem is that they're slow. Atomic operations are much faster, because the synchronization happens at CPU level (i.e. CPU ensures that no other thread, even on another CPU, modifies some variable, etc.).
So, the one single problem we're facing is reordering of memory accesses. The memory_order enum specifies what types of reorderings compiler must forbid.
relaxed - no constraints.
consume - no loads that are dependent on the newly loaded value can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.
acquire - no loads can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.
release - no stores can be reordered wrt. the atomic store. I.e. if they are before the atomic store in the source code, they will happen before the atomic store too.
acq_rel - acquire and release combined.
seq_cst - it is more difficult to understand why this ordering is required. Basically, all other orderings only ensure that specific disallowed reorderings don't happen only for the threads that consume/release the same atomic variable. Memory accesses can still propagate to other threads in any order. This ordering ensures that this doesn't happen (thus sequential consistency). For a case where this is needed see the example at the end of the linked page.

I want to provide a more precise explanation, closer to the standard.
Things to ignore:
memory_order_consume - apparently no major compiler implements it, and they silently replace it with a stronger memory_order_acquire. Even the standard itself says to avoid it.
A big part of the cppreference article on memory orders deals with 'consume', so dropping it simplifies things a lot.
It also lets you ignore related features like [[carries_dependency]] and std::kill_dependency.
Data races: Writing to a non-atomic variable from one thread, and simultaneously reading/writing to it from a different thread is called a data race, and causes undefined behavior.
memory_order_relaxed is the weakest and supposedly the fastest memory order.
Any reads/writes to atomics can't cause data races (and subsequent UB). relaxed provides just this minimal guarantee, for a single variable. It doesn't provide any guarantees for other variables (atomic or not).
All threads agree on the order of operations on every particular atomic variable. But it's the case only for invididual variables. If other variables (atomic or not) are involved, threads might disagree on how exactly the operations on different variables are interleaved.
It's as if relaxed operations propagate between threads with slight unpredictable delays.
This means that you can't use relaxed atomic operations to judge when it's safe to access other non-atomic memory (can't synchronize access to it).
By "threads agree on the order" I mean that:
Each thread will access each separate variable in the exact order you tell it to. E.g. a.store(1, relaxed); a.store(2, relaxed); will write 1, then 2, never in the opposite order. But accesses to different variables in the same thread can still be reordered relative to each other.
If a thread A writes to a variable several times, then thread B reads seveal times, it will get the values in the same order (but of course it can read some values several times, or skip some, if you don't synchronize the threads in other ways).
No other guarantees are given.
Example uses: Anything that doesn't try to use an atomic variable to synchronize access to non-atomic data: various counters (that exist for informational purposes only), or 'stop flags' to signal other threads to stop. Another example: operations on shared_ptrs that increment the reference count internally use relaxed.
Fences: atomic_thread_fence(relaxed); does nothing.
memory_order_release, memory_order_acquire do everything relaxed does, and more (so it's supposedly slower or equivalent).
Only stores (writes) can use release. Only loads (reads) can use acquire. Read-modify-write operations such as fetch_add can be both (memory_order_acq_rel), but they don't have to.
Those let you synchronize threads:
Let's say thread 1 reads/writes to some memory M (any non-atomic or atomic variables, doesn't matter).
Then thread 1 performs a release store to a variable A. Then it stops
touching that memory.
If thread 2 then performs an acquire load of the same variable A, this load is said to synchronize with the corresponding store in thread 1.
Now thread 2 can safely read/write to that memory M.
You only synchronize with the latest writer, not preceding writers.
You can chain synchronizations across multiple threads.
There's a special rule that synchronization propagates across any number of read-modify-write operations regardless of their memory order. E.g. if thread 1 does a.store(1, release);, then thread 2 does a.fetch_add(2, relaxed);, then thread 3 does a.load(acquire), then thread 1 successfully synchronizes with thread 3, even though there's a relaxed operation in the middle.
In the above rule, a release operation X, and any subsequent read-modify-write operations on the same variable X (stopping at the next non-read-modify-write operation) are called a release sequence headed by X. (So if an acquire reads from any operation in a release sequence, it synchronizes with the head of the sequence.)
If read-modify-write operations are involved, nothing stops you from synchronizing with more than one operation. In the example above, if fetch_add was using acquire or acq_rel, it too would synchronize with thread 1, and conversely, if it used release or acq_rel, the thread 3 would synchonize with 2 in addition to 1.
Example use: shared_ptr decrements its reference counter using something like fetch_sub(1, acq_rel).
Here's why: imagine that thread 1 reads/writes to *ptr, then destroys its copy of ptr, decrementing the ref count. Then thread 2 destroys the last remaining pointer, also decrementing the ref count, and then runs the destructor.
Since the destructor in thread 2 is going to access the memory previously accessed by thread 1, the acq_rel synchronization in fetch_sub is necessary. Otherwise you'd have a data race and UB.
Fences: Using atomic_thread_fence, you can essentially turn relaxed atomic operations into release/acquire operations. A single fence can apply to more than one operation, and/or can be performed conditionally.
If you do a relaxed read (or with any other order) from one or more variables, then do atomic_thread_fence(acquire) in the same thread, then all those reads count as acquire operations.
Conversely, if you do atomic_thread_fence(release), followed by any number of (possibly relaxed) writes, those writes count as release operations.
An acq_rel fence combines the effect of acquire and release fences.
Similarity with other standard library features:
Several standard library features also cause a similar synchronizes with relationship. E.g. locking a mutex synchronizes with the latest unlock, as if locking was an acquire operation, and unlocking was a release operation.
memory_order_seq_cst does everything acquire/release do, and more. This is supposedly the slowest order, but also the safest.
seq_cst reads count as acquire operations. seq_cst writes count as release operations. seq_cst read-modify-write operations count as both.
seq_cst operations can synchronize with each other, and with acquire/release operations. Beware of special effects of mixing them (see below).
seq_cst is the default order, e.g. given atomic_int x;, x = 1; does x.store(1, seq_cst);.
seq_cst has an extra property compared to acquire/release: all threads agree on the order in which all seq_cst operations happen. This is unlike weaker orders, where threads agree only on the order of operations on each individual atomic variable, but not on how the operations are interleaved - see relaxed order above.
The presence of this global operation order seems to only affect which values you can get from seq_cst loads, it doesn't in any way affect non-atomic variables and atomic operations with weaker orders (unless seq_cst fences are involved, see below), and by itself doesn't prevent any extra data race UB compared to acq/rel operations.
Among other things, this order respects the synchronizes with relationship described for acquire/release above, unless (and this is weird) that synchronization comes from mixing a seq-cst operation with an acquire/release operation (release syncing with seq-cst, or seq-cst synching with acquire). Such mix essentially demotes the affected seq-cst operation to an acquire/release (it maybe retains some of the seq-cst properties, but you better not count on it).
Example use:
atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, seq_cst);
if (y.load(seq_cst)) {...}
// Thread 2:
y.store(false, seq_cst);
if (x.load(seq_cst)) {...}
Lets say you want only one thread to be able to enter the if body. seq_cst allows you to do it. Acquire/release or weaker orders wouldn't be enough here.
Fences: atomic_thread_fence(seq_cst); does everything an acq_rel fence does, and more.
Like you would expect, they bring some seq-cst properties to atomic operations done with weaker orders.
All threads agree on the order of seq_cst fences, relative to one another and to any seq_cst operations (i.e. seq_cst fences participate in the global order of seq_cst operations, which was described above).
They essentially prevent atomic operations from being reordered across themselves.
E.g. we can transform the above example to:
atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (y.load(relaxed)) {...}
// Thread 2:
y.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (x.load(relaxed)) {...}
Both threads can't enter if at the same time, because that would require reordering a load across the fence to be before the store.
But formally, the standard doesn't describe them in terms of reordering. Instead, it just explains how the seq_cst fences are placed in the global order of seq_cst operations. Let's say:
Thread 1 performs operation A on atomic variable X using using seq_cst order, OR a weaker order preceeded by a seq_cst fence.
Then:
Thread 2 performs operation B the same atomic variable X using seq_cst order, OR a weaker order followed by a seq_cst fence.
(Here A and B are any operations, except they can't both be reads, since then it's impossible to determine which one was first.)
Then the first seq_cst operation/fence is ordered before the second seq_cst operation/fence.
Then, if you imagine an scenario (e.g. in the example above, both threads entering the if) that imposes a contradicting requirements on the order, then this scenario is impossible.
E.g. in the example above, if the first thread enters the if, then the first fence must be ordered before the second one. And vice versa. This means that both threads entering the if would lead to a contradition, and hence not allowed.
Interoperation between different orders
Summarizing the above:
relaxed write
release write
seq-cst write
relaxed load
-
-
-
acquire load
-
synchronizes with
synchronizes with*
seq-cst load
-
synchronizes with*
synchronizes with
* = The participating seq-cst operation gets a messed up seq-cst order, effectively being demoted to an acquire/release operation. This is explained above.
Does using a stronger memory order makes data transfer between threads faster?
No, it seems not.
Sequental consistency for data-race-free programs
The standard explains that if your program only uses seq_cst accesses (and mutexes), and has no data races (which cause UB), then you don't need to think about all the fancy operation reorderings. The program will behave as if only one thread executed at a time, with the threads being unpredictably interleaved.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js