is std::atomic::fetch_add a serializing operation on x86-64?

is std::atomic::fetch_add a serializing operation on x86-64? - c++

Considering the following code:
std::atomic<int> counter;
/* otherStuff 1 */
counter.fetch_add(1, std::memory_order_relaxed);
/* otherStuff 2 */
Is there an instruction in x86-64 (say less than 5 years old architectures) that would allow otherStuff 1 and 2 be re-ordered across the fetch_add or is it going to be always serializing ?
EDIT:
It looks like this is summarized by "is lock add a memory barrier on x86 ?" and it seems it is not, though I am not sure where to find a reference for that.

First let's look at what the compiler is allowed to do when using std::memory_order_relaxed.
If there are no dependencies between otherStuff 1/2 and the atomic operation, it can certainly reorder the statements. For example:
g = 3;
a.fetch_add(1, memory_order_relaxed);
g += 12;
clang++ generates the following assembly:
lock addl $0x1,0x2009f5(%rip) # 0x601040 <a>
movl $0xf,0x2009e7(%rip) # 0x60103c <g>
Here clang took the liberty to reorder g = 3 with the atomic fetch_add operation, which is a legitimate transformation.
When using std::memory_order_seq_cst, the compiler output becomes:
movl $0x3,0x2009f2(%rip) # 0x60103c <g>
lock addl $0x1,0x2009eb(%rip) # 0x601040 <a>
addl $0xc,0x2009e0(%rip) # 0x60103c <g>
Reordering of statements does not take place because the compiler is not allowed to do that.
Sequential consistent ordering on a read-modify-write (RMW) operation, is both a release and an acquire operation and as such, no (visible) reordering of statements is allowed on both compiler and CPU level.
Your question is whether, on X86-64, std::atomic::fetch_add, using relaxed ordering, is a serializing operation..
The answer is: yes, if you do not take into account compiler reordering.
On the X86 architecture, an RMW operation always flushes the store buffer and therefore is effectively a serializing and sequentially consistent operation.
You can say that, on an X86 CPU, each RMW operation:
is a release operation for memory operations that precede it and is an acquire operation for memory operations that follow it.
becomes visible in a single total order observed by all threads.

The target architecture
On the X86 architecture, an RMW operation always flushes the store
buffer and therefore is effectively a serializing and sequentially
consistent operation.
I wish people would stop saying that.
That statement doesn't even make sense, as there is no such thing as "sequentially consistent operation", as "sequential consistency" isn't a property of any operation. A sequentially consistent execution is one where the end result is one where there is an interlacing of operation that gives that result.
What can be said about these RMW operations:
all operations before the RMW have to be globally visible before either the R or W of the RMW is visible
and no operation after the RMW are visible before the RMW is visible.
That is the part before, the RMW, and the part after are run sequential. In other words, there is a full fence before and after the RMW.
Whether that results in a sequential execution for the complete program depends on the nature of all globally visible operations of the program.
Visibility vs. execution ordering
That's in term of visibility. I have no idea whether these processors try to speculatively execute code after the RMW, subject to the correctness requirement that operations are rolled back if there is a conflict with a side effect on a parallel execution (these details tend to be different for different vendors and generations in a given family, unless it's clearly specified).
The answer to your question could be different whether
you need to guarantee correctness of the set of side effect (as in sequential consistency requirement),
or guarantee that benchmarks are reliable,
or that comparative timing CPU version independent: to guarantee something on the results of comparison of timing of different executions (for a given CPU).
High level languages vs. CPU features
The question title is "is std::atomic::fetch_add a serializing operation on x86-64?" of the general form:
"does OP provide guarantees P on ARCH"
where
OP is a high level operation in a high level language
P is the desired property
ARCH is a specific CPU or compiler target
As a rule, the canonical answer is: the question doesn't make sense, OP being high level and target independent. There is a low level/high level mismatch here.
The compiler is bound by the language standard (or rather its most reasonable interpretation), by documented extension, by history... not by the standard for the target architecture, unless the feature is a low level, transparent feature of the high level language.
The canonical way to get low level semantic in C/C++ is to use volatile objects and volatile operations.
Here you must use a volatile std::atomic<int> to even be able to ask a meaningful question about architectural guarantees.
Present code generation
The meaningful variant of your question would use this code:
volatile std::atomic<int> counter;
/* otherStuff 1 */
counter.fetch_add(1, std::memory_order_relaxed);
That statement will generate an atomic RMW operation which in that case "is a serializing operation" on the CPU: all operations performed before, in assembly code, are complete before the RMW starts; all operation following the RMW wait until the RMW completes to start (in term of visibility).
And then you would need to learn about the unpleasantness of the volatile semantic: volatile applies only to these volatile operation so you would still not get general guarantees about other operations.
There is no guarantee that high level C++ operations before the volatile RMW operations are sequenced before in the assembly code. You would need a "compiler barrier" to do that. These barriers are not portable. (And not needed here as it's a silly approach anyway.)
But then if you want that guarantee, you can just use:
a release operation: to ensure that previous globally visible operations are complete
an acquire operation: to ensure that following globally visible operations do not start before
RMW operation on an object that is visible by multiple threads.
So why not make your RMW operation ack_rel? Then it wouldn't even need to be volatile.
Possible RMW variants in a processor family
Is there an instruction in x86-64 (say less than 5 years old
architectures) that would
Potential variants of the instruction set is another sub-question. Vendors can introduce new instructions, and ways to test for their availability at runtime; and compilers can even generate code to detect their availability.
Any RMW feature that would follow the existing tradition (1) of strong ordering of usual memory operation in that family would have to respect the traditions:
Total Store Order: all store operations are ordered, implicitly fenced; in other words, there is a store buffer strictly for non speculative store operations in each core, that is not reordered and not shared between cores;
each store is a release operation (for previous normal memory operations);
loads that are speculatively started are completed in order, and at completion are validated: any early load for a location that was then clobbered in the cache is cancelled and the computation is restarted with the recent value;
loads are acquire operations.
Then any new (but traditional) RMW operation must be both an acquire operation and a release operation.
(Examples for potential imaginary RMW operation to be added in the future are xmult and xdiv.)
But that's futurology and adding less ordered instruction in the future wouldn't violate any security invariants, except potentially against timing based, side channel Spectre-like attacks, that we don't know how to model and reason about in general anyway.
The problem with these questions, even about the present, is that a proof of absence would be required, and for that we would need to know about each variant for a CPU family. That is not always doable, and also, unnecessary if you use the proper ordering in the high level code, and useless if you don't.
(1) Traditions for guarantees of memory operations are guidelines in the CPU design, not guarantees about any feature operation: by definition operations that don't yet exist have no guarantee about their semantics, beside the guarantees of memory integrity, that is, the guarantee that no future operation will break the privileges and security guarantees previously established (no un privileged instruction created in the future can access an un-mapped memory address...).

When using std::memory_order_relaxed the only guarantee is that the operation is atomic. Anything around the operation can be reordered at will by either the compiler or the CPU.
From https://en.cppreference.com/w/cpp/atomic/memory_order:
Relaxed operation: there are no synchronization or ordering
constraints imposed on other reads or writes, only this operation's
atomicity is guaranteed (see Relaxed ordering below)

Related

Does C++11 sequential consistency memory order forbid store buffer litmus test?

Consider the store buffer litmus test with SC atomics:
// Initial
std::atomic<int> x(0), y(0);
// Thread 1 // Thread 2
x.store(1); y.store(1);
auto r1 = y.load(); auto r2 = x.load();
Can this program end with both r1 and r2 being zero?
I can't see how this result is forbidden by the description about memory_order_seq_cst in cppreference:
A load operation with this memory order performs an acquire operation, a store performs a release operation, and read-modify-write performs both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order
It seems to me that memory_order_seq_cst is just acquire-release plus a global store order. And I don't think the global store order comes into play in this specific litmus test.

That cppreference summary of SC is too weak, and indeed isn't strong enough to forbid this reordering.
What it says looks to me only as strong as x86-TSO (acq_rel plus no IRIW reordering, i.e a total store order that all reader threads can agree on).
ISO C++ actually guarantees that there's a total order of all SC operations including loads (and also SC fences) that's consistent with program order. (That's basically the standard definition of sequential consistency in computer science; C++ programs that use only seq_cst atomic operations and are data-race-free for their non-atomic accesses execute sequentially consistently, i.e. "recover sequential consistency" despite full optimization being allowed for the non-atomic accesses.) Sequential consistency must forbid any reordering between any two SC operations in the same thread, even StoreLoad reordering.
This means an expensive full barrier (including StoreLoad) after every seq_cst store, or for example AArch64 STLR / LDAR can't StoreLoad reorder with each other, but are otherwise only release and acquire wrt. reordering with other operations. (So cache-hit SC stores can be quite a lot cheaper on AArch64 than x86, if you don't do any SC load or RMW operations in the same thread right afterwards.)
See https://eel.is/c++draft/atomics.order#4 That makes it clear that SC operations aren't reordered wrt. each other. The current draft standard says:
31.4 [atomics.order]
There is a single total order S on all memory_order::seq_cst operations, including fences, that satisfies the following constraints. First, if A and B are memory_order::seq_cst operations and A strongly happens before B, then A precedes B in S.
Second, for every pair of atomic operations A and B on an object M, where A is coherence-ordered before B, the following four conditions are required to be satisfied by S:
(4.1) if A and B are both memory_order::seq_cst operations, then A precedes B in S; and
(4.2 .. 4.4) - basically the same thing for sc fences wrt. operations.
Sequenced before implies strongly happens before, so the opening paragraph guarantees that S is consistent with program order.
4.1 is about ops that are coherenced-ordered before/after each other. i.e. a load that happens to see the value from a store. That ties inter-thread visibility into the total order S, making it match program order. The combination of those two requirements forces a compiler to use full barriers (including StoreLoad) to recover sequential consistency from whatever weaker hardware model it's targeting.
(In the original, all of 4. is one paragraph. I split it to emphasize that there are two separate things here, one for strongly-happens-before and the list of ops/barriers for coherence-ordered-before.)
These guarantees, plus syncs-with / happens-before, are enough to recover sequential consistency for the whole program, if it's data-race free (that would be UB), and if you don't use any weaker memory orders.
These rules do still hold if the program involves weaker orders, but for example an SC fence between two relaxed operations isn't as strong as two SC loads. For example on PowerPC that wouldn't rule out IRIW reordering the way using only SC operations does; IIRC PowerPC needs barriers before SC loads, as well as after.
So having some SC operations isn't necessarily enough to recover sequential consistency everywhere; that's rather the point of using weaker operations, but it can be a bit surprising that other ops can reorder wrt. SC ops. SC ops aren't SC fences. See also this Q&A for an example with the same "store buffer" litmus test: weakening one store from seq_cst to release allows reordering.

Would a volatile variable be enough in this case? [duplicate]

A global variable is shared across 2 concurrently running threads on 2 different cores. The threads writes to and read from the variables. For the atomic variable can one thread read a stale value? Each core might have a value of the shared variable in its cache and when one threads writes to its copy in a cache the other thread on a different core might read stale value from its own cache. Or the compiler does strong memory ordering to read the latest value from the other cache? The c++11 standard library has std::atomic support. How this is different from the volatile keyword? How volatile and atomic types will behave differently in the above scenario?

Firstly, volatile does not imply atomic access. It is designed for things like memory mapped I/O and signal handling. volatile is completely unnecessary when used with std::atomic, and unless your platform documents otherwise, volatile has no bearing on atomic access or memory ordering between threads.
If you have a global variable which is shared between threads, such as:
std::atomic<int> ai;
then the visibility and ordering constraints depend on the memory ordering parameter you use for operations, and the synchronization effects of locks, threads and accesses to other atomic variables.
In the absence of any additional synchronization, if one thread writes a value to ai then there is nothing that guarantees that another thread will see the value in any given time period. The standard specifies that it should be visible "in a reasonable period of time", but any given access may return a stale value.
The default memory ordering of std::memory_order_seq_cst provides a single global total order for all std::memory_order_seq_cst operations across all variables. This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies.
If you have 2 shared variables x and y, initially zero, and have one thread write 1 to x and another write 2 to y, then a third thread that reads both may see either (0,0), (1,0), (0,2) or (1,2) since there is no ordering constraint between the operations, and thus the operations may appear in any order in the global order.
If both writes are from the same thread, which does x=1 before y=2 and the reading thread reads y before x then (0,2) is no longer a valid option, since the read of y==2 implies that the earlier write to x is visible. The other 3 pairings (0,0), (1,0) and (1,2) are still possible, depending how the 2 reads interleave with the 2 writes.
If you use other memory orderings such as std::memory_order_relaxed or std::memory_order_acquire then the constraints are relaxed even further, and the single global ordering no longer applies. Threads don't even necessarily have to agree on the ordering of two stores to separate variables if there is no additional synchronization.
The only way to guarantee you have the "latest" value is to use a read-modify-write operation such as exchange(), compare_exchange_strong() or fetch_add(). Read-modify-write operations have an additional constraint that they always operate on the "latest" value, so a sequence of ai.fetch_add(1) operations by a series of threads will return a sequence of values with no duplicates or gaps. In the absence of additional constraints, there's still no guarantee which threads will see which values though. In particular, it is important to note that the use of an RMW operation does not force changes from other threads to become visible any quicker, it just means that if the changes are not seen by the RMW then all threads must agree that they are later in the modification order of that atomic variable than the RMW operation. Stores from different threads can still be delayed by arbitrary amounts of time, depending on when the CPU actually issues the store to memory (rather than just its own store buffer), physically how far apart the CPUs executing the threads are (in the case of a multi-processor system), and the details of the cache coherency protocol.
Working with atomic operations is a complex topic. I suggest you read a lot of background material, and examine published code before writing production code with atomics. In most cases it is easier to write code that uses locks, and not noticeably less efficient.

volatile and the atomic operations have a different background, and
were introduced with a different intent.
volatile dates from way back, and is principally designed to prevent
compiler optimizations when accessing memory mapped IO. Modern
compilers tend to do no more than suppress optimizations for volatile,
although on some machines, this isn't sufficient for even memory mapped
IO. Except for the special case of signal handlers, and setjmp,
longjmp and getjmp sequences (where the C standard, and in the case
of signals, the Posix standard, gives additional guarantees), it must be
considered useless on a modern machine, where without special additional
instructions (fences or memory barriers), the hardware may reorder or
even suppress certain accesses. Since you shouldn't be using setjmp
et al. in C++, this more or less leaves signal handlers, and in a
multithreaded environment, at least under Unix, there are better
solutions for those as well. And possibly memory mapped IO, if you're
working on kernal code and can ensure that the compiler generates
whatever is needed for the platform in question. (According to the
standard, volatile access is observable behavior, which the compiler
must respect. But the compiler gets to define what is meant by
“access”, and most seem to define it as “a load or
store machine instruction was executed”. Which, on a modern
processor, doesn't even mean that there is necessarily a read or write
cycle on the bus, much less that it's in the order you expect.)
Given this situation, the C++ standard added atomic access, which does
provide a certain number of guarantees across threads; in particular,
the code generated around an atomic access will contain the necessary
additional instructions to prevent the hardware from reordering the
accesses, and to ensure that the accesses propagate down to the global
memory shared between cores on a multicore machine. (At one point in
the standardization effort, Microsoft proposed adding these semantics to
volatile, and I think some of their C++ compilers do. After
discussion of the issues in the committee, however, the general
consensus—including the Microsoft representative—was that it
was better to leave volatile with its orginal meaning, and to define
the atomic types.) Or just use the system level primitives, like
mutexes, which execute whatever instructions are needed in their code.
(They have to. You can't implement a mutex without some guarantees
concerning the order of memory accesses.)

Here's a basic synopsis of what the 2 things are:
1) Volatile keyword:
Tells the compiler that this value could alter at any moment and therefore it should not EVER cache it in a register. Look up the old "register" keyword in C. "Volatile" is basically the "-" operator to "register"'s "+". Modern compilers now do the optimization that "register" used to explicitly request by default, so you only see 'volatile' anymore. Using the volatile qualifier will guarantee that your processing never uses a stale value, but nothing more.
2) Atomic:
Atomic operations modify data in a single clock tick, so that it is impossible for ANY other thread to access the data in the middle of such an update. They're usually limited to whatever single-clock assembly instructions the hardware supports; things like ++,--, and swapping 2 pointers. Note that this says nothing about the ORDER the different threads will RUN the atomic instructions, only that they will never run in parallel. That's why you have all those additional options for forcing an ordering.

Volatile and Atomic serve different purposes.
Volatile :
Informs the compiler to avoid optimization. This keyword is used for variables that shall change unexpectedly. So, it can be used to represent the Hardware status registers, variables of ISR, Variables shared in a multi-threaded application.
Atomic :
It is also used in case of multi-threaded application. However, this ensures that there is no lock/stall while using in a multi-threaded application. Atomic operations are free of races and indivisble. Few of the key scenario of usage is to check whether a lock is free or used, atomically add to the value and return the added value etc. in multi-threaded application.

Atomic operations, std::atomic<> and ordering of writes

GCC compiles this:
#include <atomic>
std::atomic<int> a;
int b(0);
void func()
{
b = 2;
a = 1;
}
to this:
func():
mov DWORD PTR b[rip], 2
mov DWORD PTR a[rip], 1
mfence
ret
So, to clarify things for me:
Is any other thread reading ‘a’ as 1 guaranteed to read ‘b’ as 2.
Why does the MFENCE happen after the write to ‘a’ not before.
Is the write to ‘a’ guaranteed to be an atomic (in the narrow, non C++ sense) operation anyway, and does that apply for all intel processors? I assume so from this output code.
Also, clang (v3.5.1 -O3)does this:
mov dword ptr [rip + b], 2
mov eax, 1
xchg dword ptr [rip + a], eax
ret
Which appears more straightforward to my little mind, but why the different approach, what’s the advantage of each?

I put your example on the Godbolt compiler explorer, and added some functions to read, increment, or combine (a+=b) two atomic variables. I also used a.store(1, memory_order_release); instead of a = 1; to avoid getting more ordering than needed, so it's just a simple store on x86.
See below for (hopefully correct) explanations. update: I had "release" semantics confused with just a StoreStore barrier. I think I fixed all the mistakes, but may have left some.
The easy question first:
Is the write to ‘a’ guaranteed to be an atomic?
Yes, any thread reading a will get either the old or the new value, not some half-written value. This happens for free on x86 and most other architectures with any aligned type that fits in a register. (e.g. not int64_t on 32bit.) Thus, on many systems, this happens to be true for b as well, the way most compilers would generate code.
There are some types of stores that may not be atomic on an x86, including unaligned stores that cross a cache line boundary. But std::atomic of course guarantees whatever alignment is necessary.
Read-modify-write operations are where this gets interesting. 1000 evaluations of a+=3 done in multiple threads at once will always produce a += 3000. You'd potentially get fewer if a wasn't atomic.
Fun fact: signed atomic types guarantee two's complement wraparound, unlike normal signed types. C and C++ still cling to the idea of leaving signed integer overflow undefined in other cases. Some CPUs don't have arithmetic right shift, so leaving right-shift of negative numbers undefined makes some sense, but otherwise it just feels like a ridiculous hoop to jump through now that all CPUs use 2's complement and 8bit bytes. </rant>
Is any other thread reading ‘a’ as 1 guaranteed to read ‘b’ as 2.
Yes, because of the guarantees provided by std::atomic.
Now we're getting into the memory model of the language, and the hardware it runs on.
C11 and C++11 have a very weak memory ordering model, which means the compiler is allowed to reorder memory operations unless you tell it not to. (source: Jeff Preshing's Weak vs. Strong Memory Models). Even if x86 is your target machine, you have to stop the compiler from re-ordering stores at compile time. (e.g. normally you'd want the compiler to hoist a = 1 out of a loop that also writes to b.)
Using C++11 atomic types gives you full sequential-consistency ordering of operations on them with respect to the rest of the program, by default. This means they're a lot more than just atomic. See below for relaxing the ordering to just what's needed, which avoids expensive fence operations.
Why does the MFENCE happen after the write to ‘a’ not before.
StoreStore fences are a no-op with x86's strong memory model, so the compiler just has to put the store to b before the store to a to implement the source code ordering.
Full sequential consistency also requires that the store be globally ordered / globally visible before any later loads in program order.
x86 can re-order stores after loads. In practice, what happens is that out-of-order execution sees an independent load in the instruction stream, and executes it ahead of a store that was still waiting on the data to be ready. Anyway, sequential-consistency forbids this, so gcc uses MFENCE, which is a full barrier, including StoreLoad (the only kind x86 doesn't have for free. (LFENCE/SFENCE are only useful for weakly-ordered operations like movnt.))
Another way to put this is the way the C++ docs use: sequential consistency guarantees that all threads see all changes in the same order. The MFENCE after every atomic store guarantees that this thread sees stores from other threads. Otherwise, our loads would see our stores before other thread's loads saw our stores. A StoreLoad barrier (MFENCE) delays our loads until after the stores that need to happen first.
The ARM32 asm for b=2; a=1; is:
# get pointers and constants into registers
str r1, [r3] # store b=2
dmb sy # Data Memory Barrier: full memory barrier to order the stores.
# I think just a StoreStore barrier here (dmb st) would be sufficient, but gcc doesn't do that. Maybe later versions have that optimization, or maybe I'm wrong.
str r2, [r3, #4] # store a=1 (a is 4 bytes after b)
dmb sy # full memory barrier to order this store wrt. all following loads and stores.
I don't know ARM asm, but what I've figured out so far is that normally it's op dest, src1 [,src2], but loads and stores always have the register operand first and the memory operand 2nd. This is really weird if you're used to x86, where a memory operand can be the source or dest for most non-vector instructions. Loading immediate constants also takes a lot of instructions, because the fixed instruction length only leaves room for 16b of payload for movw (move word) / movt (move top).
Release / Acquire
The release and acquire naming for one-way memory barriers comes from locks:
One thread modifies a shared data structure, then releases a lock. The unlock has to be globally visible after all the loads/stores to data it's protecting. (StoreStore + LoadStore)
Another thread acquires the lock (read, or RMW with a release-store), and must do all loads/stores to the shared data structure after the acquire becomes globally visible. (LoadLoad + LoadStore)
Note that std:atomic uses these names even for standalone fences which are slightly different from load-acquire or store-release operations. (See atomic_thread_fence, below).
Release/Acquire semantics are stronger than what producer-consumer requires. That just requires one-way StoreStore (producer) and one-way LoadLoad (consumer), without LoadStore ordering.
A shared hash table protected by a readers/writers lock (for example) requires an acquire-load / release-store atomic read-modify-write operation to acquire the lock. x86 lock xadd is a full barrier (including StoreLoad), but ARM64 has load-acquire/store-release version of load-linked/store-conditional for doing atomic read-modify-writes. As I understand it, this avoids the need for a StoreLoad barrier even for locking.
Using weaker but still sufficient ordering
Writes to std::atomic types are ordered with respect to every other memory access in source code (both loads and stores), by default. You can control what ordering is imposed with std::memory_order.
In your case, you only need your producer to make sure stores become globally visible in the correct order, i.e. a StoreStore barrier before the store to a. store(memory_order_release) includes this and more. std::atomic_thread_fence(memory_order_release) is just a 1-way StoreStore barrier for all stores. x86 does StoreStore for free, so all the compiler has to do is put the stores in source order.
Release instead of seq_cst will be a big performance win, esp. on architectures like x86 where release is cheap/free. This is even more true if the no-contention case is common.
Reading atomic variables also imposes full sequential consistency of the load with respect to all other loads and stores. On x86, this is free. LoadLoad and LoadStore barriers are no-ops and implicit in every memory op. You can make your code more efficient on weakly-ordered ISAs by using a.load(std::memory_order_acquire).
Note that the std::atomic standalone fence functions confusingly reuse the "acquire" and "release" names for StoreStore and LoadLoad fences that order all stores (or all loads) in at least the desired direction. In practice, they will usually emit HW instructions that are 2-way StoreStore or LoadLoad barriers. This doc is the proposal for what became the current standard. You can see how memory_order_release maps to a #LoadStore | #StoreStore on SPARC RMO, which I assume was included partly because it has all the barrier types separately. (hmm, the cppref web page only mentions ordering stores, not the LoadStore component. It's not the C++ standard, though, so maybe the full standard says more.)
memory_order_consume isn't strong enough for this use-case. This post talks about your case of using a flag to indicate that other data is ready, and talks about memory_order_consume.
consume would be enough if your flag was a pointer to b, or even a pointer to a struct or array. However, no compiler knows how to do the dependency tracking to make sure it puts thing in the proper order in the asm, so current implementations always treat consume as acquire. This is too bad, because every architecture except DEC alpha (and C++11's software model) provide this ordering for free. According to Linus Torvalds, only a few Alpha hardware implementations actually could have this kind of reordering, so the expensive barrier instructions needed all over the place were pure downside for most Alphas.
The producer still needs to use release semantics (a StoreStore barrier), to make sure the new payload is visible when the pointer is updated.
It's not a bad idea to write code using consume, if you're sure you understand the implications and don't depend on anything that consume doesn't guarantee. In the future, once compilers are smarter, your code will compile without barrier instructions even on ARM/PPC. The actual data movement still has to happen between caches on different CPUs, but on weak memory model machines, you can avoid waiting for any unrelated writes to be visible (e.g. scratch buffers in the producer).
Just keep in mind that you can't actually test memory_order_consume code experimentally, because current compilers are giving you stronger ordering than the code requests.
It's really hard to test any of this experimentally anyway, because it's timing-sensitive. Also, unless the compiler re-orders operations (because you failed to tell it not to), producer-consumer threads will never have a problem on x86. You'd need to test on an ARM or PowerPC or something to even try to look for ordering problems happening in practice.
references:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67458: I reported the gcc bug I found with b=2; a.store(1, MO_release); b=3; producing a=1;b=3 on x86, rather than b=3; a=1;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67461: I also reported the fact that ARM gcc uses two dmb sy in a row for a=1; a=1;, and x86 gcc could maybe do with fewer mfence operations. I'm not sure if an mfence between each store is needed to protect a signal handler from making wrong assumptions, or if it's just a missing optimization.
The Purpose of memory_order_consume in C++11 (already linked above) covers exactly this case of using a flag to pass a non-atomic payload between threads.
What StoreLoad barriers (x86 mfence) are for: a working sample program that demonstrates the need: http://preshing.com/20120515/memory-reordering-caught-in-the-act/
Data-dependency barriers (only Alpha needs explicit barriers of this type, but C++ potentially needs them to prevent the compiler doing speculative loads): http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt#360
Control-dependency barriers: http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt#592
Doug Lea says x86 only needs LFENCE for data that was written with "streaming" writes like movntdqa or movnti. (NT = non-temporal). Besides bypassing the cache, x86 NT loads/stores have weakly-ordered semantics.
http://preshing.com/20120913/acquire-and-release-semantics/
http://preshing.com/20120612/an-introduction-to-lock-free-programming/ (pointers to books and other stuff he recommends).
Interesting thread on realworldtech about whether barriers everywhere or strong memory models are better, including the point that data-dependency is nearly free in HW, so it's dumb to skip it and put a large burden on software. (The thing Alpha (and C++) doesn't have, but everything else does). Go back a few posts from that to see Linus Torvalds' amusing insults, before he got around to explaining more detailed / technical reasons for his arguments.

What do each memory_order mean?

I read a chapter and I didn't like it much. I'm still unclear what the differences is between each memory order. This is my current speculation which I understood after reading the much more simple http://en.cppreference.com/w/cpp/atomic/memory_order
The below is wrong so don't try to learn from it
memory_order_relaxed: Does not sync but is not ignored when order is done from another mode in a different atomic var
memory_order_consume: Syncs reading this atomic variable however It doesnt sync relaxed vars written before this. However if the thread uses var X when modifying Y (and releases it). Other threads consuming Y will see X released as well? I don't know if this means this thread pushes out changes of x (and obviously y)
memory_order_acquire: Syncs reading this atomic variable AND makes sure relaxed vars written before this are synced as well. (does this mean all atomic variables on all threads are synced?)
memory_order_release: Pushes the atomic store to other threads (but only if they read the var with consume/acquire)
memory_order_acq_rel: For read/write ops. Does an acquire so you don't modify an old value and releases the changes.
memory_order_seq_cst: The same thing as acquire release except it forces the updates to be seen in other threads (if a store with relaxed on another thread. I store b with seq_cst. A 3rd thread reading a with relax will see changes along with b and any other atomic variable?).
I think I understood but correct me if i am wrong. I couldn't find anything that explains it in easy to read english.

The GCC Wiki gives a very thorough and easy to understand explanation with code examples.
(excerpt edited, and emphasis added)
IMPORTANT:
Upon re-reading the below quote copied from the GCC Wiki in the process of adding my own wording to the answer, I noticed that the quote is actually wrong. They got acquire and consume exactly the wrong way around. A release-consume operation only provides an ordering guarantee on dependent data whereas a release-acquire operation provides that guarantee regardless of data being dependent on the atomic value or not.
The first model is "sequentially consistent". This is the default mode used when none is specified, and it is the most restrictive. It can also be explicitly specified via memory_order_seq_cst. It provides the same restrictions and limitation to moving loads around that sequential programmers are inherently familiar with, except it applies across threads.
[...]
From a practical point of view, this amounts to all atomic operations acting as optimization barriers. It's OK to re-order things between atomic operations, but not across the operation. Thread local stuff is also unaffected since there is no visibility to other threads. [...] This mode also provides consistency across all threads.
The opposite approach is memory_order_relaxed. This model allows for much less synchronization by removing the happens-before restrictions. These types of atomic operations can also have various optimizations performed on them, such as dead store removal and commoning. [...] Without any happens-before edges, no thread can count on a specific ordering from another thread.
The relaxed mode is most commonly used when the programmer simply wants a variable to be atomic in nature rather than using it to synchronize threads for other shared memory data.
The third mode (memory_order_acquire / memory_order_release) is a hybrid between the other two. The acquire/release mode is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables. This allows for a relaxing of the synchronization required between independent reads of independent writes.
memory_order_consume is a further subtle refinement in the release/acquire memory model that relaxes the requirements slightly by removing the happens before ordering on non-dependent shared variables as well.
[...]
The real difference boils down to how much state the hardware has to flush in order to synchronize. Since a consume operation may therefore execute faster, someone who knows what they are doing can use it for performance critical applications.
Here follows my own attempt at a more mundane explanation:
A different approach to look at it is to look at the problem from the point of view of reordering reads and writes, both atomic and ordinary:
All atomic operations are guaranteed to be atomic within themselves (the combination of two atomic operations is not atomic as a whole!) and to be visible in the total order in which they appear on the timeline of the execution stream. That means no atomic operation can, under any circumstances, be reordered, but other memory operations might very well be. Compilers (and CPUs) routinely do such reordering as an optimization.
It also means the compiler must use whatever instructions are necessary to guarantee that an atomic operation executing at any time will see the results of each and every other atomic operation, possibly on another processor core (but not necessarily other operations), that were executed before.
Now, a relaxed is just that, the bare minimum. It does nothing in addition and provides no other guarantees. It is the cheapest possible operation. For non-read-modify-write operations on strongly ordered processor architectures (e.g. x86/amd64) this boils down to a plain normal, ordinary move.
The sequentially consistent operation is the exact opposite, it enforces strict ordering not only for atomic operations, but also for other memory operations that happen before or after. Neither one can cross the barrier imposed by the atomic operation. Practically, this means lost optimization opportunities, and possibly fence instructions may have to be inserted. This is the most expensive model.
A release operation prevents ordinary loads and stores from being reordered after the atomic operation, whereas an acquire operation prevents ordinary loads and stores from being reordered before the atomic operation. Everything else can still be moved around.
The combination of preventing stores being moved after, and loads being moved before the respective atomic operation makes sure that whatever the acquiring thread gets to see is consistent, with only a small amount of optimization opportunity lost.
One may think of that as something like a non-existent lock that is being released (by the writer) and acquired (by the reader). Except... there is no lock.
In practice, release/acquire usually means the compiler needs not use any particularly expensive special instructions, but it cannot freely reorder loads and stores to its liking, which may miss out some (small) optimization opportuntities.
Finally, consume is the same operation as acquire, only with the exception that the ordering guarantees only apply to dependent data. Dependent data would e.g. be data that is pointed-to by an atomically modified pointer.
Arguably, that may provide for a couple of optimization opportunities that are not present with acquire operations (since fewer data is subject to restrictions), however this happens at the expense of more complex and more error-prone code, and the non-trivial task of getting dependency chains correct.
It is currently discouraged to use consume ordering while the specification is being revised.

This is a quite complex subject. Try to read http://en.cppreference.com/w/cpp/atomic/memory_order several times, try to read other resources, etc.
Here's a simplified description:
The compiler and CPU can reorder memory accesses. That is, they can happen in different order than what's specified in the code. That's fine most of the time, the problem arises when different thread try to communicate and may see such order of memory accesses that breaks the invariants of the code.
Usually you can use locks for synchronization. The problem is that they're slow. Atomic operations are much faster, because the synchronization happens at CPU level (i.e. CPU ensures that no other thread, even on another CPU, modifies some variable, etc.).
So, the one single problem we're facing is reordering of memory accesses. The memory_order enum specifies what types of reorderings compiler must forbid.
relaxed - no constraints.
consume - no loads that are dependent on the newly loaded value can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.
acquire - no loads can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.
release - no stores can be reordered wrt. the atomic store. I.e. if they are before the atomic store in the source code, they will happen before the atomic store too.
acq_rel - acquire and release combined.
seq_cst - it is more difficult to understand why this ordering is required. Basically, all other orderings only ensure that specific disallowed reorderings don't happen only for the threads that consume/release the same atomic variable. Memory accesses can still propagate to other threads in any order. This ordering ensures that this doesn't happen (thus sequential consistency). For a case where this is needed see the example at the end of the linked page.

I want to provide a more precise explanation, closer to the standard.
Things to ignore:
memory_order_consume - apparently no major compiler implements it, and they silently replace it with a stronger memory_order_acquire. Even the standard itself says to avoid it.
A big part of the cppreference article on memory orders deals with 'consume', so dropping it simplifies things a lot.
It also lets you ignore related features like [[carries_dependency]] and std::kill_dependency.
Data races: Writing to a non-atomic variable from one thread, and simultaneously reading/writing to it from a different thread is called a data race, and causes undefined behavior.
memory_order_relaxed is the weakest and supposedly the fastest memory order.
Any reads/writes to atomics can't cause data races (and subsequent UB). relaxed provides just this minimal guarantee, for a single variable. It doesn't provide any guarantees for other variables (atomic or not).
All threads agree on the order of operations on every particular atomic variable. But it's the case only for invididual variables. If other variables (atomic or not) are involved, threads might disagree on how exactly the operations on different variables are interleaved.
It's as if relaxed operations propagate between threads with slight unpredictable delays.
This means that you can't use relaxed atomic operations to judge when it's safe to access other non-atomic memory (can't synchronize access to it).
By "threads agree on the order" I mean that:
Each thread will access each separate variable in the exact order you tell it to. E.g. a.store(1, relaxed); a.store(2, relaxed); will write 1, then 2, never in the opposite order. But accesses to different variables in the same thread can still be reordered relative to each other.
If a thread A writes to a variable several times, then thread B reads seveal times, it will get the values in the same order (but of course it can read some values several times, or skip some, if you don't synchronize the threads in other ways).
No other guarantees are given.
Example uses: Anything that doesn't try to use an atomic variable to synchronize access to non-atomic data: various counters (that exist for informational purposes only), or 'stop flags' to signal other threads to stop. Another example: operations on shared_ptrs that increment the reference count internally use relaxed.
Fences: atomic_thread_fence(relaxed); does nothing.
memory_order_release, memory_order_acquire do everything relaxed does, and more (so it's supposedly slower or equivalent).
Only stores (writes) can use release. Only loads (reads) can use acquire. Read-modify-write operations such as fetch_add can be both (memory_order_acq_rel), but they don't have to.
Those let you synchronize threads:
Let's say thread 1 reads/writes to some memory M (any non-atomic or atomic variables, doesn't matter).
Then thread 1 performs a release store to a variable A. Then it stops
touching that memory.
If thread 2 then performs an acquire load of the same variable A, this load is said to synchronize with the corresponding store in thread 1.
Now thread 2 can safely read/write to that memory M.
You only synchronize with the latest writer, not preceding writers.
You can chain synchronizations across multiple threads.
There's a special rule that synchronization propagates across any number of read-modify-write operations regardless of their memory order. E.g. if thread 1 does a.store(1, release);, then thread 2 does a.fetch_add(2, relaxed);, then thread 3 does a.load(acquire), then thread 1 successfully synchronizes with thread 3, even though there's a relaxed operation in the middle.
In the above rule, a release operation X, and any subsequent read-modify-write operations on the same variable X (stopping at the next non-read-modify-write operation) are called a release sequence headed by X. (So if an acquire reads from any operation in a release sequence, it synchronizes with the head of the sequence.)
If read-modify-write operations are involved, nothing stops you from synchronizing with more than one operation. In the example above, if fetch_add was using acquire or acq_rel, it too would synchronize with thread 1, and conversely, if it used release or acq_rel, the thread 3 would synchonize with 2 in addition to 1.
Example use: shared_ptr decrements its reference counter using something like fetch_sub(1, acq_rel).
Here's why: imagine that thread 1 reads/writes to *ptr, then destroys its copy of ptr, decrementing the ref count. Then thread 2 destroys the last remaining pointer, also decrementing the ref count, and then runs the destructor.
Since the destructor in thread 2 is going to access the memory previously accessed by thread 1, the acq_rel synchronization in fetch_sub is necessary. Otherwise you'd have a data race and UB.
Fences: Using atomic_thread_fence, you can essentially turn relaxed atomic operations into release/acquire operations. A single fence can apply to more than one operation, and/or can be performed conditionally.
If you do a relaxed read (or with any other order) from one or more variables, then do atomic_thread_fence(acquire) in the same thread, then all those reads count as acquire operations.
Conversely, if you do atomic_thread_fence(release), followed by any number of (possibly relaxed) writes, those writes count as release operations.
An acq_rel fence combines the effect of acquire and release fences.
Similarity with other standard library features:
Several standard library features also cause a similar synchronizes with relationship. E.g. locking a mutex synchronizes with the latest unlock, as if locking was an acquire operation, and unlocking was a release operation.
memory_order_seq_cst does everything acquire/release do, and more. This is supposedly the slowest order, but also the safest.
seq_cst reads count as acquire operations. seq_cst writes count as release operations. seq_cst read-modify-write operations count as both.
seq_cst operations can synchronize with each other, and with acquire/release operations. Beware of special effects of mixing them (see below).
seq_cst is the default order, e.g. given atomic_int x;, x = 1; does x.store(1, seq_cst);.
seq_cst has an extra property compared to acquire/release: all threads agree on the order in which all seq_cst operations happen. This is unlike weaker orders, where threads agree only on the order of operations on each individual atomic variable, but not on how the operations are interleaved - see relaxed order above.
The presence of this global operation order seems to only affect which values you can get from seq_cst loads, it doesn't in any way affect non-atomic variables and atomic operations with weaker orders (unless seq_cst fences are involved, see below), and by itself doesn't prevent any extra data race UB compared to acq/rel operations.
Among other things, this order respects the synchronizes with relationship described for acquire/release above, unless (and this is weird) that synchronization comes from mixing a seq-cst operation with an acquire/release operation (release syncing with seq-cst, or seq-cst synching with acquire). Such mix essentially demotes the affected seq-cst operation to an acquire/release (it maybe retains some of the seq-cst properties, but you better not count on it).
Example use:
atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, seq_cst);
if (y.load(seq_cst)) {...}
// Thread 2:
y.store(false, seq_cst);
if (x.load(seq_cst)) {...}
Lets say you want only one thread to be able to enter the if body. seq_cst allows you to do it. Acquire/release or weaker orders wouldn't be enough here.
Fences: atomic_thread_fence(seq_cst); does everything an acq_rel fence does, and more.
Like you would expect, they bring some seq-cst properties to atomic operations done with weaker orders.
All threads agree on the order of seq_cst fences, relative to one another and to any seq_cst operations (i.e. seq_cst fences participate in the global order of seq_cst operations, which was described above).
They essentially prevent atomic operations from being reordered across themselves.
E.g. we can transform the above example to:
atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (y.load(relaxed)) {...}
// Thread 2:
y.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (x.load(relaxed)) {...}
Both threads can't enter if at the same time, because that would require reordering a load across the fence to be before the store.
But formally, the standard doesn't describe them in terms of reordering. Instead, it just explains how the seq_cst fences are placed in the global order of seq_cst operations. Let's say:
Thread 1 performs operation A on atomic variable X using using seq_cst order, OR a weaker order preceeded by a seq_cst fence.
Then:
Thread 2 performs operation B the same atomic variable X using seq_cst order, OR a weaker order followed by a seq_cst fence.
(Here A and B are any operations, except they can't both be reads, since then it's impossible to determine which one was first.)
Then the first seq_cst operation/fence is ordered before the second seq_cst operation/fence.
Then, if you imagine an scenario (e.g. in the example above, both threads entering the if) that imposes a contradicting requirements on the order, then this scenario is impossible.
E.g. in the example above, if the first thread enters the if, then the first fence must be ordered before the second one. And vice versa. This means that both threads entering the if would lead to a contradition, and hence not allowed.
Interoperation between different orders
Summarizing the above:
relaxed write
release write
seq-cst write
relaxed load
-
-
-
acquire load
-
synchronizes with
synchronizes with*
seq-cst load
-
synchronizes with*
synchronizes with
* = The participating seq-cst operation gets a messed up seq-cst order, effectively being demoted to an acquire/release operation. This is explained above.
Does using a stronger memory order makes data transfer between threads faster?
No, it seems not.
Sequental consistency for data-race-free programs
The standard explains that if your program only uses seq_cst accesses (and mutexes), and has no data races (which cause UB), then you don't need to think about all the fancy operation reorderings. The program will behave as if only one thread executed at a time, with the threads being unpredictably interleaved.

Concurrency: Atomic and volatile in C++11 memory model

A global variable is shared across 2 concurrently running threads on 2 different cores. The threads writes to and read from the variables. For the atomic variable can one thread read a stale value? Each core might have a value of the shared variable in its cache and when one threads writes to its copy in a cache the other thread on a different core might read stale value from its own cache. Or the compiler does strong memory ordering to read the latest value from the other cache? The c++11 standard library has std::atomic support. How this is different from the volatile keyword? How volatile and atomic types will behave differently in the above scenario?

volatile and the atomic operations have a different background, and
were introduced with a different intent.
volatile dates from way back, and is principally designed to prevent
compiler optimizations when accessing memory mapped IO. Modern
compilers tend to do no more than suppress optimizations for volatile,
although on some machines, this isn't sufficient for even memory mapped
IO. Except for the special case of signal handlers, and setjmp,
longjmp and getjmp sequences (where the C standard, and in the case
of signals, the Posix standard, gives additional guarantees), it must be
considered useless on a modern machine, where without special additional
instructions (fences or memory barriers), the hardware may reorder or
even suppress certain accesses. Since you shouldn't be using setjmp
et al. in C++, this more or less leaves signal handlers, and in a
multithreaded environment, at least under Unix, there are better
solutions for those as well. And possibly memory mapped IO, if you're
working on kernal code and can ensure that the compiler generates
whatever is needed for the platform in question. (According to the
standard, volatile access is observable behavior, which the compiler
must respect. But the compiler gets to define what is meant by
“access”, and most seem to define it as “a load or
store machine instruction was executed”. Which, on a modern
processor, doesn't even mean that there is necessarily a read or write
cycle on the bus, much less that it's in the order you expect.)
Given this situation, the C++ standard added atomic access, which does
provide a certain number of guarantees across threads; in particular,
the code generated around an atomic access will contain the necessary
additional instructions to prevent the hardware from reordering the
accesses, and to ensure that the accesses propagate down to the global
memory shared between cores on a multicore machine. (At one point in
the standardization effort, Microsoft proposed adding these semantics to
volatile, and I think some of their C++ compilers do. After
discussion of the issues in the committee, however, the general
consensus—including the Microsoft representative—was that it
was better to leave volatile with its orginal meaning, and to define
the atomic types.) Or just use the system level primitives, like
mutexes, which execute whatever instructions are needed in their code.
(They have to. You can't implement a mutex without some guarantees
concerning the order of memory accesses.)

Here's a basic synopsis of what the 2 things are:
1) Volatile keyword:
Tells the compiler that this value could alter at any moment and therefore it should not EVER cache it in a register. Look up the old "register" keyword in C. "Volatile" is basically the "-" operator to "register"'s "+". Modern compilers now do the optimization that "register" used to explicitly request by default, so you only see 'volatile' anymore. Using the volatile qualifier will guarantee that your processing never uses a stale value, but nothing more.
2) Atomic:
Atomic operations modify data in a single clock tick, so that it is impossible for ANY other thread to access the data in the middle of such an update. They're usually limited to whatever single-clock assembly instructions the hardware supports; things like ++,--, and swapping 2 pointers. Note that this says nothing about the ORDER the different threads will RUN the atomic instructions, only that they will never run in parallel. That's why you have all those additional options for forcing an ordering.

Volatile and Atomic serve different purposes.
Volatile :
Informs the compiler to avoid optimization. This keyword is used for variables that shall change unexpectedly. So, it can be used to represent the Hardware status registers, variables of ISR, Variables shared in a multi-threaded application.
Atomic :
It is also used in case of multi-threaded application. However, this ensures that there is no lock/stall while using in a multi-threaded application. Atomic operations are free of races and indivisble. Few of the key scenario of usage is to check whether a lock is free or used, atomically add to the value and return the added value etc. in multi-threaded application.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js