Atomic operations, std::atomic<> and ordering of writes

Atomic operations, std::atomic<> and ordering of writes - c++

GCC compiles this:
#include <atomic>
std::atomic<int> a;
int b(0);
void func()
{
b = 2;
a = 1;
}
to this:
func():
mov DWORD PTR b[rip], 2
mov DWORD PTR a[rip], 1
mfence
ret
So, to clarify things for me:
Is any other thread reading ‘a’ as 1 guaranteed to read ‘b’ as 2.
Why does the MFENCE happen after the write to ‘a’ not before.
Is the write to ‘a’ guaranteed to be an atomic (in the narrow, non C++ sense) operation anyway, and does that apply for all intel processors? I assume so from this output code.
Also, clang (v3.5.1 -O3)does this:
mov dword ptr [rip + b], 2
mov eax, 1
xchg dword ptr [rip + a], eax
ret
Which appears more straightforward to my little mind, but why the different approach, what’s the advantage of each?

I put your example on the Godbolt compiler explorer, and added some functions to read, increment, or combine (a+=b) two atomic variables. I also used a.store(1, memory_order_release); instead of a = 1; to avoid getting more ordering than needed, so it's just a simple store on x86.
See below for (hopefully correct) explanations. update: I had "release" semantics confused with just a StoreStore barrier. I think I fixed all the mistakes, but may have left some.
The easy question first:
Is the write to ‘a’ guaranteed to be an atomic?
Yes, any thread reading a will get either the old or the new value, not some half-written value. This happens for free on x86 and most other architectures with any aligned type that fits in a register. (e.g. not int64_t on 32bit.) Thus, on many systems, this happens to be true for b as well, the way most compilers would generate code.
There are some types of stores that may not be atomic on an x86, including unaligned stores that cross a cache line boundary. But std::atomic of course guarantees whatever alignment is necessary.
Read-modify-write operations are where this gets interesting. 1000 evaluations of a+=3 done in multiple threads at once will always produce a += 3000. You'd potentially get fewer if a wasn't atomic.
Fun fact: signed atomic types guarantee two's complement wraparound, unlike normal signed types. C and C++ still cling to the idea of leaving signed integer overflow undefined in other cases. Some CPUs don't have arithmetic right shift, so leaving right-shift of negative numbers undefined makes some sense, but otherwise it just feels like a ridiculous hoop to jump through now that all CPUs use 2's complement and 8bit bytes. </rant>
Is any other thread reading ‘a’ as 1 guaranteed to read ‘b’ as 2.
Yes, because of the guarantees provided by std::atomic.
Now we're getting into the memory model of the language, and the hardware it runs on.
C11 and C++11 have a very weak memory ordering model, which means the compiler is allowed to reorder memory operations unless you tell it not to. (source: Jeff Preshing's Weak vs. Strong Memory Models). Even if x86 is your target machine, you have to stop the compiler from re-ordering stores at compile time. (e.g. normally you'd want the compiler to hoist a = 1 out of a loop that also writes to b.)
Using C++11 atomic types gives you full sequential-consistency ordering of operations on them with respect to the rest of the program, by default. This means they're a lot more than just atomic. See below for relaxing the ordering to just what's needed, which avoids expensive fence operations.
Why does the MFENCE happen after the write to ‘a’ not before.
StoreStore fences are a no-op with x86's strong memory model, so the compiler just has to put the store to b before the store to a to implement the source code ordering.
Full sequential consistency also requires that the store be globally ordered / globally visible before any later loads in program order.
x86 can re-order stores after loads. In practice, what happens is that out-of-order execution sees an independent load in the instruction stream, and executes it ahead of a store that was still waiting on the data to be ready. Anyway, sequential-consistency forbids this, so gcc uses MFENCE, which is a full barrier, including StoreLoad (the only kind x86 doesn't have for free. (LFENCE/SFENCE are only useful for weakly-ordered operations like movnt.))
Another way to put this is the way the C++ docs use: sequential consistency guarantees that all threads see all changes in the same order. The MFENCE after every atomic store guarantees that this thread sees stores from other threads. Otherwise, our loads would see our stores before other thread's loads saw our stores. A StoreLoad barrier (MFENCE) delays our loads until after the stores that need to happen first.
The ARM32 asm for b=2; a=1; is:
# get pointers and constants into registers
str r1, [r3] # store b=2
dmb sy # Data Memory Barrier: full memory barrier to order the stores.
# I think just a StoreStore barrier here (dmb st) would be sufficient, but gcc doesn't do that. Maybe later versions have that optimization, or maybe I'm wrong.
str r2, [r3, #4] # store a=1 (a is 4 bytes after b)
dmb sy # full memory barrier to order this store wrt. all following loads and stores.
I don't know ARM asm, but what I've figured out so far is that normally it's op dest, src1 [,src2], but loads and stores always have the register operand first and the memory operand 2nd. This is really weird if you're used to x86, where a memory operand can be the source or dest for most non-vector instructions. Loading immediate constants also takes a lot of instructions, because the fixed instruction length only leaves room for 16b of payload for movw (move word) / movt (move top).
Release / Acquire
The release and acquire naming for one-way memory barriers comes from locks:
One thread modifies a shared data structure, then releases a lock. The unlock has to be globally visible after all the loads/stores to data it's protecting. (StoreStore + LoadStore)
Another thread acquires the lock (read, or RMW with a release-store), and must do all loads/stores to the shared data structure after the acquire becomes globally visible. (LoadLoad + LoadStore)
Note that std:atomic uses these names even for standalone fences which are slightly different from load-acquire or store-release operations. (See atomic_thread_fence, below).
Release/Acquire semantics are stronger than what producer-consumer requires. That just requires one-way StoreStore (producer) and one-way LoadLoad (consumer), without LoadStore ordering.
A shared hash table protected by a readers/writers lock (for example) requires an acquire-load / release-store atomic read-modify-write operation to acquire the lock. x86 lock xadd is a full barrier (including StoreLoad), but ARM64 has load-acquire/store-release version of load-linked/store-conditional for doing atomic read-modify-writes. As I understand it, this avoids the need for a StoreLoad barrier even for locking.
Using weaker but still sufficient ordering
Writes to std::atomic types are ordered with respect to every other memory access in source code (both loads and stores), by default. You can control what ordering is imposed with std::memory_order.
In your case, you only need your producer to make sure stores become globally visible in the correct order, i.e. a StoreStore barrier before the store to a. store(memory_order_release) includes this and more. std::atomic_thread_fence(memory_order_release) is just a 1-way StoreStore barrier for all stores. x86 does StoreStore for free, so all the compiler has to do is put the stores in source order.
Release instead of seq_cst will be a big performance win, esp. on architectures like x86 where release is cheap/free. This is even more true if the no-contention case is common.
Reading atomic variables also imposes full sequential consistency of the load with respect to all other loads and stores. On x86, this is free. LoadLoad and LoadStore barriers are no-ops and implicit in every memory op. You can make your code more efficient on weakly-ordered ISAs by using a.load(std::memory_order_acquire).
Note that the std::atomic standalone fence functions confusingly reuse the "acquire" and "release" names for StoreStore and LoadLoad fences that order all stores (or all loads) in at least the desired direction. In practice, they will usually emit HW instructions that are 2-way StoreStore or LoadLoad barriers. This doc is the proposal for what became the current standard. You can see how memory_order_release maps to a #LoadStore | #StoreStore on SPARC RMO, which I assume was included partly because it has all the barrier types separately. (hmm, the cppref web page only mentions ordering stores, not the LoadStore component. It's not the C++ standard, though, so maybe the full standard says more.)
memory_order_consume isn't strong enough for this use-case. This post talks about your case of using a flag to indicate that other data is ready, and talks about memory_order_consume.
consume would be enough if your flag was a pointer to b, or even a pointer to a struct or array. However, no compiler knows how to do the dependency tracking to make sure it puts thing in the proper order in the asm, so current implementations always treat consume as acquire. This is too bad, because every architecture except DEC alpha (and C++11's software model) provide this ordering for free. According to Linus Torvalds, only a few Alpha hardware implementations actually could have this kind of reordering, so the expensive barrier instructions needed all over the place were pure downside for most Alphas.
The producer still needs to use release semantics (a StoreStore barrier), to make sure the new payload is visible when the pointer is updated.
It's not a bad idea to write code using consume, if you're sure you understand the implications and don't depend on anything that consume doesn't guarantee. In the future, once compilers are smarter, your code will compile without barrier instructions even on ARM/PPC. The actual data movement still has to happen between caches on different CPUs, but on weak memory model machines, you can avoid waiting for any unrelated writes to be visible (e.g. scratch buffers in the producer).
Just keep in mind that you can't actually test memory_order_consume code experimentally, because current compilers are giving you stronger ordering than the code requests.
It's really hard to test any of this experimentally anyway, because it's timing-sensitive. Also, unless the compiler re-orders operations (because you failed to tell it not to), producer-consumer threads will never have a problem on x86. You'd need to test on an ARM or PowerPC or something to even try to look for ordering problems happening in practice.
references:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67458: I reported the gcc bug I found with b=2; a.store(1, MO_release); b=3; producing a=1;b=3 on x86, rather than b=3; a=1;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67461: I also reported the fact that ARM gcc uses two dmb sy in a row for a=1; a=1;, and x86 gcc could maybe do with fewer mfence operations. I'm not sure if an mfence between each store is needed to protect a signal handler from making wrong assumptions, or if it's just a missing optimization.
The Purpose of memory_order_consume in C++11 (already linked above) covers exactly this case of using a flag to pass a non-atomic payload between threads.
What StoreLoad barriers (x86 mfence) are for: a working sample program that demonstrates the need: http://preshing.com/20120515/memory-reordering-caught-in-the-act/
Data-dependency barriers (only Alpha needs explicit barriers of this type, but C++ potentially needs them to prevent the compiler doing speculative loads): http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt#360
Control-dependency barriers: http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt#592
Doug Lea says x86 only needs LFENCE for data that was written with "streaming" writes like movntdqa or movnti. (NT = non-temporal). Besides bypassing the cache, x86 NT loads/stores have weakly-ordered semantics.
http://preshing.com/20120913/acquire-and-release-semantics/
http://preshing.com/20120612/an-introduction-to-lock-free-programming/ (pointers to books and other stuff he recommends).
Interesting thread on realworldtech about whether barriers everywhere or strong memory models are better, including the point that data-dependency is nearly free in HW, so it's dumb to skip it and put a large burden on software. (The thing Alpha (and C++) doesn't have, but everything else does). Go back a few posts from that to see Linus Torvalds' amusing insults, before he got around to explaining more detailed / technical reasons for his arguments.

Related

Any operation/fence available weaker than release but still offering synchronize-with semantic?

std::memory_order_release and std::memory_order_acquire operations provide the synchronize-with semantic.
In addition to that, std::memory_order_release guarantees that all loads and stores can't be reordered past the release operation.
Questions:
Is there anything in C++20/23 that provides the same synchronized-with semantic but isn't as strong as std::memory_order_release such that loads can be reordered past the release operation? In a hope that the out-of-order code is more optimized (by compiler or by CPU).
Let's say there is no such thing in C++20/23, is there any no standard way to do so (e.g. some inline asm) for x86 on linux?

ISO C++ only has three orderings that apply to stores: relaxed, release and seq_cst. Relaxed is clearly too weak, and seq_cst is strictly stronger than release. So, no.
The property that neither loads nor stores may be reordered past a release store is necessary to provide the synchronize-with semantics that you want, and can't be weakened in any way I can think of without breaking them. The point of synchronize-with is that a release store can be used as the end of a critical section. Operations within that critical section, both loads and stores, have to stay there.
Consider the following code:
std::atomic<bool> go{false};
int crit = 17;
void thr1() {
int tmp = crit;
go.store(true, std::memory_order_release);
std::cout << tmp << std::endl;
}
void thr2() {
while (!go.load(std::memory_order_acquire)) {
// delay
}
crit = 42;
}
This program is free of data races and must output 17. This is because the release store in thr1 synchronizes with the final acquire load in thr2, the one that returns true (thus taking its value from the store). This implies that the load of crit in thr1 happens-before the store in thr2, so they don't race, and the load does not observe the store.
If we replaced the release store in thr1 with your hypothetical half-release store, such that the load of crit could be reordered after go.store(true, half_release), then that load might take place any amount of time later. It could in particular happen concurrently with, or even after, the store of crit in thr2. So it could read 42, or garbage, or anything else could happen. This should not be possible if go.store(true, half_release) really did synchronize with go.load(acquire).

ISO C++
In ISO C++, no, release is the minimum for the writer side of doing some (possibly non-atomic) stores and then storing a data_ready flag. Or for locking / mutual exclusion, to keep loads before a release store and stores after an acquire load (no LoadStore reordering). Or anything else happens-before gives you. (C++'s model works in terms of guarantees on what a load can or must see, not in terms of local reordering of loads and stores from a coherent cache. I'm talking about how they're mapped into asm for normal ISAs.) acq_rel RMWs or seq_cst stores or RMWs also work, but are stronger than release.
Asm with weaker guarantees that might be sufficient for some cases
In asm for some platform, perhaps there might be something weaker you could do, but it wouldn't be fully happens-before. I don't think there are any requirements on release which are superfluous to happens-before and normal acq/rel synchronization. (https://preshing.com/20120913/acquire-and-release-semantics/).
Some common use cases for acq/rel sync only needs StoreStore ordering on the writer side, LoadLoad on the reader side. (e.g. producer / consumer with one-way communication, non-atomic stores and a data_ready flag.) Without the LoadStore ordering requirement, I could imagine either the writer or reader being cheaper on some platforms.
Perhaps PowerPC or RISC-V? I checked what compilers do on Godbolt for a.load(acquire) and a.store(1, release).
# clang(trunk) for RISC-V -O3
load(std::atomic<int>&): # acquire
lw a0, 0(a0) # apparently RISC-V just has barriers, not acquire *operations*
fence r, rw # but the barriers do let you block only what is necessary
ret
store(std::atomic<int>&): # release
fence rw, w
li a1, 1
sw a1, 0(a0)
ret
If fence r and/or fence w exist and are ever cheaper than fence r,rw or fence rw, w, then yes, RISC-V can do something slightly cheaper than acq/rel. Unless I'm missing something, that would still be strong enough if you just want loads after an acquire load see stores from before a release store, but don't care about LoadStore: Others loads staying before a release store, and others stores staying after an acquire load.
CPUs naturally want to load early and store late to hide latencies, so it's usually not much of a burden to actually block LoadStore reordering on top of blocking LoadLoad or StoreStore. At least that's true for an ISA as long as it's possible to get the ordering you need without having to use a much stronger barrier. (i.e. when the only option that meets the minimum requirement is far beyond it, like 32-bit ARMv7 where you'd need a dmb ish full barrier that also blocked StoreLoad.)
https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ - as Preshing notes, LoadStore reordering is usually only useful on cache-miss loads. ("Instruction*" reordering isn't the best way to think of it, though; the important part is the ordering of access to cache. Stores don't access cache until they come out the far end of the store buffer; executing a store just writes data and address into the store buffer. Loads do access cache when they execute.)
How does memory reordering help processors and compilers?
release is free on x86; other ISAs are more interesting.
memory_order_release is basically free on x86, only needing to block compile-time reordering. (See C++ How is release-and-acquire achieved on x86 only using MOV? - The x86 memory model is program order plus a store-buffer with store forwarding).
x86 is a silly choice to ask about; something like PowerPC where there are multiple different choices of light-weight barrier would be more interesting. Turns out it only needs one barrier each for acquire and release, but seq_cst needs multiple different barriers before and after.
PowerPC asm looks like this for load(acquire) and store(1,release) -
load(std::atomic<int>&):
lwz %r3,0(%r3)
cmpw %cr0,%r3,%r3 #; I think for a data dependency on the load
bne- %cr0,$+4 #; never-taken, if I'm reading this right?
isync #; instruction sync, blocking the front-end until older instructions retire?
blr
store(std::atomic<int>&):
li %r9,1
lwsync # light-weight sync = LoadLoad + StoreStore + LoadStore. (But not blocking StoreLoad)
stw %r9,0(%r3)
blr
I don't know if isync is always cheaper than lwsync which I'd think would also work there; I'd have thought stalling the front-end might be worse than imposing some ordering on loads and stores.
I suspect the reason for the compare-and-branch instead of just isync (documentation) is that a load can retire from the back-end ("complete") once it's known to be non-faulting, before the data actually arrives.
(x86 doesn't do this, but weakly-ordered ISAs do; it's how you get LoadStore reordering on CPUs like ARM, with in-order or out-of-order exec. Retirement goes in program order, but stores can't commit to L1d cache until after they retire. x86 requiring loads to produce a value before they can retire is one way to guarantee LoadStore ordering. How is load->store reordering possible with in-order commit?)
So on PowerPC, the compare into condition-register 0 (%cr0) has a data dependency on the load, can can't execute until the data arrives. Thus can't complete. I don't know why there's also an always-false branch on it. I think the $+4 branch destination is the isync instruction, in case that matters. I wonder if the branch could be omitted if you only need LoadLoad, not LoadStore? Unlikely.
IDK if ARMv7 can maybe block just LoadLoad or StoreStore. If so, that would be a big win over dmb ish, which compilers use because they also need to block LoadStore.
Loads cheaper than acquire: memory_order_consume
This is the useful hardware feature that ISO C++ doesn't currently expose (because std::memory_order_consume is defined in a way that's too hard for compilers to implement correctly in every corner case, without introducing more barriers. Thus it's deprecated, and compilers handle it the same as acquire).
Dependency ordering (on all CPUs except DEC Alpha) makes it safe to load a pointer and deref it without any barriers or special load instructions, and still see the pointed-to data if the writer used a release store.
If you want to do something cheaper than ISO C++ acq/rel, the load side is where the savings are on ISAs like POWER and ARMv7. (Not x86; full acquire is free). To a much lesser extent on ARMv8 I think, as ldapr should be cheapish.
See C++11: the difference between memory_order_relaxed and memory_order_consume for more, including a talk from Paul McKenney about how Linux uses plain loads (effectively relaxed) to make the read side of RCU very very cheap, with no barriers, as long as they're careful to not write code where the compiler can optimize away the data dependency into just a control dependency or nothing.
Also related:
Memory order consume usage in C11
C++11: the difference between memory_order_relaxed and memory_order_consume
When should you not use [[carries_dependency]]?
Are memory orderings: consume, acq_rel and seq_cst ever needed on Intel x86?
[[carries_dependency]] what it means and how to implement
What does memory_order_consume really do?

Does compiler need to care about other threads during optimizations?

This is a spin-off from a discussion about C# thread safety guarantees.
I had the following presupposition:
in absence of thread-aware primitives (mutexes, std::atomic* etc., let's exclude volatile as well for simplicity) a valid C++ compiler may do any kinds of transformations, including introducing reads from the memory (or e. g. writes if it wants to), if the semantics of the code in the current thread (that is, output and [excluded in this question] volatile accesses) remain the same from the current thread's point of view, that is, disregarding existence of other threads. The fact that introducing reads/writes may change other thread's behavior (e. g. because the other threads read the data without proper synchronization or performing other kinds of UB) can be totally ignored by a standard-conform compiler.
Is this presupposition correct or not? I would expect this to follow from the as-if rule. (I believe it is, but other people seem to disagree with me.) If possible, please include the appropriate normative references.

Yes, C++ defines data race UB as potentially-concurrent access to non-atomic objects when not all the accesses are reads. Another recent Q&A quotes the standard, including.
[intro.races]/2 - Two expression evaluations conflict if one of them modifies a memory location ... and the other one reads or modifies the same memory location.
[intro.races]/21 ... The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, ...
Any such data race results in undefined behavior.
That gives the compiler freedom to optimize code in ways that preserve the behaviour of the thread executing a function, but not what other threads (or a debugger) might see if they go looking at things they're not supposed to. (i.e. data race UB means that the order of reading/writing non-atomic variables is not part of the observable behaviour an optimizer has to preserve.)
introducing reads/writes may change other thread's behavior
The as-if rule allows you to invent reads, but no you can't invent writes to objects this thread didn't already write. That's why if(a[i] > 10) a[i] = 10; is different from a[i] = a[i]>10 ? 10 : a[i].
It's legal for two different threads to write a[1] and a[2] at the same time, and one thread loading a[0..3] and then storing back some modified and some unmodified elements could step on the store by the thread that wrote a[2].
Crash with icc: can the compiler invent writes where none existed in the abstract machine? is a detailed look at a compiler bug where ICC did that when auto-vectorizing with SIMD blends. Including links to Herb Sutter's atomic weapons talk where he discusses the fact that compilers must not invent writes.
By contrast, AVX-512 masking and AVX vmaskmovps etc, like ARM SVE and RISC-V vector extensions I think, do have proper masking with fault suppression to actually not store at all to some SIMD elements, without branching.
When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements? AVX-512 masking does indeed do fault-suppression for read-only or unmapped pages that masked-off elements extend into.
AVX-512 and Branching - auto-vectorizing with stores inside an if() vs. branchless.
It's legal to invent atomic RMWs (except without the Modify part), e.g. an 8-byte lock cmpxchg [rcx], rdx if you want to modify some of the bytes in that region. But in practice that's more costly than just storing modified bytes individually so compilers don't do that.
Of course a function that does unconditionally write a[2] can write it multiple times, and with different temporary values before eventually updating it to the final value. (Probably only a Deathstation 9000 would invent different-valued temporary contents, like turning a[2] = 3 into a[2] = 2; a[2]++;)
For more about what compilers can legally do, see Who's afraid of a big bad optimizing compiler? on LWN. The context for that article is Linux kernel development, where they rely on GCC to go beyond the ISO C standard and actually behave in sane ways that make it possible to roll their own atomics with volatile int* and inline asm. It explains many of the practical dangers of reading or writing a non-atomic shared variable.

What is the difference between load/store relaxed atomic and normal variable?

As I see from a test-case: https://godbolt.org/z/K477q1
The generated assembly load/store atomic relaxed is the same as the normal variable: ldr and str
So, is there any difference between relaxed atomic and normal variable?

The difference is that a normal load/store is not guaranteed to be tear-free, whereas a relaxed atomic read/write is. Also, the atomic guarantees that the compiler doesn't rearrange or optimise-out memory accesses in a similar fashion to what volatile guarantees.
(Pre-C++11, volatile was an essential part of rolling your own atomics. But now it's obsolete for that purpose. It does still work in practice but is never recommended: When to use volatile with multi threading? - essentially never.)
On most platforms it just happens that the architecture provides a tear-free load/store by default (for aligned int and long) so it works out the same in asm if loads and stores don't get optimized away. See Why is integer assignment on a naturally aligned variable atomic on x86? for example. In C++ it's up to you to express how the memory should be accessed in your source code instead of relying on architecture-specific features to make the code work as intended.
If you were hand-writing in asm, your source code would already nail down when values were kept in registers vs. loaded / stored to (shared) memory. In C++, telling the compiler when it can/can't keep values private is part of why std::atomic<T> exists.
If you read one article on this topic, take a look at the Preshing one here:
https://preshing.com/20130618/atomic-vs-non-atomic-operations/
Also try this presentation from CppCon 2017:
https://www.youtube.com/watch?v=ZQFzMfHIxng
Links for further reading:
Read a non-atomic variable, atomically?
https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering
Causing non-atomics to tear
https://lwn.net/Articles/793895/
What is the (slight) difference on the relaxing atomic rules? which includes a link to a Herb Sutter "atomic weapons" article which is also linked here:
https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/
Also see Peter Cordes' linked article: https://electronics.stackexchange.com/q/387181
And a related one about the Linux kernel: https://lwn.net/Articles/793253/
No tearing is only part of what you get with std::atomic<T> - you also avoid data race undefined behaviour.

atomic<T> constrains the optimizer to not assume the value is unchanged between accesses in the same thread.
atomic<T> also makes sure the object is sufficiently aligned: e.g. some C++ implementations for 32-bit ISAs have alignof(int64_t) = 4 but alignof(atomic<int64_t>) = 8 to enable lock-free 64-bit operations. (e.g. gcc for 32-bit x86 GNU/Linux). In that case, usually a special instruction is needed that the compiler might not use otherwise, e.g. ARMv8 32-bit ldp load-pair, or x86 SSE2 movq xmm before bouncing to integer regs.
In asm for most ISAs, pure-load and pure-store of naturally-aligned int and long are atomic for free, so atomic<T> with memory_order_relaxed can compile to the same asm as plain variables; atomicity (no tearing) doesn't require any special asm. For example: Why is integer assignment on a naturally aligned variable atomic on x86? Depending on surrounding code, the compiler might not manage to optimize out any accesses to non-atomic objects, in which case code-gen will be the same between plain T and atomic<T> with mo_relaxed.
The reverse is not true: It's not at all safe to write C++ as if you were writing in asm. In C++, multiple threads accessing the same object at the same time is data-race undefined behaviour, unless all the accesses are reads.
Thus C++ compilers are allowed to assume that no other threads are changing a variable in a loop, per the "as-if" optimization rule. If bool done is not atomic, a loop like while(!done) { } will compile into if(!done) infinite_loop;, hoisting the load out of the loop. See Multithreading program stuck in optimized mode but runs normally in -O0 for a detailed example with compiler asm output. (Compiling with optimization disabled is very similar to making every object volatile: memory in sync with the abstract machine between C++ statements for consistent debugging.)
Also obviously RMW operations like += or var.fetch_add(1, mo_seq_cst) are atomic and do have to compile to different asm than non-atomic +=. Can num++ be atomic for 'int num'?
The constraints on the optimizer placed by atomic operations are similar to what volatile does. In practice volatile is a way to roll your own mo_relaxed atomic<T>, but without any easy way to get ordering wrt. other operations. It's de-facto supported on some compilers, like GCC, because it's used by the Linux kernel. However, atomic<T> is guaranteed to work by the ISO C++ standard; When to use volatile with multi threading? - there's almost never a reason to roll your own, just use atomic<T> with mo_relaxed.
Also related: Why don't compilers merge redundant std::atomic writes? / Can and does the compiler optimize out two atomic loads? - compilers currently don't optimize atomics at all, so atomic<T> is currently equivalent to volatile atomic<T>, pending further standards work to provide ways for programmers to control when / what optimization would be ok.

Very good question actually, and I asked the same question when I started leaning concurrency.
I'll answer as simple as possible, even though the answer is a bit more complicated.
Reading and writing to the same non atomic variable from different threads* is undefined behavior - one thread is not guaranteed to read the value that the other thread wrote.
Using an atomic variable solves the problem - by using atomics all threads are guarantees to read the latest writen-value even if the memory order is relaxed.
In fact, atomics are always thread safe, regardless of the memory order!
The memory order is not for the atomics -> it's for non atomic data.
Here is the thing - if you use locks, you don't have to think about those low-level things. memory orders are used in lock-free environments where we need to synchronize non atomic data.
Here is the beautiful thing about lock free algorithms, we use atomic operations that are always thread safe, but we "piggy-pack" those operations with memory orders to synchronize the non atomic data used in those algorithms.
For example, a lock-free linked list. Usually, a lock-free link list node looks something like this:
Node:
Atomic<Node*> next_node;
T non_atomic_data
Now, let's say I push a new node into the list. next_node is always thread safe, another thread will always see the latest atomic value.
But who grantees that other threads see the correct value of non_atomic_data?
No-one.
Here is a perfect example of the usage of memory orders - we "piggyback" atomic stores and loads to next_node by also adding memory orders that synchronize the value of non_atomic_data.
So when we store a new node to the list, we use memory_order_release to "push" the non atomic data to the main memory. when we read the new node by reading next_node, we use memory_order_acquire and then we "pull" the non atomic data from the main memory.
This way we assure that both next_node and non_atomic_data are always synchronized across threads.
memory_order_relaxed doesn't synchronize any non-atomic data, it synchronizes only itself - the atomic variable. When this is used, developers can assume that the atomic variable doesn't reference any non-atomic data published by the same thread that wrote the atomic variable. In other words, that atomic variable isn't, for example, an index of a non-atomic array, or a pointer to non atomic data, or an iterator to some non-thread safe collection. (It would be fine to use relaxed atomic stores and loads for an index into a constant lookup table, or one that's synchronized separately. You only need acq/rel synchronization if the pointed-to or indexed data was written by the same thread.)
This is faster (at least on some architectures) than using stronger memory orders but can be used in fewer cases.
Great, but even this is not the full answer. I said memory orders are not used for atomics. I was half-lying.
With relaxed memory order, atomics are still thread safe. but they have a downside - they can be re-ordered. look at the following snippet:
a.store(1, std::memory_order_relaxed);
b.store(2, std::memory_order_relaxed);
In reality, a.store can happen after b.store. The CPU does this all the times, it's called Out of Order Execution and its one of the optimizations techniques CPUs use to speed up execution. a and b are still thread-safe, even though the thread-safe stores might happen in a reverse order.
Now, what happens if there is a meaning for the order? Many lock-free algorithms depend on the order of atomic operations for their correctness.
Memory orders are also used to prevent reordering. This is why memory orders are so complicated, because they do 2 things at the same time.
memory_order_acquire tells the compiler and CPU not to execute operations that happen after it code-wise, before it.
similarity, memory_order_release tells the compiler and CPU not to execute operations that before it code-wise, after it.
memory_order_relaxed tells the compiler/cpu that the atomic operation can be re-ordered is possible, in a similar way non atomic operations are reordered whenever possible.

is std::atomic::fetch_add a serializing operation on x86-64?

Considering the following code:
std::atomic<int> counter;
/* otherStuff 1 */
counter.fetch_add(1, std::memory_order_relaxed);
/* otherStuff 2 */
Is there an instruction in x86-64 (say less than 5 years old architectures) that would allow otherStuff 1 and 2 be re-ordered across the fetch_add or is it going to be always serializing ?
EDIT:
It looks like this is summarized by "is lock add a memory barrier on x86 ?" and it seems it is not, though I am not sure where to find a reference for that.

First let's look at what the compiler is allowed to do when using std::memory_order_relaxed.
If there are no dependencies between otherStuff 1/2 and the atomic operation, it can certainly reorder the statements. For example:
g = 3;
a.fetch_add(1, memory_order_relaxed);
g += 12;
clang++ generates the following assembly:
lock addl $0x1,0x2009f5(%rip) # 0x601040 <a>
movl $0xf,0x2009e7(%rip) # 0x60103c <g>
Here clang took the liberty to reorder g = 3 with the atomic fetch_add operation, which is a legitimate transformation.
When using std::memory_order_seq_cst, the compiler output becomes:
movl $0x3,0x2009f2(%rip) # 0x60103c <g>
lock addl $0x1,0x2009eb(%rip) # 0x601040 <a>
addl $0xc,0x2009e0(%rip) # 0x60103c <g>
Reordering of statements does not take place because the compiler is not allowed to do that.
Sequential consistent ordering on a read-modify-write (RMW) operation, is both a release and an acquire operation and as such, no (visible) reordering of statements is allowed on both compiler and CPU level.
Your question is whether, on X86-64, std::atomic::fetch_add, using relaxed ordering, is a serializing operation..
The answer is: yes, if you do not take into account compiler reordering.
On the X86 architecture, an RMW operation always flushes the store buffer and therefore is effectively a serializing and sequentially consistent operation.
You can say that, on an X86 CPU, each RMW operation:
is a release operation for memory operations that precede it and is an acquire operation for memory operations that follow it.
becomes visible in a single total order observed by all threads.

The target architecture
On the X86 architecture, an RMW operation always flushes the store
buffer and therefore is effectively a serializing and sequentially
consistent operation.
I wish people would stop saying that.
That statement doesn't even make sense, as there is no such thing as "sequentially consistent operation", as "sequential consistency" isn't a property of any operation. A sequentially consistent execution is one where the end result is one where there is an interlacing of operation that gives that result.
What can be said about these RMW operations:
all operations before the RMW have to be globally visible before either the R or W of the RMW is visible
and no operation after the RMW are visible before the RMW is visible.
That is the part before, the RMW, and the part after are run sequential. In other words, there is a full fence before and after the RMW.
Whether that results in a sequential execution for the complete program depends on the nature of all globally visible operations of the program.
Visibility vs. execution ordering
That's in term of visibility. I have no idea whether these processors try to speculatively execute code after the RMW, subject to the correctness requirement that operations are rolled back if there is a conflict with a side effect on a parallel execution (these details tend to be different for different vendors and generations in a given family, unless it's clearly specified).
The answer to your question could be different whether
you need to guarantee correctness of the set of side effect (as in sequential consistency requirement),
or guarantee that benchmarks are reliable,
or that comparative timing CPU version independent: to guarantee something on the results of comparison of timing of different executions (for a given CPU).
High level languages vs. CPU features
The question title is "is std::atomic::fetch_add a serializing operation on x86-64?" of the general form:
"does OP provide guarantees P on ARCH"
where
OP is a high level operation in a high level language
P is the desired property
ARCH is a specific CPU or compiler target
As a rule, the canonical answer is: the question doesn't make sense, OP being high level and target independent. There is a low level/high level mismatch here.
The compiler is bound by the language standard (or rather its most reasonable interpretation), by documented extension, by history... not by the standard for the target architecture, unless the feature is a low level, transparent feature of the high level language.
The canonical way to get low level semantic in C/C++ is to use volatile objects and volatile operations.
Here you must use a volatile std::atomic<int> to even be able to ask a meaningful question about architectural guarantees.
Present code generation
The meaningful variant of your question would use this code:
volatile std::atomic<int> counter;
/* otherStuff 1 */
counter.fetch_add(1, std::memory_order_relaxed);
That statement will generate an atomic RMW operation which in that case "is a serializing operation" on the CPU: all operations performed before, in assembly code, are complete before the RMW starts; all operation following the RMW wait until the RMW completes to start (in term of visibility).
And then you would need to learn about the unpleasantness of the volatile semantic: volatile applies only to these volatile operation so you would still not get general guarantees about other operations.
There is no guarantee that high level C++ operations before the volatile RMW operations are sequenced before in the assembly code. You would need a "compiler barrier" to do that. These barriers are not portable. (And not needed here as it's a silly approach anyway.)
But then if you want that guarantee, you can just use:
a release operation: to ensure that previous globally visible operations are complete
an acquire operation: to ensure that following globally visible operations do not start before
RMW operation on an object that is visible by multiple threads.
So why not make your RMW operation ack_rel? Then it wouldn't even need to be volatile.
Possible RMW variants in a processor family
Is there an instruction in x86-64 (say less than 5 years old
architectures) that would
Potential variants of the instruction set is another sub-question. Vendors can introduce new instructions, and ways to test for their availability at runtime; and compilers can even generate code to detect their availability.
Any RMW feature that would follow the existing tradition (1) of strong ordering of usual memory operation in that family would have to respect the traditions:
Total Store Order: all store operations are ordered, implicitly fenced; in other words, there is a store buffer strictly for non speculative store operations in each core, that is not reordered and not shared between cores;
each store is a release operation (for previous normal memory operations);
loads that are speculatively started are completed in order, and at completion are validated: any early load for a location that was then clobbered in the cache is cancelled and the computation is restarted with the recent value;
loads are acquire operations.
Then any new (but traditional) RMW operation must be both an acquire operation and a release operation.
(Examples for potential imaginary RMW operation to be added in the future are xmult and xdiv.)
But that's futurology and adding less ordered instruction in the future wouldn't violate any security invariants, except potentially against timing based, side channel Spectre-like attacks, that we don't know how to model and reason about in general anyway.
The problem with these questions, even about the present, is that a proof of absence would be required, and for that we would need to know about each variant for a CPU family. That is not always doable, and also, unnecessary if you use the proper ordering in the high level code, and useless if you don't.
(1) Traditions for guarantees of memory operations are guidelines in the CPU design, not guarantees about any feature operation: by definition operations that don't yet exist have no guarantee about their semantics, beside the guarantees of memory integrity, that is, the guarantee that no future operation will break the privileges and security guarantees previously established (no un privileged instruction created in the future can access an un-mapped memory address...).

When using std::memory_order_relaxed the only guarantee is that the operation is atomic. Anything around the operation can be reordered at will by either the compiler or the CPU.
From https://en.cppreference.com/w/cpp/atomic/memory_order:
Relaxed operation: there are no synchronization or ordering
constraints imposed on other reads or writes, only this operation's
atomicity is guaranteed (see Relaxed ordering below)

Concurrency: Atomic and volatile in C++11 memory model

A global variable is shared across 2 concurrently running threads on 2 different cores. The threads writes to and read from the variables. For the atomic variable can one thread read a stale value? Each core might have a value of the shared variable in its cache and when one threads writes to its copy in a cache the other thread on a different core might read stale value from its own cache. Or the compiler does strong memory ordering to read the latest value from the other cache? The c++11 standard library has std::atomic support. How this is different from the volatile keyword? How volatile and atomic types will behave differently in the above scenario?

Firstly, volatile does not imply atomic access. It is designed for things like memory mapped I/O and signal handling. volatile is completely unnecessary when used with std::atomic, and unless your platform documents otherwise, volatile has no bearing on atomic access or memory ordering between threads.
If you have a global variable which is shared between threads, such as:
std::atomic<int> ai;
then the visibility and ordering constraints depend on the memory ordering parameter you use for operations, and the synchronization effects of locks, threads and accesses to other atomic variables.
In the absence of any additional synchronization, if one thread writes a value to ai then there is nothing that guarantees that another thread will see the value in any given time period. The standard specifies that it should be visible "in a reasonable period of time", but any given access may return a stale value.
The default memory ordering of std::memory_order_seq_cst provides a single global total order for all std::memory_order_seq_cst operations across all variables. This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies.
If you have 2 shared variables x and y, initially zero, and have one thread write 1 to x and another write 2 to y, then a third thread that reads both may see either (0,0), (1,0), (0,2) or (1,2) since there is no ordering constraint between the operations, and thus the operations may appear in any order in the global order.
If both writes are from the same thread, which does x=1 before y=2 and the reading thread reads y before x then (0,2) is no longer a valid option, since the read of y==2 implies that the earlier write to x is visible. The other 3 pairings (0,0), (1,0) and (1,2) are still possible, depending how the 2 reads interleave with the 2 writes.
If you use other memory orderings such as std::memory_order_relaxed or std::memory_order_acquire then the constraints are relaxed even further, and the single global ordering no longer applies. Threads don't even necessarily have to agree on the ordering of two stores to separate variables if there is no additional synchronization.
The only way to guarantee you have the "latest" value is to use a read-modify-write operation such as exchange(), compare_exchange_strong() or fetch_add(). Read-modify-write operations have an additional constraint that they always operate on the "latest" value, so a sequence of ai.fetch_add(1) operations by a series of threads will return a sequence of values with no duplicates or gaps. In the absence of additional constraints, there's still no guarantee which threads will see which values though. In particular, it is important to note that the use of an RMW operation does not force changes from other threads to become visible any quicker, it just means that if the changes are not seen by the RMW then all threads must agree that they are later in the modification order of that atomic variable than the RMW operation. Stores from different threads can still be delayed by arbitrary amounts of time, depending on when the CPU actually issues the store to memory (rather than just its own store buffer), physically how far apart the CPUs executing the threads are (in the case of a multi-processor system), and the details of the cache coherency protocol.
Working with atomic operations is a complex topic. I suggest you read a lot of background material, and examine published code before writing production code with atomics. In most cases it is easier to write code that uses locks, and not noticeably less efficient.

volatile and the atomic operations have a different background, and
were introduced with a different intent.
volatile dates from way back, and is principally designed to prevent
compiler optimizations when accessing memory mapped IO. Modern
compilers tend to do no more than suppress optimizations for volatile,
although on some machines, this isn't sufficient for even memory mapped
IO. Except for the special case of signal handlers, and setjmp,
longjmp and getjmp sequences (where the C standard, and in the case
of signals, the Posix standard, gives additional guarantees), it must be
considered useless on a modern machine, where without special additional
instructions (fences or memory barriers), the hardware may reorder or
even suppress certain accesses. Since you shouldn't be using setjmp
et al. in C++, this more or less leaves signal handlers, and in a
multithreaded environment, at least under Unix, there are better
solutions for those as well. And possibly memory mapped IO, if you're
working on kernal code and can ensure that the compiler generates
whatever is needed for the platform in question. (According to the
standard, volatile access is observable behavior, which the compiler
must respect. But the compiler gets to define what is meant by
“access”, and most seem to define it as “a load or
store machine instruction was executed”. Which, on a modern
processor, doesn't even mean that there is necessarily a read or write
cycle on the bus, much less that it's in the order you expect.)
Given this situation, the C++ standard added atomic access, which does
provide a certain number of guarantees across threads; in particular,
the code generated around an atomic access will contain the necessary
additional instructions to prevent the hardware from reordering the
accesses, and to ensure that the accesses propagate down to the global
memory shared between cores on a multicore machine. (At one point in
the standardization effort, Microsoft proposed adding these semantics to
volatile, and I think some of their C++ compilers do. After
discussion of the issues in the committee, however, the general
consensus—including the Microsoft representative—was that it
was better to leave volatile with its orginal meaning, and to define
the atomic types.) Or just use the system level primitives, like
mutexes, which execute whatever instructions are needed in their code.
(They have to. You can't implement a mutex without some guarantees
concerning the order of memory accesses.)

Here's a basic synopsis of what the 2 things are:
1) Volatile keyword:
Tells the compiler that this value could alter at any moment and therefore it should not EVER cache it in a register. Look up the old "register" keyword in C. "Volatile" is basically the "-" operator to "register"'s "+". Modern compilers now do the optimization that "register" used to explicitly request by default, so you only see 'volatile' anymore. Using the volatile qualifier will guarantee that your processing never uses a stale value, but nothing more.
2) Atomic:
Atomic operations modify data in a single clock tick, so that it is impossible for ANY other thread to access the data in the middle of such an update. They're usually limited to whatever single-clock assembly instructions the hardware supports; things like ++,--, and swapping 2 pointers. Note that this says nothing about the ORDER the different threads will RUN the atomic instructions, only that they will never run in parallel. That's why you have all those additional options for forcing an ordering.

Volatile and Atomic serve different purposes.
Volatile :
Informs the compiler to avoid optimization. This keyword is used for variables that shall change unexpectedly. So, it can be used to represent the Hardware status registers, variables of ISR, Variables shared in a multi-threaded application.
Atomic :
It is also used in case of multi-threaded application. However, this ensures that there is no lock/stall while using in a multi-threaded application. Atomic operations are free of races and indivisble. Few of the key scenario of usage is to check whether a lock is free or used, atomically add to the value and return the added value etc. in multi-threaded application.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js