Using an atomic read-modify-write operation in a release sequence - c++

Say, I create an object of type Foo in thread #1 and want to be able to access it in thread #3.
I can try something like:
std::atomic<int> sync{10};
Foo *fp;
// thread 1: modifies sync: 10 -> 11
fp = new Foo;
sync.store(11, std::memory_order_release);
// thread 2a: modifies sync: 11 -> 12
while (sync.load(std::memory_order_relaxed) != 11);
sync.store(12, std::memory_order_relaxed);
// thread 3
while (sync.load(std::memory_order_acquire) != 12);
fp->do_something();
The store/release in thread #1 orders Foo with the update to 11
thread #2a non-atomically increments the value of sync to 12
the synchronizes-with relationship between thread #1 and #3 is only established when #3 loads 11
The scenario is broken because thread #3 spins until it loads 12, which may arrive out of order (wrt 11) and Foo is not ordered with 12 (due to the relaxed operations in thread #2a).
This is somewhat counter-intuitive since the modification order of sync is 10 → 11 → 12
The standard says (§ 1.10.1-6):
an atomic store-release synchronizes with a load-acquire that takes its value from the store (29.3). [ Note: Except in the specified cases, reading a later value does not necessarily ensure visibility as described below. Such a requirement would sometimes interfere with efficient implementation. —end note ]
It also says in (§ 1.10.1-5):
A release sequence headed by a release operation A on an atomic object M is a maximal contiguous subsequence of side effects in the modification order of M, where the first operation is A, and every subsequent operation
- is performed by the same thread that performed A, or
- is an atomic read-modify-write operation.
Now, thread #2a is modified to use an atomic read-modify-write operation:
// thread 2b: modifies sync: 11 -> 12
int val;
while ((val = 11) && !sync.compare_exchange_weak(val, 12, std::memory_order_relaxed));
If this release sequence is correct, Foo is synchronized with thread #3 when it loads either 11 or 12.
My questions about the use of an atomic read-modify-write are:
Does the scenario with thread #2b constitute a correct release sequence ?
And if so:
What are the specific properties of a read-modify-write operation that ensure this scenario is correct ?

Does the scenario with thread #2b constitute a correct release sequence ?
Yes, per your quote from the standard.
What are the specific properties of a read-modify-write operation that ensure this scenario is correct?
Well, the somewhat circular answer is that the only important specific property is that "The C++ standard defines it so".
As a practical matter, one may ask why the standard defines it like this. I don't think you'll find that the answer has a deep theoretical basis: I think the committee could have also defined it such that the RMW doesn't participate in the release sequence, or (perhaps with more difficulty) have defined so that both the RMW and the separate mo_relaxed load and store participate in the release sequence, without compromising the "soundness" of the model.
They already give a performance related as to why they didn't choose the latter approach:
Such a requirement would sometimes interfere with efficient implementation.
In particular, on any hardware platform that allowed load-store reordering, it would imply that even mo_relaxed loads and/or stores might require barriers! Such platforms exist today. Even on more strongly ordered platforms, it may inhibit compiler optimizations.
So why didn't they take then take other "consistent" approach of not requiring RMW mo_relaxed to participate in the release sequence? Probably because existing hardware implementations of RMW operations provide such guarantees and the nature of RMW operations makes it likely that this will be true in the future. In particular, as Peter points in the comments above, RMW operations, even with mo_relaxed are conceptually and practically1 stronger than separate loads and stores: they would be quite useless if they didn't have a consistent total order.
Once you accept that is how hardware works, it makes sense from a performance angle to align the standard: if you didn't, you'd have people using more restrictive orderings such as mo_acq_rel just to get the release sequence guarantees, but on real hardware that has weakly ordered CAS, this doesn't come for free.
1 The "practically" part means that even the weakest forms of RMW instructions are usually relatively "expensive" operations taking a dozen cycles or more on modern hardware, while mo_relaxed loads and stores generally just compile to plain loads and stores in the target ISA.

Related

Does C++11 sequential consistency memory order forbid store buffer litmus test?

Consider the store buffer litmus test with SC atomics:
// Initial
std::atomic<int> x(0), y(0);
// Thread 1 // Thread 2
x.store(1); y.store(1);
auto r1 = y.load(); auto r2 = x.load();
Can this program end with both r1 and r2 being zero?
I can't see how this result is forbidden by the description about memory_order_seq_cst in cppreference:
A load operation with this memory order performs an acquire operation, a store performs a release operation, and read-modify-write performs both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order
It seems to me that memory_order_seq_cst is just acquire-release plus a global store order. And I don't think the global store order comes into play in this specific litmus test.
That cppreference summary of SC is too weak, and indeed isn't strong enough to forbid this reordering.
What it says looks to me only as strong as x86-TSO (acq_rel plus no IRIW reordering, i.e a total store order that all reader threads can agree on).
ISO C++ actually guarantees that there's a total order of all SC operations including loads (and also SC fences) that's consistent with program order. (That's basically the standard definition of sequential consistency in computer science; C++ programs that use only seq_cst atomic operations and are data-race-free for their non-atomic accesses execute sequentially consistently, i.e. "recover sequential consistency" despite full optimization being allowed for the non-atomic accesses.) Sequential consistency must forbid any reordering between any two SC operations in the same thread, even StoreLoad reordering.
This means an expensive full barrier (including StoreLoad) after every seq_cst store, or for example AArch64 STLR / LDAR can't StoreLoad reorder with each other, but are otherwise only release and acquire wrt. reordering with other operations. (So cache-hit SC stores can be quite a lot cheaper on AArch64 than x86, if you don't do any SC load or RMW operations in the same thread right afterwards.)
See https://eel.is/c++draft/atomics.order#4 That makes it clear that SC operations aren't reordered wrt. each other. The current draft standard says:
31.4 [atomics.order]
There is a single total order S on all memory_­order​::​seq_­cst operations, including fences, that satisfies the following constraints. First, if A and B are memory_­order​::​seq_­cst operations and A strongly happens before B, then A precedes B in S.
Second, for every pair of atomic operations A and B on an object M, where A is coherence-ordered before B, the following four conditions are required to be satisfied by S:
(4.1) if A and B are both memory_­order​::​seq_­cst operations, then A precedes B in S; and
(4.2 .. 4.4) - basically the same thing for sc fences wrt. operations.
Sequenced before implies strongly happens before, so the opening paragraph guarantees that S is consistent with program order.
4.1 is about ops that are coherenced-ordered before/after each other. i.e. a load that happens to see the value from a store. That ties inter-thread visibility into the total order S, making it match program order. The combination of those two requirements forces a compiler to use full barriers (including StoreLoad) to recover sequential consistency from whatever weaker hardware model it's targeting.
(In the original, all of 4. is one paragraph. I split it to emphasize that there are two separate things here, one for strongly-happens-before and the list of ops/barriers for coherence-ordered-before.)
These guarantees, plus syncs-with / happens-before, are enough to recover sequential consistency for the whole program, if it's data-race free (that would be UB), and if you don't use any weaker memory orders.
These rules do still hold if the program involves weaker orders, but for example an SC fence between two relaxed operations isn't as strong as two SC loads. For example on PowerPC that wouldn't rule out IRIW reordering the way using only SC operations does; IIRC PowerPC needs barriers before SC loads, as well as after.
So having some SC operations isn't necessarily enough to recover sequential consistency everywhere; that's rather the point of using weaker operations, but it can be a bit surprising that other ops can reorder wrt. SC ops. SC ops aren't SC fences. See also this Q&A for an example with the same "store buffer" litmus test: weakening one store from seq_cst to release allows reordering.

If a RMW operation changes nothing, can it be optimized away, for all memory orders?

In the C/C++ memory model, can a compiler just combine and then remove redundant/NOP atomic modification operations, such as:
x++,
x--;
or even simply
x+=0; // return value is ignored
For an atomic scalar x?
Does that hold for sequential consistency or just weaker memory orders?
(Note: For weaker memory orders that still do something; for relaxed, there is no real question here. EDIT AGAIN: No actually there is a serious question in that special case. See my own answer. Not even relaxed is cleared for removal.)
EDIT:
The question is not about code generation for a particular access: if I wanted to see two lock add generated on Intel for the first example, I would have made x volatile.
The question is whether these C/C++ instructions have any impact what so ever: can the compiler just filter and remove these nul operations (that are not relaxed order operations), as a sort of source to source transformation? (or abstract tree to abstract tree transformation, perhaps in the compiler "front end")
EDIT 2:
Summary of the hypotheses:
not all operations are relaxed
nothing is volatile
atomic objects are really potentially accessible by multiple functions and threads (no automatic atomic whose address isn't shared)
Optional hypothesis:
If you want, you may assume that the address of the atomic so not taken, that all accesses are by name, and that all accesses have a property:
That no access of that variable, anywhere, has a relaxed load/store element: all load operations should have acquire and all stores should have release (so all RMW should be at least acq_rel).
Or, that for those accesses that are relaxed, the access code doesn't read the value for a purpose other than changing it: a relaxed RMW does not conserve the value further (and does not test the value to decide what to do next). In other words, no data or control dependency on the value of the atomic object unless the load has an acquire.
Or that all accesses of the atomic are sequentially consistent.
That is I'm especially curious about these (I believe quite common) use cases.
Note: an access is not considered "completely relaxed" even if it's done with a relaxed memory order, when the code makes sure observers have the same memory visibility, so this is considered valid for (1) and (2):
atomic_thread_fence(std::memory_order_release);
x.store(1,std::memory_order_relaxed);
as the memory visibility is at least as good as with just x.store(1,std::memory_order_release);
This is considered valid for (1) and (2):
int v = x.load(std::memory_order_relaxed);
atomic_thread_fence(std::memory_order_acquire);
for the same reason.
This is stupidly, trivially valid for (2) (i is just an int)
i=x.load(std::memory_order_relaxed),i=0; // useless
as no information from a relaxed operation was kept.
This is valid for (2):
(void)x.fetch_add(1, std::memory_order_relaxed);
This is not valid for (2):
if (x.load(std::memory_order_relaxed))
f();
else
g();
as a consequential decision was based on a relaxed load, neither is
i += x.fetch_add(1, std::memory_order_release);
Note: (2) covers one of the most common uses of an atomic, the thread safe reference counter. (CORRECTION: It isn't clear that all thread safe counters technically fit the description as acquire can be done only on 0 post decrement, and then a decision was taken based on counter>0 without an acquire; a decision to not do something but still...)
No, definitely not entirely. It's at least a memory barrier within the thread for stronger memory orders.
For mo_relaxed atomics, yes I think it could in theory be optimized away completely, as if it wasn't there in the source. It's equivalent for a thread to simply not be part of a release-sequence it might have been part of.
If you used the result of the fetch_add(0, mo_relaxed), then I think collapsing them together and just doing a load instead of an RMW of 0 might not be exactly equivalent. Barriers in this thread surrounding the relaxed RMW still have an effect on all operations, including ordering the relaxed operation wrt. non-atomic operations. With a load+store tied together as an atomic RMW, things that order stores could order an atomic RMW when they wouldn't have ordered a pure load.
But I don't think any C++ ordering is like that: mo_release stores order earlier loads and stores, and atomic_thread_fence(mo_release) is like an asm StoreStore + LoadStore barrier. (Preshing on fences). So yes, given that any C++-imposed ordering would also apply to a relaxed load equally to a relaxed RMW, I think int tmp = shared.fetch_add(0, mo_relaxed) could be optimized to just a load.
(In practice compilers don't optimize atomics at all, basically treating them all like volatile atomic, even for mo_relaxed. Why don't compilers merge redundant std::atomic writes? and http://wg21.link/n4455 + http://wg21.link/p0062. It's too hard / no mechanism exists to let compilers know when not to.)
But yes, the ISO C++ standard on paper makes no guarantee that other threads can actually observe any given intermediate state.
Thought experiment: Consider a C++ implementation on a single-core cooperative multi-tasking system. It implements std::thread by inserting yield calls where needed to avoid deadlocks, but not between every instruction. Nothing in the standard requires a yield between num++ and num-- to let other threads observe that state.
The as-if rule basically allows a compiler to pick a legal/possible ordering and decide at compile-time that it's what happens every time.
In practice this can create fairness problems if an unlock/re-lock never actually gives other threads the chance to take a lock if --/++ are combined together into just a memory barrier with no modification of the atomic object! This among other things is why compilers don't optimize.
Any stronger ordering for one or both of the operations can begin or be part of a release-sequence that synchronizes-with a reader. A reader that does an acquire load of a release store/RMW Synchronizes-With this thread, and must see all previous effects of this thread as having already happened.
IDK how the reader would know that it was seeing this thread's release-store instead of some previous value, so a real example is probably hard to cook up. At least we could create one without possible UB, e.g. by reading the value of another relaxed atomic variable so we avoid data-race UB if we didn't see this value.
Consider the sequence:
// broken code where optimization could fix it
memcpy(buf, stuff, sizeof(buf));
done.store(1, mo_relaxed); // relaxed: can reorder with memcpy
done.fetch_add(-1, mo_relaxed);
done.fetch_add(+1, mo_release); // release-store publishes the result
This could optimize to just done.store(1, mo_release); which correctly publishes a 1 to the other thread without the risk of the 1 being visible too soon, before the updated buf values.
But it could also optimize just the cancelling pair of RMWs into a fence after the relaxed store, which would still be broken. (And not the optimization's fault.)
// still broken
memcpy(buf, stuff, sizeof(buf));
done.store(1, mo_relaxed); // relaxed: can reorder with memcpy
atomic_thread_fence(mo_release);
I haven't thought of an example where safe code becomes broken by a plausible optimization of this sort. Of course just removing the pair entirely even when they're seq_cst wouldn't always be safe.
A seq_cst increment and decrement does still create a sort of memory barrier. If they weren't optimized away, it would be impossible for earlier stores to interleave with later loads. To preserve this, compiling for x86 would probably still need to emit mfence.
Of course the obvious thing would be a lock add [x], 0 which does actually do a dummy RMW of the shared object that we did x++/x-- on. But I think the memory barrier alone, not coupled to an access to that actual object or cache line is sufficient.
And of course it has to act as a compile-time memory barrier, blocking compile-time reordering of non-atomic and atomic accesses across it.
For acq_rel or weaker fetch_add(0) or cancelling sequence, the run-time memory barrier might happen for free on x86, only needing to restrict compile-time ordering.
See also a section of my answer on Can num++ be atomic for 'int num'?, and in comments on Richard Hodges' answer there. (But note that some of that discussion is confused by arguments about when there are modifications to other objects between the ++ and --. Of course all ordering of this thread's operations implied by the atomics must be preserved.)
As I said, this is all hypothetical and real compilers aren't going to optimize atomics until the dust settles on N4455 / P0062.
The C++ memory model provides four coherence requirements for all atomic accesses to the same atomic object. These requirements apply regardless of the memory order. As stated in a non-normative notation:
The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads.
Emphasis added.
Given that both operations are happening to the same atomic variable, and the first definitely happens before the second (due to being sequenced before it), there can be no reordering of these operations. Again, even if relaxed operations are used.
If this pair of operations were removed by a compiler, that would guarantee that no other threads would ever see the incremented value. So the question now becomes whether the standard would require some other thread to be able to see the incremented value.
It does not. Without some way for something to guarantee-ably "happen after" the increment and "happen before" the decrement, there is no guarantee that any operation on any other thread will certainly see the incremented value.
This leaves one question: does the second operation always undo the first? That is, does the decrement undo the increment? That depends on the scalar type in question. ++ and -- are only defined for the pointer and integer specializations of atomic. So we only need to consider those.
For pointers, the decrement undoes the increment. The reason being that the only way incrementing+decrementing a pointer would not result in the same pointer to the same object is if incrementing the pointer was itself UB. That is, if the pointer is invalid, NULL, or is the past-the-end pointer to an object/array. But compilers don't have to consider UB cases since... they're undefined behavior. In all of the cases where incrementing is valid, pointer decrementing must also be valid (or UB, perhaps due to someone freeing the memory or otherwise making the pointer invalid, but again, the compiler doesn't have to care).
For unsigned integers, the decrement always undoes the increment, since wraparound behavior is well-defined for unsigned integers.
That leaves signed integers. C++ usually makes signed integer over/underflow into UB. Fortunately, that's not the case for atomic math; the standard explicitly says:
For signed integer types, arithmetic is defined to use two's complement representation. There are no undefined results.
Wraparound behavior for two's complement atomics works. That means increment/decrement always results in recovering the same value.
So there does not appear to be anything in the standard which would prevent compilers from removing such operations. Again, regardless of the memory ordering.
Now, if you use non-relaxed memory ordering, the implementation cannot completely remove all traces of the atomics. The actual memory barriers behind those orderings still have to be emitted. But the barriers can be emitted without emitting the actual atomic operations.
In the C/C++ memory model, can a compiler just combine and then remove
redundant/NOP atomic modification operations,
No, the removal part is not allowed, at least not in the specific way the question suggests it would be allowed: the intent here is to describe valid source to source transformations, abstract tree to abstract tree, or rather a higher level description of the source code that encodes all the relevant semantic elements that might be needed for later phases of compilation.
The hypothesis is that code generation can be done on the transformed program, without ever checking with the original one. So only safe transformations that cannot break any code are allowed.
(Note: For weaker memory orders that still do something; for relaxed,
there is no real question here.)
No. Even that is wrong: for even relaxed operations, unconditional removal isn't a valid transformation (although in most practical cases it's certainly valid, but mostly correct is still wrong, and "true in >99% practical cases" has nothing to do with the question):
Before the introduction of standard threads, a stuck program was an infinite loop was an empty loop performing no externally visible side effects: no input, output, volatile operation and in practice no system call. A program that will not ever perform something visible is stuck and its behavior is not defined, and that allows the compiler to assume pure algorithms terminate: loops containing only invisible computations must exit somehow (that includes exiting with an exception).
With threads, that definition is obviously not usable: a loop in one thread isn't the whole program, and a stuck program is really one with no thread that can make something useful, and forbidding that would be sound.
But the very problematic standard definition of stuck doesn't describe a program execution but a single thread: a thread is stuck if it will perform no side effect that could potentially have an effect on observable side effects, that is:
no observable obviously (no I/O)
no action that might interact with another thread
The standard definition of 2. is extremely large and simplistic, all operations on an inter-thread communication device count: any atomic operation, any action on any mutex. Full text for the requirement (relevant part in boldface):
[intro.progress]
The implementation may assume that any thread will eventually do one
of the following:
terminate,
make a call to a library I/O function,
perform an access through a volatile glvalue, or
perform a synchronization operation or an atomic operation.
[ Note: This is intended to allow compiler transformations such as removal of
empty loops, even when termination cannot be proven. — end note ]
That definition does not even specify:
an inter thread communication (from one thread to another)
a shared state (visible by multiple threads)
a modification of some state
an object that is not thread private
That means that all these silly operations count:
for fences:
performing an acquire fence (even when followed by no atomic operation) in a thread that has at least once done an atomic store can synchronize with another fence or atomic operation
for mutexes:
locking a locally recently created, patently useless function private mutex;
locking a mutex to just unlock it doing nothing with the mutex locked;
for atomics:
reading an atomic variable declared as const qualified (not a const reference to a non const atomic);
reading an atomic, ignoring the value, even with relaxed memory ordering;
setting a non const qualified atomic to its own immutable value (setting a variable to zero when nothing in the whole program sets it to a non zero value), even with relaxed ordering;;
doing operations on a local atomic variable not accessible by other threads;
for thread operations:
creating a thread (that might do nothing) and joining it seems to create a (NOP) synchronization operation.
It means no early, local transformation of program code that leaves no trace of the transformation to later compiler phases and that removes even the most silly and useless inter-thread primitive is absolutely, unconditionally valid according to the standard, as it might remove the last potentially useful (but actually useless) operation in a loop (a loop doesn't have to be spelled for or while, it's any looping construct, f.ex. a backward goto).
This however doesn't apply if other operations on inter-thread primitives are left in the loop, or obviously if I/O is done.
This looks like a defect.
A meaningful requirement should be based:
not only on using thread primitives,
not be about any thread in isolation (as you can't see if a thread is contributing to anything, but at least a requirement to meaningfully interact with another thread and not use a private atomic or mutex would be better then the current requirement),
based on doing something useful (program observables) and inter-thread interactions that contribute to something being done.
I'm not proposing a replacement right now, as the rest of the thread specification isn't even clear to me.

is std::atomic::fetch_add a serializing operation on x86-64?

Considering the following code:
std::atomic<int> counter;
/* otherStuff 1 */
counter.fetch_add(1, std::memory_order_relaxed);
/* otherStuff 2 */
Is there an instruction in x86-64 (say less than 5 years old architectures) that would allow otherStuff 1 and 2 be re-ordered across the fetch_add or is it going to be always serializing ?
EDIT:
It looks like this is summarized by "is lock add a memory barrier on x86 ?" and it seems it is not, though I am not sure where to find a reference for that.
First let's look at what the compiler is allowed to do when using std::memory_order_relaxed.
If there are no dependencies between otherStuff 1/2 and the atomic operation, it can certainly reorder the statements. For example:
g = 3;
a.fetch_add(1, memory_order_relaxed);
g += 12;
clang++ generates the following assembly:
lock addl $0x1,0x2009f5(%rip) # 0x601040 <a>
movl $0xf,0x2009e7(%rip) # 0x60103c <g>
Here clang took the liberty to reorder g = 3 with the atomic fetch_add operation, which is a legitimate transformation.
When using std::memory_order_seq_cst, the compiler output becomes:
movl $0x3,0x2009f2(%rip) # 0x60103c <g>
lock addl $0x1,0x2009eb(%rip) # 0x601040 <a>
addl $0xc,0x2009e0(%rip) # 0x60103c <g>
Reordering of statements does not take place because the compiler is not allowed to do that.
Sequential consistent ordering on a read-modify-write (RMW) operation, is both a release and an acquire operation and as such, no (visible) reordering of statements is allowed on both compiler and CPU level.
Your question is whether, on X86-64, std::atomic::fetch_add, using relaxed ordering, is a serializing operation..
The answer is: yes, if you do not take into account compiler reordering.
On the X86 architecture, an RMW operation always flushes the store buffer and therefore is effectively a serializing and sequentially consistent operation.
You can say that, on an X86 CPU, each RMW operation:
is a release operation for memory operations that precede it and is an acquire operation for memory operations that follow it.
becomes visible in a single total order observed by all threads.
The target architecture
On the X86 architecture, an RMW operation always flushes the store
buffer and therefore is effectively a serializing and sequentially
consistent operation.
I wish people would stop saying that.
That statement doesn't even make sense, as there is no such thing as "sequentially consistent operation", as "sequential consistency" isn't a property of any operation. A sequentially consistent execution is one where the end result is one where there is an interlacing of operation that gives that result.
What can be said about these RMW operations:
all operations before the RMW have to be globally visible before either the R or W of the RMW is visible
and no operation after the RMW are visible before the RMW is visible.
That is the part before, the RMW, and the part after are run sequential. In other words, there is a full fence before and after the RMW.
Whether that results in a sequential execution for the complete program depends on the nature of all globally visible operations of the program.
Visibility vs. execution ordering
That's in term of visibility. I have no idea whether these processors try to speculatively execute code after the RMW, subject to the correctness requirement that operations are rolled back if there is a conflict with a side effect on a parallel execution (these details tend to be different for different vendors and generations in a given family, unless it's clearly specified).
The answer to your question could be different whether
you need to guarantee correctness of the set of side effect (as in sequential consistency requirement),
or guarantee that benchmarks are reliable,
or that comparative timing CPU version independent: to guarantee something on the results of comparison of timing of different executions (for a given CPU).
High level languages vs. CPU features
The question title is "is std::atomic::fetch_add a serializing operation on x86-64?" of the general form:
"does OP provide guarantees P on ARCH"
where
OP is a high level operation in a high level language
P is the desired property
ARCH is a specific CPU or compiler target
As a rule, the canonical answer is: the question doesn't make sense, OP being high level and target independent. There is a low level/high level mismatch here.
The compiler is bound by the language standard (or rather its most reasonable interpretation), by documented extension, by history... not by the standard for the target architecture, unless the feature is a low level, transparent feature of the high level language.
The canonical way to get low level semantic in C/C++ is to use volatile objects and volatile operations.
Here you must use a volatile std::atomic<int> to even be able to ask a meaningful question about architectural guarantees.
Present code generation
The meaningful variant of your question would use this code:
volatile std::atomic<int> counter;
/* otherStuff 1 */
counter.fetch_add(1, std::memory_order_relaxed);
That statement will generate an atomic RMW operation which in that case "is a serializing operation" on the CPU: all operations performed before, in assembly code, are complete before the RMW starts; all operation following the RMW wait until the RMW completes to start (in term of visibility).
And then you would need to learn about the unpleasantness of the volatile semantic: volatile applies only to these volatile operation so you would still not get general guarantees about other operations.
There is no guarantee that high level C++ operations before the volatile RMW operations are sequenced before in the assembly code. You would need a "compiler barrier" to do that. These barriers are not portable. (And not needed here as it's a silly approach anyway.)
But then if you want that guarantee, you can just use:
a release operation: to ensure that previous globally visible operations are complete
an acquire operation: to ensure that following globally visible operations do not start before
RMW operation on an object that is visible by multiple threads.
So why not make your RMW operation ack_rel? Then it wouldn't even need to be volatile.
Possible RMW variants in a processor family
Is there an instruction in x86-64 (say less than 5 years old
architectures) that would
Potential variants of the instruction set is another sub-question. Vendors can introduce new instructions, and ways to test for their availability at runtime; and compilers can even generate code to detect their availability.
Any RMW feature that would follow the existing tradition (1) of strong ordering of usual memory operation in that family would have to respect the traditions:
Total Store Order: all store operations are ordered, implicitly fenced; in other words, there is a store buffer strictly for non speculative store operations in each core, that is not reordered and not shared between cores;
each store is a release operation (for previous normal memory operations);
loads that are speculatively started are completed in order, and at completion are validated: any early load for a location that was then clobbered in the cache is cancelled and the computation is restarted with the recent value;
loads are acquire operations.
Then any new (but traditional) RMW operation must be both an acquire operation and a release operation.
(Examples for potential imaginary RMW operation to be added in the future are xmult and xdiv.)
But that's futurology and adding less ordered instruction in the future wouldn't violate any security invariants, except potentially against timing based, side channel Spectre-like attacks, that we don't know how to model and reason about in general anyway.
The problem with these questions, even about the present, is that a proof of absence would be required, and for that we would need to know about each variant for a CPU family. That is not always doable, and also, unnecessary if you use the proper ordering in the high level code, and useless if you don't.
(1) Traditions for guarantees of memory operations are guidelines in the CPU design, not guarantees about any feature operation: by definition operations that don't yet exist have no guarantee about their semantics, beside the guarantees of memory integrity, that is, the guarantee that no future operation will break the privileges and security guarantees previously established (no un privileged instruction created in the future can access an un-mapped memory address...).
When using std::memory_order_relaxed the only guarantee is that the operation is atomic. Anything around the operation can be reordered at will by either the compiler or the CPU.
From https://en.cppreference.com/w/cpp/atomic/memory_order:
Relaxed operation: there are no synchronization or ordering
constraints imposed on other reads or writes, only this operation's
atomicity is guaranteed (see Relaxed ordering below)

What do each memory_order mean?

I read a chapter and I didn't like it much. I'm still unclear what the differences is between each memory order. This is my current speculation which I understood after reading the much more simple http://en.cppreference.com/w/cpp/atomic/memory_order
The below is wrong so don't try to learn from it
memory_order_relaxed: Does not sync but is not ignored when order is done from another mode in a different atomic var
memory_order_consume: Syncs reading this atomic variable however It doesnt sync relaxed vars written before this. However if the thread uses var X when modifying Y (and releases it). Other threads consuming Y will see X released as well? I don't know if this means this thread pushes out changes of x (and obviously y)
memory_order_acquire: Syncs reading this atomic variable AND makes sure relaxed vars written before this are synced as well. (does this mean all atomic variables on all threads are synced?)
memory_order_release: Pushes the atomic store to other threads (but only if they read the var with consume/acquire)
memory_order_acq_rel: For read/write ops. Does an acquire so you don't modify an old value and releases the changes.
memory_order_seq_cst: The same thing as acquire release except it forces the updates to be seen in other threads (if a store with relaxed on another thread. I store b with seq_cst. A 3rd thread reading a with relax will see changes along with b and any other atomic variable?).
I think I understood but correct me if i am wrong. I couldn't find anything that explains it in easy to read english.
The GCC Wiki gives a very thorough and easy to understand explanation with code examples.
(excerpt edited, and emphasis added)
IMPORTANT:
Upon re-reading the below quote copied from the GCC Wiki in the process of adding my own wording to the answer, I noticed that the quote is actually wrong. They got acquire and consume exactly the wrong way around. A release-consume operation only provides an ordering guarantee on dependent data whereas a release-acquire operation provides that guarantee regardless of data being dependent on the atomic value or not.
The first model is "sequentially consistent". This is the default mode used when none is specified, and it is the most restrictive. It can also be explicitly specified via memory_order_seq_cst. It provides the same restrictions and limitation to moving loads around that sequential programmers are inherently familiar with, except it applies across threads.
[...]
From a practical point of view, this amounts to all atomic operations acting as optimization barriers. It's OK to re-order things between atomic operations, but not across the operation. Thread local stuff is also unaffected since there is no visibility to other threads. [...] This mode also provides consistency across all threads.
The opposite approach is memory_order_relaxed. This model allows for much less synchronization by removing the happens-before restrictions. These types of atomic operations can also have various optimizations performed on them, such as dead store removal and commoning. [...] Without any happens-before edges, no thread can count on a specific ordering from another thread.
The relaxed mode is most commonly used when the programmer simply wants a variable to be atomic in nature rather than using it to synchronize threads for other shared memory data.
The third mode (memory_order_acquire / memory_order_release) is a hybrid between the other two. The acquire/release mode is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables. This allows for a relaxing of the synchronization required between independent reads of independent writes.
memory_order_consume is a further subtle refinement in the release/acquire memory model that relaxes the requirements slightly by removing the happens before ordering on non-dependent shared variables as well.
[...]
The real difference boils down to how much state the hardware has to flush in order to synchronize. Since a consume operation may therefore execute faster, someone who knows what they are doing can use it for performance critical applications.
Here follows my own attempt at a more mundane explanation:
A different approach to look at it is to look at the problem from the point of view of reordering reads and writes, both atomic and ordinary:
All atomic operations are guaranteed to be atomic within themselves (the combination of two atomic operations is not atomic as a whole!) and to be visible in the total order in which they appear on the timeline of the execution stream. That means no atomic operation can, under any circumstances, be reordered, but other memory operations might very well be. Compilers (and CPUs) routinely do such reordering as an optimization.
It also means the compiler must use whatever instructions are necessary to guarantee that an atomic operation executing at any time will see the results of each and every other atomic operation, possibly on another processor core (but not necessarily other operations), that were executed before.
Now, a relaxed is just that, the bare minimum. It does nothing in addition and provides no other guarantees. It is the cheapest possible operation. For non-read-modify-write operations on strongly ordered processor architectures (e.g. x86/amd64) this boils down to a plain normal, ordinary move.
The sequentially consistent operation is the exact opposite, it enforces strict ordering not only for atomic operations, but also for other memory operations that happen before or after. Neither one can cross the barrier imposed by the atomic operation. Practically, this means lost optimization opportunities, and possibly fence instructions may have to be inserted. This is the most expensive model.
A release operation prevents ordinary loads and stores from being reordered after the atomic operation, whereas an acquire operation prevents ordinary loads and stores from being reordered before the atomic operation. Everything else can still be moved around.
The combination of preventing stores being moved after, and loads being moved before the respective atomic operation makes sure that whatever the acquiring thread gets to see is consistent, with only a small amount of optimization opportunity lost.
One may think of that as something like a non-existent lock that is being released (by the writer) and acquired (by the reader). Except... there is no lock.
In practice, release/acquire usually means the compiler needs not use any particularly expensive special instructions, but it cannot freely reorder loads and stores to its liking, which may miss out some (small) optimization opportuntities.
Finally, consume is the same operation as acquire, only with the exception that the ordering guarantees only apply to dependent data. Dependent data would e.g. be data that is pointed-to by an atomically modified pointer.
Arguably, that may provide for a couple of optimization opportunities that are not present with acquire operations (since fewer data is subject to restrictions), however this happens at the expense of more complex and more error-prone code, and the non-trivial task of getting dependency chains correct.
It is currently discouraged to use consume ordering while the specification is being revised.
This is a quite complex subject. Try to read http://en.cppreference.com/w/cpp/atomic/memory_order several times, try to read other resources, etc.
Here's a simplified description:
The compiler and CPU can reorder memory accesses. That is, they can happen in different order than what's specified in the code. That's fine most of the time, the problem arises when different thread try to communicate and may see such order of memory accesses that breaks the invariants of the code.
Usually you can use locks for synchronization. The problem is that they're slow. Atomic operations are much faster, because the synchronization happens at CPU level (i.e. CPU ensures that no other thread, even on another CPU, modifies some variable, etc.).
So, the one single problem we're facing is reordering of memory accesses. The memory_order enum specifies what types of reorderings compiler must forbid.
relaxed - no constraints.
consume - no loads that are dependent on the newly loaded value can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.
acquire - no loads can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.
release - no stores can be reordered wrt. the atomic store. I.e. if they are before the atomic store in the source code, they will happen before the atomic store too.
acq_rel - acquire and release combined.
seq_cst - it is more difficult to understand why this ordering is required. Basically, all other orderings only ensure that specific disallowed reorderings don't happen only for the threads that consume/release the same atomic variable. Memory accesses can still propagate to other threads in any order. This ordering ensures that this doesn't happen (thus sequential consistency). For a case where this is needed see the example at the end of the linked page.
I want to provide a more precise explanation, closer to the standard.
Things to ignore:
memory_order_consume - apparently no major compiler implements it, and they silently replace it with a stronger memory_order_acquire. Even the standard itself says to avoid it.
A big part of the cppreference article on memory orders deals with 'consume', so dropping it simplifies things a lot.
It also lets you ignore related features like [[carries_dependency]] and std::kill_dependency.
Data races: Writing to a non-atomic variable from one thread, and simultaneously reading/writing to it from a different thread is called a data race, and causes undefined behavior.
memory_order_relaxed is the weakest and supposedly the fastest memory order.
Any reads/writes to atomics can't cause data races (and subsequent UB). relaxed provides just this minimal guarantee, for a single variable. It doesn't provide any guarantees for other variables (atomic or not).
All threads agree on the order of operations on every particular atomic variable. But it's the case only for invididual variables. If other variables (atomic or not) are involved, threads might disagree on how exactly the operations on different variables are interleaved.
It's as if relaxed operations propagate between threads with slight unpredictable delays.
This means that you can't use relaxed atomic operations to judge when it's safe to access other non-atomic memory (can't synchronize access to it).
By "threads agree on the order" I mean that:
Each thread will access each separate variable in the exact order you tell it to. E.g. a.store(1, relaxed); a.store(2, relaxed); will write 1, then 2, never in the opposite order. But accesses to different variables in the same thread can still be reordered relative to each other.
If a thread A writes to a variable several times, then thread B reads seveal times, it will get the values in the same order (but of course it can read some values several times, or skip some, if you don't synchronize the threads in other ways).
No other guarantees are given.
Example uses: Anything that doesn't try to use an atomic variable to synchronize access to non-atomic data: various counters (that exist for informational purposes only), or 'stop flags' to signal other threads to stop. Another example: operations on shared_ptrs that increment the reference count internally use relaxed.
Fences: atomic_thread_fence(relaxed); does nothing.
memory_order_release, memory_order_acquire do everything relaxed does, and more (so it's supposedly slower or equivalent).
Only stores (writes) can use release. Only loads (reads) can use acquire. Read-modify-write operations such as fetch_add can be both (memory_order_acq_rel), but they don't have to.
Those let you synchronize threads:
Let's say thread 1 reads/writes to some memory M (any non-atomic or atomic variables, doesn't matter).
Then thread 1 performs a release store to a variable A. Then it stops
touching that memory.
If thread 2 then performs an acquire load of the same variable A, this load is said to synchronize with the corresponding store in thread 1.
Now thread 2 can safely read/write to that memory M.
You only synchronize with the latest writer, not preceding writers.
You can chain synchronizations across multiple threads.
There's a special rule that synchronization propagates across any number of read-modify-write operations regardless of their memory order. E.g. if thread 1 does a.store(1, release);, then thread 2 does a.fetch_add(2, relaxed);, then thread 3 does a.load(acquire), then thread 1 successfully synchronizes with thread 3, even though there's a relaxed operation in the middle.
In the above rule, a release operation X, and any subsequent read-modify-write operations on the same variable X (stopping at the next non-read-modify-write operation) are called a release sequence headed by X. (So if an acquire reads from any operation in a release sequence, it synchronizes with the head of the sequence.)
If read-modify-write operations are involved, nothing stops you from synchronizing with more than one operation. In the example above, if fetch_add was using acquire or acq_rel, it too would synchronize with thread 1, and conversely, if it used release or acq_rel, the thread 3 would synchonize with 2 in addition to 1.
Example use: shared_ptr decrements its reference counter using something like fetch_sub(1, acq_rel).
Here's why: imagine that thread 1 reads/writes to *ptr, then destroys its copy of ptr, decrementing the ref count. Then thread 2 destroys the last remaining pointer, also decrementing the ref count, and then runs the destructor.
Since the destructor in thread 2 is going to access the memory previously accessed by thread 1, the acq_rel synchronization in fetch_sub is necessary. Otherwise you'd have a data race and UB.
Fences: Using atomic_thread_fence, you can essentially turn relaxed atomic operations into release/acquire operations. A single fence can apply to more than one operation, and/or can be performed conditionally.
If you do a relaxed read (or with any other order) from one or more variables, then do atomic_thread_fence(acquire) in the same thread, then all those reads count as acquire operations.
Conversely, if you do atomic_thread_fence(release), followed by any number of (possibly relaxed) writes, those writes count as release operations.
An acq_rel fence combines the effect of acquire and release fences.
Similarity with other standard library features:
Several standard library features also cause a similar synchronizes with relationship. E.g. locking a mutex synchronizes with the latest unlock, as if locking was an acquire operation, and unlocking was a release operation.
memory_order_seq_cst does everything acquire/release do, and more. This is supposedly the slowest order, but also the safest.
seq_cst reads count as acquire operations. seq_cst writes count as release operations. seq_cst read-modify-write operations count as both.
seq_cst operations can synchronize with each other, and with acquire/release operations. Beware of special effects of mixing them (see below).
seq_cst is the default order, e.g. given atomic_int x;, x = 1; does x.store(1, seq_cst);.
seq_cst has an extra property compared to acquire/release: all threads agree on the order in which all seq_cst operations happen. This is unlike weaker orders, where threads agree only on the order of operations on each individual atomic variable, but not on how the operations are interleaved - see relaxed order above.
The presence of this global operation order seems to only affect which values you can get from seq_cst loads, it doesn't in any way affect non-atomic variables and atomic operations with weaker orders (unless seq_cst fences are involved, see below), and by itself doesn't prevent any extra data race UB compared to acq/rel operations.
Among other things, this order respects the synchronizes with relationship described for acquire/release above, unless (and this is weird) that synchronization comes from mixing a seq-cst operation with an acquire/release operation (release syncing with seq-cst, or seq-cst synching with acquire). Such mix essentially demotes the affected seq-cst operation to an acquire/release (it maybe retains some of the seq-cst properties, but you better not count on it).
Example use:
atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, seq_cst);
if (y.load(seq_cst)) {...}
// Thread 2:
y.store(false, seq_cst);
if (x.load(seq_cst)) {...}
Lets say you want only one thread to be able to enter the if body. seq_cst allows you to do it. Acquire/release or weaker orders wouldn't be enough here.
Fences: atomic_thread_fence(seq_cst); does everything an acq_rel fence does, and more.
Like you would expect, they bring some seq-cst properties to atomic operations done with weaker orders.
All threads agree on the order of seq_cst fences, relative to one another and to any seq_cst operations (i.e. seq_cst fences participate in the global order of seq_cst operations, which was described above).
They essentially prevent atomic operations from being reordered across themselves.
E.g. we can transform the above example to:
atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (y.load(relaxed)) {...}
// Thread 2:
y.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (x.load(relaxed)) {...}
Both threads can't enter if at the same time, because that would require reordering a load across the fence to be before the store.
But formally, the standard doesn't describe them in terms of reordering. Instead, it just explains how the seq_cst fences are placed in the global order of seq_cst operations. Let's say:
Thread 1 performs operation A on atomic variable X using using seq_cst order, OR a weaker order preceeded by a seq_cst fence.
Then:
Thread 2 performs operation B the same atomic variable X using seq_cst order, OR a weaker order followed by a seq_cst fence.
(Here A and B are any operations, except they can't both be reads, since then it's impossible to determine which one was first.)
Then the first seq_cst operation/fence is ordered before the second seq_cst operation/fence.
Then, if you imagine an scenario (e.g. in the example above, both threads entering the if) that imposes a contradicting requirements on the order, then this scenario is impossible.
E.g. in the example above, if the first thread enters the if, then the first fence must be ordered before the second one. And vice versa. This means that both threads entering the if would lead to a contradition, and hence not allowed.
Interoperation between different orders
Summarizing the above:
relaxed write
release write
seq-cst write
relaxed load
-
-
-
acquire load
-
synchronizes with
synchronizes with*
seq-cst load
-
synchronizes with*
synchronizes with
* = The participating seq-cst operation gets a messed up seq-cst order, effectively being demoted to an acquire/release operation. This is explained above.
Does using a stronger memory order makes data transfer between threads faster?
No, it seems not.
Sequental consistency for data-race-free programs
The standard explains that if your program only uses seq_cst accesses (and mutexes), and has no data races (which cause UB), then you don't need to think about all the fancy operation reorderings. The program will behave as if only one thread executed at a time, with the threads being unpredictably interleaved.

Memory model ordering and visibility?

I tried looking for details on this, I even read the standard on mutexes and atomics... but still I couldnt understand the C++11 memory model visibility guarantees.
From what I understand the very important feature of mutex BESIDE mutual exclusion is ensuring visibility. Aka it is not enough that only one thread per time is increasing the counter, it is important that the thread increases the counter that was stored by the thread that was last using the mutex(I really dont know why people dont mention this more when discussing mutexes, maybe I had bad teachers :)).
So from what I can tell atomic doesnt enforce immediate visibility:
(from the person that maintains boost::thread and has implemented c++11 thread and mutex library):
A fence with memory_order_seq_cst does not enforce immediate
visibility to other threads (and neither does an MFENCE instruction).
The C++0x memory ordering constraints are just that --- ordering
constraints. memory_order_seq_cst operations form a total order, but
there are no restrictions on what that order is, except that it must
be agreed on by all threads, and it must not violate other ordering
constraints. In particular, threads may continue to see "stale" values
for some time, provided they see values in an order consistent with
the constraints.
And I'm OK with that. But the problem is that I have trouble understanding what C++11 constructs regarding atomic are "global" and which only ensure consistency on atomic variables.
In particular I have understanding which(if any) of the following memory orderings guarantee that there will be a memory fence before and after load and stores:
http://www.stdthread.co.uk/doc/headers/atomic/memory_order.html
From what I can tell std::memory_order_seq_cst inserts mem barrier while other only enforce ordering of the operations on certain memory location.
So can somebody clear this up, I presume a lot of people are gonna be making horrible bugs using std::atomic , esp if they dont use default (std::memory_order_seq_cst memory ordering)
2. if I'm right does that mean that second line is redundand in this code:
atomicVar.store(42);
std::atomic_thread_fence(std::memory_order_seq_cst);
3. do std::atomic_thread_fences have same requirements as mutexes in a sense that to ensure seq consistency on nonatomic vars one must do std::atomic_thread_fence(std::memory_order_seq_cst);
before load and
std::atomic_thread_fence(std::memory_order_seq_cst);
after stores?
4. Is
{
regularSum+=atomicVar.load();
regularVar1++;
regularVar2++;
}
//...
{
regularVar1++;
regularVar2++;
atomicVar.store(74656);
}
equivalent to
std::mutex mtx;
{
std::unique_lock<std::mutex> ul(mtx);
sum+=nowRegularVar;
regularVar++;
regularVar2++;
}
//..
{
std::unique_lock<std::mutex> ul(mtx);
regularVar1++;
regularVar2++;
nowRegularVar=(74656);
}
I think not, but I would like to be sure.
EDIT:
5.
Can assert fire?
Only two threads exist.
atomic<int*> p=nullptr;
first thread writes
{
nonatomic_p=(int*) malloc(16*1024*sizeof(int));
for(int i=0;i<16*1024;++i)
nonatomic_p[i]=42;
p=nonatomic;
}
second thread reads
{
while (p==nullptr)
{
}
assert(p[1234]==42);//1234-random idx in array
}
If you like to deal with fences, then a.load(memory_order_acquire) is equivalent to a.load(memory_order_relaxed) followed by atomic_thread_fence(memory_order_acquire). Similarly, a.store(x,memory_order_release) is equivalent to a call to atomic_thread_fence(memory_order_release) before a call to a.store(x,memory_order_relaxed). memory_order_consume is a special case of memory_order_acquire, for dependent data only. memory_order_seq_cst is special, and forms a total order across all memory_order_seq_cst operations. Mixed with the others it is the same as an acquire for a load, and a release for a store. memory_order_acq_rel is for read-modify-write operations, and is equivalent to an acquire on the read part and a release on the write part of the RMW.
The use of ordering constraints on atomic operations may or may not result in actual fence instructions, depending on the hardware architecture. In some cases the compiler will generate better code if you put the ordering constraint on the atomic operation rather than using a separate fence.
On x86, loads are always acquire, and stores are always release. memory_order_seq_cst requires stronger ordering with either an MFENCE instruction or a LOCK prefixed instruction (there is an implementation choice here as to whether to make the store have the stronger ordering or the load). Consequently, standalone acquire and release fences are no-ops, but atomic_thread_fence(memory_order_seq_cst) is not (again requiring an MFENCE or LOCKed instruction).
An important effect of the ordering constraints is that they order other operations.
std::atomic<bool> ready(false);
int i=0;
void thread_1()
{
i=42;
ready.store(true,memory_order_release);
}
void thread_2()
{
while(!ready.load(memory_order_acquire)) std::this_thread::yield();
assert(i==42);
}
thread_2 spins until it reads true from ready. Since the store to ready in thread_1 is a release, and the load is an acquire then the store synchronizes-with the load, and the store to i happens-before the load from i in the assert, and the assert will not fire.
2) The second line in
atomicVar.store(42);
std::atomic_thread_fence(std::memory_order_seq_cst);
is indeed potentially redundant, because the store to atomicVar uses memory_order_seq_cst by default. However, if there are other non-memory_order_seq_cst atomic operations on this thread then the fence may have consequences. For example, it would act as a release fence for a subsequent a.store(x,memory_order_relaxed).
3) Fences and atomic operations do not work like mutexes. You can use them to build mutexes, but they do not work like them. You do not have to ever use atomic_thread_fence(memory_order_seq_cst). There is no requirement that any atomic operations are memory_order_seq_cst, and ordering on non-atomic variables can be achieved without, as in the example above.
4) No these are not equivalent. Your snippet without the mutex lock is thus a data race and undefined behaviour.
5) No your assert cannot fire. With the default memory ordering of memory_order_seq_cst, the store and load from the atomic pointer p work like the store and load in my example above, and the stores to the array elements are guaranteed to happen-before the reads.
From what I can tell std::memory_order_seq_cst inserts mem barrier while other only enforce ordering of the operations on certain memory location.
It really depends on what you're doing, and on what platform you're working with. The strong memory ordering model on a platform like x86 will create a different set of requirements for the existence of memory fence operations compared to a weaker ordering model on platforms like IA64, PowerPC, ARM, etc. What the default parameter of std::memory_order_seq_cst is ensuring is that depending on the platform, the proper memory fence instructions will be used. On a platform like x86, there is no need for a full memory barrier unless you are doing a read-modify-write operation. Per the x86 memory model, all loads have load-acquire semantics, and all stores have store-release semantics. Thus, in these cases the std::memory_order_seq_cst enum basically creates a no-op since the memory model for x86 already ensures that those types of operations are consistent across threads, and therefore there are no assembly instructions that implement these types of partial memory barriers. Thus the same no-op condition would be true if you explicitly set a std::memory_order_release or std::memory_order_acquire setting on x86. Furthermore, requiring a full memory-barrier in these situations would be an unnecessary performance impediment. As noted, it would only be required for read-modify-store operations.
On other platforms with weaker memory consistency models though, that would not be the case, and therefore using std::memory_order_seq_cst would employ the proper memory fence operations without the user having to explicitly specify whether they would like a load-acquire, store-release, or full memory fence operation. These platforms have specific machine instructions for enforcing such memory consistency contracts, and the std::memory_order_seq_cst setting would work out the proper case. If the user would like to specifically call for one of these operations they can through the explicit std::memory_order enum types, but it would not be necessary ... the compiler would work out the correct settings.
I presume a lot of people are gonna be making horrible bugs using std::atomic , esp if they dont use default (std::memory_order_seq_cst memory ordering)
Yes, if they don't know what they're doing, and don't understand which types of memory barrier semantics that are called for in certain operations, then there will be a lot of mistakes made if they attempt to explicitly state the type of memory barrier and it's the incorrect one, especially on platforms that will not help their mis-understanding of memory ordering because they are weaker in nature.
Finally, keep in mind with your situation #4 concerning a mutex that there are two different things that need to happen here:
The compiler must not be allowed to reorder operations across the mutex and critical section (especially in the case of an optimizing compiler)
There must be the requisite memory fences created (depending on the platform) that maintain a state where all stores are completed before the critical section and reading of the mutex variable, and all stores are completed before exiting the critical section.
Since by default, atomic stores and loads are implemented with std::memory_order_seq_cst, then using atomics would also implement the proper mechanisms to satisfy conditions #1 and #2. That being said, in your first example with atomics, the load would enforce acquire-semantics for the block, while the store would enforce release semantics. It would not though enforce any particular ordering inside the "critical section" between these two operations though. In your second example, you have two different sections with locks, each lock having acquire semantics. Since at some point you would have to release the locks, which would have release semantics, then no, the two code blocks would not be equivalent. In the first example, you've created a big "critical section" between the load and store (assuming this is all happening on the same thread). In the second example you have two different critical sections.
P.S. I've found the following PDF particularly instructive, and you may find it too:
http://www.nwcpp.org/Downloads/2008/Memory_Fences.pdf