Compare and Swap : synchronizing via different data sizes - c++

Using the GCC builtin C atomic primitives, we can perform an atomic CAS operation using __atomic_compare_exchange.
Unlike C++11's std::atomic type, the GCC C atomic primitives operate on regular non-atomic integral types, including 128-bit integers on platforms where cmpxchg16b is supported. (A future version of the C++ standard may support similar functionality with the std::atomic_view class template.)
This makes me question:
What happens if an atomic CAS operation on a larger data size observes a change which happened by an atomic operation on the same memory location, but using a smaller data size?
For example, suppose we have:
struct uint128_type {
uint64_t x;
uint64_t y;
} __attribute__ ((aligned (16)));
And suppose we have a shared variable of type uint128_type, like:
uint128_type Foo;
Now, suppose Thread A does:
Foo expected = { 0, 0 };
Foo desired = { 100, 100 };
int result = __atomic_compare_exchange(
&Foo,
&expected,
&desired,
0,
__ATOMIC_SEQ_CST
);
And Thread B does:
uint64_t expected = 0;
uint64_t desired = 500;
int result = __atomic_compare_exchange(
&Foo.x,
&expected,
&desired,
0,
__ATOMIC_SEQ_CST
);
What happens if Thread A's 16-byte CAS happens before Thread B's 8 byte CAS (or vice-versa)? Does the CAS fail as normal? Is this even defined behavior? Is this likely to "do the right thing" on typical architectures like x86_64 that support 16b CAS?
Edit: to be clear, since it seems to be causing confusion, I'm not asking if the above behavior is defined by the C++ standard. Obviously, all the __atomic_* functions are GCC extensions. (However future C++ standards may have to define this sort of thing, if std::atomic_view becomes standardized.) I am asking more generally about the semantics of atomic operations on typical modern hardware. As an example, if x86_64 code has 2 threads perform atomic operations on the same memory address, but one thread uses CMPXCHG8b and the other uses CMPXCHG16b, so that one does an atomic CAS on a single word while the other does an atomic CAS on a double word, how are the semantics of these operations defined? More specifically, would a CMPXCHG16b fail because it observes that the data has mutated from the expected value due to a previous CMPXCHG8b?
In other words, can two different CAS operations using two different data sizes (but the same starting memory address) safely be used to synchronize between threads?

One or the other happens first, and each operation proceeds according to its own semantics.
On x86 CPUs, both operations will require a lock on the same cache line held throughout the entire operation. So whichever one gets that lock first will not see any effects of the second operation and whichever one gets that lock second will see all the effects of the first operation. The semantics of both operations will be fully respected.
Other hardware may achieve this result some other way, but if it doesn't achieve this result, it's broken unless it specifies that it has a limitation.

The atomic data will, eventually, located somewhere in the memory and all accesses to it (or to respective caches, when the operations are atomic) will be serialized. Since the CAS operation is supposed to be atomic, it will performed as a whole or not at all.
That being said, one of the operations will succeed, the second will fail. The order is non-deterministic.
From x86 Instruction Set Reference:
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)
Clearly, both threads will attempt to locked-write after locked-read (when used with LOCK prefix, that is), which means only one of them will succeed in performing the CAS, the other will read already changed value.

Hardware is usually very conservative when checking for conflicts between potentially conflicting atomic operations. It may even happen that two CAS operations to two completely different, non-overlapping address ranges may be detected to conflict with each other.

The definition of atomic is unlikely to change, emphasis mine
In concurrent programming, an operation (or set of operations) is atomic,
linearizable, indivisible or uninterruptible if it appears to
the rest of the system to occur instantaneously. Atomicity is a
guarantee of isolation from concurrent processes.
Your question asks...
What happens if an atomic CAS operation on a larger data size observes
a change which happened by an atomic operation on the same memory
location, but using a smaller data size?
By definition no two overlapping regions of memory modified using atomic operations can mutate concurrently ie the two operations must happen linearly or their not atomic.

Related

LLVM - Atomic Ordering Unordered

I'm working on a library that heavily relies on bitwise operations, where the most important operate on shared memory. I was also having a look at LLVM's atomic ordering documentation and noticed unordered, which seems to be even weaker than C/C++'s relaxed memory order. I have several questions about it:
What are the differences between unordered and relaxed?
Say I have an atomic bool, is it safe to mutate it via unordered load/store?
Say I have an atomic bitmask, is it safe to mutate it via unordered load/store?
Is it safe to mutate it via unordered fetch_and/or/xor?
Is it safe to mutate it via unordered swap?
Is it safe to mutate it via unordered compare_and_swap?
The short answer is: whatever your problem is, unordered is screamingly unlikely to be the solution !
The longer answer is...
...the LLVM Language Reference Manual says:
unordered
The set of values that can be read is governed by the happens-before partial order. A value cannot be read unless some operation wrote it. This is intended to provide a guarantee strong enough to model Java’s non-volatile shared variables. This ordering cannot be specified for read-modify-write operations; it is not strong enough to make them atomic in any interesting way.
The "A value cannot be read unless some operation wrote it." is fun ! What this means is that “speculative” writes are not allowed. Say you have if (y == 99) x = 0 ; else x = y+1 ;: an optimizer could turn that into x = y+1 ; if (y == 99) x = 0 ; where the first write of x is the “speculative” one. (I'm not saying that's a sensible or common optimization. The point is that transformations which are perfectly OK from the perspective of a single thread, are not OK for atomics.) The C/C++ standards have the same restriction: no “out-of-thin-air” values are allowed.
Elsewhere the LLVM documentation describes unordered as loads/stores which complete without interruption from any other store -- so a load which reads two halves (say) of a value would not qualify if the two halves could be the result of two separate writes !
It seems monotonic is the local name for C/C++ memory_order_relaxed, and is described:
monotonic
In addition to the guarantees of unordered, there is a single total order for modifications by monotonic operations on each address. All modification orders must be compatible with the happens-before order.
Unlike unordered, with monotonic all threads will see writes to a given address in the same order. This means that if thread 'a' writes '1' to a given location, and afterwards thread 'b' writes '2', then after that threads 'c' and 'd' must both read '2'. (Hence the name.)
There is no guarantee that the modification orders can be combined to a global total order for the whole program (and this often will not be possible).
This is the relaxed bit.
The read in an atomic read-modify-write operation (cmpxchg and atomicrmw) reads the value in the modification order immediately before the value it writes.
Same like C/C++: read-modify-write cannot be interrupted by another write.
If one atomic read happens before another atomic read of the same address, the later read must see the same value or a later value in the address’s modification order. This disallows reordering of monotonic (or stronger) operations on the same address.
It's monotonic, guys.
If an address is written monotonic-ally by one thread, and other threads monotonic-ally read that address repeatedly, the other threads must eventually see the write.
So... there may be some latency between a write and reads, but that latency is finite (but may, I assume, not be the same for all threads). The "eventually" is interesting. The C11 standard says "Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.", which is similar but with a more positive spin :-)
This corresponds to the C++0x/C1x memory_order_relaxed.
So there you go.
You asked:
Say I have an atomic bool, is it safe to mutate it via unordered load/store?
Say I have an atomic bitmask, is it safe to mutate it via unordered load/store?
It rather depends on what you mean by safe :-( With any atomic load of 'x' followed later by an atomic store of 'x', you have no idea how many other stores to 'x' there have been since the load. But, the added joy of unordered is that it does not guarantee that all threads will see all writes to 'x' in the same order !
Your other questions are moot, because you cannot have unordered read-modify-write operations.
As a practical matter, your x86_64 guarantees that all writes are visible to all threads in the same order -- so all atomic operations are at the very least monotonic/memory_order_relaxed -- no LOCK prefixes and no xFENCE instructions are required, plain MOV to memory will do the trick. In fact, better than that, plain MOV to/from memory gives memory_order_release/memory_order_acquire.
FWIW: you mention bit-wise operations. I guess it's obvious that reading/writing a few bits is going to involve reading/writing some number of other bits as a side-effect. Which adds to the fun.
My guess is that at a minimum you will need to be doing read-modify-write operations. Again, on x86_64 this boils down to an instruction with a LOCK prefix, which is going to cost 10's of clocks -- how many 10's depends on the CPU and degree of contention. Now, a mutex lock and unlock will each involve a LOCKed, so it's generally worth replacing a mutex_lock/...do some reading and writing.../mutex_unlock sequence by an atomic read-modify-write(s) really only if it's just one read-modify-write.
The disadvantage of a mutex is, of course, that a thread can be "swapped out" while it holds the mutex :-(
A spin-lock (on x86_64) requires a LOCK to acquire but not to release... but the effect of being "swapped out" while holding a spin-lock is even worse :-(

If a RMW operation changes nothing, can it be optimized away, for all memory orders?

In the C/C++ memory model, can a compiler just combine and then remove redundant/NOP atomic modification operations, such as:
x++,
x--;
or even simply
x+=0; // return value is ignored
For an atomic scalar x?
Does that hold for sequential consistency or just weaker memory orders?
(Note: For weaker memory orders that still do something; for relaxed, there is no real question here. EDIT AGAIN: No actually there is a serious question in that special case. See my own answer. Not even relaxed is cleared for removal.)
EDIT:
The question is not about code generation for a particular access: if I wanted to see two lock add generated on Intel for the first example, I would have made x volatile.
The question is whether these C/C++ instructions have any impact what so ever: can the compiler just filter and remove these nul operations (that are not relaxed order operations), as a sort of source to source transformation? (or abstract tree to abstract tree transformation, perhaps in the compiler "front end")
EDIT 2:
Summary of the hypotheses:
not all operations are relaxed
nothing is volatile
atomic objects are really potentially accessible by multiple functions and threads (no automatic atomic whose address isn't shared)
Optional hypothesis:
If you want, you may assume that the address of the atomic so not taken, that all accesses are by name, and that all accesses have a property:
That no access of that variable, anywhere, has a relaxed load/store element: all load operations should have acquire and all stores should have release (so all RMW should be at least acq_rel).
Or, that for those accesses that are relaxed, the access code doesn't read the value for a purpose other than changing it: a relaxed RMW does not conserve the value further (and does not test the value to decide what to do next). In other words, no data or control dependency on the value of the atomic object unless the load has an acquire.
Or that all accesses of the atomic are sequentially consistent.
That is I'm especially curious about these (I believe quite common) use cases.
Note: an access is not considered "completely relaxed" even if it's done with a relaxed memory order, when the code makes sure observers have the same memory visibility, so this is considered valid for (1) and (2):
atomic_thread_fence(std::memory_order_release);
x.store(1,std::memory_order_relaxed);
as the memory visibility is at least as good as with just x.store(1,std::memory_order_release);
This is considered valid for (1) and (2):
int v = x.load(std::memory_order_relaxed);
atomic_thread_fence(std::memory_order_acquire);
for the same reason.
This is stupidly, trivially valid for (2) (i is just an int)
i=x.load(std::memory_order_relaxed),i=0; // useless
as no information from a relaxed operation was kept.
This is valid for (2):
(void)x.fetch_add(1, std::memory_order_relaxed);
This is not valid for (2):
if (x.load(std::memory_order_relaxed))
f();
else
g();
as a consequential decision was based on a relaxed load, neither is
i += x.fetch_add(1, std::memory_order_release);
Note: (2) covers one of the most common uses of an atomic, the thread safe reference counter. (CORRECTION: It isn't clear that all thread safe counters technically fit the description as acquire can be done only on 0 post decrement, and then a decision was taken based on counter>0 without an acquire; a decision to not do something but still...)
No, definitely not entirely. It's at least a memory barrier within the thread for stronger memory orders.
For mo_relaxed atomics, yes I think it could in theory be optimized away completely, as if it wasn't there in the source. It's equivalent for a thread to simply not be part of a release-sequence it might have been part of.
If you used the result of the fetch_add(0, mo_relaxed), then I think collapsing them together and just doing a load instead of an RMW of 0 might not be exactly equivalent. Barriers in this thread surrounding the relaxed RMW still have an effect on all operations, including ordering the relaxed operation wrt. non-atomic operations. With a load+store tied together as an atomic RMW, things that order stores could order an atomic RMW when they wouldn't have ordered a pure load.
But I don't think any C++ ordering is like that: mo_release stores order earlier loads and stores, and atomic_thread_fence(mo_release) is like an asm StoreStore + LoadStore barrier. (Preshing on fences). So yes, given that any C++-imposed ordering would also apply to a relaxed load equally to a relaxed RMW, I think int tmp = shared.fetch_add(0, mo_relaxed) could be optimized to just a load.
(In practice compilers don't optimize atomics at all, basically treating them all like volatile atomic, even for mo_relaxed. Why don't compilers merge redundant std::atomic writes? and http://wg21.link/n4455 + http://wg21.link/p0062. It's too hard / no mechanism exists to let compilers know when not to.)
But yes, the ISO C++ standard on paper makes no guarantee that other threads can actually observe any given intermediate state.
Thought experiment: Consider a C++ implementation on a single-core cooperative multi-tasking system. It implements std::thread by inserting yield calls where needed to avoid deadlocks, but not between every instruction. Nothing in the standard requires a yield between num++ and num-- to let other threads observe that state.
The as-if rule basically allows a compiler to pick a legal/possible ordering and decide at compile-time that it's what happens every time.
In practice this can create fairness problems if an unlock/re-lock never actually gives other threads the chance to take a lock if --/++ are combined together into just a memory barrier with no modification of the atomic object! This among other things is why compilers don't optimize.
Any stronger ordering for one or both of the operations can begin or be part of a release-sequence that synchronizes-with a reader. A reader that does an acquire load of a release store/RMW Synchronizes-With this thread, and must see all previous effects of this thread as having already happened.
IDK how the reader would know that it was seeing this thread's release-store instead of some previous value, so a real example is probably hard to cook up. At least we could create one without possible UB, e.g. by reading the value of another relaxed atomic variable so we avoid data-race UB if we didn't see this value.
Consider the sequence:
// broken code where optimization could fix it
memcpy(buf, stuff, sizeof(buf));
done.store(1, mo_relaxed); // relaxed: can reorder with memcpy
done.fetch_add(-1, mo_relaxed);
done.fetch_add(+1, mo_release); // release-store publishes the result
This could optimize to just done.store(1, mo_release); which correctly publishes a 1 to the other thread without the risk of the 1 being visible too soon, before the updated buf values.
But it could also optimize just the cancelling pair of RMWs into a fence after the relaxed store, which would still be broken. (And not the optimization's fault.)
// still broken
memcpy(buf, stuff, sizeof(buf));
done.store(1, mo_relaxed); // relaxed: can reorder with memcpy
atomic_thread_fence(mo_release);
I haven't thought of an example where safe code becomes broken by a plausible optimization of this sort. Of course just removing the pair entirely even when they're seq_cst wouldn't always be safe.
A seq_cst increment and decrement does still create a sort of memory barrier. If they weren't optimized away, it would be impossible for earlier stores to interleave with later loads. To preserve this, compiling for x86 would probably still need to emit mfence.
Of course the obvious thing would be a lock add [x], 0 which does actually do a dummy RMW of the shared object that we did x++/x-- on. But I think the memory barrier alone, not coupled to an access to that actual object or cache line is sufficient.
And of course it has to act as a compile-time memory barrier, blocking compile-time reordering of non-atomic and atomic accesses across it.
For acq_rel or weaker fetch_add(0) or cancelling sequence, the run-time memory barrier might happen for free on x86, only needing to restrict compile-time ordering.
See also a section of my answer on Can num++ be atomic for 'int num'?, and in comments on Richard Hodges' answer there. (But note that some of that discussion is confused by arguments about when there are modifications to other objects between the ++ and --. Of course all ordering of this thread's operations implied by the atomics must be preserved.)
As I said, this is all hypothetical and real compilers aren't going to optimize atomics until the dust settles on N4455 / P0062.
The C++ memory model provides four coherence requirements for all atomic accesses to the same atomic object. These requirements apply regardless of the memory order. As stated in a non-normative notation:
The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads.
Emphasis added.
Given that both operations are happening to the same atomic variable, and the first definitely happens before the second (due to being sequenced before it), there can be no reordering of these operations. Again, even if relaxed operations are used.
If this pair of operations were removed by a compiler, that would guarantee that no other threads would ever see the incremented value. So the question now becomes whether the standard would require some other thread to be able to see the incremented value.
It does not. Without some way for something to guarantee-ably "happen after" the increment and "happen before" the decrement, there is no guarantee that any operation on any other thread will certainly see the incremented value.
This leaves one question: does the second operation always undo the first? That is, does the decrement undo the increment? That depends on the scalar type in question. ++ and -- are only defined for the pointer and integer specializations of atomic. So we only need to consider those.
For pointers, the decrement undoes the increment. The reason being that the only way incrementing+decrementing a pointer would not result in the same pointer to the same object is if incrementing the pointer was itself UB. That is, if the pointer is invalid, NULL, or is the past-the-end pointer to an object/array. But compilers don't have to consider UB cases since... they're undefined behavior. In all of the cases where incrementing is valid, pointer decrementing must also be valid (or UB, perhaps due to someone freeing the memory or otherwise making the pointer invalid, but again, the compiler doesn't have to care).
For unsigned integers, the decrement always undoes the increment, since wraparound behavior is well-defined for unsigned integers.
That leaves signed integers. C++ usually makes signed integer over/underflow into UB. Fortunately, that's not the case for atomic math; the standard explicitly says:
For signed integer types, arithmetic is defined to use two's complement representation. There are no undefined results.
Wraparound behavior for two's complement atomics works. That means increment/decrement always results in recovering the same value.
So there does not appear to be anything in the standard which would prevent compilers from removing such operations. Again, regardless of the memory ordering.
Now, if you use non-relaxed memory ordering, the implementation cannot completely remove all traces of the atomics. The actual memory barriers behind those orderings still have to be emitted. But the barriers can be emitted without emitting the actual atomic operations.
In the C/C++ memory model, can a compiler just combine and then remove
redundant/NOP atomic modification operations,
No, the removal part is not allowed, at least not in the specific way the question suggests it would be allowed: the intent here is to describe valid source to source transformations, abstract tree to abstract tree, or rather a higher level description of the source code that encodes all the relevant semantic elements that might be needed for later phases of compilation.
The hypothesis is that code generation can be done on the transformed program, without ever checking with the original one. So only safe transformations that cannot break any code are allowed.
(Note: For weaker memory orders that still do something; for relaxed,
there is no real question here.)
No. Even that is wrong: for even relaxed operations, unconditional removal isn't a valid transformation (although in most practical cases it's certainly valid, but mostly correct is still wrong, and "true in >99% practical cases" has nothing to do with the question):
Before the introduction of standard threads, a stuck program was an infinite loop was an empty loop performing no externally visible side effects: no input, output, volatile operation and in practice no system call. A program that will not ever perform something visible is stuck and its behavior is not defined, and that allows the compiler to assume pure algorithms terminate: loops containing only invisible computations must exit somehow (that includes exiting with an exception).
With threads, that definition is obviously not usable: a loop in one thread isn't the whole program, and a stuck program is really one with no thread that can make something useful, and forbidding that would be sound.
But the very problematic standard definition of stuck doesn't describe a program execution but a single thread: a thread is stuck if it will perform no side effect that could potentially have an effect on observable side effects, that is:
no observable obviously (no I/O)
no action that might interact with another thread
The standard definition of 2. is extremely large and simplistic, all operations on an inter-thread communication device count: any atomic operation, any action on any mutex. Full text for the requirement (relevant part in boldface):
[intro.progress]
The implementation may assume that any thread will eventually do one
of the following:
terminate,
make a call to a library I/O function,
perform an access through a volatile glvalue, or
perform a synchronization operation or an atomic operation.
[ Note: This is intended to allow compiler transformations such as removal of
empty loops, even when termination cannot be proven. — end note ]
That definition does not even specify:
an inter thread communication (from one thread to another)
a shared state (visible by multiple threads)
a modification of some state
an object that is not thread private
That means that all these silly operations count:
for fences:
performing an acquire fence (even when followed by no atomic operation) in a thread that has at least once done an atomic store can synchronize with another fence or atomic operation
for mutexes:
locking a locally recently created, patently useless function private mutex;
locking a mutex to just unlock it doing nothing with the mutex locked;
for atomics:
reading an atomic variable declared as const qualified (not a const reference to a non const atomic);
reading an atomic, ignoring the value, even with relaxed memory ordering;
setting a non const qualified atomic to its own immutable value (setting a variable to zero when nothing in the whole program sets it to a non zero value), even with relaxed ordering;;
doing operations on a local atomic variable not accessible by other threads;
for thread operations:
creating a thread (that might do nothing) and joining it seems to create a (NOP) synchronization operation.
It means no early, local transformation of program code that leaves no trace of the transformation to later compiler phases and that removes even the most silly and useless inter-thread primitive is absolutely, unconditionally valid according to the standard, as it might remove the last potentially useful (but actually useless) operation in a loop (a loop doesn't have to be spelled for or while, it's any looping construct, f.ex. a backward goto).
This however doesn't apply if other operations on inter-thread primitives are left in the loop, or obviously if I/O is done.
This looks like a defect.
A meaningful requirement should be based:
not only on using thread primitives,
not be about any thread in isolation (as you can't see if a thread is contributing to anything, but at least a requirement to meaningfully interact with another thread and not use a private atomic or mutex would be better then the current requirement),
based on doing something useful (program observables) and inter-thread interactions that contribute to something being done.
I'm not proposing a replacement right now, as the rest of the thread specification isn't even clear to me.

Atomic operation propagation/visibility (atomic load vs atomic RMW load)

Context 
I am writing a thread-safe protothread/coroutine library in C++, and I am using atomics to make task switching lock-free. I want it to be as performant as possible. I have a general understanding of atomics and lock-free programming, but I do not have enough expertise to optimise my code. I did a lot of researching, but it was hard to find answers to my specific problem: What is the propagation delay/visiblity for different atomic operations under different memory orders?
Current assumptions 
I read that changes to memory are propagated from other threads, in such a way that they might become visible:
in different orders to different observers,
with some delay.
I am unsure as to whether this delayed visibility and inconsistent propagation applies only to non-atomic reads, or to atomic reads as well, potentially depending on what memory order is used. As I am developing on an x86 machine, I have no way of testing the behaviour on weakly ordered systems.
Do all atomic reads always read the latest values, regardless of the type of operation and the memory order used? 
I am pretty sure that all read-modify-write (RMW) operations always read the most recent value written by any thread, regardless of the memory order used. The same seems to be true for sequentially consistent operations, but only if all other modifications to a variable are also sequentially consistent. Both are said to be slow, which is not good for my task. If not all atomic reads get the most recent value, then I will have to use RMW operations just for reading an atomic variable's latest value, or use atomic reads in a while loop, to my current understanding.
Does the propagation of writes (ignoring side effects) depend on the memory order and the atomic operation used? 
(This question only matters if the answer to the previous question is that not all atomic reads always read the most recent value. Please read carefully, I do not ask about the visibility and propagation of side-effects here. I am merely concerned with the value of the atomic variable itself.) This would imply that depending on what operation is used to modify an atomic variable, it would be guaranteed that any following atomic read receives the most recent value of the variable. So I would have to choose between an operation guaranteed to always read the latest value, or use relaxed atomic reads, in tandem with this special write operation that guarantees instant visibility of the modification to other atomic operations.
Is atomic lock-free ?
First of all, let's get rid of the elephant in the room: using atomic in your code doesn't guarantee a lock-free implementation. atomic is only an enabler for a lock-free implementation. is_lock_free() will tell you if it's really lock-free for the C++ implementation and the underlying types that you are using.
What's the latest value ?
The term "latest" is very ambiguous in the world of multithreading. Because what is the "latest" for one thread that might be put asleep by the OS, might no longer be what is the latest for another thread that is active.
std::atomic only guarantees is a protection against racing conditions, by ensuring that R, M and RMW performed on one atomic in one thread are performed atomically, without any interruption, and that all other threads see either the value before or the value after, but never what's in-between. So atomic synchronize threads by creating an order between concurrent operations on the same atomic object.
You need to see every thread as a parallel universe with its own time and that is unaware of the time in the parallel universes. And like in quantum physics, the only thing that you can know in one thread about another thread is what you can observe (i.e. a "happened before" relation between the universes).
This means that you should not conceive multithreaded time as if there would be an absolute "latest" across all the threads. You need to conceive time as relative to the other threads. This is why atomics don't create an absolute latest, but only ensure a sequential ordering of the successive states that an atomic will have.
Propagation
The propagation doesn't depend on the memory order nor the atomic operation performed. memory_order is about sequential constraints on non-atomic variables around atomic operations that are seen like fences. The best explanation of how this works is certainly Herb Sutters presentation, that is definitively worth its hour and half if you're working on multithreading optimisation.
Although it is possible that a particular C++ implementation could implement some atomic operation in a way that influences propagation, you cannot rely on any such observation that you would do, since there would be no guarantee that propagation works in the same fashion in the next release of the compiler or on another compiler on another CPU architecture.
But does propagation matter ?
When designing lock-free algorithms, it is tempting to read atomic variables to get the latest status. But whereas such a read-only access is atomic, the action immediately after is not. So the following instructions might assume a state which is already obsolete (for example because the thread is send asleep immediately after the atomic read).
Take if(my_atomic_variable<10) and suppose that you read 9. Suppose you're in the best possible world and 9 would be the absolutely latest value set by all the concurrent threads. Comparing its value with <10 is not atomic, so that when the comparison succeeds and if branches, my_atomic_variable might already have a new value of 10. And this kind of problems might occur regardless of how fast the propagation is, and even if the read would be guaranteed to always get the latest value. And I didn't even mention the ABA problem yet.
The only benefit of the read is to avoid a data race and UB. But if you want to synchronize decisions/actions across threads, you need to use an RMW, such compare-and-swap (e.g. atomic_compare_exchange_strong) so that the ordering of atomic operations result in a predictable outcome.
After some discussion, here are my findings: First, let's define what an atomic variable's latest value means: In wall-clock time, the very latest write to an atomic variable, so, from an external observer's point of view. If there are multiple simultaneous last writes (i.e., on multiple cores during the same cycle), then it doesn't really matter which one of them is chosen.
Atomic loads of any memory order have no guarantee that the latest value is read. This means that writes have to propagate before you can access them. This propagation may be out of order with respect to the order in which they were executed, as well as differing in order with respect to different observers.
std::atomic_int counter = 0;
void thread()
{
// Imagine no race between read and write.
int value = counter.load(std::memory_order_relaxed);
counter.store(value+1, std::memory_order_relaxed);
}
for(int i = 0; i < 1000; i++)
std::async(thread);
In this example, according to my understanding of the specs, even if no read-write executions were to interfere, there could still be multiple executions of thread that read the same values, so that in the end, counter would not be 1000. This is because when using normal reads, although threads are guaranteed to read modifications to the same variable in the correct order (they will not read a new value and on the next value read an older value), they are not guaranteed to read the globally latest written value to a variable.
This creates the relativity effect (as in Einstein's physics) that every thread has its own "truth", and this is exactly why we need to use sequential consistency (or acquire/release) to restore causality: If we simply use relaxed loads, then we can even have broken causality and apparent time loops, which can happen because of instruction reordering in combination with out-of-order propagation. Memory ordering will ensure that those separate realities perceived by separate threads are at least causally consistent.
Atomic read-modify-write (RMW) operations (such as exchange, compare_exchange, fetch_add,…) are guaranteed to operate on the latest value as defined above. This means that propagation of writes is forced, and results in one universal view on the memory (if all reads you make are from atomic variables using RMW operations), independent of threads. So, if you use atomic.compare_exchange_strong(value,value, std::memory_order_relaxed) or atomic.fetch_or(0, std::memory_order_relaxed), then you are guaranteed to perceive one global order of modification that encompasses all atomic variables. Note that this does not guarantee you any ordering or causality of non-RMW reads.
std::atomic_int counter = 0;
void thread()
{
// Imagine no race between read and write.
int value = counter.fetch_or(0, std::memory_order_relaxed);
counter.store(value+1, std::memory_order_relaxed);
}
for(int i = 0; i < 1000; i++)
std::async(thread);
In this example (again, under the assumption that none of the thread() executions interfere with each other), it seems to me that the spec forbids value to contain anything but the globally latest written value. So, counter would always be 1000 in the end.
Now, when to use which kind of read? 
If you only need causality within each thread (there might still be different views on what happened in which order, but at least every single reader has a causally consistent view on the world), then atomic loads and acquire/release or sequential consistency suffice.
But if you also need fresh reads (so that you must never read values other than the globally (across all threads) latest value), then you should use RMW operations for reading. Those alone do not create causality for non-atomic and non-RMW reads, but all RMW reads across all threads share the exact same view on the world, which is always up to date.
So, to conclude: Use atomic loads if different world views are allowed, but if you need an objective reality, use RMWs to load.
Multithreading is surprising area.
First, an atomic read is not ordered after a write. I e reading a value does not mean that it were written before. Sometimes such read may ever see (indirect, by other thread) result of some subsequent atomic write by the same thread.
Sequential consistency are clearly about visibility and propagation. When a thread writes an atomic "sequentially consistent" it makes all its previous writes to be visible to other threads (propagation). In such case a (sequentially consistent) read is ordered in relation to a write.
Generally the most performant operations are "relaxed" atomic operations, but they provide minimum guarranties on ordering. In principle there is ever some causality paradoxes... :-)

Would a volatile variable be enough in this case? [duplicate]

A global variable is shared across 2 concurrently running threads on 2 different cores. The threads writes to and read from the variables. For the atomic variable can one thread read a stale value? Each core might have a value of the shared variable in its cache and when one threads writes to its copy in a cache the other thread on a different core might read stale value from its own cache. Or the compiler does strong memory ordering to read the latest value from the other cache? The c++11 standard library has std::atomic support. How this is different from the volatile keyword? How volatile and atomic types will behave differently in the above scenario?
Firstly, volatile does not imply atomic access. It is designed for things like memory mapped I/O and signal handling. volatile is completely unnecessary when used with std::atomic, and unless your platform documents otherwise, volatile has no bearing on atomic access or memory ordering between threads.
If you have a global variable which is shared between threads, such as:
std::atomic<int> ai;
then the visibility and ordering constraints depend on the memory ordering parameter you use for operations, and the synchronization effects of locks, threads and accesses to other atomic variables.
In the absence of any additional synchronization, if one thread writes a value to ai then there is nothing that guarantees that another thread will see the value in any given time period. The standard specifies that it should be visible "in a reasonable period of time", but any given access may return a stale value.
The default memory ordering of std::memory_order_seq_cst provides a single global total order for all std::memory_order_seq_cst operations across all variables. This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies.
If you have 2 shared variables x and y, initially zero, and have one thread write 1 to x and another write 2 to y, then a third thread that reads both may see either (0,0), (1,0), (0,2) or (1,2) since there is no ordering constraint between the operations, and thus the operations may appear in any order in the global order.
If both writes are from the same thread, which does x=1 before y=2 and the reading thread reads y before x then (0,2) is no longer a valid option, since the read of y==2 implies that the earlier write to x is visible. The other 3 pairings (0,0), (1,0) and (1,2) are still possible, depending how the 2 reads interleave with the 2 writes.
If you use other memory orderings such as std::memory_order_relaxed or std::memory_order_acquire then the constraints are relaxed even further, and the single global ordering no longer applies. Threads don't even necessarily have to agree on the ordering of two stores to separate variables if there is no additional synchronization.
The only way to guarantee you have the "latest" value is to use a read-modify-write operation such as exchange(), compare_exchange_strong() or fetch_add(). Read-modify-write operations have an additional constraint that they always operate on the "latest" value, so a sequence of ai.fetch_add(1) operations by a series of threads will return a sequence of values with no duplicates or gaps. In the absence of additional constraints, there's still no guarantee which threads will see which values though. In particular, it is important to note that the use of an RMW operation does not force changes from other threads to become visible any quicker, it just means that if the changes are not seen by the RMW then all threads must agree that they are later in the modification order of that atomic variable than the RMW operation. Stores from different threads can still be delayed by arbitrary amounts of time, depending on when the CPU actually issues the store to memory (rather than just its own store buffer), physically how far apart the CPUs executing the threads are (in the case of a multi-processor system), and the details of the cache coherency protocol.
Working with atomic operations is a complex topic. I suggest you read a lot of background material, and examine published code before writing production code with atomics. In most cases it is easier to write code that uses locks, and not noticeably less efficient.
volatile and the atomic operations have a different background, and
were introduced with a different intent.
volatile dates from way back, and is principally designed to prevent
compiler optimizations when accessing memory mapped IO. Modern
compilers tend to do no more than suppress optimizations for volatile,
although on some machines, this isn't sufficient for even memory mapped
IO. Except for the special case of signal handlers, and setjmp,
longjmp and getjmp sequences (where the C standard, and in the case
of signals, the Posix standard, gives additional guarantees), it must be
considered useless on a modern machine, where without special additional
instructions (fences or memory barriers), the hardware may reorder or
even suppress certain accesses. Since you shouldn't be using setjmp
et al. in C++, this more or less leaves signal handlers, and in a
multithreaded environment, at least under Unix, there are better
solutions for those as well. And possibly memory mapped IO, if you're
working on kernal code and can ensure that the compiler generates
whatever is needed for the platform in question. (According to the
standard, volatile access is observable behavior, which the compiler
must respect. But the compiler gets to define what is meant by
“access”, and most seem to define it as “a load or
store machine instruction was executed”. Which, on a modern
processor, doesn't even mean that there is necessarily a read or write
cycle on the bus, much less that it's in the order you expect.)
Given this situation, the C++ standard added atomic access, which does
provide a certain number of guarantees across threads; in particular,
the code generated around an atomic access will contain the necessary
additional instructions to prevent the hardware from reordering the
accesses, and to ensure that the accesses propagate down to the global
memory shared between cores on a multicore machine. (At one point in
the standardization effort, Microsoft proposed adding these semantics to
volatile, and I think some of their C++ compilers do. After
discussion of the issues in the committee, however, the general
consensus—including the Microsoft representative—was that it
was better to leave volatile with its orginal meaning, and to define
the atomic types.) Or just use the system level primitives, like
mutexes, which execute whatever instructions are needed in their code.
(They have to. You can't implement a mutex without some guarantees
concerning the order of memory accesses.)
Here's a basic synopsis of what the 2 things are:
1) Volatile keyword:
Tells the compiler that this value could alter at any moment and therefore it should not EVER cache it in a register. Look up the old "register" keyword in C. "Volatile" is basically the "-" operator to "register"'s "+". Modern compilers now do the optimization that "register" used to explicitly request by default, so you only see 'volatile' anymore. Using the volatile qualifier will guarantee that your processing never uses a stale value, but nothing more.
2) Atomic:
Atomic operations modify data in a single clock tick, so that it is impossible for ANY other thread to access the data in the middle of such an update. They're usually limited to whatever single-clock assembly instructions the hardware supports; things like ++,--, and swapping 2 pointers. Note that this says nothing about the ORDER the different threads will RUN the atomic instructions, only that they will never run in parallel. That's why you have all those additional options for forcing an ordering.
Volatile and Atomic serve different purposes.
Volatile :
Informs the compiler to avoid optimization. This keyword is used for variables that shall change unexpectedly. So, it can be used to represent the Hardware status registers, variables of ISR, Variables shared in a multi-threaded application.
Atomic :
It is also used in case of multi-threaded application. However, this ensures that there is no lock/stall while using in a multi-threaded application. Atomic operations are free of races and indivisble. Few of the key scenario of usage is to check whether a lock is free or used, atomically add to the value and return the added value etc. in multi-threaded application.

Concurrency: Atomic and volatile in C++11 memory model

A global variable is shared across 2 concurrently running threads on 2 different cores. The threads writes to and read from the variables. For the atomic variable can one thread read a stale value? Each core might have a value of the shared variable in its cache and when one threads writes to its copy in a cache the other thread on a different core might read stale value from its own cache. Or the compiler does strong memory ordering to read the latest value from the other cache? The c++11 standard library has std::atomic support. How this is different from the volatile keyword? How volatile and atomic types will behave differently in the above scenario?
Firstly, volatile does not imply atomic access. It is designed for things like memory mapped I/O and signal handling. volatile is completely unnecessary when used with std::atomic, and unless your platform documents otherwise, volatile has no bearing on atomic access or memory ordering between threads.
If you have a global variable which is shared between threads, such as:
std::atomic<int> ai;
then the visibility and ordering constraints depend on the memory ordering parameter you use for operations, and the synchronization effects of locks, threads and accesses to other atomic variables.
In the absence of any additional synchronization, if one thread writes a value to ai then there is nothing that guarantees that another thread will see the value in any given time period. The standard specifies that it should be visible "in a reasonable period of time", but any given access may return a stale value.
The default memory ordering of std::memory_order_seq_cst provides a single global total order for all std::memory_order_seq_cst operations across all variables. This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies.
If you have 2 shared variables x and y, initially zero, and have one thread write 1 to x and another write 2 to y, then a third thread that reads both may see either (0,0), (1,0), (0,2) or (1,2) since there is no ordering constraint between the operations, and thus the operations may appear in any order in the global order.
If both writes are from the same thread, which does x=1 before y=2 and the reading thread reads y before x then (0,2) is no longer a valid option, since the read of y==2 implies that the earlier write to x is visible. The other 3 pairings (0,0), (1,0) and (1,2) are still possible, depending how the 2 reads interleave with the 2 writes.
If you use other memory orderings such as std::memory_order_relaxed or std::memory_order_acquire then the constraints are relaxed even further, and the single global ordering no longer applies. Threads don't even necessarily have to agree on the ordering of two stores to separate variables if there is no additional synchronization.
The only way to guarantee you have the "latest" value is to use a read-modify-write operation such as exchange(), compare_exchange_strong() or fetch_add(). Read-modify-write operations have an additional constraint that they always operate on the "latest" value, so a sequence of ai.fetch_add(1) operations by a series of threads will return a sequence of values with no duplicates or gaps. In the absence of additional constraints, there's still no guarantee which threads will see which values though. In particular, it is important to note that the use of an RMW operation does not force changes from other threads to become visible any quicker, it just means that if the changes are not seen by the RMW then all threads must agree that they are later in the modification order of that atomic variable than the RMW operation. Stores from different threads can still be delayed by arbitrary amounts of time, depending on when the CPU actually issues the store to memory (rather than just its own store buffer), physically how far apart the CPUs executing the threads are (in the case of a multi-processor system), and the details of the cache coherency protocol.
Working with atomic operations is a complex topic. I suggest you read a lot of background material, and examine published code before writing production code with atomics. In most cases it is easier to write code that uses locks, and not noticeably less efficient.
volatile and the atomic operations have a different background, and
were introduced with a different intent.
volatile dates from way back, and is principally designed to prevent
compiler optimizations when accessing memory mapped IO. Modern
compilers tend to do no more than suppress optimizations for volatile,
although on some machines, this isn't sufficient for even memory mapped
IO. Except for the special case of signal handlers, and setjmp,
longjmp and getjmp sequences (where the C standard, and in the case
of signals, the Posix standard, gives additional guarantees), it must be
considered useless on a modern machine, where without special additional
instructions (fences or memory barriers), the hardware may reorder or
even suppress certain accesses. Since you shouldn't be using setjmp
et al. in C++, this more or less leaves signal handlers, and in a
multithreaded environment, at least under Unix, there are better
solutions for those as well. And possibly memory mapped IO, if you're
working on kernal code and can ensure that the compiler generates
whatever is needed for the platform in question. (According to the
standard, volatile access is observable behavior, which the compiler
must respect. But the compiler gets to define what is meant by
“access”, and most seem to define it as “a load or
store machine instruction was executed”. Which, on a modern
processor, doesn't even mean that there is necessarily a read or write
cycle on the bus, much less that it's in the order you expect.)
Given this situation, the C++ standard added atomic access, which does
provide a certain number of guarantees across threads; in particular,
the code generated around an atomic access will contain the necessary
additional instructions to prevent the hardware from reordering the
accesses, and to ensure that the accesses propagate down to the global
memory shared between cores on a multicore machine. (At one point in
the standardization effort, Microsoft proposed adding these semantics to
volatile, and I think some of their C++ compilers do. After
discussion of the issues in the committee, however, the general
consensus—including the Microsoft representative—was that it
was better to leave volatile with its orginal meaning, and to define
the atomic types.) Or just use the system level primitives, like
mutexes, which execute whatever instructions are needed in their code.
(They have to. You can't implement a mutex without some guarantees
concerning the order of memory accesses.)
Here's a basic synopsis of what the 2 things are:
1) Volatile keyword:
Tells the compiler that this value could alter at any moment and therefore it should not EVER cache it in a register. Look up the old "register" keyword in C. "Volatile" is basically the "-" operator to "register"'s "+". Modern compilers now do the optimization that "register" used to explicitly request by default, so you only see 'volatile' anymore. Using the volatile qualifier will guarantee that your processing never uses a stale value, but nothing more.
2) Atomic:
Atomic operations modify data in a single clock tick, so that it is impossible for ANY other thread to access the data in the middle of such an update. They're usually limited to whatever single-clock assembly instructions the hardware supports; things like ++,--, and swapping 2 pointers. Note that this says nothing about the ORDER the different threads will RUN the atomic instructions, only that they will never run in parallel. That's why you have all those additional options for forcing an ordering.
Volatile and Atomic serve different purposes.
Volatile :
Informs the compiler to avoid optimization. This keyword is used for variables that shall change unexpectedly. So, it can be used to represent the Hardware status registers, variables of ISR, Variables shared in a multi-threaded application.
Atomic :
It is also used in case of multi-threaded application. However, this ensures that there is no lock/stall while using in a multi-threaded application. Atomic operations are free of races and indivisble. Few of the key scenario of usage is to check whether a lock is free or used, atomically add to the value and return the added value etc. in multi-threaded application.