confused about atomic class: memory_order_relaxed - c++

I am studying this site: https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync, which is very helpful to understand the topic about atomic class.
But this example about relaxed mode is hard to understand:
/*Thread 1:*/
y.store (20, memory_order_relaxed)
x.store (10, memory_order_relaxed)
/*Thread 2*/
if (x.load (memory_order_relaxed) == 10)
{
assert (y.load(memory_order_relaxed) == 20) /* assert A */
y.store (10, memory_order_relaxed)
}
/*Thread 3*/
if (y.load (memory_order_relaxed) == 10)
assert (x.load(memory_order_relaxed) == 10) /* assert B */
To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
But the website says either assert in this example can actually FAIL.

To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
In effect, your argument is that since in thread 2 the store of 10 into x occurred before the store of 10 into y, in thread 3 the same must be the case.
However, since you are only using relaxed memory operations, there is nothing in the code that requires two different threads to agree on the ordering between modifications of different variables. So indeed thread 2 may see the store of 10 into x before the store of 10 into y while thread 3 sees those two operations in the opposite order.
In order to ensure that assert B succeeds, you would, in effect, need to ensure that when thread 3 sees the value 10 of y, it also sees any other side effects performed by the thread that stored 10 into y before the time of the store. That is, you need the store of 10 into y to synchronize with the load of 10 from y. This can be done by having the store perform a release and the load perform an acquire:
// thread 2
y.store (10, memory_order_release);
// thread 3
if (y.load (memory_order_acquire) == 10)
A release operation synchronizes with an acquire operation that reads the value stored. Now because the store in thread 2 synchronizes with the load in thread 3, anything that happens after the load in thread 3 will see the side effects of anything that happens before the store in thread 2. Hence the assertion will succeed.
Of course, we also need to make sure assertion A succeeds, by making the x.store in thread 1 use release and the x.load in thread 2 use acquire.

I find it much easier to understand atomics with some knowledge of what might be causing them, so here's some background knowledge. Know that these concepts are in no way stated in the C++ language itself, but is some of the possible reasons why things are the way they are.
Compiler reordering
Compilers, often when optimizing, will choose to refactor the program as long as its effects are the same on a single threaded program. This is circumvented with the use of atomics, which will tell the compiler (among other things) that the variable might change at any moment, and that its value might be read elsewhere.
Formally, atomics ensures one thing: there will be no data races. That is, accessing the variable will not make your computer explode.
CPU reordering
CPU might reorder instructions as it is executing them, which means the instructions might get reordered on the hardware level, independent of how you wrote the program.
Caching
Finally there are effects of caches, which are faster memory that sorta contains a partial copy of the global memory. Caches are not always in sync, meaning they don't always agree on what is "correct". Different threads may not be using the same cache, and due to this, they may not agree on what values variables have.
Back to the problem
What the above surmounts to is pretty much what C++ says about the matter: unless explicitly stated otherwise, the ordering of side effects of each instruction is totally and completely unspecified. It might not even be the same viewed from different threads.
Formally, the guarantee of an ordering of side effects is called a happens-before relation. Unless a side effect happens-before another, it is not. Loosely, we just say call it synchronization.
Now, what is memory_order_relaxed? It is telling the compiler to stop meddling, but don't worry about how the CPU and cache (and possibly other things) behave. Therefore, one possibility of why you see the "impossible" assert might be
Thread 1 stores 20 to y and then 10 to x to its cache.
Thread 2 reads the new values and stores 10 to y to its cache.
Thread 3 didn't read the values from thread 1, but reads those of thread 2, and then the assert fails.
This might be completely different from what happens in reality, the point is anything can happen.
To ensure a happens-before relation between the multiple reads and writes, see Brian's answer.
Another construct that provides happens-before relations is std::mutex, which is why they are free from such insanities.

The answer to your question is the C++ standard.
The section [intro.races] is surprisingly very clear (which is not the rule of normative text kind: formalism consistency oftenly hurts readability).
I have read many books and tuto which treats the subject of memory order, but it just confused me.
Finally I have read the C++ standard, the section [intro.multithread] is the clearest I have found. Taking the time to read it carefully (twice) may save you some time!
The answer to your question is in [intro.races]/4:
All modifications to a particular atomic object M occur in some particular total order, called the modification
order of M. [ Note: There is a separate order for each atomic object. There is no requirement that these can
be combined into a single total order for all objects. In general this will be impossible since different threads
may observe modifications to different objects in inconsistent orders. — end note ]
You were expecting a single total order on all atomic operations. There is such an order, but only for atomic operations that are memory_order_seq_cst as explained in [atomics.order]/3:
There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happens
before” order and modification orders for all affected locations [...]

Related

Do I understand the semantics of std::memory_order correctly?

c++reference.com states about memory_order::seq_cst:
A load operation with this memory order performs an acquire operation, a store performs a release operation, and read-modify-write performs both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order.
[ Q1 ]: Does this mean that the order goes straight down through every operation of all (others + this) atomic_vars with memory_order::seq_cst?
[ Q2 ]: And release , acquire and rel_acq are not included in "single total order" ?
I understood that seq_cst is equivalent to the other three with write, read and write_read operation, but I'm confused about whether seq_cst can order other atomic_vars too, not only the same var.
cppreference is only a summary of the C++ standard, and sometimes its text is less precise. The actual standard draft makes it clear: The final C++20 working draft N4681 states in atomics.order, par. 4 (p. 1525):
There is a single total order S on all memory_order::seq_cst operations, including fences, that satisfies the following constraints [...]
This clearly says all seq_cst operations, not just all operations on a particular object.
And notes 6 and 7 further down emphasize that the order does not apply to weaker memory orders:
6 [Note: We do not require that S be consistent with “happens before” (6.9.2.1). This allows more efficient
implementation of memory_order::acquire and memory_order::release on some machine architectures.
It can produce surprising results when these are mixed with memory_order::seq_cst accesses. — end note]
7 [Note: memory_order::seq_cst ensures sequential consistency only for a program that is free of data races
and uses exclusively memory_order::seq_cst atomic operations. Any use of weaker ordering will invalidate
this guarantee unless extreme care is used. In many cases, memory_order::seq_cst atomic operation
I find this part incomplete:
A load operation with this memory order performs an acquire operation,
a store performs a release operation, and read-modify-write performs
both an acquire operation and a release operation, plus a single total
order exists in which all threads observe all modifications in the
same order.
If those things (release store, acquire loads, and a total store order) were actually sufficient to give sequential consistency, that would imply that release and acquire operations on their own would be more strongly ordered than the actually are.
Let's have a look at the following counter example:
CPU1:
a = 1 // release store
int r1 = b // acquire load
Then based on the above definition for SC (and the known properties sequential consistency must have to fit the name), I would presume that the store of a and the load of b can't be reordered:
we have a release-store and an acquire-load
we (can) have a total order over all loads/stores
So we have satisfied the above definition for sequential consistency.
But a release-store followed by an acquire-load to a different address can be reordered. The canonical example would be Dekker's algorithm. Therefor the above definition for SC is broken because it is missing that memory order needs to preserve the program order. Apart from a compiler messing things up, the typical cause of this violation would be store buffers which most modern CPUs have can cause an older store to be reordered with a newer load to a different address.
The single total order is a different concern than CPU local instruction reordering as you can get with e.g. store buffers. It effectively means that there is some moment where an operation takes effect in the memory order and nobody should be able to disagree with that. The standard litmus test for this is the independent reads of independent writes (IRIW):
CPU1:
A=1
CPU2:
B=1
CPU3:
r1=A
[LoadLoad]
r2=B
CPU4:
r3=B
[LoadLoad]
r4=A
So could it be that CPU3 and CPU4 see the stores to different addresses in different orders? If the answer is yes, then no total order over the load/stores exist.
Another cause of a not having a total order over the loads/stores is store to load forwarding (STLF).
CPU1:
A=1
r1=A
r2=B
CPU2:
B=1
r3=B
r4=A
It is possible that r1=1, r2=0, r3=1 and r4=0?
On the X86 this is possible due to store to load forwarding. So if CPU1 does a store of A followed by a load of A, then the CPU must look in the store buffer for value of A. This causes the store of A not to be atomic; the local CPU can see the store early and the consequence is no total order over the loads/stores exist.
So instead of having a total order over all load/stores, it is reduced to a total order over the stores and this is how the X86 gets its name for its memory model (Total Store Order).
[edit]
Made some clarifications. I cleaned up some text and cleaned up the original example because it was misleading.

Is the concept of release-sequence useful in practice?

C++ atomic semantics only guarantee visibility (through happen-before relation) of memory operations performed by the last thread that did a release write (simple or read-modify-write) operation.
Consider
int x, y;
atomic<int> a;
Thread 1:
x = 1;
a.store(1,memory_order_release);
Thread 2:
y = 2;
if (a.load(memory_order_relaxed) == 1))
a.store(2,memory_order_release);
Then the observation of a == 2 implies visibility of thread 2 operations (y == 2) but not thread 1 (one cannot even read x).
As far as I know, real implementations of multithreading use concepts of fences (and sometimes release store) but not happen-before or release-sequence which are high level C++ concepts; I fail to see what real hardware details these concepts map to.
How can a real implementation not guarantee visibility of thread 1 memory operations when the value of 2 in a is globally visible?
In other words, is there any good in the release-sequence definition? Why wouldn't the release-sequence extend to every subsequent modification in the modification order?
Consider in particular silly-thread 3:
if (a.load(memory_order_relaxed) == 2))
a.store(2,memory_order_relaxed);
Can silly-thread 3 ever suppress any visibility guarantee on any real hardware? In other words, if value 2 is globally visible, how would making it again globally visible break any ordering?
Is my mental model of real multiprocessing incorrect? Can a value of partially visible, on some CPU but note another one?
(Of course I assume a non crazy semantic for relaxed writes, as writes that go back in time make language semantics of C++ absolutely nonsensical, unlike safe languages like Java that always have bounded semantics. No real implementation can have crazy, non-causal relaxed semantic.)
Let's first answer your question:
Why wouldn't the release-sequence extend to every subsequent modification in the modification order?
Because if so, we would lose some potential optimization. For example, consider the thread:
x = 1; // #1
a.store(1,memory_order_relaxed); // #2
Under current rules, the compiler is able to reorder #1 and #2. However, after the extension of release-sequence, the compiler is not allowed to reorder the two lines because another thread like your thread 2 may introduce a release sequence headed by #2 and tailed by a release operation, thus it is possible that some read-acquire operation in another thread would be synchronized with #2.
You give a specific example, and claim that all implementations would produce a specific outcome while the language rules do not guarantee this outcome. This is not a problem because the language rules are intended to handle all cases, not only your specific example. Of course the language rules may be improved so that it can guarantee the expected outcome for your specific example, but that is not a trivial work. At least, as we have argued above, simply extending the definition for release-sequence is not an accepted solution.

Acquire/release semantics with 4 threads

I am currently reading C++ Concurrency in Action by Anthony Williams. One of his listing shows this code, and he states that the assertion that z != 0 can fire.
#include <atomic>
#include <thread>
#include <assert.h>
std::atomic<bool> x,y;
std::atomic<int> z;
void write_x()
{
x.store(true,std::memory_order_release);
}
void write_y()
{
y.store(true,std::memory_order_release);
}
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire));
if(y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
if(x.load(std::memory_order_acquire))
++z;
}
int main()
{
x=false;
y=false;
z=0;
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join();
b.join();
c.join();
d.join();
assert(z.load()!=0);
}
So the different execution paths, that I can think of is this:
1)
Thread a (x is now true)
Thread c (fails to increment z)
Thread b (y is now true)
Thread d (increments z) assertion cannot fire
2)
Thread b (y is now true)
Thread d (fails to increment z)
Thread a (x is now true)
Thread c (increments z) assertion cannot fire
3)
Thread a (x is true)
Thread b (y is true)
Thread c (z is incremented) assertion cannot fire
Thread d (z is incremented)
Could someone explain to me how this assertion can fire?
He shows this little graphic:
Shouldn't the store to y also sync with the load in read_x_then_y, and the store to x sync with the load in read_y_then_x? I'm very confused.
EDIT:
Thank you for your responses, I understand how atomics work and how to use Acquire/Release. I just don't get this specific example. I was trying to figure out IF the assertion fires, then what did each thread do? And why does the assertion never fire if we use sequential consistency.
The way, I am reasoning about this is that if thread a (write_x) stores to x then all the work it has done so far is synced with any other thread that reads x with acquire ordering. Once read_x_then_y sees this, it breaks out of the loop and reads y. Now, 2 things could happen. In one option, the write_y has written to y, meaning this release will sync with the if statement (load) meaning z is incremented and assertion cannot fire. The other option is if write_y hasn't run yet, meaning the if condition fails and z isn't incremented, In this scenario, only x is true and y is still false. Once write_y runs, the read_y_then_x breaks out of its loop, however both x and y are true and z is incremented and the assertion does not fire. I can't think of any 'run' or memory ordering where z is never incremented. Can someone explain where my reasoning is flawed?
Also, I know The loop read will always be before the if statement read because the acquire prevents this reordering.
You are thinking in terms of sequential consistency, the strongest (and default) memory order. If this memory order is used, all accesses to atomic variables constitute a total order, and the assertion indeed cannot be triggered.
However, in this program, a weaker memory order is used (release stores and acquire loads). This means, by definition that you cannot assume a total order of operations. In particular, you cannot assume that changes become visible to other threads in the same order. (Only a total order on each individual variable is guaranteed for any atomic memory order, including memory_order_relaxed.)
The stores to x and y occur on different threads, with no synchronization between them. The loads of x and y occur on different threads, with no synchronization between them. This means it is entirely allowed that thread c sees x && ! y and thread d sees y && ! x. (I'm just abbreviating the acquire-loads here, don't take this syntax to mean sequentially consistent loads.)
Bottom line: Once you use a weaker memory order than sequentially consistent, you can kiss your notion of a global state of all atomics, that is consistent between all threads, goodbye. Which is exactly why so many people recommend sticking with sequential consistency unless you need the performance (BTW, remember to measure if it's even faster!) and are certain of what you are doing. Also, get a second opinion.
Now, whether you will get burned by this, is a different question. The standard simply allows a scenario where the assertion fails, based on the abstract machine that is used to describe the standard requirements. However, your compiler and/or CPU may not exploit this allowance for one reason or another. So it is possible that for a given compiler and CPU, you may never see that the assertion is triggered, in practice. Keep in mind that a compiler or CPU may always use a stricter memory order than the one you asked for, because this can never introduce violations of the minimum requirements from the standard. It may only cost you some performance – but that is not covered by the standard anyway.
UPDATE in response to comment: The standard defines no hard upper limit on how long it takes for one thread to see changes to an atomic by another thread. There is a recommendation to implementers that values should become visible eventually.
There are sequencing guarantees, but the ones pertinent to your example do not prevent the assertion from firing. The basic acquire-release guarantee is that if:
Thread e performs a release-store to an atomic variable x
Thread f performs an acquire-load from the same atomic variable
Then if the value read by f is the one that was stored by e, the store in e synchronizes-with the load in f. This means that any (atomic and non-atomic) store in e that was, in this thread, sequenced before the given store to x, is visible to any operation in f that is, in this thread, sequenced after the given load. [Note that there are no guarantees given regarding threads other than these two!]
So, there is no guarantee that f will read the value stored by e, as opposed to e.g. some older value of x. If it doesn't read the updated value, then also the load does not synchronize with the store, and there are no sequencing guarantees for any of the dependent operations mentioned above.
I liken atomics with lesser memory order than sequentially consistent to the Theory of Relativity, where there is no global notion of simultaneousness.
PS: That said, an atomic load cannot just read an arbitrary older value. For example, if one thread performs periodic increments (e.g. with release order) of an atomic<unsigned> variable, initialized to 0, and another thread periodically loads from this variable (e.g. with acquire order), then, except for eventual wrapping, the values seen by the latter thread must be monotonically increasing. But this follows from the given sequencing rules: Once the latter thread reads a 5, anything that happened before the increment from 4 to 5 is in the relative past of anything that follows the read of 5. In fact, a decrease other than wrapping is not even allowed for memory_order_relaxed, but this memory order does not make any promises to the relative sequencing (if any) of accesses to other variables.
The release-acquire synchronization has (at least) this guarantee: side-effects before a release on a memory location are visible after an acquire on this memory location.
There is no such guarantee if the memory location is not the same. More importantly, there's no total (think global) ordering guarantee.
Looking at the example, thread A makes thread C come out of its loop, and thread B makes thread D come out of its loop.
However, the way a release may "publish" to an acquire (or the way an acquire may "observe" a release) on the same memory location doesn't require total ordering. It's possible for thread C to observe A's release and thread D to observe B's release, and only somewhere in the future for C to observe B's release and for D to observe A's release.
The example has 4 threads because that's the minimum example you can force such non-intuitive behavior. If any of the atomic operations were done in the same thread, there would be an ordering you couldn't violate.
For instance, if write_x and write_y happened on the same thread, it would require that whatever thread observed a change in y would have to observe a change in x.
Similarly, if read_x_then_y and read_y_then_x happened on the same thread, you would observe both changed in x and y at least in read_y_then_x.
Having write_x and read_x_then_y in the same thread would be pointless for the exercise, as it would become obvious it's not synchronizing correctly, as would be having write_x and read_y_then_x, which would always read the latest x.
EDIT:
The way, I am reasoning about this is that if thread a (write_x) stores to x then all the work it has done so far is synced with any other thread that reads x with acquire ordering.
(...) I can't think of any 'run' or memory ordering where z is never incremented. Can someone explain where my reasoning is flawed?
Also, I know The loop read will always be before the if statement read because the acquire prevents this reordering.
That's sequentially consistent order, which imposes a total order. That is, it imposes that write_x and write_y both be visible to all threads one after the other; either x then y or y then x, but the same order for all threads.
With release-acquire, there is no total order. The effects of a release are only guaranteed to be visible to a corresponding acquire on the same memory location. With release-acquire, the effects of write_x are guaranteed to be visible to whoever notices x has changed.
This noticing something changed is very important. If you don't notice a change, you're not synchronizing. As such, thread C is not synchronizing on y and thread D is not synchronizing on x.
Essentially, it's way easier to think of release-acquire as a change notification system that only works if you synchronize properly. If you don't synchronize, you may or may not observe side-effects.
Strong memory model hardware architectures with cache coherence even in NUMA, or languages/frameworks that synchronize in terms of total order, make it difficult to think in these terms, because it's practically impossible to observe this effect.
Let's walk through the parallel code:
void write_x()
{
x.store(true,std::memory_order_release);
}
void write_y()
{
y.store(true,std::memory_order_release);
}
There is nothing before these instructions (they are at start of parallelism, everything that happened before also happened before other threads) so they are not meaningfully releasing: they are effectively relaxed operations.
Let's walk through the parallel code again, nothing that these two previous operations are not effective releases:
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire)); // acquire what state?
if(y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
if(x.load(std::memory_order_acquire))
++z;
}
Note that all the loads refer to variables for which nothing is effectively released ever, so nothing is effectively acquired here: we re-acquire the visibility over the previous operations in main that are visible already.
So you see that all operations are effectively relaxed: they provide no visibility (over what was already visible). It's like doing an acquire fence just after an acquire fence, it's redundant. Nothing new is implied that wasn't already implied.
So now that everything is relaxed, all bets are off.
Another way to view that is to notice that an atomic load is not a RMW operations that leaves the value unchanged, as a RMW can be release and a load cannot.
Just like all atomic stores are part of the modification order of an atomic variable even if the variable is an effective a constant (that is a non const variable whose value is always the same), an atomic RMW operation is somewhere in the modification order of an atomic variable, even if there was no change of value (and there cannot be a change of value because the code always compares and copies the exact same bit pattern).
In the modification order you can have release semantic (even if there was no modification).
If you protect a variable with a mutex you get release semantic (even if you just read the variable).
If you make all your loads (at least in functions that do more than once operation) release-modification-loads with:
either a mutex protecting the atomic object (then drop the atomic as it's now redundant!)
or a RMW with acq_rel order,
the previous proof that all operations are effectively relaxed doesn't work anymore and some atomic operation in at least one of the read_A_then_B functions will have to be ordered before some operation in the other, as they operate on the same objects. If they are in the modification order of a variable and you use acq_rel, then you have an happen before relation between one of these (obviously which one happens before which one is non deterministic).
Either way execution is now sequential, as all operations are effectively acquire and release, that is as operative acquire and release (even those that are effectively relaxed!).
If we change two if statements to while statements, it will make the code correct and z will be guaranteed to be equal to 2.
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire));
while(!y.load(std::memory_order_acquire));
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
while(!x.load(std::memory_order_acquire));
++z;
}

Using an atomic read-modify-write operation in a release sequence

Say, I create an object of type Foo in thread #1 and want to be able to access it in thread #3.
I can try something like:
std::atomic<int> sync{10};
Foo *fp;
// thread 1: modifies sync: 10 -> 11
fp = new Foo;
sync.store(11, std::memory_order_release);
// thread 2a: modifies sync: 11 -> 12
while (sync.load(std::memory_order_relaxed) != 11);
sync.store(12, std::memory_order_relaxed);
// thread 3
while (sync.load(std::memory_order_acquire) != 12);
fp->do_something();
The store/release in thread #1 orders Foo with the update to 11
thread #2a non-atomically increments the value of sync to 12
the synchronizes-with relationship between thread #1 and #3 is only established when #3 loads 11
The scenario is broken because thread #3 spins until it loads 12, which may arrive out of order (wrt 11) and Foo is not ordered with 12 (due to the relaxed operations in thread #2a).
This is somewhat counter-intuitive since the modification order of sync is 10 → 11 → 12
The standard says (§ 1.10.1-6):
an atomic store-release synchronizes with a load-acquire that takes its value from the store (29.3). [ Note: Except in the specified cases, reading a later value does not necessarily ensure visibility as described below. Such a requirement would sometimes interfere with efficient implementation. —end note ]
It also says in (§ 1.10.1-5):
A release sequence headed by a release operation A on an atomic object M is a maximal contiguous subsequence of side effects in the modification order of M, where the first operation is A, and every subsequent operation
- is performed by the same thread that performed A, or
- is an atomic read-modify-write operation.
Now, thread #2a is modified to use an atomic read-modify-write operation:
// thread 2b: modifies sync: 11 -> 12
int val;
while ((val = 11) && !sync.compare_exchange_weak(val, 12, std::memory_order_relaxed));
If this release sequence is correct, Foo is synchronized with thread #3 when it loads either 11 or 12.
My questions about the use of an atomic read-modify-write are:
Does the scenario with thread #2b constitute a correct release sequence ?
And if so:
What are the specific properties of a read-modify-write operation that ensure this scenario is correct ?
Does the scenario with thread #2b constitute a correct release sequence ?
Yes, per your quote from the standard.
What are the specific properties of a read-modify-write operation that ensure this scenario is correct?
Well, the somewhat circular answer is that the only important specific property is that "The C++ standard defines it so".
As a practical matter, one may ask why the standard defines it like this. I don't think you'll find that the answer has a deep theoretical basis: I think the committee could have also defined it such that the RMW doesn't participate in the release sequence, or (perhaps with more difficulty) have defined so that both the RMW and the separate mo_relaxed load and store participate in the release sequence, without compromising the "soundness" of the model.
They already give a performance related as to why they didn't choose the latter approach:
Such a requirement would sometimes interfere with efficient implementation.
In particular, on any hardware platform that allowed load-store reordering, it would imply that even mo_relaxed loads and/or stores might require barriers! Such platforms exist today. Even on more strongly ordered platforms, it may inhibit compiler optimizations.
So why didn't they take then take other "consistent" approach of not requiring RMW mo_relaxed to participate in the release sequence? Probably because existing hardware implementations of RMW operations provide such guarantees and the nature of RMW operations makes it likely that this will be true in the future. In particular, as Peter points in the comments above, RMW operations, even with mo_relaxed are conceptually and practically1 stronger than separate loads and stores: they would be quite useless if they didn't have a consistent total order.
Once you accept that is how hardware works, it makes sense from a performance angle to align the standard: if you didn't, you'd have people using more restrictive orderings such as mo_acq_rel just to get the release sequence guarantees, but on real hardware that has weakly ordered CAS, this doesn't come for free.
1 The "practically" part means that even the weakest forms of RMW instructions are usually relatively "expensive" operations taking a dozen cycles or more on modern hardware, while mo_relaxed loads and stores generally just compile to plain loads and stores in the target ISA.

C++ memory_order_consume, kill_dependency, dependency-ordered-before, synchronizes-with

I am reading C++ Concurrency in Action by Anthony Williams. Currently I at point where he desribes memory_order_consume.
After that block there is:
Now that I’ve covered the basics of the memory orderings, it’s time to look at the
more complex parts
It scares me a little bit, because I don't fully understand several things:
How dependency-ordered-before differs from synchronizes-with? They both create happens-before relationship. What is exact difference?
I am confused about following example:
int global_data[]={ … };
std::atomic<int> index;
void f()
{
int i=index.load(std::memory_order_consume);
do_something_with(global_data[std::kill_dependency(i)]);
}
What does kill_dependency exactly do? Which dependency it kills? Between which entities? And how compiler can exploit that knowladge?
Can all occurancies of memory_order_consume be safely replaced with memory_order_acquire? I.e. is it stricter in all senses?
At Listing 5.9, can I safely replace
std::atomic<int> data[5]; // all accesses are relaxed
with
int data[5]
? I.e. can acquire and release be used to synchronize access to non-atomic data?
He describes relaxed, acquire and release by some examples with mans in cubicles. Are there some similar simple descriptions of seq_cst and consume?
As to the next to last question, the answer takes a little more explanation. There are three things that can go wrong when multiple threads access the same data:
the system might switch threads in the middle of a read or write, producing a result that's half one value and half another.
the compiler might move code around, on the assumption that there is no other thread looking at the data that's involved.
the processor may be keeping a value in its local cache, without updating main memory after changing the value or re-reading it after another thread changed the value in main memory.
Memory order addresses only number 3. The atomic functions address 1 and 2, and, depending on the memory order argument, maybe 3 as well. So memory_order_relaxed means "don't bother with number 3. The code still handles 1 and 2. In that case, you'd use acquire and release to ensure proper memory ordering.
How dependency-ordered-before differs from synchronizes-with?
From 1.10/10: "[ Note: The relation “is dependency-ordered before” is analogous to “synchronizes with”, but uses release/consume in place of release/acquire. — end note ]".
What does kill_dependency exactly do?
Some compilers do data-dependency analysis. That is, they trace changes to values in variables in order to better figure out what has to be synchronized. kill_dependency tells such compilers not to trace any further because there's something going on in the code that the compiler wouldn't understand.
Can all occurancies of memory_order_consume be safely replaced with
memory_order_acquire? I.e. is it stricter in all senses?
I think so, but I'm not certain.
memory_order_consume requires that the atomic operation happens-before all non-atomic operations that are data dependent on it. A data dependency is any dependency where you cannot evaluate an expression without using that data. For example, in x->y, there is no way to evaluate x->y without first evaluating x.
kill_dependency is a unique function. All other functions have a data dependency on their arguments. Kill_dependency explicitly does not. It shows up when you know that the data itself is already synchronized, but the expression you need to get to the data may not be synchronized. In your example, do_something_with is allowed to assume any cached value of globalldata[i] is safe to use, but i itself must actually be the correct atomic value.
memory_order_acquire is strictly stronger if all changes to the data are properly released with a matching memory_order_release.