Acquire/Release Visibility of Latest Operation - c++

There is a lot of subtlety in this topic and so much information to sift through. I couldn't find an existing question/answer that specifically addressed this question, so here goes.
If I have an atomic variable M of type std::atomic_int, where
Thread 1 performs M.store(1, memory_order_release)
Later, Thread 2 performs M.store(2, memory_order_release)
Even later, Thread 3 M.load(memory_order_acquire)
Is there any legitimate scenario in which Thread 3 could read the value 1 instead of 2?
My assumption is that it is impossible, because of write-write coherence and happens-before properties. But having spent an hour going over the C++ standard as well as cppreference, I still can't form a concise and definitive answer to this question.
I'd love to get an answer here with credible references. Thanks in advance.

The concept of "later" is not one which the C++ standard defines or uses, so literally speaking, this is not a question that can be answered with reference to the standard.
If we instead use the concept of happens before as defined in [intro.races p10], then the answer to your question is yes. (I am using C++20 final draft N4860 here).
M has a modification order [intro.races p4] which is a total order. By write-write coherence [intro.races p15], if #1 happens before #2, then the side effect storing the value 1 precedes the store of the value 2 in the modification order of M.
Then by write-read coherence [intro.races p17], if #2 happens before #3, then #3 must take its value from #2, or from some side effect which follows #2 in the modification order of M. But assuming that the program contains no other stores to M, there are no other side effects that follow #2 in the modification order, so #3 must in fact take its value from #2, which is to say that #3 must load the value 2. In particular #3 cannot take its value from #1, since #1 precedes #2 in the modification order of M and not the other way around. (The modification order is defined as a total order, which means it cannot contain cycles, so it is not possible for #1 and #2 to both precede each other.)
Note that the acquire and release orderings you have attached to operations #1, #2, #3 are totally irrelevant in this analysis, and everything would be the same if they were relaxed. Where memory ordering would matter is in the operations (not shown in your example) that enforce the assumption that #1 happens before #2 happens before #3.
For example, suppose that in Thread 1, operation #1 was sequenced before (i.e. precedes in program order) a store to some other atomic object Z of a particular value (let's say true). And likewise, that Thread 2 did not perform store #2 until after loading from Z and observing that the value loaded was true. Then the store to Z in Thread 1 must be release (or seq_cst), and the load of Z in Thread 2 must be acquire (or seq_cst). That way, you get the release store to Z to synchronize with the acquire load [atomics.order p2], and by chasing through [intro.races p9-10], you conclude that #1 happens before #2. But again, no particular ordering would be needed on the accesses to M.
In fact, if you have happens-before ordering, then everything works fine even if M is not atomic at all, but is just an ordinary int, and #1, #2, #3 are just non-atomic ordinary writes and reads to M. Then the happens-before ensures that these accesses to M do not cause a data race in the sense of [intro.races p21], so the behavior of the program is well-defined. We can then refer to [intro.races p13], the "not visibly reordered" rule. It is true that #2 happens before #3, and there is no other side effect X on M such that #2 happens before X and X happens before #3. (In particular, this cannot be true of X = #1, because #1 happens before #2 and therefore not the other way around; by [intro.races p10] the "happens before" relation must not contain a cycle.) Thus, #3 must determine the value stored by #2, namely the value 2.
Either way, though, the devil is in the details. If you actually write this in a program, the important part of the analysis would be to verify that, whatever your program does to ensure that #2 is "later than" #1, that it actually imposes a happens-before ordering. Naive or intuitive notions of "later" do not necessarily suffice. For instance, checking the system time around accesses #1, #2, #3 is not good enough. Nor is something like the above example where the extra flag Z is only accessed with relaxed ordering in one or both places, or worse, if it is not atomic at all.
If you cannot guarantee, by means of additional operations, that #1 happens before #2 happens before #3, then it is definitely not guaranteed that #3 loads the value 2, not even if you soup up all of #1, #2, #3 to be seq_cst.

As almost everyone noticed in comments there is no any guarantees if "later" is some external element to memory model, like passed time or something similar.
But if "later" is actually inverted "happens before", then this question basically doesn't have any sense, since "happens before" (for this case) boils down to "synchronizes-with". And "synchronizes-with" defined as:
If an atomic store in thread A is a release operation, an atomic load in thread B from the same variable is an acquire operation, and the load in thread B reads a value written by the store in thread A, then the store in thread A synchronizes-with the load in thread B.
In other words, for 2 to happen before 3 we should have following code in thread 3 (or its equivalent with other atomic):
while(M.load(memory_order_acquire) != 2);
// At this point 2 happens before 3
Then of course in absence of any other stores to this atomic it will have value 2. But, there is no gurantees that thread 3 will not read value 1 before value 2 in while since at that moment it is not yet "synchronized-with" thread 2.
And as you already suggested yourself if 1 happens before 2, then due to "Write-write coherence" we never will see value 1 after we get value 2.

Related

Memory ordering in sub-expressions of expressions carrying a dependency

Consider the following code:
int data = 0;
std::atomic<int> ready = 0;
void writer_thread() {
data = 25;
ready.store(1, std::memory_order::release);
}
void reader_thread() {
int ready_value = 0;
while(!ready_value) {
ready_value = ready.load(std::memory_order::consume);
}
assert(data == 25); // #1
assert(data + ready_value == 26); // #2
}
The assertion marked #1 may or may not pass, because data does not carry a dependency from ready, which causes a race condition.
Assertion #2 is more complicated. It would seem that the entire expression data + ready_value == 26 carries a dependency from ready, as explained in [intro.races#7] - an expression carries a dependency from its operands, and this relation is transitive.
But even if we establish that this is true, what does it imply about the memory ordering within this expression? Consider the following paragraph from cppreference:
If an atomic store in thread A is tagged memory_order_release and an atomic load in thread B from the same variable that read the stored value is tagged memory_order_consume, all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects within those operations in thread B into which the load operation carries dependency, that is, once the atomic load is completed, those operators and functions in thread B that use the value obtained from the load are guaranteed to see what thread A wrote to memory.
I am not sure how to interpret the emphasized sentences. Does the guarantee about the visibility of the side-effects apply to every sub-expression as well?
Let's take the addition expression data + ready_value, which, as a whole, carries a dependency from ready. While evaluating the data sub-expression of this expression, are the side-effects visible or not?
If they are, then I believe the assertion #2 does not contain a race condition.
But if they are not, then the evaluation of the sub-expression data might produce a 0, and the assertion contains a race condition.
Assertion #2 is more complicated.
No, it isn't:
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
Accessing data is an "action". Both reader_thread and writer_thread attempt to access data. Pursuant to [intro.races]/2:
Two expression evaluations conflict if one of them modifies a memory location ([intro.memory]) and the other one reads or modifies the same memory location.
writer_thread assigns to data. reader_thread reads from data. The two actions therefore potentially conflict.
Both accesses are "potentially concurrent", as they are accessed from different threads.
Neither access is atomic. Neither access is one of the signal handling exceptions.
That leaves "happens before". Well, consume only matters for the atomic object being accessed and the expressions and side-effects that generated that value. Consume only gives us the "dependency-ordered before" relation. And this only applies to the written atomic object and any side-effects used to write that atomic value, through the "carries dependency" relationship as you cited.
ready_value might carry the dependency of ready, but data is not part of that dependency chain.
As such, writer_thread's write to data does not "happen before" reader_thread's read of data. Therefore, accessing its data is undefined behavior.
Nowhere is there a statement about accessing ready or ready_value.
Always. That's just how consume works.

Is the concept of release-sequence useful in practice?

C++ atomic semantics only guarantee visibility (through happen-before relation) of memory operations performed by the last thread that did a release write (simple or read-modify-write) operation.
Consider
int x, y;
atomic<int> a;
Thread 1:
x = 1;
a.store(1,memory_order_release);
Thread 2:
y = 2;
if (a.load(memory_order_relaxed) == 1))
a.store(2,memory_order_release);
Then the observation of a == 2 implies visibility of thread 2 operations (y == 2) but not thread 1 (one cannot even read x).
As far as I know, real implementations of multithreading use concepts of fences (and sometimes release store) but not happen-before or release-sequence which are high level C++ concepts; I fail to see what real hardware details these concepts map to.
How can a real implementation not guarantee visibility of thread 1 memory operations when the value of 2 in a is globally visible?
In other words, is there any good in the release-sequence definition? Why wouldn't the release-sequence extend to every subsequent modification in the modification order?
Consider in particular silly-thread 3:
if (a.load(memory_order_relaxed) == 2))
a.store(2,memory_order_relaxed);
Can silly-thread 3 ever suppress any visibility guarantee on any real hardware? In other words, if value 2 is globally visible, how would making it again globally visible break any ordering?
Is my mental model of real multiprocessing incorrect? Can a value of partially visible, on some CPU but note another one?
(Of course I assume a non crazy semantic for relaxed writes, as writes that go back in time make language semantics of C++ absolutely nonsensical, unlike safe languages like Java that always have bounded semantics. No real implementation can have crazy, non-causal relaxed semantic.)
Let's first answer your question:
Why wouldn't the release-sequence extend to every subsequent modification in the modification order?
Because if so, we would lose some potential optimization. For example, consider the thread:
x = 1; // #1
a.store(1,memory_order_relaxed); // #2
Under current rules, the compiler is able to reorder #1 and #2. However, after the extension of release-sequence, the compiler is not allowed to reorder the two lines because another thread like your thread 2 may introduce a release sequence headed by #2 and tailed by a release operation, thus it is possible that some read-acquire operation in another thread would be synchronized with #2.
You give a specific example, and claim that all implementations would produce a specific outcome while the language rules do not guarantee this outcome. This is not a problem because the language rules are intended to handle all cases, not only your specific example. Of course the language rules may be improved so that it can guarantee the expected outcome for your specific example, but that is not a trivial work. At least, as we have argued above, simply extending the definition for release-sequence is not an accepted solution.

Acquire/release semantics with 4 threads

I am currently reading C++ Concurrency in Action by Anthony Williams. One of his listing shows this code, and he states that the assertion that z != 0 can fire.
#include <atomic>
#include <thread>
#include <assert.h>
std::atomic<bool> x,y;
std::atomic<int> z;
void write_x()
{
x.store(true,std::memory_order_release);
}
void write_y()
{
y.store(true,std::memory_order_release);
}
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire));
if(y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
if(x.load(std::memory_order_acquire))
++z;
}
int main()
{
x=false;
y=false;
z=0;
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join();
b.join();
c.join();
d.join();
assert(z.load()!=0);
}
So the different execution paths, that I can think of is this:
1)
Thread a (x is now true)
Thread c (fails to increment z)
Thread b (y is now true)
Thread d (increments z) assertion cannot fire
2)
Thread b (y is now true)
Thread d (fails to increment z)
Thread a (x is now true)
Thread c (increments z) assertion cannot fire
3)
Thread a (x is true)
Thread b (y is true)
Thread c (z is incremented) assertion cannot fire
Thread d (z is incremented)
Could someone explain to me how this assertion can fire?
He shows this little graphic:
Shouldn't the store to y also sync with the load in read_x_then_y, and the store to x sync with the load in read_y_then_x? I'm very confused.
EDIT:
Thank you for your responses, I understand how atomics work and how to use Acquire/Release. I just don't get this specific example. I was trying to figure out IF the assertion fires, then what did each thread do? And why does the assertion never fire if we use sequential consistency.
The way, I am reasoning about this is that if thread a (write_x) stores to x then all the work it has done so far is synced with any other thread that reads x with acquire ordering. Once read_x_then_y sees this, it breaks out of the loop and reads y. Now, 2 things could happen. In one option, the write_y has written to y, meaning this release will sync with the if statement (load) meaning z is incremented and assertion cannot fire. The other option is if write_y hasn't run yet, meaning the if condition fails and z isn't incremented, In this scenario, only x is true and y is still false. Once write_y runs, the read_y_then_x breaks out of its loop, however both x and y are true and z is incremented and the assertion does not fire. I can't think of any 'run' or memory ordering where z is never incremented. Can someone explain where my reasoning is flawed?
Also, I know The loop read will always be before the if statement read because the acquire prevents this reordering.
You are thinking in terms of sequential consistency, the strongest (and default) memory order. If this memory order is used, all accesses to atomic variables constitute a total order, and the assertion indeed cannot be triggered.
However, in this program, a weaker memory order is used (release stores and acquire loads). This means, by definition that you cannot assume a total order of operations. In particular, you cannot assume that changes become visible to other threads in the same order. (Only a total order on each individual variable is guaranteed for any atomic memory order, including memory_order_relaxed.)
The stores to x and y occur on different threads, with no synchronization between them. The loads of x and y occur on different threads, with no synchronization between them. This means it is entirely allowed that thread c sees x && ! y and thread d sees y && ! x. (I'm just abbreviating the acquire-loads here, don't take this syntax to mean sequentially consistent loads.)
Bottom line: Once you use a weaker memory order than sequentially consistent, you can kiss your notion of a global state of all atomics, that is consistent between all threads, goodbye. Which is exactly why so many people recommend sticking with sequential consistency unless you need the performance (BTW, remember to measure if it's even faster!) and are certain of what you are doing. Also, get a second opinion.
Now, whether you will get burned by this, is a different question. The standard simply allows a scenario where the assertion fails, based on the abstract machine that is used to describe the standard requirements. However, your compiler and/or CPU may not exploit this allowance for one reason or another. So it is possible that for a given compiler and CPU, you may never see that the assertion is triggered, in practice. Keep in mind that a compiler or CPU may always use a stricter memory order than the one you asked for, because this can never introduce violations of the minimum requirements from the standard. It may only cost you some performance – but that is not covered by the standard anyway.
UPDATE in response to comment: The standard defines no hard upper limit on how long it takes for one thread to see changes to an atomic by another thread. There is a recommendation to implementers that values should become visible eventually.
There are sequencing guarantees, but the ones pertinent to your example do not prevent the assertion from firing. The basic acquire-release guarantee is that if:
Thread e performs a release-store to an atomic variable x
Thread f performs an acquire-load from the same atomic variable
Then if the value read by f is the one that was stored by e, the store in e synchronizes-with the load in f. This means that any (atomic and non-atomic) store in e that was, in this thread, sequenced before the given store to x, is visible to any operation in f that is, in this thread, sequenced after the given load. [Note that there are no guarantees given regarding threads other than these two!]
So, there is no guarantee that f will read the value stored by e, as opposed to e.g. some older value of x. If it doesn't read the updated value, then also the load does not synchronize with the store, and there are no sequencing guarantees for any of the dependent operations mentioned above.
I liken atomics with lesser memory order than sequentially consistent to the Theory of Relativity, where there is no global notion of simultaneousness.
PS: That said, an atomic load cannot just read an arbitrary older value. For example, if one thread performs periodic increments (e.g. with release order) of an atomic<unsigned> variable, initialized to 0, and another thread periodically loads from this variable (e.g. with acquire order), then, except for eventual wrapping, the values seen by the latter thread must be monotonically increasing. But this follows from the given sequencing rules: Once the latter thread reads a 5, anything that happened before the increment from 4 to 5 is in the relative past of anything that follows the read of 5. In fact, a decrease other than wrapping is not even allowed for memory_order_relaxed, but this memory order does not make any promises to the relative sequencing (if any) of accesses to other variables.
The release-acquire synchronization has (at least) this guarantee: side-effects before a release on a memory location are visible after an acquire on this memory location.
There is no such guarantee if the memory location is not the same. More importantly, there's no total (think global) ordering guarantee.
Looking at the example, thread A makes thread C come out of its loop, and thread B makes thread D come out of its loop.
However, the way a release may "publish" to an acquire (or the way an acquire may "observe" a release) on the same memory location doesn't require total ordering. It's possible for thread C to observe A's release and thread D to observe B's release, and only somewhere in the future for C to observe B's release and for D to observe A's release.
The example has 4 threads because that's the minimum example you can force such non-intuitive behavior. If any of the atomic operations were done in the same thread, there would be an ordering you couldn't violate.
For instance, if write_x and write_y happened on the same thread, it would require that whatever thread observed a change in y would have to observe a change in x.
Similarly, if read_x_then_y and read_y_then_x happened on the same thread, you would observe both changed in x and y at least in read_y_then_x.
Having write_x and read_x_then_y in the same thread would be pointless for the exercise, as it would become obvious it's not synchronizing correctly, as would be having write_x and read_y_then_x, which would always read the latest x.
EDIT:
The way, I am reasoning about this is that if thread a (write_x) stores to x then all the work it has done so far is synced with any other thread that reads x with acquire ordering.
(...) I can't think of any 'run' or memory ordering where z is never incremented. Can someone explain where my reasoning is flawed?
Also, I know The loop read will always be before the if statement read because the acquire prevents this reordering.
That's sequentially consistent order, which imposes a total order. That is, it imposes that write_x and write_y both be visible to all threads one after the other; either x then y or y then x, but the same order for all threads.
With release-acquire, there is no total order. The effects of a release are only guaranteed to be visible to a corresponding acquire on the same memory location. With release-acquire, the effects of write_x are guaranteed to be visible to whoever notices x has changed.
This noticing something changed is very important. If you don't notice a change, you're not synchronizing. As such, thread C is not synchronizing on y and thread D is not synchronizing on x.
Essentially, it's way easier to think of release-acquire as a change notification system that only works if you synchronize properly. If you don't synchronize, you may or may not observe side-effects.
Strong memory model hardware architectures with cache coherence even in NUMA, or languages/frameworks that synchronize in terms of total order, make it difficult to think in these terms, because it's practically impossible to observe this effect.
Let's walk through the parallel code:
void write_x()
{
x.store(true,std::memory_order_release);
}
void write_y()
{
y.store(true,std::memory_order_release);
}
There is nothing before these instructions (they are at start of parallelism, everything that happened before also happened before other threads) so they are not meaningfully releasing: they are effectively relaxed operations.
Let's walk through the parallel code again, nothing that these two previous operations are not effective releases:
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire)); // acquire what state?
if(y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
if(x.load(std::memory_order_acquire))
++z;
}
Note that all the loads refer to variables for which nothing is effectively released ever, so nothing is effectively acquired here: we re-acquire the visibility over the previous operations in main that are visible already.
So you see that all operations are effectively relaxed: they provide no visibility (over what was already visible). It's like doing an acquire fence just after an acquire fence, it's redundant. Nothing new is implied that wasn't already implied.
So now that everything is relaxed, all bets are off.
Another way to view that is to notice that an atomic load is not a RMW operations that leaves the value unchanged, as a RMW can be release and a load cannot.
Just like all atomic stores are part of the modification order of an atomic variable even if the variable is an effective a constant (that is a non const variable whose value is always the same), an atomic RMW operation is somewhere in the modification order of an atomic variable, even if there was no change of value (and there cannot be a change of value because the code always compares and copies the exact same bit pattern).
In the modification order you can have release semantic (even if there was no modification).
If you protect a variable with a mutex you get release semantic (even if you just read the variable).
If you make all your loads (at least in functions that do more than once operation) release-modification-loads with:
either a mutex protecting the atomic object (then drop the atomic as it's now redundant!)
or a RMW with acq_rel order,
the previous proof that all operations are effectively relaxed doesn't work anymore and some atomic operation in at least one of the read_A_then_B functions will have to be ordered before some operation in the other, as they operate on the same objects. If they are in the modification order of a variable and you use acq_rel, then you have an happen before relation between one of these (obviously which one happens before which one is non deterministic).
Either way execution is now sequential, as all operations are effectively acquire and release, that is as operative acquire and release (even those that are effectively relaxed!).
If we change two if statements to while statements, it will make the code correct and z will be guaranteed to be equal to 2.
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire));
while(!y.load(std::memory_order_acquire));
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
while(!x.load(std::memory_order_acquire));
++z;
}

confused about atomic class: memory_order_relaxed

I am studying this site: https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync, which is very helpful to understand the topic about atomic class.
But this example about relaxed mode is hard to understand:
/*Thread 1:*/
y.store (20, memory_order_relaxed)
x.store (10, memory_order_relaxed)
/*Thread 2*/
if (x.load (memory_order_relaxed) == 10)
{
assert (y.load(memory_order_relaxed) == 20) /* assert A */
y.store (10, memory_order_relaxed)
}
/*Thread 3*/
if (y.load (memory_order_relaxed) == 10)
assert (x.load(memory_order_relaxed) == 10) /* assert B */
To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
But the website says either assert in this example can actually FAIL.
To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
In effect, your argument is that since in thread 2 the store of 10 into x occurred before the store of 10 into y, in thread 3 the same must be the case.
However, since you are only using relaxed memory operations, there is nothing in the code that requires two different threads to agree on the ordering between modifications of different variables. So indeed thread 2 may see the store of 10 into x before the store of 10 into y while thread 3 sees those two operations in the opposite order.
In order to ensure that assert B succeeds, you would, in effect, need to ensure that when thread 3 sees the value 10 of y, it also sees any other side effects performed by the thread that stored 10 into y before the time of the store. That is, you need the store of 10 into y to synchronize with the load of 10 from y. This can be done by having the store perform a release and the load perform an acquire:
// thread 2
y.store (10, memory_order_release);
// thread 3
if (y.load (memory_order_acquire) == 10)
A release operation synchronizes with an acquire operation that reads the value stored. Now because the store in thread 2 synchronizes with the load in thread 3, anything that happens after the load in thread 3 will see the side effects of anything that happens before the store in thread 2. Hence the assertion will succeed.
Of course, we also need to make sure assertion A succeeds, by making the x.store in thread 1 use release and the x.load in thread 2 use acquire.
I find it much easier to understand atomics with some knowledge of what might be causing them, so here's some background knowledge. Know that these concepts are in no way stated in the C++ language itself, but is some of the possible reasons why things are the way they are.
Compiler reordering
Compilers, often when optimizing, will choose to refactor the program as long as its effects are the same on a single threaded program. This is circumvented with the use of atomics, which will tell the compiler (among other things) that the variable might change at any moment, and that its value might be read elsewhere.
Formally, atomics ensures one thing: there will be no data races. That is, accessing the variable will not make your computer explode.
CPU reordering
CPU might reorder instructions as it is executing them, which means the instructions might get reordered on the hardware level, independent of how you wrote the program.
Caching
Finally there are effects of caches, which are faster memory that sorta contains a partial copy of the global memory. Caches are not always in sync, meaning they don't always agree on what is "correct". Different threads may not be using the same cache, and due to this, they may not agree on what values variables have.
Back to the problem
What the above surmounts to is pretty much what C++ says about the matter: unless explicitly stated otherwise, the ordering of side effects of each instruction is totally and completely unspecified. It might not even be the same viewed from different threads.
Formally, the guarantee of an ordering of side effects is called a happens-before relation. Unless a side effect happens-before another, it is not. Loosely, we just say call it synchronization.
Now, what is memory_order_relaxed? It is telling the compiler to stop meddling, but don't worry about how the CPU and cache (and possibly other things) behave. Therefore, one possibility of why you see the "impossible" assert might be
Thread 1 stores 20 to y and then 10 to x to its cache.
Thread 2 reads the new values and stores 10 to y to its cache.
Thread 3 didn't read the values from thread 1, but reads those of thread 2, and then the assert fails.
This might be completely different from what happens in reality, the point is anything can happen.
To ensure a happens-before relation between the multiple reads and writes, see Brian's answer.
Another construct that provides happens-before relations is std::mutex, which is why they are free from such insanities.
The answer to your question is the C++ standard.
The section [intro.races] is surprisingly very clear (which is not the rule of normative text kind: formalism consistency oftenly hurts readability).
I have read many books and tuto which treats the subject of memory order, but it just confused me.
Finally I have read the C++ standard, the section [intro.multithread] is the clearest I have found. Taking the time to read it carefully (twice) may save you some time!
The answer to your question is in [intro.races]/4:
All modifications to a particular atomic object M occur in some particular total order, called the modification
order of M. [ Note: There is a separate order for each atomic object. There is no requirement that these can
be combined into a single total order for all objects. In general this will be impossible since different threads
may observe modifications to different objects in inconsistent orders. — end note ]
You were expecting a single total order on all atomic operations. There is such an order, but only for atomic operations that are memory_order_seq_cst as explained in [atomics.order]/3:
There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happens
before” order and modification orders for all affected locations [...]

Using an atomic read-modify-write operation in a release sequence

Say, I create an object of type Foo in thread #1 and want to be able to access it in thread #3.
I can try something like:
std::atomic<int> sync{10};
Foo *fp;
// thread 1: modifies sync: 10 -> 11
fp = new Foo;
sync.store(11, std::memory_order_release);
// thread 2a: modifies sync: 11 -> 12
while (sync.load(std::memory_order_relaxed) != 11);
sync.store(12, std::memory_order_relaxed);
// thread 3
while (sync.load(std::memory_order_acquire) != 12);
fp->do_something();
The store/release in thread #1 orders Foo with the update to 11
thread #2a non-atomically increments the value of sync to 12
the synchronizes-with relationship between thread #1 and #3 is only established when #3 loads 11
The scenario is broken because thread #3 spins until it loads 12, which may arrive out of order (wrt 11) and Foo is not ordered with 12 (due to the relaxed operations in thread #2a).
This is somewhat counter-intuitive since the modification order of sync is 10 → 11 → 12
The standard says (§ 1.10.1-6):
an atomic store-release synchronizes with a load-acquire that takes its value from the store (29.3). [ Note: Except in the specified cases, reading a later value does not necessarily ensure visibility as described below. Such a requirement would sometimes interfere with efficient implementation. —end note ]
It also says in (§ 1.10.1-5):
A release sequence headed by a release operation A on an atomic object M is a maximal contiguous subsequence of side effects in the modification order of M, where the first operation is A, and every subsequent operation
- is performed by the same thread that performed A, or
- is an atomic read-modify-write operation.
Now, thread #2a is modified to use an atomic read-modify-write operation:
// thread 2b: modifies sync: 11 -> 12
int val;
while ((val = 11) && !sync.compare_exchange_weak(val, 12, std::memory_order_relaxed));
If this release sequence is correct, Foo is synchronized with thread #3 when it loads either 11 or 12.
My questions about the use of an atomic read-modify-write are:
Does the scenario with thread #2b constitute a correct release sequence ?
And if so:
What are the specific properties of a read-modify-write operation that ensure this scenario is correct ?
Does the scenario with thread #2b constitute a correct release sequence ?
Yes, per your quote from the standard.
What are the specific properties of a read-modify-write operation that ensure this scenario is correct?
Well, the somewhat circular answer is that the only important specific property is that "The C++ standard defines it so".
As a practical matter, one may ask why the standard defines it like this. I don't think you'll find that the answer has a deep theoretical basis: I think the committee could have also defined it such that the RMW doesn't participate in the release sequence, or (perhaps with more difficulty) have defined so that both the RMW and the separate mo_relaxed load and store participate in the release sequence, without compromising the "soundness" of the model.
They already give a performance related as to why they didn't choose the latter approach:
Such a requirement would sometimes interfere with efficient implementation.
In particular, on any hardware platform that allowed load-store reordering, it would imply that even mo_relaxed loads and/or stores might require barriers! Such platforms exist today. Even on more strongly ordered platforms, it may inhibit compiler optimizations.
So why didn't they take then take other "consistent" approach of not requiring RMW mo_relaxed to participate in the release sequence? Probably because existing hardware implementations of RMW operations provide such guarantees and the nature of RMW operations makes it likely that this will be true in the future. In particular, as Peter points in the comments above, RMW operations, even with mo_relaxed are conceptually and practically1 stronger than separate loads and stores: they would be quite useless if they didn't have a consistent total order.
Once you accept that is how hardware works, it makes sense from a performance angle to align the standard: if you didn't, you'd have people using more restrictive orderings such as mo_acq_rel just to get the release sequence guarantees, but on real hardware that has weakly ordered CAS, this doesn't come for free.
1 The "practically" part means that even the weakest forms of RMW instructions are usually relatively "expensive" operations taking a dozen cycles or more on modern hardware, while mo_relaxed loads and stores generally just compile to plain loads and stores in the target ISA.