What is the effect of the change to the definition of release sequences in the C++20 memory model? - concurrency

Consider this program:
-- Initially --
std::atomic<int> x{0};
int y{0};
-- Thread 1 --
y = 1; // A
x.store(1, std::memory_order_release); // B
x.store(3, std::memory_order_relaxed); // C
-- Thread 2 --
if (x.load(std::memory_order_acquire) == 3) // D
print(y); // E
Under the C++11 memory model, if the program prints anything then it prints 1.
In the C++20 memory model, release sequences were changed to exclude writes performed by the same thread. How does that affect this program? Could it now have a data-race and print either 0 or 1?
Notes
This code appears in P0982R1: Weaken Release Sequences which I believe is the paper that resulted in the changes to the definition of release sequences in C++20. In that particular example, there is a third thread making a store to x which disrupts the release sequence in a way that is counter-intuitive. That motivates the need to weaken the release sequence definition.
From reading the paper my understanding is that with the C++20 changes, C will no longer form part of the release sequence headed by B, because C is not a Read-Modify-Write operation. Therefore C does not synchronize with D. Thus there is no happens-before relation between A and E.
Since B and C are stores to the same atomic variable and all threads must agree on the modification order of that variable, does the C++20 memory model allow us to infer anything about whether A happens-before E?

Your understanding is correct; the program has a data race. The store of 3 does not form part of any release sequence, so D is not synchronized with any release store. There is thus no way to establish any happens-before relationship between any two operations from the two different threads, and in particular, no happens-before between A and E.
I think the only thing you can infer from the load of 3 is that D definitely does not happen before C; if it did, then D would be obliged to load a value that was strictly earlier in the modification order of x [read-write coherence, intro.races p17]. That means in particular that E does not happen before A.
The modification order would come into play if you were to load from x again in Thread 2 somewhere after D. Then you would be guaranteed to load the value 3 again. That follows from read-read coherence [intro.races p16]. Your second load is not allowed to observe anything that preceded 3 in the modification order, so it cannot load the values 0 or 1. This would apply even if all the loads and stores in both threads were relaxed.

Related

How to understand the changes to sequentially-consistent ordering in C++20?

P0668R5 made some changes to the sequentially-consistent ordering. The following example, from the proposal (also from cppreference), describes the motivation for the modification.
// Thread 1:
x.store(1, std::memory_order_seq_cst); // A
y.store(1, std::memory_order_release); // B
// Thread 2:
r1 = y.fetch_add(1, std::memory_order_seq_cst); // C
r2 = y.load(std::memory_order_relaxed); // D
// Thread 3:
y.store(3, std::memory_order_seq_cst); // E
r3 = x.load(std::memory_order_seq_cst); // F
where the initial values of x and y are 0.
According to the proposal, r1 is observed to be 1, r2 is 3, and r3 is 0. But this is not allowed by the pre-modified standard.
The indicated outcome here is disallowed by the current standard: All memory_order_seq_cst accesses must occur in a single total order, which is constrained to have F before A (since it doesn't observe the store), which must be before C (since it happens before it), which must be before E (since the fetch_add does not observe the store, which is forced to be last in modification order by the load in Thread 2). But this is disallowed since the standard requires the happens before ordering to be consistent with the sequential consistency ordering, and E, the last element of the sc order, happens before F, the first one.
To solve this problem, C++20 changed the meaning of strongly happens-before (the old meaning was renamed to simply happens-before). According to the modified rule, although A happens-before C, A does not strongly happens-before C, so A does not need to precede C in the single total order.
I'm wondering about the result of the modification. According to cppreference, the single total order of memory_order_seq_cst is C-E-F-A (I don't know why). But according to the happens-before rule, A still happens-before C, so the side effects of A should be visible to C. Does this mean that A precedes C in the modification order seen by thread 2? If so, does this mean that the single total order seen by all threads is not consistent? Can someone explain the above example in detail?
Note A and C operate on different objects, so it's meaningless to say "the side effects of A should be visible to C". If you mean the side effect of B is visible to C, then yes, and it does not conflict with the single total modification order C-E-F-A:
a memory_order_seq_cst load gets its value either from the last
memory_order_seq_cst modification, or from some
non-memory_order_seq_cst modification that does not happen-before
preceding memory_order_seq_cst modifications.

Synchronization with "versioning" in c++

Please consider the following synchronization problem:
initially:
version = 0 // atomic variable
data = 0 // normal variable (there could be many)
Thread A:
version++
data = 3
Thread B:
d = data
v = version
assert(d != 3 || v == 1)
Basically, if thread B sees data = 3 then it must also see version++.
What's the weakest memory order and synchronization we must impose so that the assertion in thread B is always satisfied?
If I understand C++ memory_order correctly, the release-acquire ordering won't do because that guarantees that operations BEFORE version++, in thread A, will be seen by the operations AFTER v = version, in thread B.
Acquire and release fences also work in the same directions, but are more general.
As I said, I need the other direction: B sees data = 3 implies B sees version = 1.
I'm using this "versioning approach" to avoid locks as much as possible in a data structure I'm designing. When I see something has changed, I step back, read the new version and try again.
I'm trying to keep my code as portable as possible, but I'm targeting x86-64 CPUs.
You might be looking for a SeqLock, as long as your data doesn't include pointers. (If it does, then you might need something more like RCU to protect readers that might load a pointer, stall / sleep for a while, then deref that pointer much later.)
You can use the SeqLock sequence counter as the version number. (version = tmp_counter >> 1 since you need two increments per write of the payload to let readers detect tearing when reading the non-atomic data. And to make sure they see the data that goes with this sequence number. Make sure you don't read the atomic counter a 3rd time; use the local tmp that you read it into to verify match before/after copying data.)
Readers will have to retry if they happen to attempt a read while data is being modified. But it's non-atomic, so there's no way if thread B sees data = 3 can ever be part of what creates synchronization; it can only be something you see as a result of synchronizing with a version number from the writer.
See:
Implementing 64 bit atomic counter with 32 bit atomics - my attempt at a SeqLock in C++, with lots of comments. It's a bit of a hack because ISO C++'s data-race UB rules are overly strict; a SeqLock relies on detecting possible tearing and not using torn data, rather than avoiding concurrent access entirely. That's fine on a machine without hardware race detection so that doesn't fault (like all real CPUs), but C++ still calls that UB, even with volatile (although that puts it more into implementation-defined territory). In practice it's fine.
GCC reordering up across load with `memory_order_seq_cst`. Is this allowed? - A GCC bug fixed in 8.1 that could break a seqlock implementation.
If you have multiple writers, you can use the sequence-counter itself as a spinlock for mutual exclusion between writers. e.g. using an atomic_fetch_or or CAS to attempt to set the low bit to make the counter odd. (tmp = seq.fetch_or(1, std::memory_order_acq_rel);, hopefully compiling to x86 lock bts). If it previously didn't have the low bit set, this writer won the race, but if it did then you have to try again.
But with only a single writer, you don't need to RMW the atomic sequence counter, just store new values (ordered with writes to the payload), so you can either keep a local copy of it, or just do a relaxed load of it, and store tmp+1 and tmp+2.

What formally guarantees that non-atomic variables can't see out-of-thin-air values and create a data race like atomic relaxed theoretically can?

This is a question about the formal guarantees of the C++ standard.
The standard points out that the rules for std::memory_order_relaxed atomic variables allow "out of thin air" / "out of the blue" values to appear.
But for non-atomic variables, can this example have UB? Is r1 == r2 == 42 possible in the C++ abstract machine? Neither variable == 42 initially so you'd expect neither if body should execute, meaning no writes to the shared variables.
// Global state
int x = 0, y = 0;
// Thread 1:
r1 = x;
if (r1 == 42) y = r1;
// Thread 2:
r2 = y;
if (r2 == 42) x = 42;
The above example is adapted from the standard, which explicitly says such behavior is allowed by the specification for atomic objects:
[Note: The requirements do allow r1 == r2 == 42 in the following
example, with x and y initially zero:
// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42) y.store(r1, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42) x.store(42, memory_order_relaxed);
However, implementations should not allow such behavior. – end note]
What part of the so called "memory model" protects non atomic objects from these interactions caused by reads seeing out-of-thin-air values?
When a race condition would exist with different values for x and y, what guarantees that read of a shared variable (normal, non atomic) cannot see such values?
Can not-executed if bodies create self-fulfilling conditions that lead to a data-race?
The text of your question seems to be missing the point of the example and out-of-thin-air values. Your example does not contain data-race UB. (It might if x or y were set to 42 before those threads ran, in which case all bets are off and the other answers citing data-race UB apply.)
There is no protection against real data races, only against out-of-thin-air values.
I think you're really asking how to reconcile that mo_relaxed example with sane and well-defined behaviour for non-atomic variables. That's what this answer covers.
The note is pointing out a hole in the atomic mo_relaxed formalism, not warning you of a real possible effect on some implementations.
This gap does not (I think) apply to non-atomic objects, only to mo_relaxed.
They say However, implementations should not allow such behavior. – end note]. Apparently the standards committee couldn't find a way to formalize that requirement so for now it's just a note, but is not intended to be optional.
It's clear that even though this isn't strictly normative, the C++ standard intends to disallow out-of-thin-air values for relaxed atomic (and in general I assume). Later standards discussion, e.g. 2018's p0668r5: Revising the C++ memory model (which doesn't "fix" this, it's an unrelated change) includes juicy side-nodes like:
We still do not have an acceptable way to make our informal (since C++14) prohibition of out-of-thin-air results precise. The primary practical effect of that is that formal verification of C++ programs using relaxed atomics remains unfeasible. The above paper suggests a solution similar to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html . We continue to ignore the problem here ...
So yes, the normative parts of the standard are apparently weaker for relaxed_atomic than they are for non-atomic. This seems to be an unfortunately side effect of how they define the rules.
AFAIK no implementations can produce out-of-thin-air values in real life.
Later versions of the standard phrase the informal recommendation more clearly, e.g. in the current draft: https://timsong-cpp.github.io/cppwp/atomics.order#8
Implementations should ensure that no “out-of-thin-air” values are computed that circularly depend on their own computation.
...
[ Note: The recommendation [of 8.] similarly disallows r1 == r2 == 42 in the following example, with x and y again initially zero:
// Thread 1:
r1 = x.load(memory_order::relaxed);
if (r1 == 42) y.store(42, memory_order::relaxed);
// Thread 2:
r2 = y.load(memory_order::relaxed);
if (r2 == 42) x.store(42, memory_order::relaxed);
— end note ]
(This rest of the answer was written before I was sure that the standard intended to disallow this for mo_relaxed, too.)
I'm pretty sure the C++ abstract machine does not allow r1 == r2 == 42.
Every possible ordering of operations in the C++ abstract machine operations leads to r1=r2=0 without UB, even without synchronization. Therefore the program has no UB and any non-zero result would violate the "as-if" rule.
Formally, ISO C++ allows an implementation to implement functions / programs in any way that gives the same result as the C++ abstract machine would. For multi-threaded code, an implementation can pick one possible abstract-machine ordering and decide that's the ordering that always happens. (e.g. when reordering relaxed atomic stores when compiling to asm for a strongly-ordered ISA. The standard as written even allows coalescing atomic stores but compilers choose not to). But the result of the program always has to be something the abstract machine could have produced. (Only the Atomics chapter introduces the possibility of one thread observing the actions of another thread without mutexes. Otherwise that's not possible without data-race UB).
I think the other answers didn't look carefully enough at this. (And neither did I when it was first posted). Code that doesn't execute doesn't cause UB (including data-race UB), and compilers aren't allowed to invent writes to objects. (Except in code paths that already unconditionally write them, like y = (x==42) ? 42 : y; which would obviously create data-race UB.)
For any non-atomic object, if don't actually write it then other threads might also be reading it, regardless of code inside not-executed if blocks. The standard allows this and doesn't allow a variable to suddenly read as a different value when the abstract machine hasn't written it. (And for objects we don't even read, like neighbouring array elements, another thread might even be writing them.)
Therefore we can't do anything that would let another thread temporarily see a different value for the object, or step on its write. Inventing writes to non-atomic objects is basically always a compiler bug; this is well known and universally agreed upon because it can break code that doesn't contain UB (and has done so in practice for a few cases of compiler bugs that created it, e.g. IA-64 GCC I think had such a bug at one point that broke the Linux kernel). IIRC, Herb Sutter mentioned such bugs in part 1 or 2 of his talk, atomic<> Weapons: The C++ Memory Model and Modern Hardware", saying that it was already usually considered a compiler bug before C++11, but C++11 codified that and made it easier to be sure.
Or another recent example with ICC for x86:
Crash with icc: can the compiler invent writes where none existed in the abstract machine?
In the C++ abstract machine, there's no way for execution to reach either y = r1; or x = r2;, regardless of sequencing or simultaneity of the loads for the branch conditions. x and y both read as 0 and neither thread ever writes them.
No synchronization is required to avoid UB because no order of abstract-machine operations leads to a data-race. The ISO C++ standard doesn't have anything to say about speculative execution or what happens when mis-speculation reaches code. That's because speculation is a feature of real implementations, not of the abstract machine. It's up to implementations (HW vendors and compiler writers) to ensure the "as-if" rule is respected.
It's legal in C++ to write code like if (global_id == mine) shared_var = 123; and have all threads execute it, as long as at most one thread actually runs the shared_var = 123; statement. (And as long as synchronization exists to avoid a data race on non-atomic int global_id). If things like this broke down, it would be chaos. For example, you could apparently draw wrong conclusions like reordering atomic operations in C++
Observing that a non-write didn't happen isn't data-race UB.
It's also not UB to run if(i<SIZE) return arr[i]; because the array access only happens if i is in bounds.
I think the "out of the blue" value-invention note only applies to relaxed-atomics, apparently as a special caveat for them in the Atomics chapter. (And even then, AFAIK it can't actually happen on any real C++ implementations, certainly not mainstream ones. At this point implementations don't have to take any special measures to make sure it can't happen for non-atomic variables.)
I'm not aware of any similar language outside the atomics chapter of the standard that allows an implementation to allow values to appear out of the blue like this.
I don't see any sane way to argue that the C++ abstract machine causes UB at any point when executing this, but seeing r1 == r2 == 42 would imply that unsynchronized read+write had happened, but that's data-race UB. If that can happen, can an implementation invent UB because of speculative execution (or some other reason)? The answer has to be "no" for the C++ standard to be usable at all.
For relaxed atomics, inventing the 42 out of nowhere wouldn't imply that UB had happened; perhaps that's why the standard says it's allowed by the rules? As far as I know, nothing outside the Atomics chapter of the standard allows it.
A hypothetical asm / hardware mechanism that could cause this
(Nobody wants this, hopefully everyone agrees that it would be a bad idea to build hardware like this. It seems unlikely that coupling speculation across logical cores would ever be worth the downside of having to roll back all cores when one detects a mispredict or other mis-speculation.)
For 42 to be possible, thread 1 has to see thread 2's speculative store and the store from thread 1 has to be seen by thread 2's load. (Confirming that branch speculation as good, allowing this path of execution to become the real path that was actually taken.)
i.e. speculation across threads: Possible on current HW if they ran on the same core with only a lightweight context switch, e.g. coroutines or green threads.
But on current HW, memory reordering between threads is impossible in that case. Out-of-order execution of code on the same core gives the illusion of everything happening in program order. To get memory reordering between threads, they need to be running on different cores.
So we'd need a design that coupled together speculation between two logical cores. Nobody does that because it means more state needs to rollback if a mispredict is detected. But it is hypothetically possible. For example an OoO SMT core that allows store-forwarding between its logical cores even before they've retired from the out-of-order core (i.e. become non-speculative).
PowerPC allows store-forwarding between logical cores for retired stores, meaning that threads can disagree about the global order of stores. But waiting until they "graduate" (i.e. retire) and become non-speculative means it doesn't tie together speculation on separate logical cores. So when one is recovering from a branch miss, the others can keep the back-end busy. If they all had to rollback on a mispredict on any logical core, that would defeat a significant part of the benefit of SMT.
I thought for a while I'd found an ordering that lead to this on single core of a real weakly-ordered CPUs (with user-space context switching between the threads), but the final step store can't forward to the first step load because this is program order and OoO exec preserves that.
T2: r2 = y; stalls (e.g. cache miss)
T2: branch prediction predicts that r2 == 42 will be true. ( x = 42 should run.
T2: x = 42 runs. (Still speculative; r2 = yhasn't obtained a value yet so ther2 == 42` compare/branch is still waiting to confirm that speculation).
a context switch to Thread 1 happens without rolling back the CPU to retirement state or otherwise waiting for speculation to be confirmed as good or detected as mis-speculation.
This part won't happen on real C++ implementations unless they use an M:N thread model, not the more common 1:1 C++ thread to OS thread. Real CPUs don't rename the privilege level: they don't take interrupts or otherwise enter the kernel with speculative instructions in flight that might need to rollback and redo entering kernel mode from a different architectural state.
T1: r1 = x; takes its value from the speculative x = 42 store
T1: r1 == 42 is found to be true. (Branch speculation happens here, too, not actually waiting for store-forwarding to complete. But along this path of execution, where the x = 42 did happen, this branch condition will execute and confirm the prediction).
T1: y = 42 runs.
this was all on the same CPU core so this y=42 store is after the r2=y load in program-order; it can't give that load a 42 to let the r2==42 speculation be confirmed. So this possible ordering doesn't demonstrate this in action after all. This is why threads have to be running on separate cores with inter-thread speculation for effects like this to be possible.
Note that x = 42 doesn't have a data dependency on r2 so value-prediction isn't required to make this happen. And the y=r1 is inside an if(r1 == 42) anyway so the compiler can optimize to y=42 if it wants, breaking the data dependency in the other thread and making things symmetric.
Note that the arguments about Green Threads or other context switch on a single core isn't actually relevant: we need separate cores for the memory reordering.
I commented earlier that I thought this might involve value-prediction. The ISO C++ standard's memory model is certainly weak enough to allow the kinds of crazy "reordering" that value-prediction can create to use, but it's not necessary for this reordering. y=r1 can be optimized to y=42, and the original code includes x=42 anyway so there's no data dependency of that store on the r2=y load. Speculative stores of 42 are easily possible without value prediction. (The problem is getting the other thread to see them!)
Speculating because of branch prediction instead of value prediction has the same effect here. And in both cases the loads need to eventually see 42 to confirm the speculation as correct.
Value-prediction doesn't even help make this reordering more plausible. We still need inter-thread speculation and memory reordering for the two speculative stores to confirm each other and bootstrap themselves into existence.
ISO C++ chooses to allow this for relaxed atomics, but AFAICT is disallows this non-atomic variables. I'm not sure I see exactly what in the standard does allow the relaxed-atomic case in ISO C++ beyond the note saying it's not explicitly disallowed. If there was any other code that did anything with x or y then maybe, but I think my argument does apply to the relaxed atomic case as well. No path through the source in the C++ abstract machine can produce it.
As I said, it's not possible in practice AFAIK on any real hardware (in asm), or in C++ on any real C++ implementation. It's more of an interesting thought-experiment into crazy consequences of very weak ordering rules, like C++'s relaxed-atomic. (Those ordering rules don't disallow it, but I think the as-if rule and the rest of the standard does, unless there's some provision that allows relaxed atomics to read a value that was never actually written by any thread.)
If there is such a rule, it would only be for relaxed atomics, not for non-atomic variables. Data-race UB is pretty much all the standard needs to say about non-atomic vars and memory ordering, but we don't have that.
When a race condition potentially exists, what guarantees that a read of a shared variable (normal, non atomic) cannot see a write
There is no such guarantee.
When race condition exists, the behaviour of the program is undefined:
[intro.races]
Two actions are potentially concurrent if
they are performed by different threads, or
they are unsequenced, at least one is performed by a signal handler, and they are not both performed by the same signal handler invocation.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior. ...
The special case is not very relevant to the question, but I'll include it for completeness:
Two accesses to the same object of type volatile std::sig_­atomic_­t do not result in a data race if both occur in the same thread, even if one or more occurs in a signal handler. ...
What part of the so called "memory model" protects non atomic objects from these interactions caused by reads that see the interaction?
None. In fact, you get the opposite and the standard explicitly calls this out as undefined behavior. In [intro.races]\21 we have
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
which covers your second example.
The rule is that if you have shared data in multiple threads, and at least one of those threads write to that shared data, then you need synchronization. Without that you have a data race and undefined behavior. Do note that volatile is not a valid synchronization mechanism. You need atomics/mutexs/condition variables to protect shared access.
Note: The specific examples I give here are apparently not accurate. I've assumed the optimizer can be somewhat more aggressive than it's apparently allowed to be. There is some excellent discussion about this in the comments. I'm going to have to investigate this further, but wanted to leave this note here as a warning.
Other people have given you answers quoting the appropriate parts of the standard that flat out state that the guarantee you think exists, doesn't. It appears that you're interpreting a part of the standard that says a certain weird behavior is permitted for atomic objects if you use memory_order_relaxed as meaning that this behavior is not permitted for non-atomic objects. This is a leap of inference that is explicitly addressed by other parts of the standard that declare the behavior undefined for non-atomic objects.
In practical terms, here is an order of events that might happen in thread 1 that would be perfectly reasonable, but result in the behavior you think is barred even if the hardware guaranteed that all memory access was completely serialized between CPUs. Keep in mind that the standard has to not only take into account the behavior of the hardware, but the behavior of optimizers, which often aggressively re-order and re-write code.
Thread 1 could be re-written by an optimizer to look this way:
old_y = y; // old_y is a hidden variable (perhaps a register) created by the optimizer
y = 42;
if (x != 42) y = old_y;
There might be perfectly reasonable reasons for an optimizer to do this. For example, it may decide that it's far more likely than not for 42 to be written into y, and for dependency reasons, the pipeline might work a lot better if the store into y occurs sooner rather than later.
The rule is that the apparent result must look as if the code you wrote is what was executed. But there is no requirement that the code you write bears any resemblance at all to what the CPU is actually told to do.
The atomic variables impose constraints on the ability of the compiler to re-write code as well as instructing the compiler to issue special CPU instructions that impose constraints on the ability of the CPU to re-order memory accesses. The constraints involving memory_order_relaxed are much stronger than what is ordinarily allowed. The compiler would generally be allowed to completely get rid of any reference to x and y at all if they weren't atomic.
Additionally, if they are atomic, the compiler must ensure that other CPUs see the entire variable as either with the new value or the old value. For example, if the variable is a 32-bit entity that crosses a cache line boundary and a modification involves changing bits on both sides of the cache line boundary, one CPU may see a value of the variable that is never written because it only sees an update to the bits on one side of the cache line boundary. But this is not allowed for atomic variables modified with memory_order_relaxed.
That is why data races are labeled as undefined behavior by the standard. The space of the possible things that could happen is probably a lot wilder than your imagination could account for, and certainly wider than any standard could reasonably encompass.
(Stackoverflow complains about too many comments I put above, so I gathered them into an answer with some modifications.)
The intercept you cite from from C++ standard working draft N3337 was wrong.
[Note: The requirements do allow r1 == r2 == 42 in the following
example, with x and y initially zero:
// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42)
y.store(r1, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42)
x.store(42, memory_order_relaxed);
A programming language should never allow this "r1 == r2 == 42" to happen.
This has nothing to do with memory model. This is required by causality, which is the basic logic methodology and the foundation of any programming language design. It is the fundamental contract between human and computer. Any memory model should abide by it. Otherwise it is a bug.
The causality here is reflected by the intra-thread dependences between operations within a thread, such as data dependence (e.g., read after write in same location) and control dependence (e.g., operation in a branch), etc. They cannot be violated by any language specification. Any compiler/processor design should respect the dependence in its committed result (i.e., externally visible result or program visible result).
Memory model is mainly about memory operation ordering among multi-processors, which should never violate the intra-thread dependence, although a weak model may allow the causality happening in one processor to be violated (or unseen) in another processor.
In your code snippet, both threads have (intra-thread) data dependence (load->check) and control dependence (check->store) that ensure their respective executions (within a thread) are ordered. That means, we can check the later op's output to determine if the earlier op has executed.
Then we can use simple logic to deduce that, if both r1 and r2 are 42, there must be a dependence cycle, which is impossible, unless you remove one condition check, which essentially breaks the dependence cycle. This has nothing to do with memory model, but intra-thread data dependence.
Causality (or more accurately, intra-thread dependence here) is defined in C++ std, but not so explicitly in early drafts, because dependence is more of micro-architecture and compiler terminology. In language spec, it is usually defined as operational semantics. For example, the control dependence formed by "if statement" is defined in the same version of draft you cited as "If the condition yields true the first substatement is executed. " That defines the sequential execution order.
That said, the compiler and processor can schedule one or more operations of the if-branch to be executed before the if-condition is resolved. But no matter how the compiler and processor schedule the operations, the result of the if-branch cannot be committed (i.e., become visible to the program) before the if-condition is resolved. One should distinguish between semantics requirement and implementation details. One is language spec, the other is how the compiler and processor implement the language spec.
Actually the current C++ standard draft has corrected this bug in https://timsong-cpp.github.io/cppwp/atomics.order#9 with a slight change.
[ Note: The recommendation similarly disallows r1 == r2 == 42 in the following example, with x and y again initially zero:
// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42)
y.store(42, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42)
x.store(42, memory_order_relaxed);

Will two relaxed writes to the same location in different threads always be seen in the same order by other threads?

On the x86 architecture, stores to the same memory location have a total order, e.g., see this video. What are the guarantees in the C++11 memory model?
More precisely, in
-- Initially --
std::atomic<int> x{0};
-- Thread 1 --
x.store(1, std::memory_order_release);
-- Thread 2 --
x.store(2, std::memory_order_release);
-- Thread 3 --
int r1 = x.load(std::memory_order_acquire);
int r2 = x.load(std::memory_order_acquire);
-- Thread 4 --
int r3 = x.load(std::memory_order_acquire);
int r4 = x.load(std::memory_order_acquire);
would the outcome r1==1, r2==2, r3==2, r4==1 be allowed (on some architecture other than x86)? What if I were to replace all memory_order's by std::memory_order_relaxed?
No, such an outcome is not allowed. §1.10 [intro.multithread]/p8, 18 (quoting N3936/C++14; the same text is found in paragraphs 6 and 16 for N3337/C++11):
8 All modifications to a particular atomic object M occur in some
particular total order, called the modification order of M.
18 If a value computation A of an atomic object M happens before a
value computation B of M, and A takes its value from a side effect X
on M, then the value computed by B shall either be the value stored by
X or the value stored by a side effect Y on M, where Y follows X in
the modification order of M. [ Note: This requirement is known as
read-read coherence. —end note ]
In your code there are two side effects, and by p8 they occur in some particular total order. In Thread 3, the value computation to calculate the value to be stored in r1 happens before that of r2, so given r1 == 1 and r2 == 2 we know that the store performed by Thread 1 precedes the store performed by Thread 2 in the modification order of x. That being the case, Thread 4 cannot observe r3 == 2, r4 == 1 without running afoul of p18. This is regardless of the memory_order used.
There is a note in p21 (p19 in N3337) that is relevant:
[ Note: The four preceding coherence requirements effectively
disallow compiler reordering of atomic operations to a single object,
even if both operations are relaxed loads. This effectively makes the
cache coherence guarantee provided by most hardware available to C++
atomic operations. —end note ]
Per C++11 [intro.multithread]/6: "All modifications to a particular atomic object M occur in some particular total order, called the modification order of M." Consequently, reads of an atomic object by a particular thread will never see "older" values than those the thread has already observed. Note that there is no mention of memory orderings here, so this property holds true for all of them - seq_cst through relaxed.
In the example given in the OP, the modification order of x can be either (0,1,2) or (0,2,1). A thread that has observed a given value in that modification order cannot later observe an earlier value. The outcome r1==1, r2==2 implies that the modification order of x is (0,1,2), but r3==2, r4==1 implies it is (0,2,1), a contradiction. So that outcome is not possible on an implementation that conforms to C++11 .
Given that the C++11 rules definitely disallow this, here's a more qualitative / intuitive way to understand it:
If there are no further stores to x, eventually all readers will agree on its value. (i.e. one of the two stores came 2nd).
If it were possible for different threads to disagree about the order, then either they'd permanently / long-term disagree about the value, or one thread could see the value change a 3rd extra time (a phantom store).
Fortunately C++11 doesn't allow either of those possibilities.

C++11 atomic: why does this code work?

Let's take this struct:
struct entry {
atomic<bool> valid;
atomic_flag writing;
char payload[128];
}
Two treads A and B concurrently access this struct this way (let e be an instance of entry):
if (e.valid) {
// do something with e.payload...
} else {
while (e.writing.test_and_set(std::memory_order_acquire));
if (!e.valid) {
// write e.payload one byte at a time
// (the payload written by A may be different from the payload written by B)
e.valid = true;
e.writing.clear(std::memory_order_release);
}
}
I guess that this code is correct and does not present issues, but I want to understand why it works.
Quoting the C++ standard (29.3.13):
Implementations should make atomic stores visible to atomic loads
within a reasonable amount of time.
Now, bearing this in mind, imagine that both thread A and B enter the else block. Is this interleave possible?
Both A and B enter the else branch, because valid is false
A sets the writing flag
B starts to spin lock on the writing flag
A reads the valid flag (which is false) and enters the if block
A writes the payload
A writes true on the valid flag; obviously, if A reads valid again, it would read true
A clears the writing flag
B sets the writing flag
B reads a stale value of the valid flag (false) and enters the if block
B writes its payload
B writes true on the valid flag
B clears the writing flag
I hope this is not possible but when it comes to actually answer the question "why it is not possible?", I'm not sure of the answer. Here is my idea.
Quoting from the standard again (29.3.12):
Atomic read-modify-write operations shall always read the last value
(in the modification order) written before the write associated with
the read-modify-write operation.
atomic_flag::test_and_set() is an atomic read-modify-write operation, as stated in 29.7.5.
Since atomic_flag::test_and_set() always reads a "fresh value", and I'm calling it with the std::memory_order_acquire memory ordering, then I cannot read a stale value of the valid flag, because I must see all the side-effects caused by A before the atomic_flag::clear() call (which uses std::memory_order_release).
Am I correct?
Clarification. My whole reasoning (wrong or correct) relies on 29.3.12. For what I understood so far, if we ignore the atomic_flag, reading stale data from valid is possible even if it's atomic. atomic doesn't seem to mean "always immediately visible" to every thread. The maximum guarantee you can ask for is a consistent order in the values you read, but you can still read stale data before getting the fresh one. Fortunately, atomic_flag::test_and_set() and every exchange operation have this crucial feature: they always read fresh data. So, only if you acquire/release on the writing flag (not only on valid), then you get the expected behavior. Do you see my point (correct or not)?
EDIT: my original question included the following few lines that gained too much attention if compared to the core of the question. I leave them for consistency with the answers that have been already given, but please ignore them if you are reading the question right now.
Is there any point in valid being an atomic<bool> and
not a plain bool? Moreover, if it should be an atomic<bool>,
what is its 'minimum' memory ordering constraint that will not present
issues?
Inside the else branch valid should be protected by the acquire/release semantics imposed by the operations on waiting. However this does not obliviate the need to make valid an atomic:
You forgot to include the first line (if (e.valid)) in your analysis. If valid was an bool instead of atomic<bool> this access would be completely unprotected. Therefore you could have the situation where a change of valid becomes visible to other threads before the payload is completely written/visible. This means that a thread B could evaluate e.valid to true and enter the do something with e.payload branch while the payload isn't completely written yet.
Other then that your analysis seems somewhat reasonable but not entirely correct to me. The thing to remember with memory ordering is that acquire and release semantics will pair up. Everything written before a release operation can safely be read after an acquire operation on the same veriable reads the modified value. With that in mind the release semantics on waiting.clear(...) ensure that the write to valid must be visible when the loop on writing.test_and_set(...) exits, since the later reads the change of waiting(the write done inwaiting.clear(...)`) with acquire semantics and doesn't exit before that change is visible.
Regarding §29.3.12: It is relevant to the correctness of your code, but unrelated to the reading a stale valid flag. You can't set the flag before the clear, so acquire-release semantics will ensure correctness there. §29.3.12 protects you from the following scenario:
Both A and B enter the else branch, because valid is false
A sets the writing flag
B sees a stale value for writing and also sets it
Both A and B read the valid flag (which is false), enter the if block and write the payload creating a race condition
Edit: For the minimal Ordering constraints: acquire for the loads and release for the stores should probably do the job, however depending on your target hardware you might as well stay with sequential consistency. For the difference between those semantics look here.
Section 29.3.12 has nothing to do with why this code is correct or incorrect. The section you want (in the draft version of the standard available online) is Section 1.10: "Multi-threaded executions and data races." Section 1.10 defines a happens-before relation on atomic operations, and on non-atomic operations with respect to atomic operations.
Section 1.10 says that if there are two non-atomic operations where you can not determine the happens-before relationship then you have a data-race. It further declares (Paragraph 21) that any program with a data-race has undefined behavior.
If e.valid is not atomic then you have a data race between the first line of code and the line e.valid=true. So all of your reasoning about the behavior in the else clause is incorrect (the program has no defined behavior so there is nothing to reason about.)
On the other hand if all of your accesses to e.valid were protected by atomic operations on e.writing (like if the else clause was your whole program) then your reasoning would be correct. Event 9 in your list could not happen. But the reason is not Section 29.3.12, it is again Section 1.10, which says that your non-atomic operations will appear to be sequentially consistent if there are no dataraces.
The pattern you are using is called double checked locking‌​. Before C++11 it was impossible to implement double checked locking portably. In C++11 you can make double checked locking work correctly and portably. The way you do it is by declaring valid to be atomic.
If valid is not atomic then the initial read of e.valid on the first line conflicts with the assignment to e.valid.
There is no guarantee both threads have already done that read before one of them gets the spinlock, i.e steps 1 and 6 are not ordered.
The store to e.valid needs to a release and the load in the condition needs to be an acquire. Otherwise, the compiler/processor are free to order setting e.valid above writing the payload.
There is an opensource tool, CDSChecker, for verifying code like this against the C/C++11 memory model.