Transitivity of release-acquire - c++

Just when I thought I got some grip around atomics, I see another article. This is an excerpt from GCC wiki, under Overall Summary:
-Thread 1- -Thread 2- -Thread 3-
y.store (20); if (x.load() == 10) { if (y.load() == 10)
x.store (10); assert (y.load() == 20) assert (x.load() == 10)
y.store (10)
}
Release/acquire mode only requires the two threads involved to be synchronized. This means that synchronized values are not commutative to other threads. The assert in thread 2 must still be true since thread 1 and 2 synchronize with x.load(). Thread 3 is not involved in this synchronization, so when thread 2 and 3 synchronize with y.load(), thread 3's assert can fail. There has been no synchronization between threads 1 and 3, so no value can be assumed for 'x' there.
The article is saying that the assert in thread 2 won't fail, but that in 3 might.
I find that surprising. Here's my chain of reasoning that the thread 3 assert won't fail—perhaps someone can tell me where I'm wrong.
Thread 3 observes y == 10 only if thread 2 wrote 10.
Thread 2 writes 10 only if it saw x == 10.
Thread 2 (or any thread) sees x == 10 only if thread 1 wrote 10. There are no further updates to x from any thread.
Since thread 2 observed x == 10, and thread 3, too, having synchronized with thread 2, should observe x == 10.
Release/acquire mode only requires the two threads involved to be synchronized.
Can someone point to a source for this 2-party-only requirement, please? My understanding (granted, perhaps wrong) is that the producer has no knowledge of with whom it's synchronizing. I.e., thread 1 can't say, "my updates are only for thread 2". Likewise, thread 2 can't say, "give me the updates from thread 1". Instead, a release of x = 10 by thread 1 is for anyone to observe, if they so chose.
Thus, x = 10 being the last update (by thread 1), any acquire from anywhere in the system happened-after (ensured by transitive synchronization) is guaranteed to observe that write, isn't it?
This means that synchronized values are not commutative to other threads.
Regardless of whether it's true, the author perhaps meant transitive, not commutative, right?
Lastly, if I'm wrong above, I'm curious to know what synchronization operation(s) would guarantee that thread 3's assert won't fail.

Seems like you've found a mistake in the GCC wiki.
The assert in T3 shouldn't fail in C++.
Here are the explanations with the relevant quotes from the C++20 standard:
x.store (10) in T1 happens before assert (x.load() == 10) in T3, because:
statements within every thread are ordered with sequenced before
9 Every value computation and side effect associated with a full-expression is sequenced before every value computation and side effect associated with the next full-expression to be evaluated.48
x.store (10) synchronizes with if (x.load() == 10) and y.store (10) synchronizes with if (y.load() == 10)
2 An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A.
as a result x.store (10) inter-thread happens before assert (x.load() == 10)
9 An evaluation A inter-thread happens before an evaluation B if
(9.1)   — A synchronizes with B, or
(9.2)   — A is dependency-ordered before B, or
(9.3)   — for some evaluation X
(9.3.1)    — A synchronizes with X and X is sequenced before B, or
(9.3.2)    — A is sequenced before X and X inter-thread happens before B, or
(9.3.3)    — A inter-thread happens before X and X inter-thread happens before B.
this also means x.store (10) happens before assert (x.load() == 10)
10 An evaluation A happens before an evaluation B (or, equivalently, B happens after A) if:
(10.1)   — A is sequenced before B, or
(10.2)   — A inter-thread happens before B.
the above means that x.load() in assert (x.load() == 10) must return 10 written by x.store (10).
(We assume here that x was published correctly and therefore the initial value of x comes before x.store (10) in the modification order of x).
18 If a side effect X on an atomic object M happens before a value computation B of M, then the evaluation B shall take its value from X or from a side effect Y that follows X in the modification order of M.
[Note 18: This requirement is known as write-read coherence. — end note]

Related

Data race guarded by if (false)... what does the standard say?

Consider the following situation
// Global
int x = 0; // not atomic
// Thread 1
x = 1;
// Thread 2
if (false)
x = 2;
Does this constitute a data race according to the standard?
[intro.races] says:
Two expression evaluations conflict if one of them modifies a memory location (4.4) and the other one reads
or modifies the same memory location.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions,
at least one of which is not atomic, and neither happens before the other, except for the special case for
signal handlers described below. Any such data race results in undefined behavior.
Is it safe from a language-lawyer perspective, because the program can never be allowed to perform the "expression evaluation" x = 2;?
From a technical standpoint, what if some weird, stupid compiler decided to perform a speculative execution of this write, rolling it back after checking the actual condition?
What inspired this question is the fact that (at least in Standard 11), the following program was allowed to have its result depend entirely on reordering/speculative execution:
// Thread 1:
r1 = y.load(std::memory_order_relaxed);
if (r1 == 42) x.store(r1, std::memory_order_relaxed);
// Thread 2:
r2 = x.load(std::memory_order_relaxed);
if (r2 == 42) y.store(42, std::memory_order_relaxed);
// This is allowed to result in r1==r2==42 in c++11
(compare https://en.cppreference.com/w/cpp/atomic/memory_order)
The key term is "expression evaluation". Take the very simple example:
int a = 0;
for (int i = 0; i != 10; ++i)
++a;
There's one expression ++a, but 10 evaluations. These are all ordered: the 5th evaluation happens-before the 6th evaluation. And the evaluations of ++a are interleaved with the evaluations of i!=10.
So, in
int a = 0;
for (int i = 0; i != 0; ++i)
++a;
there are 0 evaluations. And by a trivial rewrite, that gets us
int a = 0;
if (false)
++a;
Now, if there are 10 evaluations of ++a, we need to worry for all 10 evaluations if they race with another thread (in more complex cases, the answer might vary - say if you start a thread when a==5). But if there are no evaluations at all of ++a, then there's clearly no racing evaluation.
Does this constitute a data race according to the standard?
No, data races are concerned with access to storage locations in expressions which are actually evaluated as your quote states. In if (false) x = 2; the expression x = 2; is never evaluated. Hence it doesn't matter at all to determining the presence of data races.
Is it safe from a language-lawyer perspective, because the program can never be allowed to perform the "expression evaluation" x = 2;?
Yes.
From a technical standpoint, what if some weird, stupid compiler decided to perform a speculative execution of this write, rolling it back after checking the actual condition?
It is not allowed to do that if it could affect the observable behavior of the program. Otherwise it may do that, but it is impossible to observe the difference.
What inspired this question is the fact that (at least in Standard 11), the following program was allowed to have its result depend entirely on reordering/speculative execution:
That's a completely different situation. This program also doesn't have any data races, since the only variables that are accessed in both threads are atomics, which can never have data races. It merely has potentially multiple valid results, meaning a race condition. A data race would always imply undefined behavior, not merely unspecified behavior.
Also the out-of-thin-air issue appears only as a result of the circular dependence of the accesses between multiple atomics. In your initial example there is only one variable, non-atomic and without any such circular dependence.

what happens after a thread terminates [duplicate]

Std::thread::join is said to 'synchronize-with' the joined thread, however synchronization doesnt tell anything about visibility of side effects, it merely governs the order of the visiblity, ie. in following example:
int g_i = 0;
int main()
{
auto fn = [&] {g_i = 1;};
std::thread t1(fn);
t1.join();
return g_i;
}
Do we have any guarantee in the c++ standard that this program will always return 1?
[thread.thread.member]:
void join();
Effects: Blocks until the thread represented by *this has completed.
Synchronization: The completion of the thread represented by *this synchronizes with the corresponding successful join() return.
Since the completion of the thread execution synchronizes with the return from thread::join, the completion of the thread inter-thread happens before the return:
An evaluation A inter-thread happens before an evaluation B if
— A synchronizes with B
and thus happens before it:
An evaluation A happens before an evaluation B (or, equivalently, B happens after A) if:
— A inter-thread happens before B
Due to (inter-thread) happens before transitivity (let me skip copypasting the whole definition of inter-thread happens before to show this), everything what happened before the completion of the thread, including the write of the value 1 into g_i, happens before the return from thread::join. The return from thread::join, in turn, happens before the read of value of g_i in return g_i; simply because the invocation of thread::join is sequenced before return g_i;. Again, using the transitivity, we establish that the write of 1 to g_i in the non-main thread happens before the read of g_i in return g_i; in the main thread.
Write of 1 into g_i is visible side effect with respect to the read of g_i in return g_i;:
A visible side effect A on a scalar object or bit-field M with respect to a value computation B of M satisfies the conditions:
— A happens before B and
— there is no other side effect X to M such that A happens before X and X happens before B.
The value of a non-atomic scalar object or bit-field M, as determined by evaluation B, shall be the value stored by the visible side effect A.
Emphasis of the last sentence is mine and it guarantees that the value read from g_i in return g_i; will be 1.
t1.join() will not return until thread execution is completed, so from your example g_i is guaranteed to be 1

Does std::thread::join guarantee writes visibility

Std::thread::join is said to 'synchronize-with' the joined thread, however synchronization doesnt tell anything about visibility of side effects, it merely governs the order of the visiblity, ie. in following example:
int g_i = 0;
int main()
{
auto fn = [&] {g_i = 1;};
std::thread t1(fn);
t1.join();
return g_i;
}
Do we have any guarantee in the c++ standard that this program will always return 1?
[thread.thread.member]:
void join();
Effects: Blocks until the thread represented by *this has completed.
Synchronization: The completion of the thread represented by *this synchronizes with the corresponding successful join() return.
Since the completion of the thread execution synchronizes with the return from thread::join, the completion of the thread inter-thread happens before the return:
An evaluation A inter-thread happens before an evaluation B if
— A synchronizes with B
and thus happens before it:
An evaluation A happens before an evaluation B (or, equivalently, B happens after A) if:
— A inter-thread happens before B
Due to (inter-thread) happens before transitivity (let me skip copypasting the whole definition of inter-thread happens before to show this), everything what happened before the completion of the thread, including the write of the value 1 into g_i, happens before the return from thread::join. The return from thread::join, in turn, happens before the read of value of g_i in return g_i; simply because the invocation of thread::join is sequenced before return g_i;. Again, using the transitivity, we establish that the write of 1 to g_i in the non-main thread happens before the read of g_i in return g_i; in the main thread.
Write of 1 into g_i is visible side effect with respect to the read of g_i in return g_i;:
A visible side effect A on a scalar object or bit-field M with respect to a value computation B of M satisfies the conditions:
— A happens before B and
— there is no other side effect X to M such that A happens before X and X happens before B.
The value of a non-atomic scalar object or bit-field M, as determined by evaluation B, shall be the value stored by the visible side effect A.
Emphasis of the last sentence is mine and it guarantees that the value read from g_i in return g_i; will be 1.
t1.join() will not return until thread execution is completed, so from your example g_i is guaranteed to be 1

Which value does atomic read operation with memory_order_seq_cst read in this situation?

I've read the chapters about memory ordering in the c++11 standard and confused by a rule. According to the C++11 standard (ISO/IEC JTC1 SC22 WG21 N3690), 29.3 3, it's said that:
There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happens before” order and modification orders for all affected locations, such that each memory_order_seq_cst operation B that loads a value from an atomic object M observes one of the following values: — the result of the last modification A of M that precedes B in S, if it exists, or— if A exists, the result of some modification of M in the visible sequence of side effects with respect to B that is not memory_order_seq_cst and that does not happen before A, or— if A does not exist, the result of some modification of M in the visible sequence of side effects with respect to B that is not memory_order_seq_cst.
So, consider the following situation:
There are 4 atomic operations A, B, C, D. From the code:
All of them are operations on the same atomic variable
A and B are write operations with any order (may be relaxed)
C is write operation with memory_order_seq_cst
D is read operation with memory_order_seq_cst
A is the last write operation that happens-before D
A, B, C does not have happens-before relationship mutually.
D, B, C does not have happens-before relationship mutually.
Consider an execution where the following orderings happen to occur:
C appears before D in the single total order for memory_order_seq_cst operations
The modification order on this variable appears like A->B->C
Here is the possible code
using namespace std;
atomic_bool go(false);
atomic_int var(0);
void thread1()
{
while (!go) {}
var.store(1, memory_order_relaxed); // A
this_thread::yield();
cout << var.load(memory_order_seq_cst) << endl; // D
}
void thread2()
{
while (!go) {}
var.store(2, memory_order_seq_cst); // C
}
void thread3()
{
while (!go) {}
var.store(3, memory_order_relaxed); // B
}
int main() {
thread t1(thread1);
thread t2(thread2);
thread t3(thread3);
go = true;
t1.join();
t2.join();
t3.join();
}
Is it possible that read operation D will read the value written by operation B, given an A,B,C modification-order for var?
If not possible, what rules exclude this possibility?
If possible, this means that memory_order_seq_cst could read the value "written before" the last memory_order_seq_cst write. Is this a "bug" in C++ standard, or designed intentionally when not everything is seq_cst?
In this case it is possible that D reads either from A, B or C.
Consider a graph with four nodes: A, B, C and D.
And edges (sc: sequential consistent (total) ordering (C --sc--> D), sb: sequenced before / happens before (A --sb--> D), mo: modification order (A --mo--> B --mo--> C), and rf: Read From (? --rf--> D)).
There are two reasons why an rf edge in the graph would be inconsistent with the C++ memory model: causality and because you can't read from hidden visual side effects.
If you ignore sc edges for a moment, then -- having just a single atomic variable, the only causal restriction on the graph is to not have loops involving rf edges and (directed) sb edges (this is a result from my research). In this case no such loop can even exist since you only have a single rf edge-- so, there is no reason whatsoever that you can't read from any of the three writes.
However, you specify both, the exact modification order (not that this matters IMHO, you should only be interested in the possible outcomes of the program), as well as one sc edge. And we still have to investigate if those are compatible with each of the three possible rf edges with regard to reading from a hidden visual side effect.
Note that a given rf edge introduces a synchronization if its write node is release and the read node is acquire; sc is release/acquire, so the latter is true and the former only when reading from node C. However that synchronization means never more than (in modification order) everything before the write must happen before everything after the read; and there is nothing after the read, so the whole synchronization doesn't matter.
Moreover, the dictated modification order (A --mo--> B --mo--> C) is not causally inconsistent with the dictated total sc ordering (C --sc--> D) because D is a read and not part of the modification order subgraph. The only thing that is not allowed (because of causality) are directed loops involving sc and mo edges.
Now, as an experiment, assume we make node A also sc. Then we need to put A in the total ordering, thus either A --sc--> C --sc--> D, C --sc--> A --sc--> D or C --sc--> D --sc--> A, but we have A --mo--> C, so the latter two are disallowed (would cause a (causal) loop) and the only possible ordering is: A --sc--> C --sc--> D. Now it no longer is possible to read from A because that would cause the following subgraph:
A --sc--> C
| /
| /
| /
rf sc
| /
| /
| /
v v
D
and the write in C will always overwrite the value that was written by A before it is being read by D (aka, A is a hidden visual side effect for D).
If A is not sc (as is the case in the original problem), then this rf is only disallowed (because of hidden vse) when
A --hb--> C
| /
| /
| /
rf sc
| /
| /
| /
v v
D
where 'hb' stands for Happens-Before (for the same reason; then A is a hidden visual side effect for D, as C will always overwrite the value written by A before D reads it).
In the original problem there is no happens-before between thread 1 and 2 however, because such synchronization would require another rf edge between the two threads (or a fence or anything that would cause the extra synchronization).
Finally, yes this is intended behavior and not a bug in the standard.
Edit
To quote the standard that you quoted:
— the result of the last modification A of M that precedes B in S, if
it exists, or
The A here is your C, and the B here is your D. The A mentioned here DOES exist, namely node C (C --sc--> D). So this line says that it is possible to read the value written by node C.
— if A exists, the result of some modification of M in the visible
sequence of side effects with respect to B that is not
memory_order_seq_cst and that does not happen before A, or
Again, the A here is your C and it exists. Then the "result of some modification of M (var) in the visible sequence of side effects with respect to B (your D) that is not memory_order_seq_cst" is your A. And as we have established, your A does not happen before your C (their A). So, this says that it is possible to read the value written from your A.
— if A does not exist, the result of some modification of M in the
visible sequence of side effects with respect to B that is not
memory_order_seq_cst.
This is not relevant here and would only apply when there wasn't a write in the total ordering S of M (var) that came before B (your D).
Even with A->B->C as the "modification order", B is a possibility because it "is not memory_order_seq_cst and that does not happen before [C]".
The standard definition for visible sequence of side effects (1.10.14) supports this (emphasis mine):
The visible sequence of side effects on an atomic object M, with respect to a value computation B of M, is a maximal contiguous sub-sequence of side effects in the modification order of M, where the first side effect is visible with respect to B, and for every side effect, it is not the case that B happens before it. The value of an atomic object M, as determined by evaluation B, shall be the value stored by some operation in the visible sequence of M with respect to B.
So even with an explicit modification order, your load can yield A, B, or C.

Difference between memory_order_consume and memory_order_acquire

I have a question regarding a GCC-Wiki article. Under the headline "Overall Summary" the following code example is given:
Thread 1:
y.store (20);
x.store (10);
Thread 2:
if (x.load() == 10) {
assert (y.load() == 20)
y.store (10)
}
It is said that, if all stores are release and all loads are acquire, the assert in thread 2 cannot fail. This is clear to me (because the store to x in thread 1 synchronizes with the load from x in thread 2).
But now comes the part that I don't understand. It is also said that, if all stores are release and all loads are consume, the results are the same. Wouldn't it be possible that the load from y is hoisted before the load from x (because there is no dependency between these variables)? That would mean that the assert in thread 2 actually can fail.
The C11 Standard's ruling is as follows.
5.1.2.4 Multi-threaded executions and data races
An evaluation A is dependency-ordered before 16) an evaluation B if:
— A performs a release operation on an atomic object M, and, in another thread, B performs a consume operation on M and reads a value written by any side effect in the release sequence headed by A, or
— for some evaluation X, A is dependency-ordered before X and X carries a dependency to B.
An evaluation A inter-thread happens before an evaluation B if A synchronizes with B, A is dependency-ordered before B, or, for some evaluation X:
— A synchronizes with X and X is sequenced before B,
— A is sequenced before X and X inter-thread happens before B, or
— A inter-thread happens before X and X inter-thread happens before B.
NOTE 7 The ‘‘inter-thread happens before’’ relation describes arbitrary concatenations of ‘‘sequenced before’’, ‘‘synchronizes with’’, and ‘‘dependency-ordered before’’ relationships, with two exceptions. The first exception is that a concatenation is not permitted to end with ‘‘dependency-ordered before’’ followed by ‘‘sequenced before’’. The reason for this limitation is that a consume operation participating in a ‘‘dependency-ordered before’’ relationship provides ordering only with respect to operations to which this consume operation actually carries a dependency. The reason that this limitation applies only to the end of such a concatenation is that any subsequent release operation will provide the required ordering for a prior consume operation. The second exception is that a concatenation is not permitted to consist entirely of ‘‘sequenced before’’. The reasons for this limitation are (1) to permit ‘‘inter-thread happens before’’ to be transitively closed and (2) the ‘‘happens before’’ relation, defined below, provides for relationships consisting entirely of ‘‘sequenced before’’.
An evaluation A happens before an evaluation B if A is sequenced before B or A inter-thread happens before B.
A visible side effect A on an object M with respect to a value computation B of M satisfies the conditions:
— A happens before B, and
— there is no other side effect X to M such that A happens before X and X happens before B.
The value of a non-atomic scalar object M, as determined by evaluation B, shall be the value stored by the visible side effect A.
(emphasis added)
In the commentary below, I'll abbreviate below as follows:
Dependency-ordered before: DOB
Inter-thread happens before: ITHB
Happens before: HB
Sequenced before: SeqB
Let us review how this applies. We have 4 relevant memory operations, which we will name Evaluations A, B, C and D:
Thread 1:
y.store (20); // Release; Evaluation A
x.store (10); // Release; Evaluation B
Thread 2:
if (x.load() == 10) { // Consume; Evaluation C
assert (y.load() == 20) // Consume; Evaluation D
y.store (10)
}
To prove the assert never trips, we in effect seek to prove that A is always a visible side-effect at D. In accordance with 5.1.2.4 (15), we have:
A SeqB B DOB C SeqB D
which is a concatenation ending in DOB followed by SeqB. This is explicitly ruled by (17) to not be an ITHB concatenation, despite what (16) says.
We know that since A and D are not in the same thread of execution, A is not SeqB D; Hence neither of the two conditions in (18) for HB is satisfied, and A does not HB D.
It follows then that A is not visible to D, since one of the conditions of (19) is not met. The assert may fail.
How this could play out, then, is described here, in the C++ standard's memory model discussion and here, Section 4.2 Control Dependencies:
(Some time ahead) Thread 2's branch predictor guesses that the if will be taken.
Thread 2 approaches the predicted-taken branch and begins speculative fetching.
Thread 2 out-of-order and speculatively loads 0xGUNK from y (Evaluation D). (Maybe it was not yet evicted from cache?).
Thread 1 stores 20 into y (Evaluation A)
Thread 1 stores 10 into x (Evaluation B)
Thread 2 loads 10 from x (Evaluation C)
Thread 2 confirms the if is taken.
Thread 2's speculative load of y == 0xGUNK is committed.
Thread 2 fails assert.
The reason why it is permitted for Evaluation D to be reordered before C is because a consume does not forbid it. This is unlike an acquire-load, which prevents any load/store after it in program order from being reordered before it. Again, 5.1.2.4(15) states, a consume operation participating in a ‘‘dependency-ordered before’’ relationship provides ordering only with respect to operations to which this consume operation actually carries a dependency, and there most definitely is not a dependency between the two loads.
CppMem verification
CppMem is a tool that helps explore shared data access scenarios under the C11 and C++11 memory models.
For the following code that approximates the scenario in the question:
int main() {
atomic_int x, y;
y.store(30, mo_seq_cst);
{{{ { y.store(20, mo_release);
x.store(10, mo_release); }
||| { r3 = x.load(mo_consume).readsvalue(10);
r4 = y.load(mo_consume); }
}}};
return 0; }
The tool reports two consistent, race-free scenarios, namely:
In which y=20 is successfully read, and
In which the "stale" initialization value y=30 is read. Freehand circle is mine.
By contrast, when mo_acquire is used for the loads, CppMem reports only one consistent, race-free scenario, namely the correct one:
in which y=20 is read.
Both establish a transitive "visibility" order on atomic stores, unless they have been issued with memory_order_relaxed. If a thread reads an atomic object x with one of the modes, it can be sure that it sees all modifications to all atomic objects y that were known to be done before the write to x.
The difference between "acquire" and "consume" is in the visibility of non-atomic writes to some variable z, say. For acquire all writes, atomic or not, are visible. For consume only the atomic ones are guaranteed to be visible.
thread 1 thread 2
z = 5 ... store(&x, 3, release) ...... load(&x, acquire) ... z == 5 // we know that z is written
z = 5 ... store(&x, 3, release) ...... load(&x, consume) ... z == ? // we may not have last value of z