Related
int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto job = [&] {
int asdf = bar.load();
// std::lock_guard lg(mx);
foo.emplace_back(1);
bar.store(foo.size());
};
std::thread t1(job);
std::thread t2(job);
t1.join();
t2.join();
}
This obviously is not guaranteed to work, but works with a mutex. But how can that be explained in terms of the formal definitions of the standard?
Consider this excerpt from cppreference:
If an atomic store in thread A is tagged memory_order_release and an
atomic load in thread B from the same variable is tagged
memory_order_acquire [as is the case with default atomics], all memory writes (non-atomic and relaxed
atomic) that happened-before the atomic store from the point of view
of thread A, become visible side-effects in thread B. That is, once
the atomic load is completed, thread B is guaranteed to see everything
thread A wrote to memory.
Atomic loads and stores (with the default or with the specific acquire and release memory order specified) have the mentioned acquire-release semantics. (So does a mutex's lock and unlock.)
An interpretation of that wording could be that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification, making this well-defined. But pretty much everyone would agree that this can lead to a segmentation fault and would surely do so if the job function ran its three lines in a loop.
What standard wording explains the obvious difference in capability between the two tools, given that this wording seems to imply that atomic would synchronize in a way.
I know when to use mutexes and atomics, and I know that the example doesn't work because no synchronization actually happens. My question is how the definition is to be interpreted so it doesn't contradict the way it works in reality.
The quoted passage means that when B loads the value that A stored, then by observing that the store happened, it can also be assured that everything that B did before the store has also happened and is visible.
But this doesn't tell you anything if the store has not in fact happened yet!
The actual C++ standard says this more explicitly. (Always remember that cppreference, while a valuable resource which often quotes from or paraphrases the standard, is not the standard itself and is not authoritative.) From N4861, the final C++20 draft, we have in atomics.order p2:
An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic
operation B that performs an acquire operation on M and takes its value from any side effect in the release
sequence headed by A.
I would agree that if the load in your thread B returned 1, it could safely conclude that the other thread had finished its store and therefore had exited the critical section, and therefore B could safely use foo. In this case the load in B has synchronized with the store in A, since the value of the load (namely 1) came from the store (which is part of its own release sequence).
But it is entirely possible that both loads return 0, if both threads do their loads before either one does its store. The value 0 didn't come from either store, so the loads don't synchronize with the stores in that case. Your code doesn't even look at the value that was loaded, so both threads may enter the critical section together in that case.
The following code would be a safe, though inefficient, way to use an atomic to protect a critical section. It ensures that A will execute the critical section first, and B will wait until A has finished before proceeding. (Obviously if both threads wait for the other then you have a deadlock.)
int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto jobA = [&] {
foo.emplace_back(1);
bar.store(foo.size());
};
auto jobB = [&] {
while (bar.load() == 0) /* spin */ ;
foo.emplace_back(1);
};
std::thread t1(jobA);
std::thread t2(jobB);
t1.join();
t2.join();
}
Setting aside the elephant in the room that none of the C++ containers are thread safe without employing locking of some sort (so forget about using emplace_back without implementing locking), and focusing on the question of why atomic objects alone are not sufficient:
You need more than atomic objects. You also need sequencing.
All that an atomic object gives you is that when an object changes state, any other thread will either see its old value or its new value, and it will never see any "partially old/partially new", or "intermediate" value.
But it makes no guarantee whatsoever as to when other execution threads will "see" the atomic object's new value. At some point they (hopefully) will, see the atomic object's instantly flip to its new value. When? Eventually. That's all that you get from atomics.
One execution thread may very well set an atomic object to a new value, but other execution threads will still have the old value cached, in some form or fashion, and will continue to see the atomic object's old value, and won't "see" the atomic object's new value until some intermediate time passes (if ever).
Sequencing are rules that specify when objects' new values are visible in other execution threads. The simplest way to get both atomicity and easy to deal with sequencing, in one fell swoop, is to use mutexes and condition variables which handle all the hard details for you. You can still use atomics and with a careful logic use lock/release fence instructions to implement proper sequencing. But it's very easy to get it wrong, and the worst of it you won't know that it's wrong until your code starts going off the rails due to improper sequencing and it'll be nearly impossible to accurately reproduce the faulty behavior for debugging purposes.
But for nearly all common, routine, garden-variety tasks mutexes and condition variables is the most simplest solution to proper inter-thread sequencing.
The idea is that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification
Yes all writes that done by foo.emplace_back(1); would be guaranteed when bar.store(foo.size()); is executed. But who guaranteed you that foo.emplace_back(1); from thread 1 would see any/all non partial consistent state from foo.emplace_back(1); executed in thread 2 and vice versa? They both read and modify internal state of std::vector and there is no memory barrier before code reaches atomic store. And even if all variables would be read/modified atomically std::vector state consists of multiple variables - size, capacity, pointer to the data at least. Changes to all of them must be synchronized as well and memory barrier is not enough for that.
To explain little more let's create simplified example:
int a = 0;
int b = 0;
std::atomic<int> at;
// thread 1
int foo = at.load();
a = 1;
b = 2;
at.store(foo);
// thread 2
int foo = at.load();
int tmp1 = a;
int tmp2 = b;
at.store(tmp2);
Now you have 2 problems:
There is no guarantee that when tmp2 value is 2
tmp1 value would be 1
as you read a and b before atomic operation.
There is no guarantee that when at.store(b)
is executed that either a == b == 0 or a == 1 and b == 2,
it could be a == 1 but still b == 0.
Is that clear?
But:
// thread 1
mutex.lock();
a = 1;
b = 2;
mutex.unlock();
// thread 2
mutex.lock();
int tmp1 = a;
int tmp2 = b;
mutex.unlock();
You either get tmp1 == 0 and tmp2 == 0 or tmp1 == 1 and tmp2 == 2, do you see the difference?
I was running a bunch of algorithms through Relacy to verify their correctness and I stumbled onto something I didn't really understand. Here's a simplified version of it:
#include <thread>
#include <atomic>
#include <iostream>
#include <cassert>
struct RMW_Ordering
{
std::atomic<bool> flag {false};
std::atomic<unsigned> done {0}, counter {0};
unsigned race_cancel {0}, race_success {0}, sum {0};
void thread1() // fail
{
race_cancel = 1; // data produced
if (counter.fetch_add(1, std::memory_order_release) == 1 &&
!flag.exchange(true, std::memory_order_relaxed))
{
counter.store(0, std::memory_order_relaxed);
done.store(1, std::memory_order_relaxed);
}
}
void thread2() // success
{
race_success = 1; // data produced
if (counter.fetch_add(1, std::memory_order_release) == 1 &&
!flag.exchange(true, std::memory_order_relaxed))
{
done.store(2, std::memory_order_relaxed);
}
}
void thread3()
{
while (!done.load(std::memory_order_relaxed)); // livelock test
counter.exchange(0, std::memory_order_acquire);
sum = race_cancel + race_success;
}
};
int main()
{
for (unsigned i = 0; i < 1000; ++i)
{
RMW_Ordering test;
std::thread t1([&]() { test.thread1(); });
std::thread t2([&]() { test.thread2(); });
std::thread t3([&]() { test.thread3(); });
t1.join();
t2.join();
t3.join();
assert(test.counter == 0);
}
std::cout << "Done!" << std::endl;
}
Two threads race to enter a protected region and the last one modifies done, releasing a third thread from an infinite loop. The example is a bit contrived but the original code needs to claim this region through the flag to signal "done".
Initially, the fetch_add had acq_rel ordering because I was concerned the exchange might get reordered before it, potentially causing one thread to claim the flag, attempt the fetch_add check first, and prevent the other thread (which gets past the increment check) from successfully modifying the schedule. While testing with Relacy, I figured I'd see whether the livelock I expected to happen will take place if I switched from acq_rel to release, and to my surprise, it didn't. I then used relaxed for everything, and again, no livelock.
I tried to find any rules regarding this in the C++ standard but only managed to dig up these:
1.10.7 In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations,
which have special characteristics.
29.3.11 Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated
with the read-modify-write operation.
Can I always rely on RMW operations not being reordered - even if they affect different memory locations - and is there anything in the standard that guarantees this behaviour?
EDIT:
I came up with a simpler setup that should illustrate my question a little better. Here's the CppMem script for it:
int main()
{
atomic_int x = 0; atomic_int y = 0;
{{{
{
if (cas_strong_explicit(&x, 0, 1, relaxed, relaxed))
{
cas_strong_explicit(&y, 0, 1, relaxed, relaxed);
}
}
|||
{
if (cas_strong_explicit(&x, 0, 2, relaxed, relaxed))
{
cas_strong_explicit(&y, 0, 2, relaxed, relaxed);
}
}
|||
{
// Is it possible for x and y to read 2 and 1, or 1 and 2?
x.load(relaxed).readsvalue(2);
y.load(relaxed).readsvalue(1);
}
}}}
return 0;
}
I don't think the tool is sophisticated enough to evaluate this scenario, though it does seem to indicate that it's possible. Here's the almost equivalent Relacy setup:
#include "relacy/relacy_std.hpp"
struct rmw_experiment : rl::test_suite<rmw_experiment, 3>
{
rl::atomic<unsigned> x, y;
void before()
{
x($) = y($) = 0;
}
void thread(unsigned tid)
{
if (tid == 0)
{
unsigned exp1 = 0;
if (x($).compare_exchange_strong(exp1, 1, rl::mo_relaxed))
{
unsigned exp2 = 0;
y($).compare_exchange_strong(exp2, 1, rl::mo_relaxed);
}
}
else if (tid == 1)
{
unsigned exp1 = 0;
if (x($).compare_exchange_strong(exp1, 2, rl::mo_relaxed))
{
unsigned exp2 = 0;
y($).compare_exchange_strong(exp2, 2, rl::mo_relaxed);
}
}
else
{
while (!(x($).load(rl::mo_relaxed) && y($).load(rl::mo_relaxed)));
RL_ASSERT(x($) == y($));
}
}
};
int main()
{
rl::simulate<rmw_experiment>();
}
The assertion is never violated, so 1 and 2 (or the reverse) is not possible according to Relacy.
I haven't fully grokked your code yet, but the bolded question has a straightforward answer:
Can I always rely on RMW operations not being reordered - even if they affect different memory locations
No, you can't. Compile-time reordering of two relaxed RMWs in the same thread is very much allowed. (I think runtime reordering of two RMWs is probably impossible in practice on most CPUs. ISO C++ doesn't distinguish compile-time vs. run-time for this.)
But note that an atomic RMW includes both a load and a store, and both parts have to stay together. So any kind of RMW can't move earlier past an acquire operation, or later past a release operation.
Also, of course the RMW itself being a release and/or acquire operation can stop reordering in one or the other direction.
Of course, the C++ memory model isn't formally defined in terms of local reordering of access to cache-coherent shared memory, only in terms of synchronizing with another thread and creating a happens-before / after relationship. But if you ignore IRIW reordering (2 reader threads not agreeing on the order of two writer threads doing independent stores to different variables) it's pretty much 2 different ways to model the same thing.
In your first example it is guaranteed that the flag.exchange is always executed after the counter.fetch_add, because the && short circuits - i.e., if the first expression resolves to false, the second expression is never executed. The C++ standard guarantees this, so the compiler cannot reorder the two expressions (regardless which memory order they use).
As Peter Cordes already explained, the C++ standard says nothing about if or when instructions can be reordered with respect to atomic operations. In general, most compiler optimizations rely on the as-if:
The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine [..].
This provision is sometimes called the “as-if” rule, because an implementation is free to disregard any requirement of this International Standard as long as the result is as if the requirement had been obeyed, as far as can be determined from the
observable behavior of the program. For instance, an actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no side effects affecting the observable behavior of the program are produced.
The key aspect here is the "observable behavior". Suppose you have two relaxed atomic loads A and B on two different atomic objects, where A is sequenced before B.
std::atomic<int> x, y;
x.load(std::memory_order_relaxed); // A
y.load(std::memory_order_relaxed); // B
A sequence-before relation is part of the definition of the happens-before relation, so one might assume that the two operations cannot be reordered. However, since the two operations are relaxed, there is no guarantee about the "observable behavior", i.e., even with the original order, the x.load (A) could return a newer result than the y.load (B), so the compiler would be free to reorder them, since the final program would not be able to tell the difference (i.e., the observable behavior is equivalent). If it would not be equivalent, then you would have a race condition! ;-)
To prevent such reorderings you have to rely on the (inter-thread) happens-before relation. If the x.load (A) would use memory_order_acquire, then the compiler would have to assume that this operation synchronizes-with some release operation, thus establishing a (inter-thread) happens-before relation. Suppose some other thread performs two atomic updates:
y.store(42, std::memory_order_relaxed); // C
x.store(1, std::memory_order_release); // D
If the acquire-load A sees the value store by the store-release D, then the two operations synchronize with each other, thereby establishing a happens-before relation. Since y.store is sequenced before x.store, and x.load is sequenced before, the transitivity of the happens-before relation guarantees that y.store happens-before y.load. Reordering the two loads or the two stores would destroy this guarantee and therefore also change the observable behavior. Thus, the compiler cannot perform such reorders.
In general, arguing about possible reorderings is the wrong approach. In a first step you should always identify your required happens-before relations (e.g., the y.store has to happen before the y.load) . The next step is then to ensure that these happens-before relations are correctly established in all cases. At least that is how I approach correctness arguments for my implementations of lock-free algorithms.
Regarding Relacy: Relacy only simulates the memory model, but it relies on the order of operations as generated by the compiler. So even if a compiler could reorder two instructions, but chooses not to, you will not be able to identify this with Relacy.
The standard says that a relaxed atomic operation is not a synchronization operation. But what's atomic about an operation result of which is not seen by other threads.
The example here wouldn't give the expected result then, right?
What I understand by synchronization is that the result of an operation with such trait would be visible by all threads.
Maybe I don't understand what synchronization means.
Where's the hole in my logic?
The compiler and the CPU are allowed to reorder memory accesses. It's the as-if rule and it assumes a single-threaded process.
In multithreaded programs, the memory order parameter specifies how memory accesses are to be ordered around an atomic operation. This is the synchronization aspect (the "acquire-release semantics") of an atomic operation that is separate from the atomicity aspect itself:
int x = 1;
std::atomic<int> y = 1;
// Thread 1
x++;
y.fetch_add(1, std::memory_order_release);
// Thread 2
while ((y.load(std::memory_order_acquire) == 1)
{ /* wait */ }
std::cout << x << std::endl; // x is 2 now
Whereas with a relaxed memory order we only get atomicity, but not ordering:
int x = 1;
std::atomic<int> y = 1;
// Thread 1
x++;
y.fetch_add(1, std::memory_order_relaxed);
// Thread 2
while ((y.load(std::memory_order_relaxed) == 1)
{ /* wait */ }
std::cout << x << std::endl; // x can be 1 or 2, we don't know
Indeed as Herb Sutter explains in his excellent atomic<> weapons talk, memory_order_relaxed makes a multithreaded program very difficult to reason about and should be used in very specific cases only, when there is no dependency between the atomic operation and any other operation before or after it in any thread (very rarely the case).
Yes, standard is correct. Relaxed atomics are not synchronization operation, as only atomicity of operation is guaranteed.
For example,
int k = 5;
void foo() {
k = 10;
}
int baz() {
return k;
}
In presence of multiple threads, the behavior is undefined as it exposes race condition. In practice on some architectures it could happen that a caller of baz would see nor 10, no 5, but some other, indeterminate value. It is often called torn or dirty read.
If a relaxed atomic load and store was used instead baz would be guaranteed to return either 5 or 10, as there would be no data race.
It is worth noting that for practical purposes, Intel chips and their very strong memory model make relaxed atomic a noop (meaning there is no extra cost for it being atomic) on this common architecture, as loads and stores are atomic on hardware level.
Suppose we have
std::atomic<int> x = 0;
// thread 1
foo();
x.store(1, std::memory_order_relaxed);
// thread 2
assert(x.load(std::memory_order_relaxed) == 1);
bar();
There is, first of all, no guarantee that thread 2 will observe the value 1 (that is, the assert may fire). But even if thread 2 does observe the value 1, while thread 2 is executing bar(), it might not observe side effects generated by foo() in thread 1. And if foo() and bar() access the same non-atomic variables, a data race may occur.
Now suppose we change the example to:
std::atomic<int> x = 0;
// thread 1
foo();
x.store(1, std::memory_order_release);
// thread 2
assert(x.load(std::memory_order_acquire) == 1);
bar();
There is still no guarantee that thread 2 observes the value 1; after all, it could happen that the load occurs before the store. However, in this case, if thread 2 observes the value 1, then the store in thread 1 synchronizes with the load in thread 2. What this means is that everything that's sequenced before the store in thread 1 happens before everything that's sequenced after the load in thread 2. Therefore, bar() will see all the side effects produced by foo() and if they both access the same non-atomic variables, no data race will occur.
So, as you can see, the synchronization properties of operations on x tell you nothing about what happens to x. Instead, synchronization imposes ordering on surrounding operations in the two threads. (Therefore, in the linked example, the result is always 5, and does not depend on the memory ordering; the synchronization properties of the fetch-add operations don't affect the effect of the fetch-add operations themselves.)
I have a question regarding the order of operations in the following code:
std::atomic<int> x;
std::atomic<int> y;
int r1;
int r2;
void thread1() {
y.exchange(1, std::memory_order_acq_rel);
r1 = x.load(std::memory_order_relaxed);
}
void thread2() {
x.exchange(1, std::memory_order_acq_rel);
r2 = y.load(std::memory_order_relaxed);
}
Given the description of std::memory_order_acquire on the cppreference page (https://en.cppreference.com/w/cpp/atomic/memory_order), that
A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load.
it seems obvious that there can never be an outcome that r1 == 0 && r2 == 0 after running thread1 and thread2 concurrently.
However, I cannot find any wording in the C++ standard (looking at the C++14 draft right now), which establishes guarantees that two relaxed loads cannot be reordered with acquire-release exchanges. What am I missing?
EDIT: As has been suggested in the comments, it is actually possible to get both r1 and r2 equal to zero. I've updated the program to use load-acquire as follows:
std::atomic<int> x;
std::atomic<int> y;
int r1;
int r2;
void thread1() {
y.exchange(1, std::memory_order_acq_rel);
r1 = x.load(std::memory_order_acquire);
}
void thread2() {
x.exchange(1, std::memory_order_acq_rel);
r2 = y.load(std::memory_order_acquire);
}
Now is it possible to get both and r1 and r2 equal to 0 after concurrently executing thread1 and thread2? If not, which C++ rules prevent this?
The standard does not define the C++ memory model in terms of how operations are ordered around atomic operations with a specific ordering parameter.
Instead, for the acquire/release ordering model, it defines formal relationships such as "synchronizes-with" and "happens-before" that specify how data is synchronized between threads.
N4762, §29.4.2 - [atomics.order]
An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M
and takes its value from any side effect in the release sequence headed by A.
In §6.8.2.1-9, the standard also states that if a store A synchronizes with a load B, anything sequenced before A inter-thread "happens-before" anything sequenced after B.
No "synchronizes-with" (and hence inter-thread happens-before) relationship is established in your second example (the first is even weaker) because the runtime relationships (that check the return values from the loads) are missing.
But even if you did check the return value, it would not be helpful since the exchange operations do not actually 'release' anything (i.e. no memory operations are sequenced before those operations).
Neiter do the atomic load operations 'acquire' anything since no operations are sequenced after the loads.
Therefore, according to the standard, each of the four possible outcomes for the loads in both examples (including 0 0) is valid.
In fact, the guarantees given by the standard are no stronger than memory_order_relaxed on all operations.
If you want to exclude the 0 0 result in your code, all 4 operations must use std::memory_order_seq_cst. That guarantees a single total order of the involved operations.
You already have an answer to the language-lawyer part of this. But I want to answer the related question of how to understand why this can be possible in asm on a possible CPU architecture that uses LL/SC for RMW atomics.
It doesn't make sense for C++11 to forbid this reordering: it would require a store-load barrier in this case where some CPU architectures could avoid one.
It might actually be possible with real compilers on PowerPC, given the way they map C++11 memory-orders to asm instructions.
On PowerPC64, a function with an acq_rel exchange and an acquire load (using pointer args instead of static variables) compiles as follows with gcc6.3 -O3 -mregnames. This is from a C11 version because I wanted to look at clang output for MIPS and SPARC, and Godbolt's clang setup works for C11 <atomic.h> but fails for C++11 <atomic> when you use -target sparc64.
#include <stdatomic.h> // This is C11, not C++11, for Godbolt reasons
long foo(_Atomic long *a, _Atomic int *b) {
atomic_exchange_explicit(b, 1, memory_order_acq_rel);
//++*a;
return atomic_load_explicit(a, memory_order_acquire);
}
(source + asm on Godbolt for MIPS32R6, SPARC64, ARM 32, and PowerPC64.)
foo:
lwsync # with seq_cst exchange this is full sync, not just lwsync
# gone if we use exchage with mo_acquire or relaxed
# so this barrier is providing release-store ordering
li %r9,1
.L2:
lwarx %r10,0,%r4 # load-linked from 0(%r4)
stwcx. %r9,0,%r4 # store-conditional 0(%r4)
bne %cr0,.L2 # retry if SC failed
isync # missing if we use exchange(1, mo_release) or relaxed
ld %r3,0(%r3) # 64-bit load double-word of *a
cmpw %cr7,%r3,%r3
bne- %cr7,$+4 # skip over the isync if something about the load? PowerPC is weird
isync # make the *a load a load-acquire
blr
isync is not a store-load barrier; it only requires the preceding instructions to complete locally (retire from the out-of-order part of the core). It doesn't wait for the store buffer to be flushed so other threads can see the earlier stores.
Thus the SC (stwcx.) store that's part of the exchange can sit in the store buffer and become globally visible after the pure acquire-load that follows it. In fact, another Q&A already asked this, and the answer is that we think this reordering is possible. Does `isync` prevent Store-Load reordering on CPU PowerPC?
If the pure load is seq_cst, PowerPC64 gcc puts a sync before the ld. Making the exchange seq_cst does not prevent the reordering. Remember that C++11 only guarantees a single total order for SC operations, so the exchange and the load both need to be SC for C++11 to guarantee it.
So PowerPC has a bit of an unusual mapping from C++11 to asm for atomics. Most systems put the heavier barriers on stores, allowing seq-cst loads to be cheaper or only have a barrier on one side. I'm not sure if this was required for PowerPC's famously-weak memory ordering, or if another choice was possible.
https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html shows some possible implementations on various architectures. It mentions multiple alternatives for ARM.
On AArch64, we get this for the question's original C++ version of thread1:
thread1():
adrp x0, .LANCHOR0
mov w1, 1
add x0, x0, :lo12:.LANCHOR0
.L2:
ldaxr w2, [x0] # load-linked with acquire semantics
stlxr w3, w1, [x0] # store-conditional with sc-release semantics
cbnz w3, .L2 # retry until exchange succeeds
add x1, x0, 8 # the compiler noticed the variables were next to each other
ldar w1, [x1] # load-acquire
str w1, [x0, 12] # r1 = load result
ret
The reordering can't happen there because AArch64 acquire loads interact with release stores to give sequential consistency, not just plain acq/rel. Release stores can't reorder with later acquire loads.
(They can reorder with later plain loads, on paper and probably in some real hardware. AArch64 seq_cst can be cheaper than on other ISAs, if you avoid acquire loads right after release stores.
But unfortunately it makes acq/rel worse than x86. This is fixed with ARMv8.3-A LDAPR, a load that's just acquire not sequential-acquire. It allows earlier stores, even STLR, to reorder with it. So you get just acq_rel, allowing StoreLoad reordering but not other reordering. (It's also an optional feature in ARMv8.2-A).)
On a machine that also or instead had plain-release LL/SC atomics, it's easy to see that an acq_rel doesn't stop later loads to different cache lines from becoming globally visible after the LL but before the SC of the exchange.
If exchange is implemented with a single transaction like on x86, so the load and store are adjacent in the global order of memory operations, then certainly no later operations can be reordered with an acq_rel exchange and it's basically equivalent to seq_cst.
But LL/SC doesn't have to be a true atomic transaction to give RMW atomicity for that location.
In fact, a single asm swap instruction could have relaxed or acq_rel semantics. SPARC64 needs membar instructions around its swap instruction, so unlike x86's xchg it's not seq-cst on its own. (SPARC has really nice / human readable instruction mnemonics, especially compared to PowerPC. Well basically anything is more readable that PowerPC.)
Thus it doesn't make sense for C++11 to require that it did: it would hurt an implementation on a CPU that didn't otherwise need a store-load barrier.
in Release-Acquire ordering for create synchronization point between 2 threads we need some atomic object M which will be the same in both operations
An atomic operation A that performs a release operation on an
atomic object M synchronizes with an atomic operation B
that performs an acquire operation on M and takes its value from any
side effect in the release sequence headed by A.
or in more details:
If an atomic store in thread A is tagged memory_order_release
and an atomic load in thread B from the same variable is tagged
memory_order_acquire, all memory writes (non-atomic and relaxed
atomic) that happened-before the atomic store from the point of view
of thread A, become visible side-effects in thread B. That
is, once the atomic load is completed, thread B is guaranteed to
see everything thread A wrote to memory.
The synchronization is established only between the threads releasing
and acquiring the same atomic variable.
N = u | if (M.load(acquire) == v) :[B]
[A]: M.store(v, release) | assert(N == u)
here synchronization point on M store-release and load-acquire(which take value from store-release !). as result store N = u in thread A (before store-release on M) visible in B (N == u) after load-acquire on same M
if take example:
atomic<int> x, y;
int r1, r2;
void thread_A() {
y.exchange(1, memory_order_acq_rel);
r1 = x.load(memory_order_acquire);
}
void thread_B() {
x.exchange(1, memory_order_acq_rel);
r2 = y.load(memory_order_acquire);
}
what we can select for common atomic object M ? say x ? x.load(memory_order_acquire); will be synchronization point with x.exchange(1, memory_order_acq_rel) ( memory_order_acq_rel include memory_order_release (more strong) and exchange include store) if x.load load value from x.exchange and main will be synchronized loads after acquire (be in code after acquire nothing exist) with stores before release (but again before exchange nothing in code).
correct solution (look for almost exactly question ) can be next:
atomic<int> x, y;
int r1, r2;
void thread_A()
{
x.exchange(1, memory_order_acq_rel); // [Ax]
r1 = y.exchange(1, memory_order_acq_rel); // [Ay]
}
void thread_B()
{
y.exchange(1, memory_order_acq_rel); // [By]
r2 = x.exchange(1, memory_order_acq_rel); // [Bx]
}
assume that r1 == 0.
All modifications to any particular atomic variable occur in a total
order that is specific to this one atomic variable.
we have 2 modification of y : [Ay] and [By]. because r1 == 0 this mean that [Ay] happens before [By] in total modification order of y. from this - [By] read value stored by [Ay]. so we have next:
A is write to x - [Ax]
A do store-release [Ay] to y after this ( acq_rel include release,
exchange include store)
B load-acquire from y ([By] value stored by [Ay]
once the atomic load-acquire (on y) is completed, thread B is
guaranteed to see everything thread A wrote to memory before
store-release (on y). so it view side-effect of [Ax] - and r2 == 1
another possible solution use atomic_thread_fence
atomic<int> x, y;
int r1, r2;
void thread_A()
{
x.store(1, memory_order_relaxed); // [A1]
atomic_thread_fence(memory_order_acq_rel); // [A2]
r1 = y.exchange(1, memory_order_relaxed); // [A3]
}
void thread_B()
{
y.store(1, memory_order_relaxed); // [B1]
atomic_thread_fence(memory_order_acq_rel); // [B2]
r2 = x.exchange(1, memory_order_relaxed); // [B3]
}
again because all modifications of atomic variable y occur in a total order. [A3] will be before [B1] or visa versa.
if [B1] before [A3] - [A3] read value stored by [B1] => r1 == 1.
if [A3] before [B1] - the [B1] is read value stored by [A3]
and from Fence-fence synchronization:
A release fence [A2] in thread A synchronizes-with an acquire fence [B2] in thread B, if:
There exists an atomic object y,
There exists an atomic write [A3] (with any memory order) that
modifies y in thread A
[A2] is sequenced-before [A3] in thread A
There exists an atomic read [B1] (with any memory order) in thread
B
[B1] reads the value written by [A3]
[B1] is sequenced-before [B2] in thread B
In this case, all stores ([A1]) that are sequenced-before [A2] in thread A will happen-before all loads ([B3]) from the same locations (x) made in thread B after [B2]
so [A1] (store 1 to x) will be before and have visible effect for [B3] (load form x and save result to r2 ). so will be loaded 1 from x and r2==1
[A1]: x = 1 | if (y.load(relaxed) == 1) :[B1]
[A2]: ### release ### | ### acquire ### :[B2]
[A3]: y.store(1, relaxed) | assert(x == 1) :[B3]
As language lawyer reasonings are hard to follow, I thought I'd add how a programmer who understands atomics would reason about the second snippet in your question:
Since this is symmetrical code, it is enough to look at just one side.
Since the question is about the value of r1 (r2), we start with looking at
r1 = x.load(std::memory_order_acquire);
Depending on what the value of r1 is, we can say something about the visibility of other values. However, since the value of r1 isn't tested - the acquire is irrelevant.
In either case, the value of r1 can be any value that was ever written to it (in the past or future *)). Therefore it can be zero. Nevertheless, we can assume it to BE zero because we're interested in whether or not the outcome of the whole program can be 0 0, which is a sort of testing the value of r1.
Hence, assuming we read zero THEN we can say that if that zero was written by another thread with memory_order_release then every other write to memory done by that thread before the store release will also be visible to this thread. However, value of zero that we read is the initialization value of x, and initialization values are non-atomic - let alone a 'release' - and certainly there wasn't anything "ordered" in front of them in terms of writing that value to memory; so there is nothing we can say about the visibility of other memory locations. In other words, again, the 'acquire' is irrelevant.
So, we can get r1 = 0 and the fact that we used acquire is irrelevant. The same reasoning then holds of r2. So the result can be r1 = r2 = 0.
In fact, if you assume the value of r1 is 1 after the load acquire, and that that 1 was written by thread2 with memory order release (which MUST be the case, since is the only place where a value of 1 is ever written to x) then all we know is that everything written to memory by thread2 before that store release will also be visible to thread1 (provided thread1 read x == 1 thus!). But thread2 doesn't write ANYTHING before writing to x, so again the whole release-acquire relationship is irrelevant, even in the case of loading a value of 1.
*) However, it is possible with further reasoning to show that certain value can never occur because of inconsistency with the memory model - but that doesn't happen here.
In the original version, it is possible to see r1 == 0 && r2 == 0 because there is no requirement that the stores propogate to the other thread before it reads it. This is not a re-ordering of either thread's operations, but e.g. a read of stale cache.
Thread 1's cache | Thread 2's cache
x == 0; | x == 0;
y == 0; | y == 0;
y.exchange(1, std::memory_order_acq_rel); // Thread 1
x.exchange(1, std::memory_order_acq_rel); // Thread 2
The release on Thread 1 is ignored by Thread 2, and vice-versa. In the abstract machine there is not consistency with the values of x and y on the threads
Thread 1's cache | Thread 2's cache
x == 0; // stale | x == 1;
y == 1; | y == 0; // stale
r1 = x.load(std::memory_order_relaxed); // Thread 1
r2 = y.load(std::memory_order_relaxed); // Thread 2
You need more threads to get "violations of causality" with acquire / release pairs, as the normal ordering rules, combined with the "becomes visible side effect in" rules force at least one of the loads to see 1.
Without loss of generality, let's assume that Thread 1 executes first.
Thread 1's cache | Thread 2's cache
x == 0; | x == 0;
y == 0; | y == 0;
y.exchange(1, std::memory_order_acq_rel); // Thread 1
Thread 1's cache | Thread 2's cache
x == 0; | x == 0;
y == 1; | y == 1; // sync
The release on Thread 1 forms a pair with the acquire on Thread 2, and the abstract machine describes a consistent y on both threads
r1 = x.load(std::memory_order_relaxed); // Thread 1
x.exchange(1, std::memory_order_acq_rel); // Thread 2
r2 = y.load(std::memory_order_relaxed); // Thread 2
I try to explain it in other word.
Imaginate that each thread running in the different CPU Core simultaneously, thread1 running in Core A, and thread2 running in Core B.
The core B cannot know the REAL running order in core A. The meaning of memory order is just, the running result to show to core B, from core A.
std::atomic<int> x, y;
int r1, r2, var1, var2;
void thread1() { //Core A
var1 = 99; //(0)
y.exchange(1, std::memory_order_acq_rel); //(1)
r1 = x.load(std::memory_order_acquire); //(2)
}
void thread2() { //Core B
var2 = 999; //(2.5)
x.exchange(1, std::memory_order_acq_rel); //(3)
r2 = y.load(std::memory_order_acquire); //(4)
}
For example, the (4) is just a REQUEST for (1). (which have code like 'variable y with memory_order_release')
And (4) in core B apply A for a specific order: (0)->(1)->(4).
For different REQUEST, they may see different sequence in other thread.
( If now we have core C and some atomic variable interactive with core A, the core C may seen different result with core B.)
OK now there's a detail explaination step by step: (for code above)
We start in core B : (2.5)
(2.5)var2 = 999;
(3)acq: find variable 'x' with 'memory_order_release', nothing. Now the order in Core A we can guess [(0),(1),(2)] or [(0),(2),(1)] are all legal, so there's no limit to us (B) to reorder (3) and (4).
(3)rel: find var 'x' with 'memory_order_acquire', found (2), so make a ordered show list to core A : [var2=999, x.exchange(1)]
(4) find var y with 'memory_order_release', ok found it at (1). So now we stand at core B, we can see the source code which Core displayed to me: 'There's must have var1=99 before y.exchange(1)'.
The idea is: we can see the source code which have var1=99 before y.exchange(1), because we make a REQUEST to other cores, and core A response that result to me. (the REQUEST is y.load(std::acquire)) If there have some other core who also want to observe the source code of A, they can not find that conclusion.
We can never know the real running order for (0) (1) (2).
The order for A itself can ensure the right result (seems like singal thread)
The request from B also don't have any affect on the real running order for A.
This is also applied for B (2.5) (3) (4)
That is, the operation for specific core really do, but didn't tell other cores, so the 'local cache in other cores' might be wrong.
So there have a chance for (0, 0) with the code in question.
Cppreference gives the following example about memory_order_relaxed:
Atomic operations tagged memory_order_relaxed are not synchronization
operations, they do not order memory. They only guarantee atomicity
and modification order consistency.
Then explains that, with x and y initially zero, this example code
// Thread 1:
r1 = y.load(memory_order_relaxed); // A
x.store(r1, memory_order_relaxed); // B
// Thread 2:
r2 = x.load(memory_order_relaxed); // C
y.store(42, memory_order_relaxed); // D
is allowed to produce r1 == r2 == 42 because:
Although A is sequenced-before B within thread 1 and C is sequenced-before D in thread 2,
Nothing prevents D from appearing before A in the modification order of y, and B from appearing before C in the modification order of x.
Now my question is: if A and B can't be reordered within thread 1 and, similarly, C and D within thread 2 (since each of those is sequenced-before within its thread), aren't points 1 and 2 in contradiction? In other words, with no reordering (as point 1 seems to require), how is the scenario in point 2, visualized below, even possible?
T1 ........... T2
.............. D(y)
A(y)
B(x)
.............. C(x)
Because in this case C would not be sequenced-before D within thread 2, as point 1 demands.
with no reordering (as point 1 seems to require)
Point 1 does not mean "no reordering". It means sequencing of events within a thread of execution. The compiler will issue the CPU instruction for A before B and the CPU instruction for C before D (although even that may be subverted by the as-if rule), but the CPU has no obligation to execute them in that order, caches/write buffers/invalidation queues have no obligation to propagate them in that order, and memory has no obligation to be uniform.
(individual architectures may offer those guarantees though)
Your interpretation of the text is wrong. Let's break this down:
Atomic operations tagged memory_order_relaxed are not synchronization operations, they do not order memory
This means that these operations make no guarantees regarding the order of events. As explained prior to that statement in the original text, multi threaded processors are allowed to reorder operations within a single thread. This can affect the write, the read or both. Additionally, the compiler is allowed to do the same thing at compile time (mostly for optimization purposes). To see how this relates to the example, suppose we don't use atomic types at all, but we do use primitive types that are atomic be design (an 8 bit value...). Let's rewrite the example:
// Somewhere...
uint8_t y, x;
// Thread 1:
uint8_t r1 = y; // A
x = r1; // B
// Thread 2:
uint8_t r2 = x; // C
y = 42; // D
Considering both the compiler, and the CPU are allowed to reorder operations in each thread, it's easy to see how x == y == 42 is possible.
The next part of the statement is:
They only guarantee atomicity and modification order consistency.
This means the only guarantee is that each operation is atomic, that is, is impossible for an operation to be observed "mid way though". What this means is that if x is an atomic<someComplexType>, it's impossible for one thread to observe x as having a value in between states.
It should already be clear where can that be useful, but let's examine a specific example (for demonstration proposes only, this is not how you'd want to code):
class SomeComplexType {
public:
int size;
int *values;
}
// Thread 1:
SomeComplexType r = x.load(memory_order_relaxed);
if(r.size > 3)
r.values[2] = 123;
// Thread 2:
SomeComplexType a, b;
a.size = 10; a.values = new int[10];
b.size = 0; b.values = NULL;
x.store(a, memory_order_relaxed);
x.store(b, memory_order_relaxed);
What the atomic type does for us is guarantee that r in thread 1 is not an object in between states, specifically, that it's size & values properties are in sync.
According to the STR analogy from this post: C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?, I've created a visualization of what can happen here (as I understand it) as follows:
Thread 1 first sees y=42, then it performs r1=y, and after it x=r1. Thread 2 first sees x=r1 being already 42, then it performs r2=x, and after it y=42.
Lines represent "views" of memory by individual threads. These lines/views cannot cross for a particular thread. But, with relaxed atomics, lines/views of one thread can cross these of other threads.
EDIT:
I guess this is the same as with the following program:
atomic<int> x{0}, y{0};
// thread 1:
x.store(1, memory_order_relaxed);
cout << x.load(memory_order_relaxed) << y.load(memory_order_relaxed);
// thread 2:
y.store(1, memory_order_relaxed);
cout << x.load(memory_order_relaxed) << y.load(memory_order_relaxed);
which can produce 01 and 10 on the output (such an output could not happen with SC atomic operations).
Looking exclusively at the C++ memory model (not talking about compiler or hardware reordering), the only execution that leads to r1=r2=42 is:
Here I replaced r1 with a and r2 with b.
As usual, sb stands for sequenced-before and is simply the inter-thread ordering (the order in which the instructions appear in the source code). The rf are Read-From edges and mean that the Read/load on one end reads the value written/Stored on the other end.
The loop, involving both sb and rf edges, as highlighted in green, is necessary for the outcome: y is written in one thread, which is read in the other thread into a and from there written to x, which is read in the former thread again into b (which is sequenced-before the write to y).
There are two reasons why a constructed graph like this would not be possible: causality and because a rf reads a hidden side effect. In this case the latter is impossible because we only write once to each variable, so clearly one write can not be hidden (overwritten) by another write.
In order to answer the causality question we follow the following rule: A loop is disallowed (impossible) when it involves a single memory location and the direction of the sb edges is in the same direction everywhere in the loop (the direction of the rf edges is not relevant in that case); or, the loop involves more than one variable, all edges (sb AND rf) are in the same direction and AT MOST one of the variables has one or more rf edges between different threads that are not release/acquire.
In this case the loop exists, two variables are involved (one rf edge for x and one rf edge for y), all edges are in the same direction, but TWO variables have a relaxed/relaxed rf edge (namely x and y). Therefore there is no causality violation and this is an execution that is consistent with the C++ memory model.