Why is atomic bool needed to avoid data race?

Why is atomic bool needed to avoid data race? - c++

I was looking at list 5.13 in C++ Concurrency in Action by Antony Williams:, and I am confused by the comments "the store to and load from y still have to be atomic; otherwise, there would be a data race on y". That implies that if y is a normal (non-atomic) bool then the assert may fire, but why?
#include <atomic>
#include <thread>
#include <assert.h>
bool x=false;
std::atomic<bool> y;
std::atomic<int> z;
void write_x_then_y()
{
x=true;
std::atomic_thread_fence(std::memory_order_release);
y.store(true,std::memory_order_relaxed);
}
void read_y_then_x()
{
while(!y.load(std::memory_order_relaxed));
std::atomic_thread_fence(std::memory_order_acquire);
if(x) ++z;
}
int main()
{
x=false;
y=false;
z=0;
std::thread a(write_x_then_y);
std::thread b(read_y_then_x);
a.join();
b.join();
assert(z.load()!=0);
}
Now let's change y to a normal bool, and I want to understand why the assert can fire.
#include <atomic>
#include <thread>
#include <assert.h>
bool x=false;
bool y=false;
std::atomic<int> z;
void write_x_then_y()
{
x=true;
std::atomic_thread_fence(std::memory_order_release);
y=true;
}
void read_y_then_x()
{
while(!y);
std::atomic_thread_fence(std::memory_order_acquire);
if(x) ++z;
}
int main()
{
x=false;
y=false;
z=0;
std::thread a(write_x_then_y);
std::thread b(read_y_then_x);
a.join();
b.join();
assert(z.load()!=0);
}
I understand that a data race happens on non-atomic global variables, but in this example if the while loop in read_y_then_x exits, my understanding is that y must either already be set to true, or in the process of being set to true (because it is a non-atomic operation) in the write_x_then_y thread. Since atomic_thread_fence in the write_x_then_y thread makes sure no code written above that can be reordered after, I think the x=true operation must have been finished. In addition, the std::memory_order_release and std::memory_order_acquire tags in two threads make sure that the updated value of x has been synchronized-with the read_y_then_x thread when reading x, so I feel the assert still holds... What am I missing?

Accessing a non-atomic object in two threads unsynchronized with one of the accesses being a write access is always a data race and causes undefined behavior. This is how the term "data race" is formally defined in the C++ language and what it prescribes as its consequences. It is not merely a race condition which informally refers to multiple possible outcomes being allowed due to unspecified ordering of certain thread accesses.
The write in y=true; happens while the loop while(!y); is still reading y, which makes it a data race if y is non-atomic. The program would have undefined behavior, which doesn't just mean that the assert might fire. It means that the program may do anything, e.g. crash or freeze up.
The compiler is allowed to optimize under the assumption that this never happens and thereby optimize the code in such a way that your intended behavior is not preserved since it relies on the access causing the data race.
Furthermore, an infinite loop which doesn't eventually perform any atomic/synchronizing/volatile/IO operation also has undefined behavior. So while(!y); has undefined behavior if y is not an atomic and initially false and the compiler can assume that this line is unreachable under those conditions.
The compiler could for example remove the loop from the function for that reason, as actually does happen with current compilers, see comments under the question.
And I am also aware that especially Clang does perform optimization based on that and sometimes even goes so far as to completely drop all contents (including the ret instruction at the end!) from an emitted function with such an infinite loop, if it could not ever be called without undefined behavior. However here, because y might be true when the function is called, in which case there is no undefined behavior for that, this doesn't happen.
All of this is on the language level. It doesn't matter what would happen on the hardware level if the program was compiled in a most literal translation. These would be additional concerns, e.g. potential tearing of write access and potential cache incoherency between threads, but both of these are unlikely to be a problem on common platforms for a bool. Another problem might be though that the threads could keep a copy of the variable in a register, potentially never producing a store that the other thread could observe, which is allowed for a non-atomic non-volatile object.

If you write this:
bool y=false;
...
while(!y);
then the compiler can assume y will not change by itself. The body of the while is empty so either y is true at the start and you have an endless loop or y is false at the start and the while ends.
The compiler can optimize this into:
if (!y) while(true);
But c++ also says that there must always be progress, an infinite loop is UB, so the compiler may do whatever it likes when it sees a while(true);, including removing it. gcc and clang will actually do that as Jerome pointed out here: https://godbolt.org/z/ocrxnee8T
So what the std::atomic<bool> y; does is the modern form of marking y as volatile. The compiler can no longer assume that repeated reads of y give the same result and can no longer optiomize away the while(!y); loop.
Depending on the architecture it will also insert necessary memory barriers so changes to the variable become observable to other threads, which is more than volatile would have done.

Related

Is it a data race?

volatile global x = 0;
reader() {
while (x == 0) {}
print ("World\n");
}
writer() {
print ("Hello, ")
x = 1;
}
thread (reader);
thread (writer);
https://en.wikipedia.org/wiki/Race_condition#:~:text=Data%20race%5Bedit,only%20atomic%20operations.
From wikipedia,
The precise definition of data race is specific to the formal
concurrency model being used, but typically it refers to a situation
where a memory operation in one thread could potentially attempt to
access a memory location at the same time that a memory operation in
another thread is writing to that memory location, in a context where
this is dangerous.
There are at least one thread that writes to x. (writer)
There are at least one thread that reads to x. (reader)
There is not any synchronization mechanism for accessing x. (Both of two threads access x without any locks.)
Therefore, I think the code above is data race. (Obviously not a race condition)
Am i right?
Then what is the meaning of data race when a code is data race, but it generates the expected output? (We will see "Hello, World\n", assuming processor guarantees that a store to an address becomes visible for all load instructions issued after the store instruction)
----------- added working cpp code ------------
#include <iostream>
#include <thread>
volatile int x = 0;
void reader() {
while (x == 0 ) {}
std::cout << "World" << std::endl;
}
void writer() {
std::cout << "Hello, ";
x = 1;
}
int main() {
std::thread t1(reader);
std::thread t2(writer);
t2.join();
t1.join();
return 0;
}

Yes, this is a data race and UB.
[intro.races]/2
Two expression evaluations conflict if one of them modifies a memory location ... and the other one reads or modifies the same memory location.
[intro.races]/21
Two actions are potentially concurrent if:
— they are performed by different threads, ...
...
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, ...
Any such data race results in undefined behavior.
For two things in different threads to "happen before" one another, a synchronization mechanism must be involved, such as non-relaxed atomics, mutexes, and so on.

Yes, data race and consequently undefined behavior in C++. Undefined behavior means that you have no guarantee how the program will behave. Seeing the "expected" output is one possible output, but you are not guaranteed that it will happen.
Here x is non-atomic and is read by thread t1 and written by thread t2 without any synchronization and therefore they cause a data race.
volatile has no impact on whether or not an access is a data race. Only using an atomic (e.g. std::atomic<int>) can remove the data race.
That said, on many common platforms writing to a int will be atomic on the hardware level, the compiler will not optimize away volatile accesses and will probably also not reorder volatile accesses with IO and therefore it will probably happen to work on these platforms. The language doesn't make this guarantee though.

Why does this cppreference excerpt seem to wrongly suggest that atomics can protect critical sections?

int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto job = [&] {
int asdf = bar.load();
// std::lock_guard lg(mx);
foo.emplace_back(1);
bar.store(foo.size());
};
std::thread t1(job);
std::thread t2(job);
t1.join();
t2.join();
}
This obviously is not guaranteed to work, but works with a mutex. But how can that be explained in terms of the formal definitions of the standard?
Consider this excerpt from cppreference:
If an atomic store in thread A is tagged memory_order_release and an
atomic load in thread B from the same variable is tagged
memory_order_acquire [as is the case with default atomics], all memory writes (non-atomic and relaxed
atomic) that happened-before the atomic store from the point of view
of thread A, become visible side-effects in thread B. That is, once
the atomic load is completed, thread B is guaranteed to see everything
thread A wrote to memory.
Atomic loads and stores (with the default or with the specific acquire and release memory order specified) have the mentioned acquire-release semantics. (So does a mutex's lock and unlock.)
An interpretation of that wording could be that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification, making this well-defined. But pretty much everyone would agree that this can lead to a segmentation fault and would surely do so if the job function ran its three lines in a loop.
What standard wording explains the obvious difference in capability between the two tools, given that this wording seems to imply that atomic would synchronize in a way.
I know when to use mutexes and atomics, and I know that the example doesn't work because no synchronization actually happens. My question is how the definition is to be interpreted so it doesn't contradict the way it works in reality.

The quoted passage means that when B loads the value that A stored, then by observing that the store happened, it can also be assured that everything that B did before the store has also happened and is visible.
But this doesn't tell you anything if the store has not in fact happened yet!
The actual C++ standard says this more explicitly. (Always remember that cppreference, while a valuable resource which often quotes from or paraphrases the standard, is not the standard itself and is not authoritative.) From N4861, the final C++20 draft, we have in atomics.order p2:
An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic
operation B that performs an acquire operation on M and takes its value from any side effect in the release
sequence headed by A.
I would agree that if the load in your thread B returned 1, it could safely conclude that the other thread had finished its store and therefore had exited the critical section, and therefore B could safely use foo. In this case the load in B has synchronized with the store in A, since the value of the load (namely 1) came from the store (which is part of its own release sequence).
But it is entirely possible that both loads return 0, if both threads do their loads before either one does its store. The value 0 didn't come from either store, so the loads don't synchronize with the stores in that case. Your code doesn't even look at the value that was loaded, so both threads may enter the critical section together in that case.
The following code would be a safe, though inefficient, way to use an atomic to protect a critical section. It ensures that A will execute the critical section first, and B will wait until A has finished before proceeding. (Obviously if both threads wait for the other then you have a deadlock.)
int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto jobA = [&] {
foo.emplace_back(1);
bar.store(foo.size());
};
auto jobB = [&] {
while (bar.load() == 0) /* spin */ ;
foo.emplace_back(1);
};
std::thread t1(jobA);
std::thread t2(jobB);
t1.join();
t2.join();
}

Setting aside the elephant in the room that none of the C++ containers are thread safe without employing locking of some sort (so forget about using emplace_back without implementing locking), and focusing on the question of why atomic objects alone are not sufficient:
You need more than atomic objects. You also need sequencing.
All that an atomic object gives you is that when an object changes state, any other thread will either see its old value or its new value, and it will never see any "partially old/partially new", or "intermediate" value.
But it makes no guarantee whatsoever as to when other execution threads will "see" the atomic object's new value. At some point they (hopefully) will, see the atomic object's instantly flip to its new value. When? Eventually. That's all that you get from atomics.
One execution thread may very well set an atomic object to a new value, but other execution threads will still have the old value cached, in some form or fashion, and will continue to see the atomic object's old value, and won't "see" the atomic object's new value until some intermediate time passes (if ever).
Sequencing are rules that specify when objects' new values are visible in other execution threads. The simplest way to get both atomicity and easy to deal with sequencing, in one fell swoop, is to use mutexes and condition variables which handle all the hard details for you. You can still use atomics and with a careful logic use lock/release fence instructions to implement proper sequencing. But it's very easy to get it wrong, and the worst of it you won't know that it's wrong until your code starts going off the rails due to improper sequencing and it'll be nearly impossible to accurately reproduce the faulty behavior for debugging purposes.
But for nearly all common, routine, garden-variety tasks mutexes and condition variables is the most simplest solution to proper inter-thread sequencing.

The idea is that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification
Yes all writes that done by foo.emplace_back(1); would be guaranteed when bar.store(foo.size()); is executed. But who guaranteed you that foo.emplace_back(1); from thread 1 would see any/all non partial consistent state from foo.emplace_back(1); executed in thread 2 and vice versa? They both read and modify internal state of std::vector and there is no memory barrier before code reaches atomic store. And even if all variables would be read/modified atomically std::vector state consists of multiple variables - size, capacity, pointer to the data at least. Changes to all of them must be synchronized as well and memory barrier is not enough for that.
To explain little more let's create simplified example:
int a = 0;
int b = 0;
std::atomic<int> at;
// thread 1
int foo = at.load();
a = 1;
b = 2;
at.store(foo);
// thread 2
int foo = at.load();
int tmp1 = a;
int tmp2 = b;
at.store(tmp2);
Now you have 2 problems:
There is no guarantee that when tmp2 value is 2
tmp1 value would be 1
as you read a and b before atomic operation.
There is no guarantee that when at.store(b)
is executed that either a == b == 0 or a == 1 and b == 2,
it could be a == 1 but still b == 0.
Is that clear?
But:
// thread 1
mutex.lock();
a = 1;
b = 2;
mutex.unlock();
// thread 2
mutex.lock();
int tmp1 = a;
int tmp2 = b;
mutex.unlock();
You either get tmp1 == 0 and tmp2 == 0 or tmp1 == 1 and tmp2 == 2, do you see the difference?

Independent Read-Modify-Write Ordering

I was running a bunch of algorithms through Relacy to verify their correctness and I stumbled onto something I didn't really understand. Here's a simplified version of it:
#include <thread>
#include <atomic>
#include <iostream>
#include <cassert>
struct RMW_Ordering
{
std::atomic<bool> flag {false};
std::atomic<unsigned> done {0}, counter {0};
unsigned race_cancel {0}, race_success {0}, sum {0};
void thread1() // fail
{
race_cancel = 1; // data produced
if (counter.fetch_add(1, std::memory_order_release) == 1 &&
!flag.exchange(true, std::memory_order_relaxed))
{
counter.store(0, std::memory_order_relaxed);
done.store(1, std::memory_order_relaxed);
}
}
void thread2() // success
{
race_success = 1; // data produced
if (counter.fetch_add(1, std::memory_order_release) == 1 &&
!flag.exchange(true, std::memory_order_relaxed))
{
done.store(2, std::memory_order_relaxed);
}
}
void thread3()
{
while (!done.load(std::memory_order_relaxed)); // livelock test
counter.exchange(0, std::memory_order_acquire);
sum = race_cancel + race_success;
}
};
int main()
{
for (unsigned i = 0; i < 1000; ++i)
{
RMW_Ordering test;
std::thread t1([&]() { test.thread1(); });
std::thread t2([&]() { test.thread2(); });
std::thread t3([&]() { test.thread3(); });
t1.join();
t2.join();
t3.join();
assert(test.counter == 0);
}
std::cout << "Done!" << std::endl;
}
Two threads race to enter a protected region and the last one modifies done, releasing a third thread from an infinite loop. The example is a bit contrived but the original code needs to claim this region through the flag to signal "done".
Initially, the fetch_add had acq_rel ordering because I was concerned the exchange might get reordered before it, potentially causing one thread to claim the flag, attempt the fetch_add check first, and prevent the other thread (which gets past the increment check) from successfully modifying the schedule. While testing with Relacy, I figured I'd see whether the livelock I expected to happen will take place if I switched from acq_rel to release, and to my surprise, it didn't. I then used relaxed for everything, and again, no livelock.
I tried to find any rules regarding this in the C++ standard but only managed to dig up these:
1.10.7 In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations,
which have special characteristics.
29.3.11 Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated
with the read-modify-write operation.
Can I always rely on RMW operations not being reordered - even if they affect different memory locations - and is there anything in the standard that guarantees this behaviour?
EDIT:
I came up with a simpler setup that should illustrate my question a little better. Here's the CppMem script for it:
int main()
{
atomic_int x = 0; atomic_int y = 0;
{{{
{
if (cas_strong_explicit(&x, 0, 1, relaxed, relaxed))
{
cas_strong_explicit(&y, 0, 1, relaxed, relaxed);
}
}
|||
{
if (cas_strong_explicit(&x, 0, 2, relaxed, relaxed))
{
cas_strong_explicit(&y, 0, 2, relaxed, relaxed);
}
}
|||
{
// Is it possible for x and y to read 2 and 1, or 1 and 2?
x.load(relaxed).readsvalue(2);
y.load(relaxed).readsvalue(1);
}
}}}
return 0;
}
I don't think the tool is sophisticated enough to evaluate this scenario, though it does seem to indicate that it's possible. Here's the almost equivalent Relacy setup:
#include "relacy/relacy_std.hpp"
struct rmw_experiment : rl::test_suite<rmw_experiment, 3>
{
rl::atomic<unsigned> x, y;
void before()
{
x($) = y($) = 0;
}
void thread(unsigned tid)
{
if (tid == 0)
{
unsigned exp1 = 0;
if (x($).compare_exchange_strong(exp1, 1, rl::mo_relaxed))
{
unsigned exp2 = 0;
y($).compare_exchange_strong(exp2, 1, rl::mo_relaxed);
}
}
else if (tid == 1)
{
unsigned exp1 = 0;
if (x($).compare_exchange_strong(exp1, 2, rl::mo_relaxed))
{
unsigned exp2 = 0;
y($).compare_exchange_strong(exp2, 2, rl::mo_relaxed);
}
}
else
{
while (!(x($).load(rl::mo_relaxed) && y($).load(rl::mo_relaxed)));
RL_ASSERT(x($) == y($));
}
}
};
int main()
{
rl::simulate<rmw_experiment>();
}
The assertion is never violated, so 1 and 2 (or the reverse) is not possible according to Relacy.

I haven't fully grokked your code yet, but the bolded question has a straightforward answer:
Can I always rely on RMW operations not being reordered - even if they affect different memory locations
No, you can't. Compile-time reordering of two relaxed RMWs in the same thread is very much allowed. (I think runtime reordering of two RMWs is probably impossible in practice on most CPUs. ISO C++ doesn't distinguish compile-time vs. run-time for this.)
But note that an atomic RMW includes both a load and a store, and both parts have to stay together. So any kind of RMW can't move earlier past an acquire operation, or later past a release operation.
Also, of course the RMW itself being a release and/or acquire operation can stop reordering in one or the other direction.
Of course, the C++ memory model isn't formally defined in terms of local reordering of access to cache-coherent shared memory, only in terms of synchronizing with another thread and creating a happens-before / after relationship. But if you ignore IRIW reordering (2 reader threads not agreeing on the order of two writer threads doing independent stores to different variables) it's pretty much 2 different ways to model the same thing.

In your first example it is guaranteed that the flag.exchange is always executed after the counter.fetch_add, because the && short circuits - i.e., if the first expression resolves to false, the second expression is never executed. The C++ standard guarantees this, so the compiler cannot reorder the two expressions (regardless which memory order they use).
As Peter Cordes already explained, the C++ standard says nothing about if or when instructions can be reordered with respect to atomic operations. In general, most compiler optimizations rely on the as-if:
The semantic descriptions in this International Standard deﬁne a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine [..].
This provision is sometimes called the “as-if” rule, because an implementation is free to disregard any requirement of this International Standard as long as the result is as if the requirement had been obeyed, as far as can be determined from the
observable behavior of the program. For instance, an actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no side eﬀects aﬀecting the observable behavior of the program are produced.
The key aspect here is the "observable behavior". Suppose you have two relaxed atomic loads A and B on two different atomic objects, where A is sequenced before B.
std::atomic<int> x, y;
x.load(std::memory_order_relaxed); // A
y.load(std::memory_order_relaxed); // B
A sequence-before relation is part of the definition of the happens-before relation, so one might assume that the two operations cannot be reordered. However, since the two operations are relaxed, there is no guarantee about the "observable behavior", i.e., even with the original order, the x.load (A) could return a newer result than the y.load (B), so the compiler would be free to reorder them, since the final program would not be able to tell the difference (i.e., the observable behavior is equivalent). If it would not be equivalent, then you would have a race condition! ;-)
To prevent such reorderings you have to rely on the (inter-thread) happens-before relation. If the x.load (A) would use memory_order_acquire, then the compiler would have to assume that this operation synchronizes-with some release operation, thus establishing a (inter-thread) happens-before relation. Suppose some other thread performs two atomic updates:
y.store(42, std::memory_order_relaxed); // C
x.store(1, std::memory_order_release); // D
If the acquire-load A sees the value store by the store-release D, then the two operations synchronize with each other, thereby establishing a happens-before relation. Since y.store is sequenced before x.store, and x.load is sequenced before, the transitivity of the happens-before relation guarantees that y.store happens-before y.load. Reordering the two loads or the two stores would destroy this guarantee and therefore also change the observable behavior. Thus, the compiler cannot perform such reorders.
In general, arguing about possible reorderings is the wrong approach. In a first step you should always identify your required happens-before relations (e.g., the y.store has to happen before the y.load) . The next step is then to ensure that these happens-before relations are correctly established in all cases. At least that is how I approach correctness arguments for my implementations of lock-free algorithms.
Regarding Relacy: Relacy only simulates the memory model, but it relies on the order of operations as generated by the compiler. So even if a compiler could reorder two instructions, but chooses not to, you will not be able to identify this with Relacy.

c++ multithread atomic load/store

When I read the 5th chapter of the book CplusplusConcurrencyInAction, the example code as follows, multithread load/store some atomic values concurrently with the momery_order_relaxed.Three array save the value of x、y and z respectively at each round.
#include <thread>
#include <atomic>
#include <iostream>

std::atomic<int> x(0),y(0),z(0); // 1
std::atomic<bool> go(false); // 2

unsigned const loop_count=10;

struct read_values
{
int x,y,z;
};

read_values values1[loop_count];
read_values values2[loop_count];
read_values values3[loop_count];
read_values values4[loop_count];
read_values values5[loop_count];

void increment(std::atomic<int>* var_to_inc,read_values* values)
{
while(!go)
std::this_thread::yield();
for(unsigned i=0;i<loop_count;++i)
{
values[i].x=x.load(std::memory_order_relaxed);
values[i].y=y.load(std::memory_order_relaxed);
values[i].z=z.load(std::memory_order_relaxed);
var_to_inc->store(i+1,std::memory_order_relaxed); // 4
std::this_thread::yield();
}
}

void read_vals(read_values* values)
{
while(!go)
std::this_thread::yield();
for(unsigned i=0;i<loop_count;++i)
{
values[i].x=x.load(std::memory_order_relaxed);
values[i].y=y.load(std::memory_order_relaxed);
values[i].z=z.load(std::memory_order_relaxed);
std::this_thread::yield();
}
}

void print(read_values* v)
{
for(unsigned i=0;i<loop_count;++i)
{
if(i)
std::cout<<",";
std::cout<<"("<<v[i].x<<","<<v[i].y<<","<<v[i].z<<")";
}
std::cout<<std::endl;
}

int main()
{
std::thread t1(increment,&x,values1);
std::thread t2(increment,&y,values2);
std::thread t3(increment,&z,values3);
std::thread t4(read_vals,values4);
std::thread t5(read_vals,values5);

go=true;

t5.join();
t4.join();
t3.join();
t2.join();
t1.join();

print(values1);
print(values2);
print(values3);
print(values4);
print(values5);
}
one of the valid output mentioned in this chapter:
(0,0,0),(1,0,0),(2,0,0),(3,0,0),(4,0,0),(5,7,0),(6,7,8),(7,9,8),(8,9,8),(9,9,10)
(0,0,0),(0,1,0),(0,2,0),(1,3,5),(8,4,5),(8,5,5),(8,6,6),(8,7,9),(10,8,9),(10,9,10)
(0,0,0),(0,0,1),(0,0,2),(0,0,3),(0,0,4),(0,0,5),(0,0,6),(0,0,7),(0,0,8),(0,0,9)
(1,3,0),(2,3,0),(2,4,1),(3,6,4),(3,9,5),(5,10,6),(5,10,8),(5,10,10),(9,10,10),(10,10,10)
(0,0,0),(0,0,0),(0,0,0),(6,3,7),(6,5,7),(7,7,7),(7,8,7),(8,8,7),(8,8,9),(8,8,9)
The 3rd output of values1 is (2,0,0),at this point it reads x=2,and y=z=0.It means when y=0,the x is already equals to 2, Why the 3rd output of the values2 it reads x=0 and y=2,which means x is the old value because x、y、z is increasing, so when y=2 that x is at least 2.
And I test the code in my PC,I can't reproduce the result like that.

The reason is that reading via x.load(std::memory_order_relaxed) guarantees only that you never see x decrease within the same thread (in this example code). (It also guarantees that a thread writing to x will read that same value again in the next iteration.)
In general, different threads can read different values from the same variable at the same time. That is, there need not be a consistent "global state" that all threads agree on. The example output is supposed to demonstrate that: The first thread might still see y = 0 when it already wrote x = 4, while the second thread might still see x = 0 when it already writes y = 2. The standard allows this because real hardware may work that way: Consider the case when the threads are on different CPU cores, each with its own private L1 cache.
However, it is not possible that the second thread sees x = 5 and then later sees x = 2 - the atomic object always guarantees that there is a consistent global modification order (that is, all writes to the variable are observed to happen in the same order by all the threads).
But when using std::memory_order_relaxed there are no guarantees about when a thread finally does "see" those writes*, or how the observations of different threads relate to each other. You need stronger memory ordering to get those guarantees.
*In fact, a valid output would be all threads reading only 0 all the time, except the writer threads reading what they wrote the previous iteration to their "own" variable (and 0 for the others). On hardware that never flushed caches unless prompted, this might actually happen, and it would be fully compliant with the C++ standard!
And I test the code in my PC,I can't reproduce the result like that.
The "example output" shown is highly artificial. The C++ standard allows for this output to happen. This means you can write efficient and correct multithreaded code even on hardware with no inbuilt guarantees on cache coherency (see above). But common hardware today (x86 in particular) brings a lot of guarantees that actually make certain behavior impossible to observe (including the output in the question).
Also, note that x, y and z are extremely likely to be adjacent (depends on the compiler), meaning they will likely all land on the same cache line. This will lead to massive performance degradation (look up "false sharing"). But since memory can only be transferred between cores at cache line granularity, this (together with the x86 coherency guarantees) makes it essentially impossible that an x86 CPU (which you most likely performed your tests with) reads outdated values of any of the variables. Allocating these values more than 1-2 cache lines apart will likely lead to more interesting/chaotic results.

Optimization and multithreading in B.Stroustrup's new book

Please refer to section 41.2.2 Instruction Reordering of "TCPL" 4th edition by B.Stroustrup, which I transcribe below:
To gain performance, compilers, optimizers, and hardware reorder
instructions. Consider:
// thread 1:
int x;
bool x_init;
void init()
{
x = initialize(); // no use of x_init in initialize()
x_init = true;
// ...
}
For this piece of code there is no stated reason to assign to x before
assigning to x_init. The optimizer (or the hardware instruction
scheduler) may decide to speed up the program by executing x_init =
true first. We probably meant for x_init to indicate whether x had
been initialized by initializer() or not. However, we did not say
that, so the hardware, the compiler, and the optimizer do not know
that.
Add another thread to the program:
// thread 2:
extern int x;
extern bool x_init;
void f2()
{
int y;
while (!x_init) // if necessary, wait for initialization to complete
this_thread::sleep_for(milliseconds{10});
y = x;
// ...
}
Now we have a problem: thread 2 may never wait and thus will assign an
uninitialized x to y. Even if thread 1 did not set x_init and x in
‘‘the wrong order,’’ we still may have a problem. In thread 2, there
are no assignments to x_init, so an optimizer may decide to lift the
evaluation of !x_init out of the loop, so that thread 2 either never
sleeps or sleeps forever.
Does the Standard allow the reordering in thread 1? (some quote from the Standard would be forthcoming) Why would that speed up the program?
Both answers in this discussion on SO seem to indicate that no such optimization occurs when there are global variables in the code, as x_init above.
What does the author mean by "to lift the evaluation of !x_init out of the loop"? Is this something like this?
if( !x_init ) while(true) this_thread::sleep_for(milliseconds{10});
y = x;

This is not so much a issue of the C++ compiler/standard, but that of modern CPUs. Have a look here. The compiler isn't going to emit memory barrier instructions between the assignments of x and x_init unless you tell it to.
For what it is worth, prior to C++11, the standard had no notion of multi threading in it's abstract machine model. Things are a bit nicer these days.

The C++11 standard does not "allow" or "prevent" reordering. It specifies some way to force a specific "barrier" that, it turns, prevent the compiler to move instructions before/after them. A compiler, in this example might reorder the assignment because it might be more efficient on a CPU with multiple computing unit (ALU/Hyperthreading/etc...) even with a single core. Typically, if your CPU has 2 ALU that can work in parallel, there is no reason the compiler would not try to feed them with as much work as it can.
I'm not speaking of out-of-order reordering of the CPU instructions that's done internally in Intel CPU (for example), but compile time ordering to ensure all the computing resources are busy doing some work.
I think it depends on the compiler compilation flags. Typically unless you tell it to, the compiler must assume that another compilation unit (say B.cpp, which is not visible at compile time) can have a "extern bool x_init", and can change it at any time. Then, the re-ordering optimization would break with the expected behavior (B can define the initialize() function). This example is trivial and unlikely to break. The linked SO answer are not related to this "optimization", but simply that, in their case, the compiler can not make the assumption that the global array is not modified externally, and as such can not make the optimization. This is not like your example.
Yes. It's a very common optimization trick, instead of:
// test is a bool
for (int i = 0; i < 345; i++) {
if (test) do_something();
}
The compiler might do:
if (test) for(int i = 0; i < 345; i++) { do_something(); }
And save 344 useless tests.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js