c++ multithread atomic load/store - c++
When I read the 5th chapter of the book CplusplusConcurrencyInAction, the example code as follows, multithread load/store some atomic values concurrently with the momery_order_relaxed.Three array save the value of x、y and z respectively at each round.
#include <thread>
#include <atomic>
#include <iostream>
std::atomic<int> x(0),y(0),z(0); // 1
std::atomic<bool> go(false); // 2
unsigned const loop_count=10;
struct read_values
{
int x,y,z;
};
read_values values1[loop_count];
read_values values2[loop_count];
read_values values3[loop_count];
read_values values4[loop_count];
read_values values5[loop_count];
void increment(std::atomic<int>* var_to_inc,read_values* values)
{
while(!go)
std::this_thread::yield();
for(unsigned i=0;i<loop_count;++i)
{
values[i].x=x.load(std::memory_order_relaxed);
values[i].y=y.load(std::memory_order_relaxed);
values[i].z=z.load(std::memory_order_relaxed);
var_to_inc->store(i+1,std::memory_order_relaxed); // 4
std::this_thread::yield();
}
}
void read_vals(read_values* values)
{
while(!go)
std::this_thread::yield();
for(unsigned i=0;i<loop_count;++i)
{
values[i].x=x.load(std::memory_order_relaxed);
values[i].y=y.load(std::memory_order_relaxed);
values[i].z=z.load(std::memory_order_relaxed);
std::this_thread::yield();
}
}
void print(read_values* v)
{
for(unsigned i=0;i<loop_count;++i)
{
if(i)
std::cout<<",";
std::cout<<"("<<v[i].x<<","<<v[i].y<<","<<v[i].z<<")";
}
std::cout<<std::endl;
}
int main()
{
std::thread t1(increment,&x,values1);
std::thread t2(increment,&y,values2);
std::thread t3(increment,&z,values3);
std::thread t4(read_vals,values4);
std::thread t5(read_vals,values5);
go=true;
t5.join();
t4.join();
t3.join();
t2.join();
t1.join();
print(values1);
print(values2);
print(values3);
print(values4);
print(values5);
}
one of the valid output mentioned in this chapter:
(0,0,0),(1,0,0),(2,0,0),(3,0,0),(4,0,0),(5,7,0),(6,7,8),(7,9,8),(8,9,8),(9,9,10)
(0,0,0),(0,1,0),(0,2,0),(1,3,5),(8,4,5),(8,5,5),(8,6,6),(8,7,9),(10,8,9),(10,9,10)
(0,0,0),(0,0,1),(0,0,2),(0,0,3),(0,0,4),(0,0,5),(0,0,6),(0,0,7),(0,0,8),(0,0,9)
(1,3,0),(2,3,0),(2,4,1),(3,6,4),(3,9,5),(5,10,6),(5,10,8),(5,10,10),(9,10,10),(10,10,10)
(0,0,0),(0,0,0),(0,0,0),(6,3,7),(6,5,7),(7,7,7),(7,8,7),(8,8,7),(8,8,9),(8,8,9)
The 3rd output of values1 is (2,0,0),at this point it reads x=2,and y=z=0.It means when y=0,the x is already equals to 2, Why the 3rd output of the values2 it reads x=0 and y=2,which means x is the old value because x、y、z is increasing, so when y=2 that x is at least 2.
And I test the code in my PC,I can't reproduce the result like that.
The reason is that reading via x.load(std::memory_order_relaxed) guarantees only that you never see x decrease within the same thread (in this example code). (It also guarantees that a thread writing to x will read that same value again in the next iteration.)
In general, different threads can read different values from the same variable at the same time. That is, there need not be a consistent "global state" that all threads agree on. The example output is supposed to demonstrate that: The first thread might still see y = 0 when it already wrote x = 4, while the second thread might still see x = 0 when it already writes y = 2. The standard allows this because real hardware may work that way: Consider the case when the threads are on different CPU cores, each with its own private L1 cache.
However, it is not possible that the second thread sees x = 5 and then later sees x = 2 - the atomic object always guarantees that there is a consistent global modification order (that is, all writes to the variable are observed to happen in the same order by all the threads).
But when using std::memory_order_relaxed there are no guarantees about when a thread finally does "see" those writes*, or how the observations of different threads relate to each other. You need stronger memory ordering to get those guarantees.
*In fact, a valid output would be all threads reading only 0 all the time, except the writer threads reading what they wrote the previous iteration to their "own" variable (and 0 for the others). On hardware that never flushed caches unless prompted, this might actually happen, and it would be fully compliant with the C++ standard!
And I test the code in my PC,I can't reproduce the result like that.
The "example output" shown is highly artificial. The C++ standard allows for this output to happen. This means you can write efficient and correct multithreaded code even on hardware with no inbuilt guarantees on cache coherency (see above). But common hardware today (x86 in particular) brings a lot of guarantees that actually make certain behavior impossible to observe (including the output in the question).
Also, note that x, y and z are extremely likely to be adjacent (depends on the compiler), meaning they will likely all land on the same cache line. This will lead to massive performance degradation (look up "false sharing"). But since memory can only be transferred between cores at cache line granularity, this (together with the x86 coherency guarantees) makes it essentially impossible that an x86 CPU (which you most likely performed your tests with) reads outdated values of any of the variables. Allocating these values more than 1-2 cache lines apart will likely lead to more interesting/chaotic results.
Related
Why is atomic bool needed to avoid data race?
I was looking at list 5.13 in C++ Concurrency in Action by Antony Williams:, and I am confused by the comments "the store to and load from y still have to be atomic; otherwise, there would be a data race on y". That implies that if y is a normal (non-atomic) bool then the assert may fire, but why? #include <atomic> #include <thread> #include <assert.h> bool x=false; std::atomic<bool> y; std::atomic<int> z; void write_x_then_y() { x=true; std::atomic_thread_fence(std::memory_order_release); y.store(true,std::memory_order_relaxed); } void read_y_then_x() { while(!y.load(std::memory_order_relaxed)); std::atomic_thread_fence(std::memory_order_acquire); if(x) ++z; } int main() { x=false; y=false; z=0; std::thread a(write_x_then_y); std::thread b(read_y_then_x); a.join(); b.join(); assert(z.load()!=0); } Now let's change y to a normal bool, and I want to understand why the assert can fire. #include <atomic> #include <thread> #include <assert.h> bool x=false; bool y=false; std::atomic<int> z; void write_x_then_y() { x=true; std::atomic_thread_fence(std::memory_order_release); y=true; } void read_y_then_x() { while(!y); std::atomic_thread_fence(std::memory_order_acquire); if(x) ++z; } int main() { x=false; y=false; z=0; std::thread a(write_x_then_y); std::thread b(read_y_then_x); a.join(); b.join(); assert(z.load()!=0); } I understand that a data race happens on non-atomic global variables, but in this example if the while loop in read_y_then_x exits, my understanding is that y must either already be set to true, or in the process of being set to true (because it is a non-atomic operation) in the write_x_then_y thread. Since atomic_thread_fence in the write_x_then_y thread makes sure no code written above that can be reordered after, I think the x=true operation must have been finished. In addition, the std::memory_order_release and std::memory_order_acquire tags in two threads make sure that the updated value of x has been synchronized-with the read_y_then_x thread when reading x, so I feel the assert still holds... What am I missing?
Accessing a non-atomic object in two threads unsynchronized with one of the accesses being a write access is always a data race and causes undefined behavior. This is how the term "data race" is formally defined in the C++ language and what it prescribes as its consequences. It is not merely a race condition which informally refers to multiple possible outcomes being allowed due to unspecified ordering of certain thread accesses. The write in y=true; happens while the loop while(!y); is still reading y, which makes it a data race if y is non-atomic. The program would have undefined behavior, which doesn't just mean that the assert might fire. It means that the program may do anything, e.g. crash or freeze up. The compiler is allowed to optimize under the assumption that this never happens and thereby optimize the code in such a way that your intended behavior is not preserved since it relies on the access causing the data race. Furthermore, an infinite loop which doesn't eventually perform any atomic/synchronizing/volatile/IO operation also has undefined behavior. So while(!y); has undefined behavior if y is not an atomic and initially false and the compiler can assume that this line is unreachable under those conditions. The compiler could for example remove the loop from the function for that reason, as actually does happen with current compilers, see comments under the question. And I am also aware that especially Clang does perform optimization based on that and sometimes even goes so far as to completely drop all contents (including the ret instruction at the end!) from an emitted function with such an infinite loop, if it could not ever be called without undefined behavior. However here, because y might be true when the function is called, in which case there is no undefined behavior for that, this doesn't happen. All of this is on the language level. It doesn't matter what would happen on the hardware level if the program was compiled in a most literal translation. These would be additional concerns, e.g. potential tearing of write access and potential cache incoherency between threads, but both of these are unlikely to be a problem on common platforms for a bool. Another problem might be though that the threads could keep a copy of the variable in a register, potentially never producing a store that the other thread could observe, which is allowed for a non-atomic non-volatile object.
If you write this: bool y=false; ... while(!y); then the compiler can assume y will not change by itself. The body of the while is empty so either y is true at the start and you have an endless loop or y is false at the start and the while ends. The compiler can optimize this into: if (!y) while(true); But c++ also says that there must always be progress, an infinite loop is UB, so the compiler may do whatever it likes when it sees a while(true);, including removing it. gcc and clang will actually do that as Jerome pointed out here: https://godbolt.org/z/ocrxnee8T So what the std::atomic<bool> y; does is the modern form of marking y as volatile. The compiler can no longer assume that repeated reads of y give the same result and can no longer optiomize away the while(!y); loop. Depending on the architecture it will also insert necessary memory barriers so changes to the variable become observable to other threads, which is more than volatile would have done.
Is it a data race?
volatile global x = 0; reader() { while (x == 0) {} print ("World\n"); } writer() { print ("Hello, ") x = 1; } thread (reader); thread (writer); https://en.wikipedia.org/wiki/Race_condition#:~:text=Data%20race%5Bedit,only%20atomic%20operations. From wikipedia, The precise definition of data race is specific to the formal concurrency model being used, but typically it refers to a situation where a memory operation in one thread could potentially attempt to access a memory location at the same time that a memory operation in another thread is writing to that memory location, in a context where this is dangerous. There are at least one thread that writes to x. (writer) There are at least one thread that reads to x. (reader) There is not any synchronization mechanism for accessing x. (Both of two threads access x without any locks.) Therefore, I think the code above is data race. (Obviously not a race condition) Am i right? Then what is the meaning of data race when a code is data race, but it generates the expected output? (We will see "Hello, World\n", assuming processor guarantees that a store to an address becomes visible for all load instructions issued after the store instruction) ----------- added working cpp code ------------ #include <iostream> #include <thread> volatile int x = 0; void reader() { while (x == 0 ) {} std::cout << "World" << std::endl; } void writer() { std::cout << "Hello, "; x = 1; } int main() { std::thread t1(reader); std::thread t2(writer); t2.join(); t1.join(); return 0; }
Yes, this is a data race and UB. [intro.races]/2 Two expression evaluations conflict if one of them modifies a memory location ... and the other one reads or modifies the same memory location. [intro.races]/21 Two actions are potentially concurrent if: — they are performed by different threads, ... ... The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, ... Any such data race results in undefined behavior. For two things in different threads to "happen before" one another, a synchronization mechanism must be involved, such as non-relaxed atomics, mutexes, and so on.
Yes, data race and consequently undefined behavior in C++. Undefined behavior means that you have no guarantee how the program will behave. Seeing the "expected" output is one possible output, but you are not guaranteed that it will happen. Here x is non-atomic and is read by thread t1 and written by thread t2 without any synchronization and therefore they cause a data race. volatile has no impact on whether or not an access is a data race. Only using an atomic (e.g. std::atomic<int>) can remove the data race. That said, on many common platforms writing to a int will be atomic on the hardware level, the compiler will not optimize away volatile accesses and will probably also not reorder volatile accesses with IO and therefore it will probably happen to work on these platforms. The language doesn't make this guarantee though.
Are relaxed atomic store reordered themselves before the release? (similar with load /acquire)
I read in the en.cppreference.com specifications relaxed operations on atomics: "[...]only guarantee atomicity and modification order consistency." So, I was asking myself if such 'modification order' would work when you are working on the same atomic variable or different ones. In my code I have an atomic tree, where a low priority, event based message thread fills which node should be updated storing some data on red '1' atomic (see picture), using memory_order_relaxed. Then it continues writing in its parent using fetch_or to know which child atomic has been updated. Each atomic supports up to 64 bits, so I fill the bit 1 in red operation '2'. It continues successively until the root atomic which is also flagged using the fetch_or but using this time memory_order_release. Then a fast, real time, unblockable, thread loads the control atomic (with memory_order_acquire) and reads which bits have it enabled. Then it updates recursively the childs atomics with memory_order_relaxed. And that is how I sync my data with each cycle of the high priority thread. Since this thread is updating, it is fine child atomics are being stored before its parent. The problem is when it stores a parent (filling the bit of the children to update) before I fill the child information. In other words, as the tittle says, are the relaxed stores reordered between them before the release one? I don't mind non-atomic variables are reordered. Pseudo-code, suppose [x, y, z, control] are atomic and with initial values 0: Event thread: z = 1; // relaxed y = 1; // relaxed x = 1; // relaxed; control = 0; // release Real time thread (loop): load control; // acquire load x; // relaxed load y; // relaxed load z; // relaxed I wonder if in the real time thread this would be true always: x <= y <=z. To check that I wrote this small program: #define _ENABLE_ATOMIC_ALIGNMENT_FIX 1 #include <atomic> #include <iostream> #include <thread> #include <assert.h> #include <array> using namespace std; constexpr int numTries = 10000; constexpr int arraySize = 10000; array<atomic<int>, arraySize> tat; atomic<int> tsync {0}; void writeArray() { // Stores atomics in reverse order for (int j=0; j!=numTries; ++j) { for (int i=arraySize-1; i>=0; --i) { tat[i].store(j, memory_order_relaxed); } tsync.store(0, memory_order_release); } } void readArray() { // Loads atomics in normal order for (int j=0; j!=numTries; ++j) { bool readFail = false; tsync.load(memory_order_acquire); int minValue = 0; for (int i=0; i!=arraySize; ++i) { int newValue = tat[i].load(memory_order_relaxed); // If it fails, it stops the execution if (newValue < minValue) { readFail = true; cout << "fail " << endl; break; } minValue = newValue; } if (readFail) break; } } int main() { for (int i=0; i!=arraySize; ++i) { tat[i].store(0); } thread b(readArray); thread a(writeArray); a.join(); b.join(); } How it works: There is an array of atomic. One thread stores with relaxed ordering in reverse order and ends storing a control atomic with release ordering. The other thread loads with acquire ordering that control atomic, then it loads with relaxed that atomic the rest of values of the array. Since the parents mustn't be updates before the children, the newValue should always be equal or greater than the oldValue. I've executed this program on my computer several times, debug and release, and it doesn't trig the fail. I'm using a normal x64 Intel i7 processor. So, is it safe to suppose that relaxed stores to multiple atomics do keep the 'modification order' at least when they are being sync with a control atomic and acquire/release?
Sadly, you will learn very little about what the Standard supports by experiment with x86_64, because the x86_64 is so well-behaved. In particular, unless you specify _seq_cst: all reads are effectively _acquire all writes are effectively _release unless they cross a cache-line boundary. And: all read-modify-write are effectively seq_cst Except that the compiler is (also) allowed to re-order _relaxed operations. You mention using _relaxed fetch_or... and if I understand correctly, you may be disappointed to learn that is no less expensive than seq_cst, and requires a LOCK prefixed instruction, carrying the full overhead of that. But, yes _relaxed atomic operations are indistinguishable from ordinary operations as far as ordering is concerned. So yes, they may be reordered wrt to other _relaxed atomic operations as well as not-atomic ones -- by the compiler and/or the machine. [Though, as noted, on x86_64, not by the machine.] And, yes where a release operation in thread X synchronizes-with an acquire operation in thread Y, all writes in thread X which are sequenced-before the release will have happened-before the acquire in thread Y. So the release operation is a signal that all writes which precede it in X are "complete", and when the acquire operation sees that signal Y knows it has synchronized and can read what was written by X (up to the release). Now, the key thing to understand here is that simply doing a store _release is not enough, the value which is stored must be an unambiguous signal to the load _acquire that the store has happened. For otherwise, how can the load tell ? Generally a _release/_acquire pair like this are used to synchronize access to some collection of data. Once that data is "ready", a store _release signals that. Any load _acquire which sees the signal (or all loads _acquire which see the signal) know that the data is "ready" and they can read it. Of course, any writes to the data which come after the store _release may (depending on timing) also be seen by the load(s) _acquire. What I am trying to say here, is that another signal may be required if there are to be further changes to the data. Your little test program: initialises tsync to 0 in the writer: after all the tat[i].store(j, memory_order_relaxed), does tsync.store(0, memory_order_release) so the value of tsync does not change ! in the reader: does tsync.load(memory_order_acquire) before doing tat[i].load(memory_order_relaxed) and ignores the value read from tsync I am here to tell you that the _release/_acquire pairs are not synchronizing -- all these stores/load may as well be _relaxed. [I think your test will "pass" if the the writer manages to stay ahead of the reader. Because on x86-64 all writes are done in instruction order, as are all reads.] For this to be a test of _release/_acquire semantics, I suggest: initialises tsync to 0 and tat[] to all zero. in the writer: run j = 1..numTries after all the tat[i].store(j, memory_order_relaxed), write tsync.store(j, memory_order_release) this signals that the pass is complete, and that all tat[] is now j. in the reader: do j = tsync.load(memory_order_acquire) a pass across tat[] should find j <= tat[i].load(memory_order_relaxed) and after the pass, j == numTries signals that the writer has finished. where the signal sent by the writer is that it has just completed writing j, and will continue with j+1, unless j == numTries. But this does not guarantee the order in which tat[] are written. If what you wanted was for the writer to stop after each pass, and wait for the reader to see it and signal same -- then you need another signal and you need the threads to wait for their respective "you may proceed" signal.
The quote about relaxed giving modification order consistency. only means that all threads can agree on a modification order for that one object. i.e. an order exists. A later release-store that synchronizes with an acquire-load in another thread will guarantee that it's visible. https://preshing.com/20120913/acquire-and-release-semantics/ has a nice diagram. Any time you're storing a pointer that other threads could load and deref, use at least mo_release if any of the pointed-to data has also been recently modified, if it's necessary that readers also see those updates. (This includes anything indirectly reachable, like levels of your tree.) On any kind of tree / linked-list / pointer-based data structure, pretty much the only time you could use relaxed would be in newly-allocated nodes that haven't been "published" to the other threads yet. (Ideally you can just pass args to constructors so they can be initialized without even trying to be atomic at all; the constructor for std::atomic<T>() is not itself atomic. So you must use a release store when publishing a pointer to a newly-constructed atomic object.) On x86 / x86-64, mo_release has no extra cost; plain asm stores already have ordering as strong as release so the compiler only needs to block compile time reordering to implement var.store(val, mo_release); It's also pretty cheap on AArch64, especially if you don't do any acquire loads soon after. It also means you can't test for relaxed being unsafe using x86 hardware; the compiler will pick one order for the relaxed stores at compile time, nailing them down into release operations in whatever order it picked. (And x86 atomic-RMW operations are always full barriers, effectively seq_cst. Making them weaker in the source only allows compile-time reordering. Some non-x86 ISAs can have cheaper RMWs as well as load or store for weaker orders, though, even acq_rel being slightly cheaper on PowerPC.)
Multithreading and OOO execution.
int main() { int f = 0, x=0; std::thread *t = new std::thread([&f, &x](){ while( f == 0); std::cout << x << endl;}); std::thread *t2 = new std::thread([&f, &x](){ x = 42; f = 1;}); t->join(); t2->join(); return 0; } From what I know, theoritically it is possible to get stdout equals to 0 against to our intuition ( we are expecting 42 as a result. But, CPU can execute out of order instructions and in fact, it is possbile to execute program in that order: ( We assume that we have > 1 cores in our CPU) So, thread#2 on the second core executed first ( because of the OOO meachanism) f = 1 and then, thread#1 on the first core executed first program: while( f == 0); std::cout << x << endl. So, the output is 0. I tried to get such output but I always get 42. I ran that program 1000000 times and the result was always same = 42. (I know that it is not secure, there is data race). My questions are: Am I right or I am wrong? Why? If I am right, is it possible to force to get output equals to 0? How to make safe this code? I know about mutex/semaphores and I could protect f with mutex but I have heard something about memory fences, please say me more.
But, CPU can execute out of order instructions and in fact, it is possbile to execute program in that order: Out-of-order execution is different from reordering of when loads / stores become globally visible. OoOE preserves the illusion of your programming running in-order. Memory re-ordering is possible without OoOE. Even an in-order pipelined core will want to buffer its stores. See parts of this answer, for example. If I am right, is it possible to force to get output equals to 0? Not on x86, which only does StoreLoad reordering, not StoreStore reordering. If the compiler reorders the stores to x and f at compile time, then you will sometimes see x==0 after seeing f==1. Otherwise you will never see that. A short sleep after spawning thread1 before spawning thread2 would also make sure thread1 was spinning on x before you modify it. Then you don't need thread2, and can actually do the stores from the main thread. Have a look at Jeff Preshing's Memory Reordering Caught In The Act for a real program that does observe run-time memory reordering on x86, once per ~6k iterations on a Nehalem. On a weakly-ordered architecture, you could maybe see StoreStore reordering at run-time with something like your test program. But you'd likely have to arrange for the variables to be in different cache lines! And you'd need to test in a loop, not just once per program invocation. How to make safe this code? I know about mutex/semaphores and I could protect f with mutex but I have heard something about memory fences, please say me more. Use C++11 std::atomic to get acquire/release semantics on your accesses to f. std::atomic<uin32t_t> f; // flag to indicate when x is ready uint32_t x; ... // don't use new when a local with automatic storage works fine std::thread t1 = std::thread([&f, &x](){ while( f.load(std::memory_order_acquire) == 0); std::cout << x << endl;}); // or sleep a few ms, and do t2's work in the main thread std::thread t2 = std::thread([&f, &x](){ x = 42; f.store(1, std::memory_order_release);}); The default memory ordering for something like f = 1 is mo_seq_cst, which requires an MFENCE on x86, or an equivalent expensive barrier on other architectures. On x86, the weaker memory ordering just prevent compile-time reordering, but don't require any barrier instructions. std::atomic also prevents the compiler from hoisting the load of f out of the while loop in thread1, like #Baum's comment describes. (Because atomic has semantics like volatile, where it's assumed that the stored value can change asynchronously. Since data races are undefined behaviour, the compiler normally can hoist loads out of loops, unless alias analysis fails to prove that stores through pointers inside the loop can't modify the value.).
about spin lock
i have some questions in boost spinlock code : class spinlock { public: spinlock() : v_(0) { } bool try_lock() { long r = InterlockedExchange(&v_, 1); _ReadWriteBarrier(); // 1. what this mean return r == 0; } void lock() { for (unsigned k = 0; !try_lock(); ++k) { yield(k); } } void unlock() { _ReadWriteBarrier(); *const_cast<long volatile*>(&v_) = 0; // 2. Why don't need to use InterlockedExchange(&v_, 0); } private: long v_; };
A ReadWriteBarrier() is a "memory barrier" (in this case for both reads and writes), a special instruction to the processor to ensure that any instructions resulting in memory operations have completed (load & store operations - or in for example x86 processors, any opertion which has a memory operand at either side). In this particular case, to make sure that the InterlockedExchange(&v_,1) has completed before we continue. Because an InterlockedExchange would be less efficient (takes more interaction with any other cores in the machine to ensure all other processor cores have 'let go' of the value - which makes no sense, since most likely (in correctly working code) we only unlock if we actually hold the lock, so no other processor will have a different value cached than what we're writing over anyway), and a volatile write to the memory will be just as good.
The barriers are there to ensure memory synchronization; without them, different threads may see modifications of memory in different orders. And the InterlockedExchange isn't necessary in the second case because we're not interested in the previous value. The role of InterlockedExchange is doubtlessly to set the value and return the previous value. (And why v_ would be long, when it can only take values 0 and 1, is beyond me.)
There are three issues with atomic access to variables. First, ensuring that there is no thread switch in the middle of reading or writing a value; if this happens it's called "tearing"; the second thread can see a partly written value, which will usually be nonsensical. Second, ensuring that all processors see the change that is being made with a write, or that the processor reading a value sees any previous changes to that value; this is called "cache coherency". Third, ensuring that the compiler doesn't move code across the read or write; this is called "code motion". InterlockedExchange does the first two; although the MSDN documentation is rather muddled, _ReadWriteBarrier does the third, and possibly the second.