Suppose there are not compiler reorderings.
int32_t value;
int32_t flag = 0;
// thread 1
void UpdateValue(int32_t x) {
value = x;
flag = 1;
}
// thread 2
void DoSomething() {
while (flag == 0);
do_something(value);
}
According to https://en.cppreference.com/w/cpp/language/memory_model, evaluation flag = 1 and evaluation flag == 0 conflict.
And:
flag is not atomic variable
there is no signal handler
flag = 1 doesn't happens before flag == 0
So there is data race?
But in this sample code, every read/write is atomic(4 bytes aligned).
I don't find any undefined behavior and I'm confused...
the data race here is a UB and you can expect any behavior including the one that you expect.
The order different threads read/write that location may
help to understand why it is a UB:
std::memory_order specifies how memory accesses, including regular,
non-atomic memory accesses, are to be ordered around an atomic
operation. Absent any constraints on a multi-core system, when
multiple threads simultaneously read and write to several variables,
one thread can observe the values change in an order different from
the order another thread wrote them. Indeed, the apparent order of
changes can even differ among multiple reader threads. Some similar
effects can occur even on uniprocessor systems due to compiler
transformations allowed by the memory model.
Related
volatile global x = 0;
reader() {
while (x == 0) {}
print ("World\n");
}
writer() {
print ("Hello, ")
x = 1;
}
thread (reader);
thread (writer);
https://en.wikipedia.org/wiki/Race_condition#:~:text=Data%20race%5Bedit,only%20atomic%20operations.
From wikipedia,
The precise definition of data race is specific to the formal
concurrency model being used, but typically it refers to a situation
where a memory operation in one thread could potentially attempt to
access a memory location at the same time that a memory operation in
another thread is writing to that memory location, in a context where
this is dangerous.
There are at least one thread that writes to x. (writer)
There are at least one thread that reads to x. (reader)
There is not any synchronization mechanism for accessing x. (Both of two threads access x without any locks.)
Therefore, I think the code above is data race. (Obviously not a race condition)
Am i right?
Then what is the meaning of data race when a code is data race, but it generates the expected output? (We will see "Hello, World\n", assuming processor guarantees that a store to an address becomes visible for all load instructions issued after the store instruction)
----------- added working cpp code ------------
#include <iostream>
#include <thread>
volatile int x = 0;
void reader() {
while (x == 0 ) {}
std::cout << "World" << std::endl;
}
void writer() {
std::cout << "Hello, ";
x = 1;
}
int main() {
std::thread t1(reader);
std::thread t2(writer);
t2.join();
t1.join();
return 0;
}
Yes, this is a data race and UB.
[intro.races]/2
Two expression evaluations conflict if one of them modifies a memory location ... and the other one reads or modifies the same memory location.
[intro.races]/21
Two actions are potentially concurrent if:
— they are performed by different threads, ...
...
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, ...
Any such data race results in undefined behavior.
For two things in different threads to "happen before" one another, a synchronization mechanism must be involved, such as non-relaxed atomics, mutexes, and so on.
Yes, data race and consequently undefined behavior in C++. Undefined behavior means that you have no guarantee how the program will behave. Seeing the "expected" output is one possible output, but you are not guaranteed that it will happen.
Here x is non-atomic and is read by thread t1 and written by thread t2 without any synchronization and therefore they cause a data race.
volatile has no impact on whether or not an access is a data race. Only using an atomic (e.g. std::atomic<int>) can remove the data race.
That said, on many common platforms writing to a int will be atomic on the hardware level, the compiler will not optimize away volatile accesses and will probably also not reorder volatile accesses with IO and therefore it will probably happen to work on these platforms. The language doesn't make this guarantee though.
I read in the en.cppreference.com specifications relaxed operations on atomics:
"[...]only guarantee atomicity and modification order
consistency."
So, I was asking myself if such 'modification order' would work when you are working on the same atomic variable or different ones.
In my code I have an atomic tree, where a low priority, event based message thread fills which node should be updated storing some data on red '1' atomic (see picture), using memory_order_relaxed. Then it continues writing in its parent using fetch_or to know which child atomic has been updated. Each atomic supports up to 64 bits, so I fill the bit 1 in red operation '2'. It continues successively until the root atomic which is also flagged using the fetch_or but using this time memory_order_release.
Then a fast, real time, unblockable, thread loads the control atomic (with memory_order_acquire) and reads which bits have it enabled. Then it updates recursively the childs atomics with memory_order_relaxed. And that is how I sync my data with each cycle of the high priority thread.
Since this thread is updating, it is fine child atomics are being stored before its parent. The problem is when it stores a parent (filling the bit of the children to update) before I fill the child information.
In other words, as the tittle says, are the relaxed stores reordered between them before the release one? I don't mind non-atomic variables are reordered. Pseudo-code, suppose [x, y, z, control] are atomic and with initial values 0:
Event thread:
z = 1; // relaxed
y = 1; // relaxed
x = 1; // relaxed;
control = 0; // release
Real time thread (loop):
load control; // acquire
load x; // relaxed
load y; // relaxed
load z; // relaxed
I wonder if in the real time thread this would be true always: x <= y <=z. To check that I wrote this small program:
#define _ENABLE_ATOMIC_ALIGNMENT_FIX 1
#include <atomic>
#include <iostream>
#include <thread>
#include <assert.h>
#include <array>
using namespace std;
constexpr int numTries = 10000;
constexpr int arraySize = 10000;
array<atomic<int>, arraySize> tat;
atomic<int> tsync {0};
void writeArray()
{
// Stores atomics in reverse order
for (int j=0; j!=numTries; ++j)
{
for (int i=arraySize-1; i>=0; --i)
{
tat[i].store(j, memory_order_relaxed);
}
tsync.store(0, memory_order_release);
}
}
void readArray()
{
// Loads atomics in normal order
for (int j=0; j!=numTries; ++j)
{
bool readFail = false;
tsync.load(memory_order_acquire);
int minValue = 0;
for (int i=0; i!=arraySize; ++i)
{
int newValue = tat[i].load(memory_order_relaxed);
// If it fails, it stops the execution
if (newValue < minValue)
{
readFail = true;
cout << "fail " << endl;
break;
}
minValue = newValue;
}
if (readFail) break;
}
}
int main()
{
for (int i=0; i!=arraySize; ++i)
{
tat[i].store(0);
}
thread b(readArray);
thread a(writeArray);
a.join();
b.join();
}
How it works: There is an array of atomic. One thread stores with relaxed ordering in reverse order and ends storing a control atomic with release ordering.
The other thread loads with acquire ordering that control atomic, then it loads with relaxed that atomic the rest of values of the array. Since the parents mustn't be updates before the children, the newValue should always be equal or greater than the oldValue.
I've executed this program on my computer several times, debug and release, and it doesn't trig the fail. I'm using a normal x64 Intel i7 processor.
So, is it safe to suppose that relaxed stores to multiple atomics do keep the 'modification order' at least when they are being sync with a control atomic and acquire/release?
Sadly, you will learn very little about what the Standard supports by experiment with x86_64, because the x86_64 is so well-behaved. In particular, unless you specify _seq_cst:
all reads are effectively _acquire
all writes are effectively _release
unless they cross a cache-line boundary. And:
all read-modify-write are effectively seq_cst
Except that the compiler is (also) allowed to re-order _relaxed operations.
You mention using _relaxed fetch_or... and if I understand correctly, you may be disappointed to learn that is no less expensive than seq_cst, and requires a LOCK prefixed instruction, carrying the full overhead of that.
But, yes _relaxed atomic operations are indistinguishable from ordinary operations as far as ordering is concerned. So yes, they may be reordered wrt to other _relaxed atomic operations as well as not-atomic ones -- by the compiler and/or the machine. [Though, as noted, on x86_64, not by the machine.]
And, yes where a release operation in thread X synchronizes-with an acquire operation in thread Y, all writes in thread X which are sequenced-before the release will have happened-before the acquire in thread Y. So the release operation is a signal that all writes which precede it in X are "complete", and when the acquire operation sees that signal Y knows it has synchronized and can read what was written by X (up to the release).
Now, the key thing to understand here is that simply doing a store _release is not enough, the value which is stored must be an unambiguous signal to the load _acquire that the store has happened. For otherwise, how can the load tell ?
Generally a _release/_acquire pair like this are used to synchronize access to some collection of data. Once that data is "ready", a store _release signals that. Any load _acquire which sees the signal (or all loads _acquire which see the signal) know that the data is "ready" and they can read it. Of course, any writes to the data which come after the store _release may (depending on timing) also be seen by the load(s) _acquire. What I am trying to say here, is that another signal may be required if there are to be further changes to the data.
Your little test program:
initialises tsync to 0
in the writer: after all the tat[i].store(j, memory_order_relaxed), does tsync.store(0, memory_order_release)
so the value of tsync does not change !
in the reader: does tsync.load(memory_order_acquire) before doing tat[i].load(memory_order_relaxed)
and ignores the value read from tsync
I am here to tell you that the _release/_acquire pairs are not synchronizing -- all these stores/load may as well be _relaxed. [I think your test will "pass" if the the writer manages to stay ahead of the reader. Because on x86-64 all writes are done in instruction order, as are all reads.]
For this to be a test of _release/_acquire semantics, I suggest:
initialises tsync to 0 and tat[] to all zero.
in the writer: run j = 1..numTries
after all the tat[i].store(j, memory_order_relaxed), write tsync.store(j, memory_order_release)
this signals that the pass is complete, and that all tat[] is now j.
in the reader: do j = tsync.load(memory_order_acquire)
a pass across tat[] should find j <= tat[i].load(memory_order_relaxed)
and after the pass, j == numTries signals that the writer has finished.
where the signal sent by the writer is that it has just completed writing j, and will continue with j+1, unless j == numTries. But this does not guarantee the order in which tat[] are written.
If what you wanted was for the writer to stop after each pass, and wait for the reader to see it and signal same -- then you need another signal and you need the threads to wait for their respective "you may proceed" signal.
The quote about relaxed giving modification order consistency. only means that all threads can agree on a modification order for that one object. i.e. an order exists. A later release-store that synchronizes with an acquire-load in another thread will guarantee that it's visible. https://preshing.com/20120913/acquire-and-release-semantics/ has a nice diagram.
Any time you're storing a pointer that other threads could load and deref, use at least mo_release if any of the pointed-to data has also been recently modified, if it's necessary that readers also see those updates. (This includes anything indirectly reachable, like levels of your tree.)
On any kind of tree / linked-list / pointer-based data structure, pretty much the only time you could use relaxed would be in newly-allocated nodes that haven't been "published" to the other threads yet. (Ideally you can just pass args to constructors so they can be initialized without even trying to be atomic at all; the constructor for std::atomic<T>() is not itself atomic. So you must use a release store when publishing a pointer to a newly-constructed atomic object.)
On x86 / x86-64, mo_release has no extra cost; plain asm stores already have ordering as strong as release so the compiler only needs to block compile time reordering to implement var.store(val, mo_release); It's also pretty cheap on AArch64, especially if you don't do any acquire loads soon after.
It also means you can't test for relaxed being unsafe using x86 hardware; the compiler will pick one order for the relaxed stores at compile time, nailing them down into release operations in whatever order it picked. (And x86 atomic-RMW operations are always full barriers, effectively seq_cst. Making them weaker in the source only allows compile-time reordering. Some non-x86 ISAs can have cheaper RMWs as well as load or store for weaker orders, though, even acq_rel being slightly cheaper on PowerPC.)
I was running a bunch of algorithms through Relacy to verify their correctness and I stumbled onto something I didn't really understand. Here's a simplified version of it:
#include <thread>
#include <atomic>
#include <iostream>
#include <cassert>
struct RMW_Ordering
{
std::atomic<bool> flag {false};
std::atomic<unsigned> done {0}, counter {0};
unsigned race_cancel {0}, race_success {0}, sum {0};
void thread1() // fail
{
race_cancel = 1; // data produced
if (counter.fetch_add(1, std::memory_order_release) == 1 &&
!flag.exchange(true, std::memory_order_relaxed))
{
counter.store(0, std::memory_order_relaxed);
done.store(1, std::memory_order_relaxed);
}
}
void thread2() // success
{
race_success = 1; // data produced
if (counter.fetch_add(1, std::memory_order_release) == 1 &&
!flag.exchange(true, std::memory_order_relaxed))
{
done.store(2, std::memory_order_relaxed);
}
}
void thread3()
{
while (!done.load(std::memory_order_relaxed)); // livelock test
counter.exchange(0, std::memory_order_acquire);
sum = race_cancel + race_success;
}
};
int main()
{
for (unsigned i = 0; i < 1000; ++i)
{
RMW_Ordering test;
std::thread t1([&]() { test.thread1(); });
std::thread t2([&]() { test.thread2(); });
std::thread t3([&]() { test.thread3(); });
t1.join();
t2.join();
t3.join();
assert(test.counter == 0);
}
std::cout << "Done!" << std::endl;
}
Two threads race to enter a protected region and the last one modifies done, releasing a third thread from an infinite loop. The example is a bit contrived but the original code needs to claim this region through the flag to signal "done".
Initially, the fetch_add had acq_rel ordering because I was concerned the exchange might get reordered before it, potentially causing one thread to claim the flag, attempt the fetch_add check first, and prevent the other thread (which gets past the increment check) from successfully modifying the schedule. While testing with Relacy, I figured I'd see whether the livelock I expected to happen will take place if I switched from acq_rel to release, and to my surprise, it didn't. I then used relaxed for everything, and again, no livelock.
I tried to find any rules regarding this in the C++ standard but only managed to dig up these:
1.10.7 In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations,
which have special characteristics.
29.3.11 Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated
with the read-modify-write operation.
Can I always rely on RMW operations not being reordered - even if they affect different memory locations - and is there anything in the standard that guarantees this behaviour?
EDIT:
I came up with a simpler setup that should illustrate my question a little better. Here's the CppMem script for it:
int main()
{
atomic_int x = 0; atomic_int y = 0;
{{{
{
if (cas_strong_explicit(&x, 0, 1, relaxed, relaxed))
{
cas_strong_explicit(&y, 0, 1, relaxed, relaxed);
}
}
|||
{
if (cas_strong_explicit(&x, 0, 2, relaxed, relaxed))
{
cas_strong_explicit(&y, 0, 2, relaxed, relaxed);
}
}
|||
{
// Is it possible for x and y to read 2 and 1, or 1 and 2?
x.load(relaxed).readsvalue(2);
y.load(relaxed).readsvalue(1);
}
}}}
return 0;
}
I don't think the tool is sophisticated enough to evaluate this scenario, though it does seem to indicate that it's possible. Here's the almost equivalent Relacy setup:
#include "relacy/relacy_std.hpp"
struct rmw_experiment : rl::test_suite<rmw_experiment, 3>
{
rl::atomic<unsigned> x, y;
void before()
{
x($) = y($) = 0;
}
void thread(unsigned tid)
{
if (tid == 0)
{
unsigned exp1 = 0;
if (x($).compare_exchange_strong(exp1, 1, rl::mo_relaxed))
{
unsigned exp2 = 0;
y($).compare_exchange_strong(exp2, 1, rl::mo_relaxed);
}
}
else if (tid == 1)
{
unsigned exp1 = 0;
if (x($).compare_exchange_strong(exp1, 2, rl::mo_relaxed))
{
unsigned exp2 = 0;
y($).compare_exchange_strong(exp2, 2, rl::mo_relaxed);
}
}
else
{
while (!(x($).load(rl::mo_relaxed) && y($).load(rl::mo_relaxed)));
RL_ASSERT(x($) == y($));
}
}
};
int main()
{
rl::simulate<rmw_experiment>();
}
The assertion is never violated, so 1 and 2 (or the reverse) is not possible according to Relacy.
I haven't fully grokked your code yet, but the bolded question has a straightforward answer:
Can I always rely on RMW operations not being reordered - even if they affect different memory locations
No, you can't. Compile-time reordering of two relaxed RMWs in the same thread is very much allowed. (I think runtime reordering of two RMWs is probably impossible in practice on most CPUs. ISO C++ doesn't distinguish compile-time vs. run-time for this.)
But note that an atomic RMW includes both a load and a store, and both parts have to stay together. So any kind of RMW can't move earlier past an acquire operation, or later past a release operation.
Also, of course the RMW itself being a release and/or acquire operation can stop reordering in one or the other direction.
Of course, the C++ memory model isn't formally defined in terms of local reordering of access to cache-coherent shared memory, only in terms of synchronizing with another thread and creating a happens-before / after relationship. But if you ignore IRIW reordering (2 reader threads not agreeing on the order of two writer threads doing independent stores to different variables) it's pretty much 2 different ways to model the same thing.
In your first example it is guaranteed that the flag.exchange is always executed after the counter.fetch_add, because the && short circuits - i.e., if the first expression resolves to false, the second expression is never executed. The C++ standard guarantees this, so the compiler cannot reorder the two expressions (regardless which memory order they use).
As Peter Cordes already explained, the C++ standard says nothing about if or when instructions can be reordered with respect to atomic operations. In general, most compiler optimizations rely on the as-if:
The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine [..].
This provision is sometimes called the “as-if” rule, because an implementation is free to disregard any requirement of this International Standard as long as the result is as if the requirement had been obeyed, as far as can be determined from the
observable behavior of the program. For instance, an actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no side effects affecting the observable behavior of the program are produced.
The key aspect here is the "observable behavior". Suppose you have two relaxed atomic loads A and B on two different atomic objects, where A is sequenced before B.
std::atomic<int> x, y;
x.load(std::memory_order_relaxed); // A
y.load(std::memory_order_relaxed); // B
A sequence-before relation is part of the definition of the happens-before relation, so one might assume that the two operations cannot be reordered. However, since the two operations are relaxed, there is no guarantee about the "observable behavior", i.e., even with the original order, the x.load (A) could return a newer result than the y.load (B), so the compiler would be free to reorder them, since the final program would not be able to tell the difference (i.e., the observable behavior is equivalent). If it would not be equivalent, then you would have a race condition! ;-)
To prevent such reorderings you have to rely on the (inter-thread) happens-before relation. If the x.load (A) would use memory_order_acquire, then the compiler would have to assume that this operation synchronizes-with some release operation, thus establishing a (inter-thread) happens-before relation. Suppose some other thread performs two atomic updates:
y.store(42, std::memory_order_relaxed); // C
x.store(1, std::memory_order_release); // D
If the acquire-load A sees the value store by the store-release D, then the two operations synchronize with each other, thereby establishing a happens-before relation. Since y.store is sequenced before x.store, and x.load is sequenced before, the transitivity of the happens-before relation guarantees that y.store happens-before y.load. Reordering the two loads or the two stores would destroy this guarantee and therefore also change the observable behavior. Thus, the compiler cannot perform such reorders.
In general, arguing about possible reorderings is the wrong approach. In a first step you should always identify your required happens-before relations (e.g., the y.store has to happen before the y.load) . The next step is then to ensure that these happens-before relations are correctly established in all cases. At least that is how I approach correctness arguments for my implementations of lock-free algorithms.
Regarding Relacy: Relacy only simulates the memory model, but it relies on the order of operations as generated by the compiler. So even if a compiler could reorder two instructions, but chooses not to, you will not be able to identify this with Relacy.
I found an example of a race condition that I was able to reproduce under g++ in linux. What I don't understand is how the order of operations matter in this example.
int va = 0;
void fa() {
for (int i = 0; i < 10000; ++i)
++va;
}
void fb() {
for (int i = 0; i < 10000; ++i)
--va;
}
int main() {
std::thread a(fa);
std::thread b(fb);
a.join();
b.join();
std::cout << va;
}
I can undertand that the order matters if I had used va = va + 1; because then RHS va could have changed before getting back to assigned LHS va. Can someone clarify?
The standard says (quoting the latest draft):
[intro.races]
Two expression evaluations conflict if one of them modifies a memory location ([intro.memory]) and the other one reads or modifies the same memory location.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below.
Any such data race results in undefined behavior.
Your example program has a data race, and the behaviour of the program is undefined.
What I don't understand is how the order of operations matter in this example.
The order of operations matters because the operations are not atomic, and they read and modify the same memory location.
can undertand that the order matters if I had used va = va + 1; because then RHS va could have changed before getting back to assigned LHS va
The same applies to the increment operator. The abstract machine will:
Read a value from memory
Increment the value
Write a value back to memory
There are multiple steps there that can interleave with operations in the other thread.
Even if there was a single operation per thread, there would be no guarantee of well defined behaviour unless those operations are atomic.
Note outside of the scope of C++: A CPU might have a single instruction for incrementing an integer in memory. For example, x86 has such instruction. It can be invoked both atomically and non-atomically. It would be wasteful for the compiler to use the atomic instruction unless you explicitly use atomic operations in C++.
First of all, this is undefined behaviour since the two threads' reads and writes of the same non-atomic variable va are potentially concurrent and neither happens before the other.
With that being said, if you want to understand what your computer is actually doing when this program is run, it may help to assume that ++va is the same as va = va + 1. In fact, the standard says they are identical, and the compiler will likely compile them identically. Since your program contains UB, the compiler is not required to do anything sensible like using an atomic increment instruction. If you wanted an atomic increment instruction, you should have made va atomic. Similarly, --va is the same as va = va - 1. So in practice, various results are possible.
The important idea here is that when c++ is compiled it is "translated" to assembly language. The translation of ++va or --va will result in assembly code that moves the value of va to a register, then stores the result of adding 1 to that register back to va in a separate instruction. In this way, it is exactly the same as va = va + 1;. It also means that the operation va++ is not necessarily atomic.
See here for an explanation of what the Assembly code for these instructions will look like.
In order to make atomic operations, the variable could use a locking mechanism. You can do this by declaring an atomic variable (which will handle synchronization of threads for you):
std::atomic<int> va;
Reference: https://en.cppreference.com/w/cpp/atomic/atomic
i have some questions in boost spinlock code :
class spinlock
{
public:
spinlock()
: v_(0)
{
}
bool try_lock()
{
long r = InterlockedExchange(&v_, 1);
_ReadWriteBarrier(); // 1. what this mean
return r == 0;
}
void lock()
{
for (unsigned k = 0; !try_lock(); ++k)
{
yield(k);
}
}
void unlock()
{
_ReadWriteBarrier();
*const_cast<long volatile*>(&v_) = 0;
// 2. Why don't need to use InterlockedExchange(&v_, 0);
}
private:
long v_;
};
A ReadWriteBarrier() is a "memory barrier" (in this case for both reads and writes), a special instruction to the processor to ensure that any instructions resulting in memory operations have completed (load & store operations - or in for example x86 processors, any opertion which has a memory operand at either side). In this particular case, to make sure that the InterlockedExchange(&v_,1) has completed before we continue.
Because an InterlockedExchange would be less efficient (takes more interaction with any other cores in the machine to ensure all other processor cores have 'let go' of the value - which makes no sense, since most likely (in correctly working code) we only unlock if we actually hold the lock, so no other processor will have a different value cached than what we're writing over anyway), and a volatile write to the memory will be just as good.
The barriers are there to ensure memory synchronization; without
them, different threads may see modifications of memory in
different orders.
And the InterlockedExchange isn't necessary in the second case
because we're not interested in the previous value. The role of
InterlockedExchange is doubtlessly to set the value and return
the previous value. (And why v_ would be long, when it can
only take values 0 and 1, is beyond me.)
There are three issues with atomic access to variables. First, ensuring that there is no thread switch in the middle of reading or writing a value; if this happens it's called "tearing"; the second thread can see a partly written value, which will usually be nonsensical. Second, ensuring that all processors see the change that is being made with a write, or that the processor reading a value sees any previous changes to that value; this is called "cache coherency". Third, ensuring that the compiler doesn't move code across the read or write; this is called "code motion". InterlockedExchange does the first two; although the MSDN documentation is rather muddled, _ReadWriteBarrier does the third, and possibly the second.