How do fences actually work in c++ - c++

I've been struggling with understanding how fences actually force code to synchronize.
for instance, say i have this code
bool x = false;
std::atomic<bool> y;
std::atomic<int> z;
void write_x_then_y()
{
x = true;
std::atomic_thread_fence(std::memory_order_release);
y.store(true, std::memory_order_relaxed);
}
void read_y_then_x()
{
while (!y.load(std::memory_order_relaxed));
std::atomic_thread_fence(std::memory_order_acquire);
if (x)
++z;
}
int main()
{
x = false;
y = false;
z = 0;
std::thread a(write_x_then_y);
std::thread b(read_y_then_x);
a.join();
b.join();
assert(z.load() != 0);
}
because the release fence is followed by an atomic store operation, and the acquire fence is preceded by an atomic load, everything synchronizes as it's supposed to and the assert won't fire
but if y was not an atomic variable like this
bool x;
bool y;
std::atomic<int> z;
void write_x_then_y()
{
x = true;
std::atomic_thread_fence(std::memory_order_release);
y = true;
}
void read_y_then_x()
{
while (!y);
std::atomic_thread_fence(std::memory_order_acquire);
if (x)
++z;
}
then, I hear, there might be a data race. But why is that?
Why must release fences be followed by an atomic store, and acquire fences be preceded by an atomic load in order for the code to synchronize properly?
I would also appreciate it if anyone could provide an execution scenario in which a data race causes the assert to fire

No real data race is a problem for your second snippet. This snippet would be OK ... if the compiler would literally generate machine code from the one which is written.
But the compiler is free to generate any machine code, which is equivalent to the original one in case of a single-threaded program.
E.g., compiler can note, that the y variable doesn't changes within while(!y) loop, so it can load this variable once to register and use only that register in the next iterations. So, if initially y=false, you will get an infinite loop.
Another optimization, which is possible, is just removing the while(!y) loop, as it doesn't contain accesses to volatile or atomic variables and doesn't use synchronization actions. (C++ Standard says that any correct program should eventually do one of the actions specified above, so the compiler may rely on that fact when optimizing the program).
And so on.
More generally, the C++ Standard specifies that concurrent access to any non-atomic variable lead to Undefined Behavior, which is like "Warranty is cleared". That is why you should use an atomic y variable.
From the other side, variable x doesn't need to be atomic, as accesses to it are not concurrent because of the memory fences.

Related

Why does only "std::memory_order_seq_cst" guarantee the result [duplicate]

#include <thread>
#include <atomic>
#include <cassert>
std::atomic<bool> x = {false};
std::atomic<bool> y = {false};
std::atomic<int> z = {0};
void write_x()
{
x.store(true, std::memory_order_release);
}
void write_y()
{
y.store(true, std::memory_order_release);
}
void read_x_then_y()
{
while (!x.load(std::memory_order_acquire))
;
if (y.load(std::memory_order_acquire)) {
++z;
}
}
void read_y_then_x()
{
while (!y.load(std::memory_order_acquire))
;
if (x.load(std::memory_order_acquire)) {
++z;
}
}
int main()
{
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join(); b.join(); c.join(); d.join();
assert(z.load() != 0);
}
If I relplace seq_cst to acquire/release in cppreference's last example,
can assert(z.load() != 0) be fail ?
Seq_CST can prevent StoreLoad reorder, but the code hasn't.
Acquire can prevent LoadLoad reorder.
Release can prevent StoreStore reorder.
Yes, it is possible that z.load() == 0 in your code if you use acquire/release order as you've done. There is no happens-before relationship between the independent writes to x and y. It's not a coincidence cppreference used that example specifically to illustrate a case where acquire/release isn't sufficient.
This is sometimes call the IRIW (independent reads of independent writes), and tended to be glossed over in some hardware ordering models. In particular, a memory model defined only in terms of the possible load-load, load-store, store-store, etc, reorderings doens't say anything either way really about IRIW. In the x86 memory model IRIW reordering is disallowed because of a clause that explains that stores have a total order and all processors view stores in this same order.
I'm not aware if any CPUs in common use admit the IRIW reordering when the barriers and/or instructions needed for acquire and release are used, but I wouldn't be surprised if some do.
Yes, the assert can fire.
The principal property that is not guaranteed by acquire / release is a single total order of modifications. It only guarantees that (the non-existent) previous actions of a and b are observed by c and d if they see true from the loads.
A (slightly contrived) example of this is on a multi-cpu (physical socket) system that isn't fully cache-coherant. Die 1 has core A running thread a and core C running thread c. Die 2 has core B running thread b and core D running thread d. The interconnect between the two sockets has a long latency when compared to a memory operation that hits on-die cache.
a and b run at the same wall clock time. C is on-die with A, so can see the store to x immediately, but the interconnect delays it's observation of the store to y, so it sees the old value. Similarly D is on-die with B, so it sees the store to y, but misses the store to x.
Whereas if you have sequential consistency, some co-ordination is required to enforce a total order, such as "C and D are blocked while the interconnect syncs the caches".

atomic ops embedded, std::atomic understanding

If I have a shared var x which is accessed within an isr and a main loop.
int x;
void isr() {
x++;
}
void loop() {
y = x;
}
In the past, I would wrap the loop call with an atomic operation using assembly, i.e. y = atomic_get(x) . The isr increment would not change.
In reading about C++ atomics, it seems that I can make x an atomic type. The y = x would be handled atomically under the hood.
The x in the isr, when using atomics, would also want to do 'something' atomically, as it doesn't know it's within a thread that can't be interrupted. How does that work?
Is it guaranteed that the x++ will always happen when the isr is called, and that y will always show a value of 'x' (not relevant if it's before or after a pending x++)?

C++ memory_order_acquire/release questions

I recently learn about c++ six memory orders, I felt very confusing about memory_order_acquire and memory_order_release, here is an example from cpp:
#include <thread>
#include <atomic>
#include <cassert>
std::atomic<bool> x = {false};
std::atomic<bool> y = {false};
std::atomic<int> z = {0};
void write_x() { x.store(true, std::memory_order_seq_cst); }
void write_y() { y.store(true, std::memory_order_seq_cst); }
void read_x_then_y() {
while (!x.load(std::memory_order_seq_cst));
if (y.load(std::memory_order_seq_cst))
++z;
}
void read_y_then_x() {
while (!y.load(std::memory_order_seq_cst));
if (x.load(std::memory_order_seq_cst))
++z;
}
int main() {
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join(); b.join(); c.join(); d.join();
assert(z.load() != 0); // will never happen
}
In the cpp reference page, it says:
This example demonstrates a situation where sequential ordering is necessary.
Any other ordering may trigger the assert because it would be possible
for the threads c and d to observe changes to the atomics x and y in
opposite order.
So my question is why memory_order_acquire and memory_order_release can not be used here? And what semantics does memory_order_acquire and memory_order_release provide?
some references:
https://en.cppreference.com/w/cpp/atomic/memory_order
https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync
Sequential consistency provides a single total order of all sequentially consistent operations. So if you have a sequentially consistent store in thread A, and a sequentially consistent load in thread B, and the store is ordered before the load (in said single total order), then B observes the value stored by A. So basically sequential consistency guarantees that the store is "immediately visible" to other threads. A release store does not provide this guarantee.
As Peter Cordes pointed out correctly, the term "immediately visible" is rather imprecise. The "visibility" stems from the fact that all seq-cst operations are totally ordered, and all threads observe that order. Since the store and the load are totally ordered, the value of a store becomes visible before a subsequent load (in the single total order) is executed.
There exists no such total order between acquire/release operations in different threads, so there is not visibility guarantee. The operations are only ordered once an acquire-operations observes the value from a release-operation, but there is no guarantee when the value of the release-operation becomes visible to the thread performing the acquire-operation.
Let's consider what would happen if we were to use acquire/release in this example:
void write_x() { x.store(true, std::memory_order_release); }
void write_y() { y.store(true, std::memory_order_release); }
void read_x_then_y() {
while (!x.load(std::memory_order_acquire));
if (y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x() {
while (!y.load(std::memory_order_acquire));
if (x.load(std::memory_order_acquire))
++z;
}
int main() {
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join(); b.join(); c.join(); d.join();
assert(z.load() != 0); // can actually happen!!
}
Since we have no guarantee about visibility, it could happen that thread c observes x == true and y == false, while at the same time thread d could observe y == true and x == false. So neither thread would increment z and the assertion would fire.
For more details about the C++ memory model I can recommend this paper which I have co-authored: Memory Models for C/C++ Programmers
You can use aquire/release when passing information from one thread to another - this is the most common situation. No need for sequential requirements on this one.
In this example there are a bunch of threads. Two threads make write operation while third roughly tests whether x was ready before y and fourth tests whether y was ready before x. Theoretically one thread may observe that x was modified before y while another sees that y was modified before x. Not entirely sure how likely it is. This is an uncommon usecase.
Edit: you can visualize the example: assume that each threads is run on a different PC and they communicate via a network. Each pair of PCs has a different ping to each other. Here it is easy to make an example where it is unclear which event occurred first x or y as each PC will see the events occur in different order.
I am not sure on sure on which architectures this effect may occur but there are complex ones where two different processors are conjoined. Surely communication between the processors is slower than between cores of each processor.

Instruction reordering with lock

Will the compiler reorder instructions which are guarded with a mutex? I am using a boolean variable to decide whether one thread has updated some struct. If the compiler reorders the instructions, it might happen that the boolean variable is set before all the fields of the struct are updated.
struct A{
int x;
int y;
// Many other variables, it is a big struct
}
std::mutex m_;
bool updated;
A* first_thread_copy_;
// a has been allocated memory
m_.lock();
first_thread_copy_->x = 1;
first_thread_copy_->y = 2;
// Other variables of the struct are updated here
updated = true;
m_.unlock();
And in the other thread I just check if the struct has been updated and swap the pointers.
while (!stop_running){
if (updated){
m_.lock();
updated = false;
A* tmp = second_thread_copy_;
second_thread_copy_ = first_thread_copy_;
first_thread_copy_ = tmp;
m.unlock();
}
....
}
My goal is to keep the second thread as fast as possible. If it sees that update has not happened, it continues and uses the old value to do the remaining stuff.
One solution to avoid reordering would be to use memory barriers, but I'm trying to avoid it within mutex block.
You can safely assume that instructions are not reordered between the lock/unlock and the instructions inside the lock. updated = true will not happen after the unlock or before the lock. Both are barriers, and prevent reordering.
You cannot assume that the updates inside the lock happen without reordering. It is possible that the update to updated takes place before the updates to x or y. If all your accesses are also under lock, that should not be a problem.
With that in mind, please note that it is not only the compiler that might reorder instructions. The CPU also might execute instructions out of order.
Thats what locks gaurantee so your dont have to use updated. On a side note you should lock your struct rather than code and everytime you access a you should lock it first.
struct A{
int x;
int y;
}
A a;
std::mutex m_; //Share this lock amongst threads
m_.lock();
a.x = 1;
a.y = 2;
m_.unlock();
On second thread you can do
while(!stop)
{
if(m_.try_lock()) {
A* tmp = second_thread_copy_;
second_thread_copy_ = first_thread_copy_;
first_thread_copy_ = tmp;
m_.unlock();
}
}
EDIT: since you are overwriting struct having mutex inside doesnt make sense.

Understanding c++11 memory fences

I'm trying to understand memory fences in c++11, I know there are better ways to do this, atomic variables and so on, but wondered if this usage was correct. I realize that this program doesn't do anything useful, I just wanted to make sure that the usage of the fence functions did what I thought they did.
Basically that the release ensures that any changes made in this thread before the fence are visible to other threads after the fence, and that in the second thread that any changes to the variables are visible in the thread immediately after the fence?
Is my understanding correct? Or have I missed the point entirely?
#include <iostream>
#include <atomic>
#include <thread>
int a;
void func1()
{
for(int i = 0; i < 1000000; ++i)
{
a = i;
// Ensure that changes to a to this point are visible to other threads
atomic_thread_fence(std::memory_order_release);
}
}
void func2()
{
for(int i = 0; i < 1000000; ++i)
{
// Ensure that this thread's view of a is up to date
atomic_thread_fence(std::memory_order_acquire);
std::cout << a;
}
}
int main()
{
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
}
Your usage does not actually ensure the things you mention in your comments. That is, your usage of fences does not ensure that your assignments to a are visible to other threads or that the value you read from a is 'up to date.' This is because, although you seem to have the basic idea of where fences should be used, your code does not actually meet the exact requirements for those fences to "synchronize".
Here's a different example that I think demonstrates correct usage better.
#include <iostream>
#include <atomic>
#include <thread>
std::atomic<bool> flag(false);
int a;
void func1()
{
a = 100;
atomic_thread_fence(std::memory_order_release);
flag.store(true, std::memory_order_relaxed);
}
void func2()
{
while(!flag.load(std::memory_order_relaxed))
;
atomic_thread_fence(std::memory_order_acquire);
std::cout << a << '\n'; // guaranteed to print 100
}
int main()
{
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
}
The load and store on the atomic flag do not synchronize, because they both use the relaxed memory ordering. Without the fences this code would be a data race, because we're performing conflicting operations a non-atomic object in different threads, and without the fences and the synchronization they provide there would be no happens-before relationship between the conflicting operations on a.
However with the fences we do get synchronization because we've guaranteed that thread 2 will read the flag written by thread 1 (because we loop until we see that value), and since the atomic write happened after the release fence and the atomic read happens-before the acquire fence, the fences synchronize. (see ยง 29.8/2 for the specific requirements.)
This synchronization means anything that happens-before the release fence happens-before anything that happens-after the acquire fence. Therefore the non-atomic write to a happens-before the non-atomic read of a.
Things get trickier when you're writing a variable in a loop, because you might establish a happens-before relation for some particular iteration, but not other iterations, causing a data race.
std::atomic<int> f(0);
int a;
void func1()
{
for (int i = 0; i<1000000; ++i) {
a = i;
atomic_thread_fence(std::memory_order_release);
f.store(i, std::memory_order_relaxed);
}
}
void func2()
{
int prev_value = 0;
while (prev_value < 1000000) {
while (true) {
int new_val = f.load(std::memory_order_relaxed);
if (prev_val < new_val) {
prev_val = new_val;
break;
}
}
atomic_thread_fence(std::memory_order_acquire);
std::cout << a << '\n';
}
}
This code still causes the fences to synchronize but does not eliminate data races. For example if f.load() happens to return 10 then we know that a=1,a=2, ... a=10 have all happened-before that particular cout<<a, but we don't know that cout<<a happens-before a=11. Those are conflicting operations on different threads with no happens-before relation; a data race.
Your usage is correct, but insufficient to guarantee anything useful.
For example, the compiler is free to internally implement a = i; like this if it wants to:
while(a != i)
{
++a;
atomic_thread_fence(std::memory_order_release);
}
So the other thread may see any values at all.
Of course, the compiler would never implement a simple assignment like that. However, there are cases where similarly perplexing behavior is actually an optimization, so it's a very bad idea to rely on ordinary code being implemented internally in any particular way. This is why we have things like atomic operations and fences only produce guaranteed results when used with such operations.