I recently learn about c++ six memory orders, I felt very confusing about memory_order_acquire and memory_order_release, here is an example from cpp:
#include <thread>
#include <atomic>
#include <cassert>
std::atomic<bool> x = {false};
std::atomic<bool> y = {false};
std::atomic<int> z = {0};
void write_x() { x.store(true, std::memory_order_seq_cst); }
void write_y() { y.store(true, std::memory_order_seq_cst); }
void read_x_then_y() {
while (!x.load(std::memory_order_seq_cst));
if (y.load(std::memory_order_seq_cst))
void read_y_then_x() {
while (!y.load(std::memory_order_seq_cst));
if (x.load(std::memory_order_seq_cst))
int main() {
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join(); b.join(); c.join(); d.join();
assert(z.load() != 0); // will never happen
In the cpp reference page, it says:
This example demonstrates a situation where sequential ordering is necessary.
Any other ordering may trigger the assert because it would be possible
for the threads c and d to observe changes to the atomics x and y in
opposite order.
So my question is why memory_order_acquire and memory_order_release can not be used here? And what semantics does memory_order_acquire and memory_order_release provide?
Sequential consistency provides a single total order of all sequentially consistent operations. So if you have a sequentially consistent store in thread A, and a sequentially consistent load in thread B, and the store is ordered before the load (in said single total order), then B observes the value stored by A. So basically sequential consistency guarantees that the store is "immediately visible" to other threads. A release store does not provide this guarantee.
As Peter Cordes pointed out correctly, the term "immediately visible" is rather imprecise. The "visibility" stems from the fact that all seq-cst operations are totally ordered, and all threads observe that order. Since the store and the load are totally ordered, the value of a store becomes visible before a subsequent load (in the single total order) is executed.
There exists no such total order between acquire/release operations in different threads, so there is not visibility guarantee. The operations are only ordered once an acquire-operations observes the value from a release-operation, but there is no guarantee when the value of the release-operation becomes visible to the thread performing the acquire-operation.
Let's consider what would happen if we were to use acquire/release in this example:
void write_x() { x.store(true, std::memory_order_release); }
void write_y() { y.store(true, std::memory_order_release); }
void read_x_then_y() {
while (!x.load(std::memory_order_acquire));
if (y.load(std::memory_order_acquire))
void read_y_then_x() {
while (!y.load(std::memory_order_acquire));
if (x.load(std::memory_order_acquire))
int main() {
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join(); b.join(); c.join(); d.join();
assert(z.load() != 0); // can actually happen!!
Since we have no guarantee about visibility, it could happen that thread c observes x == true and y == false, while at the same time thread d could observe y == true and x == false. So neither thread would increment z and the assertion would fire.
For more details about the C++ memory model I can recommend this paper which I have co-authored:

You can use aquire/release when passing information from one thread to another - this is the most common situation. No need for sequential requirements on this one.
In this example there are a bunch of threads. Two threads make write operation while third roughly tests whether x was ready before y and fourth tests whether y was ready before x. Theoretically one thread may observe that x was modified before y while another sees that y was modified before x. Not entirely sure how likely it is. This is an uncommon usecase.
Edit: you can visualize the example: assume that each threads is run on a different PC and they communicate via a network. Each pair of PCs has a different ping to each other. Here it is easy to make an example where it is unclear which event occurred first x or y as each PC will see the events occur in different order.
I am not sure on sure on which architectures this effect may occur but there are complex ones where two different processors are conjoined. Surely communication between the processors is slower than between cores of each processor.


Why does only "std::memory_order_seq_cst" guarantee the result [duplicate]

#include <thread>
#include <atomic>
#include <cassert>
std::atomic<bool> x = {false};
std::atomic<bool> y = {false};
std::atomic<int> z = {0};
void write_x()
x.store(true, std::memory_order_release);
void write_y()
y.store(true, std::memory_order_release);
void read_x_then_y()
while (!x.load(std::memory_order_acquire))
if (y.load(std::memory_order_acquire)) {
void read_y_then_x()
while (!y.load(std::memory_order_acquire))
if (x.load(std::memory_order_acquire)) {
int main()
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join(); b.join(); c.join(); d.join();
assert(z.load() != 0);
If I relplace seq_cst to acquire/release in cppreference's last example,
can assert(z.load() != 0) be fail ?
Seq_CST can prevent StoreLoad reorder, but the code hasn't.
Acquire can prevent LoadLoad reorder.
Release can prevent StoreStore reorder.
Yes, it is possible that z.load() == 0 in your code if you use acquire/release order as you've done. There is no happens-before relationship between the independent writes to x and y. It's not a coincidence cppreference used that example specifically to illustrate a case where acquire/release isn't sufficient.
This is sometimes call the IRIW (independent reads of independent writes), and tended to be glossed over in some hardware ordering models. In particular, a memory model defined only in terms of the possible load-load, load-store, store-store, etc, reorderings doens't say anything either way really about IRIW. In the x86 memory model IRIW reordering is disallowed because of a clause that explains that stores have a total order and all processors view stores in this same order.
I'm not aware if any CPUs in common use admit the IRIW reordering when the barriers and/or instructions needed for acquire and release are used, but I wouldn't be surprised if some do.
Yes, the assert can fire.
The principal property that is not guaranteed by acquire / release is a single total order of modifications. It only guarantees that (the non-existent) previous actions of a and b are observed by c and d if they see true from the loads.
A (slightly contrived) example of this is on a multi-cpu (physical socket) system that isn't fully cache-coherant. Die 1 has core A running thread a and core C running thread c. Die 2 has core B running thread b and core D running thread d. The interconnect between the two sockets has a long latency when compared to a memory operation that hits on-die cache.
a and b run at the same wall clock time. C is on-die with A, so can see the store to x immediately, but the interconnect delays it's observation of the store to y, so it sees the old value. Similarly D is on-die with B, so it sees the store to y, but misses the store to x.
Whereas if you have sequential consistency, some co-ordination is required to enforce a total order, such as "C and D are blocked while the interconnect syncs the caches".

c++ memory order (acquire and release) with two writers, and a sleep between two reads in one thread, no reordering visible?

i tested this code a lot and its assertion was never failed
but can its assertion fail sometimes ?
#include <iostream>
#include <thread>
#include <atomic>
#include <chrono>
#include <cassert>
std::atomic<bool> x{false};
std::atomic<int> y{0};
void write_x() {
x.store(true, std::memory_order_release);
void write_y() {
y.store(1, std::memory_order_relaxed);
void read_x_then_y() {
while (!x.load(std::memory_order_acquire));
std::this_thread::sleep_for(std::chrono::seconds(1)); // allow write_y thread to store 1 in to y befor reading it
assert(y.load(std::memory_order_relaxed) == 1); // could this assert fail ?
int main() {
std::thread t1{read_x_then_y};
std::thread t2{write_x};
std::thread t3{write_y};
there is no memory operation before release store in write_x function
so the acquire load in read_x_then_y should never see the store operation happened in write_y function
is it right ?
In theory yes the assertion can fail. In practice with a 1 second sleep, no unless your system is extremely heavily over-loaded, for example thrashing its swap space.
Spin-wait on x is pointless; there's no synchronization between independent stores by independent writers. If they were both seq_cst then all readers on all cores would be required to agree upon their order, even on a PowerPC system where IRIW reordering is possible (and can happen for acq_rel or weaker). But even seq_cst wouldn't provide any guarantee on which order that was, just that all readers in the same run of the program would agree.
Acquire and release create ordering wrt. other operations in the same thread. https://preshing.com/20120913/acquire-and-release-semantics/
After sleeping for 1 second the stores from both writer threads will be visible unless something is stopping the writer threads from running for that long.

Acquire-release memory order between multiple threads

std::atomic<int> done = 0;
int a = 10;
void f1() {
a = 20;
done.store(1, std::memory_order_release);
void f2() {
if(done.load(std::memory_order_acquired) == 1) {
assert(a == 20);
void f3() {
if(done.load(std::memory_order_acquired) == 1) {
assert(a == 20);
int main() {
std::thread t1(f1);
std::thread t2(f2);
std::thread t3(f3);
The question is if thread 2 & 3 both see done == 1, will
the assertion a == 20 hold in both threads ?
I know acquire-release works for a pair of threads. But does
it work in multiple threads as well?
Yes. The release-acquire relationship holds separately for all pairs of threads (that access the same atomic location!) and guarantees that all writes (that appear in program order) before the release are visible to all reads (that appear in program order) after whichever corresponding acquire.
A release is like publishing a newspaper (one with no defined periodicity), and acquire is like buying the later edition right now, then discovering what it says (not caring which edition it is, or which day it is). (You rarely need to do versioning on these shared atomics, although it can sometimes be needed.)
Any number of people can buy the newspaper. What matters is that what's printed was true when it was printed, and is still true if they are invariant truths, like the creation of a monument (invariable by hypothesis).
So a release operation publishes what is true at the time of publication, and proper design guarantees that these facts cannot have changed when you can "buy" (acquire) the publication. Any number of threads can see these facts.

How do fences actually work in c++

I've been struggling with understanding how fences actually force code to synchronize.
for instance, say i have this code
bool x = false;
std::atomic<bool> y;
std::atomic<int> z;
void write_x_then_y()
x = true;
y.store(true, std::memory_order_relaxed);
void read_y_then_x()
while (!y.load(std::memory_order_relaxed));
if (x)
int main()
x = false;
y = false;
z = 0;
std::thread a(write_x_then_y);
std::thread b(read_y_then_x);
assert(z.load() != 0);
because the release fence is followed by an atomic store operation, and the acquire fence is preceded by an atomic load, everything synchronizes as it's supposed to and the assert won't fire
but if y was not an atomic variable like this
bool x;
bool y;
std::atomic<int> z;
void write_x_then_y()
x = true;
y = true;
void read_y_then_x()
while (!y);
if (x)
then, I hear, there might be a data race. But why is that?
Why must release fences be followed by an atomic store, and acquire fences be preceded by an atomic load in order for the code to synchronize properly?
I would also appreciate it if anyone could provide an execution scenario in which a data race causes the assert to fire
No real data race is a problem for your second snippet. This snippet would be OK ... if the compiler would literally generate machine code from the one which is written.
But the compiler is free to generate any machine code, which is equivalent to the original one in case of a single-threaded program.
E.g., compiler can note, that the y variable doesn't changes within while(!y) loop, so it can load this variable once to register and use only that register in the next iterations. So, if initially y=false, you will get an infinite loop.
Another optimization, which is possible, is just removing the while(!y) loop, as it doesn't contain accesses to volatile or atomic variables and doesn't use synchronization actions. (C++ Standard says that any correct program should eventually do one of the actions specified above, so the compiler may rely on that fact when optimizing the program).
And so on.
More generally, the C++ Standard specifies that concurrent access to any non-atomic variable lead to Undefined Behavior, which is like "Warranty is cleared". That is why you should use an atomic y variable.
From the other side, variable x doesn't need to be atomic, as accesses to it are not concurrent because of the memory fences.

Understanding c++11 memory fences

I'm trying to understand memory fences in c++11, I know there are better ways to do this, atomic variables and so on, but wondered if this usage was correct. I realize that this program doesn't do anything useful, I just wanted to make sure that the usage of the fence functions did what I thought they did.
Basically that the release ensures that any changes made in this thread before the fence are visible to other threads after the fence, and that in the second thread that any changes to the variables are visible in the thread immediately after the fence?
Is my understanding correct? Or have I missed the point entirely?
#include <iostream>
#include <atomic>
#include <thread>
int a;
void func1()
for(int i = 0; i < 1000000; ++i)
a = i;
// Ensure that changes to a to this point are visible to other threads
void func2()
for(int i = 0; i < 1000000; ++i)
// Ensure that this thread's view of a is up to date
std::cout << a;
int main()
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
Your usage does not actually ensure the things you mention in your comments. That is, your usage of fences does not ensure that your assignments to a are visible to other threads or that the value you read from a is 'up to date.' This is because, although you seem to have the basic idea of where fences should be used, your code does not actually meet the exact requirements for those fences to "synchronize".
Here's a different example that I think demonstrates correct usage better.
#include <iostream>
#include <atomic>
#include <thread>
std::atomic<bool> flag(false);
int a;
void func1()
a = 100;
flag.store(true, std::memory_order_relaxed);
void func2()
std::cout << a << '\n'; // guaranteed to print 100
int main()
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
The load and store on the atomic flag do not synchronize, because they both use the relaxed memory ordering. Without the fences this code would be a data race, because we're performing conflicting operations a non-atomic object in different threads, and without the fences and the synchronization they provide there would be no happens-before relationship between the conflicting operations on a.
However with the fences we do get synchronization because we've guaranteed that thread 2 will read the flag written by thread 1 (because we loop until we see that value), and since the atomic write happened after the release fence and the atomic read happens-before the acquire fence, the fences synchronize. (see ยง 29.8/2 for the specific requirements.)
This synchronization means anything that happens-before the release fence happens-before anything that happens-after the acquire fence. Therefore the non-atomic write to a happens-before the non-atomic read of a.
Things get trickier when you're writing a variable in a loop, because you might establish a happens-before relation for some particular iteration, but not other iterations, causing a data race.
std::atomic<int> f(0);
int a;
void func1()
for (int i = 0; i<1000000; ++i) {
a = i;
f.store(i, std::memory_order_relaxed);
void func2()
int prev_value = 0;
while (prev_value < 1000000) {
while (true) {
int new_val = f.load(std::memory_order_relaxed);
if (prev_val < new_val) {
prev_val = new_val;
std::cout << a << '\n';
This code still causes the fences to synchronize but does not eliminate data races. For example if f.load() happens to return 10 then we know that a=1,a=2, ... a=10 have all happened-before that particular cout<<a, but we don't know that cout<<a happens-before a=11. Those are conflicting operations on different threads with no happens-before relation; a data race.
Your usage is correct, but insufficient to guarantee anything useful.
For example, the compiler is free to internally implement a = i; like this if it wants to:
while(a != i)
So the other thread may see any values at all.
Of course, the compiler would never implement a simple assignment like that. However, there are cases where similarly perplexing behavior is actually an optimization, so it's a very bad idea to rely on ordinary code being implemented internally in any particular way. This is why we have things like atomic operations and fences only produce guaranteed results when used with such operations.