Independent Read-Modify-Write Ordering

Independent Read-Modify-Write Ordering - c++

I was running a bunch of algorithms through Relacy to verify their correctness and I stumbled onto something I didn't really understand. Here's a simplified version of it:
#include <thread>
#include <atomic>
#include <iostream>
#include <cassert>
struct RMW_Ordering
{
std::atomic<bool> flag {false};
std::atomic<unsigned> done {0}, counter {0};
unsigned race_cancel {0}, race_success {0}, sum {0};
void thread1() // fail
{
race_cancel = 1; // data produced
if (counter.fetch_add(1, std::memory_order_release) == 1 &&
!flag.exchange(true, std::memory_order_relaxed))
{
counter.store(0, std::memory_order_relaxed);
done.store(1, std::memory_order_relaxed);
}
}
void thread2() // success
{
race_success = 1; // data produced
if (counter.fetch_add(1, std::memory_order_release) == 1 &&
!flag.exchange(true, std::memory_order_relaxed))
{
done.store(2, std::memory_order_relaxed);
}
}
void thread3()
{
while (!done.load(std::memory_order_relaxed)); // livelock test
counter.exchange(0, std::memory_order_acquire);
sum = race_cancel + race_success;
}
};
int main()
{
for (unsigned i = 0; i < 1000; ++i)
{
RMW_Ordering test;
std::thread t1([&]() { test.thread1(); });
std::thread t2([&]() { test.thread2(); });
std::thread t3([&]() { test.thread3(); });
t1.join();
t2.join();
t3.join();
assert(test.counter == 0);
}
std::cout << "Done!" << std::endl;
}
Two threads race to enter a protected region and the last one modifies done, releasing a third thread from an infinite loop. The example is a bit contrived but the original code needs to claim this region through the flag to signal "done".
Initially, the fetch_add had acq_rel ordering because I was concerned the exchange might get reordered before it, potentially causing one thread to claim the flag, attempt the fetch_add check first, and prevent the other thread (which gets past the increment check) from successfully modifying the schedule. While testing with Relacy, I figured I'd see whether the livelock I expected to happen will take place if I switched from acq_rel to release, and to my surprise, it didn't. I then used relaxed for everything, and again, no livelock.
I tried to find any rules regarding this in the C++ standard but only managed to dig up these:
1.10.7 In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations,
which have special characteristics.
29.3.11 Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated
with the read-modify-write operation.
Can I always rely on RMW operations not being reordered - even if they affect different memory locations - and is there anything in the standard that guarantees this behaviour?
EDIT:
I came up with a simpler setup that should illustrate my question a little better. Here's the CppMem script for it:
int main()
{
atomic_int x = 0; atomic_int y = 0;
{{{
{
if (cas_strong_explicit(&x, 0, 1, relaxed, relaxed))
{
cas_strong_explicit(&y, 0, 1, relaxed, relaxed);
}
}
|||
{
if (cas_strong_explicit(&x, 0, 2, relaxed, relaxed))
{
cas_strong_explicit(&y, 0, 2, relaxed, relaxed);
}
}
|||
{
// Is it possible for x and y to read 2 and 1, or 1 and 2?
x.load(relaxed).readsvalue(2);
y.load(relaxed).readsvalue(1);
}
}}}
return 0;
}
I don't think the tool is sophisticated enough to evaluate this scenario, though it does seem to indicate that it's possible. Here's the almost equivalent Relacy setup:
#include "relacy/relacy_std.hpp"
struct rmw_experiment : rl::test_suite<rmw_experiment, 3>
{
rl::atomic<unsigned> x, y;
void before()
{
x($) = y($) = 0;
}
void thread(unsigned tid)
{
if (tid == 0)
{
unsigned exp1 = 0;
if (x($).compare_exchange_strong(exp1, 1, rl::mo_relaxed))
{
unsigned exp2 = 0;
y($).compare_exchange_strong(exp2, 1, rl::mo_relaxed);
}
}
else if (tid == 1)
{
unsigned exp1 = 0;
if (x($).compare_exchange_strong(exp1, 2, rl::mo_relaxed))
{
unsigned exp2 = 0;
y($).compare_exchange_strong(exp2, 2, rl::mo_relaxed);
}
}
else
{
while (!(x($).load(rl::mo_relaxed) && y($).load(rl::mo_relaxed)));
RL_ASSERT(x($) == y($));
}
}
};
int main()
{
rl::simulate<rmw_experiment>();
}
The assertion is never violated, so 1 and 2 (or the reverse) is not possible according to Relacy.

I haven't fully grokked your code yet, but the bolded question has a straightforward answer:
Can I always rely on RMW operations not being reordered - even if they affect different memory locations
No, you can't. Compile-time reordering of two relaxed RMWs in the same thread is very much allowed. (I think runtime reordering of two RMWs is probably impossible in practice on most CPUs. ISO C++ doesn't distinguish compile-time vs. run-time for this.)
But note that an atomic RMW includes both a load and a store, and both parts have to stay together. So any kind of RMW can't move earlier past an acquire operation, or later past a release operation.
Also, of course the RMW itself being a release and/or acquire operation can stop reordering in one or the other direction.
Of course, the C++ memory model isn't formally defined in terms of local reordering of access to cache-coherent shared memory, only in terms of synchronizing with another thread and creating a happens-before / after relationship. But if you ignore IRIW reordering (2 reader threads not agreeing on the order of two writer threads doing independent stores to different variables) it's pretty much 2 different ways to model the same thing.

In your first example it is guaranteed that the flag.exchange is always executed after the counter.fetch_add, because the && short circuits - i.e., if the first expression resolves to false, the second expression is never executed. The C++ standard guarantees this, so the compiler cannot reorder the two expressions (regardless which memory order they use).
As Peter Cordes already explained, the C++ standard says nothing about if or when instructions can be reordered with respect to atomic operations. In general, most compiler optimizations rely on the as-if:
The semantic descriptions in this International Standard deﬁne a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine [..].
This provision is sometimes called the “as-if” rule, because an implementation is free to disregard any requirement of this International Standard as long as the result is as if the requirement had been obeyed, as far as can be determined from the
observable behavior of the program. For instance, an actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no side eﬀects aﬀecting the observable behavior of the program are produced.
The key aspect here is the "observable behavior". Suppose you have two relaxed atomic loads A and B on two different atomic objects, where A is sequenced before B.
std::atomic<int> x, y;
x.load(std::memory_order_relaxed); // A
y.load(std::memory_order_relaxed); // B
A sequence-before relation is part of the definition of the happens-before relation, so one might assume that the two operations cannot be reordered. However, since the two operations are relaxed, there is no guarantee about the "observable behavior", i.e., even with the original order, the x.load (A) could return a newer result than the y.load (B), so the compiler would be free to reorder them, since the final program would not be able to tell the difference (i.e., the observable behavior is equivalent). If it would not be equivalent, then you would have a race condition! ;-)
To prevent such reorderings you have to rely on the (inter-thread) happens-before relation. If the x.load (A) would use memory_order_acquire, then the compiler would have to assume that this operation synchronizes-with some release operation, thus establishing a (inter-thread) happens-before relation. Suppose some other thread performs two atomic updates:
y.store(42, std::memory_order_relaxed); // C
x.store(1, std::memory_order_release); // D
If the acquire-load A sees the value store by the store-release D, then the two operations synchronize with each other, thereby establishing a happens-before relation. Since y.store is sequenced before x.store, and x.load is sequenced before, the transitivity of the happens-before relation guarantees that y.store happens-before y.load. Reordering the two loads or the two stores would destroy this guarantee and therefore also change the observable behavior. Thus, the compiler cannot perform such reorders.
In general, arguing about possible reorderings is the wrong approach. In a first step you should always identify your required happens-before relations (e.g., the y.store has to happen before the y.load) . The next step is then to ensure that these happens-before relations are correctly established in all cases. At least that is how I approach correctness arguments for my implementations of lock-free algorithms.
Regarding Relacy: Relacy only simulates the memory model, but it relies on the order of operations as generated by the compiler. So even if a compiler could reorder two instructions, but chooses not to, you will not be able to identify this with Relacy.

Related

Acqrel memory order with 3 threads

Lately the more I read about memory order in C++, the more confusing it gets. Hope you can help me clarify this (for purely theoretic purposes). Suppose I have the following code:
std::atomic<int> val = { 0 };
std::atomic<bool> f1 = { false };
std::atomic<bool> f2 = { false };
void thread_1() {
f1.store(true, std::memory_order_relaxed);
int v = 0;
while (!val.compare_exchange_weak(v, v | 1,
std::memory_order_release));
}
void thread_2() {
f2.store(true, std::memory_order_relaxed);
int v = 0;
while (!val.compare_exchange_weak(v, v | 2,
std::memory_order_release));
}
void thread_3() {
auto v = val.load(std::memory_order_acquire);
if (v & 1) assert(f1.load(std::memory_order_relaxed));
if (v & 2) assert(f2.load(std::memory_order_relaxed));
}
The question is: can any of the assertions be false? On one hand, cppreference claims, std::memory_order_release forbids the reordering of both stores after exchanges in threads 1-2 and std::memory_order_acquire in thread 3 forbids both reads to be reordered before the first load. Thus, if thread 3 saw the first or the second bit set that means that the store to the corresponding boolean already happened and it has to be true.
On the other hand, thread 3 synchronizes with whoever released the value it has acquired from val. Can it happen so (in theory if not in practice) that thread 3 "acquired" the exchange "1 -> 3" by thread 2 (and therefore f2 load returns true), but not the "0 -> 1" by thread 1 (thus the first assertion fires)? This possibility makes no sense to me considering the "reordering" understanding, yet I can't find any confirmation that this cannot happen anywhere.

Neither assertion can ever fail, thanks to ISO C++'s "release sequence" rules. This is the formalism that provides the guarantee you assumed must exist in your last paragraph.
The only stores to val are release-stores with the appropriate bits set, done after the corresponding store to f1 or f2. So if thread_3 sees a value with 1 bit set, it has definitely synchronized-with the writer that set the corresponding variable.
And crucially, they're each part of an RMW, and thus form a release-sequence that lets the acquire load in thread_3 synchronize-with both CAS writes, if it happens to see val == 3.
(Even a relaxed RMW can be part of a release-sequence, although in that case there wouldn't be a happens-before guarantee for stuff before the relaxed RMW, only for other release operations by this or other threads on this atomic variable. If thread_2 had used mo_relaxed, the assert on f2 could fail, but it still couldn't break things so the assert on f1 could ever fail. See also What does "release sequence" mean? and https://en.cppreference.com/w/cpp/atomic/memory_order)
If it helps, I think those CAS loops are fully equivalent to val.fetch_or(1, release). Definitely that's how a compiler would implement fetch_or on a machine with CAS but not an atomic OR primitive. IIRC, in the ISO C++ model, CAS failure is only a load, not an RMW. Not that it matters; a relaxed no-op RMW would still propagate a release-sequence.
(Fun fact: x86 asm lock cmpxchg is always a real RMW, even on failure, at least on paper. But it's also a full barrier, so basically irrelevant to any reasoning about weakly-ordered RMWs.)

Why does this cppreference excerpt seem to wrongly suggest that atomics can protect critical sections?

int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto job = [&] {
int asdf = bar.load();
// std::lock_guard lg(mx);
foo.emplace_back(1);
bar.store(foo.size());
};
std::thread t1(job);
std::thread t2(job);
t1.join();
t2.join();
}
This obviously is not guaranteed to work, but works with a mutex. But how can that be explained in terms of the formal definitions of the standard?
Consider this excerpt from cppreference:
If an atomic store in thread A is tagged memory_order_release and an
atomic load in thread B from the same variable is tagged
memory_order_acquire [as is the case with default atomics], all memory writes (non-atomic and relaxed
atomic) that happened-before the atomic store from the point of view
of thread A, become visible side-effects in thread B. That is, once
the atomic load is completed, thread B is guaranteed to see everything
thread A wrote to memory.
Atomic loads and stores (with the default or with the specific acquire and release memory order specified) have the mentioned acquire-release semantics. (So does a mutex's lock and unlock.)
An interpretation of that wording could be that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification, making this well-defined. But pretty much everyone would agree that this can lead to a segmentation fault and would surely do so if the job function ran its three lines in a loop.
What standard wording explains the obvious difference in capability between the two tools, given that this wording seems to imply that atomic would synchronize in a way.
I know when to use mutexes and atomics, and I know that the example doesn't work because no synchronization actually happens. My question is how the definition is to be interpreted so it doesn't contradict the way it works in reality.

The quoted passage means that when B loads the value that A stored, then by observing that the store happened, it can also be assured that everything that B did before the store has also happened and is visible.
But this doesn't tell you anything if the store has not in fact happened yet!
The actual C++ standard says this more explicitly. (Always remember that cppreference, while a valuable resource which often quotes from or paraphrases the standard, is not the standard itself and is not authoritative.) From N4861, the final C++20 draft, we have in atomics.order p2:
An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic
operation B that performs an acquire operation on M and takes its value from any side effect in the release
sequence headed by A.
I would agree that if the load in your thread B returned 1, it could safely conclude that the other thread had finished its store and therefore had exited the critical section, and therefore B could safely use foo. In this case the load in B has synchronized with the store in A, since the value of the load (namely 1) came from the store (which is part of its own release sequence).
But it is entirely possible that both loads return 0, if both threads do their loads before either one does its store. The value 0 didn't come from either store, so the loads don't synchronize with the stores in that case. Your code doesn't even look at the value that was loaded, so both threads may enter the critical section together in that case.
The following code would be a safe, though inefficient, way to use an atomic to protect a critical section. It ensures that A will execute the critical section first, and B will wait until A has finished before proceeding. (Obviously if both threads wait for the other then you have a deadlock.)
int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto jobA = [&] {
foo.emplace_back(1);
bar.store(foo.size());
};
auto jobB = [&] {
while (bar.load() == 0) /* spin */ ;
foo.emplace_back(1);
};
std::thread t1(jobA);
std::thread t2(jobB);
t1.join();
t2.join();
}

Setting aside the elephant in the room that none of the C++ containers are thread safe without employing locking of some sort (so forget about using emplace_back without implementing locking), and focusing on the question of why atomic objects alone are not sufficient:
You need more than atomic objects. You also need sequencing.
All that an atomic object gives you is that when an object changes state, any other thread will either see its old value or its new value, and it will never see any "partially old/partially new", or "intermediate" value.
But it makes no guarantee whatsoever as to when other execution threads will "see" the atomic object's new value. At some point they (hopefully) will, see the atomic object's instantly flip to its new value. When? Eventually. That's all that you get from atomics.
One execution thread may very well set an atomic object to a new value, but other execution threads will still have the old value cached, in some form or fashion, and will continue to see the atomic object's old value, and won't "see" the atomic object's new value until some intermediate time passes (if ever).
Sequencing are rules that specify when objects' new values are visible in other execution threads. The simplest way to get both atomicity and easy to deal with sequencing, in one fell swoop, is to use mutexes and condition variables which handle all the hard details for you. You can still use atomics and with a careful logic use lock/release fence instructions to implement proper sequencing. But it's very easy to get it wrong, and the worst of it you won't know that it's wrong until your code starts going off the rails due to improper sequencing and it'll be nearly impossible to accurately reproduce the faulty behavior for debugging purposes.
But for nearly all common, routine, garden-variety tasks mutexes and condition variables is the most simplest solution to proper inter-thread sequencing.

The idea is that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification
Yes all writes that done by foo.emplace_back(1); would be guaranteed when bar.store(foo.size()); is executed. But who guaranteed you that foo.emplace_back(1); from thread 1 would see any/all non partial consistent state from foo.emplace_back(1); executed in thread 2 and vice versa? They both read and modify internal state of std::vector and there is no memory barrier before code reaches atomic store. And even if all variables would be read/modified atomically std::vector state consists of multiple variables - size, capacity, pointer to the data at least. Changes to all of them must be synchronized as well and memory barrier is not enough for that.
To explain little more let's create simplified example:
int a = 0;
int b = 0;
std::atomic<int> at;
// thread 1
int foo = at.load();
a = 1;
b = 2;
at.store(foo);
// thread 2
int foo = at.load();
int tmp1 = a;
int tmp2 = b;
at.store(tmp2);
Now you have 2 problems:
There is no guarantee that when tmp2 value is 2
tmp1 value would be 1
as you read a and b before atomic operation.
There is no guarantee that when at.store(b)
is executed that either a == b == 0 or a == 1 and b == 2,
it could be a == 1 but still b == 0.
Is that clear?
But:
// thread 1
mutex.lock();
a = 1;
b = 2;
mutex.unlock();
// thread 2
mutex.lock();
int tmp1 = a;
int tmp2 = b;
mutex.unlock();
You either get tmp1 == 0 and tmp2 == 0 or tmp1 == 1 and tmp2 == 2, do you see the difference?

Are relaxed atomic store reordered themselves before the release? (similar with load /acquire)

I read in the en.cppreference.com specifications relaxed operations on atomics:
"[...]only guarantee atomicity and modification order
consistency."
So, I was asking myself if such 'modification order' would work when you are working on the same atomic variable or different ones.
In my code I have an atomic tree, where a low priority, event based message thread fills which node should be updated storing some data on red '1' atomic (see picture), using memory_order_relaxed. Then it continues writing in its parent using fetch_or to know which child atomic has been updated. Each atomic supports up to 64 bits, so I fill the bit 1 in red operation '2'. It continues successively until the root atomic which is also flagged using the fetch_or but using this time memory_order_release.
Then a fast, real time, unblockable, thread loads the control atomic (with memory_order_acquire) and reads which bits have it enabled. Then it updates recursively the childs atomics with memory_order_relaxed. And that is how I sync my data with each cycle of the high priority thread.
Since this thread is updating, it is fine child atomics are being stored before its parent. The problem is when it stores a parent (filling the bit of the children to update) before I fill the child information.
In other words, as the tittle says, are the relaxed stores reordered between them before the release one? I don't mind non-atomic variables are reordered. Pseudo-code, suppose [x, y, z, control] are atomic and with initial values 0:
Event thread:
z = 1; // relaxed
y = 1; // relaxed
x = 1; // relaxed;
control = 0; // release
Real time thread (loop):
load control; // acquire
load x; // relaxed
load y; // relaxed
load z; // relaxed
I wonder if in the real time thread this would be true always: x <= y <=z. To check that I wrote this small program:
#define _ENABLE_ATOMIC_ALIGNMENT_FIX 1
#include <atomic>
#include <iostream>
#include <thread>
#include <assert.h>
#include <array>
using namespace std;
constexpr int numTries = 10000;
constexpr int arraySize = 10000;
array<atomic<int>, arraySize> tat;
atomic<int> tsync {0};
void writeArray()
{
// Stores atomics in reverse order
for (int j=0; j!=numTries; ++j)
{
for (int i=arraySize-1; i>=0; --i)
{
tat[i].store(j, memory_order_relaxed);
}
tsync.store(0, memory_order_release);
}
}
void readArray()
{
// Loads atomics in normal order
for (int j=0; j!=numTries; ++j)
{
bool readFail = false;
tsync.load(memory_order_acquire);
int minValue = 0;
for (int i=0; i!=arraySize; ++i)
{
int newValue = tat[i].load(memory_order_relaxed);
// If it fails, it stops the execution
if (newValue < minValue)
{
readFail = true;
cout << "fail " << endl;
break;
}
minValue = newValue;
}
if (readFail) break;
}
}
int main()
{
for (int i=0; i!=arraySize; ++i)
{
tat[i].store(0);
}
thread b(readArray);
thread a(writeArray);
a.join();
b.join();
}
How it works: There is an array of atomic. One thread stores with relaxed ordering in reverse order and ends storing a control atomic with release ordering.
The other thread loads with acquire ordering that control atomic, then it loads with relaxed that atomic the rest of values of the array. Since the parents mustn't be updates before the children, the newValue should always be equal or greater than the oldValue.
I've executed this program on my computer several times, debug and release, and it doesn't trig the fail. I'm using a normal x64 Intel i7 processor.
So, is it safe to suppose that relaxed stores to multiple atomics do keep the 'modification order' at least when they are being sync with a control atomic and acquire/release?

Sadly, you will learn very little about what the Standard supports by experiment with x86_64, because the x86_64 is so well-behaved. In particular, unless you specify _seq_cst:
all reads are effectively _acquire
all writes are effectively _release
unless they cross a cache-line boundary. And:
all read-modify-write are effectively seq_cst
Except that the compiler is (also) allowed to re-order _relaxed operations.
You mention using _relaxed fetch_or... and if I understand correctly, you may be disappointed to learn that is no less expensive than seq_cst, and requires a LOCK prefixed instruction, carrying the full overhead of that.
But, yes _relaxed atomic operations are indistinguishable from ordinary operations as far as ordering is concerned. So yes, they may be reordered wrt to other _relaxed atomic operations as well as not-atomic ones -- by the compiler and/or the machine. [Though, as noted, on x86_64, not by the machine.]
And, yes where a release operation in thread X synchronizes-with an acquire operation in thread Y, all writes in thread X which are sequenced-before the release will have happened-before the acquire in thread Y. So the release operation is a signal that all writes which precede it in X are "complete", and when the acquire operation sees that signal Y knows it has synchronized and can read what was written by X (up to the release).
Now, the key thing to understand here is that simply doing a store _release is not enough, the value which is stored must be an unambiguous signal to the load _acquire that the store has happened. For otherwise, how can the load tell ?
Generally a _release/_acquire pair like this are used to synchronize access to some collection of data. Once that data is "ready", a store _release signals that. Any load _acquire which sees the signal (or all loads _acquire which see the signal) know that the data is "ready" and they can read it. Of course, any writes to the data which come after the store _release may (depending on timing) also be seen by the load(s) _acquire. What I am trying to say here, is that another signal may be required if there are to be further changes to the data.
Your little test program:
initialises tsync to 0
in the writer: after all the tat[i].store(j, memory_order_relaxed), does tsync.store(0, memory_order_release)
so the value of tsync does not change !
in the reader: does tsync.load(memory_order_acquire) before doing tat[i].load(memory_order_relaxed)
and ignores the value read from tsync
I am here to tell you that the _release/_acquire pairs are not synchronizing -- all these stores/load may as well be _relaxed. [I think your test will "pass" if the the writer manages to stay ahead of the reader. Because on x86-64 all writes are done in instruction order, as are all reads.]
For this to be a test of _release/_acquire semantics, I suggest:
initialises tsync to 0 and tat[] to all zero.
in the writer: run j = 1..numTries
after all the tat[i].store(j, memory_order_relaxed), write tsync.store(j, memory_order_release)
this signals that the pass is complete, and that all tat[] is now j.
in the reader: do j = tsync.load(memory_order_acquire)
a pass across tat[] should find j <= tat[i].load(memory_order_relaxed)
and after the pass, j == numTries signals that the writer has finished.
where the signal sent by the writer is that it has just completed writing j, and will continue with j+1, unless j == numTries. But this does not guarantee the order in which tat[] are written.
If what you wanted was for the writer to stop after each pass, and wait for the reader to see it and signal same -- then you need another signal and you need the threads to wait for their respective "you may proceed" signal.

The quote about relaxed giving modification order consistency. only means that all threads can agree on a modification order for that one object. i.e. an order exists. A later release-store that synchronizes with an acquire-load in another thread will guarantee that it's visible. https://preshing.com/20120913/acquire-and-release-semantics/ has a nice diagram.
Any time you're storing a pointer that other threads could load and deref, use at least mo_release if any of the pointed-to data has also been recently modified, if it's necessary that readers also see those updates. (This includes anything indirectly reachable, like levels of your tree.)
On any kind of tree / linked-list / pointer-based data structure, pretty much the only time you could use relaxed would be in newly-allocated nodes that haven't been "published" to the other threads yet. (Ideally you can just pass args to constructors so they can be initialized without even trying to be atomic at all; the constructor for std::atomic<T>() is not itself atomic. So you must use a release store when publishing a pointer to a newly-constructed atomic object.)
On x86 / x86-64, mo_release has no extra cost; plain asm stores already have ordering as strong as release so the compiler only needs to block compile time reordering to implement var.store(val, mo_release); It's also pretty cheap on AArch64, especially if you don't do any acquire loads soon after.
It also means you can't test for relaxed being unsafe using x86 hardware; the compiler will pick one order for the relaxed stores at compile time, nailing them down into release operations in whatever order it picked. (And x86 atomic-RMW operations are always full barriers, effectively seq_cst. Making them weaker in the source only allows compile-time reordering. Some non-x86 ISAs can have cheaper RMWs as well as load or store for weaker orders, though, even acq_rel being slightly cheaper on PowerPC.)

How can an atomic operation be not a synchronization operation?

The standard says that a relaxed atomic operation is not a synchronization operation. But what's atomic about an operation result of which is not seen by other threads.
The example here wouldn't give the expected result then, right?
What I understand by synchronization is that the result of an operation with such trait would be visible by all threads.
Maybe I don't understand what synchronization means.
Where's the hole in my logic?

The compiler and the CPU are allowed to reorder memory accesses. It's the as-if rule and it assumes a single-threaded process.
In multithreaded programs, the memory order parameter specifies how memory accesses are to be ordered around an atomic operation. This is the synchronization aspect (the "acquire-release semantics") of an atomic operation that is separate from the atomicity aspect itself:
int x = 1;
std::atomic<int> y = 1;
// Thread 1
x++;
y.fetch_add(1, std::memory_order_release);
// Thread 2
while ((y.load(std::memory_order_acquire) == 1)
{ /* wait */ }
std::cout << x << std::endl; // x is 2 now
Whereas with a relaxed memory order we only get atomicity, but not ordering:
int x = 1;
std::atomic<int> y = 1;
// Thread 1
x++;
y.fetch_add(1, std::memory_order_relaxed);
// Thread 2
while ((y.load(std::memory_order_relaxed) == 1)
{ /* wait */ }
std::cout << x << std::endl; // x can be 1 or 2, we don't know
Indeed as Herb Sutter explains in his excellent atomic<> weapons talk, memory_order_relaxed makes a multithreaded program very difficult to reason about and should be used in very specific cases only, when there is no dependency between the atomic operation and any other operation before or after it in any thread (very rarely the case).

Yes, standard is correct. Relaxed atomics are not synchronization operation, as only atomicity of operation is guaranteed.
For example,
int k = 5;
void foo() {
k = 10;
}
int baz() {
return k;
}
In presence of multiple threads, the behavior is undefined as it exposes race condition. In practice on some architectures it could happen that a caller of baz would see nor 10, no 5, but some other, indeterminate value. It is often called torn or dirty read.
If a relaxed atomic load and store was used instead baz would be guaranteed to return either 5 or 10, as there would be no data race.
It is worth noting that for practical purposes, Intel chips and their very strong memory model make relaxed atomic a noop (meaning there is no extra cost for it being atomic) on this common architecture, as loads and stores are atomic on hardware level.

Suppose we have
std::atomic<int> x = 0;
// thread 1
foo();
x.store(1, std::memory_order_relaxed);
// thread 2
assert(x.load(std::memory_order_relaxed) == 1);
bar();
There is, first of all, no guarantee that thread 2 will observe the value 1 (that is, the assert may fire). But even if thread 2 does observe the value 1, while thread 2 is executing bar(), it might not observe side effects generated by foo() in thread 1. And if foo() and bar() access the same non-atomic variables, a data race may occur.
Now suppose we change the example to:
std::atomic<int> x = 0;
// thread 1
foo();
x.store(1, std::memory_order_release);
// thread 2
assert(x.load(std::memory_order_acquire) == 1);
bar();
There is still no guarantee that thread 2 observes the value 1; after all, it could happen that the load occurs before the store. However, in this case, if thread 2 observes the value 1, then the store in thread 1 synchronizes with the load in thread 2. What this means is that everything that's sequenced before the store in thread 1 happens before everything that's sequenced after the load in thread 2. Therefore, bar() will see all the side effects produced by foo() and if they both access the same non-atomic variables, no data race will occur.
So, as you can see, the synchronization properties of operations on x tell you nothing about what happens to x. Instead, synchronization imposes ordering on surrounding operations in the two threads. (Therefore, in the linked example, the result is always 5, and does not depend on the memory ordering; the synchronization properties of the fetch-add operations don't affect the effect of the fetch-add operations themselves.)

Re-ordering Atomic Reads

I am working on a multithreded algorithm which reads two shared atomic variables:
std::atomic<int> a(10);
std::atomic<int> b(20);
void func(int key) {
int b_local = b;
int a_local = a;
/* Some Operations on a & b*/
}
The invariant of the algorithm is that b should be read before reading a.
The question is, can compiler(say GCC) re-order the instructions so that a is read before b? Using explicit memory fences would achieve this but what I want to understand is, can two atomic loads be re-ordered.
Further, after going through Acquire/Release semantics from Herb Sutter's talk(http://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/), I understand that a sequentially consistent system ensures an ordering between acquire(like load) and release(like store). How about ordering between two acquires(like two loads)?
Edit: Adding more info about the code:
Consider two threads T1 & T2 executing:
T1 : reads value of b, sleeps
T2 : changes value of a, returns
T1 : wakes up and reads the new value of a(new value)
Now, consider this scenario with re-ordering:
int a_local =a;
int b_local = b;
T1 : reads value of a, sleeps
T2 : changes value of a, returns
T1 : Doesn't know any thing about change in value of a.
The question is "Can a compiler like GCC re-order two atomic loads`

Description of memory_order_acquire:
no memory accesses in the current thread can be reordered before this load.
As default memory order when loading b is memory_order_seq_cst, which is the strongest one, reading from a cannot be reordered before reading from b.
Even weaker memory orders, as in the code below, provide same garantee:
int b_local = b.load(std::memory_order_acquire);
int a_local = a.load(std::memory_order_relaxed);

Yes, they can be reordered since one order is not different from the other and you put no constraints to force any particular order. There is only one relationship between these lines of codes: int b_local = b; is sequenced before int a_local = a; but since you have only one thread in your code and 2 lines are independent it is completely irrelevant which line is completed first for the 3rd line of code(whatever that line might be) and, hence compiler might reorder it without a doubt.
So, if you need to force some particular order you need:
2+ threads of execution
Establish a happens before relationship between 2 operations in these threads.

Here's what __atomic_base is doing when you call assignment:
operator __pointer_type() const noexcept
{ return load(); }
_GLIBCXX_ALWAYS_INLINE __pointer_type
load(memory_order __m = memory_order_seq_cst) const noexcept
{
memory_order __b = __m & __memory_order_mask;
__glibcxx_assert(__b != memory_order_release);
__glibcxx_assert(__b != memory_order_acq_rel);
return __atomic_load_n(&_M_p, __m);
}
As per the GCC docs on builtins like __atomic_load_n:
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
"An atomic operation can both constrain code motion and be mapped to hardware instructions for synchronization between threads (e.g., a fence). To which extent this happens is controlled by the memory orders, which are listed here in approximately ascending order of strength. The description of each memory order is only meant to roughly illustrate the effects and is not a specification; see the C++11 memory model for precise semantics.
__ATOMIC_RELAXED
Implies no inter-thread ordering constraints.
__ATOMIC_CONSUME
This is currently implemented using the stronger __ATOMIC_ACQUIRE memory order because of a deficiency in C++11's semantics for memory_order_consume.
__ATOMIC_ACQUIRE
Creates an inter-thread happens-before constraint from the release (or stronger) semantic store to this acquire load. Can prevent hoisting of code to before the operation.
__ATOMIC_RELEASE
Creates an inter-thread happens-before constraint to acquire (or stronger) semantic loads that read from this release store. Can prevent sinking of code to after the operation.
__ATOMIC_ACQ_REL
Combines the effects of both __ATOMIC_ACQUIRE and __ATOMIC_RELEASE.
__ATOMIC_SEQ_CST
Enforces total ordering with all other __ATOMIC_SEQ_CST operations. "
So, if I'm reading this right, it does "constrain code motion", which I read to mean prevent reordering. But I could be misinterpreting the docs.

Yes, I think it can do reorder in addition to several optimizations.
Please check the following resources:
Atomic vs. Non-Atomic Operations
In case you still concern about this issue, try to use mutexes which for sure will prevent memory reordering.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js