Race condition when incrementing and decrementing global variable in C++

Race condition when incrementing and decrementing global variable in C++ - c++

I found an example of a race condition that I was able to reproduce under g++ in linux. What I don't understand is how the order of operations matter in this example.
int va = 0;
void fa() {
for (int i = 0; i < 10000; ++i)
++va;
}
void fb() {
for (int i = 0; i < 10000; ++i)
--va;
}
int main() {
std::thread a(fa);
std::thread b(fb);
a.join();
b.join();
std::cout << va;
}
I can undertand that the order matters if I had used va = va + 1; because then RHS va could have changed before getting back to assigned LHS va. Can someone clarify?

The standard says (quoting the latest draft):
[intro.races]
Two expression evaluations conflict if one of them modifies a memory location ([intro.memory]) and the other one reads or modifies the same memory location.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below.
Any such data race results in undefined behavior.
Your example program has a data race, and the behaviour of the program is undefined.
What I don't understand is how the order of operations matter in this example.
The order of operations matters because the operations are not atomic, and they read and modify the same memory location.
can undertand that the order matters if I had used va = va + 1; because then RHS va could have changed before getting back to assigned LHS va
The same applies to the increment operator. The abstract machine will:
Read a value from memory
Increment the value
Write a value back to memory
There are multiple steps there that can interleave with operations in the other thread.
Even if there was a single operation per thread, there would be no guarantee of well defined behaviour unless those operations are atomic.
Note outside of the scope of C++: A CPU might have a single instruction for incrementing an integer in memory. For example, x86 has such instruction. It can be invoked both atomically and non-atomically. It would be wasteful for the compiler to use the atomic instruction unless you explicitly use atomic operations in C++.

First of all, this is undefined behaviour since the two threads' reads and writes of the same non-atomic variable va are potentially concurrent and neither happens before the other.
With that being said, if you want to understand what your computer is actually doing when this program is run, it may help to assume that ++va is the same as va = va + 1. In fact, the standard says they are identical, and the compiler will likely compile them identically. Since your program contains UB, the compiler is not required to do anything sensible like using an atomic increment instruction. If you wanted an atomic increment instruction, you should have made va atomic. Similarly, --va is the same as va = va - 1. So in practice, various results are possible.

The important idea here is that when c++ is compiled it is "translated" to assembly language. The translation of ++va or --va will result in assembly code that moves the value of va to a register, then stores the result of adding 1 to that register back to va in a separate instruction. In this way, it is exactly the same as va = va + 1;. It also means that the operation va++ is not necessarily atomic.
See here for an explanation of what the Assembly code for these instructions will look like.
In order to make atomic operations, the variable could use a locking mechanism. You can do this by declaring an atomic variable (which will handle synchronization of threads for you):
std::atomic<int> va;
Reference: https://en.cppreference.com/w/cpp/atomic/atomic

Related

Is it a data race?

volatile global x = 0;
reader() {
while (x == 0) {}
print ("World\n");
}
writer() {
print ("Hello, ")
x = 1;
}
thread (reader);
thread (writer);
https://en.wikipedia.org/wiki/Race_condition#:~:text=Data%20race%5Bedit,only%20atomic%20operations.
From wikipedia,
The precise definition of data race is specific to the formal
concurrency model being used, but typically it refers to a situation
where a memory operation in one thread could potentially attempt to
access a memory location at the same time that a memory operation in
another thread is writing to that memory location, in a context where
this is dangerous.
There are at least one thread that writes to x. (writer)
There are at least one thread that reads to x. (reader)
There is not any synchronization mechanism for accessing x. (Both of two threads access x without any locks.)
Therefore, I think the code above is data race. (Obviously not a race condition)
Am i right?
Then what is the meaning of data race when a code is data race, but it generates the expected output? (We will see "Hello, World\n", assuming processor guarantees that a store to an address becomes visible for all load instructions issued after the store instruction)
----------- added working cpp code ------------
#include <iostream>
#include <thread>
volatile int x = 0;
void reader() {
while (x == 0 ) {}
std::cout << "World" << std::endl;
}
void writer() {
std::cout << "Hello, ";
x = 1;
}
int main() {
std::thread t1(reader);
std::thread t2(writer);
t2.join();
t1.join();
return 0;
}

Yes, this is a data race and UB.
[intro.races]/2
Two expression evaluations conflict if one of them modifies a memory location ... and the other one reads or modifies the same memory location.
[intro.races]/21
Two actions are potentially concurrent if:
— they are performed by different threads, ...
...
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, ...
Any such data race results in undefined behavior.
For two things in different threads to "happen before" one another, a synchronization mechanism must be involved, such as non-relaxed atomics, mutexes, and so on.

Yes, data race and consequently undefined behavior in C++. Undefined behavior means that you have no guarantee how the program will behave. Seeing the "expected" output is one possible output, but you are not guaranteed that it will happen.
Here x is non-atomic and is read by thread t1 and written by thread t2 without any synchronization and therefore they cause a data race.
volatile has no impact on whether or not an access is a data race. Only using an atomic (e.g. std::atomic<int>) can remove the data race.
That said, on many common platforms writing to a int will be atomic on the hardware level, the compiler will not optimize away volatile accesses and will probably also not reorder volatile accesses with IO and therefore it will probably happen to work on these platforms. The language doesn't make this guarantee though.

Why does this cppreference excerpt seem to wrongly suggest that atomics can protect critical sections?

int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto job = [&] {
int asdf = bar.load();
// std::lock_guard lg(mx);
foo.emplace_back(1);
bar.store(foo.size());
};
std::thread t1(job);
std::thread t2(job);
t1.join();
t2.join();
}
This obviously is not guaranteed to work, but works with a mutex. But how can that be explained in terms of the formal definitions of the standard?
Consider this excerpt from cppreference:
If an atomic store in thread A is tagged memory_order_release and an
atomic load in thread B from the same variable is tagged
memory_order_acquire [as is the case with default atomics], all memory writes (non-atomic and relaxed
atomic) that happened-before the atomic store from the point of view
of thread A, become visible side-effects in thread B. That is, once
the atomic load is completed, thread B is guaranteed to see everything
thread A wrote to memory.
Atomic loads and stores (with the default or with the specific acquire and release memory order specified) have the mentioned acquire-release semantics. (So does a mutex's lock and unlock.)
An interpretation of that wording could be that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification, making this well-defined. But pretty much everyone would agree that this can lead to a segmentation fault and would surely do so if the job function ran its three lines in a loop.
What standard wording explains the obvious difference in capability between the two tools, given that this wording seems to imply that atomic would synchronize in a way.
I know when to use mutexes and atomics, and I know that the example doesn't work because no synchronization actually happens. My question is how the definition is to be interpreted so it doesn't contradict the way it works in reality.

The quoted passage means that when B loads the value that A stored, then by observing that the store happened, it can also be assured that everything that B did before the store has also happened and is visible.
But this doesn't tell you anything if the store has not in fact happened yet!
The actual C++ standard says this more explicitly. (Always remember that cppreference, while a valuable resource which often quotes from or paraphrases the standard, is not the standard itself and is not authoritative.) From N4861, the final C++20 draft, we have in atomics.order p2:
An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic
operation B that performs an acquire operation on M and takes its value from any side effect in the release
sequence headed by A.
I would agree that if the load in your thread B returned 1, it could safely conclude that the other thread had finished its store and therefore had exited the critical section, and therefore B could safely use foo. In this case the load in B has synchronized with the store in A, since the value of the load (namely 1) came from the store (which is part of its own release sequence).
But it is entirely possible that both loads return 0, if both threads do their loads before either one does its store. The value 0 didn't come from either store, so the loads don't synchronize with the stores in that case. Your code doesn't even look at the value that was loaded, so both threads may enter the critical section together in that case.
The following code would be a safe, though inefficient, way to use an atomic to protect a critical section. It ensures that A will execute the critical section first, and B will wait until A has finished before proceeding. (Obviously if both threads wait for the other then you have a deadlock.)
int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto jobA = [&] {
foo.emplace_back(1);
bar.store(foo.size());
};
auto jobB = [&] {
while (bar.load() == 0) /* spin */ ;
foo.emplace_back(1);
};
std::thread t1(jobA);
std::thread t2(jobB);
t1.join();
t2.join();
}

Setting aside the elephant in the room that none of the C++ containers are thread safe without employing locking of some sort (so forget about using emplace_back without implementing locking), and focusing on the question of why atomic objects alone are not sufficient:
You need more than atomic objects. You also need sequencing.
All that an atomic object gives you is that when an object changes state, any other thread will either see its old value or its new value, and it will never see any "partially old/partially new", or "intermediate" value.
But it makes no guarantee whatsoever as to when other execution threads will "see" the atomic object's new value. At some point they (hopefully) will, see the atomic object's instantly flip to its new value. When? Eventually. That's all that you get from atomics.
One execution thread may very well set an atomic object to a new value, but other execution threads will still have the old value cached, in some form or fashion, and will continue to see the atomic object's old value, and won't "see" the atomic object's new value until some intermediate time passes (if ever).
Sequencing are rules that specify when objects' new values are visible in other execution threads. The simplest way to get both atomicity and easy to deal with sequencing, in one fell swoop, is to use mutexes and condition variables which handle all the hard details for you. You can still use atomics and with a careful logic use lock/release fence instructions to implement proper sequencing. But it's very easy to get it wrong, and the worst of it you won't know that it's wrong until your code starts going off the rails due to improper sequencing and it'll be nearly impossible to accurately reproduce the faulty behavior for debugging purposes.
But for nearly all common, routine, garden-variety tasks mutexes and condition variables is the most simplest solution to proper inter-thread sequencing.

The idea is that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification
Yes all writes that done by foo.emplace_back(1); would be guaranteed when bar.store(foo.size()); is executed. But who guaranteed you that foo.emplace_back(1); from thread 1 would see any/all non partial consistent state from foo.emplace_back(1); executed in thread 2 and vice versa? They both read and modify internal state of std::vector and there is no memory barrier before code reaches atomic store. And even if all variables would be read/modified atomically std::vector state consists of multiple variables - size, capacity, pointer to the data at least. Changes to all of them must be synchronized as well and memory barrier is not enough for that.
To explain little more let's create simplified example:
int a = 0;
int b = 0;
std::atomic<int> at;
// thread 1
int foo = at.load();
a = 1;
b = 2;
at.store(foo);
// thread 2
int foo = at.load();
int tmp1 = a;
int tmp2 = b;
at.store(tmp2);
Now you have 2 problems:
There is no guarantee that when tmp2 value is 2
tmp1 value would be 1
as you read a and b before atomic operation.
There is no guarantee that when at.store(b)
is executed that either a == b == 0 or a == 1 and b == 2,
it could be a == 1 but still b == 0.
Is that clear?
But:
// thread 1
mutex.lock();
a = 1;
b = 2;
mutex.unlock();
// thread 2
mutex.lock();
int tmp1 = a;
int tmp2 = b;
mutex.unlock();
You either get tmp1 == 0 and tmp2 == 0 or tmp1 == 1 and tmp2 == 2, do you see the difference?

Difference between Interlocked, InterlockedAcquire, and InterlockedRelease if single thread reordering is impossible

In all likelihood, a lockless implementation is already overkill for the purposes of my application, but I wanted to look into memory barriers and lockless-ness anyways in case I ever actually need to use these concepts in the future.
From what I can tell:
an "InterlockedAcquire" function performs an atomic operation while preventing the compiler from moving code statements after the InterlockedAcquire to before the InterlockedAcquire.
an "InterlockedRelease" function performs an atomic operation while preventing the compiler from moving code statements before the InterlockedRelease to after the InterlockedRelease.
a vanilla "Interlocked" function performs an atomic operation while preventing the compiler from moving code statements in either direction across the Interlocked call.
My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
For a more concrete example, here's my current application - the produce() function as part of what will eventually be a multiple producer, single consumer queue built using a circular buffer:
template <typename T>
class Queue {
private:
long headIndex;
long tailIndex;
T* array[MAXQUEUESIZE];
public:
Queue() {
headIndex = 0;
tailIndex = 0;
memset(array, 0, MAXQUEUESIZE*sizeof(void*);
}
~Queue() {
}
bool produce(T value) {
//1) prevents concurrent calls to produce() from causing corruption:
long indexRetVal;
long reservedIndex;
do {
reservedIndex = tailIndex;
indexRetVal = InterlockedCompareExchange64(&tailIndex, (reservedIndex + 1) % MAXQUEUESIZE, reservedIndex);
} while (indexRetVal != reservedIndex);
//2) allocates the node.
T* newValPtr = (T*) malloc(sizeof(T));
if (newValPtr == null) {
OutputDebugString("Queue: malloc returned null");
return false;
}
*newValPtr = value;
//3) prevents a concurrent call to consume from causing corruption by atomically replacing the old pointer:
T* valPtrRetVal = InterlockedCompareExchangePointer(array + reservedIndex, newValPtr, null);
//if the previous value wasn't null, then our circular buffer overflowed:
if (valPtrRetVal != null) {
OutputDebugString("Queue: circular buffer overflowed");
free(newValPtr); //as pointed out by RbMm
return false;
}
//otherwise, everything worked fine
return true;
}
};
As I understand it, 3) will occur after 1) and 2) regardless of what I do anyways, but I should change 1) to an InterlockedRelease because I don't care whether it occurs before or after 2) and I should let the compiler decide.

My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
You may be confusing C++ statements with instructions. Your question isn't CPU specific, so you have to pretend you have no idea what the CPU instructions look like.
Consider this code:
if (a == 2)
{
b = 5;
}
Now, here's an example of a re-ordering of this code that doesn't affect a single thread:
int c = b;
b = 5;
if (a != 2)
b = c;
This performs the same operations but in a different order. It has no effect on single-threaded code. But, of course, if another thread was accessing b, it could see a value of 5 from this code even if a was never 2.
Thus it could also see a value of 5 from the original code even if a is never 2!
Why, because the two bits of code perform the same from the point of view of a single thread. And unless you use operations with guaranteed threading semantics, that's all the compiler, CPU, caches, and other platform components need to preserve.
So most likely, your belief that reordering any of the code would affect single-threaded behavior is probably incorrect. There's lots of ways to reorder and optimize code that doesn't affect single-threaded behavior.

There is an document on the msdn Explained the difference: Acquire and Release Semantics.
For the sample:
a++;
b++;
c++;
If we use acquire semantics to increment a, other processors would always see the increment of a before the increments of b and c;
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
the InterlockedXxx routines perform, have both acquire and release semantics by default.
More specific, for 4 values:
a++;
b++;
c++;
d++;
If we use acquire semantics to increment b, other processors would always see the increment of b before the increments of c and d;
The order may be a->b->c,d or b->a,c,d.
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
The order may be a,b->c->d or a,b,d->c.
To quote from this answer of #antiduh:
Acquire says "only worry about stuff after me". Release says "only
worry about stuff before me". Combining those both is a full memory
barrier.

All three versions prevent the compiler from moving code across the function call, but the compiler is not the only place that reordering takes place.
Modern CPUs have "out-of-order execution" and even "speculative execution". Acquire and release semantics cause the code to compiler to instructions with flags or prefixes controlling reordering within the CPU.

Are relaxed atomic store reordered themselves before the release? (similar with load /acquire)

I read in the en.cppreference.com specifications relaxed operations on atomics:
"[...]only guarantee atomicity and modification order
consistency."
So, I was asking myself if such 'modification order' would work when you are working on the same atomic variable or different ones.
In my code I have an atomic tree, where a low priority, event based message thread fills which node should be updated storing some data on red '1' atomic (see picture), using memory_order_relaxed. Then it continues writing in its parent using fetch_or to know which child atomic has been updated. Each atomic supports up to 64 bits, so I fill the bit 1 in red operation '2'. It continues successively until the root atomic which is also flagged using the fetch_or but using this time memory_order_release.
Then a fast, real time, unblockable, thread loads the control atomic (with memory_order_acquire) and reads which bits have it enabled. Then it updates recursively the childs atomics with memory_order_relaxed. And that is how I sync my data with each cycle of the high priority thread.
Since this thread is updating, it is fine child atomics are being stored before its parent. The problem is when it stores a parent (filling the bit of the children to update) before I fill the child information.
In other words, as the tittle says, are the relaxed stores reordered between them before the release one? I don't mind non-atomic variables are reordered. Pseudo-code, suppose [x, y, z, control] are atomic and with initial values 0:
Event thread:
z = 1; // relaxed
y = 1; // relaxed
x = 1; // relaxed;
control = 0; // release
Real time thread (loop):
load control; // acquire
load x; // relaxed
load y; // relaxed
load z; // relaxed
I wonder if in the real time thread this would be true always: x <= y <=z. To check that I wrote this small program:
#define _ENABLE_ATOMIC_ALIGNMENT_FIX 1
#include <atomic>
#include <iostream>
#include <thread>
#include <assert.h>
#include <array>
using namespace std;
constexpr int numTries = 10000;
constexpr int arraySize = 10000;
array<atomic<int>, arraySize> tat;
atomic<int> tsync {0};
void writeArray()
{
// Stores atomics in reverse order
for (int j=0; j!=numTries; ++j)
{
for (int i=arraySize-1; i>=0; --i)
{
tat[i].store(j, memory_order_relaxed);
}
tsync.store(0, memory_order_release);
}
}
void readArray()
{
// Loads atomics in normal order
for (int j=0; j!=numTries; ++j)
{
bool readFail = false;
tsync.load(memory_order_acquire);
int minValue = 0;
for (int i=0; i!=arraySize; ++i)
{
int newValue = tat[i].load(memory_order_relaxed);
// If it fails, it stops the execution
if (newValue < minValue)
{
readFail = true;
cout << "fail " << endl;
break;
}
minValue = newValue;
}
if (readFail) break;
}
}
int main()
{
for (int i=0; i!=arraySize; ++i)
{
tat[i].store(0);
}
thread b(readArray);
thread a(writeArray);
a.join();
b.join();
}
How it works: There is an array of atomic. One thread stores with relaxed ordering in reverse order and ends storing a control atomic with release ordering.
The other thread loads with acquire ordering that control atomic, then it loads with relaxed that atomic the rest of values of the array. Since the parents mustn't be updates before the children, the newValue should always be equal or greater than the oldValue.
I've executed this program on my computer several times, debug and release, and it doesn't trig the fail. I'm using a normal x64 Intel i7 processor.
So, is it safe to suppose that relaxed stores to multiple atomics do keep the 'modification order' at least when they are being sync with a control atomic and acquire/release?

Sadly, you will learn very little about what the Standard supports by experiment with x86_64, because the x86_64 is so well-behaved. In particular, unless you specify _seq_cst:
all reads are effectively _acquire
all writes are effectively _release
unless they cross a cache-line boundary. And:
all read-modify-write are effectively seq_cst
Except that the compiler is (also) allowed to re-order _relaxed operations.
You mention using _relaxed fetch_or... and if I understand correctly, you may be disappointed to learn that is no less expensive than seq_cst, and requires a LOCK prefixed instruction, carrying the full overhead of that.
But, yes _relaxed atomic operations are indistinguishable from ordinary operations as far as ordering is concerned. So yes, they may be reordered wrt to other _relaxed atomic operations as well as not-atomic ones -- by the compiler and/or the machine. [Though, as noted, on x86_64, not by the machine.]
And, yes where a release operation in thread X synchronizes-with an acquire operation in thread Y, all writes in thread X which are sequenced-before the release will have happened-before the acquire in thread Y. So the release operation is a signal that all writes which precede it in X are "complete", and when the acquire operation sees that signal Y knows it has synchronized and can read what was written by X (up to the release).
Now, the key thing to understand here is that simply doing a store _release is not enough, the value which is stored must be an unambiguous signal to the load _acquire that the store has happened. For otherwise, how can the load tell ?
Generally a _release/_acquire pair like this are used to synchronize access to some collection of data. Once that data is "ready", a store _release signals that. Any load _acquire which sees the signal (or all loads _acquire which see the signal) know that the data is "ready" and they can read it. Of course, any writes to the data which come after the store _release may (depending on timing) also be seen by the load(s) _acquire. What I am trying to say here, is that another signal may be required if there are to be further changes to the data.
Your little test program:
initialises tsync to 0
in the writer: after all the tat[i].store(j, memory_order_relaxed), does tsync.store(0, memory_order_release)
so the value of tsync does not change !
in the reader: does tsync.load(memory_order_acquire) before doing tat[i].load(memory_order_relaxed)
and ignores the value read from tsync
I am here to tell you that the _release/_acquire pairs are not synchronizing -- all these stores/load may as well be _relaxed. [I think your test will "pass" if the the writer manages to stay ahead of the reader. Because on x86-64 all writes are done in instruction order, as are all reads.]
For this to be a test of _release/_acquire semantics, I suggest:
initialises tsync to 0 and tat[] to all zero.
in the writer: run j = 1..numTries
after all the tat[i].store(j, memory_order_relaxed), write tsync.store(j, memory_order_release)
this signals that the pass is complete, and that all tat[] is now j.
in the reader: do j = tsync.load(memory_order_acquire)
a pass across tat[] should find j <= tat[i].load(memory_order_relaxed)
and after the pass, j == numTries signals that the writer has finished.
where the signal sent by the writer is that it has just completed writing j, and will continue with j+1, unless j == numTries. But this does not guarantee the order in which tat[] are written.
If what you wanted was for the writer to stop after each pass, and wait for the reader to see it and signal same -- then you need another signal and you need the threads to wait for their respective "you may proceed" signal.

The quote about relaxed giving modification order consistency. only means that all threads can agree on a modification order for that one object. i.e. an order exists. A later release-store that synchronizes with an acquire-load in another thread will guarantee that it's visible. https://preshing.com/20120913/acquire-and-release-semantics/ has a nice diagram.
Any time you're storing a pointer that other threads could load and deref, use at least mo_release if any of the pointed-to data has also been recently modified, if it's necessary that readers also see those updates. (This includes anything indirectly reachable, like levels of your tree.)
On any kind of tree / linked-list / pointer-based data structure, pretty much the only time you could use relaxed would be in newly-allocated nodes that haven't been "published" to the other threads yet. (Ideally you can just pass args to constructors so they can be initialized without even trying to be atomic at all; the constructor for std::atomic<T>() is not itself atomic. So you must use a release store when publishing a pointer to a newly-constructed atomic object.)
On x86 / x86-64, mo_release has no extra cost; plain asm stores already have ordering as strong as release so the compiler only needs to block compile time reordering to implement var.store(val, mo_release); It's also pretty cheap on AArch64, especially if you don't do any acquire loads soon after.
It also means you can't test for relaxed being unsafe using x86 hardware; the compiler will pick one order for the relaxed stores at compile time, nailing them down into release operations in whatever order it picked. (And x86 atomic-RMW operations are always full barriers, effectively seq_cst. Making them weaker in the source only allows compile-time reordering. Some non-x86 ISAs can have cheaper RMWs as well as load or store for weaker orders, though, even acq_rel being slightly cheaper on PowerPC.)

Re-ordering Atomic Reads

I am working on a multithreded algorithm which reads two shared atomic variables:
std::atomic<int> a(10);
std::atomic<int> b(20);
void func(int key) {
int b_local = b;
int a_local = a;
/* Some Operations on a & b*/
}
The invariant of the algorithm is that b should be read before reading a.
The question is, can compiler(say GCC) re-order the instructions so that a is read before b? Using explicit memory fences would achieve this but what I want to understand is, can two atomic loads be re-ordered.
Further, after going through Acquire/Release semantics from Herb Sutter's talk(http://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/), I understand that a sequentially consistent system ensures an ordering between acquire(like load) and release(like store). How about ordering between two acquires(like two loads)?
Edit: Adding more info about the code:
Consider two threads T1 & T2 executing:
T1 : reads value of b, sleeps
T2 : changes value of a, returns
T1 : wakes up and reads the new value of a(new value)
Now, consider this scenario with re-ordering:
int a_local =a;
int b_local = b;
T1 : reads value of a, sleeps
T2 : changes value of a, returns
T1 : Doesn't know any thing about change in value of a.
The question is "Can a compiler like GCC re-order two atomic loads`

Description of memory_order_acquire:
no memory accesses in the current thread can be reordered before this load.
As default memory order when loading b is memory_order_seq_cst, which is the strongest one, reading from a cannot be reordered before reading from b.
Even weaker memory orders, as in the code below, provide same garantee:
int b_local = b.load(std::memory_order_acquire);
int a_local = a.load(std::memory_order_relaxed);

Yes, they can be reordered since one order is not different from the other and you put no constraints to force any particular order. There is only one relationship between these lines of codes: int b_local = b; is sequenced before int a_local = a; but since you have only one thread in your code and 2 lines are independent it is completely irrelevant which line is completed first for the 3rd line of code(whatever that line might be) and, hence compiler might reorder it without a doubt.
So, if you need to force some particular order you need:
2+ threads of execution
Establish a happens before relationship between 2 operations in these threads.

Here's what __atomic_base is doing when you call assignment:
operator __pointer_type() const noexcept
{ return load(); }
_GLIBCXX_ALWAYS_INLINE __pointer_type
load(memory_order __m = memory_order_seq_cst) const noexcept
{
memory_order __b = __m & __memory_order_mask;
__glibcxx_assert(__b != memory_order_release);
__glibcxx_assert(__b != memory_order_acq_rel);
return __atomic_load_n(&_M_p, __m);
}
As per the GCC docs on builtins like __atomic_load_n:
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
"An atomic operation can both constrain code motion and be mapped to hardware instructions for synchronization between threads (e.g., a fence). To which extent this happens is controlled by the memory orders, which are listed here in approximately ascending order of strength. The description of each memory order is only meant to roughly illustrate the effects and is not a specification; see the C++11 memory model for precise semantics.
__ATOMIC_RELAXED
Implies no inter-thread ordering constraints.
__ATOMIC_CONSUME
This is currently implemented using the stronger __ATOMIC_ACQUIRE memory order because of a deficiency in C++11's semantics for memory_order_consume.
__ATOMIC_ACQUIRE
Creates an inter-thread happens-before constraint from the release (or stronger) semantic store to this acquire load. Can prevent hoisting of code to before the operation.
__ATOMIC_RELEASE
Creates an inter-thread happens-before constraint to acquire (or stronger) semantic loads that read from this release store. Can prevent sinking of code to after the operation.
__ATOMIC_ACQ_REL
Combines the effects of both __ATOMIC_ACQUIRE and __ATOMIC_RELEASE.
__ATOMIC_SEQ_CST
Enforces total ordering with all other __ATOMIC_SEQ_CST operations. "
So, if I'm reading this right, it does "constrain code motion", which I read to mean prevent reordering. But I could be misinterpreting the docs.

Yes, I think it can do reorder in addition to several optimizations.
Please check the following resources:
Atomic vs. Non-Atomic Operations
In case you still concern about this issue, try to use mutexes which for sure will prevent memory reordering.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js