Memory ordering in sub-expressions of expressions carrying a dependency - c++

Consider the following code:
int data = 0;
std::atomic<int> ready = 0;
void writer_thread() {
data = 25;
ready.store(1, std::memory_order::release);
}
void reader_thread() {
int ready_value = 0;
while(!ready_value) {
ready_value = ready.load(std::memory_order::consume);
}
assert(data == 25); // #1
assert(data + ready_value == 26); // #2
}
The assertion marked #1 may or may not pass, because data does not carry a dependency from ready, which causes a race condition.
Assertion #2 is more complicated. It would seem that the entire expression data + ready_value == 26 carries a dependency from ready, as explained in [intro.races#7] - an expression carries a dependency from its operands, and this relation is transitive.
But even if we establish that this is true, what does it imply about the memory ordering within this expression? Consider the following paragraph from cppreference:
If an atomic store in thread A is tagged memory_order_release and an atomic load in thread B from the same variable that read the stored value is tagged memory_order_consume, all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects within those operations in thread B into which the load operation carries dependency, that is, once the atomic load is completed, those operators and functions in thread B that use the value obtained from the load are guaranteed to see what thread A wrote to memory.
I am not sure how to interpret the emphasized sentences. Does the guarantee about the visibility of the side-effects apply to every sub-expression as well?
Let's take the addition expression data + ready_value, which, as a whole, carries a dependency from ready. While evaluating the data sub-expression of this expression, are the side-effects visible or not?
If they are, then I believe the assertion #2 does not contain a race condition.
But if they are not, then the evaluation of the sub-expression data might produce a 0, and the assertion contains a race condition.

Assertion #2 is more complicated.
No, it isn't:
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
Accessing data is an "action". Both reader_thread and writer_thread attempt to access data. Pursuant to [intro.races]/2:
Two expression evaluations conflict if one of them modifies a memory location ([intro.memory]) and the other one reads or modifies the same memory location.
writer_thread assigns to data. reader_thread reads from data. The two actions therefore potentially conflict.
Both accesses are "potentially concurrent", as they are accessed from different threads.
Neither access is atomic. Neither access is one of the signal handling exceptions.
That leaves "happens before". Well, consume only matters for the atomic object being accessed and the expressions and side-effects that generated that value. Consume only gives us the "dependency-ordered before" relation. And this only applies to the written atomic object and any side-effects used to write that atomic value, through the "carries dependency" relationship as you cited.
ready_value might carry the dependency of ready, but data is not part of that dependency chain.
As such, writer_thread's write to data does not "happen before" reader_thread's read of data. Therefore, accessing its data is undefined behavior.
Nowhere is there a statement about accessing ready or ready_value.
Always. That's just how consume works.

Related

C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads

Could you please help me to understand what std::memory_order should be used in std::atomic_flag::test_and_set to do some work only once by a set of threads and why? The work should be done by whatever thread gets to it first, and all other threads should just check as quickly as possible that someone is already going the work and continue working on other tasks.
In my tests of the example below, any memory order works, but I think that it is just a coincidence. I suspect that Release-Acquire ordering is what I need, but, in my case, only one memory_order can be used in both threads (it is not the case that one thread can use memory_order_release and the other can use memory_order_acquire since I do not know which thread will arrive to doing the work first).
#include <atomic>
#include <iostream>
#include <thread>
std::atomic_flag done = ATOMIC_FLAG_INIT;
const std::memory_order order = std::memory_order_seq_cst;
//const std::memory_order order = std::memory_order_acquire;
//const std::memory_order order = std::memory_order_relaxed;
void do_some_work_that_needs_to_be_done_only_once(void)
{ std::cout<<"Hello, my friend\n"; }
void run(void)
{
if(not done.test_and_set(order))
do_some_work_that_needs_to_be_done_only_once();
}
int main(void)
{
std::thread a(run);
std::thread b(run);
a.join();
b.join();
// expected result:
// * only one thread said hello
// * all threads spent as little time as possible to check if any
// other thread said hello yet
return 0;
}
Thank you very much for your help!
Following up on some things in the comments:
As has been discussed, there is a well-defined modification order M for done on any given run of the program. Every thread does one store to done, which means one entry in M. And by the nature of atomic read-modify-writes, the value returned by each thread's test_and_set is the value that immediately precedes its own store in the order M. That's promised in C++20 atomics.order p10, which is the critical clause for understanding atomic RMW in the C++ memory model.
Now there are a finite number of threads, each corresponding to one entry in M, which is a total order. Necessarily there is one such entry that precedes all the others. Call it m1. The test_and_set whose store is entry m1 in M must return the preceding value in M. That can only be the value 0 which initialized done. So the thread corresponding to m1 will see test_and_set return 0. Every other thread will see it return 1, because each of their modifications m2, ..., mN follows (in M) another modification, which must have been a test_and_set storing the value 1.
We may not be bothering to observe all of the total order M, but this program does determine which of its entries is first on this particular run. It's the unique one whose test_and_set returns 0. A thread that sees its test_and_set return 1 won't know whether it came 2nd or 8th or 96th in that order, but it does know that it wasn't first, and that's all that matters here.
Another way to think about it: suppose it were possible for two threads (tA, tB) both to load the value 0. Well, each one makes an entry in the modification order; call them mA and mB. M is a total order so one has to go before the other. And bearing in mind the all-important [atomics.order p10], you will quickly find there is no legal way for you to fill out the rest of M.
All of this is promised by the standard without any reference to memory ordering, so it works even with std::memory_order_relaxed. The only effect of relaxed memory ordering is that we can't say much about how our load/store will become visible with respect to operations on other variables. That's irrelevant to the program at hand; it doesn't even have any other variables.
In the actual implementation, this means that an atomic RMW really has to exclusively own the variable for the duration of the operation. We must ensure that no other thread does a store to that variable, nor the load half of a read-modify-write, during that period. In a MESI-like coherent cache, this is done by temporarily locking the cache line in the E state; if the system makes it possible for us to lose that lock (like an LL/SC architecture), abort and start again.
As to your comment about "a thread reading false from its own cache/buffer": the implementation mustn't allow that in an atomic RMW, not even with relaxed ordering. When you do an atomic RMW, you must read it while you hold the lock, and use that value in the RMW operation. You can't use some old value that happens to be in a buffer somewhere. Likewise, you have to complete the write while you still hold the lock; you can't stash it in a buffer and let it complete later.
relaxed is fine if you just need to determine the winner of the race to set the flag1, so one thread can start on the work and later threads can just continue on.
If the run_once work produces data that other threads need to be able to read, you'll need a release store after that, to let potential readers know that the work is finished, not just started. If it was instead just something like printing or writing to a file, and other threads don't care when that finishes, then yeah you have no ordering requirements between threads beyond the modification order of done which exists even with relaxed. An atomic RMW like test_and_set lets you determines which thread's modification was first.
BTW, you should check read-only before even trying to test-and-set; unless run() is only called very infrequently, like once per thread startup. For something like a static int foo = non_constant; local var, compilers use a guard variable that's loaded (with an acquire load) to see if init is already complete. If it's not, branch to code that uses an atomic RMW to modify the guard variable, with one thread winning, the rest effectively waiting on a mutex for that thread to init.
You might want something like that if you have data that all threads should read. Or just use a static int foo = something_to_run_once(), or some type other than int, if you actually have some data to init.
Or perhaps use C++11 std::call_once to solve this problem for you.
On normal systems, atomic_flag has no advantage over and atomic_bool. done.exchange(true) on a bool is equivalent to test_and_set of a flag. But atomic_bool is more flexible in terms of the operations it supports, like plain read that isn't part of an RMW test-and-set.
C++20 does add a test() method for atomic_flag. ISO C++ guarantees that atomic_flag is lock-free, but in practice so is std::atomic<bool> on all real-world systems.
Footnote 1: why relaxed guarantees a single winner
The memory_order parameter only governs ordering wrt. operations on other variables by the same thread.
Does calling test_and_set by a thread force somehow synchronization of the flag with values written by other threads?
It's not a pure write, it's an atomic read-modify-write, so the result of the one that went first is guaranteed to be visible to the one that happens to be second. That's the whole point of test-and-set as a primitive building block for mutual exclusion.
If two TAS operations could both load the original value (false), and then both store true, they would be atomic. They'd have overlapped with each other.
Two atomic RMWs on the same atomic object must happen in some order, the modification-order of that object. (Because they're not read-only: an RMW includes a modification. But also includes a read so you can see what the value was immediately before the new value; that read is tied to the modification order, unlike a plain read).
Every atomic object separately has a modification-order that all threads can agree on; this is guaranteed by ISO C++. (With orders less than seq_cst, ordering between objects can be different from source order, and not guaranteed that all threads even agree which store happened first, the IRIW problem.)
Being an atomic RMW guarantees that exactly one test_and_set will return false in thread A or B. Same for fetch_add with multiple threads incrementing a counter: the increments have to happen in some order (i.e. serialized with each other), and whatever that order is becomes the modification-order of that atomic object.
Atomic RMWs have to work this way to not lose counts. i.e. to actually be atomic.

Synchronizing against relaxed atomics

I have an allocator that does relaxed atomics to track the number of bytes currently allocated. They're just adds and subtracts so I don't need any synchronization between threads other than ensuring the modifications are atomic.
However, I occasionally want to check the number of allocated bytes (e.g. when shutting down the program) and I want to ensure any pending writes are committed. I assume I need a full memory barrier in this case to prevent any previous writes from being moved after the barrier and to prevent the next read from being moved before the barrier.
The question is: what is the proper way to ensure the relaxed atomic writes are committed before reading? Is my current code correct? (Assume functions and types map to std library constructs as expected.)
void* Allocator::Alloc(size_t bytes, size_t alignment)
{
void* p = AlignedAlloc(bytes, alignment);
AtomicFetchAdd(&allocatedBytes, AlignedMsize(p), MemoryOrder::Relaxed);
return p;
}
void Allocator::Free(void* p)
{
AtomicFetchSub(&allocatedBytes, AlignedMsize(p), MemoryOrder::Relaxed);
AlignedFree(p);
}
size_t Allocator::GetAllocatedBytes()
{
AtomicThreadFence(MemoryOrder::AcqRel);
return AtomicLoad(&allocatedBytes, MemoryOrder::Relaxed);
}
And some type definitions for context
enum struct MemoryOrder
{
Relaxed = 0,
Consume = 1,
Acquire = 2,
Release = 3,
AcqRel = 4,
SeqCst = 5,
};
struct Allocator
{
void* Alloc (size_t bytes, size_t alignment);
void Free (void* p);
size_t GetAllocatedBytes();
Atomic<size_t> allocatedBytes = { 0 };
};
I don't want to simply default to sequential consistency as I'm trying to understand memory ordering better.
The part that's really tripping me up is that in the standard under [atomics.fences] all the points talk about an acquire fence/atomic op synchronizing with a release fence/atomic op. It's entirely opaque to me whether an acquire fence/atomic op will synchronize with a relaxed atomic op on another thread. If an AcqRel fence function literally maps to an mfence instruction, it seems that the above code will be fine. However, I'm having a hard time convincing myself the standard guarantees this. Namely,
4 An atomic operation A that is a release operation on an atomic
object M synchronizes with an acquire fence B if there exists some
atomic operation X on M such that X is sequenced before B and reads
the value written by A or a value written by any side effect in the
release sequence headed by A.
This seems to make it clear that the fence will not synchronize with the relaxed atomic writes. On the other hand, a full fence is both a release and an acquire fence, so it should synchronize with itself, right?
2 A release fence A synchronizes with an acquire fence B if there
exist atomic operations X and Y, both operating on some atomic object
M, such that A is sequenced before X, X modifies M, Y is sequenced
before B, and Y reads the value written by X or a value written by any
side effect in the hypothetical release sequence X would head if it
were a release operation.
The scenario described is
Unsequenced writes
A release fence
X atomic write
Y atomic read
B acquire fence
Unsequenced reads (unsequenced writes will be visible here)
However, in my case I don't have the atomic write + atomic read as a signal between the threads and the release fence happens with the acquire fence on thread B. So what's actually happening is
Unsequenced writes
A release fence
B acquire fence
Unsequenced reads
Clearly if the fence executes before an unsequenced write begins it's a race and all bets are off. But it seems to me that if the fence executes after an unsequenced write begins but before it is committed it will be forced to finish before the unsequenced reads. This is exactly that I want, but I can't glean whether this is guaranteed by the standard.
Let's say you spawn Thread A, which calls Allocator::Alloc(), then immediately spawn Thread B, which calls Allocator::GetAllocatedBytes(). Those two Allocator calls are now running concurrently. You don't know which one will actually happen first, because there's no ordering between them. Your only guarantee is that either Thread B will see the value of allocatedBytes before Thread A modifies it, or it will see the value of allocatedBytes after Thread A modifies it. You won't know which value Thread B saw until after GetAllocatedBytes() returns. (At least Thread B won't see a totally garbage value for allocatedBytes, because there's no data race on it thanks to your use of relaxed atomics.)
You seem to be concerned about the case where Thread A got as far as AtomicFetchAdd(), but for some reason, the change is not visible when Thread B calls AtomicLoad(). But so what? That's no different from the outcome where GetAllocatedBytes() runs entirely before AtomicFetchAdd(). And that's a totally valid outcome. Remember, either Thread B sees the modified value, or it doesn't.
Even if you change all the atomic operations/fences to MemoryOrder::SeqCst, it won't make any difference. In the scenario I described, Thread B can still either see the modified value or the unmodified value of allocatedBytes, because the two Allocator calls run concurrently.
As long as you insist on calling GetAllocatedBytes() while other threads are still calling Alloc() and Free(), that's really the most you can expect. If you want to get a more "accurate" value, just don't allow any concurrent calls to Alloc()/Free() while GetAllocatedBytes() is running! For example, if the program is shutting down, just join all the other threads before calling GetAllocatedBytes(). That'll give you an accurate number of allocated bytes at shutdown. The C++ standard even guarantees it, because the completion of a thread synchronizes with the call to join().
This will not work properly, acq_rel memory order is designed specifically for CAS and FAA memory operations that "simulanously" read and write atomic data. In Your case You want to enforce memory synchronization before load. To do this You need to change memory order of You fetchAndAdd and fetchAndSub to acq_rel and Your load to acquire. This may seem much, but on x86 it has very little cost (some compiler optimizations) as it does not generate any new instructions into the code. As of how acquire-release synchronization works I recomend this article: http://preshing.com/20120913/acquire-and-release-semantics/
I removed info about sequential ordering as it should be used for all the operations to work properly and would be an overkill.
From my understanding of C++ atomics relaxed memory order makes sense when used in combination with other atomic operations using memory fences. For example in some situations atomic a may be stored in relaxed manner, as atomic b is written with release memory order and so on.
If your question is what is the proper way to ensure the relaxed atomic writes are committed before reading this same atomic object? Nothing, this is ensured by the language, [intro.multithread]:
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M.
All threads see the same modification order. For exemple imagine that 2 allocation happens in 2 different threads and then you read the counter in a third thread.
In the first thread, the atomic is incremented by 1 bytes, and the relaxed read/modify (AtomicFetchAdd) expression return 0: the counter made this transition: 0->1.
In the second thread, the atomic is incremented by 2 bytes, and the relaxed read/modify expression return 1: the counter make this transition: 1->3. There is no way the read/modify expression return 0. This thread cannot see a transition 0->2 because the other thread has performed the transition 0->1.
Then in a third thread you perform a relaxed load. The only possible values that may be loaded are 0,1 or 3. It is not possible to load 2. The modification order of the atomic is 0 -> 1 -> 3. And the observer thread will also see this modification order.

Acquire/release semantics with 4 threads

I am currently reading C++ Concurrency in Action by Anthony Williams. One of his listing shows this code, and he states that the assertion that z != 0 can fire.
#include <atomic>
#include <thread>
#include <assert.h>
std::atomic<bool> x,y;
std::atomic<int> z;
void write_x()
{
x.store(true,std::memory_order_release);
}
void write_y()
{
y.store(true,std::memory_order_release);
}
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire));
if(y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
if(x.load(std::memory_order_acquire))
++z;
}
int main()
{
x=false;
y=false;
z=0;
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join();
b.join();
c.join();
d.join();
assert(z.load()!=0);
}
So the different execution paths, that I can think of is this:
1)
Thread a (x is now true)
Thread c (fails to increment z)
Thread b (y is now true)
Thread d (increments z) assertion cannot fire
2)
Thread b (y is now true)
Thread d (fails to increment z)
Thread a (x is now true)
Thread c (increments z) assertion cannot fire
3)
Thread a (x is true)
Thread b (y is true)
Thread c (z is incremented) assertion cannot fire
Thread d (z is incremented)
Could someone explain to me how this assertion can fire?
He shows this little graphic:
Shouldn't the store to y also sync with the load in read_x_then_y, and the store to x sync with the load in read_y_then_x? I'm very confused.
EDIT:
Thank you for your responses, I understand how atomics work and how to use Acquire/Release. I just don't get this specific example. I was trying to figure out IF the assertion fires, then what did each thread do? And why does the assertion never fire if we use sequential consistency.
The way, I am reasoning about this is that if thread a (write_x) stores to x then all the work it has done so far is synced with any other thread that reads x with acquire ordering. Once read_x_then_y sees this, it breaks out of the loop and reads y. Now, 2 things could happen. In one option, the write_y has written to y, meaning this release will sync with the if statement (load) meaning z is incremented and assertion cannot fire. The other option is if write_y hasn't run yet, meaning the if condition fails and z isn't incremented, In this scenario, only x is true and y is still false. Once write_y runs, the read_y_then_x breaks out of its loop, however both x and y are true and z is incremented and the assertion does not fire. I can't think of any 'run' or memory ordering where z is never incremented. Can someone explain where my reasoning is flawed?
Also, I know The loop read will always be before the if statement read because the acquire prevents this reordering.
You are thinking in terms of sequential consistency, the strongest (and default) memory order. If this memory order is used, all accesses to atomic variables constitute a total order, and the assertion indeed cannot be triggered.
However, in this program, a weaker memory order is used (release stores and acquire loads). This means, by definition that you cannot assume a total order of operations. In particular, you cannot assume that changes become visible to other threads in the same order. (Only a total order on each individual variable is guaranteed for any atomic memory order, including memory_order_relaxed.)
The stores to x and y occur on different threads, with no synchronization between them. The loads of x and y occur on different threads, with no synchronization between them. This means it is entirely allowed that thread c sees x && ! y and thread d sees y && ! x. (I'm just abbreviating the acquire-loads here, don't take this syntax to mean sequentially consistent loads.)
Bottom line: Once you use a weaker memory order than sequentially consistent, you can kiss your notion of a global state of all atomics, that is consistent between all threads, goodbye. Which is exactly why so many people recommend sticking with sequential consistency unless you need the performance (BTW, remember to measure if it's even faster!) and are certain of what you are doing. Also, get a second opinion.
Now, whether you will get burned by this, is a different question. The standard simply allows a scenario where the assertion fails, based on the abstract machine that is used to describe the standard requirements. However, your compiler and/or CPU may not exploit this allowance for one reason or another. So it is possible that for a given compiler and CPU, you may never see that the assertion is triggered, in practice. Keep in mind that a compiler or CPU may always use a stricter memory order than the one you asked for, because this can never introduce violations of the minimum requirements from the standard. It may only cost you some performance – but that is not covered by the standard anyway.
UPDATE in response to comment: The standard defines no hard upper limit on how long it takes for one thread to see changes to an atomic by another thread. There is a recommendation to implementers that values should become visible eventually.
There are sequencing guarantees, but the ones pertinent to your example do not prevent the assertion from firing. The basic acquire-release guarantee is that if:
Thread e performs a release-store to an atomic variable x
Thread f performs an acquire-load from the same atomic variable
Then if the value read by f is the one that was stored by e, the store in e synchronizes-with the load in f. This means that any (atomic and non-atomic) store in e that was, in this thread, sequenced before the given store to x, is visible to any operation in f that is, in this thread, sequenced after the given load. [Note that there are no guarantees given regarding threads other than these two!]
So, there is no guarantee that f will read the value stored by e, as opposed to e.g. some older value of x. If it doesn't read the updated value, then also the load does not synchronize with the store, and there are no sequencing guarantees for any of the dependent operations mentioned above.
I liken atomics with lesser memory order than sequentially consistent to the Theory of Relativity, where there is no global notion of simultaneousness.
PS: That said, an atomic load cannot just read an arbitrary older value. For example, if one thread performs periodic increments (e.g. with release order) of an atomic<unsigned> variable, initialized to 0, and another thread periodically loads from this variable (e.g. with acquire order), then, except for eventual wrapping, the values seen by the latter thread must be monotonically increasing. But this follows from the given sequencing rules: Once the latter thread reads a 5, anything that happened before the increment from 4 to 5 is in the relative past of anything that follows the read of 5. In fact, a decrease other than wrapping is not even allowed for memory_order_relaxed, but this memory order does not make any promises to the relative sequencing (if any) of accesses to other variables.
The release-acquire synchronization has (at least) this guarantee: side-effects before a release on a memory location are visible after an acquire on this memory location.
There is no such guarantee if the memory location is not the same. More importantly, there's no total (think global) ordering guarantee.
Looking at the example, thread A makes thread C come out of its loop, and thread B makes thread D come out of its loop.
However, the way a release may "publish" to an acquire (or the way an acquire may "observe" a release) on the same memory location doesn't require total ordering. It's possible for thread C to observe A's release and thread D to observe B's release, and only somewhere in the future for C to observe B's release and for D to observe A's release.
The example has 4 threads because that's the minimum example you can force such non-intuitive behavior. If any of the atomic operations were done in the same thread, there would be an ordering you couldn't violate.
For instance, if write_x and write_y happened on the same thread, it would require that whatever thread observed a change in y would have to observe a change in x.
Similarly, if read_x_then_y and read_y_then_x happened on the same thread, you would observe both changed in x and y at least in read_y_then_x.
Having write_x and read_x_then_y in the same thread would be pointless for the exercise, as it would become obvious it's not synchronizing correctly, as would be having write_x and read_y_then_x, which would always read the latest x.
EDIT:
The way, I am reasoning about this is that if thread a (write_x) stores to x then all the work it has done so far is synced with any other thread that reads x with acquire ordering.
(...) I can't think of any 'run' or memory ordering where z is never incremented. Can someone explain where my reasoning is flawed?
Also, I know The loop read will always be before the if statement read because the acquire prevents this reordering.
That's sequentially consistent order, which imposes a total order. That is, it imposes that write_x and write_y both be visible to all threads one after the other; either x then y or y then x, but the same order for all threads.
With release-acquire, there is no total order. The effects of a release are only guaranteed to be visible to a corresponding acquire on the same memory location. With release-acquire, the effects of write_x are guaranteed to be visible to whoever notices x has changed.
This noticing something changed is very important. If you don't notice a change, you're not synchronizing. As such, thread C is not synchronizing on y and thread D is not synchronizing on x.
Essentially, it's way easier to think of release-acquire as a change notification system that only works if you synchronize properly. If you don't synchronize, you may or may not observe side-effects.
Strong memory model hardware architectures with cache coherence even in NUMA, or languages/frameworks that synchronize in terms of total order, make it difficult to think in these terms, because it's practically impossible to observe this effect.
Let's walk through the parallel code:
void write_x()
{
x.store(true,std::memory_order_release);
}
void write_y()
{
y.store(true,std::memory_order_release);
}
There is nothing before these instructions (they are at start of parallelism, everything that happened before also happened before other threads) so they are not meaningfully releasing: they are effectively relaxed operations.
Let's walk through the parallel code again, nothing that these two previous operations are not effective releases:
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire)); // acquire what state?
if(y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
if(x.load(std::memory_order_acquire))
++z;
}
Note that all the loads refer to variables for which nothing is effectively released ever, so nothing is effectively acquired here: we re-acquire the visibility over the previous operations in main that are visible already.
So you see that all operations are effectively relaxed: they provide no visibility (over what was already visible). It's like doing an acquire fence just after an acquire fence, it's redundant. Nothing new is implied that wasn't already implied.
So now that everything is relaxed, all bets are off.
Another way to view that is to notice that an atomic load is not a RMW operations that leaves the value unchanged, as a RMW can be release and a load cannot.
Just like all atomic stores are part of the modification order of an atomic variable even if the variable is an effective a constant (that is a non const variable whose value is always the same), an atomic RMW operation is somewhere in the modification order of an atomic variable, even if there was no change of value (and there cannot be a change of value because the code always compares and copies the exact same bit pattern).
In the modification order you can have release semantic (even if there was no modification).
If you protect a variable with a mutex you get release semantic (even if you just read the variable).
If you make all your loads (at least in functions that do more than once operation) release-modification-loads with:
either a mutex protecting the atomic object (then drop the atomic as it's now redundant!)
or a RMW with acq_rel order,
the previous proof that all operations are effectively relaxed doesn't work anymore and some atomic operation in at least one of the read_A_then_B functions will have to be ordered before some operation in the other, as they operate on the same objects. If they are in the modification order of a variable and you use acq_rel, then you have an happen before relation between one of these (obviously which one happens before which one is non deterministic).
Either way execution is now sequential, as all operations are effectively acquire and release, that is as operative acquire and release (even those that are effectively relaxed!).
If we change two if statements to while statements, it will make the code correct and z will be guaranteed to be equal to 2.
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire));
while(!y.load(std::memory_order_acquire));
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
while(!x.load(std::memory_order_acquire));
++z;
}

why there is no data-race in the following case?

A data race occurs when two threads access the same variable concurrently and at least one of the accesses is a write.
https://isocpp.org/wiki/faq/cpp11-language-concurrency
// start with x==0 and y==0
if (x) y = 1; // Thread 1
if (y) x = 1; // Thread 2
Is there a problem here? More precisely, is there a data race? (No
there isn’t).
Why does the original article claim that there is no data race here?
Neither thread will be writing since neither variable is non-zero before the conditionals.
Data races are not static properties of your code. They are properties of the actual state of the program at execution time. So while that program could be in a state where the code would produce a data race, that's not the question.
The question is, given the state of the system, will the code cause a data race? And since the program is in a state such that neither thread will write to either variable, then the code will not cause a data race.
Data races aren't about what your code might do. It's about what they will do. Just as a function that takes a pointer isn't undefined behavior just because it uses the pointer without checking for NULL. It is only UB if someone passes a pointer that really is NULL.
Because x and y are both zero, the abstract machine defined by the C++ standard can't write to either memory location, so the only way this could be a problem is if the implementation decided to write to the memory location anyway. For example, if it transformed
if (x) y = 1;
into
y = 1;
if (!x) y = 0;
This is a potentially valid rewrite under the as-if rule since the observed behavior by any one thread is the same (C++14 1.9 [intro.execution])
The semantic descriptions in this International Standard define a parameterized nondeterministic abstract
machine. This International Standard places no requirement on the structure of conforming implementations.
In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming
implementations are required to emulate (only) the observable behavior of the abstract machine as explained
below.
This would in fact have been a valid rewrite prior to C++11, but since C++11, threads of execution are considered. Because of this, the implementation is not allowed to make changes that would have different observed behavior across threads as long as no data race occurs in the abstract machine.
There's a special note in the C++14 standard that applies here (C++14 1.10 [into.multithread] paragraph 22)
[ Note: Compiler transformations that introduce assignments to a potentially shared memory location that
would not be modified by the abstract machine are generally precluded by this standard, since such an
assignment might overwrite another assignment by a different thread in cases in which an abstract machine
execution would not have encountered a data race.
...
Because of this, the rewrite isn't valid. The implementation has to preserve the observed behavior that x and y are not modified, even across threads. Therefore, there is no data race.
I found this article written by Hans-J. Boehm illuminating:
http://www.hpl.hp.com/techreports/2009/HPL-2009-259html.html#races
We say that two ordinary memory operations conflict if they access the
same memory location (for example, variable or array element), and at
least one of them writes to the location.
We say that a program allows a data race on a particular set of inputs
if there is a sequentially consistent execution, that is an
interleaving of operations of the individual threads, in which two
conflicting operations can be executed "simultaneously". For our
purposes, two such operations can be executed "simultaneously", if
they occur next to each other in the interleaving, and correspond to
different threads.
And the article goes on to our point:
Our definition of data race is rather stringent: There must be an
actual way to execute the original, untransformed program such that
conflicting operations occur in parallel. This imposes a burden on
compilers not to "break" programs by introducing harmful data races.
As stated in the article, which report the same example (and others):
There's no sequentially consistent execution of this program in which Thread 1 assigns to y, since x and y never become nonzero. Indeed you never satisfy the condition, so no one is writing to the variable that the other thread might be reading.
To understand the difference with the case where a data race exists, try to think about the following example in the article:
y = ((x != 0)? 1 : y) # Thread 1
y = 2; # Thread 2
In this last case it's clear that can happen that y is assigned (write) to by Thread 1 while Thread 2 executes y = 2; (y is written to by Thread 1 no matter what). A data race can happen.
If x is not set, setting y to 1 doesn't happen and vice versa. So, things here are indeed happening sequentially.

How can memory_order_relaxed work for incrementing atomic reference counts in smart pointers?

Consider the following code snippet taken from Herb Sutter's talk on atomics:
The smart_ptr class contains a pimpl object called control_block_ptr containing the reference count refs.
// Thread A:
// smart_ptr copy ctor
smart_ptr(const smart_ptr& other) {
...
control_block_ptr = other->control_block_ptr;
control_block_ptr->refs.fetch_add(1, memory_order_relaxed);
...
}
// Thread D:
// smart_ptr destructor
~smart_ptr() {
if (control_block_ptr->refs.fetch_sub(1, memory_order_acq_rel) == 1) {
delete control_block_ptr;
}
}
Herb Sutter says the increment of refs in Thread A can use memory_order_relaxed because "nobody does anything based on the action". Now as I understand memory_order_relaxed, if refs equals N at some point and two threads A and B execute the following code:
control_block_ptr->refs.fetch_add(1, memory_order_relaxed);
then it may happen that both threads see the value of refs to be N and both write N+1 back to it. That will clearly not work and memory_order_acq_rel should be used just as with the destructor. Where am I going wrong?
EDIT1: Consider the following code.
atomic_int refs = N; // at time t0.
// [Thread 1]
refs.fetch_add(1, memory_order_relaxed); // at time t1.
// [Thread 2]
n = refs.load(memory_order_relaxed); // starting at time t2 > t1
refs.fetch_add(1, memory_order_relaxed);
n = refs.load(memory_order_relaxed);
What is the value of refs observed by Thread 2 before the call to fetch_add? Could it be either N or N+1? What is the value of refs observed by Thread 2 after the call to fetch_add? Must it be at least N+2?
[Talk URL: C++ & Beyond 2012 - http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-2-of-2 (# 1:20:00)]
Boost.Atomic library that emulates std::atomic provides similar reference counting example and explanation, and it may help your understanding.
Increasing the reference counter can always be done with memory_order_relaxed: New references to an object can only be formed from an existing reference, and passing an existing reference from one thread to another must already provide any required synchronization.
It is important to enforce any possible access to the object in one thread (through an existing reference) to happen before deleting the object in a different thread. This is achieved by a "release" operation after dropping a reference (any access to the object through this reference must obviously happened before), and an "acquire" operation before deleting the object.
It would be possible to use memory_order_acq_rel for the fetch_sub operation, but this results in unneeded "acquire" operations when the reference counter does not yet reach zero and may impose a performance penalty.
From C++ reference on std::memory_order:
memory_order_relaxed: Relaxed operation: there are no synchronization
or ordering constraints imposed on other reads or writes, only this
operation's atomicity is guaranteed
There is also an example below on that page.
So basically, std::atomic::fetch_add() is still atomic, even when with std::memory_order_relaxed, therefore concurrent refs.fetch_add(1, std::memory_order_relaxed) from 2 different threads will always increment refs by 2. The point of the memory order is how other non-atomic or std::memory_order_relaxed atomic operations can be reordered around the current atomic operation with memory order specified.
As this is rather confusing (at least to me) I'm going to partially address one point:
(...) then it may happen that both threads see the value of refs to be N and both write N+1 back to it (...)
According to #AnthonyWilliams in this answer, the above sentence seems to be wrong as:
The only way to guarantee you have the "latest" value is to use a read-modify-write operation such as exchange(), compare_exchange_strong() or fetch_add(). Read-modify-write operations have an additional constraint that they always operate on the "latest" value, so a sequence of ai.fetch_add(1) operations by a series of threads will return a sequence of values with no duplicates or gaps. In the absence of additional constraints, there's still no guarantee which threads will see which values though.
So, given the authority argument, I'd say it's impossible that both threads see the value going from N to N+1.