Synchronizing against relaxed atomics - c++

I have an allocator that does relaxed atomics to track the number of bytes currently allocated. They're just adds and subtracts so I don't need any synchronization between threads other than ensuring the modifications are atomic.
However, I occasionally want to check the number of allocated bytes (e.g. when shutting down the program) and I want to ensure any pending writes are committed. I assume I need a full memory barrier in this case to prevent any previous writes from being moved after the barrier and to prevent the next read from being moved before the barrier.
The question is: what is the proper way to ensure the relaxed atomic writes are committed before reading? Is my current code correct? (Assume functions and types map to std library constructs as expected.)
void* Allocator::Alloc(size_t bytes, size_t alignment)
{
void* p = AlignedAlloc(bytes, alignment);
AtomicFetchAdd(&allocatedBytes, AlignedMsize(p), MemoryOrder::Relaxed);
return p;
}
void Allocator::Free(void* p)
{
AtomicFetchSub(&allocatedBytes, AlignedMsize(p), MemoryOrder::Relaxed);
AlignedFree(p);
}
size_t Allocator::GetAllocatedBytes()
{
AtomicThreadFence(MemoryOrder::AcqRel);
return AtomicLoad(&allocatedBytes, MemoryOrder::Relaxed);
}
And some type definitions for context
enum struct MemoryOrder
{
Relaxed = 0,
Consume = 1,
Acquire = 2,
Release = 3,
AcqRel = 4,
SeqCst = 5,
};
struct Allocator
{
void* Alloc (size_t bytes, size_t alignment);
void Free (void* p);
size_t GetAllocatedBytes();
Atomic<size_t> allocatedBytes = { 0 };
};
I don't want to simply default to sequential consistency as I'm trying to understand memory ordering better.
The part that's really tripping me up is that in the standard under [atomics.fences] all the points talk about an acquire fence/atomic op synchronizing with a release fence/atomic op. It's entirely opaque to me whether an acquire fence/atomic op will synchronize with a relaxed atomic op on another thread. If an AcqRel fence function literally maps to an mfence instruction, it seems that the above code will be fine. However, I'm having a hard time convincing myself the standard guarantees this. Namely,
4 An atomic operation A that is a release operation on an atomic
object M synchronizes with an acquire fence B if there exists some
atomic operation X on M such that X is sequenced before B and reads
the value written by A or a value written by any side effect in the
release sequence headed by A.
This seems to make it clear that the fence will not synchronize with the relaxed atomic writes. On the other hand, a full fence is both a release and an acquire fence, so it should synchronize with itself, right?
2 A release fence A synchronizes with an acquire fence B if there
exist atomic operations X and Y, both operating on some atomic object
M, such that A is sequenced before X, X modifies M, Y is sequenced
before B, and Y reads the value written by X or a value written by any
side effect in the hypothetical release sequence X would head if it
were a release operation.
The scenario described is
Unsequenced writes
A release fence
X atomic write
Y atomic read
B acquire fence
Unsequenced reads (unsequenced writes will be visible here)
However, in my case I don't have the atomic write + atomic read as a signal between the threads and the release fence happens with the acquire fence on thread B. So what's actually happening is
Unsequenced writes
A release fence
B acquire fence
Unsequenced reads
Clearly if the fence executes before an unsequenced write begins it's a race and all bets are off. But it seems to me that if the fence executes after an unsequenced write begins but before it is committed it will be forced to finish before the unsequenced reads. This is exactly that I want, but I can't glean whether this is guaranteed by the standard.

Let's say you spawn Thread A, which calls Allocator::Alloc(), then immediately spawn Thread B, which calls Allocator::GetAllocatedBytes(). Those two Allocator calls are now running concurrently. You don't know which one will actually happen first, because there's no ordering between them. Your only guarantee is that either Thread B will see the value of allocatedBytes before Thread A modifies it, or it will see the value of allocatedBytes after Thread A modifies it. You won't know which value Thread B saw until after GetAllocatedBytes() returns. (At least Thread B won't see a totally garbage value for allocatedBytes, because there's no data race on it thanks to your use of relaxed atomics.)
You seem to be concerned about the case where Thread A got as far as AtomicFetchAdd(), but for some reason, the change is not visible when Thread B calls AtomicLoad(). But so what? That's no different from the outcome where GetAllocatedBytes() runs entirely before AtomicFetchAdd(). And that's a totally valid outcome. Remember, either Thread B sees the modified value, or it doesn't.
Even if you change all the atomic operations/fences to MemoryOrder::SeqCst, it won't make any difference. In the scenario I described, Thread B can still either see the modified value or the unmodified value of allocatedBytes, because the two Allocator calls run concurrently.
As long as you insist on calling GetAllocatedBytes() while other threads are still calling Alloc() and Free(), that's really the most you can expect. If you want to get a more "accurate" value, just don't allow any concurrent calls to Alloc()/Free() while GetAllocatedBytes() is running! For example, if the program is shutting down, just join all the other threads before calling GetAllocatedBytes(). That'll give you an accurate number of allocated bytes at shutdown. The C++ standard even guarantees it, because the completion of a thread synchronizes with the call to join().

This will not work properly, acq_rel memory order is designed specifically for CAS and FAA memory operations that "simulanously" read and write atomic data. In Your case You want to enforce memory synchronization before load. To do this You need to change memory order of You fetchAndAdd and fetchAndSub to acq_rel and Your load to acquire. This may seem much, but on x86 it has very little cost (some compiler optimizations) as it does not generate any new instructions into the code. As of how acquire-release synchronization works I recomend this article: http://preshing.com/20120913/acquire-and-release-semantics/
I removed info about sequential ordering as it should be used for all the operations to work properly and would be an overkill.
From my understanding of C++ atomics relaxed memory order makes sense when used in combination with other atomic operations using memory fences. For example in some situations atomic a may be stored in relaxed manner, as atomic b is written with release memory order and so on.

If your question is what is the proper way to ensure the relaxed atomic writes are committed before reading this same atomic object? Nothing, this is ensured by the language, [intro.multithread]:
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M.
All threads see the same modification order. For exemple imagine that 2 allocation happens in 2 different threads and then you read the counter in a third thread.
In the first thread, the atomic is incremented by 1 bytes, and the relaxed read/modify (AtomicFetchAdd) expression return 0: the counter made this transition: 0->1.
In the second thread, the atomic is incremented by 2 bytes, and the relaxed read/modify expression return 1: the counter make this transition: 1->3. There is no way the read/modify expression return 0. This thread cannot see a transition 0->2 because the other thread has performed the transition 0->1.
Then in a third thread you perform a relaxed load. The only possible values that may be loaded are 0,1 or 3. It is not possible to load 2. The modification order of the atomic is 0 -> 1 -> 3. And the observer thread will also see this modification order.

Related

C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads

Could you please help me to understand what std::memory_order should be used in std::atomic_flag::test_and_set to do some work only once by a set of threads and why? The work should be done by whatever thread gets to it first, and all other threads should just check as quickly as possible that someone is already going the work and continue working on other tasks.
In my tests of the example below, any memory order works, but I think that it is just a coincidence. I suspect that Release-Acquire ordering is what I need, but, in my case, only one memory_order can be used in both threads (it is not the case that one thread can use memory_order_release and the other can use memory_order_acquire since I do not know which thread will arrive to doing the work first).
#include <atomic>
#include <iostream>
#include <thread>
std::atomic_flag done = ATOMIC_FLAG_INIT;
const std::memory_order order = std::memory_order_seq_cst;
//const std::memory_order order = std::memory_order_acquire;
//const std::memory_order order = std::memory_order_relaxed;
void do_some_work_that_needs_to_be_done_only_once(void)
{ std::cout<<"Hello, my friend\n"; }
void run(void)
{
if(not done.test_and_set(order))
do_some_work_that_needs_to_be_done_only_once();
}
int main(void)
{
std::thread a(run);
std::thread b(run);
a.join();
b.join();
// expected result:
// * only one thread said hello
// * all threads spent as little time as possible to check if any
// other thread said hello yet
return 0;
}
Thank you very much for your help!
Following up on some things in the comments:
As has been discussed, there is a well-defined modification order M for done on any given run of the program. Every thread does one store to done, which means one entry in M. And by the nature of atomic read-modify-writes, the value returned by each thread's test_and_set is the value that immediately precedes its own store in the order M. That's promised in C++20 atomics.order p10, which is the critical clause for understanding atomic RMW in the C++ memory model.
Now there are a finite number of threads, each corresponding to one entry in M, which is a total order. Necessarily there is one such entry that precedes all the others. Call it m1. The test_and_set whose store is entry m1 in M must return the preceding value in M. That can only be the value 0 which initialized done. So the thread corresponding to m1 will see test_and_set return 0. Every other thread will see it return 1, because each of their modifications m2, ..., mN follows (in M) another modification, which must have been a test_and_set storing the value 1.
We may not be bothering to observe all of the total order M, but this program does determine which of its entries is first on this particular run. It's the unique one whose test_and_set returns 0. A thread that sees its test_and_set return 1 won't know whether it came 2nd or 8th or 96th in that order, but it does know that it wasn't first, and that's all that matters here.
Another way to think about it: suppose it were possible for two threads (tA, tB) both to load the value 0. Well, each one makes an entry in the modification order; call them mA and mB. M is a total order so one has to go before the other. And bearing in mind the all-important [atomics.order p10], you will quickly find there is no legal way for you to fill out the rest of M.
All of this is promised by the standard without any reference to memory ordering, so it works even with std::memory_order_relaxed. The only effect of relaxed memory ordering is that we can't say much about how our load/store will become visible with respect to operations on other variables. That's irrelevant to the program at hand; it doesn't even have any other variables.
In the actual implementation, this means that an atomic RMW really has to exclusively own the variable for the duration of the operation. We must ensure that no other thread does a store to that variable, nor the load half of a read-modify-write, during that period. In a MESI-like coherent cache, this is done by temporarily locking the cache line in the E state; if the system makes it possible for us to lose that lock (like an LL/SC architecture), abort and start again.
As to your comment about "a thread reading false from its own cache/buffer": the implementation mustn't allow that in an atomic RMW, not even with relaxed ordering. When you do an atomic RMW, you must read it while you hold the lock, and use that value in the RMW operation. You can't use some old value that happens to be in a buffer somewhere. Likewise, you have to complete the write while you still hold the lock; you can't stash it in a buffer and let it complete later.
relaxed is fine if you just need to determine the winner of the race to set the flag1, so one thread can start on the work and later threads can just continue on.
If the run_once work produces data that other threads need to be able to read, you'll need a release store after that, to let potential readers know that the work is finished, not just started. If it was instead just something like printing or writing to a file, and other threads don't care when that finishes, then yeah you have no ordering requirements between threads beyond the modification order of done which exists even with relaxed. An atomic RMW like test_and_set lets you determines which thread's modification was first.
BTW, you should check read-only before even trying to test-and-set; unless run() is only called very infrequently, like once per thread startup. For something like a static int foo = non_constant; local var, compilers use a guard variable that's loaded (with an acquire load) to see if init is already complete. If it's not, branch to code that uses an atomic RMW to modify the guard variable, with one thread winning, the rest effectively waiting on a mutex for that thread to init.
You might want something like that if you have data that all threads should read. Or just use a static int foo = something_to_run_once(), or some type other than int, if you actually have some data to init.
Or perhaps use C++11 std::call_once to solve this problem for you.
On normal systems, atomic_flag has no advantage over and atomic_bool. done.exchange(true) on a bool is equivalent to test_and_set of a flag. But atomic_bool is more flexible in terms of the operations it supports, like plain read that isn't part of an RMW test-and-set.
C++20 does add a test() method for atomic_flag. ISO C++ guarantees that atomic_flag is lock-free, but in practice so is std::atomic<bool> on all real-world systems.
Footnote 1: why relaxed guarantees a single winner
The memory_order parameter only governs ordering wrt. operations on other variables by the same thread.
Does calling test_and_set by a thread force somehow synchronization of the flag with values written by other threads?
It's not a pure write, it's an atomic read-modify-write, so the result of the one that went first is guaranteed to be visible to the one that happens to be second. That's the whole point of test-and-set as a primitive building block for mutual exclusion.
If two TAS operations could both load the original value (false), and then both store true, they would be atomic. They'd have overlapped with each other.
Two atomic RMWs on the same atomic object must happen in some order, the modification-order of that object. (Because they're not read-only: an RMW includes a modification. But also includes a read so you can see what the value was immediately before the new value; that read is tied to the modification order, unlike a plain read).
Every atomic object separately has a modification-order that all threads can agree on; this is guaranteed by ISO C++. (With orders less than seq_cst, ordering between objects can be different from source order, and not guaranteed that all threads even agree which store happened first, the IRIW problem.)
Being an atomic RMW guarantees that exactly one test_and_set will return false in thread A or B. Same for fetch_add with multiple threads incrementing a counter: the increments have to happen in some order (i.e. serialized with each other), and whatever that order is becomes the modification-order of that atomic object.
Atomic RMWs have to work this way to not lose counts. i.e. to actually be atomic.

What is guaranteed with C++ std::atomic at the programmer level?

I have listened and read to several articles, talks and stackoverflow questions about std::atomic, and I would like to be sure that I have understood it well. Because I am still a bit confused with cache line writes visibility due to possible delays in MESI (or derived) cache coherency protocols, store buffers, invalidate queues, and so on.
I read x86 has a stronger memory model, and that if a cache invalidation is delayed x86 can revert started operations. But I am now interested only on what I should assume as a C++ programmer, independently of the platform.
[T1: thread1 T2: thread2 V1: shared atomic variable]
I understand that std::atomic guarantees that,
(1) No data races occur on a variable (thanks to exclusive access to the cache line).
(2) Depending which memory_order we use, it guarantees (with barriers) that sequential consistency happens (before a barrier, after a barrier or both).
(3) After an atomic write(V1) on T1, an atomic RMW(V1) on T2 will be coherent (its cache line will have been updated with the written value on T1).
But as cache coherency primer mention,
The implication of all these things is that, by default, loads can fetch stale data (if a corresponding invalidation request was sitting in the invalidation queue)
So, is the following correct?
(4) std::atomic does NOT guarantee that T2 won't read a 'stale' value on an atomic read(V) after an atomic write(V) on T1.
Questions if (4) is right: if the atomic write on T1 invalidates the cache line no matter the delay, why is T2 waiting for the invalidation to be effective when does an atomic RMW operation but not on an atomic read?
Questions if (4) is wrong: when can a thread read a 'stale' value and "it's visible" in the execution, then?
I appreciate your answers a lot
Update 1
So it seems I was wrong on (3) then. Imagine the following interleave, for an initial V1=0:
T1: W(1)
T2: R(0) M(++) W(1)
Even though T2's RMW is guaranteed to happen entirely after W(1) in this case, it can still read a 'stale' value (I was wrong). According to this, atomic doesn't guarantee full cache coherency, only sequential consistency.
Update 2
(5) Now imagine this example (x = y = 0 and are atomic):
T1: x = 1;
T2: y = 1;
T3: if (x==1 && y==0) print("msg");
according to what we've talked, seeing the "msg" displayed on screen wouldn't give us information beyond that T2 was executed after T1. So either of the following executions might have happened:
T1 < T3 < T2
T1 < T2 < T3 (where T3 sees x = 1 but not y = 1 yet)
is that right?
(6) If a thread can always read 'stale' values, what would happen if we took the typical "publish" scenario but instead of signaling that some data is ready, we do just the opposite (delete the data)?
T1: delete gameObjectPtr; is_enabled.store(false, std::memory_order_release);
T2: while (is_enabled.load(std::memory_order_acquire)) gameObjectPtr->doSomething();
where T2 would still be using a deleted ptr until sees that is_enabled is false.
(7) Also, the fact that threads may read 'stale' values means that a mutex cannot be implemented with just one lock-free atomic right? It would require a synch mechanism between threads. Would it require a lockable atomic?
Yes, there are no data races
Yes, with appropriate memory_order values you can guarantee sequential consistency
An atomic read-modify-write will always occur entirely before or entirely after an atomic write to the same variable
Yes, T2 can read a stale value from a variable after an atomic write on T1
Atomic read-modify-write operations are specified in a way to guarantee their atomicity. If another thread could write to the value after the initial read and before the write of an RMW operation, then that operation would not be atomic.
Threads can always read stale values, except when happens-before guarantees relative ordering.
If a RMW operation reads a "stale" value, then it guarantees that the write it generates will be visible before any writes from other threads that would overwrite the value it read.
Update for example
If T1 writes x=1 and T2 does x++, with x initially 0, the choices from the point of view of the storage of x are:
T1's write is first, so T1 writes x=1, then T2 reads x==1, increments that to 2 and writes back x=2 as a single atomic operation.
T1's write is second. T2 reads x==0, increments it to 1, and writes back x=1 as a single operation, then T1 writes x=1.
However, provided there are no other points of synchronization between these two threads, the threads can proceed with the operations not flushed to memory.
Thus T1 can issue x=1, then proceed with other things, even though T2 will still read x==0 (and thus write x=1).
If there are any other points of synchronization then it will become apparent which thread modified x first, because those synchronization points will force an order.
This is most apparent if you have a conditional on the value read from a RMW operation.
Update 2
If you use memory_order_seq_cst (the default) for all atomic operations you don't need to worry about this sort of thing. From the point of view of the program, if you see "msg" then T1 ran, then T3, then T2.
If you use other memory orderings (especially memory_order_relaxed) then you may see other scenarios in your code.
In this case, you have a bug. Suppose the is_enabled flag is true, when T2 enters its while loop, so it decides to run the body. T1 now deletes the data, and T2 then deferences the pointer, which is a dangling pointer, and undefined behaviour ensues. The atomics don't help or hinder in any way beyond preventing the data race on the flag.
You can implement a mutex with a single atomic variable.
Regarding (3) - it depends on the memory order used. If both, the store and the RMW operation use std::memory_order_seq_cst, then both operations are ordered in some way - i.e., either the store happens before the RMW, or the other way round. If the store is order before the RMW, then it is guaranteed that the RMW operation "sees" the value that was stored. If the store is ordered after the RMW, it would overwrite the value written by the RMW operation.
If you use more relaxed memory orders, the modifications will still be ordered in some way (the modification order of the variable), but you have no guarantees on whether the RMW "sees" the value from the store operation - even if the RMW operation is order after the write in the variable's modification order.
In case you want to read yet another article I can refer you to Memory Models for C/C++ Programmers.

Acquire/release semantics with 4 threads

I am currently reading C++ Concurrency in Action by Anthony Williams. One of his listing shows this code, and he states that the assertion that z != 0 can fire.
#include <atomic>
#include <thread>
#include <assert.h>
std::atomic<bool> x,y;
std::atomic<int> z;
void write_x()
{
x.store(true,std::memory_order_release);
}
void write_y()
{
y.store(true,std::memory_order_release);
}
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire));
if(y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
if(x.load(std::memory_order_acquire))
++z;
}
int main()
{
x=false;
y=false;
z=0;
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join();
b.join();
c.join();
d.join();
assert(z.load()!=0);
}
So the different execution paths, that I can think of is this:
1)
Thread a (x is now true)
Thread c (fails to increment z)
Thread b (y is now true)
Thread d (increments z) assertion cannot fire
2)
Thread b (y is now true)
Thread d (fails to increment z)
Thread a (x is now true)
Thread c (increments z) assertion cannot fire
3)
Thread a (x is true)
Thread b (y is true)
Thread c (z is incremented) assertion cannot fire
Thread d (z is incremented)
Could someone explain to me how this assertion can fire?
He shows this little graphic:
Shouldn't the store to y also sync with the load in read_x_then_y, and the store to x sync with the load in read_y_then_x? I'm very confused.
EDIT:
Thank you for your responses, I understand how atomics work and how to use Acquire/Release. I just don't get this specific example. I was trying to figure out IF the assertion fires, then what did each thread do? And why does the assertion never fire if we use sequential consistency.
The way, I am reasoning about this is that if thread a (write_x) stores to x then all the work it has done so far is synced with any other thread that reads x with acquire ordering. Once read_x_then_y sees this, it breaks out of the loop and reads y. Now, 2 things could happen. In one option, the write_y has written to y, meaning this release will sync with the if statement (load) meaning z is incremented and assertion cannot fire. The other option is if write_y hasn't run yet, meaning the if condition fails and z isn't incremented, In this scenario, only x is true and y is still false. Once write_y runs, the read_y_then_x breaks out of its loop, however both x and y are true and z is incremented and the assertion does not fire. I can't think of any 'run' or memory ordering where z is never incremented. Can someone explain where my reasoning is flawed?
Also, I know The loop read will always be before the if statement read because the acquire prevents this reordering.
You are thinking in terms of sequential consistency, the strongest (and default) memory order. If this memory order is used, all accesses to atomic variables constitute a total order, and the assertion indeed cannot be triggered.
However, in this program, a weaker memory order is used (release stores and acquire loads). This means, by definition that you cannot assume a total order of operations. In particular, you cannot assume that changes become visible to other threads in the same order. (Only a total order on each individual variable is guaranteed for any atomic memory order, including memory_order_relaxed.)
The stores to x and y occur on different threads, with no synchronization between them. The loads of x and y occur on different threads, with no synchronization between them. This means it is entirely allowed that thread c sees x && ! y and thread d sees y && ! x. (I'm just abbreviating the acquire-loads here, don't take this syntax to mean sequentially consistent loads.)
Bottom line: Once you use a weaker memory order than sequentially consistent, you can kiss your notion of a global state of all atomics, that is consistent between all threads, goodbye. Which is exactly why so many people recommend sticking with sequential consistency unless you need the performance (BTW, remember to measure if it's even faster!) and are certain of what you are doing. Also, get a second opinion.
Now, whether you will get burned by this, is a different question. The standard simply allows a scenario where the assertion fails, based on the abstract machine that is used to describe the standard requirements. However, your compiler and/or CPU may not exploit this allowance for one reason or another. So it is possible that for a given compiler and CPU, you may never see that the assertion is triggered, in practice. Keep in mind that a compiler or CPU may always use a stricter memory order than the one you asked for, because this can never introduce violations of the minimum requirements from the standard. It may only cost you some performance – but that is not covered by the standard anyway.
UPDATE in response to comment: The standard defines no hard upper limit on how long it takes for one thread to see changes to an atomic by another thread. There is a recommendation to implementers that values should become visible eventually.
There are sequencing guarantees, but the ones pertinent to your example do not prevent the assertion from firing. The basic acquire-release guarantee is that if:
Thread e performs a release-store to an atomic variable x
Thread f performs an acquire-load from the same atomic variable
Then if the value read by f is the one that was stored by e, the store in e synchronizes-with the load in f. This means that any (atomic and non-atomic) store in e that was, in this thread, sequenced before the given store to x, is visible to any operation in f that is, in this thread, sequenced after the given load. [Note that there are no guarantees given regarding threads other than these two!]
So, there is no guarantee that f will read the value stored by e, as opposed to e.g. some older value of x. If it doesn't read the updated value, then also the load does not synchronize with the store, and there are no sequencing guarantees for any of the dependent operations mentioned above.
I liken atomics with lesser memory order than sequentially consistent to the Theory of Relativity, where there is no global notion of simultaneousness.
PS: That said, an atomic load cannot just read an arbitrary older value. For example, if one thread performs periodic increments (e.g. with release order) of an atomic<unsigned> variable, initialized to 0, and another thread periodically loads from this variable (e.g. with acquire order), then, except for eventual wrapping, the values seen by the latter thread must be monotonically increasing. But this follows from the given sequencing rules: Once the latter thread reads a 5, anything that happened before the increment from 4 to 5 is in the relative past of anything that follows the read of 5. In fact, a decrease other than wrapping is not even allowed for memory_order_relaxed, but this memory order does not make any promises to the relative sequencing (if any) of accesses to other variables.
The release-acquire synchronization has (at least) this guarantee: side-effects before a release on a memory location are visible after an acquire on this memory location.
There is no such guarantee if the memory location is not the same. More importantly, there's no total (think global) ordering guarantee.
Looking at the example, thread A makes thread C come out of its loop, and thread B makes thread D come out of its loop.
However, the way a release may "publish" to an acquire (or the way an acquire may "observe" a release) on the same memory location doesn't require total ordering. It's possible for thread C to observe A's release and thread D to observe B's release, and only somewhere in the future for C to observe B's release and for D to observe A's release.
The example has 4 threads because that's the minimum example you can force such non-intuitive behavior. If any of the atomic operations were done in the same thread, there would be an ordering you couldn't violate.
For instance, if write_x and write_y happened on the same thread, it would require that whatever thread observed a change in y would have to observe a change in x.
Similarly, if read_x_then_y and read_y_then_x happened on the same thread, you would observe both changed in x and y at least in read_y_then_x.
Having write_x and read_x_then_y in the same thread would be pointless for the exercise, as it would become obvious it's not synchronizing correctly, as would be having write_x and read_y_then_x, which would always read the latest x.
EDIT:
The way, I am reasoning about this is that if thread a (write_x) stores to x then all the work it has done so far is synced with any other thread that reads x with acquire ordering.
(...) I can't think of any 'run' or memory ordering where z is never incremented. Can someone explain where my reasoning is flawed?
Also, I know The loop read will always be before the if statement read because the acquire prevents this reordering.
That's sequentially consistent order, which imposes a total order. That is, it imposes that write_x and write_y both be visible to all threads one after the other; either x then y or y then x, but the same order for all threads.
With release-acquire, there is no total order. The effects of a release are only guaranteed to be visible to a corresponding acquire on the same memory location. With release-acquire, the effects of write_x are guaranteed to be visible to whoever notices x has changed.
This noticing something changed is very important. If you don't notice a change, you're not synchronizing. As such, thread C is not synchronizing on y and thread D is not synchronizing on x.
Essentially, it's way easier to think of release-acquire as a change notification system that only works if you synchronize properly. If you don't synchronize, you may or may not observe side-effects.
Strong memory model hardware architectures with cache coherence even in NUMA, or languages/frameworks that synchronize in terms of total order, make it difficult to think in these terms, because it's practically impossible to observe this effect.
Let's walk through the parallel code:
void write_x()
{
x.store(true,std::memory_order_release);
}
void write_y()
{
y.store(true,std::memory_order_release);
}
There is nothing before these instructions (they are at start of parallelism, everything that happened before also happened before other threads) so they are not meaningfully releasing: they are effectively relaxed operations.
Let's walk through the parallel code again, nothing that these two previous operations are not effective releases:
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire)); // acquire what state?
if(y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
if(x.load(std::memory_order_acquire))
++z;
}
Note that all the loads refer to variables for which nothing is effectively released ever, so nothing is effectively acquired here: we re-acquire the visibility over the previous operations in main that are visible already.
So you see that all operations are effectively relaxed: they provide no visibility (over what was already visible). It's like doing an acquire fence just after an acquire fence, it's redundant. Nothing new is implied that wasn't already implied.
So now that everything is relaxed, all bets are off.
Another way to view that is to notice that an atomic load is not a RMW operations that leaves the value unchanged, as a RMW can be release and a load cannot.
Just like all atomic stores are part of the modification order of an atomic variable even if the variable is an effective a constant (that is a non const variable whose value is always the same), an atomic RMW operation is somewhere in the modification order of an atomic variable, even if there was no change of value (and there cannot be a change of value because the code always compares and copies the exact same bit pattern).
In the modification order you can have release semantic (even if there was no modification).
If you protect a variable with a mutex you get release semantic (even if you just read the variable).
If you make all your loads (at least in functions that do more than once operation) release-modification-loads with:
either a mutex protecting the atomic object (then drop the atomic as it's now redundant!)
or a RMW with acq_rel order,
the previous proof that all operations are effectively relaxed doesn't work anymore and some atomic operation in at least one of the read_A_then_B functions will have to be ordered before some operation in the other, as they operate on the same objects. If they are in the modification order of a variable and you use acq_rel, then you have an happen before relation between one of these (obviously which one happens before which one is non deterministic).
Either way execution is now sequential, as all operations are effectively acquire and release, that is as operative acquire and release (even those that are effectively relaxed!).
If we change two if statements to while statements, it will make the code correct and z will be guaranteed to be equal to 2.
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire));
while(!y.load(std::memory_order_acquire));
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
while(!x.load(std::memory_order_acquire));
++z;
}

Understanding atomic variables and operations

I read about boost's and std's (c++11) atomic type and operations over and over again and still I'm not sure I understand it right (and at some cases I don't understand it at all). So, I have a few questions about it.
My sources I use for learning:
Boost documentation: http://www.boost.org/doc/libs/1_53_0/doc/html/atomic.html
http://www.developerfusion.com/article/138018/memory-ordering-for-atomic-operations-in-c0x/
Consider following snippet:
atomic<bool> x,y;
void write_x_then_y()
{
x.store(true, memory_order_relaxed);
y.store(true, memory_order_release);
}
#1: Is it equivalent to this next one?
atomic<bool> x,y;
void write_x_then_y()
{
x.store(true, memory_order_relaxed);
atomic_thread_fence(memory_order_release); // *1
y.store(true, memory_order_relaxed); // *2
}
#2: Is following statement true?
Line *1 assures, that when operations done under this line (for example *2) are visible (for other thread using acquire), code above *1 will be visible too (with new values).
Next snipped extends the ones above:
void read_y_then_x()
{
if(y.load(memory_order_acquire))
{
assert(x.load(memory_order_relaxed));
}
}
#3: Is it equivalent to this next one?
void read_y_then_x()
{
atomic_thread_fence(memory_order_acquire); // *3
if(y.load(memory_order_relaxed)) // *4
{
assert(x.load(memory_order_relaxed)); // *5
}
}
#4: Are following statements true?
Line *3 assures that if some operations under release order (in other thread, like *2) is visible, every operation above the release order (for example *1) will be visible as well.
That means that assert at *5 will never fail (with false as default values).
But this does not assure that even if physically (in processor) *2 happens before before *3, it will be visible by snipped above (running in different thread) - function read_y_then_x() still can read old values. Only thing which is assured is, that if y is true, x will be also true.
#5: Incrementing (operation of adding 1) to an atomic integer can be memory_order_relaxed and no data are lost. Only problem is order and time of visibility of result.
According boost, following snipped is working reference counter:
#include <boost/intrusive_ptr.hpp>
#include <boost/atomic.hpp>
class X {
public:
typedef boost::intrusive_ptr<X> pointer;
X() : refcount_(0) {}
private:
mutable boost::atomic<int> refcount_;
friend void intrusive_ptr_add_ref(const X * x)
{
x->refcount_.fetch_add(1, boost::memory_order_relaxed);
}
friend void intrusive_ptr_release(const X * x)
{
if (x->refcount_.fetch_sub(1, boost::memory_order_release) == 1) {
boost::atomic_thread_fence(boost::memory_order_acquire);
delete x;
}
}
};
#6 Why is for decrementing used memory_order_release? How it works (in the context)? If what I wrote earlier is true, what makes returned value the most recent, especially when we use acquire AFTER reading and not before/during?
#7 Why there is acquire order after reference counter reach zero? We just read that the counter is zero and there is no other atomic variable used (pointer itself is not marked/used as such).
1: No. A release fence synchronizes with all acquire operations and fences. If there was a third atomic<bool> z which was being manipulated in a third thread, the fence would synchronize with that third thread as well, which is unnecessary. That being said, they will act the same on x86, but that is because x86 has very strong synchronization. The architectures used on 1000 core systems tend to be weaker.
2: Yes, this is correct. A fence ensures that if you see anything that follows, you also see everything that preceded.
3: In general they are different, but realistically they will be the same. The compiler is allowed to reorder two relaxed operations on different variables, but may not introduce spurious operations. If the compiler has any way of being confident that it is going to need to read x, it may do so before reading y. In your particular case, this is very difficult for the compiler, but there are many similar cases where such reordering is fair game.
4: All of those are true. The atomic operations guarantee consistency. They do not always guarantee that things happen in an order you wanted, they just prevent pathological orders that ruin your algorithm.
5: Correct. Relaxed operations are truly atomic. They just don't synchronize any additional memory
6: For any given atomic object M, C++ guarantees that there is an "official" order for operations on M. You don't get to see the "latest" value for M so much as C++ and the processor guarantee that all threads will see a consistent series of values for M. If two threads increment the refcount, then decrement it, there is no guarentee which one will decrement it to 0, but there is a guarentee that one of them will see that it decremented it to 0. There is no way for both of them to see that they decremented 2->1 and 2->1, but somehow the refcount combined them to 0. One thread will always see 2->1 and the other will see 1->0.
Remember, memory order is more about synchronizing the memory around the atomic. The atomic gets handled properly no matter what memory order you use.
7: This one is trickier. The short version for 7 is that decrement is release order because some thread is going to have to run the destructor for x, and we want to make sure it sees all operations on x made on all threads. Using release order on the destructor satisfies this need because you can prove that it works. Whoever is responsible for deleting x acquires all changes before doing so (using a fence to make sure atomics in the deleter don't drift upward). In all cases where threads release their own references, it is obvious that all threads will have a release-order decrement before the deleter gets called. In cases where one thread increments the refcount and another decrements it, you can prove that the only valid way to do so is if the threads synchronize with eachother, so that the destructor sees the result of both threads. Failure to synchronize would create a race case no matter what, so the user is obliged to get it right.
1
After pondering over #1 i have been convinced they are not equivalent by this argument §29.8.3 in [atomics.fences]:
A release fence A synchronizes with an atomic operation B that performs an acquire operation on an atomic
object M if there exists an atomic operation X such that A is sequenced before X, X modifies M, and B
reads the value written by X or a value written by any side effect in the hypothetical release sequence X
would head if it were a release operation.
This paragraph says that a release fence can be synchronized only with an aquire operation. But release operation can be in addition syncronized with consume operation.
Your void read_y_then_x() with the acquire fence has the fence in the wrong place. It should be placed between the two atomic loads. An acquire fence essentially makes all the load above the fence act somewhat like acquire loads, with the exception the happens before isn't established until you executed the fence.

What do each memory_order mean?

I read a chapter and I didn't like it much. I'm still unclear what the differences is between each memory order. This is my current speculation which I understood after reading the much more simple http://en.cppreference.com/w/cpp/atomic/memory_order
The below is wrong so don't try to learn from it
memory_order_relaxed: Does not sync but is not ignored when order is done from another mode in a different atomic var
memory_order_consume: Syncs reading this atomic variable however It doesnt sync relaxed vars written before this. However if the thread uses var X when modifying Y (and releases it). Other threads consuming Y will see X released as well? I don't know if this means this thread pushes out changes of x (and obviously y)
memory_order_acquire: Syncs reading this atomic variable AND makes sure relaxed vars written before this are synced as well. (does this mean all atomic variables on all threads are synced?)
memory_order_release: Pushes the atomic store to other threads (but only if they read the var with consume/acquire)
memory_order_acq_rel: For read/write ops. Does an acquire so you don't modify an old value and releases the changes.
memory_order_seq_cst: The same thing as acquire release except it forces the updates to be seen in other threads (if a store with relaxed on another thread. I store b with seq_cst. A 3rd thread reading a with relax will see changes along with b and any other atomic variable?).
I think I understood but correct me if i am wrong. I couldn't find anything that explains it in easy to read english.
The GCC Wiki gives a very thorough and easy to understand explanation with code examples.
(excerpt edited, and emphasis added)
IMPORTANT:
Upon re-reading the below quote copied from the GCC Wiki in the process of adding my own wording to the answer, I noticed that the quote is actually wrong. They got acquire and consume exactly the wrong way around. A release-consume operation only provides an ordering guarantee on dependent data whereas a release-acquire operation provides that guarantee regardless of data being dependent on the atomic value or not.
The first model is "sequentially consistent". This is the default mode used when none is specified, and it is the most restrictive. It can also be explicitly specified via memory_order_seq_cst. It provides the same restrictions and limitation to moving loads around that sequential programmers are inherently familiar with, except it applies across threads.
[...]
From a practical point of view, this amounts to all atomic operations acting as optimization barriers. It's OK to re-order things between atomic operations, but not across the operation. Thread local stuff is also unaffected since there is no visibility to other threads. [...] This mode also provides consistency across all threads.
The opposite approach is memory_order_relaxed. This model allows for much less synchronization by removing the happens-before restrictions. These types of atomic operations can also have various optimizations performed on them, such as dead store removal and commoning. [...] Without any happens-before edges, no thread can count on a specific ordering from another thread.
The relaxed mode is most commonly used when the programmer simply wants a variable to be atomic in nature rather than using it to synchronize threads for other shared memory data.
The third mode (memory_order_acquire / memory_order_release) is a hybrid between the other two. The acquire/release mode is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables. This allows for a relaxing of the synchronization required between independent reads of independent writes.
memory_order_consume is a further subtle refinement in the release/acquire memory model that relaxes the requirements slightly by removing the happens before ordering on non-dependent shared variables as well.
[...]
The real difference boils down to how much state the hardware has to flush in order to synchronize. Since a consume operation may therefore execute faster, someone who knows what they are doing can use it for performance critical applications.
Here follows my own attempt at a more mundane explanation:
A different approach to look at it is to look at the problem from the point of view of reordering reads and writes, both atomic and ordinary:
All atomic operations are guaranteed to be atomic within themselves (the combination of two atomic operations is not atomic as a whole!) and to be visible in the total order in which they appear on the timeline of the execution stream. That means no atomic operation can, under any circumstances, be reordered, but other memory operations might very well be. Compilers (and CPUs) routinely do such reordering as an optimization.
It also means the compiler must use whatever instructions are necessary to guarantee that an atomic operation executing at any time will see the results of each and every other atomic operation, possibly on another processor core (but not necessarily other operations), that were executed before.
Now, a relaxed is just that, the bare minimum. It does nothing in addition and provides no other guarantees. It is the cheapest possible operation. For non-read-modify-write operations on strongly ordered processor architectures (e.g. x86/amd64) this boils down to a plain normal, ordinary move.
The sequentially consistent operation is the exact opposite, it enforces strict ordering not only for atomic operations, but also for other memory operations that happen before or after. Neither one can cross the barrier imposed by the atomic operation. Practically, this means lost optimization opportunities, and possibly fence instructions may have to be inserted. This is the most expensive model.
A release operation prevents ordinary loads and stores from being reordered after the atomic operation, whereas an acquire operation prevents ordinary loads and stores from being reordered before the atomic operation. Everything else can still be moved around.
The combination of preventing stores being moved after, and loads being moved before the respective atomic operation makes sure that whatever the acquiring thread gets to see is consistent, with only a small amount of optimization opportunity lost.
One may think of that as something like a non-existent lock that is being released (by the writer) and acquired (by the reader). Except... there is no lock.
In practice, release/acquire usually means the compiler needs not use any particularly expensive special instructions, but it cannot freely reorder loads and stores to its liking, which may miss out some (small) optimization opportuntities.
Finally, consume is the same operation as acquire, only with the exception that the ordering guarantees only apply to dependent data. Dependent data would e.g. be data that is pointed-to by an atomically modified pointer.
Arguably, that may provide for a couple of optimization opportunities that are not present with acquire operations (since fewer data is subject to restrictions), however this happens at the expense of more complex and more error-prone code, and the non-trivial task of getting dependency chains correct.
It is currently discouraged to use consume ordering while the specification is being revised.
This is a quite complex subject. Try to read http://en.cppreference.com/w/cpp/atomic/memory_order several times, try to read other resources, etc.
Here's a simplified description:
The compiler and CPU can reorder memory accesses. That is, they can happen in different order than what's specified in the code. That's fine most of the time, the problem arises when different thread try to communicate and may see such order of memory accesses that breaks the invariants of the code.
Usually you can use locks for synchronization. The problem is that they're slow. Atomic operations are much faster, because the synchronization happens at CPU level (i.e. CPU ensures that no other thread, even on another CPU, modifies some variable, etc.).
So, the one single problem we're facing is reordering of memory accesses. The memory_order enum specifies what types of reorderings compiler must forbid.
relaxed - no constraints.
consume - no loads that are dependent on the newly loaded value can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.
acquire - no loads can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.
release - no stores can be reordered wrt. the atomic store. I.e. if they are before the atomic store in the source code, they will happen before the atomic store too.
acq_rel - acquire and release combined.
seq_cst - it is more difficult to understand why this ordering is required. Basically, all other orderings only ensure that specific disallowed reorderings don't happen only for the threads that consume/release the same atomic variable. Memory accesses can still propagate to other threads in any order. This ordering ensures that this doesn't happen (thus sequential consistency). For a case where this is needed see the example at the end of the linked page.
I want to provide a more precise explanation, closer to the standard.
Things to ignore:
memory_order_consume - apparently no major compiler implements it, and they silently replace it with a stronger memory_order_acquire. Even the standard itself says to avoid it.
A big part of the cppreference article on memory orders deals with 'consume', so dropping it simplifies things a lot.
It also lets you ignore related features like [[carries_dependency]] and std::kill_dependency.
Data races: Writing to a non-atomic variable from one thread, and simultaneously reading/writing to it from a different thread is called a data race, and causes undefined behavior.
memory_order_relaxed is the weakest and supposedly the fastest memory order.
Any reads/writes to atomics can't cause data races (and subsequent UB). relaxed provides just this minimal guarantee, for a single variable. It doesn't provide any guarantees for other variables (atomic or not).
All threads agree on the order of operations on every particular atomic variable. But it's the case only for invididual variables. If other variables (atomic or not) are involved, threads might disagree on how exactly the operations on different variables are interleaved.
It's as if relaxed operations propagate between threads with slight unpredictable delays.
This means that you can't use relaxed atomic operations to judge when it's safe to access other non-atomic memory (can't synchronize access to it).
By "threads agree on the order" I mean that:
Each thread will access each separate variable in the exact order you tell it to. E.g. a.store(1, relaxed); a.store(2, relaxed); will write 1, then 2, never in the opposite order. But accesses to different variables in the same thread can still be reordered relative to each other.
If a thread A writes to a variable several times, then thread B reads seveal times, it will get the values in the same order (but of course it can read some values several times, or skip some, if you don't synchronize the threads in other ways).
No other guarantees are given.
Example uses: Anything that doesn't try to use an atomic variable to synchronize access to non-atomic data: various counters (that exist for informational purposes only), or 'stop flags' to signal other threads to stop. Another example: operations on shared_ptrs that increment the reference count internally use relaxed.
Fences: atomic_thread_fence(relaxed); does nothing.
memory_order_release, memory_order_acquire do everything relaxed does, and more (so it's supposedly slower or equivalent).
Only stores (writes) can use release. Only loads (reads) can use acquire. Read-modify-write operations such as fetch_add can be both (memory_order_acq_rel), but they don't have to.
Those let you synchronize threads:
Let's say thread 1 reads/writes to some memory M (any non-atomic or atomic variables, doesn't matter).
Then thread 1 performs a release store to a variable A. Then it stops
touching that memory.
If thread 2 then performs an acquire load of the same variable A, this load is said to synchronize with the corresponding store in thread 1.
Now thread 2 can safely read/write to that memory M.
You only synchronize with the latest writer, not preceding writers.
You can chain synchronizations across multiple threads.
There's a special rule that synchronization propagates across any number of read-modify-write operations regardless of their memory order. E.g. if thread 1 does a.store(1, release);, then thread 2 does a.fetch_add(2, relaxed);, then thread 3 does a.load(acquire), then thread 1 successfully synchronizes with thread 3, even though there's a relaxed operation in the middle.
In the above rule, a release operation X, and any subsequent read-modify-write operations on the same variable X (stopping at the next non-read-modify-write operation) are called a release sequence headed by X. (So if an acquire reads from any operation in a release sequence, it synchronizes with the head of the sequence.)
If read-modify-write operations are involved, nothing stops you from synchronizing with more than one operation. In the example above, if fetch_add was using acquire or acq_rel, it too would synchronize with thread 1, and conversely, if it used release or acq_rel, the thread 3 would synchonize with 2 in addition to 1.
Example use: shared_ptr decrements its reference counter using something like fetch_sub(1, acq_rel).
Here's why: imagine that thread 1 reads/writes to *ptr, then destroys its copy of ptr, decrementing the ref count. Then thread 2 destroys the last remaining pointer, also decrementing the ref count, and then runs the destructor.
Since the destructor in thread 2 is going to access the memory previously accessed by thread 1, the acq_rel synchronization in fetch_sub is necessary. Otherwise you'd have a data race and UB.
Fences: Using atomic_thread_fence, you can essentially turn relaxed atomic operations into release/acquire operations. A single fence can apply to more than one operation, and/or can be performed conditionally.
If you do a relaxed read (or with any other order) from one or more variables, then do atomic_thread_fence(acquire) in the same thread, then all those reads count as acquire operations.
Conversely, if you do atomic_thread_fence(release), followed by any number of (possibly relaxed) writes, those writes count as release operations.
An acq_rel fence combines the effect of acquire and release fences.
Similarity with other standard library features:
Several standard library features also cause a similar synchronizes with relationship. E.g. locking a mutex synchronizes with the latest unlock, as if locking was an acquire operation, and unlocking was a release operation.
memory_order_seq_cst does everything acquire/release do, and more. This is supposedly the slowest order, but also the safest.
seq_cst reads count as acquire operations. seq_cst writes count as release operations. seq_cst read-modify-write operations count as both.
seq_cst operations can synchronize with each other, and with acquire/release operations. Beware of special effects of mixing them (see below).
seq_cst is the default order, e.g. given atomic_int x;, x = 1; does x.store(1, seq_cst);.
seq_cst has an extra property compared to acquire/release: all threads agree on the order in which all seq_cst operations happen. This is unlike weaker orders, where threads agree only on the order of operations on each individual atomic variable, but not on how the operations are interleaved - see relaxed order above.
The presence of this global operation order seems to only affect which values you can get from seq_cst loads, it doesn't in any way affect non-atomic variables and atomic operations with weaker orders (unless seq_cst fences are involved, see below), and by itself doesn't prevent any extra data race UB compared to acq/rel operations.
Among other things, this order respects the synchronizes with relationship described for acquire/release above, unless (and this is weird) that synchronization comes from mixing a seq-cst operation with an acquire/release operation (release syncing with seq-cst, or seq-cst synching with acquire). Such mix essentially demotes the affected seq-cst operation to an acquire/release (it maybe retains some of the seq-cst properties, but you better not count on it).
Example use:
atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, seq_cst);
if (y.load(seq_cst)) {...}
// Thread 2:
y.store(false, seq_cst);
if (x.load(seq_cst)) {...}
Lets say you want only one thread to be able to enter the if body. seq_cst allows you to do it. Acquire/release or weaker orders wouldn't be enough here.
Fences: atomic_thread_fence(seq_cst); does everything an acq_rel fence does, and more.
Like you would expect, they bring some seq-cst properties to atomic operations done with weaker orders.
All threads agree on the order of seq_cst fences, relative to one another and to any seq_cst operations (i.e. seq_cst fences participate in the global order of seq_cst operations, which was described above).
They essentially prevent atomic operations from being reordered across themselves.
E.g. we can transform the above example to:
atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (y.load(relaxed)) {...}
// Thread 2:
y.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (x.load(relaxed)) {...}
Both threads can't enter if at the same time, because that would require reordering a load across the fence to be before the store.
But formally, the standard doesn't describe them in terms of reordering. Instead, it just explains how the seq_cst fences are placed in the global order of seq_cst operations. Let's say:
Thread 1 performs operation A on atomic variable X using using seq_cst order, OR a weaker order preceeded by a seq_cst fence.
Then:
Thread 2 performs operation B the same atomic variable X using seq_cst order, OR a weaker order followed by a seq_cst fence.
(Here A and B are any operations, except they can't both be reads, since then it's impossible to determine which one was first.)
Then the first seq_cst operation/fence is ordered before the second seq_cst operation/fence.
Then, if you imagine an scenario (e.g. in the example above, both threads entering the if) that imposes a contradicting requirements on the order, then this scenario is impossible.
E.g. in the example above, if the first thread enters the if, then the first fence must be ordered before the second one. And vice versa. This means that both threads entering the if would lead to a contradition, and hence not allowed.
Interoperation between different orders
Summarizing the above:
relaxed write
release write
seq-cst write
relaxed load
-
-
-
acquire load
-
synchronizes with
synchronizes with*
seq-cst load
-
synchronizes with*
synchronizes with
* = The participating seq-cst operation gets a messed up seq-cst order, effectively being demoted to an acquire/release operation. This is explained above.
Does using a stronger memory order makes data transfer between threads faster?
No, it seems not.
Sequental consistency for data-race-free programs
The standard explains that if your program only uses seq_cst accesses (and mutexes), and has no data races (which cause UB), then you don't need to think about all the fancy operation reorderings. The program will behave as if only one thread executed at a time, with the threads being unpredictably interleaved.