Acquire/release semantics with 4 threads - c++

I am currently reading C++ Concurrency in Action by Anthony Williams. One of his listing shows this code, and he states that the assertion that z != 0 can fire.
#include <atomic>
#include <thread>
#include <assert.h>
std::atomic<bool> x,y;
std::atomic<int> z;
void write_x()
{
x.store(true,std::memory_order_release);
}
void write_y()
{
y.store(true,std::memory_order_release);
}
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire));
if(y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
if(x.load(std::memory_order_acquire))
++z;
}
int main()
{
x=false;
y=false;
z=0;
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join();
b.join();
c.join();
d.join();
assert(z.load()!=0);
}
So the different execution paths, that I can think of is this:
1)
Thread a (x is now true)
Thread c (fails to increment z)
Thread b (y is now true)
Thread d (increments z) assertion cannot fire
2)
Thread b (y is now true)
Thread d (fails to increment z)
Thread a (x is now true)
Thread c (increments z) assertion cannot fire
3)
Thread a (x is true)
Thread b (y is true)
Thread c (z is incremented) assertion cannot fire
Thread d (z is incremented)
Could someone explain to me how this assertion can fire?
He shows this little graphic:
Shouldn't the store to y also sync with the load in read_x_then_y, and the store to x sync with the load in read_y_then_x? I'm very confused.
EDIT:
Thank you for your responses, I understand how atomics work and how to use Acquire/Release. I just don't get this specific example. I was trying to figure out IF the assertion fires, then what did each thread do? And why does the assertion never fire if we use sequential consistency.
The way, I am reasoning about this is that if thread a (write_x) stores to x then all the work it has done so far is synced with any other thread that reads x with acquire ordering. Once read_x_then_y sees this, it breaks out of the loop and reads y. Now, 2 things could happen. In one option, the write_y has written to y, meaning this release will sync with the if statement (load) meaning z is incremented and assertion cannot fire. The other option is if write_y hasn't run yet, meaning the if condition fails and z isn't incremented, In this scenario, only x is true and y is still false. Once write_y runs, the read_y_then_x breaks out of its loop, however both x and y are true and z is incremented and the assertion does not fire. I can't think of any 'run' or memory ordering where z is never incremented. Can someone explain where my reasoning is flawed?
Also, I know The loop read will always be before the if statement read because the acquire prevents this reordering.

You are thinking in terms of sequential consistency, the strongest (and default) memory order. If this memory order is used, all accesses to atomic variables constitute a total order, and the assertion indeed cannot be triggered.
However, in this program, a weaker memory order is used (release stores and acquire loads). This means, by definition that you cannot assume a total order of operations. In particular, you cannot assume that changes become visible to other threads in the same order. (Only a total order on each individual variable is guaranteed for any atomic memory order, including memory_order_relaxed.)
The stores to x and y occur on different threads, with no synchronization between them. The loads of x and y occur on different threads, with no synchronization between them. This means it is entirely allowed that thread c sees x && ! y and thread d sees y && ! x. (I'm just abbreviating the acquire-loads here, don't take this syntax to mean sequentially consistent loads.)
Bottom line: Once you use a weaker memory order than sequentially consistent, you can kiss your notion of a global state of all atomics, that is consistent between all threads, goodbye. Which is exactly why so many people recommend sticking with sequential consistency unless you need the performance (BTW, remember to measure if it's even faster!) and are certain of what you are doing. Also, get a second opinion.
Now, whether you will get burned by this, is a different question. The standard simply allows a scenario where the assertion fails, based on the abstract machine that is used to describe the standard requirements. However, your compiler and/or CPU may not exploit this allowance for one reason or another. So it is possible that for a given compiler and CPU, you may never see that the assertion is triggered, in practice. Keep in mind that a compiler or CPU may always use a stricter memory order than the one you asked for, because this can never introduce violations of the minimum requirements from the standard. It may only cost you some performance – but that is not covered by the standard anyway.
UPDATE in response to comment: The standard defines no hard upper limit on how long it takes for one thread to see changes to an atomic by another thread. There is a recommendation to implementers that values should become visible eventually.
There are sequencing guarantees, but the ones pertinent to your example do not prevent the assertion from firing. The basic acquire-release guarantee is that if:
Thread e performs a release-store to an atomic variable x
Thread f performs an acquire-load from the same atomic variable
Then if the value read by f is the one that was stored by e, the store in e synchronizes-with the load in f. This means that any (atomic and non-atomic) store in e that was, in this thread, sequenced before the given store to x, is visible to any operation in f that is, in this thread, sequenced after the given load. [Note that there are no guarantees given regarding threads other than these two!]
So, there is no guarantee that f will read the value stored by e, as opposed to e.g. some older value of x. If it doesn't read the updated value, then also the load does not synchronize with the store, and there are no sequencing guarantees for any of the dependent operations mentioned above.
I liken atomics with lesser memory order than sequentially consistent to the Theory of Relativity, where there is no global notion of simultaneousness.
PS: That said, an atomic load cannot just read an arbitrary older value. For example, if one thread performs periodic increments (e.g. with release order) of an atomic<unsigned> variable, initialized to 0, and another thread periodically loads from this variable (e.g. with acquire order), then, except for eventual wrapping, the values seen by the latter thread must be monotonically increasing. But this follows from the given sequencing rules: Once the latter thread reads a 5, anything that happened before the increment from 4 to 5 is in the relative past of anything that follows the read of 5. In fact, a decrease other than wrapping is not even allowed for memory_order_relaxed, but this memory order does not make any promises to the relative sequencing (if any) of accesses to other variables.

The release-acquire synchronization has (at least) this guarantee: side-effects before a release on a memory location are visible after an acquire on this memory location.
There is no such guarantee if the memory location is not the same. More importantly, there's no total (think global) ordering guarantee.
Looking at the example, thread A makes thread C come out of its loop, and thread B makes thread D come out of its loop.
However, the way a release may "publish" to an acquire (or the way an acquire may "observe" a release) on the same memory location doesn't require total ordering. It's possible for thread C to observe A's release and thread D to observe B's release, and only somewhere in the future for C to observe B's release and for D to observe A's release.
The example has 4 threads because that's the minimum example you can force such non-intuitive behavior. If any of the atomic operations were done in the same thread, there would be an ordering you couldn't violate.
For instance, if write_x and write_y happened on the same thread, it would require that whatever thread observed a change in y would have to observe a change in x.
Similarly, if read_x_then_y and read_y_then_x happened on the same thread, you would observe both changed in x and y at least in read_y_then_x.
Having write_x and read_x_then_y in the same thread would be pointless for the exercise, as it would become obvious it's not synchronizing correctly, as would be having write_x and read_y_then_x, which would always read the latest x.
EDIT:
The way, I am reasoning about this is that if thread a (write_x) stores to x then all the work it has done so far is synced with any other thread that reads x with acquire ordering.
(...) I can't think of any 'run' or memory ordering where z is never incremented. Can someone explain where my reasoning is flawed?
Also, I know The loop read will always be before the if statement read because the acquire prevents this reordering.
That's sequentially consistent order, which imposes a total order. That is, it imposes that write_x and write_y both be visible to all threads one after the other; either x then y or y then x, but the same order for all threads.
With release-acquire, there is no total order. The effects of a release are only guaranteed to be visible to a corresponding acquire on the same memory location. With release-acquire, the effects of write_x are guaranteed to be visible to whoever notices x has changed.
This noticing something changed is very important. If you don't notice a change, you're not synchronizing. As such, thread C is not synchronizing on y and thread D is not synchronizing on x.
Essentially, it's way easier to think of release-acquire as a change notification system that only works if you synchronize properly. If you don't synchronize, you may or may not observe side-effects.
Strong memory model hardware architectures with cache coherence even in NUMA, or languages/frameworks that synchronize in terms of total order, make it difficult to think in these terms, because it's practically impossible to observe this effect.

Let's walk through the parallel code:
void write_x()
{
x.store(true,std::memory_order_release);
}
void write_y()
{
y.store(true,std::memory_order_release);
}
There is nothing before these instructions (they are at start of parallelism, everything that happened before also happened before other threads) so they are not meaningfully releasing: they are effectively relaxed operations.
Let's walk through the parallel code again, nothing that these two previous operations are not effective releases:
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire)); // acquire what state?
if(y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
if(x.load(std::memory_order_acquire))
++z;
}
Note that all the loads refer to variables for which nothing is effectively released ever, so nothing is effectively acquired here: we re-acquire the visibility over the previous operations in main that are visible already.
So you see that all operations are effectively relaxed: they provide no visibility (over what was already visible). It's like doing an acquire fence just after an acquire fence, it's redundant. Nothing new is implied that wasn't already implied.
So now that everything is relaxed, all bets are off.
Another way to view that is to notice that an atomic load is not a RMW operations that leaves the value unchanged, as a RMW can be release and a load cannot.
Just like all atomic stores are part of the modification order of an atomic variable even if the variable is an effective a constant (that is a non const variable whose value is always the same), an atomic RMW operation is somewhere in the modification order of an atomic variable, even if there was no change of value (and there cannot be a change of value because the code always compares and copies the exact same bit pattern).
In the modification order you can have release semantic (even if there was no modification).
If you protect a variable with a mutex you get release semantic (even if you just read the variable).
If you make all your loads (at least in functions that do more than once operation) release-modification-loads with:
either a mutex protecting the atomic object (then drop the atomic as it's now redundant!)
or a RMW with acq_rel order,
the previous proof that all operations are effectively relaxed doesn't work anymore and some atomic operation in at least one of the read_A_then_B functions will have to be ordered before some operation in the other, as they operate on the same objects. If they are in the modification order of a variable and you use acq_rel, then you have an happen before relation between one of these (obviously which one happens before which one is non deterministic).
Either way execution is now sequential, as all operations are effectively acquire and release, that is as operative acquire and release (even those that are effectively relaxed!).

If we change two if statements to while statements, it will make the code correct and z will be guaranteed to be equal to 2.
void read_x_then_y()
{
while(!x.load(std::memory_order_acquire));
while(!y.load(std::memory_order_acquire));
++z;
}
void read_y_then_x()
{
while(!y.load(std::memory_order_acquire));
while(!x.load(std::memory_order_acquire));
++z;
}

Related

C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads

Could you please help me to understand what std::memory_order should be used in std::atomic_flag::test_and_set to do some work only once by a set of threads and why? The work should be done by whatever thread gets to it first, and all other threads should just check as quickly as possible that someone is already going the work and continue working on other tasks.
In my tests of the example below, any memory order works, but I think that it is just a coincidence. I suspect that Release-Acquire ordering is what I need, but, in my case, only one memory_order can be used in both threads (it is not the case that one thread can use memory_order_release and the other can use memory_order_acquire since I do not know which thread will arrive to doing the work first).
#include <atomic>
#include <iostream>
#include <thread>
std::atomic_flag done = ATOMIC_FLAG_INIT;
const std::memory_order order = std::memory_order_seq_cst;
//const std::memory_order order = std::memory_order_acquire;
//const std::memory_order order = std::memory_order_relaxed;
void do_some_work_that_needs_to_be_done_only_once(void)
{ std::cout<<"Hello, my friend\n"; }
void run(void)
{
if(not done.test_and_set(order))
do_some_work_that_needs_to_be_done_only_once();
}
int main(void)
{
std::thread a(run);
std::thread b(run);
a.join();
b.join();
// expected result:
// * only one thread said hello
// * all threads spent as little time as possible to check if any
// other thread said hello yet
return 0;
}
Thank you very much for your help!
Following up on some things in the comments:
As has been discussed, there is a well-defined modification order M for done on any given run of the program. Every thread does one store to done, which means one entry in M. And by the nature of atomic read-modify-writes, the value returned by each thread's test_and_set is the value that immediately precedes its own store in the order M. That's promised in C++20 atomics.order p10, which is the critical clause for understanding atomic RMW in the C++ memory model.
Now there are a finite number of threads, each corresponding to one entry in M, which is a total order. Necessarily there is one such entry that precedes all the others. Call it m1. The test_and_set whose store is entry m1 in M must return the preceding value in M. That can only be the value 0 which initialized done. So the thread corresponding to m1 will see test_and_set return 0. Every other thread will see it return 1, because each of their modifications m2, ..., mN follows (in M) another modification, which must have been a test_and_set storing the value 1.
We may not be bothering to observe all of the total order M, but this program does determine which of its entries is first on this particular run. It's the unique one whose test_and_set returns 0. A thread that sees its test_and_set return 1 won't know whether it came 2nd or 8th or 96th in that order, but it does know that it wasn't first, and that's all that matters here.
Another way to think about it: suppose it were possible for two threads (tA, tB) both to load the value 0. Well, each one makes an entry in the modification order; call them mA and mB. M is a total order so one has to go before the other. And bearing in mind the all-important [atomics.order p10], you will quickly find there is no legal way for you to fill out the rest of M.
All of this is promised by the standard without any reference to memory ordering, so it works even with std::memory_order_relaxed. The only effect of relaxed memory ordering is that we can't say much about how our load/store will become visible with respect to operations on other variables. That's irrelevant to the program at hand; it doesn't even have any other variables.
In the actual implementation, this means that an atomic RMW really has to exclusively own the variable for the duration of the operation. We must ensure that no other thread does a store to that variable, nor the load half of a read-modify-write, during that period. In a MESI-like coherent cache, this is done by temporarily locking the cache line in the E state; if the system makes it possible for us to lose that lock (like an LL/SC architecture), abort and start again.
As to your comment about "a thread reading false from its own cache/buffer": the implementation mustn't allow that in an atomic RMW, not even with relaxed ordering. When you do an atomic RMW, you must read it while you hold the lock, and use that value in the RMW operation. You can't use some old value that happens to be in a buffer somewhere. Likewise, you have to complete the write while you still hold the lock; you can't stash it in a buffer and let it complete later.
relaxed is fine if you just need to determine the winner of the race to set the flag1, so one thread can start on the work and later threads can just continue on.
If the run_once work produces data that other threads need to be able to read, you'll need a release store after that, to let potential readers know that the work is finished, not just started. If it was instead just something like printing or writing to a file, and other threads don't care when that finishes, then yeah you have no ordering requirements between threads beyond the modification order of done which exists even with relaxed. An atomic RMW like test_and_set lets you determines which thread's modification was first.
BTW, you should check read-only before even trying to test-and-set; unless run() is only called very infrequently, like once per thread startup. For something like a static int foo = non_constant; local var, compilers use a guard variable that's loaded (with an acquire load) to see if init is already complete. If it's not, branch to code that uses an atomic RMW to modify the guard variable, with one thread winning, the rest effectively waiting on a mutex for that thread to init.
You might want something like that if you have data that all threads should read. Or just use a static int foo = something_to_run_once(), or some type other than int, if you actually have some data to init.
Or perhaps use C++11 std::call_once to solve this problem for you.
On normal systems, atomic_flag has no advantage over and atomic_bool. done.exchange(true) on a bool is equivalent to test_and_set of a flag. But atomic_bool is more flexible in terms of the operations it supports, like plain read that isn't part of an RMW test-and-set.
C++20 does add a test() method for atomic_flag. ISO C++ guarantees that atomic_flag is lock-free, but in practice so is std::atomic<bool> on all real-world systems.
Footnote 1: why relaxed guarantees a single winner
The memory_order parameter only governs ordering wrt. operations on other variables by the same thread.
Does calling test_and_set by a thread force somehow synchronization of the flag with values written by other threads?
It's not a pure write, it's an atomic read-modify-write, so the result of the one that went first is guaranteed to be visible to the one that happens to be second. That's the whole point of test-and-set as a primitive building block for mutual exclusion.
If two TAS operations could both load the original value (false), and then both store true, they would be atomic. They'd have overlapped with each other.
Two atomic RMWs on the same atomic object must happen in some order, the modification-order of that object. (Because they're not read-only: an RMW includes a modification. But also includes a read so you can see what the value was immediately before the new value; that read is tied to the modification order, unlike a plain read).
Every atomic object separately has a modification-order that all threads can agree on; this is guaranteed by ISO C++. (With orders less than seq_cst, ordering between objects can be different from source order, and not guaranteed that all threads even agree which store happened first, the IRIW problem.)
Being an atomic RMW guarantees that exactly one test_and_set will return false in thread A or B. Same for fetch_add with multiple threads incrementing a counter: the increments have to happen in some order (i.e. serialized with each other), and whatever that order is becomes the modification-order of that atomic object.
Atomic RMWs have to work this way to not lose counts. i.e. to actually be atomic.

relaxed ordering and inter thread visibility

I learnt from relaxed ordering as a signal that a store on an atomic variable should be visible to other thread in a "within a reasonnable amount of time".
That say, I am pretty sure it should happen in a very short time (some nano second ?).
However, I don't want to rely on "within a reasonnable amount of time".
So, here is some code :
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_relaxed))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_relaxed);
}
Thread A and B are within a kind of ThreadPool, so there is no creation of thread or whatsoever in this problem.
I don't need to protect any data, so acquire / consume / release ordering on atomic store/load are not needed here (I think?).
We know for sure that the functionThatWillBeLaunchedInThreadAfunction will be launched AFTER the end of the functionThatWillBeLaunchedInThreadB.
However, in such a code, we don't have any guarantee that the store will be visible in the thread A, so the thread A can read a stale value (false).
Here are some solution I think about.
Solution 1 : Use volatility
Just declare volatile std::atomic_bool canBegin{false}; Here the volatileness guarantee us that we will not see stale value.
Solution 2 : Use mutex or spinlock
Here the idea is to protect the canBegin access via a mutex / spinlock that guarantee via a release/acquire ordering that we will not see a stale value.
I don't need canGo to be an atomic either.
Solution 3 : not sure at all, but memory fence?
Maybe this code will not work, so, tell me :).
bool canGo{false}; // not an atomic value now
// in thread A
std::atomic_thread_fence(std::memory_order_acquire);
if(canGo) produceData();
// in thread B
canGo = true;
std::atomic_thread_fence(std::memory_order_release);
On cpp reference, for this case, it is write that :
all non-atomic and relaxed atomic stores that are sequenced-before FB
in thread B will happen-before all non-atomic and relaxed atomic loads
from the same locations made in thread A after FA
Which solution would you use and why?
There's nothing you can do to make a store visible to other threads any sooner. See If I don't use fences, how long could it take a core to see another core's writes? - barriers don't speed up visibility to other cores, they just make this core wait until that's happened.
The store part of an RMW is not different from a pure store for this, either.
(Certainly on x86; not totally sure about other ISAs, where a relaxed LL/SC might possibly get special treatment from the store buffer, possibly being more likely to commit before other stores if this core can get exclusive ownership of the cache line. But I think it still would have to retire from out-of-order execution so the core knows it's not speculative.)
Anthony's answer that was linked in comment is misleading; as I commented there:
If the RMW runs before the other thread's store commits to cache, it doesn't see the value, just like if it was a pure load. Does that mean "stale"? No, it just means that the store hasn't happened yet.
The only reason RMWs need a guarantee about "latest" value is that they're inherently serializing operations on that memory location. This is what you need if you want 100 unsynchronized fetch_add operations to not step on each other and be equivalent to += 100, but otherwise best-effort / latest-available value is fine, and that's what you get from a normal atomic load.
If you require instant visibility of results (a nanosecond or so), that's only possible within a single thread, like x = y; x += z;
Also note, the C / C++ standard requirement (actually just a note) to make stores visible in a reasonable amount of time is in addition to the requirements on ordering of operations. It doesn't mean seq_cst store visibility can be delayed until after later loads. All seq_cst operations happen in some interleaving of program order across all threads.
On real-world C++ implementations, the visibility time is entirely up to hardware inter-core latency. But the C++ standard is abstract, and could in theory be implemented on a CPU that required manual flushing to make stores visible to other threads. Then it would be up to the compiler to not be lazy and defer that for "too long".
volatile atomic<T> is useless; compilers already don't optimize atomic<T>, so every atomic access done by the abstract machine will already happen in the asm. (Why don't compilers merge redundant std::atomic writes?). That's all that volatile does, so volatile atomic<T> compiles to the same asm as atomic<T> for anything you can with the atomic.
Defining "stale" is a problem because separate threads running on separate cores can't see each other's actions instantly. It takes tens of nanoseconds on modern hardware to see a store from another thread.
But you can't read "stale" values from cache; that's impossible because real CPUs have coherent caches. (That's why volatile int could be used to roll your own atomics before C++11, but is no longer useful.) You may need an ordering stronger than relaxed to get the semantics you want as far as one value being older than another (i.e. "reordering", not "stale"). But for a single value, if you don't see a store, that means your load executed before the other core took exclusive ownership of the cache line in order to commit its store. i.e. that the store hasn't truly happened yet.
In the formal ISO C++ rules, there are guarantees about what value you're allowed to see which effectively give you the guarantees you'd expect from cache coherency for a single object, like that after a reader sees a store, further loads in this thread won't see some older store and then eventually back to the newest store. (https://eel.is/c++draft/intro.multithread#intro.races-19).
(Note for 2 writers + 2 readers with non-seq_cst operations, it's possible for the readers to disagree about the order in which the stores happened. This is called IRIW reordering, but most hardware can't do it; only some PowerPC. Will two atomic writes to different locations in different threads always be seen in the same order by other threads? - so it's not always quite as simple as "the store hasn't happened yet", it be visible to some threads before others. But it's still true that you can't speed up visibility, only for example slow down the readers so none of them see it via the "early" mechanism, i.e. with hwsync for the PowerPC loads to drain the store buffer first.)
We know for sure that the functionThatWillBeLaunchedInThreadAfunction
will be launched AFTER the end of the
functionThatWillBeLaunchedInThreadB.
First of all, if this is the case then it's likely that your task queue mechanism takes care of the necessary synchronization already.
On to the answer...
By far the simplest thing to do is acquire/release ordering. All the solutions you gave are worse.
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_acquire))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_release);
}
By the way, shouldn't this be a while loop?
void functionThatWillBeLaunchedInThreadA() {
while (!canBegin.load(std::memory_order_acquire))
{ }
produceData();
}
I don't need to protect any data, so acquire / consume / release
ordering on atomic store/load are not needed here (I think?)
In this case, the ordering is required to keep the compiler/CPU/memory subsystem from ordering the canBegin store true before the previous reads/writes have completed. And it should actually stall the CPU until it can be guaranteed that every write that comes before in program order will propagate before the store to canBegin. On the load side it prevents memory from being read/written before canBegin is read as true.
However, in such a code, we don't have any guarantee that the store
will be visible in the thread A, so the thread A can read a stale
value (false).
You said yourself:
a store on an atomic variable should be visible to other thread in a
"within a reasonnable amount of time".
Even with relaxed memory order, a write is guaranteed to eventually reach the other cores and all cores will eventually agree on any given variable's store history, so there are no stale values. There are only values that haven't propagated yet. What's "relaxed" about it is the store order in relation to other variables. Thus, memory_order_relaxed solves the stale read problem (but doesn't address the ordering required as discussed above).
Don't use volatile. It doesn't provide all the guarantees required of atomics in the C++ memory model, so using it would be undefined behavior. See https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering at the bottom to read about it.
You could use a mutex or spinlock, but a mutex operation is much more expensive than a lock-free std::atomic acquire-load/release-store. A spinlock will do at least one atomic read-modify-write operation...and possibly many. A mutex is definitely overkill. But both have the benefit of simplicity in the C++ source. Most people know how to use locks so it's easier to demonstrate correctness.
A memory fence will also work but your fences are in the wrong spot (it's counter-intuitive) and the inter-thread communication variable should be std::atomic. (Careful when playing these games...! It's easy to get undefined behavior) Relaxed ordering is ok thanks to the fences.
std::atomic<bool> canGo{false}; // MUST be atomic
// in thread A
if(canGo.load(std::memory_order_relaxed))
{
std::atomic_thread_fence(std::memory_order_acquire);
produceData();
}
// in thread B
std::atomic_thread_fence(std::memory_order_release);
canGo.store(true, memory_order_relaxed);
The memory fences are actually more strict than acquire/release ordering on the std::atomicload/store so this gains nothing and could be more expensive.
It seems like you really want to avoid overhead with your signaling mechanism. This is exactly what the std::atomic acquire/release semantics were invented for! You are worrying too much about stale values. Yes, an atomic RMW will give you the "latest" value, but they're also very expensive operations themselves. I want to give you an idea of how fast acquire/release is. It's most likely that you're targeting x86. x86 has total store order and word-sized loads/stores are atomic, so an load acquire compiles to just a regular load and and a release store compiles to a regular store. So it turns out that almost everything in this long post will probably compile to exactly the same code anyway.

Synchronizing against relaxed atomics

I have an allocator that does relaxed atomics to track the number of bytes currently allocated. They're just adds and subtracts so I don't need any synchronization between threads other than ensuring the modifications are atomic.
However, I occasionally want to check the number of allocated bytes (e.g. when shutting down the program) and I want to ensure any pending writes are committed. I assume I need a full memory barrier in this case to prevent any previous writes from being moved after the barrier and to prevent the next read from being moved before the barrier.
The question is: what is the proper way to ensure the relaxed atomic writes are committed before reading? Is my current code correct? (Assume functions and types map to std library constructs as expected.)
void* Allocator::Alloc(size_t bytes, size_t alignment)
{
void* p = AlignedAlloc(bytes, alignment);
AtomicFetchAdd(&allocatedBytes, AlignedMsize(p), MemoryOrder::Relaxed);
return p;
}
void Allocator::Free(void* p)
{
AtomicFetchSub(&allocatedBytes, AlignedMsize(p), MemoryOrder::Relaxed);
AlignedFree(p);
}
size_t Allocator::GetAllocatedBytes()
{
AtomicThreadFence(MemoryOrder::AcqRel);
return AtomicLoad(&allocatedBytes, MemoryOrder::Relaxed);
}
And some type definitions for context
enum struct MemoryOrder
{
Relaxed = 0,
Consume = 1,
Acquire = 2,
Release = 3,
AcqRel = 4,
SeqCst = 5,
};
struct Allocator
{
void* Alloc (size_t bytes, size_t alignment);
void Free (void* p);
size_t GetAllocatedBytes();
Atomic<size_t> allocatedBytes = { 0 };
};
I don't want to simply default to sequential consistency as I'm trying to understand memory ordering better.
The part that's really tripping me up is that in the standard under [atomics.fences] all the points talk about an acquire fence/atomic op synchronizing with a release fence/atomic op. It's entirely opaque to me whether an acquire fence/atomic op will synchronize with a relaxed atomic op on another thread. If an AcqRel fence function literally maps to an mfence instruction, it seems that the above code will be fine. However, I'm having a hard time convincing myself the standard guarantees this. Namely,
4 An atomic operation A that is a release operation on an atomic
object M synchronizes with an acquire fence B if there exists some
atomic operation X on M such that X is sequenced before B and reads
the value written by A or a value written by any side effect in the
release sequence headed by A.
This seems to make it clear that the fence will not synchronize with the relaxed atomic writes. On the other hand, a full fence is both a release and an acquire fence, so it should synchronize with itself, right?
2 A release fence A synchronizes with an acquire fence B if there
exist atomic operations X and Y, both operating on some atomic object
M, such that A is sequenced before X, X modifies M, Y is sequenced
before B, and Y reads the value written by X or a value written by any
side effect in the hypothetical release sequence X would head if it
were a release operation.
The scenario described is
Unsequenced writes
A release fence
X atomic write
Y atomic read
B acquire fence
Unsequenced reads (unsequenced writes will be visible here)
However, in my case I don't have the atomic write + atomic read as a signal between the threads and the release fence happens with the acquire fence on thread B. So what's actually happening is
Unsequenced writes
A release fence
B acquire fence
Unsequenced reads
Clearly if the fence executes before an unsequenced write begins it's a race and all bets are off. But it seems to me that if the fence executes after an unsequenced write begins but before it is committed it will be forced to finish before the unsequenced reads. This is exactly that I want, but I can't glean whether this is guaranteed by the standard.
Let's say you spawn Thread A, which calls Allocator::Alloc(), then immediately spawn Thread B, which calls Allocator::GetAllocatedBytes(). Those two Allocator calls are now running concurrently. You don't know which one will actually happen first, because there's no ordering between them. Your only guarantee is that either Thread B will see the value of allocatedBytes before Thread A modifies it, or it will see the value of allocatedBytes after Thread A modifies it. You won't know which value Thread B saw until after GetAllocatedBytes() returns. (At least Thread B won't see a totally garbage value for allocatedBytes, because there's no data race on it thanks to your use of relaxed atomics.)
You seem to be concerned about the case where Thread A got as far as AtomicFetchAdd(), but for some reason, the change is not visible when Thread B calls AtomicLoad(). But so what? That's no different from the outcome where GetAllocatedBytes() runs entirely before AtomicFetchAdd(). And that's a totally valid outcome. Remember, either Thread B sees the modified value, or it doesn't.
Even if you change all the atomic operations/fences to MemoryOrder::SeqCst, it won't make any difference. In the scenario I described, Thread B can still either see the modified value or the unmodified value of allocatedBytes, because the two Allocator calls run concurrently.
As long as you insist on calling GetAllocatedBytes() while other threads are still calling Alloc() and Free(), that's really the most you can expect. If you want to get a more "accurate" value, just don't allow any concurrent calls to Alloc()/Free() while GetAllocatedBytes() is running! For example, if the program is shutting down, just join all the other threads before calling GetAllocatedBytes(). That'll give you an accurate number of allocated bytes at shutdown. The C++ standard even guarantees it, because the completion of a thread synchronizes with the call to join().
This will not work properly, acq_rel memory order is designed specifically for CAS and FAA memory operations that "simulanously" read and write atomic data. In Your case You want to enforce memory synchronization before load. To do this You need to change memory order of You fetchAndAdd and fetchAndSub to acq_rel and Your load to acquire. This may seem much, but on x86 it has very little cost (some compiler optimizations) as it does not generate any new instructions into the code. As of how acquire-release synchronization works I recomend this article: http://preshing.com/20120913/acquire-and-release-semantics/
I removed info about sequential ordering as it should be used for all the operations to work properly and would be an overkill.
From my understanding of C++ atomics relaxed memory order makes sense when used in combination with other atomic operations using memory fences. For example in some situations atomic a may be stored in relaxed manner, as atomic b is written with release memory order and so on.
If your question is what is the proper way to ensure the relaxed atomic writes are committed before reading this same atomic object? Nothing, this is ensured by the language, [intro.multithread]:
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M.
All threads see the same modification order. For exemple imagine that 2 allocation happens in 2 different threads and then you read the counter in a third thread.
In the first thread, the atomic is incremented by 1 bytes, and the relaxed read/modify (AtomicFetchAdd) expression return 0: the counter made this transition: 0->1.
In the second thread, the atomic is incremented by 2 bytes, and the relaxed read/modify expression return 1: the counter make this transition: 1->3. There is no way the read/modify expression return 0. This thread cannot see a transition 0->2 because the other thread has performed the transition 0->1.
Then in a third thread you perform a relaxed load. The only possible values that may be loaded are 0,1 or 3. It is not possible to load 2. The modification order of the atomic is 0 -> 1 -> 3. And the observer thread will also see this modification order.

confused about atomic class: memory_order_relaxed

I am studying this site: https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync, which is very helpful to understand the topic about atomic class.
But this example about relaxed mode is hard to understand:
/*Thread 1:*/
y.store (20, memory_order_relaxed)
x.store (10, memory_order_relaxed)
/*Thread 2*/
if (x.load (memory_order_relaxed) == 10)
{
assert (y.load(memory_order_relaxed) == 20) /* assert A */
y.store (10, memory_order_relaxed)
}
/*Thread 3*/
if (y.load (memory_order_relaxed) == 10)
assert (x.load(memory_order_relaxed) == 10) /* assert B */
To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
But the website says either assert in this example can actually FAIL.
To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
In effect, your argument is that since in thread 2 the store of 10 into x occurred before the store of 10 into y, in thread 3 the same must be the case.
However, since you are only using relaxed memory operations, there is nothing in the code that requires two different threads to agree on the ordering between modifications of different variables. So indeed thread 2 may see the store of 10 into x before the store of 10 into y while thread 3 sees those two operations in the opposite order.
In order to ensure that assert B succeeds, you would, in effect, need to ensure that when thread 3 sees the value 10 of y, it also sees any other side effects performed by the thread that stored 10 into y before the time of the store. That is, you need the store of 10 into y to synchronize with the load of 10 from y. This can be done by having the store perform a release and the load perform an acquire:
// thread 2
y.store (10, memory_order_release);
// thread 3
if (y.load (memory_order_acquire) == 10)
A release operation synchronizes with an acquire operation that reads the value stored. Now because the store in thread 2 synchronizes with the load in thread 3, anything that happens after the load in thread 3 will see the side effects of anything that happens before the store in thread 2. Hence the assertion will succeed.
Of course, we also need to make sure assertion A succeeds, by making the x.store in thread 1 use release and the x.load in thread 2 use acquire.
I find it much easier to understand atomics with some knowledge of what might be causing them, so here's some background knowledge. Know that these concepts are in no way stated in the C++ language itself, but is some of the possible reasons why things are the way they are.
Compiler reordering
Compilers, often when optimizing, will choose to refactor the program as long as its effects are the same on a single threaded program. This is circumvented with the use of atomics, which will tell the compiler (among other things) that the variable might change at any moment, and that its value might be read elsewhere.
Formally, atomics ensures one thing: there will be no data races. That is, accessing the variable will not make your computer explode.
CPU reordering
CPU might reorder instructions as it is executing them, which means the instructions might get reordered on the hardware level, independent of how you wrote the program.
Caching
Finally there are effects of caches, which are faster memory that sorta contains a partial copy of the global memory. Caches are not always in sync, meaning they don't always agree on what is "correct". Different threads may not be using the same cache, and due to this, they may not agree on what values variables have.
Back to the problem
What the above surmounts to is pretty much what C++ says about the matter: unless explicitly stated otherwise, the ordering of side effects of each instruction is totally and completely unspecified. It might not even be the same viewed from different threads.
Formally, the guarantee of an ordering of side effects is called a happens-before relation. Unless a side effect happens-before another, it is not. Loosely, we just say call it synchronization.
Now, what is memory_order_relaxed? It is telling the compiler to stop meddling, but don't worry about how the CPU and cache (and possibly other things) behave. Therefore, one possibility of why you see the "impossible" assert might be
Thread 1 stores 20 to y and then 10 to x to its cache.
Thread 2 reads the new values and stores 10 to y to its cache.
Thread 3 didn't read the values from thread 1, but reads those of thread 2, and then the assert fails.
This might be completely different from what happens in reality, the point is anything can happen.
To ensure a happens-before relation between the multiple reads and writes, see Brian's answer.
Another construct that provides happens-before relations is std::mutex, which is why they are free from such insanities.
The answer to your question is the C++ standard.
The section [intro.races] is surprisingly very clear (which is not the rule of normative text kind: formalism consistency oftenly hurts readability).
I have read many books and tuto which treats the subject of memory order, but it just confused me.
Finally I have read the C++ standard, the section [intro.multithread] is the clearest I have found. Taking the time to read it carefully (twice) may save you some time!
The answer to your question is in [intro.races]/4:
All modifications to a particular atomic object M occur in some particular total order, called the modification
order of M. [ Note: There is a separate order for each atomic object. There is no requirement that these can
be combined into a single total order for all objects. In general this will be impossible since different threads
may observe modifications to different objects in inconsistent orders. — end note ]
You were expecting a single total order on all atomic operations. There is such an order, but only for atomic operations that are memory_order_seq_cst as explained in [atomics.order]/3:
There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happens
before” order and modification orders for all affected locations [...]

Understanding atomic variables and operations

I read about boost's and std's (c++11) atomic type and operations over and over again and still I'm not sure I understand it right (and at some cases I don't understand it at all). So, I have a few questions about it.
My sources I use for learning:
Boost documentation: http://www.boost.org/doc/libs/1_53_0/doc/html/atomic.html
http://www.developerfusion.com/article/138018/memory-ordering-for-atomic-operations-in-c0x/
Consider following snippet:
atomic<bool> x,y;
void write_x_then_y()
{
x.store(true, memory_order_relaxed);
y.store(true, memory_order_release);
}
#1: Is it equivalent to this next one?
atomic<bool> x,y;
void write_x_then_y()
{
x.store(true, memory_order_relaxed);
atomic_thread_fence(memory_order_release); // *1
y.store(true, memory_order_relaxed); // *2
}
#2: Is following statement true?
Line *1 assures, that when operations done under this line (for example *2) are visible (for other thread using acquire), code above *1 will be visible too (with new values).
Next snipped extends the ones above:
void read_y_then_x()
{
if(y.load(memory_order_acquire))
{
assert(x.load(memory_order_relaxed));
}
}
#3: Is it equivalent to this next one?
void read_y_then_x()
{
atomic_thread_fence(memory_order_acquire); // *3
if(y.load(memory_order_relaxed)) // *4
{
assert(x.load(memory_order_relaxed)); // *5
}
}
#4: Are following statements true?
Line *3 assures that if some operations under release order (in other thread, like *2) is visible, every operation above the release order (for example *1) will be visible as well.
That means that assert at *5 will never fail (with false as default values).
But this does not assure that even if physically (in processor) *2 happens before before *3, it will be visible by snipped above (running in different thread) - function read_y_then_x() still can read old values. Only thing which is assured is, that if y is true, x will be also true.
#5: Incrementing (operation of adding 1) to an atomic integer can be memory_order_relaxed and no data are lost. Only problem is order and time of visibility of result.
According boost, following snipped is working reference counter:
#include <boost/intrusive_ptr.hpp>
#include <boost/atomic.hpp>
class X {
public:
typedef boost::intrusive_ptr<X> pointer;
X() : refcount_(0) {}
private:
mutable boost::atomic<int> refcount_;
friend void intrusive_ptr_add_ref(const X * x)
{
x->refcount_.fetch_add(1, boost::memory_order_relaxed);
}
friend void intrusive_ptr_release(const X * x)
{
if (x->refcount_.fetch_sub(1, boost::memory_order_release) == 1) {
boost::atomic_thread_fence(boost::memory_order_acquire);
delete x;
}
}
};
#6 Why is for decrementing used memory_order_release? How it works (in the context)? If what I wrote earlier is true, what makes returned value the most recent, especially when we use acquire AFTER reading and not before/during?
#7 Why there is acquire order after reference counter reach zero? We just read that the counter is zero and there is no other atomic variable used (pointer itself is not marked/used as such).
1: No. A release fence synchronizes with all acquire operations and fences. If there was a third atomic<bool> z which was being manipulated in a third thread, the fence would synchronize with that third thread as well, which is unnecessary. That being said, they will act the same on x86, but that is because x86 has very strong synchronization. The architectures used on 1000 core systems tend to be weaker.
2: Yes, this is correct. A fence ensures that if you see anything that follows, you also see everything that preceded.
3: In general they are different, but realistically they will be the same. The compiler is allowed to reorder two relaxed operations on different variables, but may not introduce spurious operations. If the compiler has any way of being confident that it is going to need to read x, it may do so before reading y. In your particular case, this is very difficult for the compiler, but there are many similar cases where such reordering is fair game.
4: All of those are true. The atomic operations guarantee consistency. They do not always guarantee that things happen in an order you wanted, they just prevent pathological orders that ruin your algorithm.
5: Correct. Relaxed operations are truly atomic. They just don't synchronize any additional memory
6: For any given atomic object M, C++ guarantees that there is an "official" order for operations on M. You don't get to see the "latest" value for M so much as C++ and the processor guarantee that all threads will see a consistent series of values for M. If two threads increment the refcount, then decrement it, there is no guarentee which one will decrement it to 0, but there is a guarentee that one of them will see that it decremented it to 0. There is no way for both of them to see that they decremented 2->1 and 2->1, but somehow the refcount combined them to 0. One thread will always see 2->1 and the other will see 1->0.
Remember, memory order is more about synchronizing the memory around the atomic. The atomic gets handled properly no matter what memory order you use.
7: This one is trickier. The short version for 7 is that decrement is release order because some thread is going to have to run the destructor for x, and we want to make sure it sees all operations on x made on all threads. Using release order on the destructor satisfies this need because you can prove that it works. Whoever is responsible for deleting x acquires all changes before doing so (using a fence to make sure atomics in the deleter don't drift upward). In all cases where threads release their own references, it is obvious that all threads will have a release-order decrement before the deleter gets called. In cases where one thread increments the refcount and another decrements it, you can prove that the only valid way to do so is if the threads synchronize with eachother, so that the destructor sees the result of both threads. Failure to synchronize would create a race case no matter what, so the user is obliged to get it right.
1
After pondering over #1 i have been convinced they are not equivalent by this argument §29.8.3 in [atomics.fences]:
A release fence A synchronizes with an atomic operation B that performs an acquire operation on an atomic
object M if there exists an atomic operation X such that A is sequenced before X, X modifies M, and B
reads the value written by X or a value written by any side effect in the hypothetical release sequence X
would head if it were a release operation.
This paragraph says that a release fence can be synchronized only with an aquire operation. But release operation can be in addition syncronized with consume operation.
Your void read_y_then_x() with the acquire fence has the fence in the wrong place. It should be placed between the two atomic loads. An acquire fence essentially makes all the load above the fence act somewhat like acquire loads, with the exception the happens before isn't established until you executed the fence.