can operation reorder to before memory_order_release? - c++

Can y.store reorder to before x.store? Because the standard said any atomic operation happen before memory_order_release cannot be reorder to after memory_order_release, but didnt state any operation happen after memory_order_release cannot reorder to before memory_order_release.
If i can happen, then the Listing 5.12 example from c++ concurrency in action book is wrong?
std::atomic<bool> x,y;
std::atomic<int> z;
void write_x_then_y()
{
x.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
y.store(true,std::memory_order_relaxed);
}

The standard does not define memory fences in terms of how operations are ordered around them.
It defines how a release fence (or operation) synchronizes with an acquire fence (or operation) when the right conditions are met.
In your example, if an acquire operation observes the value stored by y, it is guaranteed that it also observes the value stored by x.
Under those conditions, the store to x is ordered before the store to y.
Other than that, it is hard to speculate how fences impose ordering since a lot can happen under the as-if rule.

Related

When is a memory_order_seq_cst fence useful?

C++ supported atomic thread fences, that is fences guaranteeing properties for thread that use std::atomic<> operations, with the function atomic_thread_fence. It takes a memory order parameter to adjust the "strength" of the fence.
I understand that fences are useful when not all atomic operations are done with a "strong" order:
when not all atomic reads (1) in a thread are acquire operations, you may find a use for an acquire fence;
when not all atomic modifications (1) in a thread are release operations, you may find a use for a release fence.
(1) that includes RMW operations
So the usefulness of all these (acquire, release and acq_rel fences) is obvious: they allow threads that use atomic operations weaker than acq/rel (respectively) to synchronize properly.
But I don't understand where memory_order_seq_cst could be specifically needed as a fence:
What's the implication of using weaker than memory_order_seq_cst atomic operations and a memory_order_seq_cst fence?
What would specifically be guaranteed (in term of possible ordering of atomic operations) by a memory_order_seq_cst fence that wouldn't be guaranteed by memory_order_acq_rel?
No, a seq-cst-fence is not only both a release and an acquire-fence, but also provides some additional properties (see Working Draft, Standard for Programming Language C++, 32.4.4-32.4.8). A seq-cst fence is also part of the single total order of all sequentially consistent operations, enforcing the following observations:
For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_cst fence X sequenced before B, then B observes either the last memory_order_seq_cst modification of M preceding X in the total order S or a later modification of M in its modification order.
For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory_order_seq_cst fence X such that A is sequenced before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.
For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order.
For example, I am using seq-cst fences in my hazard pointer implementation: https://github.com/mpoeter/xenium/blob/master/xenium/reclamation/impl/hazard_pointer.hpp
The thread acquiring a safe reference to some object uses seq-cst fence after storing the hazard pointer, but before re-reading the pointer to the object. The thread trying to reclaim some objects uses a seq-cst fence before gathering the active hazard pointers from all threads. Based on the rules above this ensures that either the thread trying to reclaim the object sees that some other thread has a HP for this object (i.e., the object is used), or the reload of thread trying to acquire the safe reference to the object returns a different pointer, indicating to that thread that the object has been removed and it has to perform a retry.

Is atomic_thread_fence(memory_order_release) different from using memory_order_acq_rel?

cppreference.com provides this note about std::atomic_thread_fence (emphasis mine):
atomic_thread_fence imposes stronger synchronization constraints than an atomic store operation with the same std::memory_order.
While an atomic store-release operation prevents all preceding writes from moving past the store-release, an atomic_thread_fence with memory_order_release ordering prevents all preceding writes from moving past all subsequent stores.
I understand this note to mean that std::atomic_thread_fence(std::memory_order_release) is not unidirectional, like a store-release. It's a bidirectional fence, preventing stores on either side of the fence from reordering past a store on the other side of the fence.
If I understand that correctly, this fence seems to make the same guarantees that atomic_thread_fence(memory_order_acq_rel) does. It is an "upward" fence, and a "downward" fence.
Is there a functional difference between std::atomic_thread_fence(std::memory_order_release) and std::atomic_thread_fence(std::memory_order_acq_rel)? Or is the difference merely aesthetic, to document the purpose of the code?
A standalone fence imposes stronger ordering than an atomic operation with the same ordering constraint, but this does not change the direction in which ordering is enforced.
Bot an atomic release operation and a standalone release fence are uni-directional,
but the atomic operation orders with respect to itself whereas the atomic fence imposes ordering with respect to other stores.
For example, an atomic operation with release semantics:
std::atomic<int> sync{0};
// memory operations A
sync.store(1, std::memory_order_release);
// store B
This guarantees that no memory operation part of A (loads & stores) can be (visibly) reordered with the atomic store itself.
But it is uni-directional and no ordering rules apply to memory operations that are sequenced after the atomic operation; therefore, store B can still be reordered with any of the memory operations in A.
A standalone release fence changes this behavior:
// memory operations A
std::atomic_thread_fence(std::memory_order_release);
// load X
sync.store(1, std::memory_order_relaxed);
// stores B
This guarantees that no memory operation in A can be (visibly) reordered with any of the stores that are sequenced after the release fence.
Here, the store to B can no longer be reordered with any of the memory operations in A, and as such, the release fence is stronger than the atomic release operation.
But it also uni-directional since the load from X can still be reordered with any memory operation in A.
The difference is subtle and usually an atomic release operation is preferred over a standalone release fence.
The rules for a standalone acquire fence are similar, except that it enforces ordering in the opposite direction and operates on loads:
// loads B
sync.load(std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_acquire);
// memory operations A
No memory operation in A can be reordered with any load that is sequenced before the standalone acquire fence.
A standalone fence with std::memory_order_acq_rel ordering combines the logic for both acquire and release fences.
// memory operations A
// load A
std::atomic_thread_fence(std::memory_order_acq_rel);
// store B
//memory operations B
But this can get incredibly tricky once you realize that a store in A can still be reordered with a load in B.
Acq/rel fences should probably be avoided in favor of regular atomic operations, or even better, mutexes.
cppreference.com made some mistakes in the paragraph you quoted. I have highlighted them in the following:
atomic_thread_fence imposes stronger synchronization constraints than an atomic store operation with the same std::memory_order. While an atomic store-release operation prevents all preceding writes(should be memory operations, i.e. including reads and writes) from moving past the store-release (the complete sentence should be: the store-release operation itself), an atomic_thread_fence with memory_order_release ordering prevents all preceding writes(should be memory operations, i.e. including reads and writes) from moving past all subsequent stores.
To paraphrase it:
The release operation actually places fewer memory ordering constraints on neighboring operations than the release fence. A release operation only needs to prevent preceding memory operations from being reordered past itself, but a release fence must prevent preceding memory operations from being reordered past all subsequent writes. Because of this difference, a release operation can never take the place of a release fence.
This is quoted from here.
This is my interpretation of the intent of the following text, which I think is what was intended. Also, that interpretation is correct in term of the memory model, but still bad as it's an incomplete explanation.
While an atomic store-release operation prevents all preceding writes
from moving past the store-release, an atomic_thread_fence with
memory_order_release ordering prevents all preceding writes from
moving past all subsequent stores.
The use of "store" vs. "writes" is intentional:
"store", here, means a store on an std::atomic<> object (not just a call to std::atomic<>::store, also assignment which is equivalent to .store(value) or a RMW atomic operation);
"write", here, means any memory write, either normal (non atomic) or atomic.
It's a bidirectional fence, preventing stores on either side of the
fence from reordering past a store on the other side of the fence.
No, you missed an essential distinction, because it was only implied; expressed in an unclear, too subtle way - not good for a teaching text!
It says that a release fence is not symmetric: previous memory side effect, called "writes", are bound by following atomic store operations.
Even with that clarification, it's incomplete and so it's a bad explanation: it strongly suggests that the release fences exist just to make sure that writes (and writes only) are finished. That is not the case.
A release operation is what I call a: "I'm done there" signal. It signals that all previous memory operations are done, finished, visible. It's important to understand that not only modifications (which can be detected by looking at memory state) are ordered, everything on memory needs to be.
Many writes-up about thread primitives are defective in that way.

C++ memory model: do seq_cst loads synchronize with seq_cst stores?

In the C++ memory model, there is a total order on all loads and stores of all sequentially consistent operations. I'm wondering how this interacts with operations that have other memory orderings that are sequenced before/after sequentially consistent loads.
For example, consider two threads:
std::atomic<int> a(0);
std::atomic<int> b(0);
std::atomic<int> c(0);
//////////////
// Thread T1
//////////////
// Signal that we've started running.
a.store(1, std::memory_order_relaxed);
// If T2's store to b occurs before our load below in the total
// order on sequentially consistent operations, set flag c.
if (b.load(std::memory_order_seq_cst) == 1) {
c.store(1, std::memory_order_relaxed)
}
//////////////
// Thread T2
//////////////
// Blindly write to b.
b.store(1, std::memory_order_seq_cst)
// Has T1 set c? If so, then we know our store to b occurred before T1's load
// in the total order on sequentially consistent operations.
if (c.load(1, std::memory_order_relaxed)) {
// But is this guaranteed to be visible yet?
assert(a.load(1, std::memory_order_relaxed) == 1);
}
Is it guaranteed that the assertion in T2 cannot fire?
I'm looking for detailed citations of the standard here. In particular I think this this would require showing that the load from b in T1 synchronizes with the store to b in T2 in order to establish that the store to a inter-thread happens before the load from a, but as far as I can tell the standard says that memory_order_seq_cst stores synchronize with loads, but not the other way around.
Do seq_cst loads synchronize with seq_cst stores?
They do if all necessary requirements are met; in your example code, the assert can fire
§29.3.3
There shall be a single total order S on all memory_order_seq_cst operations
This total order applies to the seq_cst operations themselves.. In isolation, a store(seq_cst) has release semantics, whereas a load(seq_cst) has acquire semantics.
§29.3.1-2 [atomics.order]
memory_order_release, memory_order_acq_rel, and memory_order_seq_cst:
a store operation performs a release operation on the affected memory location.
.....
§29.3.1-4 [atomics.order]
memory_order_acquire, memory_order_acq_rel, and memory_order_seq_cst:
a load operation performs an acquire operation on the affected memory location.
Therefore, atomic operations with non-seq_cst ordering (or non-atomic operations) are ordered with respect to seq_cst operations per the acquire/release ordering rules:
a store(seq_cst) operation cannot be reordered with any memory operation that is sequenced before it (i.e. comes earlier in program order)..
a load(seq_cst) operation cannot be reordered with any memory operation that is sequenced after it.
In your example, although c.store(relaxed) in T1 is ordered (inter-thread) after b.load(seq_cst) (the load is an acquire operation),
c.load(relaxed) in T2 is unordered with respect to b.store(seq_cst) (which is a release operation, but it does not prevent the reordering).
You can also look at the operations on a. Since those are not ordered with respect to anything, a.load(relaxed) can return 0, causing the assert to fire.

memory_order_relaxed and Atomic RMW operations

The C++ Standard says that RMW (Read-Modify-Write) operations on atomics will operate on the latest value of the atomic variable. Consequently using memory_order_relaxed with these operations won't affect the RMW operation when executed concurrently from multiple threads.
I am assuming that this behavior is possible only if there is some memory barrier or fence in place for RMW operations even when the memory order specified is "relaxed". Please correct me if my understanding is wrong and explain how these operations work on the latest value if no such memory barrier is used. If my understanding is correct, then can I further assume that using Acquire-Release or Seq-CST memory order should not have additional performance hits for RMW operations on say a weakly ordered architecture like ARM or Alpha. Thanks in advance.
This is an unfortunately common misconception about the atomic memory orders. See, those do not (entirely) apply to the actual atomic operation. They apply mainly to other operations around them.
For example:
//accessible from anywhere
std::atomic<bool> flag;
int value = 0;
//code in thread 1:
value = 1;
flag.store(true, <order_write>);
//code in thread 2:
bool true_val = true;
while(!flag.compare_exchange_weak(true_val, false, <order_read>);
int my_val = value;
So, what is this doing? Thread 2 is waiting on thread 1 to signal that value has been updated, then thread 2 reads value.
<order_write> and <order_read> do not govern the behavior of how the specific atomic variable is seen. It governs the behavior of how other values that were set before/after that atomic operation are seen.
In order for this code to work, <order_write> must use a memory order that is at least as strong as memory_order_release. And <order_read> must use a memory order that is at least as strong as memory_order_acquire.
These memory orders affect how value is transferred (or more specifically, the stuff set before the atomic write).
wouldn't the condition that "operate on the latest value" require something like a memory barrier?
It is unlikely that most architectures implement the actual atomic modification using a global memory barrier. It takes the the non-relaxed memory orders to do that: they impose a general memory barrier on the writers and readers.
Atomic operations, if they need a memory barrier to work at all, will typically use a local memory barrier. That is, a barrier specific to the address of the atomic variable.
So it is reasonable to assume that non-relaxed memory orders will hurt performance more than a relaxed memory order. That's not a guarantee of course, but it's a pretty good first-order approximation.
Is it possible for atomic implementations to use a full global memory barrier on any atomic operation? Yes. But if an implementation resorts to that for fundamental atomic types, then the architecture probably has no other choice. So if your algorithm requires atomic operations, you don't really have any other choice.

Understanding atomic variables and operations

I read about boost's and std's (c++11) atomic type and operations over and over again and still I'm not sure I understand it right (and at some cases I don't understand it at all). So, I have a few questions about it.
My sources I use for learning:
Boost documentation: http://www.boost.org/doc/libs/1_53_0/doc/html/atomic.html
http://www.developerfusion.com/article/138018/memory-ordering-for-atomic-operations-in-c0x/
Consider following snippet:
atomic<bool> x,y;
void write_x_then_y()
{
x.store(true, memory_order_relaxed);
y.store(true, memory_order_release);
}
#1: Is it equivalent to this next one?
atomic<bool> x,y;
void write_x_then_y()
{
x.store(true, memory_order_relaxed);
atomic_thread_fence(memory_order_release); // *1
y.store(true, memory_order_relaxed); // *2
}
#2: Is following statement true?
Line *1 assures, that when operations done under this line (for example *2) are visible (for other thread using acquire), code above *1 will be visible too (with new values).
Next snipped extends the ones above:
void read_y_then_x()
{
if(y.load(memory_order_acquire))
{
assert(x.load(memory_order_relaxed));
}
}
#3: Is it equivalent to this next one?
void read_y_then_x()
{
atomic_thread_fence(memory_order_acquire); // *3
if(y.load(memory_order_relaxed)) // *4
{
assert(x.load(memory_order_relaxed)); // *5
}
}
#4: Are following statements true?
Line *3 assures that if some operations under release order (in other thread, like *2) is visible, every operation above the release order (for example *1) will be visible as well.
That means that assert at *5 will never fail (with false as default values).
But this does not assure that even if physically (in processor) *2 happens before before *3, it will be visible by snipped above (running in different thread) - function read_y_then_x() still can read old values. Only thing which is assured is, that if y is true, x will be also true.
#5: Incrementing (operation of adding 1) to an atomic integer can be memory_order_relaxed and no data are lost. Only problem is order and time of visibility of result.
According boost, following snipped is working reference counter:
#include <boost/intrusive_ptr.hpp>
#include <boost/atomic.hpp>
class X {
public:
typedef boost::intrusive_ptr<X> pointer;
X() : refcount_(0) {}
private:
mutable boost::atomic<int> refcount_;
friend void intrusive_ptr_add_ref(const X * x)
{
x->refcount_.fetch_add(1, boost::memory_order_relaxed);
}
friend void intrusive_ptr_release(const X * x)
{
if (x->refcount_.fetch_sub(1, boost::memory_order_release) == 1) {
boost::atomic_thread_fence(boost::memory_order_acquire);
delete x;
}
}
};
#6 Why is for decrementing used memory_order_release? How it works (in the context)? If what I wrote earlier is true, what makes returned value the most recent, especially when we use acquire AFTER reading and not before/during?
#7 Why there is acquire order after reference counter reach zero? We just read that the counter is zero and there is no other atomic variable used (pointer itself is not marked/used as such).
1: No. A release fence synchronizes with all acquire operations and fences. If there was a third atomic<bool> z which was being manipulated in a third thread, the fence would synchronize with that third thread as well, which is unnecessary. That being said, they will act the same on x86, but that is because x86 has very strong synchronization. The architectures used on 1000 core systems tend to be weaker.
2: Yes, this is correct. A fence ensures that if you see anything that follows, you also see everything that preceded.
3: In general they are different, but realistically they will be the same. The compiler is allowed to reorder two relaxed operations on different variables, but may not introduce spurious operations. If the compiler has any way of being confident that it is going to need to read x, it may do so before reading y. In your particular case, this is very difficult for the compiler, but there are many similar cases where such reordering is fair game.
4: All of those are true. The atomic operations guarantee consistency. They do not always guarantee that things happen in an order you wanted, they just prevent pathological orders that ruin your algorithm.
5: Correct. Relaxed operations are truly atomic. They just don't synchronize any additional memory
6: For any given atomic object M, C++ guarantees that there is an "official" order for operations on M. You don't get to see the "latest" value for M so much as C++ and the processor guarantee that all threads will see a consistent series of values for M. If two threads increment the refcount, then decrement it, there is no guarentee which one will decrement it to 0, but there is a guarentee that one of them will see that it decremented it to 0. There is no way for both of them to see that they decremented 2->1 and 2->1, but somehow the refcount combined them to 0. One thread will always see 2->1 and the other will see 1->0.
Remember, memory order is more about synchronizing the memory around the atomic. The atomic gets handled properly no matter what memory order you use.
7: This one is trickier. The short version for 7 is that decrement is release order because some thread is going to have to run the destructor for x, and we want to make sure it sees all operations on x made on all threads. Using release order on the destructor satisfies this need because you can prove that it works. Whoever is responsible for deleting x acquires all changes before doing so (using a fence to make sure atomics in the deleter don't drift upward). In all cases where threads release their own references, it is obvious that all threads will have a release-order decrement before the deleter gets called. In cases where one thread increments the refcount and another decrements it, you can prove that the only valid way to do so is if the threads synchronize with eachother, so that the destructor sees the result of both threads. Failure to synchronize would create a race case no matter what, so the user is obliged to get it right.
1
After pondering over #1 i have been convinced they are not equivalent by this argument §29.8.3 in [atomics.fences]:
A release fence A synchronizes with an atomic operation B that performs an acquire operation on an atomic
object M if there exists an atomic operation X such that A is sequenced before X, X modifies M, and B
reads the value written by X or a value written by any side effect in the hypothetical release sequence X
would head if it were a release operation.
This paragraph says that a release fence can be synchronized only with an aquire operation. But release operation can be in addition syncronized with consume operation.
Your void read_y_then_x() with the acquire fence has the fence in the wrong place. It should be placed between the two atomic loads. An acquire fence essentially makes all the load above the fence act somewhat like acquire loads, with the exception the happens before isn't established until you executed the fence.