When is a memory_order_seq_cst fence useful? - c++

C++ supported atomic thread fences, that is fences guaranteeing properties for thread that use std::atomic<> operations, with the function atomic_thread_fence. It takes a memory order parameter to adjust the "strength" of the fence.
I understand that fences are useful when not all atomic operations are done with a "strong" order:
when not all atomic reads (1) in a thread are acquire operations, you may find a use for an acquire fence;
when not all atomic modifications (1) in a thread are release operations, you may find a use for a release fence.
(1) that includes RMW operations
So the usefulness of all these (acquire, release and acq_rel fences) is obvious: they allow threads that use atomic operations weaker than acq/rel (respectively) to synchronize properly.
But I don't understand where memory_order_seq_cst could be specifically needed as a fence:
What's the implication of using weaker than memory_order_seq_cst atomic operations and a memory_order_seq_cst fence?
What would specifically be guaranteed (in term of possible ordering of atomic operations) by a memory_order_seq_cst fence that wouldn't be guaranteed by memory_order_acq_rel?

No, a seq-cst-fence is not only both a release and an acquire-fence, but also provides some additional properties (see Working Draft, Standard for Programming Language C++, 32.4.4-32.4.8). A seq-cst fence is also part of the single total order of all sequentially consistent operations, enforcing the following observations:
For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_cst fence X sequenced before B, then B observes either the last memory_order_seq_cst modification of M preceding X in the total order S or a later modification of M in its modification order.
For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory_order_seq_cst fence X such that A is sequenced before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.
For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order.
For example, I am using seq-cst fences in my hazard pointer implementation: https://github.com/mpoeter/xenium/blob/master/xenium/reclamation/impl/hazard_pointer.hpp
The thread acquiring a safe reference to some object uses seq-cst fence after storing the hazard pointer, but before re-reading the pointer to the object. The thread trying to reclaim some objects uses a seq-cst fence before gathering the active hazard pointers from all threads. Based on the rules above this ensures that either the thread trying to reclaim the object sees that some other thread has a HP for this object (i.e., the object is used), or the reload of thread trying to acquire the safe reference to the object returns a different pointer, indicating to that thread that the object has been removed and it has to perform a retry.

Related

Synchronizing against relaxed atomics

I have an allocator that does relaxed atomics to track the number of bytes currently allocated. They're just adds and subtracts so I don't need any synchronization between threads other than ensuring the modifications are atomic.
However, I occasionally want to check the number of allocated bytes (e.g. when shutting down the program) and I want to ensure any pending writes are committed. I assume I need a full memory barrier in this case to prevent any previous writes from being moved after the barrier and to prevent the next read from being moved before the barrier.
The question is: what is the proper way to ensure the relaxed atomic writes are committed before reading? Is my current code correct? (Assume functions and types map to std library constructs as expected.)
void* Allocator::Alloc(size_t bytes, size_t alignment)
{
void* p = AlignedAlloc(bytes, alignment);
AtomicFetchAdd(&allocatedBytes, AlignedMsize(p), MemoryOrder::Relaxed);
return p;
}
void Allocator::Free(void* p)
{
AtomicFetchSub(&allocatedBytes, AlignedMsize(p), MemoryOrder::Relaxed);
AlignedFree(p);
}
size_t Allocator::GetAllocatedBytes()
{
AtomicThreadFence(MemoryOrder::AcqRel);
return AtomicLoad(&allocatedBytes, MemoryOrder::Relaxed);
}
And some type definitions for context
enum struct MemoryOrder
{
Relaxed = 0,
Consume = 1,
Acquire = 2,
Release = 3,
AcqRel = 4,
SeqCst = 5,
};
struct Allocator
{
void* Alloc (size_t bytes, size_t alignment);
void Free (void* p);
size_t GetAllocatedBytes();
Atomic<size_t> allocatedBytes = { 0 };
};
I don't want to simply default to sequential consistency as I'm trying to understand memory ordering better.
The part that's really tripping me up is that in the standard under [atomics.fences] all the points talk about an acquire fence/atomic op synchronizing with a release fence/atomic op. It's entirely opaque to me whether an acquire fence/atomic op will synchronize with a relaxed atomic op on another thread. If an AcqRel fence function literally maps to an mfence instruction, it seems that the above code will be fine. However, I'm having a hard time convincing myself the standard guarantees this. Namely,
4 An atomic operation A that is a release operation on an atomic
object M synchronizes with an acquire fence B if there exists some
atomic operation X on M such that X is sequenced before B and reads
the value written by A or a value written by any side effect in the
release sequence headed by A.
This seems to make it clear that the fence will not synchronize with the relaxed atomic writes. On the other hand, a full fence is both a release and an acquire fence, so it should synchronize with itself, right?
2 A release fence A synchronizes with an acquire fence B if there
exist atomic operations X and Y, both operating on some atomic object
M, such that A is sequenced before X, X modifies M, Y is sequenced
before B, and Y reads the value written by X or a value written by any
side effect in the hypothetical release sequence X would head if it
were a release operation.
The scenario described is
Unsequenced writes
A release fence
X atomic write
Y atomic read
B acquire fence
Unsequenced reads (unsequenced writes will be visible here)
However, in my case I don't have the atomic write + atomic read as a signal between the threads and the release fence happens with the acquire fence on thread B. So what's actually happening is
Unsequenced writes
A release fence
B acquire fence
Unsequenced reads
Clearly if the fence executes before an unsequenced write begins it's a race and all bets are off. But it seems to me that if the fence executes after an unsequenced write begins but before it is committed it will be forced to finish before the unsequenced reads. This is exactly that I want, but I can't glean whether this is guaranteed by the standard.
Let's say you spawn Thread A, which calls Allocator::Alloc(), then immediately spawn Thread B, which calls Allocator::GetAllocatedBytes(). Those two Allocator calls are now running concurrently. You don't know which one will actually happen first, because there's no ordering between them. Your only guarantee is that either Thread B will see the value of allocatedBytes before Thread A modifies it, or it will see the value of allocatedBytes after Thread A modifies it. You won't know which value Thread B saw until after GetAllocatedBytes() returns. (At least Thread B won't see a totally garbage value for allocatedBytes, because there's no data race on it thanks to your use of relaxed atomics.)
You seem to be concerned about the case where Thread A got as far as AtomicFetchAdd(), but for some reason, the change is not visible when Thread B calls AtomicLoad(). But so what? That's no different from the outcome where GetAllocatedBytes() runs entirely before AtomicFetchAdd(). And that's a totally valid outcome. Remember, either Thread B sees the modified value, or it doesn't.
Even if you change all the atomic operations/fences to MemoryOrder::SeqCst, it won't make any difference. In the scenario I described, Thread B can still either see the modified value or the unmodified value of allocatedBytes, because the two Allocator calls run concurrently.
As long as you insist on calling GetAllocatedBytes() while other threads are still calling Alloc() and Free(), that's really the most you can expect. If you want to get a more "accurate" value, just don't allow any concurrent calls to Alloc()/Free() while GetAllocatedBytes() is running! For example, if the program is shutting down, just join all the other threads before calling GetAllocatedBytes(). That'll give you an accurate number of allocated bytes at shutdown. The C++ standard even guarantees it, because the completion of a thread synchronizes with the call to join().
This will not work properly, acq_rel memory order is designed specifically for CAS and FAA memory operations that "simulanously" read and write atomic data. In Your case You want to enforce memory synchronization before load. To do this You need to change memory order of You fetchAndAdd and fetchAndSub to acq_rel and Your load to acquire. This may seem much, but on x86 it has very little cost (some compiler optimizations) as it does not generate any new instructions into the code. As of how acquire-release synchronization works I recomend this article: http://preshing.com/20120913/acquire-and-release-semantics/
I removed info about sequential ordering as it should be used for all the operations to work properly and would be an overkill.
From my understanding of C++ atomics relaxed memory order makes sense when used in combination with other atomic operations using memory fences. For example in some situations atomic a may be stored in relaxed manner, as atomic b is written with release memory order and so on.
If your question is what is the proper way to ensure the relaxed atomic writes are committed before reading this same atomic object? Nothing, this is ensured by the language, [intro.multithread]:
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M.
All threads see the same modification order. For exemple imagine that 2 allocation happens in 2 different threads and then you read the counter in a third thread.
In the first thread, the atomic is incremented by 1 bytes, and the relaxed read/modify (AtomicFetchAdd) expression return 0: the counter made this transition: 0->1.
In the second thread, the atomic is incremented by 2 bytes, and the relaxed read/modify expression return 1: the counter make this transition: 1->3. There is no way the read/modify expression return 0. This thread cannot see a transition 0->2 because the other thread has performed the transition 0->1.
Then in a third thread you perform a relaxed load. The only possible values that may be loaded are 0,1 or 3. It is not possible to load 2. The modification order of the atomic is 0 -> 1 -> 3. And the observer thread will also see this modification order.

Is atomic_thread_fence(memory_order_release) different from using memory_order_acq_rel?

cppreference.com provides this note about std::atomic_thread_fence (emphasis mine):
atomic_thread_fence imposes stronger synchronization constraints than an atomic store operation with the same std::memory_order.
While an atomic store-release operation prevents all preceding writes from moving past the store-release, an atomic_thread_fence with memory_order_release ordering prevents all preceding writes from moving past all subsequent stores.
I understand this note to mean that std::atomic_thread_fence(std::memory_order_release) is not unidirectional, like a store-release. It's a bidirectional fence, preventing stores on either side of the fence from reordering past a store on the other side of the fence.
If I understand that correctly, this fence seems to make the same guarantees that atomic_thread_fence(memory_order_acq_rel) does. It is an "upward" fence, and a "downward" fence.
Is there a functional difference between std::atomic_thread_fence(std::memory_order_release) and std::atomic_thread_fence(std::memory_order_acq_rel)? Or is the difference merely aesthetic, to document the purpose of the code?
A standalone fence imposes stronger ordering than an atomic operation with the same ordering constraint, but this does not change the direction in which ordering is enforced.
Bot an atomic release operation and a standalone release fence are uni-directional,
but the atomic operation orders with respect to itself whereas the atomic fence imposes ordering with respect to other stores.
For example, an atomic operation with release semantics:
std::atomic<int> sync{0};
// memory operations A
sync.store(1, std::memory_order_release);
// store B
This guarantees that no memory operation part of A (loads & stores) can be (visibly) reordered with the atomic store itself.
But it is uni-directional and no ordering rules apply to memory operations that are sequenced after the atomic operation; therefore, store B can still be reordered with any of the memory operations in A.
A standalone release fence changes this behavior:
// memory operations A
std::atomic_thread_fence(std::memory_order_release);
// load X
sync.store(1, std::memory_order_relaxed);
// stores B
This guarantees that no memory operation in A can be (visibly) reordered with any of the stores that are sequenced after the release fence.
Here, the store to B can no longer be reordered with any of the memory operations in A, and as such, the release fence is stronger than the atomic release operation.
But it also uni-directional since the load from X can still be reordered with any memory operation in A.
The difference is subtle and usually an atomic release operation is preferred over a standalone release fence.
The rules for a standalone acquire fence are similar, except that it enforces ordering in the opposite direction and operates on loads:
// loads B
sync.load(std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_acquire);
// memory operations A
No memory operation in A can be reordered with any load that is sequenced before the standalone acquire fence.
A standalone fence with std::memory_order_acq_rel ordering combines the logic for both acquire and release fences.
// memory operations A
// load A
std::atomic_thread_fence(std::memory_order_acq_rel);
// store B
//memory operations B
But this can get incredibly tricky once you realize that a store in A can still be reordered with a load in B.
Acq/rel fences should probably be avoided in favor of regular atomic operations, or even better, mutexes.
cppreference.com made some mistakes in the paragraph you quoted. I have highlighted them in the following:
atomic_thread_fence imposes stronger synchronization constraints than an atomic store operation with the same std::memory_order. While an atomic store-release operation prevents all preceding writes(should be memory operations, i.e. including reads and writes) from moving past the store-release (the complete sentence should be: the store-release operation itself), an atomic_thread_fence with memory_order_release ordering prevents all preceding writes(should be memory operations, i.e. including reads and writes) from moving past all subsequent stores.
To paraphrase it:
The release operation actually places fewer memory ordering constraints on neighboring operations than the release fence. A release operation only needs to prevent preceding memory operations from being reordered past itself, but a release fence must prevent preceding memory operations from being reordered past all subsequent writes. Because of this difference, a release operation can never take the place of a release fence.
This is quoted from here.
This is my interpretation of the intent of the following text, which I think is what was intended. Also, that interpretation is correct in term of the memory model, but still bad as it's an incomplete explanation.
While an atomic store-release operation prevents all preceding writes
from moving past the store-release, an atomic_thread_fence with
memory_order_release ordering prevents all preceding writes from
moving past all subsequent stores.
The use of "store" vs. "writes" is intentional:
"store", here, means a store on an std::atomic<> object (not just a call to std::atomic<>::store, also assignment which is equivalent to .store(value) or a RMW atomic operation);
"write", here, means any memory write, either normal (non atomic) or atomic.
It's a bidirectional fence, preventing stores on either side of the
fence from reordering past a store on the other side of the fence.
No, you missed an essential distinction, because it was only implied; expressed in an unclear, too subtle way - not good for a teaching text!
It says that a release fence is not symmetric: previous memory side effect, called "writes", are bound by following atomic store operations.
Even with that clarification, it's incomplete and so it's a bad explanation: it strongly suggests that the release fences exist just to make sure that writes (and writes only) are finished. That is not the case.
A release operation is what I call a: "I'm done there" signal. It signals that all previous memory operations are done, finished, visible. It's important to understand that not only modifications (which can be detected by looking at memory state) are ordered, everything on memory needs to be.
Many writes-up about thread primitives are defective in that way.

C++ memory model: do seq_cst loads synchronize with seq_cst stores?

In the C++ memory model, there is a total order on all loads and stores of all sequentially consistent operations. I'm wondering how this interacts with operations that have other memory orderings that are sequenced before/after sequentially consistent loads.
For example, consider two threads:
std::atomic<int> a(0);
std::atomic<int> b(0);
std::atomic<int> c(0);
//////////////
// Thread T1
//////////////
// Signal that we've started running.
a.store(1, std::memory_order_relaxed);
// If T2's store to b occurs before our load below in the total
// order on sequentially consistent operations, set flag c.
if (b.load(std::memory_order_seq_cst) == 1) {
c.store(1, std::memory_order_relaxed)
}
//////////////
// Thread T2
//////////////
// Blindly write to b.
b.store(1, std::memory_order_seq_cst)
// Has T1 set c? If so, then we know our store to b occurred before T1's load
// in the total order on sequentially consistent operations.
if (c.load(1, std::memory_order_relaxed)) {
// But is this guaranteed to be visible yet?
assert(a.load(1, std::memory_order_relaxed) == 1);
}
Is it guaranteed that the assertion in T2 cannot fire?
I'm looking for detailed citations of the standard here. In particular I think this this would require showing that the load from b in T1 synchronizes with the store to b in T2 in order to establish that the store to a inter-thread happens before the load from a, but as far as I can tell the standard says that memory_order_seq_cst stores synchronize with loads, but not the other way around.
Do seq_cst loads synchronize with seq_cst stores?
They do if all necessary requirements are met; in your example code, the assert can fire
§29.3.3
There shall be a single total order S on all memory_order_seq_cst operations
This total order applies to the seq_cst operations themselves.. In isolation, a store(seq_cst) has release semantics, whereas a load(seq_cst) has acquire semantics.
§29.3.1-2 [atomics.order]
memory_order_release, memory_order_acq_rel, and memory_order_seq_cst:
a store operation performs a release operation on the affected memory location.
.....
§29.3.1-4 [atomics.order]
memory_order_acquire, memory_order_acq_rel, and memory_order_seq_cst:
a load operation performs an acquire operation on the affected memory location.
Therefore, atomic operations with non-seq_cst ordering (or non-atomic operations) are ordered with respect to seq_cst operations per the acquire/release ordering rules:
a store(seq_cst) operation cannot be reordered with any memory operation that is sequenced before it (i.e. comes earlier in program order)..
a load(seq_cst) operation cannot be reordered with any memory operation that is sequenced after it.
In your example, although c.store(relaxed) in T1 is ordered (inter-thread) after b.load(seq_cst) (the load is an acquire operation),
c.load(relaxed) in T2 is unordered with respect to b.store(seq_cst) (which is a release operation, but it does not prevent the reordering).
You can also look at the operations on a. Since those are not ordered with respect to anything, a.load(relaxed) can return 0, causing the assert to fire.

In C++, is there any effective difference between a acquire/release atomic access and a relaxed access combined with a fence?

Specifically, is there any effective difference between:
i = a.load(memory_order_acquire);
or
a.store(5, memory_order_release);
and
atomic_thread_fence(memory_order_acquire);
i = a.load(memory_order_relaxed);
or
a.store(5, memory_order_relaxed);
atomic_thread_fence(memory_order_release);
respectively?
Do non-relaxed atomic accesses provide signal fences as well as thread fences?
In your code, for both load and store, the order between the fence and the atomic operation should be reversed and then it is similar to the standalone operations, but there are differences.
Acquire and release operations on atomic variables act as one-way barriers, but in opposite directions.
That is, a store/release operation prevents memory operations that precede it (in the program source) from being reordered after it,
while a load/acquire operation prevents memory operations that follow it from being reordered before it.
// thread 1
// shared memory operations A
a.store(5, std::memory_order_release);
x = 42; // regular int
// thread 2
while (a.load(std::memory_order_acquire) != 5);
// shared memory operations B
Memory operations A cannot move down below the store/release, while memory operations B cannot move up above the load/acquire.
As soon as thread 2 reads 5, memory operation A are visible to B and synchronization is complete.
Being a one-way barrier, the write to x can join, or even precede, memory operations A, but since it is not part of the acquire/release relationship x cannot be reliably accessed by thread 2.
Replacing the atomic operations with standalone thread fences and relaxed operations is similar:
// thread 1
// shared memory operations A
std::atomic_thread_fence(memory_order_release);
a.store(5, std::memory_order_relaxed);
// thread 2
while (a.load(std::memory_order_relaxed) != 5);
std::atomic_thread_fence(memory_order_acquire);
// shared memory operations B
This achieves the same result but an important difference is that both fences do not act as one-way barriers;
If they did, the atomic store to a could be reordered before the release fence and the atomic load from a could be reordered after the acquire fence and
that would break the synchronization relationship.
In general:
A standalone release fence prevents preceding operations from being reordered with (atomic) stores that follow it.
A standalone acquire fence prevents following operations from being reordered with (atomic) loads that precede it.
The standard allows Acquire/Release fences to be mixed with Acquire/Release operations.
Do non-relaxed atomic accesses provide signal fences as well as thread fences?
It is not fully clear to me what you are asking here because thread fences are normally used with relaxed atomic operations,
but std::thread_signal_fence is similar to a std::atomic_thread_fence, except that it is supposed to operate within the same thread and
therefore the compiler does not generate CPU instructions for inter-thread synchronization.
It basically acts as a compiler-only barrier.
You need
atomic_thread_fence(memory_order_release);
a.store(5, memory_order_relaxed);
and
i = a.load(memory_order_relaxed);
atomic_thread_fence(memory_order_acquire);
To replace
a.store(5, memory_order_release);
and
i = a.load(memory_order_acquire);
Non-relaxed atomic accesses do provide signal fences as well as thread fences.

can operation reorder to before memory_order_release?

Can y.store reorder to before x.store? Because the standard said any atomic operation happen before memory_order_release cannot be reorder to after memory_order_release, but didnt state any operation happen after memory_order_release cannot reorder to before memory_order_release.
If i can happen, then the Listing 5.12 example from c++ concurrency in action book is wrong?
std::atomic<bool> x,y;
std::atomic<int> z;
void write_x_then_y()
{
x.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
y.store(true,std::memory_order_relaxed);
}
The standard does not define memory fences in terms of how operations are ordered around them.
It defines how a release fence (or operation) synchronizes with an acquire fence (or operation) when the right conditions are met.
In your example, if an acquire operation observes the value stored by y, it is guaranteed that it also observes the value stored by x.
Under those conditions, the store to x is ordered before the store to y.
Other than that, it is hard to speculate how fences impose ordering since a lot can happen under the as-if rule.