Memory fences: acquire/load and release/store

Memory fences: acquire/load and release/store - c++

My understanding of std::memory_order_acquire and std::memory_order_release is as follows:
Acquire means that no memory accesses which appear after the acquire fence can be reordered to before the fence.
Release means that no memory accesses which appear before the release fence can be reordered to after the fence.
What I don't understand is why with the C++11 atomics library in particular, the acquire fence is associated with load operations, while the release fence is associated with store operations.
To clarify, the C++11 <atomic> library enables you to specify memory fences in two ways: either you can specify a fence as an extra argument to an atomic operation, like:
x.load(std::memory_order_acquire);
Or you can use std::memory_order_relaxed and specify the fence separately, like:
x.load(std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_acquire);
What I don't understand is, given the above definitions of acquire and release, why does C++11 specifically associate acquire with load, and release with store? Yes, I've seen many of the examples that show how you can use an acquire/load with a release/store to synchronize between threads, but in general it seems that the idea of acquire fences (prevent memory reordering after statement) and release fences (prevent memory reordering before statement) is orthogonal to the idea of loads and stores.
So, why, for example, won't the compiler let me say:
x.store(10, std::memory_order_acquire);
I realize I can accomplish the above by using memory_order_relaxed, and then a separate atomic_thread_fence(memory_order_acquire) statement, but again, why can't I use store directly with memory_order_acquire?
A possible use case for this might be if I want to ensure that some store, say x = 10, happens before some other statement executes that might affect other threads.

Say I write some data, and then I write an indication that the data is now ready. It's imperative that no other thread who sees the indication that the data is ready not see the write of the data itself. So prior writes cannot move past that write.
Say I read that some data is ready. It's imperative that any reads I issue after seeing that take place after the read that saw that the data was ready. So subsequent reads cannot move behind that read.
So when you do a synchronized write, you typically need to make sure that all writes you did before that are visible to anyone who sees the synchronized write. And when you do a synchronized read, it's typically imperative that any reads you do after that take place after the synchronized read.
Or, to put it another way, an acquire is typically reading that you can take or access the resource, and subsequent reads and writes must not be moved before it. A release is typically writing that you are done with the resource, and preceding writes must not be moved to after it.

(Partial answer correcting a mistake in the early part of the question. David Schwartz's answer already nicely covers the main question you're asking. Jeff Preshing's article on acquire / release is also good reading for another take on it.)
The definitions you gave for acquire / release are wrong for fences; they only apply to acquire operations and release operations, like x.store(mo_release), not std::atomic_thread_fence(mo_release).
Acquire means that no memory accesses which appear after the acquire fence can be reordered to before the fence. [wrong, would be correct for acquire operation]
Release means that no memory accesses which appear before the release fence can be reordered to after the fence. [wrong, would be correct for release operation]
They're insufficient for fences, which is why ISO C++ has stronger ordering rules for acquire fences (blocking LoadStore / LoadLoad reordering) and release fences (LoadStore / StoreStore).
Of course ISO C++ doesn't define "reordering", that would imply there is some global coherent state that you're accessing. ISO C++ instead
Jeff Preshing's articles are relevant here:
Acquire and Release Semantics (acquire / release operations such as loads, stores, and RMWs)
Acquire and Release Fences Don't Work the Way You'd Expect explains why those one-way barrier definitions are incorrect and insufficient for fences, unlike for operations. (Because it would let the fence reorder all the way to one end of your program and leave all the operations unordered wrt. each other, because it's not tied to an operation itself.)
A possible use case for this might be if I want to ensure that some store, say x = 10, happens before some other statement executes that might affect other threads.
If that "other statement" is a load from an atomic shared variable, you actually need std::memory_order_seq_cst to avoid StoreLoad reordering. acquire / release / acq_rel won't block that.
If you mean make sure the atomic store is visible before some other atomic store, the normal way is to make the 2nd atomic store use mo_release.
If the 2nd store isn't atomic, it's unlikely any reader could safely sync with anything in a way that it could observe the value without data-race UB.
(Although you do run into a use case for a release fence when hacking up a SeqLock that uses plain non-atomic objects for the payload, to allow a compiler to optimize. But that's an implementation-specific behaviour that depends on knowing how std::atomic stuff compiles for real CPUs. See Implementing 64 bit atomic counter with 32 bit atomics for example.)

std::memory_order_acquire fence only ensures all load operation after the fence is not reordered before any load operation before the fence, thus memory_order_acquire cannot ensure the store is visible for other threads when after loads are executed. This is why memory_order_acquire is not supported for store operation, you may need memory_order_seq_cst to achieve the acquire of store.
As an alternative, you may say
x.store(10, std::memory_order_releaxed);
x.load(std::memory_order_acquire); // this introduce a data dependency
to ensure all loads not reordered before the store. Again, the fence not work here.
Besides, memory order in atomic operation could be cheaper than a memory fence, because it only ensures the order relative to the atomic instruction, not all instruction before and after the fence.
See also formal description and explanation for detail.

Related

Synchronization problem with std::atomic<>

I have basically two questions that are closely related and they are both based on this SO question:
Thread synchronization problem with c++ std::atomic variables
As cppreference.com explains:
For memory_order_acquire:
A load operation with this memory order performs the acquire operation
on the affected memory location: no reads or writes in the current
thread can be reordered before this load. All writes in other
threads that release the same atomic variable are visible in the
current thread
For memory_order_release: A store operation with this memory order
performs the release operation: no reads or writes in the current
thread can be reordered after this store. All writes in the current
thread are visible in other threads that acquire the same atomic
variable
Why people say that memory_order_seq_cst MUST be used in order for that example to work properly? What's the purpose of memory_order_acquire if it doesn't work as the official documentation says so?
The documentation clearly says: All writes in other threads that release the same atomic variable are visible in the current thread.
Why that example from SO question should never print "bad\n"? It just doesn't make any sense to me.
I did my homework by reading all available documentation, SO queastions/anwers, googling, etc... But, I'm still not able to understand some things.

Your linked question has two atomic variables, your "cppreference" quote specifically mentions "same atomic variable". That's why the reference text doesn't cover the linked question.
Quoting further from cppreference: memory_order_seq_cst : "...a single total order exists in which all threads observe all modifications in the same order".
So that does cover modifications to two atomic variables.
Essentially, the design problem with memory_order_release is that it's a data equivalent of GOTO, which we know is a problem since Dijkstra. And memory_order_acquire is the equivalent of a COMEFROM, which is usually reserved for April Fools. I'm not yet convinced that they're good additions to C++.

In C++, which Standard Library functions (if any) are required to implicitly provided an atomic memory fence?

For example, is calling std::mutex::lock() required by the Standard to provide a sequentially consistent fence, an acquire fence, or neither?
cppreference.com doesn't seem to address this topic. Is it addressed in any reference documentation that's more easy to use than the Standard or working papers?

I'm not sure about an easier source, but here's a quote from a note in the standard:
[...] a call that acquires a mutex will perform an acquire operation on
the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a
release operation on those same locations. Informally, performing a release operation on A forces prior side
effects on other memory locations to become visible to other threads that later perform a consume or an
acquire operation on A.
I think that answers the question about memory fences reasonably well (and although it's "only" a note, not a normative part of the standard, I'd say it's as reliable a description of the standard as any other site could hope to provide).

std::atomic and std::mutex operations never require full 2-way fences. That does happen in practice on some ISAs as an implementation detail, notably x86, but not AArch64.
Even std::atomic<T> atomic RMWs with the default memory_order_seq_cst aren't as strong as full 2-way fences, I think. On real ISAs where SC RMWs can be done without being much stronger than required (specifically AArch64), I'm not sure they stop relaxed operations on opposite sides from reordering with each other. (Happening between the load and store parts of the atomic RMW).
As Jerry Coffin says, taking a std::mutex is only an acquire operation in the ISO C++ standard, not an acquire fence. It's not like std::atomic_thread_fence(std::memory_order_acquire), it's only required to be as strong as foo.exchange(std::memory_order_acquire).
The lack of mention of requiring a 2-way fence makes it clear that one isn't required or guaranteed by the standard. An acquire operation like taking a mutex allows 1-way reordering with itself, so relaxed operations before/after it can potentially reorder with each other. (That's why fences and operations are different things.)
Being any stronger than that is an implementation detail. For example on x86 where any atomic RMW operation is a full barrier, waiting for the store buffer to drain itself, and for all earlier loads to complete, before RMWing the cache line. So it's like a std::atomic_thread_fence(seq_cst) tied to the foo.exchange(); in fact a dummy lock add byte [rsp], 0 is how most compilers implement that C++ fence, because unfortunately mfence is slower on most CPUs.
Taking a mutex always require an atomic RMW, but some machines can do that in ways that allow limited reordering with surrounding operations. e.g. AArch64 can use ldaxr (sequential-acquire load-linked) / stxr (plain store-conditional, not stlxr with release semantics) to implement .exchange(acquire) or .compare_exchange_weak(acquire). See an example compiling to asm for AArch64 on Godbolt, and also atomic exchange with memory_order_acquire and memory_order_release and For purposes of ordering, is atomic read-modify-write one operation or two?

Thread-local acquire/release synchronization

In general, load-acquire/store-release synchronization is one of the most common forms of memory-ordering based synchronization in the C++11 memory model. It's basically how a mutex provides memory ordering. The "critical section" between a load-acquire and a store-release is always synchronized among different observer threads, in the sense that all observer threads will agree on what happens after the acquire and before the release.
Generally, this is achieved with a read-modify-write instruction, like compare-exchange, along with an acquire barrier, when entering the critical section, and another read-modify-write instruction with a release barrier when exiting the critical section.
But there are some situations where you might have a similar critical section[1] between a load-acquire and a release-store, except only one thread actually modifies the synchronization variable. Other threads may read the synchronization variable, but only one thread actually modifies it. In this case, when entering the critical section, you don't need a read-modify-write instruction. You would just need a simple store, since you are not racing with other threads that are attempting to modify the synchronization flag. (This may seem odd, but note that many lock-free memory reclamation deferral patterns, like user-space RCU or epoch based reclamation, use thread-local synchronization variables that are written to only by one thread, but read by many threads, so this isn't too weird of a situation.)
So, when entering the critical section, you could just do something like:
sync_var.store(true, ...);
.... critical section ....
sync_var.store(false, std::memory_order_release);
There is no race, because, again, there is no need for a read-modify-write when only one thread needs to set/unset the critical section variable. Other threads can simply read the critical section variable with a load-acquire.
The problem is, when you're entering the critical section, you need an acquire operation or fence. But you don't need to do a LOAD, only a STORE. So what is a good way to produce acquire ordering when you only really need a STORE? I see only two real options that fall within the C++ memory model. Either:
Use an exchange instead of a store, so you can do sync_var.exchange(true, std::memory_order_acquire). The downside here is that exchange is a more heavy-weight read-modify-write operation, when all you really need is a simple store.
Insert a "dummy" load-acquire, like:
(void)sync_var.load(std::memory_order_acquire);
sync_var.store(true, std::memory_order_relaxed);
The "dummy" load-acquire seems better. Presumably, the compiler can't optimize away the unused load, because it's an atomic instruction that has the side-effect of producing a "synchronizes-with" relationship with a release operation on sync_var. But it also seems very hacky, and the intention is unclear without comments explaining what's going on.
So what is the best way to produce acquire semantics when all we need to do is a simple store?
[1] I use the term "critical section" loosely. I don't necessarily mean a section that is always accessed via mutual exclusion. Rather, I just mean any section where memory ordering is synchronized via acquire-release semantics. This could refer to a mutex, or it could just mean something like RCU, where the critical section can be accessed concurrently by multiple readers.

The flaw in your logic is that an atomic RMW is not required because data in the critical section is modified by a single thread while all other threads only have read-access.
This is not true; there still needs to be a well-defined order between reading and writing.
You don't want data to be modified while another thread is still reading it. Therefore, each thread needs to inform other threads when it has finished accessing the data.
By only using an atomic store to enter the critical section, the 'synchronizes-with' relationship cannot be established.
Acquire/release synchronization is based on a runtime relationship where the acquirer knows that synchronization is complete only after observing a particular value returned by the atomic load.
This can never be achieved by a single atomic store since the one modifying thread can change the atomic variable sync_var at any time
and as such it has no way knowing whether another thread is still reading the data.
The option with a 'dummy' load/acquire is also invalid because it fails to inform other threads that it wants exclusive access. You attempt to solve that by using a single (relaxed) store,
but the load and the store are separate operations that can be interrupted by other threads (i.e. multiple threads simultaneously accessing the critical area).
An atomic RMW must be used by each thread to load a particular value and at the same time update the variable to inform all other threads it has now exclusive access
(regardless whether that is for reading or writing).
void lock()
{
while (sync_var.exchange(true, std::memory_order_acquire));
}
void unlock()
{
sync_var.store(false, std::memory_order_release);
}
Optimizations are possible where multiple threads have read-access at the same time (eg. std::shared_mutex).

Do I need Read-Write lock here

I have writing a multi threaded code. I am not sure, whether I would need a read and write lock mechanism. Could you please go through the usecase and tell me do I have to use read-write lock or just normal mutex will do.
Use case:
1) Class having two variables. These are accessed by every thread before doing operation.
2) When something goes wrong, these variables are updated to reflect the error scenarios.
Thus threads reading these variables can take different decisions (including abort)
Here, in second point, I need to update the data. And in first point, every thread will use the data. So, my question is do I have to use write lock while updating data and read lock while reading the data. (Note: Both variables are in memory. Just a boolean flag & string)
I am confused because as my both vars are in memory. So does OS take care when integrity. I mean I can live with 1 or 2 threads missing the updated value when some thread is writing the data in mutex.
Please tell if I am right or wrong? Also please tell If I have to use read-write lock or just normal mutex would do.
Update: I am sorry that I did not give platform and compiler name. I am on RHEL 5.0 and using gcc 4.6. My platform is x86_64. But I don not want my code to be OS specific because we are going to port the code shortly to Solaris 10.

First off, ignore those other answerers talking about volatile. Volatile is almost useless for multithreaded programming, and any false sense of safety given by it is just that - false.
Now, whether you need a lock depends on what you're doing with these variables. You will need a memory barrier at least (locks imply one).
So let's give an example:
One flag is an error flag. If zero, you continue, otherwise, you abort.
Another flag is a diagnostic code flag. It gives the precise reason for the error.
In this case, one option would be to do the following:
Read the error flag without a lock, but with read memory barriers after the read.
When an error occurs, take a lock, set the diagnostic code and error flags, then release the lock. If the diagnostic code is already set, release the lock immediately.
The memory barriers are needed, as otherwise the compiler (or CPU!) may choose to cache the same result for every read.
Of course, if the semantics of your two variables are different, the answer may vary. You'll need to be more specific.
Note that the exact mechanism for specifying locks and memory barriers depends on the compiler. C++0x provides a portable mechanism, but few compilers fully implement the C++0x standard yet. Please specify your compiler and OS for a more detailed answer.
As for your output data, you will almost certainly need a lock there. Try to avoid taking these locks too often though, as too much lock contention will kill your performance.

If they are atomic variables (C1x stdatomic.h or C++0x atomic), then you don't need read/write locks. With earlier C/C++ standards, there's no portable way of using multiple threads at all, so you need to look into how the implementation you are using does things. In most implementations, data types that can be accessed with a single machine instruction are atomic.
Note that just having an atomic variable is not enough -- you probably also need to declare it as volatile to guarantee that the compiler does not do things that will cause you to miss updates from other threads.

Thus threads reading these variables can take different decisions (including abort)
So each thread need to ensure that it reads the updated data. Also since the variables are shared, you need to take care about the race condition as well.
So in short - you need to use read/write locks when reading and writing to these shared variables.
See if you can use volatile variables - that should save you from using locks when you read the values (however write should still be with locks). This is applicable only because you said that -
I mean I can live with 1 or 2 threads missing the updated value when some thread is writing the data in mutex

Atomic action - mutex

I heard that there is something known as "atomic action" which is faster then using a mutex with critical section.
Does somebody know what is it, and how do I use it?

An atomic operation is an operation where the CPU reads and writes memory during the same bus access, this prevents other CPUs or system devices from modifying the memory simultaneously. E.g. a "test and set" operation, which could do "read memory at location X, if it is 0 set it to 1, return an indicator telling whether the value was set" without any chance of simultaneous access.
Wikipedia's http://en.wikipedia.org/wiki/Linearizability describes atomic operations.
If you're on windows, have a look at e.g. InterlockedTestExchange or InterlockedIncrement which are wrappers for atomic operations.
EDIT: Sample usage
A spinlock could be implemented using a test-and-set atomic operation:
while (test_and_set(&x) == 1) ;
This will keep looping until the current thread is the one that sets x to 1. If all other threads treat x in the same way its effect is the same as a mutex.

Atomic action only refers to the fact that an action will be done atomically uninterrupted by a co-running thread/processes.
What you are probably looking for are atomic built-ins in compilers. For example GCC provides this set: http://gcc.gnu.org/onlinedocs/gcc-4.5.1/gcc/Atomic-Builtins.html
These are usually implemented very efficiently using CPU support.

It is a tradeoff. As other posters have stated, an atomic operation is "try to grab this flag and return true on success". It is fast, but there are downsides.
A proper mutex blocks the threads that need to get into the critical section. With only atomic operations the waiting threads have to loop until they get the flag - this wastes CPU cycles. Another downside is that mutexes guarantee fair access - usually by just queueing the waiting processes in a FIFO queue. With spinlocks, there's a risk of resource starvation.
So, the bare atomic operations are faster but only when there's not too many threads trying to grab the critical section.

Since a GCC-specific answer was given, here's the VC++-specific link -- all of the intrinsics listed that begin with _Interlocked are relevant: http://msdn.microsoft.com/en-us/library/hd9bdb82.aspx. Also note that there are more intrinsics available for x64 than for x86: http://msdn.microsoft.com/en-us/library/azcs88h2.aspx.

Critical section implementation (shurely in windows) uses atomic variable to detect is a critical section (cs) captured by another thread or not, and enters kernel-side synchronization primitive only if real collision occurs. So if you need to protect a small piece of code and collision probability is small enough, critical section is a good solution.
However, if the protected code does nothing except of incrementing/decrementing or testing and modifying a single variable then it's the right case to use atomic operation.
There is also possible to protect by atomics more complicated code (google "lock free structures" and "transactional memory" for more info)
It is interesting but very complicated stuff and may not be recommended if some simple solution (like critical section) also works.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js