Concurrency: Atomic and volatile in C++11 memory model - c++

A global variable is shared across 2 concurrently running threads on 2 different cores. The threads writes to and read from the variables. For the atomic variable can one thread read a stale value? Each core might have a value of the shared variable in its cache and when one threads writes to its copy in a cache the other thread on a different core might read stale value from its own cache. Or the compiler does strong memory ordering to read the latest value from the other cache? The c++11 standard library has std::atomic support. How this is different from the volatile keyword? How volatile and atomic types will behave differently in the above scenario?

Firstly, volatile does not imply atomic access. It is designed for things like memory mapped I/O and signal handling. volatile is completely unnecessary when used with std::atomic, and unless your platform documents otherwise, volatile has no bearing on atomic access or memory ordering between threads.
If you have a global variable which is shared between threads, such as:
std::atomic<int> ai;
then the visibility and ordering constraints depend on the memory ordering parameter you use for operations, and the synchronization effects of locks, threads and accesses to other atomic variables.
In the absence of any additional synchronization, if one thread writes a value to ai then there is nothing that guarantees that another thread will see the value in any given time period. The standard specifies that it should be visible "in a reasonable period of time", but any given access may return a stale value.
The default memory ordering of std::memory_order_seq_cst provides a single global total order for all std::memory_order_seq_cst operations across all variables. This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies.
If you have 2 shared variables x and y, initially zero, and have one thread write 1 to x and another write 2 to y, then a third thread that reads both may see either (0,0), (1,0), (0,2) or (1,2) since there is no ordering constraint between the operations, and thus the operations may appear in any order in the global order.
If both writes are from the same thread, which does x=1 before y=2 and the reading thread reads y before x then (0,2) is no longer a valid option, since the read of y==2 implies that the earlier write to x is visible. The other 3 pairings (0,0), (1,0) and (1,2) are still possible, depending how the 2 reads interleave with the 2 writes.
If you use other memory orderings such as std::memory_order_relaxed or std::memory_order_acquire then the constraints are relaxed even further, and the single global ordering no longer applies. Threads don't even necessarily have to agree on the ordering of two stores to separate variables if there is no additional synchronization.
The only way to guarantee you have the "latest" value is to use a read-modify-write operation such as exchange(), compare_exchange_strong() or fetch_add(). Read-modify-write operations have an additional constraint that they always operate on the "latest" value, so a sequence of ai.fetch_add(1) operations by a series of threads will return a sequence of values with no duplicates or gaps. In the absence of additional constraints, there's still no guarantee which threads will see which values though. In particular, it is important to note that the use of an RMW operation does not force changes from other threads to become visible any quicker, it just means that if the changes are not seen by the RMW then all threads must agree that they are later in the modification order of that atomic variable than the RMW operation. Stores from different threads can still be delayed by arbitrary amounts of time, depending on when the CPU actually issues the store to memory (rather than just its own store buffer), physically how far apart the CPUs executing the threads are (in the case of a multi-processor system), and the details of the cache coherency protocol.
Working with atomic operations is a complex topic. I suggest you read a lot of background material, and examine published code before writing production code with atomics. In most cases it is easier to write code that uses locks, and not noticeably less efficient.

volatile and the atomic operations have a different background, and
were introduced with a different intent.
volatile dates from way back, and is principally designed to prevent
compiler optimizations when accessing memory mapped IO. Modern
compilers tend to do no more than suppress optimizations for volatile,
although on some machines, this isn't sufficient for even memory mapped
IO. Except for the special case of signal handlers, and setjmp,
longjmp and getjmp sequences (where the C standard, and in the case
of signals, the Posix standard, gives additional guarantees), it must be
considered useless on a modern machine, where without special additional
instructions (fences or memory barriers), the hardware may reorder or
even suppress certain accesses. Since you shouldn't be using setjmp
et al. in C++, this more or less leaves signal handlers, and in a
multithreaded environment, at least under Unix, there are better
solutions for those as well. And possibly memory mapped IO, if you're
working on kernal code and can ensure that the compiler generates
whatever is needed for the platform in question. (According to the
standard, volatile access is observable behavior, which the compiler
must respect. But the compiler gets to define what is meant by
“access”, and most seem to define it as “a load or
store machine instruction was executed”. Which, on a modern
processor, doesn't even mean that there is necessarily a read or write
cycle on the bus, much less that it's in the order you expect.)
Given this situation, the C++ standard added atomic access, which does
provide a certain number of guarantees across threads; in particular,
the code generated around an atomic access will contain the necessary
additional instructions to prevent the hardware from reordering the
accesses, and to ensure that the accesses propagate down to the global
memory shared between cores on a multicore machine. (At one point in
the standardization effort, Microsoft proposed adding these semantics to
volatile, and I think some of their C++ compilers do. After
discussion of the issues in the committee, however, the general
consensus—including the Microsoft representative—was that it
was better to leave volatile with its orginal meaning, and to define
the atomic types.) Or just use the system level primitives, like
mutexes, which execute whatever instructions are needed in their code.
(They have to. You can't implement a mutex without some guarantees
concerning the order of memory accesses.)

Here's a basic synopsis of what the 2 things are:
1) Volatile keyword:
Tells the compiler that this value could alter at any moment and therefore it should not EVER cache it in a register. Look up the old "register" keyword in C. "Volatile" is basically the "-" operator to "register"'s "+". Modern compilers now do the optimization that "register" used to explicitly request by default, so you only see 'volatile' anymore. Using the volatile qualifier will guarantee that your processing never uses a stale value, but nothing more.
2) Atomic:
Atomic operations modify data in a single clock tick, so that it is impossible for ANY other thread to access the data in the middle of such an update. They're usually limited to whatever single-clock assembly instructions the hardware supports; things like ++,--, and swapping 2 pointers. Note that this says nothing about the ORDER the different threads will RUN the atomic instructions, only that they will never run in parallel. That's why you have all those additional options for forcing an ordering.

Volatile and Atomic serve different purposes.
Volatile :
Informs the compiler to avoid optimization. This keyword is used for variables that shall change unexpectedly. So, it can be used to represent the Hardware status registers, variables of ISR, Variables shared in a multi-threaded application.
Atomic :
It is also used in case of multi-threaded application. However, this ensures that there is no lock/stall while using in a multi-threaded application. Atomic operations are free of races and indivisble. Few of the key scenario of usage is to check whether a lock is free or used, atomically add to the value and return the added value etc. in multi-threaded application.

Related

Atomic operation propagation/visibility (atomic load vs atomic RMW load)

Context 
I am writing a thread-safe protothread/coroutine library in C++, and I am using atomics to make task switching lock-free. I want it to be as performant as possible. I have a general understanding of atomics and lock-free programming, but I do not have enough expertise to optimise my code. I did a lot of researching, but it was hard to find answers to my specific problem: What is the propagation delay/visiblity for different atomic operations under different memory orders?
Current assumptions 
I read that changes to memory are propagated from other threads, in such a way that they might become visible:
in different orders to different observers,
with some delay.
I am unsure as to whether this delayed visibility and inconsistent propagation applies only to non-atomic reads, or to atomic reads as well, potentially depending on what memory order is used. As I am developing on an x86 machine, I have no way of testing the behaviour on weakly ordered systems.
Do all atomic reads always read the latest values, regardless of the type of operation and the memory order used? 
I am pretty sure that all read-modify-write (RMW) operations always read the most recent value written by any thread, regardless of the memory order used. The same seems to be true for sequentially consistent operations, but only if all other modifications to a variable are also sequentially consistent. Both are said to be slow, which is not good for my task. If not all atomic reads get the most recent value, then I will have to use RMW operations just for reading an atomic variable's latest value, or use atomic reads in a while loop, to my current understanding.
Does the propagation of writes (ignoring side effects) depend on the memory order and the atomic operation used? 
(This question only matters if the answer to the previous question is that not all atomic reads always read the most recent value. Please read carefully, I do not ask about the visibility and propagation of side-effects here. I am merely concerned with the value of the atomic variable itself.) This would imply that depending on what operation is used to modify an atomic variable, it would be guaranteed that any following atomic read receives the most recent value of the variable. So I would have to choose between an operation guaranteed to always read the latest value, or use relaxed atomic reads, in tandem with this special write operation that guarantees instant visibility of the modification to other atomic operations.
Is atomic lock-free ?
First of all, let's get rid of the elephant in the room: using atomic in your code doesn't guarantee a lock-free implementation. atomic is only an enabler for a lock-free implementation. is_lock_free() will tell you if it's really lock-free for the C++ implementation and the underlying types that you are using.
What's the latest value ?
The term "latest" is very ambiguous in the world of multithreading. Because what is the "latest" for one thread that might be put asleep by the OS, might no longer be what is the latest for another thread that is active.
std::atomic only guarantees is a protection against racing conditions, by ensuring that R, M and RMW performed on one atomic in one thread are performed atomically, without any interruption, and that all other threads see either the value before or the value after, but never what's in-between. So atomic synchronize threads by creating an order between concurrent operations on the same atomic object.
You need to see every thread as a parallel universe with its own time and that is unaware of the time in the parallel universes. And like in quantum physics, the only thing that you can know in one thread about another thread is what you can observe (i.e. a "happened before" relation between the universes).
This means that you should not conceive multithreaded time as if there would be an absolute "latest" across all the threads. You need to conceive time as relative to the other threads. This is why atomics don't create an absolute latest, but only ensure a sequential ordering of the successive states that an atomic will have.
Propagation
The propagation doesn't depend on the memory order nor the atomic operation performed. memory_order is about sequential constraints on non-atomic variables around atomic operations that are seen like fences. The best explanation of how this works is certainly Herb Sutters presentation, that is definitively worth its hour and half if you're working on multithreading optimisation.
Although it is possible that a particular C++ implementation could implement some atomic operation in a way that influences propagation, you cannot rely on any such observation that you would do, since there would be no guarantee that propagation works in the same fashion in the next release of the compiler or on another compiler on another CPU architecture.
But does propagation matter ?
When designing lock-free algorithms, it is tempting to read atomic variables to get the latest status. But whereas such a read-only access is atomic, the action immediately after is not. So the following instructions might assume a state which is already obsolete (for example because the thread is send asleep immediately after the atomic read).
Take if(my_atomic_variable<10) and suppose that you read 9. Suppose you're in the best possible world and 9 would be the absolutely latest value set by all the concurrent threads. Comparing its value with <10 is not atomic, so that when the comparison succeeds and if branches, my_atomic_variable might already have a new value of 10. And this kind of problems might occur regardless of how fast the propagation is, and even if the read would be guaranteed to always get the latest value. And I didn't even mention the ABA problem yet.
The only benefit of the read is to avoid a data race and UB. But if you want to synchronize decisions/actions across threads, you need to use an RMW, such compare-and-swap (e.g. atomic_compare_exchange_strong) so that the ordering of atomic operations result in a predictable outcome.
After some discussion, here are my findings: First, let's define what an atomic variable's latest value means: In wall-clock time, the very latest write to an atomic variable, so, from an external observer's point of view. If there are multiple simultaneous last writes (i.e., on multiple cores during the same cycle), then it doesn't really matter which one of them is chosen.
Atomic loads of any memory order have no guarantee that the latest value is read. This means that writes have to propagate before you can access them. This propagation may be out of order with respect to the order in which they were executed, as well as differing in order with respect to different observers.
std::atomic_int counter = 0;
void thread()
{
// Imagine no race between read and write.
int value = counter.load(std::memory_order_relaxed);
counter.store(value+1, std::memory_order_relaxed);
}
for(int i = 0; i < 1000; i++)
std::async(thread);
In this example, according to my understanding of the specs, even if no read-write executions were to interfere, there could still be multiple executions of thread that read the same values, so that in the end, counter would not be 1000. This is because when using normal reads, although threads are guaranteed to read modifications to the same variable in the correct order (they will not read a new value and on the next value read an older value), they are not guaranteed to read the globally latest written value to a variable.
This creates the relativity effect (as in Einstein's physics) that every thread has its own "truth", and this is exactly why we need to use sequential consistency (or acquire/release) to restore causality: If we simply use relaxed loads, then we can even have broken causality and apparent time loops, which can happen because of instruction reordering in combination with out-of-order propagation. Memory ordering will ensure that those separate realities perceived by separate threads are at least causally consistent.
Atomic read-modify-write (RMW) operations (such as exchange, compare_exchange, fetch_add,…) are guaranteed to operate on the latest value as defined above. This means that propagation of writes is forced, and results in one universal view on the memory (if all reads you make are from atomic variables using RMW operations), independent of threads. So, if you use atomic.compare_exchange_strong(value,value, std::memory_order_relaxed) or atomic.fetch_or(0, std::memory_order_relaxed), then you are guaranteed to perceive one global order of modification that encompasses all atomic variables. Note that this does not guarantee you any ordering or causality of non-RMW reads.
std::atomic_int counter = 0;
void thread()
{
// Imagine no race between read and write.
int value = counter.fetch_or(0, std::memory_order_relaxed);
counter.store(value+1, std::memory_order_relaxed);
}
for(int i = 0; i < 1000; i++)
std::async(thread);
In this example (again, under the assumption that none of the thread() executions interfere with each other), it seems to me that the spec forbids value to contain anything but the globally latest written value. So, counter would always be 1000 in the end.
Now, when to use which kind of read? 
If you only need causality within each thread (there might still be different views on what happened in which order, but at least every single reader has a causally consistent view on the world), then atomic loads and acquire/release or sequential consistency suffice.
But if you also need fresh reads (so that you must never read values other than the globally (across all threads) latest value), then you should use RMW operations for reading. Those alone do not create causality for non-atomic and non-RMW reads, but all RMW reads across all threads share the exact same view on the world, which is always up to date.
So, to conclude: Use atomic loads if different world views are allowed, but if you need an objective reality, use RMWs to load.
Multithreading is surprising area.
First, an atomic read is not ordered after a write. I e reading a value does not mean that it were written before. Sometimes such read may ever see (indirect, by other thread) result of some subsequent atomic write by the same thread.
Sequential consistency are clearly about visibility and propagation. When a thread writes an atomic "sequentially consistent" it makes all its previous writes to be visible to other threads (propagation). In such case a (sequentially consistent) read is ordered in relation to a write.
Generally the most performant operations are "relaxed" atomic operations, but they provide minimum guarranties on ordering. In principle there is ever some causality paradoxes... :-)

How do I make memory stores in one thread "promptly" visible in other threads?

Suppose I wanted to copy the contents of a device register into a variable that would be read by multiple threads. Is there a good general way of doing this? Here are examples of two possible methods of doing this:
#include <atomic>
volatile int * const Device_reg_ptr = reinterpret_cast<int *>(0x666);
// This variable is read by multiple threads.
std::atomic<int> device_reg_copy;
// ...
// Method 1
const_cast<volatile std::atomic<int> &>(device_reg_copy)
.store(*Device_reg_ptr, std::memory_order_relaxed);
// Method 2
device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
More generally, in the face of possible whole program optimization, how does one correctly control the latency of memory writes in one thread being visible in other threads?
EDIT: In your answer, please consider the following scenario:
The code is running on a CPU in an embedded system.
A single application is running on the CPU.
The application has far fewer threads than the CPU has processor cores.
Each core has a massive number of registers.
The application is small enough that whole program optimization is successfully used when building its executable.
How do we make sure that a store in one thread does not remain invisible to other threads indefinitely?
If you would like to update the value of device_reg_copy in atomic fashion, then device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed); suffices.
There is no need to apply volatile to atomic variables, it is unnecessary.
std::memory_order_relaxed store is supposed to incur the least amount of synchronization overhead. On x86 it is just a plain mov instruction.
However, if you would like to update it in such a way, that the effects of any preceding stores become visible to other threads along with the new value of device_reg_copy, then use std::memory_order_release store, i.e. device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);. The readers need to load device_reg_copy as std::memory_order_acquire in this case. Again, on x86 std::memory_order_release store is a plain mov.
Whereas if you use the most expensive std::memory_order_seq_cst store, it does insert the memory barrier for you on x86.
This is why they say that x86 memory model is a bit too strong for C++11: plain mov instruction is std::memory_order_release on stores and std::memory_order_acquire on loads. There is no relaxed store or load on x86.
I cannot recommend enough CPU Cache Flushing Fallacy article.
The C++ standard is rather vague about making atomic stores visible to other threads..
29.3.12
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
That is as detailed as it gets, there is no definition of 'reasonable', and it does not have to be immediately.
Using a stand-alone fence to force a certain memory ordering is not necessary since you can specify those on atomic operations, but the question is,
what is your expectation with regards to using a memory fence..
Fences are designed to enforce ordering on memory operations (between threads), but they do not guarantee visibility in a timely manner.
You can store a value to an atomic variable with the strongest memory ordering (ie. seq_cst), but even when another thread executes load() at a later time than the store(),
you might still get an old value from the cache and yet (surprisingly) it does not violate the happens-before relationship.
Using a stronger fence might make a difference wrt. timing and visibility, but there are no guarantees.
If prompt visibility is important, I would consider using a Read-Modify-Write (RMW) operation to load the value.
These are atomic operations that read and modify atomically (ie. in a single call), and have the additional property that they are guaranteed to operate on the latest value.
But since they have to reach a little further than the local cache, these calls also tend to be more expensive to execute.
As pointed out by Maxim Egorushkin, whether or not you can use weaker memory orderings than the default (seq_cst) depends on whether other memory operations need to be synchronized (made visible) between threads.
That is not clear from your question, but it is generally considered safe to use the default (sequential consistency).
If you are on an unusually weak platform, if performance is problematic, and if you need data synchronization between threads, you could consider using acquire/release semantics:
// thread 1
device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);
// thread 2
device_reg_copy.fetch_add(0, std::memory_order_acquire);
If thread 2 sees the value written by thread 1, it is guaranteed that memory operations prior to the store in thread 1 are visible after the load in thread 2.
Acquire/Release operations form a pair and they synchronize based on a run-time relationship between the store and load. In other words, if thread 2 does not see the value stored by thread 1,
there are no ordering guarantees.
If the atomic variable has no dependencies on any other data, you can use std::memory_order_relaxed; store ordering is always guaranteed for a single atomic variable.
As mentioned by others, there is no need for volatile when it comes to inter-thread communication with std::atomic.

Would a volatile variable be enough in this case? [duplicate]

A global variable is shared across 2 concurrently running threads on 2 different cores. The threads writes to and read from the variables. For the atomic variable can one thread read a stale value? Each core might have a value of the shared variable in its cache and when one threads writes to its copy in a cache the other thread on a different core might read stale value from its own cache. Or the compiler does strong memory ordering to read the latest value from the other cache? The c++11 standard library has std::atomic support. How this is different from the volatile keyword? How volatile and atomic types will behave differently in the above scenario?
Firstly, volatile does not imply atomic access. It is designed for things like memory mapped I/O and signal handling. volatile is completely unnecessary when used with std::atomic, and unless your platform documents otherwise, volatile has no bearing on atomic access or memory ordering between threads.
If you have a global variable which is shared between threads, such as:
std::atomic<int> ai;
then the visibility and ordering constraints depend on the memory ordering parameter you use for operations, and the synchronization effects of locks, threads and accesses to other atomic variables.
In the absence of any additional synchronization, if one thread writes a value to ai then there is nothing that guarantees that another thread will see the value in any given time period. The standard specifies that it should be visible "in a reasonable period of time", but any given access may return a stale value.
The default memory ordering of std::memory_order_seq_cst provides a single global total order for all std::memory_order_seq_cst operations across all variables. This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies.
If you have 2 shared variables x and y, initially zero, and have one thread write 1 to x and another write 2 to y, then a third thread that reads both may see either (0,0), (1,0), (0,2) or (1,2) since there is no ordering constraint between the operations, and thus the operations may appear in any order in the global order.
If both writes are from the same thread, which does x=1 before y=2 and the reading thread reads y before x then (0,2) is no longer a valid option, since the read of y==2 implies that the earlier write to x is visible. The other 3 pairings (0,0), (1,0) and (1,2) are still possible, depending how the 2 reads interleave with the 2 writes.
If you use other memory orderings such as std::memory_order_relaxed or std::memory_order_acquire then the constraints are relaxed even further, and the single global ordering no longer applies. Threads don't even necessarily have to agree on the ordering of two stores to separate variables if there is no additional synchronization.
The only way to guarantee you have the "latest" value is to use a read-modify-write operation such as exchange(), compare_exchange_strong() or fetch_add(). Read-modify-write operations have an additional constraint that they always operate on the "latest" value, so a sequence of ai.fetch_add(1) operations by a series of threads will return a sequence of values with no duplicates or gaps. In the absence of additional constraints, there's still no guarantee which threads will see which values though. In particular, it is important to note that the use of an RMW operation does not force changes from other threads to become visible any quicker, it just means that if the changes are not seen by the RMW then all threads must agree that they are later in the modification order of that atomic variable than the RMW operation. Stores from different threads can still be delayed by arbitrary amounts of time, depending on when the CPU actually issues the store to memory (rather than just its own store buffer), physically how far apart the CPUs executing the threads are (in the case of a multi-processor system), and the details of the cache coherency protocol.
Working with atomic operations is a complex topic. I suggest you read a lot of background material, and examine published code before writing production code with atomics. In most cases it is easier to write code that uses locks, and not noticeably less efficient.
volatile and the atomic operations have a different background, and
were introduced with a different intent.
volatile dates from way back, and is principally designed to prevent
compiler optimizations when accessing memory mapped IO. Modern
compilers tend to do no more than suppress optimizations for volatile,
although on some machines, this isn't sufficient for even memory mapped
IO. Except for the special case of signal handlers, and setjmp,
longjmp and getjmp sequences (where the C standard, and in the case
of signals, the Posix standard, gives additional guarantees), it must be
considered useless on a modern machine, where without special additional
instructions (fences or memory barriers), the hardware may reorder or
even suppress certain accesses. Since you shouldn't be using setjmp
et al. in C++, this more or less leaves signal handlers, and in a
multithreaded environment, at least under Unix, there are better
solutions for those as well. And possibly memory mapped IO, if you're
working on kernal code and can ensure that the compiler generates
whatever is needed for the platform in question. (According to the
standard, volatile access is observable behavior, which the compiler
must respect. But the compiler gets to define what is meant by
“access”, and most seem to define it as “a load or
store machine instruction was executed”. Which, on a modern
processor, doesn't even mean that there is necessarily a read or write
cycle on the bus, much less that it's in the order you expect.)
Given this situation, the C++ standard added atomic access, which does
provide a certain number of guarantees across threads; in particular,
the code generated around an atomic access will contain the necessary
additional instructions to prevent the hardware from reordering the
accesses, and to ensure that the accesses propagate down to the global
memory shared between cores on a multicore machine. (At one point in
the standardization effort, Microsoft proposed adding these semantics to
volatile, and I think some of their C++ compilers do. After
discussion of the issues in the committee, however, the general
consensus—including the Microsoft representative—was that it
was better to leave volatile with its orginal meaning, and to define
the atomic types.) Or just use the system level primitives, like
mutexes, which execute whatever instructions are needed in their code.
(They have to. You can't implement a mutex without some guarantees
concerning the order of memory accesses.)
Here's a basic synopsis of what the 2 things are:
1) Volatile keyword:
Tells the compiler that this value could alter at any moment and therefore it should not EVER cache it in a register. Look up the old "register" keyword in C. "Volatile" is basically the "-" operator to "register"'s "+". Modern compilers now do the optimization that "register" used to explicitly request by default, so you only see 'volatile' anymore. Using the volatile qualifier will guarantee that your processing never uses a stale value, but nothing more.
2) Atomic:
Atomic operations modify data in a single clock tick, so that it is impossible for ANY other thread to access the data in the middle of such an update. They're usually limited to whatever single-clock assembly instructions the hardware supports; things like ++,--, and swapping 2 pointers. Note that this says nothing about the ORDER the different threads will RUN the atomic instructions, only that they will never run in parallel. That's why you have all those additional options for forcing an ordering.
Volatile and Atomic serve different purposes.
Volatile :
Informs the compiler to avoid optimization. This keyword is used for variables that shall change unexpectedly. So, it can be used to represent the Hardware status registers, variables of ISR, Variables shared in a multi-threaded application.
Atomic :
It is also used in case of multi-threaded application. However, this ensures that there is no lock/stall while using in a multi-threaded application. Atomic operations are free of races and indivisble. Few of the key scenario of usage is to check whether a lock is free or used, atomically add to the value and return the added value etc. in multi-threaded application.

atomic<T>.load() with std::memory_order_release

When writing C++11 code that uses the newly introduced thread-synchronization primitives to make use of the relaxed memory ordering, you usually see either
std::atomic<int> vv;
int i = vv.load(std::memory_order_acquire);
or
vv.store(42, std::memory_order_release);
It is clear to me why this makes sense.
My questions are: Do the combinations vv.store(42, std::memory_order_acquire) and vv.load(std::memory_order_release) also make sense? In which situation could one use them? What are the semantics of these combinations?
That's simply not allowed. The C++ (11) standard has requirements on what memory order constraints you can put on load/store operations.
For load (§29.6.5):
Requires: The order argument shall not be memory_order_release nor memory_order_acq_rel.
For store:
Requires: The order argument shall not be memory_order_consume, memory_order_acquire, nor memory_order_acq_rel.
The C/C++/LLVM memory model is sufficient for synchronization
strategies that ensure data is ready to be accessed before accessing
it. While that covers most common synchronization primitives, useful
properties can be obtained by building consistent models on weaker
guarantees.
The biggest example is the seqlock.
It relies on "speculatively" reading data that may not be in a
consistent state. Because reads are allowed to race with writes,
readers don't block writers -- a property which is used in the Linux
kernel to allow the system clock to be updated even if a user process
is repeatedly reading it. Another strength of the seqlock is that on
modern SMP arches it scales perfectly with the number of readers:
because the readers don't need to take any locks, they only need
shared access to the cache lines.
The ideal implementation of a seqlock would use something like a
"release load" in the reader, which is not available in any major
programming language. The kernel works around this with a full read
fence,
which scales well across architectures, but doesn't achieve optimal
performance.
These combinations do not make any sense, and they are not allowed either.
An acquire operation synchronizes previous non-atomic writes or side effects with a release operation so that when the acquire (load) is realized, all other stores (effects) that happened before the release (store) are also visible (for threads that acquire the same atomic that was released).
Now, if you could do (and would do) an acquire store and a release load, what should it do? What store should the acquire operation synchronize with? Itself?
Do the combinations vv.store(42, std::memory_order_acquire) and
vv.load(std::memory_order_release) also make sense?
Technically, they are formally disallowed but it isn't important to know that, except to write C++ code.
They simply cannot be defined in the model and it's important that you know and understand, even if you don't write code.
Note that disallowing these values is an important design choice: if you write your_own::atomic<> class, you can chose to allow these values and define them as equivalent to relaxed operations.
It's important to understand the design space; you must not have too much respect for the all C++ thread primitive design choices, some of which are purely arbitrary.
In which situation could one use them? What are the semantics of
these combinations?
None, as you must understand the fundamental notion that a read is not a write (it took me a while to get it). You can only claim to understand non linear execution when you get that idea.
In a non threaded program that doesn't have async signals, all steps are sequential and it doesn't matter that reads aren't writes: all reads of an object could just rewrite the value if you arrange for sequence points to be respected and if you allow writing to a constant its own value (which is OK in practice as long as memory is R/W).
So the distinction between reads and writes, at such level, isn't that important. You could manage to define a semantic based only operations that are both reads and writes of a memory location such that writing to a constant is allowed and reading an invalid value that isn't used is OK.
I don't recommend it of course, as it's pretty ugly to blurry the distinction between reads and writes.
But for multithreading you really don't want to have writes to data you only read: no only it would create data races (which you could arbitrary declare to be unimportant when the old value is written back), it would also not map to the CPU worldview as a write changes the state of the cache line of a shared object. The fact that a read isn't a write is essential for efficiency in multithread programs, much more than for single threaded ones.
At the abstract level, a store operation on an atomic is a modification so it's part of its modification order, and a load is not: a load only points to a position in the modification order (a load can see an atomically stored value, or the initial value, the value established at construction, before all atomic modifications).
Modifications are ordered with each others and loads are not, only with respect to modifications. (You can view loads as happening at exactly the same time.)
Acquire and release operations are about making an history (a past) and communicating it: a release operation on an object makes your past the past of the atomic object and the acquire operation makes that past your past.
A modification that isn't an atomic RMW cannot see previous value; on the other hand, an algorithm that includes a load and then a store (on one or two atomics) sees some previous value but isn't in general guaranteed to see the value left by the modification just before it in the modification order, so an acquire load X followed by a release store Y transitively releases the history and makes a past (of another thread at some point by another release operation that was seen by X) part of the past associated with the atomic variable by Y (in addition to the rest of our past).
A RMW is semantically different from acquire then release because there is never "space" in the history between the release and the acquire. It means that programs using only acq+rel RMW operations are always sequentially consistent, as they get the full past of all threads they interact with.
So if you want a acq+rel load or store, just do a RMW read or RMW write operation:
acq+rel load is RMW writing back the same value
acq+rel store is RMW dismissing the original value
You can write you own (strong) atomic class that does that for (strong) loads and (strong) stores: it would be logically defined as your class would make all operations, even loads, part of the operation history of the (strong) atomic object. So a (strong) load could be observed by a (strong) store, as they are both (atomic) modifications and reads of the underlying normal atomic object.
Note the set of acq_rel operations on such "strong atomic" objects would have strictly stronger guarantees than the intended guarantees of the set of seq_cst operations on normal atomics, for programs using relaxed atomic operations: the intent of the designers of seq_cst is that using seq_cst do not make programs using mixed atomic operations sequentially consistent in general.

What do each memory_order mean?

I read a chapter and I didn't like it much. I'm still unclear what the differences is between each memory order. This is my current speculation which I understood after reading the much more simple http://en.cppreference.com/w/cpp/atomic/memory_order
The below is wrong so don't try to learn from it
memory_order_relaxed: Does not sync but is not ignored when order is done from another mode in a different atomic var
memory_order_consume: Syncs reading this atomic variable however It doesnt sync relaxed vars written before this. However if the thread uses var X when modifying Y (and releases it). Other threads consuming Y will see X released as well? I don't know if this means this thread pushes out changes of x (and obviously y)
memory_order_acquire: Syncs reading this atomic variable AND makes sure relaxed vars written before this are synced as well. (does this mean all atomic variables on all threads are synced?)
memory_order_release: Pushes the atomic store to other threads (but only if they read the var with consume/acquire)
memory_order_acq_rel: For read/write ops. Does an acquire so you don't modify an old value and releases the changes.
memory_order_seq_cst: The same thing as acquire release except it forces the updates to be seen in other threads (if a store with relaxed on another thread. I store b with seq_cst. A 3rd thread reading a with relax will see changes along with b and any other atomic variable?).
I think I understood but correct me if i am wrong. I couldn't find anything that explains it in easy to read english.
The GCC Wiki gives a very thorough and easy to understand explanation with code examples.
(excerpt edited, and emphasis added)
IMPORTANT:
Upon re-reading the below quote copied from the GCC Wiki in the process of adding my own wording to the answer, I noticed that the quote is actually wrong. They got acquire and consume exactly the wrong way around. A release-consume operation only provides an ordering guarantee on dependent data whereas a release-acquire operation provides that guarantee regardless of data being dependent on the atomic value or not.
The first model is "sequentially consistent". This is the default mode used when none is specified, and it is the most restrictive. It can also be explicitly specified via memory_order_seq_cst. It provides the same restrictions and limitation to moving loads around that sequential programmers are inherently familiar with, except it applies across threads.
[...]
From a practical point of view, this amounts to all atomic operations acting as optimization barriers. It's OK to re-order things between atomic operations, but not across the operation. Thread local stuff is also unaffected since there is no visibility to other threads. [...] This mode also provides consistency across all threads.
The opposite approach is memory_order_relaxed. This model allows for much less synchronization by removing the happens-before restrictions. These types of atomic operations can also have various optimizations performed on them, such as dead store removal and commoning. [...] Without any happens-before edges, no thread can count on a specific ordering from another thread.
The relaxed mode is most commonly used when the programmer simply wants a variable to be atomic in nature rather than using it to synchronize threads for other shared memory data.
The third mode (memory_order_acquire / memory_order_release) is a hybrid between the other two. The acquire/release mode is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables. This allows for a relaxing of the synchronization required between independent reads of independent writes.
memory_order_consume is a further subtle refinement in the release/acquire memory model that relaxes the requirements slightly by removing the happens before ordering on non-dependent shared variables as well.
[...]
The real difference boils down to how much state the hardware has to flush in order to synchronize. Since a consume operation may therefore execute faster, someone who knows what they are doing can use it for performance critical applications.
Here follows my own attempt at a more mundane explanation:
A different approach to look at it is to look at the problem from the point of view of reordering reads and writes, both atomic and ordinary:
All atomic operations are guaranteed to be atomic within themselves (the combination of two atomic operations is not atomic as a whole!) and to be visible in the total order in which they appear on the timeline of the execution stream. That means no atomic operation can, under any circumstances, be reordered, but other memory operations might very well be. Compilers (and CPUs) routinely do such reordering as an optimization.
It also means the compiler must use whatever instructions are necessary to guarantee that an atomic operation executing at any time will see the results of each and every other atomic operation, possibly on another processor core (but not necessarily other operations), that were executed before.
Now, a relaxed is just that, the bare minimum. It does nothing in addition and provides no other guarantees. It is the cheapest possible operation. For non-read-modify-write operations on strongly ordered processor architectures (e.g. x86/amd64) this boils down to a plain normal, ordinary move.
The sequentially consistent operation is the exact opposite, it enforces strict ordering not only for atomic operations, but also for other memory operations that happen before or after. Neither one can cross the barrier imposed by the atomic operation. Practically, this means lost optimization opportunities, and possibly fence instructions may have to be inserted. This is the most expensive model.
A release operation prevents ordinary loads and stores from being reordered after the atomic operation, whereas an acquire operation prevents ordinary loads and stores from being reordered before the atomic operation. Everything else can still be moved around.
The combination of preventing stores being moved after, and loads being moved before the respective atomic operation makes sure that whatever the acquiring thread gets to see is consistent, with only a small amount of optimization opportunity lost.
One may think of that as something like a non-existent lock that is being released (by the writer) and acquired (by the reader). Except... there is no lock.
In practice, release/acquire usually means the compiler needs not use any particularly expensive special instructions, but it cannot freely reorder loads and stores to its liking, which may miss out some (small) optimization opportuntities.
Finally, consume is the same operation as acquire, only with the exception that the ordering guarantees only apply to dependent data. Dependent data would e.g. be data that is pointed-to by an atomically modified pointer.
Arguably, that may provide for a couple of optimization opportunities that are not present with acquire operations (since fewer data is subject to restrictions), however this happens at the expense of more complex and more error-prone code, and the non-trivial task of getting dependency chains correct.
It is currently discouraged to use consume ordering while the specification is being revised.
This is a quite complex subject. Try to read http://en.cppreference.com/w/cpp/atomic/memory_order several times, try to read other resources, etc.
Here's a simplified description:
The compiler and CPU can reorder memory accesses. That is, they can happen in different order than what's specified in the code. That's fine most of the time, the problem arises when different thread try to communicate and may see such order of memory accesses that breaks the invariants of the code.
Usually you can use locks for synchronization. The problem is that they're slow. Atomic operations are much faster, because the synchronization happens at CPU level (i.e. CPU ensures that no other thread, even on another CPU, modifies some variable, etc.).
So, the one single problem we're facing is reordering of memory accesses. The memory_order enum specifies what types of reorderings compiler must forbid.
relaxed - no constraints.
consume - no loads that are dependent on the newly loaded value can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.
acquire - no loads can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.
release - no stores can be reordered wrt. the atomic store. I.e. if they are before the atomic store in the source code, they will happen before the atomic store too.
acq_rel - acquire and release combined.
seq_cst - it is more difficult to understand why this ordering is required. Basically, all other orderings only ensure that specific disallowed reorderings don't happen only for the threads that consume/release the same atomic variable. Memory accesses can still propagate to other threads in any order. This ordering ensures that this doesn't happen (thus sequential consistency). For a case where this is needed see the example at the end of the linked page.
I want to provide a more precise explanation, closer to the standard.
Things to ignore:
memory_order_consume - apparently no major compiler implements it, and they silently replace it with a stronger memory_order_acquire. Even the standard itself says to avoid it.
A big part of the cppreference article on memory orders deals with 'consume', so dropping it simplifies things a lot.
It also lets you ignore related features like [[carries_dependency]] and std::kill_dependency.
Data races: Writing to a non-atomic variable from one thread, and simultaneously reading/writing to it from a different thread is called a data race, and causes undefined behavior.
memory_order_relaxed is the weakest and supposedly the fastest memory order.
Any reads/writes to atomics can't cause data races (and subsequent UB). relaxed provides just this minimal guarantee, for a single variable. It doesn't provide any guarantees for other variables (atomic or not).
All threads agree on the order of operations on every particular atomic variable. But it's the case only for invididual variables. If other variables (atomic or not) are involved, threads might disagree on how exactly the operations on different variables are interleaved.
It's as if relaxed operations propagate between threads with slight unpredictable delays.
This means that you can't use relaxed atomic operations to judge when it's safe to access other non-atomic memory (can't synchronize access to it).
By "threads agree on the order" I mean that:
Each thread will access each separate variable in the exact order you tell it to. E.g. a.store(1, relaxed); a.store(2, relaxed); will write 1, then 2, never in the opposite order. But accesses to different variables in the same thread can still be reordered relative to each other.
If a thread A writes to a variable several times, then thread B reads seveal times, it will get the values in the same order (but of course it can read some values several times, or skip some, if you don't synchronize the threads in other ways).
No other guarantees are given.
Example uses: Anything that doesn't try to use an atomic variable to synchronize access to non-atomic data: various counters (that exist for informational purposes only), or 'stop flags' to signal other threads to stop. Another example: operations on shared_ptrs that increment the reference count internally use relaxed.
Fences: atomic_thread_fence(relaxed); does nothing.
memory_order_release, memory_order_acquire do everything relaxed does, and more (so it's supposedly slower or equivalent).
Only stores (writes) can use release. Only loads (reads) can use acquire. Read-modify-write operations such as fetch_add can be both (memory_order_acq_rel), but they don't have to.
Those let you synchronize threads:
Let's say thread 1 reads/writes to some memory M (any non-atomic or atomic variables, doesn't matter).
Then thread 1 performs a release store to a variable A. Then it stops
touching that memory.
If thread 2 then performs an acquire load of the same variable A, this load is said to synchronize with the corresponding store in thread 1.
Now thread 2 can safely read/write to that memory M.
You only synchronize with the latest writer, not preceding writers.
You can chain synchronizations across multiple threads.
There's a special rule that synchronization propagates across any number of read-modify-write operations regardless of their memory order. E.g. if thread 1 does a.store(1, release);, then thread 2 does a.fetch_add(2, relaxed);, then thread 3 does a.load(acquire), then thread 1 successfully synchronizes with thread 3, even though there's a relaxed operation in the middle.
In the above rule, a release operation X, and any subsequent read-modify-write operations on the same variable X (stopping at the next non-read-modify-write operation) are called a release sequence headed by X. (So if an acquire reads from any operation in a release sequence, it synchronizes with the head of the sequence.)
If read-modify-write operations are involved, nothing stops you from synchronizing with more than one operation. In the example above, if fetch_add was using acquire or acq_rel, it too would synchronize with thread 1, and conversely, if it used release or acq_rel, the thread 3 would synchonize with 2 in addition to 1.
Example use: shared_ptr decrements its reference counter using something like fetch_sub(1, acq_rel).
Here's why: imagine that thread 1 reads/writes to *ptr, then destroys its copy of ptr, decrementing the ref count. Then thread 2 destroys the last remaining pointer, also decrementing the ref count, and then runs the destructor.
Since the destructor in thread 2 is going to access the memory previously accessed by thread 1, the acq_rel synchronization in fetch_sub is necessary. Otherwise you'd have a data race and UB.
Fences: Using atomic_thread_fence, you can essentially turn relaxed atomic operations into release/acquire operations. A single fence can apply to more than one operation, and/or can be performed conditionally.
If you do a relaxed read (or with any other order) from one or more variables, then do atomic_thread_fence(acquire) in the same thread, then all those reads count as acquire operations.
Conversely, if you do atomic_thread_fence(release), followed by any number of (possibly relaxed) writes, those writes count as release operations.
An acq_rel fence combines the effect of acquire and release fences.
Similarity with other standard library features:
Several standard library features also cause a similar synchronizes with relationship. E.g. locking a mutex synchronizes with the latest unlock, as if locking was an acquire operation, and unlocking was a release operation.
memory_order_seq_cst does everything acquire/release do, and more. This is supposedly the slowest order, but also the safest.
seq_cst reads count as acquire operations. seq_cst writes count as release operations. seq_cst read-modify-write operations count as both.
seq_cst operations can synchronize with each other, and with acquire/release operations. Beware of special effects of mixing them (see below).
seq_cst is the default order, e.g. given atomic_int x;, x = 1; does x.store(1, seq_cst);.
seq_cst has an extra property compared to acquire/release: all threads agree on the order in which all seq_cst operations happen. This is unlike weaker orders, where threads agree only on the order of operations on each individual atomic variable, but not on how the operations are interleaved - see relaxed order above.
The presence of this global operation order seems to only affect which values you can get from seq_cst loads, it doesn't in any way affect non-atomic variables and atomic operations with weaker orders (unless seq_cst fences are involved, see below), and by itself doesn't prevent any extra data race UB compared to acq/rel operations.
Among other things, this order respects the synchronizes with relationship described for acquire/release above, unless (and this is weird) that synchronization comes from mixing a seq-cst operation with an acquire/release operation (release syncing with seq-cst, or seq-cst synching with acquire). Such mix essentially demotes the affected seq-cst operation to an acquire/release (it maybe retains some of the seq-cst properties, but you better not count on it).
Example use:
atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, seq_cst);
if (y.load(seq_cst)) {...}
// Thread 2:
y.store(false, seq_cst);
if (x.load(seq_cst)) {...}
Lets say you want only one thread to be able to enter the if body. seq_cst allows you to do it. Acquire/release or weaker orders wouldn't be enough here.
Fences: atomic_thread_fence(seq_cst); does everything an acq_rel fence does, and more.
Like you would expect, they bring some seq-cst properties to atomic operations done with weaker orders.
All threads agree on the order of seq_cst fences, relative to one another and to any seq_cst operations (i.e. seq_cst fences participate in the global order of seq_cst operations, which was described above).
They essentially prevent atomic operations from being reordered across themselves.
E.g. we can transform the above example to:
atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (y.load(relaxed)) {...}
// Thread 2:
y.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (x.load(relaxed)) {...}
Both threads can't enter if at the same time, because that would require reordering a load across the fence to be before the store.
But formally, the standard doesn't describe them in terms of reordering. Instead, it just explains how the seq_cst fences are placed in the global order of seq_cst operations. Let's say:
Thread 1 performs operation A on atomic variable X using using seq_cst order, OR a weaker order preceeded by a seq_cst fence.
Then:
Thread 2 performs operation B the same atomic variable X using seq_cst order, OR a weaker order followed by a seq_cst fence.
(Here A and B are any operations, except they can't both be reads, since then it's impossible to determine which one was first.)
Then the first seq_cst operation/fence is ordered before the second seq_cst operation/fence.
Then, if you imagine an scenario (e.g. in the example above, both threads entering the if) that imposes a contradicting requirements on the order, then this scenario is impossible.
E.g. in the example above, if the first thread enters the if, then the first fence must be ordered before the second one. And vice versa. This means that both threads entering the if would lead to a contradition, and hence not allowed.
Interoperation between different orders
Summarizing the above:
relaxed write
release write
seq-cst write
relaxed load
-
-
-
acquire load
-
synchronizes with
synchronizes with*
seq-cst load
-
synchronizes with*
synchronizes with
* = The participating seq-cst operation gets a messed up seq-cst order, effectively being demoted to an acquire/release operation. This is explained above.
Does using a stronger memory order makes data transfer between threads faster?
No, it seems not.
Sequental consistency for data-race-free programs
The standard explains that if your program only uses seq_cst accesses (and mutexes), and has no data races (which cause UB), then you don't need to think about all the fancy operation reorderings. The program will behave as if only one thread executed at a time, with the threads being unpredictably interleaved.