compare-and-swap atomic operation vs Load-link/store-conditional operation - c++

Under an x86 processor I am not sure of the difference between compare-and-swap atomic operation and Load-link/store-conditional operation. Is the latter safer than the former? Is it the case that the first is better than the second?

There are three common styles of atomic primitive: Compare-Exchange, Load-Linked/Store-Conditional, and Compare-And-Swap.
A CompareExchange operation will atomically read a memory location and, if it matches a compare value, store a specified new value. If the value that was read does not match the compare value, no store takes place. In any case, the operation will report the original value read.
A Compare-And-Swap operation is similar to CompareExchange, except that it does not report what value was read--merely whether whatever value was read matched the compare value. Note that a CompareExchange may be used to implement Compare-And-Swap by having it report whether the value read from memory matched the specified compare value.
The LL/SC combination allows a store operation to be conditioned upon whether some outside influence might have affected the target since its value was loaded. In particular, it guarantees that if the store succeeds, the location has not been written at all by outside code. Even if outside code wrote a new value and then re-wrote the original value, that would be guaranteed to cause the conditional code to fail. Conceptually, this might make LL/SC seem more powerful than other methods, since it wouldn't have the "ABA" problem. Unfortunately, LL/SC semantics allow for stores to spontaneously fail, and the probability of spontaneously failure may go up rapidly as the complexity of the code between the load and store is increased. While using LL/SC to implement something like an atomic increment directly would be more efficient than using it to implement a compare-and-swap, and then implementing an atomic increment using that compare-and-swap implementation, in situations where one would need to do much between a load and store, one should generally use LL-SC to implement a compare-and-swap, and then use that compare-and-swap method in a load-modify-CompareAndSwap loop.
Of the three primitives, the Compare-And-Swap is the least powerful, but it can be implemented in terms of either of the other two. CompareAndSwap can do a pretty good job of emulating CompareExchange, though there are some corner cases where such emulation might live-lock. Neither CompareExchange nor Compare-And-Swap can offer guarantees quite as strong as LL-SC, though the limited amount of code one can reliably place within an LL/SC loop limits the usefulness of its guarantees.

x86 does not provide LL/SC instructions. Check out wikipedia for platforms that do. Also see this SO question.

Related

Atomic operation propagation/visibility (atomic load vs atomic RMW load)

Context 
I am writing a thread-safe protothread/coroutine library in C++, and I am using atomics to make task switching lock-free. I want it to be as performant as possible. I have a general understanding of atomics and lock-free programming, but I do not have enough expertise to optimise my code. I did a lot of researching, but it was hard to find answers to my specific problem: What is the propagation delay/visiblity for different atomic operations under different memory orders?
Current assumptions 
I read that changes to memory are propagated from other threads, in such a way that they might become visible:
in different orders to different observers,
with some delay.
I am unsure as to whether this delayed visibility and inconsistent propagation applies only to non-atomic reads, or to atomic reads as well, potentially depending on what memory order is used. As I am developing on an x86 machine, I have no way of testing the behaviour on weakly ordered systems.
Do all atomic reads always read the latest values, regardless of the type of operation and the memory order used? 
I am pretty sure that all read-modify-write (RMW) operations always read the most recent value written by any thread, regardless of the memory order used. The same seems to be true for sequentially consistent operations, but only if all other modifications to a variable are also sequentially consistent. Both are said to be slow, which is not good for my task. If not all atomic reads get the most recent value, then I will have to use RMW operations just for reading an atomic variable's latest value, or use atomic reads in a while loop, to my current understanding.
Does the propagation of writes (ignoring side effects) depend on the memory order and the atomic operation used? 
(This question only matters if the answer to the previous question is that not all atomic reads always read the most recent value. Please read carefully, I do not ask about the visibility and propagation of side-effects here. I am merely concerned with the value of the atomic variable itself.) This would imply that depending on what operation is used to modify an atomic variable, it would be guaranteed that any following atomic read receives the most recent value of the variable. So I would have to choose between an operation guaranteed to always read the latest value, or use relaxed atomic reads, in tandem with this special write operation that guarantees instant visibility of the modification to other atomic operations.
Is atomic lock-free ?
First of all, let's get rid of the elephant in the room: using atomic in your code doesn't guarantee a lock-free implementation. atomic is only an enabler for a lock-free implementation. is_lock_free() will tell you if it's really lock-free for the C++ implementation and the underlying types that you are using.
What's the latest value ?
The term "latest" is very ambiguous in the world of multithreading. Because what is the "latest" for one thread that might be put asleep by the OS, might no longer be what is the latest for another thread that is active.
std::atomic only guarantees is a protection against racing conditions, by ensuring that R, M and RMW performed on one atomic in one thread are performed atomically, without any interruption, and that all other threads see either the value before or the value after, but never what's in-between. So atomic synchronize threads by creating an order between concurrent operations on the same atomic object.
You need to see every thread as a parallel universe with its own time and that is unaware of the time in the parallel universes. And like in quantum physics, the only thing that you can know in one thread about another thread is what you can observe (i.e. a "happened before" relation between the universes).
This means that you should not conceive multithreaded time as if there would be an absolute "latest" across all the threads. You need to conceive time as relative to the other threads. This is why atomics don't create an absolute latest, but only ensure a sequential ordering of the successive states that an atomic will have.
Propagation
The propagation doesn't depend on the memory order nor the atomic operation performed. memory_order is about sequential constraints on non-atomic variables around atomic operations that are seen like fences. The best explanation of how this works is certainly Herb Sutters presentation, that is definitively worth its hour and half if you're working on multithreading optimisation.
Although it is possible that a particular C++ implementation could implement some atomic operation in a way that influences propagation, you cannot rely on any such observation that you would do, since there would be no guarantee that propagation works in the same fashion in the next release of the compiler or on another compiler on another CPU architecture.
But does propagation matter ?
When designing lock-free algorithms, it is tempting to read atomic variables to get the latest status. But whereas such a read-only access is atomic, the action immediately after is not. So the following instructions might assume a state which is already obsolete (for example because the thread is send asleep immediately after the atomic read).
Take if(my_atomic_variable<10) and suppose that you read 9. Suppose you're in the best possible world and 9 would be the absolutely latest value set by all the concurrent threads. Comparing its value with <10 is not atomic, so that when the comparison succeeds and if branches, my_atomic_variable might already have a new value of 10. And this kind of problems might occur regardless of how fast the propagation is, and even if the read would be guaranteed to always get the latest value. And I didn't even mention the ABA problem yet.
The only benefit of the read is to avoid a data race and UB. But if you want to synchronize decisions/actions across threads, you need to use an RMW, such compare-and-swap (e.g. atomic_compare_exchange_strong) so that the ordering of atomic operations result in a predictable outcome.
After some discussion, here are my findings: First, let's define what an atomic variable's latest value means: In wall-clock time, the very latest write to an atomic variable, so, from an external observer's point of view. If there are multiple simultaneous last writes (i.e., on multiple cores during the same cycle), then it doesn't really matter which one of them is chosen.
Atomic loads of any memory order have no guarantee that the latest value is read. This means that writes have to propagate before you can access them. This propagation may be out of order with respect to the order in which they were executed, as well as differing in order with respect to different observers.
std::atomic_int counter = 0;
void thread()
{
// Imagine no race between read and write.
int value = counter.load(std::memory_order_relaxed);
counter.store(value+1, std::memory_order_relaxed);
}
for(int i = 0; i < 1000; i++)
std::async(thread);
In this example, according to my understanding of the specs, even if no read-write executions were to interfere, there could still be multiple executions of thread that read the same values, so that in the end, counter would not be 1000. This is because when using normal reads, although threads are guaranteed to read modifications to the same variable in the correct order (they will not read a new value and on the next value read an older value), they are not guaranteed to read the globally latest written value to a variable.
This creates the relativity effect (as in Einstein's physics) that every thread has its own "truth", and this is exactly why we need to use sequential consistency (or acquire/release) to restore causality: If we simply use relaxed loads, then we can even have broken causality and apparent time loops, which can happen because of instruction reordering in combination with out-of-order propagation. Memory ordering will ensure that those separate realities perceived by separate threads are at least causally consistent.
Atomic read-modify-write (RMW) operations (such as exchange, compare_exchange, fetch_add,…) are guaranteed to operate on the latest value as defined above. This means that propagation of writes is forced, and results in one universal view on the memory (if all reads you make are from atomic variables using RMW operations), independent of threads. So, if you use atomic.compare_exchange_strong(value,value, std::memory_order_relaxed) or atomic.fetch_or(0, std::memory_order_relaxed), then you are guaranteed to perceive one global order of modification that encompasses all atomic variables. Note that this does not guarantee you any ordering or causality of non-RMW reads.
std::atomic_int counter = 0;
void thread()
{
// Imagine no race between read and write.
int value = counter.fetch_or(0, std::memory_order_relaxed);
counter.store(value+1, std::memory_order_relaxed);
}
for(int i = 0; i < 1000; i++)
std::async(thread);
In this example (again, under the assumption that none of the thread() executions interfere with each other), it seems to me that the spec forbids value to contain anything but the globally latest written value. So, counter would always be 1000 in the end.
Now, when to use which kind of read? 
If you only need causality within each thread (there might still be different views on what happened in which order, but at least every single reader has a causally consistent view on the world), then atomic loads and acquire/release or sequential consistency suffice.
But if you also need fresh reads (so that you must never read values other than the globally (across all threads) latest value), then you should use RMW operations for reading. Those alone do not create causality for non-atomic and non-RMW reads, but all RMW reads across all threads share the exact same view on the world, which is always up to date.
So, to conclude: Use atomic loads if different world views are allowed, but if you need an objective reality, use RMWs to load.
Multithreading is surprising area.
First, an atomic read is not ordered after a write. I e reading a value does not mean that it were written before. Sometimes such read may ever see (indirect, by other thread) result of some subsequent atomic write by the same thread.
Sequential consistency are clearly about visibility and propagation. When a thread writes an atomic "sequentially consistent" it makes all its previous writes to be visible to other threads (propagation). In such case a (sequentially consistent) read is ordered in relation to a write.
Generally the most performant operations are "relaxed" atomic operations, but they provide minimum guarranties on ordering. In principle there is ever some causality paradoxes... :-)

In C11/C++11, possible to mix atomic/non-atomic ops on the same memory?

Is it possible to perform atomic and non-atomic ops on the same memory location?
I ask not because I actually want to do this, but because I'm trying to understand the C11/C++11 memory model. They define a "data race" like so:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
-- C11 §5.1.2.4 p25, C++11 § 1.10 p21
Its the "at least one of which is not atomic" part that is troubling me. If it weren't possible to mix atomic and non-atomic ops, it would just say "on an object which is not atomic."
I can't see any straightforward way of performing non-atomic operations on atomic variables. std::atomic<T> in C++ doesn't define any operations with non-atomic semantics. In C, all direct reads/writes of an atomic variable appear to be translated into atomic operations.
I suppose memcpy() and other direct memory operations might be a way of performing a non-atomic read/write on an atomic variable? ie. memcpy(&atomicvar, othermem, sizeof(atomicvar))? But is this even defined behavior? In C++, std::atomic is not copyable, so would it be defined behavior to memcpy() it in C or C++?
Initialization of an atomic variable (whether through a constructor or atomic_init()) is defined to not be atomic. But this is a one-time operation: you're not allowed to initialize an atomic variable a second time. Placement new or an explicit destructor call could would also not be atomic. But in all of these cases, it doesn't seem like it would be defined behavior anyway to have a concurrent atomic operation that might be operating on an uninitialized value.
Performing atomic operations on non-atomic variables seems totally impossible: neither C nor C++ define any atomic functions that can operate on non-atomic variables.
So what is the story here? Is it really about memcpy(), or initialization/destruction, or something else?
I think you're overlooking another case, the reverse order. Consider an initialized int whose storage is reused to create an std::atomic_int. All atomic operations happen after its ctor finishes, and therefore on initialized memory. But any concurrent, non-atomic access to the now-overwritten int has to be barred as well.
(I'm assuming here that the storage lifetime is sufficient and plays no role)
I'm not entirely sure because I think that the second access to int would be invalid anyway as the type of the accessing expression int doesn't match the object's type at the time (std::atomic<int>). However, "the object's type at the time" assumes a single linear time progression which doesn't hold in a multi-threaded environment. C++11 in general has that solved by making such assumptions about "the global state" Undefined Behavior per se, and the rule from the question appears to fit in that framework.
So perhaps rephrasing: if a single memory location contains an atomic object as well as a non-atomic object, and if the destruction of the earliest created (older) object is not sequenced-before the creation of the other (newer) object, then access to the older object conflicts with access to the newer object unless the former is scheduled-before the latter.
disclaimer: I am not a parallelism guru.
Is it possible to mix atomic/non-atomic ops on the same memory, and if
so, how?
you can write it in the code and compile, but it will probably yield undefined behaviour.
when talking about atomics, it is important to understand what kind o problems do they solve.
As you might know, what we call in shortly "memory" is multi-layered set of entities which are capable to hold memory.
first we have the RAM, then the cache lines , then the registers.
on mono-core processors, we don't have any synchronization problem. on multi-core processors we have all of them. every core has it own set of registers and cache lines.
this casues few problems.
First one of them is memory reordering - the CPU may decide on runtime to scrumble some reading/writing instructions to make the code run faster. this may yield some strange results that are completly invisible on the high-level code that brought this set of instruction. the most classic example of this phenomanon is the "two threads - two integer" example:
int i=0;
int j=0;
thread a -> i=1, then print j
thread b -> j=1 then print i;
logically, the result "00" cannot be. either a ends first, the result may be "01", either b ends first, the result may be "10". if both of them ends in the same time, the result may be "11". yet, if you build small program which imitates this situtation and run it in a loop, very quicly you will see the result "00"
another problem is memory invisibility. like I mentioned before, the variable's value may be cached in one of the cache lines, or be stored in one of the registered. when the CPU updates a variables value - it may delay the writing of the new value back to the RAM. it may keep the value in the cache/regiter because it was told (by the compiler optimizations) that that value will be updated again soon, so in order to make the program faster - update the value again and only then write it back to the RAM. it may cause undefined behaviour if other CPU (and consequently a thread or a process) depends on the new value.
for example, look at this psuedo code:
bool b = true;
while (b) -> print 'a'
new thread -> sleep 4 seconds -> b=false;
the character 'a' may be printed infinitly, because b may be cached and never be updated.
there are many more problems when dealing with paralelism.
atomics solves these kind of issues by (in a nutshell) telling the compiler/CPU how to read and write data to/from the RAM correctly without doing un-wanted scrumbling (read about memory orders). a memory order may force the cpu to write it's values back to the RAM, or read the valuse from the RAM even if they are cached.
So, although you can mix non atomics actions with atomic ones, you only doing part of the job.
for example let's go back to the second example:
atomic bool b = true;
while (reload b) print 'a'
new thread - > b = (non atomicly) false.
so although one thread re-read the value of b from the RAM again and again but the other thread may not write false back to the RAM.
So although you can mix these kind of operations in the code, it will yield underfined behavior.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operationsin parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. That is that the sizes of the atomic and non atomic types are the same.
I definitely would not attempt to mix both types of access to an address in a parallel context, that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operations in parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. My, possibly simplistic, interpretation of that is that the sizes of the atomic and non atomic types are the same and that the hardware can update such types in a single operation.
I definitely would not attempt to mix both types of access to an address in a parallel context, i think that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.

Understanding std::atomic::compare_exchange_weak() in C++11

bool compare_exchange_weak (T& expected, T val, ..);
compare_exchange_weak() is one of compare-exchange primitives provided in C++11. It's weak in the sense that it returns false even if the value of the object is equal to expected. This is due to spurious failure on some platforms where a sequence of instructions (instead of one as on x86) are used to implement it. On such platforms, context switch, reloading of the same address (or cache line) by another thread, etc can fail the primitive. It's spurious as it's not the value of the object (not equal to expected) that fails the operation. Instead, it's kind of timing issues.
But what puzzles me is what's said in C++11 Standard (ISO/IEC 14882),
29.6.5
..
A consequence of spurious failure is that nearly all uses of weak
compare-and-exchange will be in a loop.
Why does it have to be in a loop in nearly all uses ? Does that mean we shall loop when it fails because of spurious failures? If that's the case, why do we bother use compare_exchange_weak() and write the loop ourselves? We can just use compare_exchange_strong() which I think should get rid of spurious failures for us. What are the common use cases of compare_exchange_weak()?
Another question related. In his book "C++ Concurrency In Action" Anthony says,
//Because compare_exchange_weak() can fail spuriously, it must typically
//be used in a loop:
bool expected=false;
extern atomic<bool> b; // set somewhere else
while(!b.compare_exchange_weak(expected,true) && !expected);
//In this case, you keep looping as long as expected is still false,
//indicating that the compare_exchange_weak() call failed spuriously.
Why is !expected there in the loop condition? Does it there to prevent that all threads may starve and make no progress for some time?
One last question
On platforms that no single hardware CAS instruction exist, both the weak and strong version are implemented using LL/SC (like ARM, PowerPC, etc). So is there any difference between the following two loops? Why, if any? (To me, they should have similar performance.)
// use LL/SC (or CAS on x86) and ignore/loop on spurious failures
while (!compare_exchange_weak(..))
{ .. }
// use LL/SC (or CAS on x86) and ignore/loop on spurious failures
while (!compare_exchange_strong(..))
{ .. }
I come up w/ this last question you guys all mention that there maybe a performance difference inside a loop. It's also mentioned by the C++11 Standard (ISO/IEC 14882):
When a compare-and-exchange is in a loop, the weak version will yield
better performance on some platforms.
But as analyzed above, two versions in a loop should give the same/similar performance. What's the thing I miss?
Why doing exchange in a loop?
Usually, you want your work to be done before you move on, thus, you put compare_exchange_weak into a loop so that it tries to exchange until it succeeds (i.e., returns true).
Note that also compare_exchange_strong is often used in a loop. It does not fail due to spurious failure, but it does fail due to concurrent writes.
Why to use weak instead of strong?
Quite easy: Spurious failure does not happen often, so it is no big performance hit. In constrast, tolerating such a failure allows for a much more efficient implementation of the weak version (in comparison to strong) on some platforms: strong must always check for spurious failure and mask it. This is expensive.
Thus, weak is used because it is a lot faster than strong on some platforms
When should you use weak and when strong?
The reference states hints when to use weak and when to use strong:
When a compare-and-exchange is in a loop, the weak version will yield
better performance on some platforms. When a weak compare-and-exchange
would require a loop and a strong one would not, the strong one is
preferable.
So the answer seems to be quite simple to remember: If you would have to introduce a loop only because of spurious failure, don't do it; use strong. If you have a loop anyway, then use weak.
Why is !expected in the example
It depends on the situation and its desired semantics, but usually it is not needed for correctness. Omitting it would yield a very similar semantics. Only in a case where another thread might reset the value to false, the semantics could become slightly different (yet I cannot find a meaningful example where you would want that). See Tony D.'s comment for a detailed explanation.
It is simply a fast track when another thread writes true: Then the we abort instead of trying to write true again.
About your last question
But as analyzed above, two versions in a loop should give the same/similar performance.
What's the thing I miss?
From Wikipedia:
Real implementations of LL/SC do not always succeed if there are no
concurrent updates to the memory location in question. Any exceptional
events between the two operations, such as a context switch, another
load-link, or even (on many platforms) another load or store
operation, will cause the store-conditional to spuriously fail. Older
implementations will fail if there are any updates broadcast over the
memory bus.
So, LL/SC will fail spuriously on context switch, for example. Now, the strong version would bring its "own small loop" to detect that spurious failure and mask it by trying again. Note that this own loop is also more complicated than a usual CAS loop, since it must distinguish between spurious failure (and mask it) and failure due to concurrent access (which results in a return with value false). The weak version does not have such own loop.
Since you provide an explicit loop in both examples, it is simply not necessary to have the small loop for the strong version. Consequently, in the example with the strong version, the check for failure is done twice; once by compare_exchange_strong (which is more complicated since it must distinguish spurious failure and concurrent acces) and once by your loop. This expensive check is unnecessary and the reason why weak will be faster here.
Also note that your argument (LL/SC) is just one possibility to implement this. There are more platforms that have even different instruction sets. In addition (and more importantly), note that std::atomic must support all operations for all possible data types, so even if you declare a ten million byte struct, you can use compare_exchange on this. Even when on a CPU that does have CAS, you cannot CAS ten million bytes, so the compiler will generate other instructions (probably lock acquire, followed by a non-atomic compare and swap, followed by a lock release). Now, think of how many things can happen while swapping ten million bytes. So while a spurious error may be very rare for 8 byte exchanges, it might be more common in this case.
So in a nutshell, C++ gives you two semantics, a "best effort" one (weak) and a "I will do it for sure, no matter how many bad things might happen inbetween" one (strong). How these are implemented on various data types and platforms is a totally different topic. Don't tie your mental model to the implementation on your specific platform; the standard library is designed to work with more architectures than you might be aware of. The only general conclusion we can draw is that guaranteeing success is usually more difficult (and thus may require additional work) than just trying and leaving room for possible failure.
I'm trying to answer this myself, after going through various online resources (e.g., this one and this one), the C++11 Standard, as well as the answers given here.
The related questions are merged (e.g., "why !expected ?" is merged with "why put compare_exchange_weak() in a loop ?") and answers are given accordingly.
Why does compare_exchange_weak() have to be in a loop in nearly all uses?
Typical Pattern A
You need achieve an atomic update based on the value in the atomic variable. A failure indicates that the variable is not updated with our desired value and we want to retry it. Note that we don't really care about whether it fails due to concurrent write or spurious failure. But we do care that it is us that make this change.
expected = current.load();
do desired = function(expected);
while (!current.compare_exchange_weak(expected, desired));
A real-world example is for several threads to add an element to a singly linked list concurrently. Each thread first loads the head pointer, allocates a new node and appends the head to this new node. Finally, it tries to swap the new node with the head.
Another example is to implement mutex using std::atomic<bool>. At most one thread can enter the critical section at a time, depending on which thread first set current to true and exit the loop.
Typical Pattern B
This is actually the pattern mentioned in Anthony's book. In contrary to pattern A, you want the atomic variable to be updated once, but you don't care who does it. As long as it's not updated, you try it again. This is typically used with boolean variables. E.g., you need implement a trigger for a state machine to move on. Which thread pulls the trigger is regardless.
expected = false;
// !expected: if expected is set to true by another thread, it's done!
// Otherwise, it fails spuriously and we should try again.
while (!current.compare_exchange_weak(expected, true) && !expected);
Note that we generally cannot use this pattern to implement a mutex. Otherwise, multiple threads may be inside the critical section at the same time.
That said, it should be rare to use compare_exchange_weak() outside a loop. On the contrary, there are cases that the strong version is in use. E.g.,
bool criticalSection_tryEnter(lock)
{
bool flag = false;
return lock.compare_exchange_strong(flag, true);
}
compare_exchange_weak is not proper here because when it returns due to spurious failure, it's likely that no one occupies the critical section yet.
Starving Thread?
One point worth mentioning is that what happens if spurious failures continue to happen thus starving the thread? Theoretically it could happen on platforms when compare_exchange_XXX() is implement as a sequence of instructions (e.g., LL/SC). Frequent access of the same cache line between LL and SC will produce continuous spurious failures. A more realistic example is due to a dumb scheduling where all concurrent threads are interleaved in the following way.
Time
| thread 1 (LL)
| thread 2 (LL)
| thread 1 (compare, SC), fails spuriously due to thread 2's LL
| thread 1 (LL)
| thread 2 (compare, SC), fails spuriously due to thread 1's LL
| thread 2 (LL)
v ..
Can it happen?
It won't happen forever, fortunately, thanks to what C++11 requires:
Implementations should ensure that weak compare-and-exchange
operations do not consistently return false unless either the atomic
object has value different from expected or there are concurrent
modifications to the atomic object.
Why do we bother use compare_exchange_weak() and write the loop ourselves? We can just use compare_exchange_strong().
It depends.
Case 1: When both need to be used inside a loop. C++11 says:
When a compare-and-exchange is in a loop, the weak version will yield
better performance on some platforms.
On x86 (at least currently. Maybe it'll resort to a similiar scheme as LL/SC one day for performance when more cores are introduced), the weak and strong version are essentially the same because they both boil down to the single instruction cmpxchg. On some other platforms where compare_exchange_XXX() isn't implemented atomically (here meaning no single hardware primitive exists), the weak version inside the loop may win the battle because the strong one will have to handle the spurious failures and retry accordingly.
But,
rarely, we may prefer compare_exchange_strong() over compare_exchange_weak() even in a loop. E.g., when there is a lot of things to do between atomic variable is loaded and a calculated new value is exchanged out (see function() above). If the atomic variable itself doesn't change frequently, we don't need repeat the costly calculation for every spurious failure. Instead, we may hope that compare_exchange_strong() "absorb" such failures and we only repeat calculation when it fails due to a real value change.
Case 2: When only compare_exchange_weak() need to be used inside a loop. C++11 also says:
When a weak compare-and-exchange would require a loop and a strong one
would not, the strong one is preferable.
This is typically the case when you loop just to eliminate spurious failures from the weak version. You retry until exchange is either successful or failed because of concurrent write.
expected = false;
// !expected: if it fails spuriously, we should try again.
while (!current.compare_exchange_weak(expected, true) && !expected);
At best, it's reinventing the wheels and perform the same as compare_exchange_strong(). Worse? This approach fails to take full advantage of machines that provide non-spurious compare-and-exchange in hardware.
Last, if you loop for other things (e.g., see "Typical Pattern A" above), then there is a good chance that compare_exchange_strong() shall also be put in a loop, which brings us back to the previous case.
Why does it have to be in a loop in nearly all uses ?
Because if you don't loop and it fails spuriously your program hasn't done anything useful - you didn't update the atomic object and you don't know what its current value is (Correction: see comment below from Cameron). If the call doesn't do anything useful what's the point of doing it?
Does that mean we shall loop when it fails because of spurious failures?
Yes.
If that's the case, why do we bother use compare_exchange_weak() and write the loop ourselves? We can just use compare_exchange_strong() which I think should get rid of spurious failures for us. What are the common use cases of compare_exchange_weak()?
On some architectures compare_exchange_weak is more efficient, and spurious failures should be fairly uncommon, so it might be possible to write more efficient algorithms using the weak form and a loop.
In general it is probably better to use the strong version instead if your algorithm doesn't need to loop, as you don't need to worry about spurious failures. If it needs to loop anyway even for the strong version (and many algorithms do need to loop anyway), then using the weak form might be more efficient on some platforms.
Why is !expected there in the loop condition?
The value could have got set to true by another thread, so you don't want to keep looping trying to set it.
Edit:
But as analyzed above, two versions in a loop should give the same/similar performance. What's the thing I miss?
Surely it's obvious that on platforms where spurious failure is possible the implementation of compare_exchange_strong has to be more complicated, to check for spurious failure and retry.
The weak form just returns on spurious failure, it doesn't retry.
Alright, so I need a function which performs atomic left-shifting. My processor doesn't have a native operation for this, and the standard library doesn't have a function for it, so it looks like I'm writing my own. Here goes:
void atomicLeftShift(std::atomic<int>* var, int shiftBy)
{
do {
int oldVal = std::atomic_load(var);
int newVal = oldVal << shiftBy;
} while(!std::compare_exchange_weak(oldVal, newVal));
}
Now, there's two reasons that loop might be executed more than once.
Someone else changed the variable while I was doing my left shift. The results of my computation should not be applied to the atomic variable, because it would effectively erase that someone else's write.
My CPU burped and the weak CAS spuriously failed.
I honestly don't care which one. Left shifting is fast enough that I may as well just do it again, even if the failure was spurious.
What's less fast, though, is the extra code that strong CAS needs to wrap around weak CAS in order to be strong. That code doesn't do much when the weak CAS succeeds... but when it fails, strong CAS needs to do some detective work to determine whether it was Case 1 or Case 2. That detective work takes the form of a second loop, effectively inside my own loop. Two nested loops. Imagine your algorithms teacher glaring at you right now.
And as I previously mentioned, I don't care about the result of that detective work! Either way I'm going to be redoing the CAS. So using strong CAS gains me precisely nothing, and loses me a small but measurable amount of efficiency.
In other words, weak CAS is used to implement atomic update operations. Strong CAS is used when you care about the result of CAS.
I think most of the answers above address "spurious failure" as some kind of problem, performance VS correctness tradeoff.
It can be seen as the weak version is faster most of the times, but in case of spurious failure, it becomes slower. And the strong version is a version that has no possibility of spurious failure, but it is almost always slower.
For me, the main difference is how these two version handle the ABA problem:
weak version will succeed only if noone has touched the cache line between load and store, so it will 100% detect ABA problem.
strong version will fail only if the comparison fails, so it will not detect ABA problem without extra measures.
So, in theory, if you use weak version on weak-ordered architecture, you don't need ABA detection mechanism and the implementation will be much simpler, giving better performance.
But, on x86 (strong-ordered architecture), weak version and strong version are the same, and they both suffer from ABA problem.
So if you write a completely cross-platform algorithm, you need to address ABA problem anyway, so there is no performance benefit from using the weak version, but there is a performance penalty for handling spurious failures.
In conclusion - for portability and performance reasons, the strong version is always a better-or-equal option.
Weak version can only be a better option if it lets you skip ABA countermeasures completely or your algorithm doesn't care about ABA.

atomic<T>.load() with std::memory_order_release

When writing C++11 code that uses the newly introduced thread-synchronization primitives to make use of the relaxed memory ordering, you usually see either
std::atomic<int> vv;
int i = vv.load(std::memory_order_acquire);
or
vv.store(42, std::memory_order_release);
It is clear to me why this makes sense.
My questions are: Do the combinations vv.store(42, std::memory_order_acquire) and vv.load(std::memory_order_release) also make sense? In which situation could one use them? What are the semantics of these combinations?
That's simply not allowed. The C++ (11) standard has requirements on what memory order constraints you can put on load/store operations.
For load (§29.6.5):
Requires: The order argument shall not be memory_order_release nor memory_order_acq_rel.
For store:
Requires: The order argument shall not be memory_order_consume, memory_order_acquire, nor memory_order_acq_rel.
The C/C++/LLVM memory model is sufficient for synchronization
strategies that ensure data is ready to be accessed before accessing
it. While that covers most common synchronization primitives, useful
properties can be obtained by building consistent models on weaker
guarantees.
The biggest example is the seqlock.
It relies on "speculatively" reading data that may not be in a
consistent state. Because reads are allowed to race with writes,
readers don't block writers -- a property which is used in the Linux
kernel to allow the system clock to be updated even if a user process
is repeatedly reading it. Another strength of the seqlock is that on
modern SMP arches it scales perfectly with the number of readers:
because the readers don't need to take any locks, they only need
shared access to the cache lines.
The ideal implementation of a seqlock would use something like a
"release load" in the reader, which is not available in any major
programming language. The kernel works around this with a full read
fence,
which scales well across architectures, but doesn't achieve optimal
performance.
These combinations do not make any sense, and they are not allowed either.
An acquire operation synchronizes previous non-atomic writes or side effects with a release operation so that when the acquire (load) is realized, all other stores (effects) that happened before the release (store) are also visible (for threads that acquire the same atomic that was released).
Now, if you could do (and would do) an acquire store and a release load, what should it do? What store should the acquire operation synchronize with? Itself?
Do the combinations vv.store(42, std::memory_order_acquire) and
vv.load(std::memory_order_release) also make sense?
Technically, they are formally disallowed but it isn't important to know that, except to write C++ code.
They simply cannot be defined in the model and it's important that you know and understand, even if you don't write code.
Note that disallowing these values is an important design choice: if you write your_own::atomic<> class, you can chose to allow these values and define them as equivalent to relaxed operations.
It's important to understand the design space; you must not have too much respect for the all C++ thread primitive design choices, some of which are purely arbitrary.
In which situation could one use them? What are the semantics of
these combinations?
None, as you must understand the fundamental notion that a read is not a write (it took me a while to get it). You can only claim to understand non linear execution when you get that idea.
In a non threaded program that doesn't have async signals, all steps are sequential and it doesn't matter that reads aren't writes: all reads of an object could just rewrite the value if you arrange for sequence points to be respected and if you allow writing to a constant its own value (which is OK in practice as long as memory is R/W).
So the distinction between reads and writes, at such level, isn't that important. You could manage to define a semantic based only operations that are both reads and writes of a memory location such that writing to a constant is allowed and reading an invalid value that isn't used is OK.
I don't recommend it of course, as it's pretty ugly to blurry the distinction between reads and writes.
But for multithreading you really don't want to have writes to data you only read: no only it would create data races (which you could arbitrary declare to be unimportant when the old value is written back), it would also not map to the CPU worldview as a write changes the state of the cache line of a shared object. The fact that a read isn't a write is essential for efficiency in multithread programs, much more than for single threaded ones.
At the abstract level, a store operation on an atomic is a modification so it's part of its modification order, and a load is not: a load only points to a position in the modification order (a load can see an atomically stored value, or the initial value, the value established at construction, before all atomic modifications).
Modifications are ordered with each others and loads are not, only with respect to modifications. (You can view loads as happening at exactly the same time.)
Acquire and release operations are about making an history (a past) and communicating it: a release operation on an object makes your past the past of the atomic object and the acquire operation makes that past your past.
A modification that isn't an atomic RMW cannot see previous value; on the other hand, an algorithm that includes a load and then a store (on one or two atomics) sees some previous value but isn't in general guaranteed to see the value left by the modification just before it in the modification order, so an acquire load X followed by a release store Y transitively releases the history and makes a past (of another thread at some point by another release operation that was seen by X) part of the past associated with the atomic variable by Y (in addition to the rest of our past).
A RMW is semantically different from acquire then release because there is never "space" in the history between the release and the acquire. It means that programs using only acq+rel RMW operations are always sequentially consistent, as they get the full past of all threads they interact with.
So if you want a acq+rel load or store, just do a RMW read or RMW write operation:
acq+rel load is RMW writing back the same value
acq+rel store is RMW dismissing the original value
You can write you own (strong) atomic class that does that for (strong) loads and (strong) stores: it would be logically defined as your class would make all operations, even loads, part of the operation history of the (strong) atomic object. So a (strong) load could be observed by a (strong) store, as they are both (atomic) modifications and reads of the underlying normal atomic object.
Note the set of acq_rel operations on such "strong atomic" objects would have strictly stronger guarantees than the intended guarantees of the set of seq_cst operations on normal atomics, for programs using relaxed atomic operations: the intent of the designers of seq_cst is that using seq_cst do not make programs using mixed atomic operations sequentially consistent in general.

C++ memory_order_consume, kill_dependency, dependency-ordered-before, synchronizes-with

I am reading C++ Concurrency in Action by Anthony Williams. Currently I at point where he desribes memory_order_consume.
After that block there is:
Now that I’ve covered the basics of the memory orderings, it’s time to look at the
more complex parts
It scares me a little bit, because I don't fully understand several things:
How dependency-ordered-before differs from synchronizes-with? They both create happens-before relationship. What is exact difference?
I am confused about following example:
int global_data[]={ … };
std::atomic<int> index;
void f()
{
int i=index.load(std::memory_order_consume);
do_something_with(global_data[std::kill_dependency(i)]);
}
What does kill_dependency exactly do? Which dependency it kills? Between which entities? And how compiler can exploit that knowladge?
Can all occurancies of memory_order_consume be safely replaced with memory_order_acquire? I.e. is it stricter in all senses?
At Listing 5.9, can I safely replace
std::atomic<int> data[5]; // all accesses are relaxed
with
int data[5]
? I.e. can acquire and release be used to synchronize access to non-atomic data?
He describes relaxed, acquire and release by some examples with mans in cubicles. Are there some similar simple descriptions of seq_cst and consume?
As to the next to last question, the answer takes a little more explanation. There are three things that can go wrong when multiple threads access the same data:
the system might switch threads in the middle of a read or write, producing a result that's half one value and half another.
the compiler might move code around, on the assumption that there is no other thread looking at the data that's involved.
the processor may be keeping a value in its local cache, without updating main memory after changing the value or re-reading it after another thread changed the value in main memory.
Memory order addresses only number 3. The atomic functions address 1 and 2, and, depending on the memory order argument, maybe 3 as well. So memory_order_relaxed means "don't bother with number 3. The code still handles 1 and 2. In that case, you'd use acquire and release to ensure proper memory ordering.
How dependency-ordered-before differs from synchronizes-with?
From 1.10/10: "[ Note: The relation “is dependency-ordered before” is analogous to “synchronizes with”, but uses release/consume in place of release/acquire. — end note ]".
What does kill_dependency exactly do?
Some compilers do data-dependency analysis. That is, they trace changes to values in variables in order to better figure out what has to be synchronized. kill_dependency tells such compilers not to trace any further because there's something going on in the code that the compiler wouldn't understand.
Can all occurancies of memory_order_consume be safely replaced with
memory_order_acquire? I.e. is it stricter in all senses?
I think so, but I'm not certain.
memory_order_consume requires that the atomic operation happens-before all non-atomic operations that are data dependent on it. A data dependency is any dependency where you cannot evaluate an expression without using that data. For example, in x->y, there is no way to evaluate x->y without first evaluating x.
kill_dependency is a unique function. All other functions have a data dependency on their arguments. Kill_dependency explicitly does not. It shows up when you know that the data itself is already synchronized, but the expression you need to get to the data may not be synchronized. In your example, do_something_with is allowed to assume any cached value of globalldata[i] is safe to use, but i itself must actually be the correct atomic value.
memory_order_acquire is strictly stronger if all changes to the data are properly released with a matching memory_order_release.