Can/should non-lock-free atomics be implemented with a SeqLock?

Can/should non-lock-free atomics be implemented with a SeqLock? - c++

In both MSVC STL and LLVM libc++ implementations std::atomic for non-atomic size is implemented using a spin lock.
libc++ (Github):
_LIBCPP_INLINE_VISIBILITY void __lock() const volatile {
while(1 == __cxx_atomic_exchange(&__a_lock, _LIBCPP_ATOMIC_FLAG_TYPE(true), memory_order_acquire))
/*spin*/;
}
_LIBCPP_INLINE_VISIBILITY void __lock() const {
while(1 == __cxx_atomic_exchange(&__a_lock, _LIBCPP_ATOMIC_FLAG_TYPE(true), memory_order_acquire))
/*spin*/;
}
MSVC (Github) (recently discussed in this Q&A):
inline void _Atomic_lock_acquire(long& _Spinlock) noexcept {
#if defined(_M_IX86) || (defined(_M_X64) && !defined(_M_ARM64EC))
// Algorithm from Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, May 2020
// Example 2-4. Contended Locks with Increasing Back-off Example - Improved Version, page 2-22
// The code in mentioned manual is covered by the 0BSD license.
int _Current_backoff = 1;
const int _Max_backoff = 64;
while (_InterlockedExchange(&_Spinlock, 1) != 0) {
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
for (int _Count_down = _Current_backoff; _Count_down != 0; --_Count_down) {
_mm_pause();
}
_Current_backoff = _Current_backoff < _Max_backoff ? _Current_backoff << 1 : _Max_backoff;
}
}
#elif
/* ... */
#endif
}
While thinking of a better possible implementation, I wonder if it is feasible to replace this with SeqLock? Advantage would be cheap reads if reads don't contend with writes.
Another thing I'm questioning is if SeqLock can be improved to use OS wait. It appears to me that if reader observes an odd count, it can wait with atomic wait underlying mechanism (Linux futex/Windows WaitOnAddress), thus avoiding the starvation problem of spinlock.
To me it looks like possible. Though C++ memory model doesn't cover Seqlock currently, types in std::atomic must be trivially copyable, so memcpy reads/writes in seqlock will work and will deal with races if sufficient barriers are used to get a volatile-equivalent without defeating optimizations too badly. This will be part of a specific C++ implementation's header files so it doesn't have to be portable.
Existing SO Q&As about implement a SeqLock in C++ (perhaps using other std::atomic operations)
Implementing 64 bit atomic counter with 32 bit atomics
how to implement a seqlock lock using c++11 atomic library

Yes, you can use a SeqLock as a readers/writers lock if you provide mutual exclusion between writers. You'd still get read-side scalability, while writes and RMWs would stay about the same.
It's not a bad idea, although it has potential fairness problems for readers if you have very frequent writes. Maybe not a good idea for a mainstream standard library, at least not without some testing with some different workloads / use-cases on a range of hardware, since working great on some machines but faceplanting on others is not what you want for standard library stuff. (Code that wants great performance for its special case often unfortunately has to use an implementation that's tuned for it, not the standard one.)
Mutual exclusion is possible with a separate spinlock, or just using the low bit of the sequence number. In fact I've seen other descriptions of a SeqLock that assumed you'd be using it with multiple writers, and didn't even mention the single-writer case that allows pure-load and pure-store for the sequence number to avoid the cost of an atomic RMW.
How to use the sequence number as a spinlock
A writer or RMWer attempts to atomically CAS the sequence number to increment (if it wasn't already odd). If the sequence number is already odd, writers just spin until they see an even value.
This would mean writers have to start by reading the sequence number before trying to write, which can cause extra coherency traffic (MESI Share request, then RFO). On a machine that actually had a fetch_or in hardware, you could use that to atomically make the count odd and see if you won the race to take it from even to odd.
On x86-64, you can use lock bts to set the low bit and find out what the old low bit was, then load the whole sequence number if it was previously even (because you won the race, no other writer is going to be modifying it). So you can do a release-store of that plus 1 to "unlock" instead of needing a lock add.
Making other writers faster at reclaiming the lock may actually be a bad thing, though: you want to give a window for readers to complete. Maybe just use multiple pause instructions (or equivalent on non-x86) in write-side spin loops, more than in read-side spins. If contention is low, readers probably had time to see it before writers got to it, otherwise writers will frequently see it locked and go into the slower spin loop. Maybe with faster-increasing backoff for writers, too.
An LL/SC machine could (in asm at least) test-and-increment just as easily as CAS or TAS. I don't know how to write pure C++ that would compile to just that. fetch_or could compile efficiently for LL/SC, but still to a store even if it was already odd. (If you have to LL separately from SC, you might as well make the most of it and not store at all if it will be useless, and hope that the hardware is designed to make the best of things.)
(It's critical to not unconditionally increment; you must not unlock another writer's ownership of the lock. But an atomic-RMW that leaves the value unchanged is always ok for correctness, if not performance.)
It may not be a good idea by default because of bad results with heavy write activity making it potentially hard for a reader to get a successful read done. As Wikipedia points out:
The reader never blocks, but it may have to retry if a write is in progress; this speeds up the readers in the case where the data was not modified, since they do not have to acquire the lock as they would with a traditional read–write lock. Also, writers do not wait for readers, whereas with traditional read–write locks they do, leading to potential resource starvation in a situation where there are a number of readers (because the writer must wait for there to be no readers). Because of these two factors, seqlocks are more efficient than traditional read–write locks for the situation where there are many readers and few writers. The drawback is that if there is too much write activity or the reader is too slow, they might livelock (and the readers may starve).
The "too slow reader" problem is unlikely, just a small memcpy. Code shouldn't expect good results from std::atomic<T> for very large T; the general assumption is that you'd only bother with std::atomic for a T that can be lock-free on some implementations. (Usually not including transactional memory since mainstream implementations don't do that.)
But the "too much write" problem could still be real: SeqLock is best for read-mostly data. Readers may have a bad time with a heavy write mix, retrying even more than with a simple spinlock or a readers-writers lock.
It would be nice if there was a way to make this an option for an implementation, like an optional template parameter such as std::atomic<T, true>, or a #pragma, or #define before including <atomic>. Or a command-line options.
An optional template param affects every use of the type, but might be slightly less clunky than a separate class name like gnu::atomic_seqlock<T>. An optional template param would still make std::atomic types be that class name, so e.g. matching specializations of other things for std::atomic. But might break other things, IDK.
Might be fun to hack something up to experiment with.

Related

Does this envelope implementation correctly use C++11 atomics?

I have written a simple 'envelope' class to make sure I understand the C++11 atomic semantics correctly. I have a header and a payload, where the writer clears the header, fills in the payload, then fills the header with an increasing integer. The idea is that a reader then can read the header, memcpy out the payload, read the header again, and if the header is the same the reader can then assume they successfully copied the payload. It's OK that the reader may miss some updates, but it's not OK for them to get a torn update (where there is a mix of bytes from different updates). There is only ever a single reader and a single writer.
The writer uses release memory order and the reader uses acquire memory order.
Is there any risk of the memcpy being reordered with the atomic store/load calls? Or can the loads be reordered with each other? This never aborts for me but maybe I'm lucky.
#include <iostream>
#include <atomic>
#include <thread>
#include <cstring>
struct envelope {
alignas(64) uint64_t writer_sequence_number = 1;
std::atomic<uint64_t> sequence_number;
char payload[5000];
void start_writing()
{
sequence_number.store(0, std::memory_order::memory_order_release);
}
void publish()
{
sequence_number.store(++writer_sequence_number, std::memory_order::memory_order_release);
}
bool try_copy(char* copy)
{
auto before = sequence_number.load(std::memory_order::memory_order_acquire);
if(!before) {
return false;
}
::memcpy(copy, payload, 5000);
auto after = sequence_number.load(std::memory_order::memory_order_acquire);
return before == after;
}
};
envelope g_envelope;
void reader_thread()
{
char local_copy[5000];
unsigned messages_received = 0;
while(true) {
if(g_envelope.try_copy(local_copy)) {
for(int i = 0; i < 5000; ++i) {
// if there is no tearing we should only see the same letter over and over
if(local_copy[i] != local_copy[0]) {
abort();
}
}
if(messages_received++ % 64 == 0) {
std::cout << "successfully received=" << messages_received << std::endl;
}
}
}
}
void writer_thread()
{
const char alphabet[] = {"ABCDEFGHIJKLMNOPQRSTUVWXYZ"};
unsigned i = 0;
while(true) {
char to_write = alphabet[i % (sizeof(alphabet)-1)];
g_envelope.start_writing();
::memset(g_envelope.payload, to_write, 5000);
g_envelope.publish();
++i;
}
}
int main(int argc, char** argv)
{
std::thread writer(&writer_thread);
std::thread reader(&reader_thread);
writer.join();
reader.join();
return 0;
}

This is called a seqlock; it has a data race simply because of the conflicting calls to memset and memcpy. There have been proposals to provide a memcpy-like facility to make this sort of code correct; the most recent is not likely to appear before C++26 (even if approved).

This is called a seqlock. It's a known pattern, and it works well for publish occasionally, read often. If you republish too often (especially for a buffer as large as 5000 bytes), you risk too many retries by the readers as they keep detecting possible tearing. It's commonly used to e.g. publish a 64-bit or 128-bit timestamp from a timer interrupt handler to all cores, where the fact that the writer doesn't have to acquire a lock is great, and so is the fact that readers are read-only and have negligible overhead in the fast-path.
Acq and Rel are one-way barriers.
You need atomic_thread_fence(mo_acquire) before the 2nd load of the sequence number in the reader to make sure it doesn't happen earlier, before the memcpy finishes. And same for atomic_thread_fence(mo_release) in the writer, after the first store before writing the data. Note that acquire / release fences are 2-way barriers, and do affect non-atomic variables1. (Despite misconceptions to the contrary, fences really are 2-way barriers, unlike acquire or release operations. Jeff Preshing explains and debunks the confusion)
See also Implementing 64 bit atomic counter with 32 bit atomics for my attempt at a templated SeqLock class. I required the template class T to provide an assignment operator to copy itself around, but using memcpy might be better. I was using volatile for extra safety against the C++ UB we include. That works easily for uint64_t but is a huge pain in C++ for anything wider, unlike in C where you can get the compiler to efficiently emit code to load from a volatile struct into a non-volatile temporary.
You're going to have C++ data-race UB either way (because C++ makes best efficiency impossible without UB: the whole point of a SeqLock is to let tearing potentially happen on data[], but detect that and never actually look at the torn data). You could avoid UB by copying your data as an array of atomic<unsigned long> or something, but current compilers aren't smart enough to use SIMD for that so the access to the shared data would be inefficient. (And HW vendors fail to document Per-element atomicity of vector load/store and gather/scatter?, even though we all know that current CPUs do give that and future CPUs almost certainly will too.)
A memory barrier is probably sufficient, but it would be nice to do something to "launder" the value to make extra sure the compiler doesn't put another reload of the non-atomic data after the 2nd load. Like What is the purpose of glibc's atomic_forced_read function?. But as I said, I think atomic_thread_fence() is sufficient. At least in practice with compilers like GCC, which treat thread_fence like asm("":::"memory") that tells the compiler all values in memory might have changed.
Footnote 1: Maxim points out that atomic_thread_fence may be sort of a hack because ISO C++ specifies things only in terms of barriers and release-sequences synchronizing with loads that see the value stored.
But it's well known how fences and acq/rel loads/stores map to asm for any given target platform. It's implausible that a compiler will do enough whole-program inter-thread analysis to prove that it can break your code.
There might be an argument to be made in terms of the language used in the C++ standard about establishing happens-before relationships between the store of tmp+1 and at least some hypothetical reader. In practice that's enough to stop a compiler from breaking the writer: it can't know what code will be reading the data it's writing so it has to respect barriers. And probably the language in the standard is strong enough that a reader that sees an odd sequence number (and avoids reading data[]) can avoid data-race UB, so there would be a valid happens-before relationship between an atomic store that has to stay ahead of some non-atomic stores. So I'm not convinced that there's any room for a malicious all-seeing compiler to not respect atomic_thread_fence() there, let alone any real compiler.
In any case, you definitely do not want _mm_lfence() on x86. You want the compiler barrier against runtime reordering, but you definitely do not want the main effect of lfence: blocking out-of-order execution.Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths and Are loads and stores the only instructions that gets reordered?
i.e. you just want GNU C asm("":::"memory"), aka atomic_signal_fence(mo_seq_cst). Also equivalent to atomic_thread_fence(mo_acq_rel) on x86, which only has to block compile-time reordering to control runtime ordering, because the only runtime reording x86's strong memory model allows is StoreLoad (except for NT stores). x86's memory model is seq_cst + a store-buffer with store forwarding (which weakens seq_cst to acq/rel, and occasionally has other funky effects especially for loads that partially overlap a store).
For more about _mm_lfence() and so on vs. the asm instructions, see When should I use _mm_sfence _mm_lfence and _mm_mfence.
Other tweaks
Your sequence number is unnecessarily wide, and 64-bit atomics are less efficient on some 32-bit platforms, and very inefficient on a few. A 32-bit sequence number won't wrap in any reasonable thread-sleep time. (e.g. a 4GHz CPU will take about a whole second to do 2^32 stores at 1 store per clock, and that's with zero contention for writes to the cache line. And no cycles spend executing stores of the actual data. And practical use-cases don't have the writer in a tight loop publishing new values constantly: that could lead to something similar to livelock with readers constantly retrying and making no progress.)
unsigned long is never (AFAIK) too wide to handle efficiently, except on CPUs narrower than 32-bit. So atomic<long> or atomic<unsigned long> would use a 64-bit counter on CPUs where that's fine, but definitely avoid the risk of using a 64-bit atomic in 32-bit code. And long is required to be at least 32 bits wide.
Also, you don't need two copies of the write sequence number. Just have the writer do an atomic load into a tmp var, then separate atomic stores of tmp+1 and tmp+2.
(You're correct in wanting to avoid sequence_number++; it would be a bad idea to ask the compiler to do two atomic RMWs). The only advantage of a separate non-atomic var for the writer's private seq number is if this can inline into a write loop and keep it in a register so the writer never reloads the value.

relaxed ordering and inter thread visibility

I learnt from relaxed ordering as a signal that a store on an atomic variable should be visible to other thread in a "within a reasonnable amount of time".
That say, I am pretty sure it should happen in a very short time (some nano second ?).
However, I don't want to rely on "within a reasonnable amount of time".
So, here is some code :
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_relaxed))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_relaxed);
}
Thread A and B are within a kind of ThreadPool, so there is no creation of thread or whatsoever in this problem.
I don't need to protect any data, so acquire / consume / release ordering on atomic store/load are not needed here (I think?).
We know for sure that the functionThatWillBeLaunchedInThreadAfunction will be launched AFTER the end of the functionThatWillBeLaunchedInThreadB.
However, in such a code, we don't have any guarantee that the store will be visible in the thread A, so the thread A can read a stale value (false).
Here are some solution I think about.
Solution 1 : Use volatility
Just declare volatile std::atomic_bool canBegin{false}; Here the volatileness guarantee us that we will not see stale value.
Solution 2 : Use mutex or spinlock
Here the idea is to protect the canBegin access via a mutex / spinlock that guarantee via a release/acquire ordering that we will not see a stale value.
I don't need canGo to be an atomic either.
Solution 3 : not sure at all, but memory fence?
Maybe this code will not work, so, tell me :).
bool canGo{false}; // not an atomic value now
// in thread A
std::atomic_thread_fence(std::memory_order_acquire);
if(canGo) produceData();
// in thread B
canGo = true;
std::atomic_thread_fence(std::memory_order_release);
On cpp reference, for this case, it is write that :
all non-atomic and relaxed atomic stores that are sequenced-before FB
in thread B will happen-before all non-atomic and relaxed atomic loads
from the same locations made in thread A after FA
Which solution would you use and why?

There's nothing you can do to make a store visible to other threads any sooner. See If I don't use fences, how long could it take a core to see another core's writes? - barriers don't speed up visibility to other cores, they just make this core wait until that's happened.
The store part of an RMW is not different from a pure store for this, either.
(Certainly on x86; not totally sure about other ISAs, where a relaxed LL/SC might possibly get special treatment from the store buffer, possibly being more likely to commit before other stores if this core can get exclusive ownership of the cache line. But I think it still would have to retire from out-of-order execution so the core knows it's not speculative.)
Anthony's answer that was linked in comment is misleading; as I commented there:
If the RMW runs before the other thread's store commits to cache, it doesn't see the value, just like if it was a pure load. Does that mean "stale"? No, it just means that the store hasn't happened yet.
The only reason RMWs need a guarantee about "latest" value is that they're inherently serializing operations on that memory location. This is what you need if you want 100 unsynchronized fetch_add operations to not step on each other and be equivalent to += 100, but otherwise best-effort / latest-available value is fine, and that's what you get from a normal atomic load.
If you require instant visibility of results (a nanosecond or so), that's only possible within a single thread, like x = y; x += z;
Also note, the C / C++ standard requirement (actually just a note) to make stores visible in a reasonable amount of time is in addition to the requirements on ordering of operations. It doesn't mean seq_cst store visibility can be delayed until after later loads. All seq_cst operations happen in some interleaving of program order across all threads.
On real-world C++ implementations, the visibility time is entirely up to hardware inter-core latency. But the C++ standard is abstract, and could in theory be implemented on a CPU that required manual flushing to make stores visible to other threads. Then it would be up to the compiler to not be lazy and defer that for "too long".
volatile atomic<T> is useless; compilers already don't optimize atomic<T>, so every atomic access done by the abstract machine will already happen in the asm. (Why don't compilers merge redundant std::atomic writes?). That's all that volatile does, so volatile atomic<T> compiles to the same asm as atomic<T> for anything you can with the atomic.
Defining "stale" is a problem because separate threads running on separate cores can't see each other's actions instantly. It takes tens of nanoseconds on modern hardware to see a store from another thread.
But you can't read "stale" values from cache; that's impossible because real CPUs have coherent caches. (That's why volatile int could be used to roll your own atomics before C++11, but is no longer useful.) You may need an ordering stronger than relaxed to get the semantics you want as far as one value being older than another (i.e. "reordering", not "stale"). But for a single value, if you don't see a store, that means your load executed before the other core took exclusive ownership of the cache line in order to commit its store. i.e. that the store hasn't truly happened yet.
In the formal ISO C++ rules, there are guarantees about what value you're allowed to see which effectively give you the guarantees you'd expect from cache coherency for a single object, like that after a reader sees a store, further loads in this thread won't see some older store and then eventually back to the newest store. (https://eel.is/c++draft/intro.multithread#intro.races-19).
(Note for 2 writers + 2 readers with non-seq_cst operations, it's possible for the readers to disagree about the order in which the stores happened. This is called IRIW reordering, but most hardware can't do it; only some PowerPC. Will two atomic writes to different locations in different threads always be seen in the same order by other threads? - so it's not always quite as simple as "the store hasn't happened yet", it be visible to some threads before others. But it's still true that you can't speed up visibility, only for example slow down the readers so none of them see it via the "early" mechanism, i.e. with hwsync for the PowerPC loads to drain the store buffer first.)

We know for sure that the functionThatWillBeLaunchedInThreadAfunction
will be launched AFTER the end of the
functionThatWillBeLaunchedInThreadB.
First of all, if this is the case then it's likely that your task queue mechanism takes care of the necessary synchronization already.
On to the answer...
By far the simplest thing to do is acquire/release ordering. All the solutions you gave are worse.
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_acquire))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_release);
}
By the way, shouldn't this be a while loop?
void functionThatWillBeLaunchedInThreadA() {
while (!canBegin.load(std::memory_order_acquire))
{ }
produceData();
}
I don't need to protect any data, so acquire / consume / release
ordering on atomic store/load are not needed here (I think?)
In this case, the ordering is required to keep the compiler/CPU/memory subsystem from ordering the canBegin store true before the previous reads/writes have completed. And it should actually stall the CPU until it can be guaranteed that every write that comes before in program order will propagate before the store to canBegin. On the load side it prevents memory from being read/written before canBegin is read as true.
However, in such a code, we don't have any guarantee that the store
will be visible in the thread A, so the thread A can read a stale
value (false).
You said yourself:
a store on an atomic variable should be visible to other thread in a
"within a reasonnable amount of time".
Even with relaxed memory order, a write is guaranteed to eventually reach the other cores and all cores will eventually agree on any given variable's store history, so there are no stale values. There are only values that haven't propagated yet. What's "relaxed" about it is the store order in relation to other variables. Thus, memory_order_relaxed solves the stale read problem (but doesn't address the ordering required as discussed above).
Don't use volatile. It doesn't provide all the guarantees required of atomics in the C++ memory model, so using it would be undefined behavior. See https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering at the bottom to read about it.
You could use a mutex or spinlock, but a mutex operation is much more expensive than a lock-free std::atomic acquire-load/release-store. A spinlock will do at least one atomic read-modify-write operation...and possibly many. A mutex is definitely overkill. But both have the benefit of simplicity in the C++ source. Most people know how to use locks so it's easier to demonstrate correctness.
A memory fence will also work but your fences are in the wrong spot (it's counter-intuitive) and the inter-thread communication variable should be std::atomic. (Careful when playing these games...! It's easy to get undefined behavior) Relaxed ordering is ok thanks to the fences.
std::atomic<bool> canGo{false}; // MUST be atomic
// in thread A
if(canGo.load(std::memory_order_relaxed))
{
std::atomic_thread_fence(std::memory_order_acquire);
produceData();
}
// in thread B
std::atomic_thread_fence(std::memory_order_release);
canGo.store(true, memory_order_relaxed);
The memory fences are actually more strict than acquire/release ordering on the std::atomicload/store so this gains nothing and could be more expensive.
It seems like you really want to avoid overhead with your signaling mechanism. This is exactly what the std::atomic acquire/release semantics were invented for! You are worrying too much about stale values. Yes, an atomic RMW will give you the "latest" value, but they're also very expensive operations themselves. I want to give you an idea of how fast acquire/release is. It's most likely that you're targeting x86. x86 has total store order and word-sized loads/stores are atomic, so an load acquire compiles to just a regular load and and a release store compiles to a regular store. So it turns out that almost everything in this long post will probably compile to exactly the same code anyway.

Understanding std::atomic::compare_exchange_weak() in C++11

bool compare_exchange_weak (T& expected, T val, ..);
compare_exchange_weak() is one of compare-exchange primitives provided in C++11. It's weak in the sense that it returns false even if the value of the object is equal to expected. This is due to spurious failure on some platforms where a sequence of instructions (instead of one as on x86) are used to implement it. On such platforms, context switch, reloading of the same address (or cache line) by another thread, etc can fail the primitive. It's spurious as it's not the value of the object (not equal to expected) that fails the operation. Instead, it's kind of timing issues.
But what puzzles me is what's said in C++11 Standard (ISO/IEC 14882),
29.6.5
..
A consequence of spurious failure is that nearly all uses of weak
compare-and-exchange will be in a loop.
Why does it have to be in a loop in nearly all uses ? Does that mean we shall loop when it fails because of spurious failures? If that's the case, why do we bother use compare_exchange_weak() and write the loop ourselves? We can just use compare_exchange_strong() which I think should get rid of spurious failures for us. What are the common use cases of compare_exchange_weak()?
Another question related. In his book "C++ Concurrency In Action" Anthony says,
//Because compare_exchange_weak() can fail spuriously, it must typically
//be used in a loop:
bool expected=false;
extern atomic<bool> b; // set somewhere else
while(!b.compare_exchange_weak(expected,true) && !expected);
//In this case, you keep looping as long as expected is still false,
//indicating that the compare_exchange_weak() call failed spuriously.
Why is !expected there in the loop condition? Does it there to prevent that all threads may starve and make no progress for some time?
One last question
On platforms that no single hardware CAS instruction exist, both the weak and strong version are implemented using LL/SC (like ARM, PowerPC, etc). So is there any difference between the following two loops? Why, if any? (To me, they should have similar performance.)
// use LL/SC (or CAS on x86) and ignore/loop on spurious failures
while (!compare_exchange_weak(..))
{ .. }
// use LL/SC (or CAS on x86) and ignore/loop on spurious failures
while (!compare_exchange_strong(..))
{ .. }
I come up w/ this last question you guys all mention that there maybe a performance difference inside a loop. It's also mentioned by the C++11 Standard (ISO/IEC 14882):
When a compare-and-exchange is in a loop, the weak version will yield
better performance on some platforms.
But as analyzed above, two versions in a loop should give the same/similar performance. What's the thing I miss?

Why doing exchange in a loop?
Usually, you want your work to be done before you move on, thus, you put compare_exchange_weak into a loop so that it tries to exchange until it succeeds (i.e., returns true).
Note that also compare_exchange_strong is often used in a loop. It does not fail due to spurious failure, but it does fail due to concurrent writes.
Why to use weak instead of strong?
Quite easy: Spurious failure does not happen often, so it is no big performance hit. In constrast, tolerating such a failure allows for a much more efficient implementation of the weak version (in comparison to strong) on some platforms: strong must always check for spurious failure and mask it. This is expensive.
Thus, weak is used because it is a lot faster than strong on some platforms
When should you use weak and when strong?
The reference states hints when to use weak and when to use strong:
When a compare-and-exchange is in a loop, the weak version will yield
better performance on some platforms. When a weak compare-and-exchange
would require a loop and a strong one would not, the strong one is
preferable.
So the answer seems to be quite simple to remember: If you would have to introduce a loop only because of spurious failure, don't do it; use strong. If you have a loop anyway, then use weak.
Why is !expected in the example
It depends on the situation and its desired semantics, but usually it is not needed for correctness. Omitting it would yield a very similar semantics. Only in a case where another thread might reset the value to false, the semantics could become slightly different (yet I cannot find a meaningful example where you would want that). See Tony D.'s comment for a detailed explanation.
It is simply a fast track when another thread writes true: Then the we abort instead of trying to write true again.
About your last question
But as analyzed above, two versions in a loop should give the same/similar performance.
What's the thing I miss?
From Wikipedia:
Real implementations of LL/SC do not always succeed if there are no
concurrent updates to the memory location in question. Any exceptional
events between the two operations, such as a context switch, another
load-link, or even (on many platforms) another load or store
operation, will cause the store-conditional to spuriously fail. Older
implementations will fail if there are any updates broadcast over the
memory bus.
So, LL/SC will fail spuriously on context switch, for example. Now, the strong version would bring its "own small loop" to detect that spurious failure and mask it by trying again. Note that this own loop is also more complicated than a usual CAS loop, since it must distinguish between spurious failure (and mask it) and failure due to concurrent access (which results in a return with value false). The weak version does not have such own loop.
Since you provide an explicit loop in both examples, it is simply not necessary to have the small loop for the strong version. Consequently, in the example with the strong version, the check for failure is done twice; once by compare_exchange_strong (which is more complicated since it must distinguish spurious failure and concurrent acces) and once by your loop. This expensive check is unnecessary and the reason why weak will be faster here.
Also note that your argument (LL/SC) is just one possibility to implement this. There are more platforms that have even different instruction sets. In addition (and more importantly), note that std::atomic must support all operations for all possible data types, so even if you declare a ten million byte struct, you can use compare_exchange on this. Even when on a CPU that does have CAS, you cannot CAS ten million bytes, so the compiler will generate other instructions (probably lock acquire, followed by a non-atomic compare and swap, followed by a lock release). Now, think of how many things can happen while swapping ten million bytes. So while a spurious error may be very rare for 8 byte exchanges, it might be more common in this case.
So in a nutshell, C++ gives you two semantics, a "best effort" one (weak) and a "I will do it for sure, no matter how many bad things might happen inbetween" one (strong). How these are implemented on various data types and platforms is a totally different topic. Don't tie your mental model to the implementation on your specific platform; the standard library is designed to work with more architectures than you might be aware of. The only general conclusion we can draw is that guaranteeing success is usually more difficult (and thus may require additional work) than just trying and leaving room for possible failure.

I'm trying to answer this myself, after going through various online resources (e.g., this one and this one), the C++11 Standard, as well as the answers given here.
The related questions are merged (e.g., "why !expected ?" is merged with "why put compare_exchange_weak() in a loop ?") and answers are given accordingly.
Why does compare_exchange_weak() have to be in a loop in nearly all uses?
Typical Pattern A
You need achieve an atomic update based on the value in the atomic variable. A failure indicates that the variable is not updated with our desired value and we want to retry it. Note that we don't really care about whether it fails due to concurrent write or spurious failure. But we do care that it is us that make this change.
expected = current.load();
do desired = function(expected);
while (!current.compare_exchange_weak(expected, desired));
A real-world example is for several threads to add an element to a singly linked list concurrently. Each thread first loads the head pointer, allocates a new node and appends the head to this new node. Finally, it tries to swap the new node with the head.
Another example is to implement mutex using std::atomic<bool>. At most one thread can enter the critical section at a time, depending on which thread first set current to true and exit the loop.
Typical Pattern B
This is actually the pattern mentioned in Anthony's book. In contrary to pattern A, you want the atomic variable to be updated once, but you don't care who does it. As long as it's not updated, you try it again. This is typically used with boolean variables. E.g., you need implement a trigger for a state machine to move on. Which thread pulls the trigger is regardless.
expected = false;
// !expected: if expected is set to true by another thread, it's done!
// Otherwise, it fails spuriously and we should try again.
while (!current.compare_exchange_weak(expected, true) && !expected);
Note that we generally cannot use this pattern to implement a mutex. Otherwise, multiple threads may be inside the critical section at the same time.
That said, it should be rare to use compare_exchange_weak() outside a loop. On the contrary, there are cases that the strong version is in use. E.g.,
bool criticalSection_tryEnter(lock)
{
bool flag = false;
return lock.compare_exchange_strong(flag, true);
}
compare_exchange_weak is not proper here because when it returns due to spurious failure, it's likely that no one occupies the critical section yet.
Starving Thread?
One point worth mentioning is that what happens if spurious failures continue to happen thus starving the thread? Theoretically it could happen on platforms when compare_exchange_XXX() is implement as a sequence of instructions (e.g., LL/SC). Frequent access of the same cache line between LL and SC will produce continuous spurious failures. A more realistic example is due to a dumb scheduling where all concurrent threads are interleaved in the following way.
Time
| thread 1 (LL)
| thread 2 (LL)
| thread 1 (compare, SC), fails spuriously due to thread 2's LL
| thread 1 (LL)
| thread 2 (compare, SC), fails spuriously due to thread 1's LL
| thread 2 (LL)
v ..
Can it happen?
It won't happen forever, fortunately, thanks to what C++11 requires:
Implementations should ensure that weak compare-and-exchange
operations do not consistently return false unless either the atomic
object has value different from expected or there are concurrent
modifications to the atomic object.
Why do we bother use compare_exchange_weak() and write the loop ourselves? We can just use compare_exchange_strong().
It depends.
Case 1: When both need to be used inside a loop. C++11 says:
When a compare-and-exchange is in a loop, the weak version will yield
better performance on some platforms.
On x86 (at least currently. Maybe it'll resort to a similiar scheme as LL/SC one day for performance when more cores are introduced), the weak and strong version are essentially the same because they both boil down to the single instruction cmpxchg. On some other platforms where compare_exchange_XXX() isn't implemented atomically (here meaning no single hardware primitive exists), the weak version inside the loop may win the battle because the strong one will have to handle the spurious failures and retry accordingly.
But,
rarely, we may prefer compare_exchange_strong() over compare_exchange_weak() even in a loop. E.g., when there is a lot of things to do between atomic variable is loaded and a calculated new value is exchanged out (see function() above). If the atomic variable itself doesn't change frequently, we don't need repeat the costly calculation for every spurious failure. Instead, we may hope that compare_exchange_strong() "absorb" such failures and we only repeat calculation when it fails due to a real value change.
Case 2: When only compare_exchange_weak() need to be used inside a loop. C++11 also says:
When a weak compare-and-exchange would require a loop and a strong one
would not, the strong one is preferable.
This is typically the case when you loop just to eliminate spurious failures from the weak version. You retry until exchange is either successful or failed because of concurrent write.
expected = false;
// !expected: if it fails spuriously, we should try again.
while (!current.compare_exchange_weak(expected, true) && !expected);
At best, it's reinventing the wheels and perform the same as compare_exchange_strong(). Worse? This approach fails to take full advantage of machines that provide non-spurious compare-and-exchange in hardware.
Last, if you loop for other things (e.g., see "Typical Pattern A" above), then there is a good chance that compare_exchange_strong() shall also be put in a loop, which brings us back to the previous case.

Why does it have to be in a loop in nearly all uses ?
Because if you don't loop and it fails spuriously your program hasn't done anything useful - you didn't update the atomic object and you don't know what its current value is (Correction: see comment below from Cameron). If the call doesn't do anything useful what's the point of doing it?
Does that mean we shall loop when it fails because of spurious failures?
Yes.
If that's the case, why do we bother use compare_exchange_weak() and write the loop ourselves? We can just use compare_exchange_strong() which I think should get rid of spurious failures for us. What are the common use cases of compare_exchange_weak()?
On some architectures compare_exchange_weak is more efficient, and spurious failures should be fairly uncommon, so it might be possible to write more efficient algorithms using the weak form and a loop.
In general it is probably better to use the strong version instead if your algorithm doesn't need to loop, as you don't need to worry about spurious failures. If it needs to loop anyway even for the strong version (and many algorithms do need to loop anyway), then using the weak form might be more efficient on some platforms.
Why is !expected there in the loop condition?
The value could have got set to true by another thread, so you don't want to keep looping trying to set it.
Edit:
But as analyzed above, two versions in a loop should give the same/similar performance. What's the thing I miss?
Surely it's obvious that on platforms where spurious failure is possible the implementation of compare_exchange_strong has to be more complicated, to check for spurious failure and retry.
The weak form just returns on spurious failure, it doesn't retry.

Alright, so I need a function which performs atomic left-shifting. My processor doesn't have a native operation for this, and the standard library doesn't have a function for it, so it looks like I'm writing my own. Here goes:
void atomicLeftShift(std::atomic<int>* var, int shiftBy)
{
do {
int oldVal = std::atomic_load(var);
int newVal = oldVal << shiftBy;
} while(!std::compare_exchange_weak(oldVal, newVal));
}
Now, there's two reasons that loop might be executed more than once.
Someone else changed the variable while I was doing my left shift. The results of my computation should not be applied to the atomic variable, because it would effectively erase that someone else's write.
My CPU burped and the weak CAS spuriously failed.
I honestly don't care which one. Left shifting is fast enough that I may as well just do it again, even if the failure was spurious.
What's less fast, though, is the extra code that strong CAS needs to wrap around weak CAS in order to be strong. That code doesn't do much when the weak CAS succeeds... but when it fails, strong CAS needs to do some detective work to determine whether it was Case 1 or Case 2. That detective work takes the form of a second loop, effectively inside my own loop. Two nested loops. Imagine your algorithms teacher glaring at you right now.
And as I previously mentioned, I don't care about the result of that detective work! Either way I'm going to be redoing the CAS. So using strong CAS gains me precisely nothing, and loses me a small but measurable amount of efficiency.
In other words, weak CAS is used to implement atomic update operations. Strong CAS is used when you care about the result of CAS.

I think most of the answers above address "spurious failure" as some kind of problem, performance VS correctness tradeoff.
It can be seen as the weak version is faster most of the times, but in case of spurious failure, it becomes slower. And the strong version is a version that has no possibility of spurious failure, but it is almost always slower.
For me, the main difference is how these two version handle the ABA problem:
weak version will succeed only if noone has touched the cache line between load and store, so it will 100% detect ABA problem.
strong version will fail only if the comparison fails, so it will not detect ABA problem without extra measures.
So, in theory, if you use weak version on weak-ordered architecture, you don't need ABA detection mechanism and the implementation will be much simpler, giving better performance.
But, on x86 (strong-ordered architecture), weak version and strong version are the same, and they both suffer from ABA problem.
So if you write a completely cross-platform algorithm, you need to address ABA problem anyway, so there is no performance benefit from using the weak version, but there is a performance penalty for handling spurious failures.
In conclusion - for portability and performance reasons, the strong version is always a better-or-equal option.
Weak version can only be a better option if it lets you skip ABA countermeasures completely or your algorithm doesn't care about ABA.

InterlockedExchange vs InterlockedCompareExchange spin locks

I've written a basic spin lock (see below) using InterlockedExchange. However I've seen a lot of implementations use InterlockedCompareExchange instead. Is mine incorrect in some unforeseen way and if not what are the pro's and cons of each way (if indeed there are any)?
PS I know the sleep is expensive and I'd want to have an attempt count before I call it.
class SpinLock
{
public:
SpinLock() : m_lock( 0 ) {}
~SpinLock(){}
void Lock()
{
while( InterlockedExchange( &m_lock, 1 ) == 1 )
{
Sleep( 0 );
}
}
void Unlock()
{
InterlockedExchange( &m_lock, 0 );
}
private:
volatile unsigned int m_lock;
};

First of all, InterlockedExchange takes a LONG. Please repeat after me: a LONG isn't the same an an int. This may seem like a small thing but it can cause you grief.
Now, to elaborate a little on what Mats Petersson said:
Your spinlock will have horrible performance since the InterlockedExchange loop in Lock will modify the m_lock variable unconditionally, causing a lot of work to be done by the processors behind the scenes to maintain cache coherency.
To make matters worse, by not ensuring that your m_lock variable is on a cache line by itself, the above effect is amplified and could affect other data, unlucky enough to share the cache line with the instance of your spinlock.
These are just two moderately subtle issues with this code. There are others. The simple fact is that locks aren't easy to get right, and you shouldn't be implementing custom locking primitives. Please don't reinvent the wheel. Use the facilities provided to you by the operating system. It's unlikely they themselves are a bottleneck.
If you do find you have a performance issue (that is, you have profiling data that suggests a performance bottleneck) first focus on algorithmic changes and on improving parallelization and reducing lock contention. If the problem persists then and only then look elsewhere.

There is very little difference between CMPXCHG and XCHG (which is the x86 instructions that you'd get from the two intrinsic functions you mention).
I think the main difference is that in a SMP system with a lot of contention on the lock, you don't get a bunch of writes when the value is already "locked" - which means that the other processors don't have to read back a value that is already there in the cache.
In a debug build, you'd also want to ensure that Unlock() is only called from the current owner of the lock!

C++ Thread Safe Integer

I have currently created a C++ class for a thread safe integer which simply stores an integer privately and has public get a set functions which use a boost::mutex to ensure that only one change at a time can be applied to the integer.
Is this the most efficient way to do it, I have been informed that mutexes are quite resource intensive? The class is used a lot, very rapidly so it could well be a bottleneck...
Googleing C++ Thread Safe Integer returns unclear views and oppinions on the thread safety of integer operations on different architectures.
Some say that a 32bit int on a 32bit arch is safe, but 64 on 32 isn't due to 'alignment' Others say it is compiler/OS specific (which I don't doubt).
I am using Ubuntu 9.10 on 32 bit machines, some have dual cores and so threads may be executed simultaneously on different cores in some cases and I am using GCC 4.4's g++ compiler.
Thanks in advance...
Please Note: The answer I have marked as 'correct' was most suitable for my problem - however there are some excellent points made in the other answers and they are all worth reading!

There is the C++0x atomic library, and there is also a Boost.Atomic library under development that use lock free techniques.

It's not compiler and OS specific, it's architecture specific. The compiler and OS come into it because they're the tools you work through, but they're not the ones setting the real rules. This is why the C++ standard won't touch the issue.
I have never in my life heard of an 64-bit integer write, which can be split into two 32-bit writes, being interrupted halfway through. (Yes, that's an invitation to others to post counterexamples.) Specifically, I have never heard of a CPU's load/store unit allowing a misaligned write to be interrupted; an interrupting source has to wait for the whole misaligned access to complete.
To have an interruptible load/store unit, its state would have to be saved to the stack... and the load/store unit is what saves the rest of the CPU's state to the stack. This would be hugely complicated, and bug prone, if the load/store unit were interruptible... and all that you would gain is one cycle less latency in responding to interrupts, which, at best, is measured in tens of cycles. Totally not worth it.
Back in 1997, A coworker and I wrote a C++ Queue template which was used in a multiprocessing system. (Each processor had its own OS running, and its own local memory, so these queues were only needed for memory shared between processors.) We worked out a way to make the queue change state with a single integer write, and treated this write as an atomic operation. Also, we required that each end of the queue (i.e. the read or write index) be owned by one and only one processor. Thirteen years later, the code is still running fine, and we even have a version that handles multiple readers.
Still, if you want to treat a 64-bit integer write as atomic, align the field to a 64-bit bound. Why worry?
EDIT: For the case you mention in your comment, I'd need more information to be sure, so let me give an example of something that could be implemented without specialized synchronization code.
Suppose you have N writers and one reader. You want the writers to be able to signal events to the reader. The events themselves have no data; you just want an event count, really.
Declare a structure for the shared memory, shared between all writers and the reader:
#include <stdint.h>
struct FlagTable
{ uint32_t flag[NWriters];
};
(Make this a class or template or whatever as you see fit.)
Each writer needs to be told its index and given a pointer to this table:
class Writer
{public:
Writer(FlagTable* flags_, size_t index_): flags(flags_), index(index_) {}
void SignalEvent(uint32_t eventCount = 1);
private:
FlagTable* flags;
size_t index;
}
When the writer wants to signal an event (or several), it updates its flag:
void Writer::SignalEvent(uint32_t eventCount)
{ // Effectively atomic: only one writer modifies this value, and
// the state changes when the incremented value is written out.
flags->flag[index] += eventCount;
}
The reader keeps a local copy of all the flag values it has seen:
class Reader
{public:
Reader(FlagTable* flags_): flags(flags_)
{ for(size_t i = 0; i < NWriters; ++i)
seenFlags[i] = flags->flag[i];
}
bool AnyEvents(void);
uint32_t CountEvents(int writerIndex);
private:
FlagTable* flags;
uint32_t seenFlags[NWriters];
}
To find out if any events have happened, it just looks for changed values:
bool Reader::AnyEvents(void)
{ for(size_t i = 0; i < NWriters; ++i)
if(seenFlags[i] != flags->flag[i])
return true;
return false;
}
If something happened, we can check each source and get the event count:
uint32_t Reader::CountEvents(int writerIndex)
{ // Only read a flag once per function call. If you read it twice,
// it may change between reads and then funny stuff happens.
uint32_t newFlag = flags->flag[i];
// Our local copy, though, we can mess with all we want since there
// is only one reader.
uint32_t oldFlag = seenFlags[i];
// Next line atomically changes Reader state, marking the events as counted.
seenFlags[i] = newFlag;
return newFlag - oldFlag;
}
Now the big gotcha in all this? It's nonblocking, which is to say that you can't make the Reader sleep until a Writer writes something. The Reader has to choose between sitting in a spin-loop waiting for AnyEvents() to return true, which minimizes latency, or it can sleep a bit each time through, which saves CPU but could let a lot of events build up. So it's better than nothing, but it's not the solution to everything.
Using actual synchronization primitives, one would only need to wrap this code with a mutex and condition variable to make it properly blocking: the Reader would sleep until there was something to do. Since you used atomic operations with the flags, you could actually keep the amount of time the mutex is locked to a minimum: the Writer would only need to lock the mutex long enough to send the condition, and not set the flag, and the reader only needs to wait for the condition before calling AnyEvents() (basically, it's like the sleep-loop case above, but with a wait-for-condition instead of a sleep call).

C++ has no real atomic integer implementation, neither do most common libraries.
Consider the fact that even if said implementation would exist, it would have to rely on some sort of mutex - due to the fact that you cannot guarantee atomic operations across all architectures.

As you're using GCC, and depending on what operations you want to perform on the integer, you might get away with GCC's atomic builtins.
These might be a bit faster than mutexes, but in some cases still a lot slower than "normal" operations.

For full, general purpose synchronization, as others have already mentioned, the traditional synchronization tools are pretty much required. However, for certain special cases it is possible to take advantage of hardware optimizations. Specifically, most modern CPUs support atomic increment & decrement on integers. The GLib library has pretty good cross-platform support for this. Essentially, the library wraps CPU & compiler specific assembly code for these operations and defaults to mutex protection where they're not available. It's certainly not very general-purpose but if you're only interested in maintaining a counter, this might be sufficient.

you can also have a look at the atomic ops section of intels Thread Building Blocks or the atomic_ops project

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js