The cost of atomic counters and spinlocks on x86(_64) - c++

Preface
I recently came across some synchronization problems, which led me to spinlocks and atomic counters. Then I was searching a bit more, how these work and found std::memory_order and memory barriers (mfence, lfence and sfence).
So now, it seems that I should use acquire/release for the spinlocks and relaxed for the counters.
Some reference
x86 MFENCE - Memory Fence
x86 LOCK - Assert LOCK# Signal
Question
What is the machine code (edit: see below) for those three operations (lock = test_and_set, unlock = clear, increment = operator++ = fetch_add) with default (seq_cst) memory order and with acquire/release/relaxed (in that order for those three operations). What is the difference (which memory barriers where) and the cost (how many CPU cycles)?
Purpose
I was just wondering how bad my old code (not specifying memory order = seq_cst used) really is and if I should create some class atomic_counter derived from std::atomic but using relaxed memory ordering (as well as good spinlock with acquire/release instead of mutexes on some places ...or to use something from boost library - I have avoided boost so far).
My Knowledge
So far I do understand that spinlocks protect more than itself (but some shared resource/memory as well), so, there must be something that makes some memory view coherent for multiple threads/cores (that would be those acquire/release and memory fences). Atomic counter just lives for itself and only need that atomic increment (no other memory involved and I do not really care about the value when I read it, it is informative and can be few cycles old, no problem). There is some LOCK prefix and some instructions like xchg implicitly have it. Here my knowledge ends, I don't know how the cache and buses really work and what is behind (but I know that modern CPUs can reorder instructions, execute them in parallel and use memory cache and some synchronization). Thank you for explanation.
P.S.: I have old 32bit PC now, can only see lock addl and simple xchg, nothing else - all versions look the same (except unlock), memory_order makes no difference on my old PC (except unlock, release uses move instead of xchg). Will that be true for 64bit PC? (edit: see below) Do I have to care about memory order? (answer: no, not much, release on unlock saves few cycles, that's all.)
The Code:
#include <atomic>
using namespace std;
atomic_flag spinlock;
atomic<int> counter;
void inc1() {
counter++;
}
void inc2() {
counter.fetch_add(1, memory_order_relaxed);
}
void lock1() {
while(spinlock.test_and_set()) ;
}
void lock2() {
while(spinlock.test_and_set(memory_order_acquire)) ;
}
void unlock1() {
spinlock.clear();
}
void unlock2() {
spinlock.clear(memory_order_release);
}
int main() {
inc1();
inc2();
lock1();
unlock1();
lock2();
unlock2();
}
g++ -std=c++11 -O1 -S (32bit Cygwin, shortened output)
__Z4inc1v:
__Z4inc2v:
lock addl $1, _counter ; both seq_cst and relaxed
ret
__Z5lock1v:
__Z5lock2v:
movl $1, %edx
L5:
movl %edx, %eax
xchgb _spinlock, %al ; both seq_cst and acquire
testb %al, %al
jne L5
rep ret
__Z7unlock1v:
movl $0, %eax
xchgb _spinlock, %al ; seq_cst
ret
__Z7unlock2v:
movb $0, _spinlock ; release
ret
UPDATE for x86_64bit: (see mfence in unlock1)
_Z4inc1v:
_Z4inc2v:
lock addl $1, counter(%rip) ; both seq_cst and relaxed
ret
_Z5lock1v:
_Z5lock2v:
movl $1, %edx
.L5:
movl %edx, %eax
xchgb spinlock(%rip), %al ; both seq_cst and acquire
testb %al, %al
jne .L5
ret
_Z7unlock1v:
movb $0, spinlock(%rip)
mfence ; seq_cst
ret
_Z7unlock2v:
movb $0, spinlock(%rip) ; release
ret

x86 has mostly strong memory model, all the usual stores/loads have release/acquire semantics implicitly. The exception is only SSE non-temporal store operations which require sfence to be ordered as usual. All the read-modify-write (RMW) instructions with the LOCK prefix imply full memory barrier, i.e. seq_cst.
Thus on x86, we have
test_and_set can be coded with lock bts (for bit-wise operations), lock cmpxchg, or lock xchg (or just xchg which implies the lock). Other spin-lock implementations can use instructions like lock inc (or dec) if they need e.g. fairness. It is not possible to implement try_lock with release/acquire fence (at least you'd need standalone memory barrier mfence anyway).
clear is coded with lock and (for bit-wise) or lock xchg, though, more efficient implementations would use plain write (mov) instead of locked instruction.
fetch_add is coded with lock add.
Removing the lock prefix will not guarantee atomicity for RMW operations thus such operations cannot be interpreted strictly as having memory_order_relaxed in C++ view. However in practice, you might want to access atomic variable via faster non-atomic operation when it is safe (in constructor, under lock).
In our experience, it does not really matter which exactly RMW atomic operation is performed they take almost the same number of cycles to execute (and mfence is about x0.5 of a lock operation). You can estimate performance of synchronization algorithms by counting the number of atomic operations (and mfences), and the number of memory indirections (cache misses).

I recommend: x86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors.
Your x86 and x86_64 are indeed pretty "well behaved". In particular, they do not re-order write operations (and any speculative writes are discarded while they are in the cpu/core's write-queue), and they do not re-order read operations. However, they will start read operations as early as they can, which means that reads and writes can be re-ordered. (A read of something sitting in the write-queue reads the queued value, so reads/writes of the same location are not re-ordered.) So:
read-modify-write operations require LOCKs which makes them, implicitly, memory_order_seq_cst.
So for these operations you gain nothing by weakening the memory ordering (on the x86/x86_64). The general advice is to "keep it simple" and stick with memory_order_seq_cst, which happily is not costing anything extra for the x86 and x86_64.
For anything newer than a Pentium, if the cpu/core already has "exclusive" access to the affected memory, the LOCK does not affect other cpus/cores, and may be a relatively simple operation.
memory_order_acquire/_release do not require an mfence or any other overhead.
So, for atomic load/store, if acquire/release is sufficient, then for the x86/x86_64 those operations are "tax free".
memory_order_seq_cst does require mfence...
...which is worth understanding.
(NB: we are here talking about what the processor does with the instructions generated by the compiler. The compiler's re-ordering of operations is a very similar issue, but not addressed here.)
An mfence stalls the cpu/core until all pending writes are cleared out of the write-queue. In particular, any read operations which follow the mfence will not start until the write-queue is empty. Consider two threads:
initial state: wa = wb = 0
thread 'A' thread 'B'
wa = 1 ; (mov [wa] ← 1) wb = 1 ; (mov [wb] ← 1)
a = wb ; (mov ebx ← [wb]) b = wa ; (mov ebx ← [wa])
Left to their own devices, the x86/x86_64 can produce any of (a = 1, b = 1), (a = 0, b = 1), (a = 1, b = 0) and (a = 0, b = 0). The last is invalid if you expect memory_order_seq_cst -- since you cannot get that by any interleaving of the operations. The reason this can happen is that the writes of wa and wb are queued in the respective cpu's/core's queue, and the reads of wa and wb can both be scheduled and can both complete before either write. To achieve memory_order_seq_cst you need an mfence:
thread 'A' thread 'B'
wa = 1 ; (mov [wa] ← 1) wb = 1 ; (mov [wb] ← 1)
mfence ; mfence
a = wb ; (mov ebx ← [wb]) b = wa ; (mov ebx ← [wa])
Since there is no synchronization between the threads, the result may be anything except (a = 0, b = 0). Interestingly, the mfence is for the benefit of the thread itself, because it prevents the read operation starting before the write completes. The only thing that other threads care about is the order in which writes occur, and the x86/x86_64 does not re-order those in any case.
So, to implement memory_order_seq_cst atomic_load() and atomic_store(), it is necessary to insert an mfence after one or more stores and before a load. Where these operations are implemented as library functions, the common convention is to add the mfence to all stores, leaving the load "naked". (The logic being that loads are more common than stores, and it seems better to add the overhead to the store.)
For spin-locks, at least, your question seems to boil down to whether a spin-unlock operation requires an mfence, or not, and what difference it makes.
The C11 atomic_flag_clear() is, implicitly, memory_order_seq_cst, for which an mfence is required. The C11 atomic_flag_test_and_set() is not only a read-modify-write operation but is also implictly memory_order_seq_cst -- and LOCK does that.
C11 does not offer a spin-lock in the threads.h library. But you can use an atomic_flag -- though for your x86/x86_64 you have PAUSE instruction problem to deal with. The question is, do you need memory_order_seq_cst for this, in particular for the unlock ? I think the answer is no, and that the trick is to do: atomic_flag_test_and_set_explicit(xxx, memory_order_acquire) and atomic_flag_clear(xxx, memory_order_release).
FWIW, the glibc pthread_spin_unlock() does not have an mfence. Nor does the gcc __sync_lock_release() (which is explicitly a "release" operation). But the gcc _atomic_clear() is aligned with the C11 atomic_flag_clear(), and takes a memory order parameter.
What difference does the mfence make to the unlock ? Clearly it's very disruptive to the pipe-line, and since it's not necessary, there's not much to be gained working out the exact scale of its impact, which will depend on the circumstances.

spinlock do not use mfence, mfence only enforce serialise/flush of data of current core. The fence itself do not in any way relate to atomic operation.
For spinlock you need some kind of atomic action to exchange data to a memory place. There are many different implementation, targeted for different requirement: for example, do it work on kernel or user-space? is it fair-lock?
A very simple and dumb spinlock for x86 looks like this (my kernel use this):
typedef volatile uint32_t _SPINLOCK __attribute__ ((aligned(16)));
static inline void _SPIN_LOCK(_SPINLOCK* lock) {
__asm (
"cli\n"
"lock bts %0, 0\n"
"jnc 1f\n"
"0:\n"
"pause\n"
"test %0, 1\n"
"je 0b\n"
"lock bts %0, 0\n"
"jc 0b\n"
"1:\n"
:
: "m"(lock)
:
);
}
The logic is simple
test and exchange a bit, if zero it mean the lock not taken, and we got it.
if bit is not zero, it mean the lock is taken by other, pause is a hint recommended by cpu manufacture so that it doesn't burn the cpu with a tight look.
loop until you got the lock
Note 1. you may also implement spinlock with intrinsics and extensions, it should be fairly similar.
Note 2. Spinlock is not judge by cycles, a sane implementation should be quite fast, for instant, the above implementation you should grab the lock on first try in well designed usage, if not, fix the algorithm or split the lock to prevent/reduce lock contention.
Note 3. You should also consider other things like fairness.

Re
and the cost (how many CPU cycles)?
On x86 at least, instructions that perform memory synchronization (atomic ops, fences) have a very variable CPU cycle latency. They wait for the processor store buffers to be flushed to memory, and this varies dramatically depending on the store buffer content.
E.g., if an atomic op is straight after a memcpy() that pushes multiple cache lines out to main memory, the delay may be in the 100's of nanoseconds. The same atomic op, but after a series of register-only arithmetic instructions, may take only a few clock cycles.

Related

Atomically clearing lowest non-zero bit of an unsigned integer

Question:
I'm looking for the best way to clear the lowest non-zero bit of a unsigned atomic like std::atomic_uint64_t in a threadsafe fashion without using an extra mutex or the like. In addition, I also need to know, which bit got cleared.
Example:
Lets say, if the current value stored is 0b0110 I want to know that the lowest non-zero bit is bit 1 (0-indexed) and set the variable to 0b0100.
The best version I came up with is this:
#include <atomic>
#include <cstdint>
inline uint64_t with_lowest_non_zero_cleared(std::uint64_t v){
return v-1 & v;
}
inline uint64_t only_keep_lowest_non_zero(std::uint64_t v){
return v & ~with_lowest_non_zero_cleared(v);
}
uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo)
{
auto expected = foo.load(std::memory_order_relaxed);
while(!foo.compare_exchange_weak(expected,with_lowest_non_zero_cleared(expected), std::memory_order_seq_cst, std::memory_order_relaxed))
{;}
return only_keep_lowest_non_zero(expected);
}
is there a better solution?
Notes:
It doesn't have to be the lowest non-zero bit - I'd also be happy with a solution that clears the highest bit or sets the highest/lowest "zero bit" if that makes a difference
I'd very much prefer a standard c++ (17) version, but would acccept an answer that uses intrinsics for clang and msvc
It should be portable (compiling with clang or msvc for x64 and AArch64), but I'm most interested in the performance on recent intel and AMD x64 architectures when compiled with clang.
EDIT: The update has to be atomic and global progress must be guaranteed, but just as with solution above, it doesn't have to be a wait free algorithm (of course I'd be very happy if you can show me one).
There is no direct hardware support for atomic clear-lowest-bit on x86. BMI1 blsr is only available in memory-source form, not memory-destination form; lock blsr [shared_var] does not exist. (Compilers know how to optimize (var-1) & (var) into blsr for local vars when you compile with -march=haswell or otherwise enable code-gen that assumes BMI1 support.) So even if you can assume BMI1 support1, it doesn't let you do anything fundamentally different.
The best you can do on x86 is the lock cmpxchg retry loop you propose in the question. This is a better option than finding the right bit in an old version of the variable and then using lock btr, especially if it would be a correctness problem to clear the wrong bit if a lower bit was set between the bit-scan and the lock btr. And you'd still need a retry loop in case another thread already cleared the bit you wanted.
CAS retry loops are not bad in practice: retry is quite rare
(Unless your program has very high contention for the shared variable, which would be problematic for performance even with lock add where there's no trying, just hardware arbitration for access to cache lines. If that's the case, you should probably redesign your lockless algorithm and/or consider some kind of OS-assisted sleep/wake instead of having a lot of cores spending a lot of CPU time hammering on the same cache line. Lockless is great in the low-contention case.)
The window for the CPU to lose the cache line between the load to get expected and running lock cmpxchg with the result of a couple instructions on that value is tiny. Usually it will succeed the first time through, because the cache line will still be present in L1d cache when the cmpxchg runs. When the cache line arrives, it will (hopefully) already be in MESI Exclusive state, if the CPU saw far enough ahead to do an RFO for it.
You can instrument your cmpxchg retry loops to see how much contention you actually get in your real program. (e.g. by incrementing a local inside the loop and using an atomic relaxed += into a shared counter once you succeed, or with thread-local counters.)
Remember that your real code (hopefully) does plenty of work between atomic operations on this bitmask, so it's very different from a microbenchmark where all the threads spend all their time hammering on that cache line.
EDIT: The update has to be atomic and global progress must be guaranteed, but just as with solution above, it doesn't have to be a wait free algorithm (of course I'd be very happy if you can show me one).
A CAS retry loop (even when compiled on a LL/SC machine, see below) is lock-free in the technical sense: you only have to retry if some other thread made progress.
CAS leaves the cache line unmodified if it fails. On x86 it dirties the cache line (MESI M state), but x86 cmpxchg doesn't detect ABA, it only compares, so one other thread that loaded the same expected will succeed. On LL/SC machines, hopefully an SC failure on one core doesn't cause surious SC failures on other cores, otherwise it could livelock. That's something you can assume CPU architects thought of.
It's not wait-free because a single thread can in theory have to retry an unbounded number of times.
Your code compiles with gcc8.1 -O3 -march=haswell to this asm (from the Godbolt compiler explorer)
# gcc's code-gen for x86 with BMI1 looks optimal to me. No wasted instructions!
# presumably you'll get something similar when inlining.
pop_lowest_non_zero(std::atomic<unsigned long>&):
mov rax, QWORD PTR [rdi]
.L2:
blsr rdx, rax # Bit Lowest Set Reset
lock cmpxchg QWORD PTR [rdi], rdx
jne .L2 # fall through on success: cmpxchg sets ZF on equal like regular cmp
blsi rax, rax # Bit Lowest Set Isolate
ret
Without BMI1, blsr and blsi become two instructions each. This is close to irrelevant for performance given the cost of a locked instruction.
clang and MSVC unfortunately are slightly more clunky, with a taken branch on the no-contention fast path. (And clang bloats the function by peeling the first iteration. IDK if this helps with branch prediction or something, or if it's purely a missed optimization. We can get clang to generate the fast path with no taken branches using an unlikely() macro. How do the likely/unlikely macros in the Linux kernel work and what is their benefit?).
Footnote 1:
Unless you're building binaries for a known set of machines, you can't assume BMI1 is available anyway. So the compiler will need to do something like lea rdx, [rax-1] / and rdx, rax instead of blsr rdx, rax. This makes a trivial difference for this function; the majority of the cost is the atomic operation even in the uncontended case, even for the hot-in-L1d cache no contention case, looking at the out-of-order execution throughput impact on surrounding code. (e.g. 10 uops for lock cmpxchg on Skylake (http://agner.org/optimize/) vs. saving 1 uop with blsr instead of 2 other instructions. Assuming the front-end is the bottleneck, rather than memory or something else. Being a full memory barrier has an impact on loads/stores in surrounding code, too, but fortunately not on out-of-order execution of independent ALU instructions.)
Non-x86: LL/SC machines
Most non-x86 machines do all their atomic operations with load-linked / store-conditional retry loops. It's a bit unfortunate that C++11 doesn't allow you to create custom LL/SC operations (e.g. with (x-1) & x inside an LL/SC instead of just the add that you'd get from using fetch_add), but CAS machines (like x86) can't give you the ABA detection that LL/SC provides. So it's not clear how you'd design a C++ class that could compile efficiently on x86 but also directly to a LL/SC retry loop on ARM and other LL/SC ISAs. (See this discussion.)
So you just have to write code using compare_exchange_weak if C++ doesn't provide the operation you want (e.g. fetch_or or fetch_and).
What you get in practice from current compilers is a compare_exchange_weak implemented with LL/SC, and your arithmetic operation separate from that. The asm actually does the relaxed load before the exclusive-load-acquire (ldaxr), instead of just basing the computation on the ldaxr result. And there's extra branching to check that expected from the last attempt matches the new load result before attempting the store.
The LL/SC window is maybe shorter than with 2 dependent ALU instructions between the load and store, in case that matters. The CPU has the desired value ready ahead of time, not dependent on the load result. (It depends on the previous load result.) Clang's code-gen puts the sub/and inside the loop, but dependent on the previous iteration's load, so with out of order execution they can still be ready ahead of time.
But if that was really the most efficient way to do things, compilers should be using it for fetch_add as well. They don't because it probably isn't. LL/SC retries are rare, just like CAS retries on x86. I don't know if it could make a different for very-high-contention situations. Maybe, but compilers don't slow down the fast path to optimize for that when compiling fetch_add.
I renamed your functions and re-formatted the while() for readability, because it was already too long for one line without wrapping it with unlikely().
This version compiles to maybe slightly nicer asm than your original, with clang. I also fixed your function names so it actually compiles at all; there's a mismatch in your question. I picked totally different names that are similar to x86's BMI instruction names because they succinctly describe the operation.
#include <atomic>
#include <cstdint>
#ifdef __GNUC__
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)
#else
#define unlikely(expr) (expr)
#define likely(expr) (expr)
#endif
inline uint64_t clear_lowest_set(std::uint64_t v){
return v-1 & v;
}
inline uint64_t isolate_lowest_set(std::uint64_t v){
//return v & ~clear_lowest_set(v);
return (-v) & v;
// MSVC optimizes this better for ARM when used separately.
// The other way (in terms of clear_lowest_set) does still CSE to 2 instructions
// when the clear_lowest_set result is already available
}
uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo)
{
auto expected = foo.load(std::memory_order_relaxed);
while(unlikely(!foo.compare_exchange_weak(
expected, clear_lowest_set(expected),
std::memory_order_seq_cst, std::memory_order_relaxed)))
{}
return isolate_lowest_set(expected);
}
Clang -O3 for AArch64 (-target aarch64-linux-android -stdlib=libc++ -mcpu=cortex-a57 on Godbolt) makes some weird code. This is from clang6.0, but there is weirdness with older versions, too, creating a 0/1 integer in a register and then testing it instead of just jumping to the right place in the first place.
pop_lowest_non_zero(std::__1::atomic<unsigned long long>&): // #pop_lowest_non_zero(std::__1::atomic<unsigned long long>&)
ldr x9, [x0] # the relaxed load
ldaxr x8, [x0] # the CAS load (a=acquire, x=exclusive: pairs with a later stxr)
cmp x8, x9 # compare part of the CAS
b.ne .LBB0_3
sub x10, x9, #1
and x10, x10, x9 # clear_lowest( relaxed load result)
stlxr w11, x10, [x0] # the CAS store (sequential-release)
cbnz w11, .LBB0_4 # if(SC failed) goto retry loop
# else fall through and eventually reach the epilogue
# this looks insane. w10 = 0|1, then branch if bit[0] != 0. Always taken, right?
orr w10, wzr, #0x1
tbnz w10, #0, .LBB0_5 # test-bit not-zero will always jump to the epilogue
b .LBB0_6 # never reached
.LBB0_3:
clrex # clear the ldrx/stxr transaction
.LBB0_4:
# This block is pointless; just should b to .LBB0_6
mov w10, wzr
tbz w10, #0, .LBB0_6 # go to the retry loop, below the ret (not shown here)
.LBB0_5: # isolate_lowest_set and return
neg x8, x9
and x0, x9, x8
ret
.LBB0_6:
# the retry loop. Only reached if the compare or SC failed.
...
All this crazy branching might not be a real performance problem, but it's obvious this could be a lot more efficient (even without teaching clang how to use LL/SC directly instead of emulated CAS). Reported as clang/LLVM bug 38173
It seems MSVC doesn't know that AArch64's release-stores already have sequentially-consistent semantics wrt. ldar / ldaxr loads (not just regular release like x86): it's still using a dmb ish instruction (full memory barrier) after stlxr. It has fewer wasted instructions, but its x86 bias is showing or something: compare_exchange_weak compiles like compare_exchange_strong with a probably-useless retry loop that will try just the CAS again with the old expected/desired, on LL/SC failure. That will usually be because another thread modified the line, so expected will mismatch. (Godbolt doesn't have MSVC for AArch64 in any older versions, so perhaps support is brand new. That might explain the dmb)
## MSVC Pre 2018 -Ox
|pop_lowest_non_zero| PROC
ldr x10,[x0] # x10 = expected = foo.load(relaxed)
|$LL2#pop_lowest| # L2 # top of the while() loop
sub x8,x10,#1
and x11,x8,x10 # clear_lowest(relaxed load result)
|$LN76#pop_lowest| # LN76
ldaxr x8,[x0]
cmp x8,x10 # the compare part of CAS
bne |$LN77#pop_lowest|
stlxr w9,x11,[x0]
cbnz w9,|$LN76#pop_lowest| # retry just the CAS on LL/SC fail; this looks like compare_exchange_strong
# fall through on LL/SC success
|$LN77#pop_lowest| # LN77
dmb ish # full memory barrier: slow
cmp x8,x10 # compare again, because apparently MSVC wants to share the `dmb` instruction between the CAS-fail and CAS-success paths.
beq |$LN75#pop_lowest| # if(expected matches) goto epilogue
mov x10,x8 # else update expected
b |$LL2#pop_lowest| # and goto the top of the while loop
|$LN75#pop_lowest| # LN75 # function epilogue
neg x8,x10
and x0,x8,x10
ret
gcc6.3 for AArch64 makes weird code, too, storing expected to the stack. (Godbolt doesn't have newer AArch64 gcc).
Hand-written AArch64 version that sucks less
I'm very unimpressed with AArch64 compilers for this! IDK why they don't generate something clean and efficient like this, with no taken branches on the fast path, and only a bit of out-of-line code to set up for jumping back into the CAS to retry.
pop_lowest_non_zero ## hand written based on clang
# clang could emit this even without turning CAS into LL/SC directly
ldr x9, [x0] # x9 = expected = foo.load(relaxed)
.Lcas_retry:
ldaxr x8, [x0] # x8 = the CAS load (a=acquire, x=exclusive: pairs with a later stxr)
cmp x8, x9 # compare part of the CAS
b.ne .Lcas_fail
sub x10, x9, #1
and x10, x10, x9 # clear_lowest (relaxed load result)
stlxr w11, x10, [x0] # the CAS store (sequential-release)
cbnz w11, .Lllsc_fail
# LL/SC success: isolate_lowest_set and return
.Lepilogue:
neg x8, x9
and x0, x9, x8
ret
.Lcas_fail:
// clrex # We're about to start another ldaxr making clrex unnecessary
.Lllsc_fail:
mov x9, x8 # update expected
b .Lcas_retry # instead of duplicating the loop, jump back to the existing one with x9 = expected
Compare with .fetch_add:
Clang does make nice code for fetch_add():
mov x8, x0
.LBB1_1:
ldxr x0, [x8] # relaxed exclusive load: I used mo_release
add x9, x0, #1
stlxr w10, x9, [x8]
cbnz w10, .LBB1_1 # retry if LL/SC failed
ret
Instead of add #1, we'd like to have sub x9, x8, #1 / and x9, x9, x8, so we just get a LL/SC retry loop. This saves code-size, but it won't be significantly faster. Probably not even measurably faster in most cases, especially as part of a whole program where it's not used an insane amount.
Alternatives: Do you even need exactly this bitmap operation? Can you break it up to reduce contention?
Can you use an atomic counter instead of a bitmap, and map it to a bitmap when needed? Operations that want a bitmap can map the counter to a bitmap with uint64_t(~0ULL) << (64-counter) (for non-zero counter only). Or maybe tmp=1ULL << counter; tmp ^= tmp-1; (i.e. x86 xor eax,eax / bts rax, rdi (rax=1 set bit at position 0..63) / blsmsk rax, rax (rax=all bits set up to that position). Hmm, that still needs a special case for mask=0, because there are 65 possible states for a contiguous bitmask: highest (or lowest) bit at one of 64 positions, or no bits set at all.
Or if there's some pattern to the bitmap, x86 BMI2 pdep can scatter contiguous bits into that pattern. Get N contiguous set bits with (1ULL << counter) - 1, again for counter < 64.
Or maybe it has to be a bitmask, but doesn't need to be one single bitmask?
No idea what your use-case is, but this kind of idea might be useful:
Do you need a single atomic bitmap that all threads have to contend for? Perhaps you could break it into multiple chunks, each in a separate cache line. (But that makes it impossible to atomically snapshot the entire map.) Still, if each thread has a preferred chunk, and always tries that before moving on to look for a slot in another chunk, you might reduce contention in the average case.
In asm (or with horrible union hacks in C++), you could try to reduce cmpxchg failures without reducing contention by finding the right byte of the 64-bit integer to update, and then only lock cmpxchg on it. That's not actually helpful for this case because two threads that see the same integer will both try to clear the same bit. But it could reduce retries between this and something that sets other bits in the qword. Of course this only works if it's not a correctness problem for lock cmpxchg to succeed when other bytes of the qword have changed.
So basically the question is, do I do better to loop on compare_exchange_weak until I find / get what I want or should I just use a mutex? OK, at least I understand the question now so let's explore that a little bit.
So, everyone out there is probably screaming which is faster, which is faster? Well, I personally couldn't care much, unless it proves to be a problem in practise, but if you care then you should benchmark it. You can get a first order approximation over at the most excellent Wandbox.
But there's another, more subtle issue: locking. The first solution is lockless, but there's a busy loop. The second takes a lock, which can have side-effects.
The busy loop is probably pretty harmless. It only soaks up one core and it is unlikely to run for long but one would want to profile the code when running under real-world conditions if one suspected otherwise.
The lock, on the other hand, might not be so harmless because it can cause priority inversion, which can result in a lower priority thread pre-empting a higher priority one. This can be an issue in any application which runs cooperating threads at different priorities but it is particularly so for me because I write realtime audio code. It bit me on the Mac, that's how I know all this.
So, I hope that tells you some at least of what you wanted to know. With your rep, I should not be trying to tell you how to 'write the codez'.
Reference: https://en.m.wikipedia.org/wiki/Priority_inversion

Can num++ be atomic for 'int num'?

In general, for int num, num++ (or ++num), as a read-modify-write operation, is not atomic. But I often see compilers, for example GCC, generate the following code for it (try here):
void f()
{
int num = 0;
num++;
}
f():
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
add DWORD PTR [rbp-4], 1
nop
pop rbp
ret
Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
And if so, does it mean that so-generated num++ can be used in concurrent (multi-threaded) scenarios without any danger of data races (i.e. we don't need to make it, for example, std::atomic<int> and impose the associated costs, since it's atomic anyway)?
UPDATE
Notice that this question is not whether increment is atomic (it's not and that was and is the opening line of the question). It's whether it can be in particular scenarios, i.e. whether one-instruction nature can in certain cases be exploited to avoid the overhead of the lock prefix. And, as the accepted answer mentions in the section about uniprocessor machines, as well as this answer, the conversation in its comments and others explain, it can (although not with C or C++).
This is absolutely what C++ defines as a Data Race that causes Undefined Behaviour, even if one compiler happened to produce code that did what you hoped on some target machine. You need to use std::atomic for reliable results, but you can use it with memory_order_relaxed if you don't care about reordering. See below for some example code and asm output using fetch_add.
But first, the assembly language part of the question:
Since num++ is one instruction (add dword [num], 1), can we conclude that num++ is atomic in this case?
Memory-destination instructions (other than pure stores) are read-modify-write operations that happen in multiple internal steps. No architectural register is modified, but the CPU has to hold the data internally while it sends it through its ALU. The actual register file is only a small part of the data storage inside even the simplest CPU, with latches holding outputs of one stage as inputs for another stage, etc., etc.
Memory operations from other CPUs can become globally visible between the load and store. I.e. two threads running add dword [num], 1 in a loop would step on each other's stores. (See #Margaret's answer for a nice diagram). After 40k increments from each of two threads, the counter might have only gone up by ~60k (not 80k) on real multi-core x86 hardware.
"Atomic", from the Greek word meaning indivisible, means that no observer can see the operation as separate steps. Happening physically / electrically instantaneously for all bits simultaneously is just one way to achieve this for a load or store, but that's not even possible for an ALU operation. I went into a lot more detail about pure loads and pure stores in my answer to Atomicity on x86, while this answer focuses on read-modify-write.
The lock prefix can be applied to many read-modify-write (memory destination) instructions to make the entire operation atomic with respect to all possible observers in the system (other cores and DMA devices, not an oscilloscope hooked up to the CPU pins). That is why it exists. (See also this Q&A).
So lock add dword [num], 1 is atomic. A CPU core running that instruction would keep the cache line pinned in Modified state in its private L1 cache from when the load reads data from cache until the store commits its result back into cache. This prevents any other cache in the system from having a copy of the cache line at any point from load to store, according to the rules of the MESI cache coherency protocol (or the MOESI/MESIF versions of it used by multi-core AMD/Intel CPUs, respectively). Thus, operations by other cores appear to happen either before or after, not during.
Without the lock prefix, another core could take ownership of the cache line and modify it after our load but before our store, so that other store would become globally visible in between our load and store. Several other answers get this wrong, and claim that without lock you'd get conflicting copies of the same cache line. This can never happen in a system with coherent caches.
(If a locked instruction operates on memory that spans two cache lines, it takes a lot more work to make sure the changes to both parts of the object stay atomic as they propagate to all observers, so no observer can see tearing. The CPU might have to lock the whole memory bus until the data hits memory. Don't misalign your atomic variables!)
Note that the lock prefix also turns an instruction into a full memory barrier (like MFENCE), stopping all run-time reordering and thus giving sequential consistency. (See Jeff Preshing's excellent blog post. His other posts are all excellent, too, and clearly explain a lot of good stuff about lock-free programming, from x86 and other hardware details to C++ rules.)
On a uniprocessor machine, or in a single-threaded process, a single RMW instruction actually is atomic without a lock prefix. The only way for other code to access the shared variable is for the CPU to do a context switch, which can't happen in the middle of an instruction. So a plain dec dword [num] can synchronize between a single-threaded program and its signal handlers, or in a multi-threaded program running on a single-core machine. See the second half of my answer on another question, and the comments under it, where I explain this in more detail.
Back to C++:
It's totally bogus to use num++ without telling the compiler that you need it to compile to a single read-modify-write implementation:
;; Valid compiler output for num++
mov eax, [num]
inc eax
mov [num], eax
This is very likely if you use the value of num later: the compiler will keep it live in a register after the increment. So even if you check how num++ compiles on its own, changing the surrounding code can affect it.
(If the value isn't needed later, inc dword [num] is preferred; modern x86 CPUs will run a memory-destination RMW instruction at least as efficiently as using three separate instructions. Fun fact: gcc -O3 -m32 -mtune=i586 will actually emit this, because (Pentium) P5's superscalar pipeline didn't decode complex instructions to multiple simple micro-operations the way P6 and later microarchitectures do. See the Agner Fog's instruction tables / microarchitecture guide for more info, and the x86 tag wiki for many useful links (including Intel's x86 ISA manuals, which are freely available as PDF)).
Don't confuse the target memory model (x86) with the C++ memory model
Compile-time reordering is allowed. The other part of what you get with std::atomic is control over compile-time reordering, to make sure your num++ becomes globally visible only after some other operation.
Classic example: Storing some data into a buffer for another thread to look at, then setting a flag. Even though x86 does acquire loads/release stores for free, you still have to tell the compiler not to reorder by using flag.store(1, std::memory_order_release);.
You might be expecting that this code will synchronize with other threads:
// int flag; is just a plain global, not std::atomic<int>.
flag--; // Pretend this is supposed to be some kind of locking attempt
modify_a_data_structure(&foo); // doesn't look at flag, and the compiler knows this. (Assume it can see the function def). Otherwise the usual don't-break-single-threaded-code rules come into play!
flag++;
But it won't. The compiler is free to move the flag++ across the function call (if it inlines the function or knows that it doesn't look at flag). Then it can optimize away the modification entirely, because flag isn't even volatile.
(And no, C++ volatile is not a useful substitute for std::atomic. std::atomic does make the compiler assume that values in memory can be modified asynchronously similar to volatile, but there's much more to it than that. (In practice there are similarities between volatile int to std::atomic with mo_relaxed for pure-load and pure-store operations, but not for RMWs). Also, volatile std::atomic<int> foo is not necessarily the same as std::atomic<int> foo, although current compilers don't optimize atomics (e.g. 2 back-to-back stores of the same value) so volatile atomic wouldn't change the code-gen.)
Defining data races on non-atomic variables as Undefined Behaviour is what lets the compiler still hoist loads and sink stores out of loops, and many other optimizations for memory that multiple threads might have a reference to. (See this LLVM blog for more about how UB enables compiler optimizations.)
As I mentioned, the x86 lock prefix is a full memory barrier, so using num.fetch_add(1, std::memory_order_relaxed); generates the same code on x86 as num++ (the default is sequential consistency), but it can be much more efficient on other architectures (like ARM). Even on x86, relaxed allows more compile-time reordering.
This is what GCC actually does on x86, for a few functions that operate on a std::atomic global variable.
See the source + assembly language code formatted nicely on the Godbolt compiler explorer. You can select other target architectures, including ARM, MIPS, and PowerPC, to see what kind of assembly language code you get from atomics for those targets.
#include <atomic>
std::atomic<int> num;
void inc_relaxed() {
num.fetch_add(1, std::memory_order_relaxed);
}
int load_num() { return num; } // Even seq_cst loads are free on x86
void store_num(int val){ num = val; }
void store_num_release(int val){
num.store(val, std::memory_order_release);
}
// Can the compiler collapse multiple atomic operations into one? No, it can't.
# g++ 6.2 -O3, targeting x86-64 System V calling convention. (First argument in edi/rdi)
inc_relaxed():
lock add DWORD PTR num[rip], 1 #### Even relaxed RMWs need a lock. There's no way to request just a single-instruction RMW with no lock, for synchronizing between a program and signal handler for example. :/ There is atomic_signal_fence for ordering, but nothing for RMW.
ret
inc_seq_cst():
lock add DWORD PTR num[rip], 1
ret
load_num():
mov eax, DWORD PTR num[rip]
ret
store_num(int):
mov DWORD PTR num[rip], edi
mfence ##### seq_cst stores need an mfence
ret
store_num_release(int):
mov DWORD PTR num[rip], edi
ret ##### Release and weaker doesn't.
store_num_relaxed(int):
mov DWORD PTR num[rip], edi
ret
Notice how MFENCE (a full barrier) is needed after a sequential-consistency stores. x86 is strongly ordered in general, but StoreLoad reordering is allowed. Having a store buffer is essential for good performance on a pipelined out-of-order CPU. Jeff Preshing's Memory Reordering Caught in the Act shows the consequences of not using MFENCE, with real code to show reordering happening on real hardware.
Re: discussion in comments on #Richard Hodges' answer about compilers merging std::atomic num++; num-=2; operations into one num--; instruction:
A separate Q&A on this same subject: Why don't compilers merge redundant std::atomic writes?, where my answer restates a lot of what I wrote below.
Current compilers don't actually do this (yet), but not because they aren't allowed to. C++ WG21/P0062R1: When should compilers optimize atomics? discusses the expectation that many programmers have that compilers won't make "surprising" optimizations, and what the standard can do to give programmers control. N4455 discusses many examples of things that can be optimized, including this one. It points out that inlining and constant-propagation can introduce things like fetch_or(0) which may be able to turn into just a load() (but still has acquire and release semantics), even when the original source didn't have any obviously redundant atomic ops.
The real reasons compilers don't do it (yet) are: (1) nobody's written the complicated code that would allow the compiler to do that safely (without ever getting it wrong), and (2) it potentially violates the principle of least surprise. Lock-free code is hard enough to write correctly in the first place. So don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much. It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).
Getting back to num++; num-=2; compiling as if it were num--:
Compilers are allowed to do this, unless num is volatile std::atomic<int>. If a reordering is possible, the as-if rule allows the compiler to decide at compile time that it always happens that way. Nothing guarantees that an observer could see the intermediate values (the num++ result).
I.e. if the ordering where nothing becomes globally visible between these operations is compatible with the ordering requirements of the source
(according to the C++ rules for the abstract machine, not the target architecture), the compiler can emit a single lock dec dword [num] instead of lock inc dword [num] / lock sub dword [num], 2.
num++; num-- can't disappear, because it still has a Synchronizes With relationship with other threads that look at num, and it's both an acquire-load and a release-store which disallows reordering of other operations in this thread. For x86, this might be able to compile to an MFENCE, instead of a lock add dword [num], 0 (i.e. num += 0).
As discussed in PR0062, more aggressive merging of non-adjacent atomic ops at compile time can be bad (e.g. a progress counter only gets updated once at the end instead of every iteration), but it can also help performance without downsides (e.g. skipping the atomic inc / dec of ref counts when a copy of a shared_ptr is created and destroyed, if the compiler can prove that another shared_ptr object exists for entire lifespan of the temporary.)
Even num++; num-- merging could hurt fairness of a lock implementation when one thread unlocks and re-locks right away. If it's never actually released in the asm, even hardware arbitration mechanisms won't give another thread a chance to grab the lock at that point.
With current gcc6.2 and clang3.9, you still get separate locked operations even with memory_order_relaxed in the most obviously optimizable case. (Godbolt compiler explorer so you can see if the latest versions are different.)
void multiple_ops_relaxed(std::atomic<unsigned int>& num) {
num.fetch_add( 1, std::memory_order_relaxed);
num.fetch_add(-1, std::memory_order_relaxed);
num.fetch_add( 6, std::memory_order_relaxed);
num.fetch_add(-5, std::memory_order_relaxed);
//num.fetch_add(-1, std::memory_order_relaxed);
}
multiple_ops_relaxed(std::atomic<unsigned int>&):
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
ret
Without many complications an instruction like add DWORD PTR [rbp-4], 1 is very CISC-style.
It perform three operations: load the operand from memory, increment it, store the operand back to memory.
During these operations the CPU acquire and release the bus twice, in between any other agent can acquire it too and this violates the atomicity.
AGENT 1 AGENT 2
load X
inc C
load X
inc C
store X
store X
X is incremented only once.
...and now let's enable optimisations:
f():
rep ret
OK, let's give it a chance:
void f(int& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
result:
f(int&):
mov DWORD PTR [rdi], 0
ret
another observing thread (even ignoring cache synchronisation delays) has no opportunity to observe the individual changes.
compare to:
#include <atomic>
void f(std::atomic<int>& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
where the result is:
f(std::atomic<int>&):
mov DWORD PTR [rdi], 0
mfence
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
lock sub DWORD PTR [rdi], 1
ret
Now, each modification is:-
observable in another thread, and
respectful of similar modifications happening in other threads.
atomicity is not just at the instruction level, it involves the whole pipeline from processor, through the caches, to memory and back.
Further info
Regarding the effect of optimisations of updates of std::atomics.
The c++ standard has the 'as if' rule, by which it is permissible for the compiler to reorder code, and even rewrite code provided that the outcome has the exact same observable effects (including side-effects) as if it had simply executed your code.
The as-if rule is conservative, particularly involving atomics.
consider:
void incdec(int& num) {
++num;
--num;
}
Because there are no mutex locks, atomics or any other constructs that influence inter-thread sequencing, I would argue that the compiler is free to rewrite this function as a NOP, eg:
void incdec(int&) {
// nada
}
This is because in the c++ memory model, there is no possibility of another thread observing the result of the increment. It would of course be different if num was volatile (might influence hardware behaviour). But in this case, this function will be the only function modifying this memory (otherwise the program is ill-formed).
However, this is a different ball game:
void incdec(std::atomic<int>& num) {
++num;
--num;
}
num is an atomic. Changes to it must be observable to other threads that are watching. Changes those threads themselves make (such as setting the value to 100 in between the increment and decrement) will have very far-reaching effects on the eventual value of num.
Here is a demo:
#include <thread>
#include <atomic>
int main()
{
for (int iter = 0 ; iter < 20 ; ++iter)
{
std::atomic<int> num = { 0 };
std::thread t1([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
++num;
--num;
}
});
std::thread t2([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
num = 100;
}
});
t2.join();
t1.join();
std::cout << num << std::endl;
}
}
sample output:
99
99
99
99
99
100
99
99
100
100
100
100
99
99
100
99
99
100
100
99
The add instruction is not atomic. It references memory, and two processor cores may have different local cache of that memory.
IIRC the atomic variant of the add instruction is called lock xadd
Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
It is dangerous to draw conclusions based on "reverse engineering" generated assembly. For example, you seem to have compiled your code with optimization disabled, otherwise the compiler would have thrown away that variable or loaded 1 directly to it without invoking operator++. Because the generated assembly may change significantly, based on optimization flags, target CPU, etc., your conclusion is based on sand.
Also, your idea that one assembly instruction means an operation is atomic is wrong as well. This add will not be atomic on multi-CPU systems, even on the x86 architecture.
Even if your compiler always emitted this as an atomic operation, accessing num from any other thread concurrently would constitute a data race according to the C++11 and C++14 standards and the program would have undefined behavior.
But it is worse than that. First, as has been mentioned, the instruction generated by the compiler when incrementing a variable may depend on the optimization level. Secondly, the compiler may reorder other memory accesses around ++num if num is not atomic, e.g.
int main()
{
std::unique_ptr<std::vector<int>> vec;
int ready = 0;
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
Even if we assume optimistically that ++ready is "atomic", and that the compiler generates the checking loop as needed (as I said, it's UB and therefore the compiler is free to remove it, replace it with an infinite loop, etc.), the compiler might still move the pointer assignment, or even worse the initialization of the vector to a point after the increment operation, causing chaos in the new thread. In practice, I would not be surprised at all if an optimizing compiler removed the ready variable and the checking loop completely, as this does not affect observable behavior under language rules (as opposed to your private hopes).
In fact, at last year's Meeting C++ conference, I've heard from two compiler developers that they very gladly implement optimizations that make naively written multi-threaded programs misbehave, as long as language rules allow it, if even a minor performance improvement is seen in correctly written programs.
Lastly, even if you didn't care about portability, and your compiler was magically nice, the CPU you are using is very likely of a superscalar CISC type and will break down instructions into micro-ops, reorder and/or speculatively execute them, to an extent only limited by synchronizing primitives such as (on Intel) the LOCK prefix or memory fences, in order to maximize operations per second.
To make a long story short, the natural responsibilities of thread-safe programming are:
Your duty is to write code that has well-defined behavior under language rules (and in particular the language standard memory model).
Your compiler's duty is to generate machine code which has the same well-defined (observable) behavior under the target architecture's memory model.
Your CPU's duty is to execute this code so that the observed behavior is compatible with its own architecture's memory model.
If you want to do it your own way, it might just work in some cases, but understand that the warranty is void, and you will be solely responsible for any unwanted outcomes. :-)
PS: Correctly written example:
int main()
{
std::unique_ptr<std::vector<int>> vec;
std::atomic<int> ready{0}; // NOTE the use of the std::atomic template
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
This is safe because:
The checks of ready cannot be optimized away according to language rules.
The ++ready happens-before the check that sees ready as not zero, and other operations cannot be reordered around these operations. This is because ++ready and the check are sequentially consistent, which is another term described in the C++ memory model and that forbids this specific reordering. Therefore the compiler must not reorder the instructions, and must also tell the CPU that it must not e.g. postpone the write to vec to after the increment of ready. Sequentially consistent is the strongest guarantee regarding atomics in the language standard. Lesser (and theoretically cheaper) guarantees are available e.g. via other methods of std::atomic<T>, but these are definitely for experts only, and may not be optimized much by the compiler developers, because they are rarely used.
On a single-core x86 machine, an add instruction will generally be atomic with respect to other code on the CPU1. An interrupt can't split a single instruction down the middle.
Out-of-order execution is required to preserve the illusion of instructions executing one at a time in order within a single core, so any instruction running on the same CPU will either happen completely before or completely after the add.
Modern x86 systems are multi-core, so the uniprocessor special case doesn't apply.
If one is targeting a small embedded PC and has no plans to move the code to anything else, the atomic nature of the "add" instruction could be exploited. On the other hand, platforms where operations are inherently atomic are becoming more and more scarce.
(This doesn't help you if you're writing in C++, though. Compilers don't have an option to require num++ to compile to a memory-destination add or xadd without a lock prefix. They could choose to load num into a register and store the increment result with a separate instruction, and will likely do that if you use the result.)
Footnote 1: The lock prefix existed even on original 8086 because I/O devices operate concurrently with the CPU; drivers on a single-core system need lock add to atomically increment a value in device memory if the device can also modify it, or with respect to DMA access.
Back in the day when x86 computers had one CPU, the use of a single instruction ensured that interrupts would not split the read/modify/write and if the memory would not be used as a DMA buffer too, it was atomic in fact (and C++ did not mention threads in the standard, so this wasn’t addressed).
When it was rare to have a dual processor (e.g. dual-socket Pentium Pro) on a customer desktop, I effectively used this to avoid the LOCK prefix on a single-core machine and improve performance.
Today, it would only help against multiple threads that were all set to the same CPU affinity, so the threads you are worried about would only come into play via time slice expiring and running the other thread on the same CPU (core). That is not realistic.
With modern x86/x64 processors, the single instruction is broken up into several micro ops and furthermore the memory reading and writing is buffered. So different threads running on different CPUs will not only see this as non-atomic but may see inconsistent results concerning what it reads from memory and what it assumes other threads have read to that point in time: you need to add memory fences to restore sane behavior.
No.
https://www.youtube.com/watch?v=31g0YE61PLQ
(That's just a link to the "No" scene from "The Office")
Do you agree that this would be a possible output for the program:
sample output:
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
If so, then the compiler is free to make that the only possible output for the program, in whichever way the compiler wants. ie a main() that just puts out 100s.
This is the "as-if" rule.
And regardless of output, you can think of thread synchronization the same way - if thread A does num++; num--; and thread B reads num repeatedly, then a possible valid interleaving is that thread B never reads between num++ and num--. Since that interleaving is valid, the compiler is free to make that the only possible interleaving. And just remove the incr/decr entirely.
There are some interesting implications here:
while (working())
progress++; // atomic, global
(ie imagine some other thread updates a progress bar UI based on progress)
Can the compiler turn this into:
int local = 0;
while (working())
local++;
progress += local;
probably that is valid. But probably not what the programmer was hoping for :-(
The committee is still working on this stuff. Currently it "works" because compilers don't optimize atomics much. But that is changing.
And even if progress was also volatile, this would still be valid:
int local = 0;
while (working())
local++;
while (local--)
progress++;
:-/
That a single compiler's output, on a specific CPU architecture, with optimizations disabled (since gcc doesn't even compile ++ to add when optimizing in a quick&dirty example), seems to imply incrementing this way is atomic doesn't mean this is standard-compliant (you would cause undefined behavior when trying to access num in a thread), and is wrong anyways, because add is not atomic in x86.
Note that atomics (using the lock instruction prefix) are relatively heavy on x86 (see this relevant answer), but still remarkably less than a mutex, which isn't very appropriate in this use-case.
Following results are taken from clang++ 3.8 when compiling with -Os.
Incrementing an int by reference, the "regular" way :
void inc(int& x)
{
++x;
}
This compiles into :
inc(int&):
incl (%rdi)
retq
Incrementing an int passed by reference, the atomic way :
#include <atomic>
void inc(std::atomic<int>& x)
{
++x;
}
This example, which is not much more complex than the regular way, just gets the lock prefix added to the incl instruction - but caution, as previously stated this is not cheap. Just because assembly looks short doesn't mean it's fast.
inc(std::atomic<int>&):
lock incl (%rdi)
retq
Yes, but...
Atomic is not what you meant to say. You're probably asking the wrong thing.
The increment is certainly atomic. Unless the storage is misaligned (and since you left alignment to the compiler, it is not), it is necessarily aligned within a single cache line. Short of special non-caching streaming instructions, each and every write goes through the cache. Complete cache lines are being atomically read and written, never anything different.
Smaller-than-cacheline data is, of course, also written atomically (since the surrounding cache line is).
Is it thread-safe?
This is a different question, and there are at least two good reasons to answer with a definite "No!".
First, there is the possibility that another core might have a copy of that cache line in L1 (L2 and upwards is usually shared, but L1 is normally per-core!), and concurrently modifies that value. Of course that happens atomically, too, but now you have two "correct" (correctly, atomically, modified) values -- which one is the truly correct one now?
The CPU will sort it out somehow, of course. But the result may not be what you expect.
Second, there is memory ordering, or worded differently happens-before guarantees. The most important thing about atomic instructions is not so much that they are atomic. It's ordering.
You have the possibility of enforcing a guarantee that everything that happens memory-wise is realized in some guaranteed, well-defined order where you have a "happened before" guarantee. This ordering may be as "relaxed" (read as: none at all) or as strict as you need.
For example, you can set a pointer to some block of data (say, the results of some calculation) and then atomically release the "data is ready" flag. Now, whoever acquires this flag will be led into thinking that the pointer is valid. And indeed, it will always be a valid pointer, never anything different. That's because the write to the pointer happened-before the atomic operation.
When your compiler uses only a single instruction for the increment and your machine is single-threaded, your code is safe. ^^
Try compiling the same code on a non-x86 machine, and you'll quickly see very different assembly results.
The reason num++ appears to be atomic is because on x86 machines, incrementing a 32-bit integer is, in fact, atomic (assuming no memory retrieval takes place). But this is neither guaranteed by the c++ standard, nor is it likely to be the case on a machine that doesn't use the x86 instruction set. So this code is not cross-platform safe from race conditions.
You also don't have a strong guarantee that this code is safe from Race Conditions even on an x86 architecture, because x86 doesn't set up loads and stores to memory unless specifically instructed to do so. So if multiple threads tried to update this variable simultaneously, they may end up incrementing cached (outdated) values
The reason, then, that we have std::atomic<int> and so on is so that when you're working with an architecture where the atomicity of basic computations is not guaranteed, you have a mechanism that will force the compiler to generate atomic code.

Memory ordering behavior of std::atomic::load

Am I wrong to assume that the atomic::load should also act as a memory barrier ensuring that all previous non-atomic writes will become visible by other threads?
To illustrate:
volatile bool arm1 = false;
std::atomic_bool arm2 = false;
bool triggered = false;
Thread1:
arm1 = true;
//std::std::atomic_thread_fence(std::memory_order_seq_cst); // this would do the trick
if (arm2.load())
triggered = true;
Thread2:
arm2.store(true);
if (arm1)
triggered = true;
I expected that after executing both 'triggered' would be true. Please don't suggest to make arm1 atomic, the point is to explore the behavior of atomic::load.
While I have to admit I don't fully understand the formal definitions of the different relaxed semantics of memory order I thought that the sequentially consistent ordering was pretty straightforward in that it guarantees that "a single total order exists in which all threads observe all modifications in the same order." To me this implies that the std::atomic::load with the default memory order of std::memory_order_seq_cst will also act as a memory fence. This is further corroborated by the following statement under "Sequentially-consistent ordering":
Total sequential ordering requires a full memory fence CPU instruction on all multi-core systems.
Yet, my simple example below demonstrates this is not the case with MSVC 2013, gcc 4.9 (x86) and clang 3.5.1 (x86), where the atomic load simply translates to a load instruction.
#include <atomic>
std::atomic_long al;
#ifdef _WIN32
__declspec(noinline)
#else
__attribute__((noinline))
#endif
long load() {
return al.load(std::memory_order_seq_cst);
}
int main(int argc, char* argv[]) {
long r = load();
}
With gcc this looks like:
load():
mov rax, QWORD PTR al[rip] ; <--- plain load here, no fence or xchg
ret
main:
call load()
xor eax, eax
ret
I'll omit the msvc and clang which are essentially identical. Now on gcc for ARM we get what I expected:
load():
dmb sy ; <---- data memory barrier here
movw r3, #:lower16:.LANCHOR0
movt r3, #:upper16:.LANCHOR0
ldr r0, [r3]
dmb sy ; <----- and here
bx lr
main:
push {r3, lr}
bl load()
movs r0, #0
pop {r3, pc}
This is not an academic question, it results in a subtle race condition in our code which called into question my understanding of the behavior of std::atomic.
Sigh, this was too long for a comment:
Isn't the meaning of atomic "to appear to occur instantaneously to the rest of the system"?
I'd say yes and no to that one, depending on how you think of it. For writes with SEQ_CST, yes. But as far as how atomic loads are handled, check out 29.3 of the C++11 standard. Specifically, 29.3.3 is really good reading, and 29.3.4 might be specifically what you're looking for:
For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_-
cst fence X sequenced before B, then B observes either the last memory_order_seq_cst modification of M
preceding X in the total order S or a later modification of M in its modification order.
Basically, SEQ_CST forces a global order just like the standard says, but reads can return and old value without violating the 'atomic' constraint.
To accomplish 'getting the absolute latest value' you'll need to perform an operation that forces the hardware coherency protocol to lock(the lock instruction on x86_64). This is what the atomic compare-and-exchange operations do, if you look at the assembly output.
Am I wrong to assume that the atomic::load should also act as a memory barrier ensuring that all previous non-atomic writes will become visible by other threads?
Yes. atomic::load(SEQ_CST) just enforces that the read cannot load an 'invalid' value, and neither writes nor loads may be reordered by the compiler or the cpu around that statement. It does not mean you'll always get the most up to date value.
I would expect your code to have a data race because again, barriers do not ensure the most up to date value is seen at a given time, they just prevent reordering.
Its perfectly valid for Thread1 to not see the write by Thread2 and therefore not set triggered, and for Thread2 to not see the write by Thread1 (again, not setting triggered), because you only write 'atomically' from one thread.
With two threads writing and reading shared values, you'll need a barrier in each thread to maintain consistency. It looks like you knew this already based in your code comments, so I'll just leave it at "the C++ standard is somewhat misleading when it comes to accurately describing meaning of atomic / multithreaded operations".
Even though you're writing C++, its still best, in my opinion, to think about what you're doing on the underlying architecture.
Not sure I explained that well, but I'd be happy to go into more detail if you'd like.

Why Sequential Semantic on x86/x86_64 is using through MOV [addr], reg + MFENCE instead of + SFENCE?

At Intel x86/x86_64 systems have 3 types of memory barriers: lfence, sfence and mfence. The question in terms of their use.
For Sequential Semantic (SC) is sufficient to use MOV [addr], reg + MFENCE for all memory cells requiring SC-semantics. However, you can write code in the whole and vice versa: MFENCE + MOV reg, [addr]. Apparently felt, that if the number of stores to memory is usually less than the loads from it, then the use of write-barrier in total cost less. And on this basis, that we must use sequential stores to memory, made ​​another optimization - [LOCK] XCHG, which is probably cheaper due to the fact that "MFENCE inside in XCHG" applies only to the cache line of memory used in XCHG (video where on 0:28:20 said that MFENCE more expensive that XCHG).
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
C/C++11 Operation x86 implementation
Load Seq_Cst: MOV (from memory)
Store Seq Cst: (LOCK) XCHG //
alternative: MOV (into memory),MFENCE
Note: there is an alternative mapping of C/C++11 to x86, which instead of locking (or fencing) the Seq Cst store locks/fences the Seq Cst load:
Load Seq_Cst: LOCK XADD(0) // alternative: MFENCE,MOV (from memory)
Store Seq Cst: MOV (into memory)
The difference is that ARM and Power memory barriers interact exclusively with LLC (Last Level Cache), and x86 interact and with lower level caches L1/L2.
In x86/x86_64:
lfence on Core1: (CoreX-L1) -> (CoreX-L2) -> L3-> (Core1-L2) -> (Core1-L1)
sfence on Core1: (Core1-L1) -> (Core1-L2) -> L3-> (CoreX-L2) -> (CoreX-L1)
In ARM:
ldr; dmb;: L3-> (Core1-L2) -> (Core1-L1)
dmb; str; dmb;: (Core1-L1) -> (Core1-L2) -> L3
C++11 code compiled by GCC 4.8.2 - GDB in x86_64:
std::atomic<int> a;
int temp = 0;
a.store(temp, std::memory_order_seq_cst);
0x4613e8 <+0x0058> mov 0x38(%rsp),%eax
0x4613ec <+0x005c> mov %eax,0x20(%rsp)
0x4613f0 <+0x0060> mfence
But why on x86/x86_64 Sequential Semantic (SC) using through MOV [addr], reg + MFENCE, and not MOV [addr], reg + SFENCE, why do we need full-fence MFENCE instead of SFENCE there?
sfence doesn't block StoreLoad reordering. Unless there are any NT stores in flight, it's architecturally a no-op. Stores already wait for older stores to commit before they themselves commit to L1d and become globally visible, because x86 doesn't allow StoreStore reordering. (Except for NT stores / stores to WC memory)
For seq_cst you need a full barrier to flush the store buffer / make sure all older stores are globally visible before any later loads. See https://preshing.com/20120515/memory-reordering-caught-in-the-act/ for an example where failing to use mfence in practice leads to non-sequentially-consistent behaviour, i.e. memory reordering.
As you found, it is possible to map seq_cst to x86 asm with full barriers on every seq_cst load instead of on every seq_cst store / RMW. In that case you wouldn't need any barrier instructions on stores (so they'd have release semantics), but you'd need mfence before every atomic::load(seq_cst).
You don't need an mfence; sfence does indeed suffice. In fact, you never need lfence in x86 unless you are dealing with a device. But Intel (and I think AMD) has (or at least had) a single implementation shared with mfence and sfence (namely, flushing the store buffer), so there was no performance advantage to using the weaker sfence.
BTW, note that you don't have to flush after every write to a shared variable; you only have to flush between a write and a subsequent read of a different shared variable.

Do atomic CAS-operations on x86_64 and ARM always use std::memory_order_seq_cst?

As Anthony Williams said:
some_atomic.load(std::memory_order_acquire) does just drop through to
a simple load instruction, and
some_atomic.store(std::memory_order_release) drops through to a simple
store instruction.
It is known that on x86 for the operations load() and store() memory barriers memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel does not require a processor instructions.
But on ARMv8 we known that here are memory barriers both for load() and store():
http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2
http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-2-of-2
About different architectures of CPUs: http://g.oswego.edu/dl/jmm/cookbook.html
Next, but for the CAS-operation on x86, these two lines with different memory barriers are identical in Disassembly code (MSVS2012 x86_64):
a.compare_exchange_weak(temp, 4, std::memory_order_seq_cst, std::memory_order_seq_cst);
000000013FE71A2D mov ebx,dword ptr [temp]
000000013FE71A31 mov eax,ebx
000000013FE71A33 mov ecx,4
000000013FE71A38 lock cmpxchg dword ptr [temp],ecx
a.compare_exchange_weak(temp, 5, std::memory_order_relaxed, std::memory_order_relaxed);
000000013FE71A4D mov ecx,5
000000013FE71A52 mov eax,ebx
000000013FE71A54 lock cmpxchg dword ptr [temp],ecx
Disassembly code compiled by GCC 4.8.1 x86_64 - GDB:
a.compare_exchange_weak(temp, 4, std::memory_order_seq_cst, std::memory_order_seq_cst);
a.compare_exchange_weak(temp, 5, std::memory_order_relaxed, std::memory_order_relaxed);
0x4613b7 <+0x0027> mov 0x2c(%rsp),%eax
0x4613bb <+0x002b> mov $0x4,%edx
0x4613c0 <+0x0030> lock cmpxchg %edx,0x20(%rsp)
0x4613c6 <+0x0036> mov %eax,0x2c(%rsp)
0x4613ca <+0x003a> lock cmpxchg %edx,0x20(%rsp)
Is on x86/x86_64 platforms for any atomic CAS-operations, an example such like this atomic_val.compare_exchange_weak(temp, 1, std::memory_order_relaxed, std::memory_order_relaxed); always satisfied with the ordering std::memory_order_seq_cst?
And if the any CAS operation on the x86 always run with sequential consistency (std::memory_order_seq_cst) regardless of barriers, then on the ARMv8 it is the same?
QUESTION: Should the order of std::memory_order_relaxed for CAS block memory bus on x86 or ARM?
ANSWER: On x86 any compare_exchange_weak() operations with any std::memory_orders(even std::memory_order_relaxed) always translates to the LOCK CMPXCHG with lock bus, to be really atomic, and have equal expensive to XCHG - "the cmpxchg is just as expensive as the xchg instruction".
(An addition: XCHG equal to LOCK XCHG, but CMPXCHG doesn't equal to LOCK CMPXCHG(which is really atomic)
On ARM and PowerPC for any`compare_exchange_weak() for different std::memory_orders there are differents lock's processor instructions, through LL/SC.
Processor memory-barriers-instructions for x86(except CAS), ARM and PowerPC: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
You shouldn't worry about what instructions the compiler maps a given C11 construct to as this doesn't capture everything. Instead you need to develop code with respect to the guarantees of the C11 memory model. As the above comment notes, your compiler or future compilers are free to reorder relaxed memory operations as long as it doesn't violate the C11 memory model. It is also a worthwhile running your code through a tool like CDSChecker to see what behaviors are allowed under the memory model.
x86 guarantees that loads following loads are ordered, and stores following stores are ordered. Given that CAS requires both loading and storing, all operations have to be ordered around it.
However, it is worth noting that, in the presence of multiple atomics with memory_order_relaxed, the compiler is allowed to reorder them. It cannot do so with memory_order_seq_cst.
I think the compiler emits lock cmpxchg even for memory_order_relaxed because that's the only way to make sure the compare+exchange itself is actually atomic. Like artless_noise said in comments, other architectures can use a Load Linked / Store Conditional to implement compare_exchange_weak(...).
memory_order_relaxed should still let the compiler hoist stores of other variables out of loops, and otherwise reorder memory access at compile time.
If there was a way to do it on x86 that wasn't also a full memory barrier, a good compiler would use it for memory_order_relaxed.