Is x86 CMPXCHG atomic, if so why does it need LOCK?

Is x86 CMPXCHG atomic, if so why does it need LOCK? - concurrency

The Intel documentation says
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.
My question is
Can CMPXCHG operate with memory address? From the document it seems not but can anyone confirm that only works with actual VALUE in registers, not memory address?
If CMPXCHG isn't atomic and a high level language level CAS has to be implemented through LOCK CMPXCHG (with LOCK prefix), what's the purpose of introducing such an instruction at all?
(I am asking from a high level language perspective. I.e., if the lock-free algorithm has to be translated into a LOCK CMPXCHG on the x86 platform, then it's still prefixed with LOCK. That means the lock-free algorithms are not better than ones with a carefully written synchronized lock / mutex (on x86 at least). This also seems to make the naked CMPXCHG instruction pointless, as I guess the major point for introducing it, was to support such lock-free operations.)

It seems like part what you're really asking is:
Why isn't the lock prefix implicit for cmpxchg with a memory operand, like it is for xchg (since 386)?
The simple answer (that others have given) is simply that Intel designed it this way. But this leads to the question:
Why did Intel do that? Is there a use-case for cmpxchg without lock?
On a single-CPU system, cmpxchg is atomic with respect to other threads, or any other code running on the same CPU core. (But not to "system" observers like a memory-mapped I/O device, or a device doing DMA reads of normal memory, so lock cmpxchg was relevant even on uniprocessor CPU designs).
Context switches can only happen on interrupts, and interrupts happen before or after an instruction, not in the middle. Any code running on the same CPU will see the cmpxchg as either fully executed or not at all.
For example, the Linux kernel is normally compiled with SMP support, so it uses lock cmpxchg for atomic CAS. But when booted on a single-processor system, it will patch the lock prefix to a nop everywhere that code was inlined, since nop cmpxchg runs much faster than lock cmpxchg. For more info, see this LWN article about Linux's "SMP alternatives" system. It can even patch back to lock prefixes before hot-plugging a second CPU.
Read more about atomicity of single instructions on uniprocessor systems in this answer, and in #supercat's answer + comments on Can num++ be atomic for int num. See my answer there for lots of details about how atomicity really works / is implemented for read-modify-write instructions like lock cmpxchg.
(This same reasoning also applies to cmpxchg8b / cmpxchg16b, and xadd, which are usually only used for synchonization / atomic ops, not to make single-threaded code run faster. Of course memory-destination instructions like add [mem], reg are useful outside of the lock add [mem], reg case.)
Related:
Interrupting instruction in the middle of execution only a few instructions like rep movsb and vpgatherdd are interruptible part way through, and they don't support lock. They also have a well-defined way to update architectural state to record their partial progress, not like a few ISAs where microarchitectural progress can get saved in hidden locations and resumed after an interrupt.
Interrupting an assembly instruction while it is operating quotes Intel's manuals about that guarantee
When an interrupt occurs, what happens to instructions in the pipeline?

You are mixing up high-level locks with the low-level CPU feature that happened to be named LOCK.
The high-level locks that lock-free algorithms try to avoid can guard arbitrary code fragments whose execution may take arbitrary time and thus, these locks will have to put threads into wait state until the lock is available which is a costly operation, e.g. implies maintaining a queue of waiting threads.
This is an entirely different thing than the CPU LOCK prefix feature which guards a single instruction only and thus might hold other threads for the duration of that single instruction only. Since this is implemented by the CPU itself, it doesn’t require additional software efforts.
Therefore the challenge of developing lock-free algorithms is not the removal of synchronization entirely, it boils down to reduce the critical section of the code to a single atomic operation which will be provided by the CPU itself.

The LOCK prefix is to lock the memory access for the current command, so that other commands that are in the CPU pipeline can not access the memory at the same time. Using the LOCK prefix, the execution of the command won't be interrupted by another command in the CPU pipeline due to memory access of other commands that are executed at the same time.
The INTEL manual says:
The LOCK prefix can be prepended only to the following in structions
and only to those forms of the instructions where the destination
operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG,
CMPXCH8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and
XCHG. If the LOCK prefix is used with one of these instructions and
the source operand is a memory operand, an undefined opcode exception
(#UD) may be generated.

Related

Does the memory fence involve the kernel

After asking this question, I've understood that the atomic instruction, such as test-and-set, would not involve the kernel. Only if a process needs to be put to sleep (to wait to acquire the lock) or woken (because it couldn't acquire the lock but now can), then the kernel has to be involved to perform the scheduling operations.
If so, does it mean that the memory fence, such as std::atomic_thread_fence in c++11, won't also involve the kernel?

std::atomic doesn't involve the kernel1
On almost all normal CPUs (the kind we program for in real life), memory barrier instructions are unprivileged and get used directly by the compiler. The same way compilers know how to emit instructions like x86 lock add [rdi], eax for fetch_add (or lock xadd if you use the return value). Or on other ISAs, literally the same barrier instructions they use before/after loads, stores, and RMWs to give the required ordering. https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/
On some arbitrary hypothetical hardware and/or compiler, anything is of course possible, even if it would be catastrophically bad for performance.
In asm, a barrier just makes this core wait until some previous (program-order) operations are visible to other cores. It's a purely local operation. (At least, this is how real-word CPUs are designed, so that sequential consistency is recoverable with only local barriers to control local ordering of load and/or store operations. All cores share a coherent view of cache, maintained via a protocol like MESI. Non-coherent shared-memory systems exist, but implementations don't run C++ std::thread across them, and they typically don't run a single-system-image kernel.)
Footnote 1: (Even non-lock-free atomics usually use light-weight locking).
Also, ARM before ARMv7 apparently didn't have proper memory barrier instructions. On ARMv6, GCC uses mcr p15, 0, r0, c7, c10, 5 as a barrier.
Before that (g++ -march=armv5 and earlier), GCC doesn't know what to do and calls __sync_synchronize (a libatomic GCC helper function) which hopefully is implemneted somehow for whatever machine the code is actually running on. This may involve a system call on a hypothetical ARMv5 multi-core system, but more likely the binary will be running on an ARMv7 or v8 system where the library function can run a dmb ish. Or if it's a single-core system then it could be a no-op, I think. (C++ memory ordering cares about other C++ threads, not about memory order as seen by possible hardware devices / DMA. Normally implementations assume a multi-core system, but this library function might be a case where a single-core only implementation could be used.)
On x86 for example, std::atomic_thread_fence(std::memory_order_seq_cst) compiles to mfence. Weaker barriers like std::atomic_thread_fence(std::memory_order_release) only have to block compile-time reordering; x86's runtime hardware memory model is already acq/rel (seq-cst + a store buffer). So there aren't any asm instructions corresponding to the barrier. (One possible implementation for a C++ library would be GNU C asm("" ::: "memory");, but GCC/clang do have barrier builtins.)
std::atomic_signal_fence only ever has to block compile-time reordering, even on weakly-ordered ISAs, because all real-world ISAs guarantee that execution within a single thread sees its own operations as happening in program order. (Hardware implements this by having loads snoop the store buffer of the current core). VLIW and IA-64 EPIC, or other explicit-parallelism ISA mechanisms (like Mill with its delayed-visibility loads), still make it possible for the compiler to generate code that respects any C++ ordering guarantees involving the barrier if an async signal (or interrupt for kernel code) arrives after any instruction.
You can look at code-gen yourself on the Godbolt compiler explorer:
#include <atomic>
void barrier_sc(void) {
std::atomic_thread_fence(std::memory_order_seq_cst);
}
x86: mfence.
POWER: sync.
AArch64: dmb ish (full barrier on "inner shareable" coherence domain).
ARM with gcc -mcpu=cortex-a15 (or -march=armv7): dmb ish
RISC-V: fence iorw,iorw
void barrier_acq_rel(void) {
std::atomic_thread_fence(std::memory_order_acq_rel);
}
x86: nothing
POWER: lwsync (light-weight sync).
AArch64: still dmb ish
ARM: still dmb ish
RISC-V: still fence iorw,iorw
void barrier_acq(void) {
std::atomic_thread_fence(std::memory_order_acquire);
}
x86: nothing
POWER: lwsync (light-weight sync).
AArch64: dmb ishld (load barrier, doesn't have to drain the store buffer)
ARM: still dmb ish, even with -mcpu=cortex-a53 (an ARMv8) :/
RISC-V: still fence iorw,iorw

In both this question and the referenced one you are mixing:
synchronization primitives, in the assembler scope, like cmpxchg and fences
process/thread synchronizations, like futexes
What does it means "it involves the kernel"? I guess you mean "(p)threads synchronizations": the thread is put to sleep and will awoken as soon as the given condition is met by another process/thread.
However, test-and-set primitives like cmpxchg and memory fences are functionalities provided by the microprocessor assembler. The kernel synchronization primitives are eventually based on them to provide system and processes synchronizations, using shared state in kernel space hidden behind kernel calls.
You can look at the futex source to get evidence of it.
But no, memory fences don't involve the kernel: they are translated into simple assembler operations. As the same as cmpxchg.

How "lock add" is implemented on x86 processors

I recently benchmarked std::atomic::fetch_add vs std::atomic::compare_exchange_strong on a 32 core Skylake Intel processor. Unsurprisingly (from the myths I've heard about fetch_add), fetch_add is almost an order of magnitude more scalable than compare_exchange_strong. Looking at the disassembly of the program std::atomic::fetch_add is implemented with a lock add and std::atomic::compare_exchange_strong is implemented with lock cmpxchg (https://godbolt.org/z/qfo4an).
What makes lock add so much faster on an intel multi-core processor? From my understanding, the slowness in both instructions comes from contention on the cacheline, and to execute both instructions with sequential consistency, the executing CPU has to pull the line into it's own core in exclusive or modified mode (from MESI). How then does the processor optimize fetch_add internally?
This is a simplified version of the benchmarking code. There was no load+CAS loop for the compare_exchange_strong benchmark, just a compare_exchange_strong on the atomic with an input variable that kept getting varied by thread and iteration. So it was just a comparison of instruction throughput under contention from multiple CPUs.

lock add and lock cmpxchg both work essentially the same way, by holding onto that cache line in Modified state for the duration of the microcoded instruction. (Can num++ be atomic for 'int num'?). According to Agner Fog's instruction tables, lock cmpxchg and lock add are very similar numbers of uops from microcode. (Although lock add is slightly simpler). Agner's throughput numbers are for the uncontended case, where the var stays hot in L1d cache of one core. And cache misses can cause uop replays, but I don't see any reason to expect a significant difference.
You claim you aren't doing load+CAS or using a retry loop. But is it possible you're only counting successful CAS or something? On x86, every CAS (including failures) has almost identical cost to lock add. (With all your threads hammering on the same atomic variable, you'll get lots of CAS failures from using a stale value for expected. This is not the usual use-case for CAS retry loops).
Or does your CAS version actually do a pure load from the atomic variable to get an expected value? That might be leading to memory-order mis-speculation.
You don't have complete code in the question so I have to guess, and couldn't try it on my desktop. You don't even have any perf-counter results or anything like that; there are lots of perf events for off-core memory access, and events like mem_inst_retired.lock_loads that could record number of locked instructions executed.
With lock add, every time a core gets ownership of the cache line, it succeeds at doing an increment. Cores are only waiting for HW arbitration of access to the line, never for another core to get the line and then fail to increment because it had a stale value.
It's plausible that HW arbitration could treat lock add and lock cmpxchg differently, e.g. perhaps letting a core hang onto the line for long enough to do a couple lock add instructions.
Is that what you mean?
Or maybe you have some major failure in microbenchmark methodology, like maybe not doing a warm-up loop to get CPU frequency up from idle before starting your timing? Or maybe some threads happen to finish early and let the other threads run with less contention?

to execute both instructions with sequential consistency, the
executing CPU has to pull the line into it's own core in exclusive or
modified mode (from MESI).
No, to execute either instruction with any consistent, defined semantic that guarantees that concurrent executions on multiple CPU do not lose increments, you would need that. Even if you were willing to drop "sequential consistency" (on these instructions) or even drop the usual acquire and release guarantees of reads and writes.
Any locked instruction effectively enforces mutual exclusion on the part of memory sufficient to guarantee atomicity. (Like a regular mutex but at the memory level.) Because no other core can access that memory range for the duration of the operation, the atomicity is trivially guaranteed.
What makes lock add so much faster on an intel multi-core processor?
I would expect any tiny difference of timing to be critical in these cases, and doing the load plus compare (or compare-load plus compare-load ...) might change the timing enough to lose the chance, much like too code using mutexes can have widely different efficiency when there is heavy contention and a small change in access pattern changes the way the mutex is attributed.

Why does a std::atomic store with sequential consistency use XCHG?

Why is std::atomic's store:
std::atomic<int> my_atomic;
my_atomic.store(1, std::memory_order_seq_cst);
doing an xchg when a store with sequential consistency is requested?
Shouldn't, technically, a normal store with a read/write memory barrier be enough? Equivalent to:
_ReadWriteBarrier(); // Or `asm volatile("" ::: "memory");` for gcc/clang
my_atomic.store(1, std::memory_order_acquire);
I'm explicitly talking about x86 & x86_64. Where a store has an implicit acquire fence.

mov-store + mfence and xchg are both valid ways to implement a sequential-consistency store on x86. The implicit lock prefix on an xchg with memory makes it a full memory barrier, like all atomic RMW operations on x86.
(x86's memory-ordering rules essentially make that full-barrier effect the only option for any atomic RMW: it's both a load and a store at the same time, stuck together in the global order. Atomicity requires that the load and store aren't separated by just queuing the store into the store buffer so it has to be drained, and load-load ordering of the load side requires that it not reorder.)
Plain mov is not sufficient; it only has release semantics, not sequential-release. (Unlike AArch64's stlr instruction, which does do a sequential-release store that can't reorder with later ldar sequential-acquire loads. This choice is obviously motivated by C++11 having seq_cst as the default memory ordering. But AArch64's normal store is much weaker; relaxed not release.)
See Jeff Preshing's article on acquire / release semantics, and note that regular release stores (like mov or any non-locked x86 memory-destination instruction other than xchg) allows reordering with later operations, including acquire loads (like mov or any x86 memory-source operand). e.g. If the release-store is releasing a lock, it's ok for later stuff to appear to happen inside the critical section.
There are performance differences between mfence and xchg on different CPUs, and maybe in the hot vs. cold cache and contended vs. uncontended cases. And/or for throughput of many operations back to back in the same thread vs. for one on its own, and for allowing surrounding code to overlap execution with the atomic operation.
See https://shipilev.net/blog/2014/on-the-fence-with-dependencies for actual benchmarks of mfence vs. lock addl $0, -8(%rsp) vs. (%rsp) as a full barrier (when you don't already have a store to do).
On Intel Skylake hardware, mfence blocks out-of-order execution of independent ALU instructions, but xchg doesn't. (See my test asm + results in the bottom of this SO answer). Intel's manuals don't require it to be that strong; only lfence is documented to do that. But as an implementation detail, it's very expensive for out-of-order execution of surrounding code on Skylake.
I haven't tested other CPUs, and this may be a result of a microcode fix for erratum SKL079, SKL079 MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions. The existence of the erratum basically proves that SKL used to be able to execute instructions after MFENCE. I wouldn't be surprised if they fixed it by making MFENCE stronger in microcode, kind of a blunt instrument approach that significantly increases the impact on surrounding code.
I've only tested the single-threaded case where the cache line is hot in L1d cache. (Not when it's cold in memory, or when it's in Modified state on another core.) xchg has to load the previous value, creating a "false" dependency on the old value that was in memory. But mfence forces the CPU to wait until previous stores commit to L1d, which also requires the cache line to arrive (and be in M state). So they're probably about equal in that respect, but Intel's mfence forces everything to wait, not just loads.
AMD's optimization manual recommends xchg for atomic seq-cst stores. I thought Intel recommended mov + mfence, which older gcc uses, but Intel's compiler also uses xchg here.
When I tested, I got better throughput on Skylake for xchg than for mov+mfence in a single-threaded loop on the same location repeatedly. See Agner Fog's microarch guide and instruction tables for some details, but he doesn't spend much time on locked operations.
See gcc/clang/ICC/MSVC output on the Godbolt compiler explorer for a C++11 seq-cst my_atomic = 4; gcc uses mov + mfence when SSE2 is available. (use -m32 -mno-sse2 to get gcc to use xchg too). The other 3 compilers all prefer xchg with default tuning, or for znver1 (Ryzen) or skylake.
The Linux kernel uses xchg for __smp_store_mb().
Update: recent GCC (like GCC10) changed to using xchg for seq-cst stores like other compilers do, even when SSE2 for mfence is available.
Another interesting question is how to compile atomic_thread_fence(mo_seq_cst);. The obvious option is mfence, but lock or dword [rsp], 0 is another valid option (and used by gcc -m32 when MFENCE isn't available). The bottom of the stack is usually already hot in cache in M state. The downside is introducing latency if a local was stored there. (If it's just a return address, return-address prediction is usually very good so delaying ret's ability to read it is not much of a problem.) So lock or dword [rsp-4], 0 could be worth considering in some cases. (gcc did consider it, but reverted it because it makes valgrind unhappy. This was before it was known that it might be better than mfence even when mfence was available.)
All compilers currently use mfence for a stand-alone barrier when it's available. Those are rare in C++11 code, but more research is needed on what's actually most efficient for real multi-threaded code that has real work going on inside the threads that are communicating locklessly.
But multiple source recommend using lock add to the stack as a barrier instead of mfence, so the Linux kernel recently switched to using it for the smp_mb() implementation on x86, even when SSE2 is available.
See https://groups.google.com/d/msg/fa.linux.kernel/hNOoIZc6I9E/pVO3hB5ABAAJ for some discussion, including a mention of some errata for HSW/BDW about movntdqa loads from WC memory passing earlier locked instructions. (Opposite of Skylake, where it was mfence instead of locked instructions that were a problem. But unlike SKL, there's no fix in microcode. This may be why Linux still uses mfence for its mb() for drivers, in case anything ever uses NT loads to copy back from video RAM or something but can't let the reads happen until after an earlier store is visible.)
In Linux 4.14, smp_mb() uses mb(). That uses mfence is used if available, otherwise lock addl $0, 0(%esp).
__smp_store_mb (store + memory barrier) uses xchg (and that doesn't change in later kernels).
In Linux 4.15, smb_mb() uses lock; addl $0,-4(%esp) or %rsp, instead of using mb(). (The kernel doesn't use a red-zone even in 64-bit, so the -4 may help avoid extra latency for local vars).
mb() is used by drivers to order access to MMIO regions, but smp_mb() turns into a no-op when compiled for a uniprocessor system. Changing mb() is riskier because it's harder to test (affects drivers), and CPUs have errata related to lock vs. mfence. But anyway, mb() uses mfence if available, else lock addl $0, -4(%esp). The only change is the -4.
In Linux 4.16, no change except removing the #if defined(CONFIG_X86_PPRO_FENCE) which defined stuff for a more weakly-ordered memory model than the x86-TSO model that modern hardware implements.
x86 & x86_64. Where a store has an implicit acquire fence
You mean release, I hope. my_atomic.store(1, std::memory_order_acquire); won't compile, because write-only atomic operations can't be acquire operations. See also Jeff Preshing's article on acquire/release semantics.
Or asm volatile("" ::: "memory");
No, that's a compiler barrier only; it prevents all compile-time reordering across it, but doesn't prevent runtime StoreLoad reordering, i.e. the store being buffered until later, and not appearing in the global order until after a later load. (StoreLoad is the only kind of runtime reordering x86 allows.)
Anyway, another way to express what you want here is:
my_atomic.store(1, std::memory_order_release); // mov
// with no operations in between, there's nothing for the release-store to be delayed past
std::atomic_thread_fence(std::memory_order_seq_cst); // mfence
Using a release fence would not be strong enough (it and the release-store could both be delayed past a later load, which is the same thing as saying that release fences don't keep later loads from happening early). A release-acquire fence would do the trick, though, keeping later loads from happening early and not itself being able to reorder with the release store.
Related: Jeff Preshing's article on fences being different from release operations.
But note that seq-cst is special according to C++11 rules: only seq-cst operations are guaranteed to have a single global / total order which all threads agree on seeing. So emulating them with weaker order + fences might not be exactly equivalent in general on the C++ abstract machine, even if it is on x86. (On x86, all store have a single total order which all cores agree on. See also Globally Invisible load instructions: Loads can take their data from the store buffer, so we can't really say that there's a total order for loads + stores.)

When should I use _mm_sfence _mm_lfence and _mm_mfence

I read the "Intel Optimization guide Guide For Intel Architecture".
However, I still have no idea about when should I use
_mm_sfence()
_mm_lfence()
_mm_mfence()
Could anyone explain when these should be used when writing multi-threaded code?

If you're using NT stores, you might want _mm_sfence or maybe even _mm_mfence. The use-cases for _mm_lfence are much more obscure.
If not, just use C++11 std::atomic and let the compiler worry about the asm details of controlling memory ordering.
x86 has a strongly-ordered memory model, but C++ has a very weak memory model (same for C). For acquire/release semantics, you only need to prevent compile-time reordering. See Jeff Preshing's Memory Ordering At Compile Time article.
_mm_lfence and _mm_sfence do have the necessary compiler-barrier effect, but they will also cause the compiler to emit a useless lfence or sfence asm instruction that makes your code run slower.
There are better options for controlling compile-time reordering when you aren't doing any of the obscure stuff that would make you want sfence.
For example, GNU C/C++ asm("" ::: "memory") is a compiler barrier (all values have to be in memory matching the abstract machine because of the "memory" clobber), but no asm instructions are emitted.
If you're using C++11 std::atomic, you can simply do shared_var.store(tmp, std::memory_order_release). That's guaranteed to become globally visible after any earlier C assignments, even to non-atomic variables.
_mm_mfence is potentially useful if you're rolling your own version of C11 / C++11 std::atomic, because an actual mfence instruction is one way to get sequential consistency, i.e. to stop later loads from reading a value until after preceding stores become globally visible. See Jeff Preshing's Memory Reordering Caught in the Act.
But note that mfence seems to be slower on current hardware than using a locked atomic-RMW operation. e.g. xchg [mem], eax is also a full barrier, but runs faster, and does a store. On Skylake, the way mfence is implemented prevents out-of-order execution of even non-memory instruction following it. See the bottom of this answer.
In C++ without inline asm, though, your options for memory barriers are more limited (How many memory barriers instructions does an x86 CPU have?). mfence isn't terrible, and it is what gcc and clang currently use to do sequential-consistency stores.
Seriously just use C++11 std::atomic or C11 stdatomic if possible, though; It's easier to use and you get quite good code-gen for a lot of things. Or in the Linux kernel, there are already wrapper functions for inline asm for the necessary barriers. Sometimes that's just a compiler barrier, sometimes it's also an asm instruction to get stronger run-time ordering than the default. (e.g. for a full barrier).
No barriers will make your stores appear to other threads any faster. All they can do is delay later operations in the current thread until earlier things happen. The CPU already tries to commit pending non-speculative stores to L1d cache as quickly as possible.
_mm_sfence is by far the most likely barrier to actually use manually in C++
The main use-case for _mm_sfence() is after some _mm_stream stores, before setting a flag that other threads will check.
See Enhanced REP MOVSB for memcpy for more about NT stores vs. regular stores, and x86 memory bandwidth. For writing very large buffers (larger than L3 cache size) that definitely won't be re-read any time soon, it can be a good idea to use NT stores.
NT stores are weakly-ordered, unlike normal stores, so you need sfence if you care about publishing the data to another thread. If not (you'll eventually read them from this thread), then you don't. Or if you make a system call before telling another thread the data is ready, that's also serializing.
sfence (or some other barrier) is necessary to give you release/acquire synchronization when using NT stores. C++11 std::atomic implementations leave it up to you to fence your NT stores, so that atomic release-stores can be efficient.
#include <atomic>
#include <immintrin.h>
struct bigbuf {
int buf[100000];
std::atomic<unsigned> buf_ready;
};
void producer(bigbuf *p) {
__m128i *buf = (__m128i*) (p->buf);
for(...) {
...
_mm_stream_si128(buf, vec1);
_mm_stream_si128(buf+1, vec2);
_mm_stream_si128(buf+2, vec3);
...
}
_mm_sfence(); // All weakly-ordered memory shenanigans stay above this line
// So we can safely use normal std::atomic release/acquire sync for buf
p->buf_ready.store(1, std::memory_order_release);
}
Then a consumer can safely do if(p->buf_ready.load(std::memory_order_acquire)) { foo = p->buf[0]; ... } without any data-race Undefined Behaviour. The reader side does not need _mm_lfence; the weakly-ordered nature of NT stores is confined entirely to the core doing the writing. Once it becomes globally visible, it's fully coherent and ordered according to the normal rules.
Other use-cases include ordering clflushopt to control the order of data being stored to memory-mapped non-volatile storage. (e.g. an NVDIMM using Optane memory, or DIMMs with battery-backed DRAM exist now.)
_mm_lfence is almost never useful as an actual load fence. Loads can only be weakly ordered when loading from WC (Write-Combining) memory regions, like video ram. Even movntdqa (_mm_stream_load_si128) is still strongly ordered on normal (WB = write-back) memory, and doesn't do anything to reduce cache pollution. (prefetchnta might, but it's hard to tune and can make things worse.)
TL:DR: if you aren't writing graphics drivers or something else that maps video RAM directly, you don't need _mm_lfence to order your loads.
lfence does have the interesting microarchitectural effect of preventing execution of later instructions until it retires. e.g. to stop _rdtsc() from reading the cycle-counter while earlier work is still pending in a microbenchmark. (Applies always on Intel CPUs, but on AMD only with an MSR setting: Is LFENCE serializing on AMD processors?. Otherwise lfence runs 4 per clock on Bulldozer family, so clearly not serializing.)
Since you're using intrinsics from C/C++, the compiler is generating code for you. You don't have direct control over the asm, but you might possibly use _mm_lfence for things like Spectre mitigation if you can get the compiler to put it in the right place in the asm output: right after a conditional branch, before a double array access. (like foo[bar[i]]). If you're using kernel patches for Spectre, I think the kernel will defend your process from other processes, so you'd only have to worry about this in a program that uses a JIT sandbox and is worried about being attacked from within its own sandbox.

Here is my understanding, hopefully accurate and simple enough to make sense:
(Itanium) IA64 architecture allows memory reads and writes to be executed in any order, so the order of memory changes from the point of view of another processor is not predictable unless you use fences to enforce that writes complete in a reasonable order.
From here on, I am talking about x86, x86 is strongly ordered.
On x86, Intel does not guarantee that a store done on another processor will always be immediately visible on this processor. It is possible that this processor speculatively executed the load (read) just early enough to miss the other processor's store (write). It only guarantees the order that writes become visible to other processors is in program order. It does not guarantee that other processors will immediately see any update, no matter what you do.
Locked read/modify/write instructions are fully sequentially consistent. Because of this, in general you already handle missing the other processor's memory operations because a locked xchg or cmpxchg will sync it all up, you will acquire the relevant cache line for ownership immediately and will update it atomically. If another CPU is racing with your locked operation, either you will win the race and the other CPU will miss the cache and get it back after your locked operation, or they will win the race, and you will miss the cache and get the updated value from them.
lfence stalls instruction issue until all instructions before the lfence are completed. mfence specifically waits for all preceding memory reads to be brought fully into the destination register, and waits for all preceding writes to become globally visible, but does not stall all further instructions as lfence would. sfence does the same for only stores, flushes write combiner, and ensures that all stores preceding the sfence are globally visible before allowing any stores following the sfence to begin execution.
Fences of any kind are rarely needed on x86, they are not necessary unless you are using write-combining memory or non-temporal instructions, something you rarely do if you are not a kernel mode (driver) developer. Normally, x86 guarantees that all stores are visible in program order, but it does not make that guarantee for WC (write combining) memory or for "non-temporal" instructions that do explicit weakly ordered stores, such as movnti.
So, to summarize, stores are always visible in program order unless you have used special weakly ordered stores or are accessing WC memory type. Algorithms using locked instructions like xchg, or xadd, or cmpxchg, etc, will work without fences because locked instructions are sequentially consistent.

The intrinsic calls you mention all simply insert an sfence, lfence or mfence instruction when they are called. So the question then becomes "What are the purposes of those fence instructions"?
The short answer is that lfence is completely useless* and sfence almost completely useless for memory ordering purposes for user-mode programs in x86. On the other hand, mfence serves as a full memory barrier, so you might use it in places where you need a barrier if there isn't already some nearby lock-prefixed instruction providing what you need.
The longer-but-still short answer is...
lfence
lfence is documented to order loads prior to the lfence with respect to loads after, but this guarantee is already provided for normal loads without any fence at all: that is, Intel already guarantees that "loads aren't reordered with other loads". As a practical matter, this leaves the purpose of lfence in user-mode code as an out-of-order execution barrier, useful perhaps for carefully timing certain operations.
sfence
sfence is documented to order stores before and after in the same way that lfence does for loads, but just like loads the store order is already guaranteed in most cases by Intel. The primary interesting case where it doesn't is the so-called non-temporal stores such as movntdq, movnti, maskmovq and a few other instructions. These instructions don't play by the normal memory ordering rules, so you can put an sfence between these stores and any other stores where you want to enforce the relative order. mfence works for this purpose too, but sfence is faster.
mfence
Unlike the other two, mfence actually does something: it serves as a full memory barrier, ensuring that all of the previous loads and stores will have completed1 before any of the subsequent loads or stores begin execution. This answer is too short to explain the concept of a memory barrier fully, but an example would be Dekker's algorithm, where each thread wanting to enter a critical section stores to a location and then checks to see if the other thread has stored something to its location. For example, on thread 1:
mov DWORD [thread_1_wants_to_enter], 1 # store our flag
mov eax, [thread_2_wants_to_enter] # check the other thread's flag
test eax, eax
jnz retry
; critical section
Here, on x86, you need a memory barrier in between the store (the first mov), and the load (the second mov), otherwise each thread could see zero when they read the other's flag because the x86 memory model allows loads to be re-ordered with earlier stores. So you could insert an mfence barrier as follows to restore sequential consistency and the correct behavior of the algorithm:
mov DWORD [thread_1_wants_to_enter], 1 # store our flag
mfence
mov eax, [thread_2_wants_to_enter] # check the other thread's flag
test eax, eax
jnz retry
; critical section
In practice, you don't see mfence as much as you might expect, because x86 lock-prefixed instructions have the same full-barrier effect, and these are often/always (?) cheaper than an mfence.
1 E.g., loads will have been satisfied and stores will have become globally visible (although it would be implemented differently as long as the visible effect wrt ordering is "as if" that occurred).

Caveat: I'm no expert in this. I'm still trying to learn this myself. But since no one has replied in the past two days, it seems experts on memory fence instructions are not plentiful. So here's my understanding ...
Intel is a weakly-ordered memory system. That means your program may execute
array[idx+1] = something
idx++
but the change to idx may be globally visible (e.g. to threads/processes running on other processors) before the change to array. Placing sfence between the two statements will ensure the order the writes are sent to the FSB.
Meanwhile, another processor runs
newestthing = array[idx]
may have cached the memory for array and has a stale copy, but gets the updated idx due to a cache miss.
The solution is to use lfence just beforehand to ensure the loads are synchronized.
This article or this article may give better info

Compare and swap C++0x

From the C++0x proposal on C++ Atomic Types and Operations:
29.1 Order and Consistency [atomics.order]
Add a new sub-clause with the following paragraphs.
The enumeration memory_order specifies the detailed regular (non-atomic) memory synchronization order as defined in [the new section added by N2334 or its adopted successor] and may provide for operation ordering. Its enumerated values and their meanings are as follows.
memory_order_relaxed
The operation does not order memory.
memory_order_release
Performs a release operation on the affected memory locations, thus making regular memory writes visible to other threads through the atomic variable to which it is applied.
memory_order_acquire
Performs an acquire operation on the affected memory locations, thus making regular memory writes in other threads released through the atomic variable to which it is applied, visible to the current thread.
memory_order_acq_rel
The operation has both acquire and release semantics.
memory_order_seq_cst
The operation has both acquire and release semantics, and in addition, has sequentially-consistent operation ordering.
Lower in the proposal:
bool A::compare_swap( C& expected, C desired,
memory_order success, memory_order failure ) volatile
where one can specify memory order for the CAS.
My understanding is that “memory_order_acq_rel” will only necessarily synchronize those memory locations which are needed for the operation, while other memory locations may remain unsynchronized (it will not behave as a memory fence).
Now, my question is - if I choose “memory_order_acq_rel” and apply compare_swap to integral types, for instance, integers, how is this typically translated into machine code on modern consumer processors such as a multicore Intel i7? What about the other commonly used architectures (x64, SPARC, ppc, arm)?
In particular (assuming a concrete compiler, say gcc):
How to compare-and-swap an integer location with the above operation?
What instruction sequence will such a code produce?
Is the operation lock-free on i7?
Will such an operation run a full cache coherence protocol, synchronizing caches of different processor cores as if it were a memory fence on i7? Or will it just synchronize the memory locations needed by this operation?
Related to previous question - is there any performance advantage to using acq_rel semantics on i7? What about the other architectures?
Thanks for all the answers.

The answer here is not trivial. Exactly what happens and what is meant is dependent on many things. For basic understanding of cache coherence/memory perhaps my recent blog entries might be helpful:
CPU Reordering – What is actually being reordered?
CPU Memory – Why do I need a mutex?
But that aside, let me try to answer a few questions. First off the below function is being very hopeful as to what is supported: very fine-grained control over exactly how strong a memory-order guarantee you get. That's reasonable for compile-time reordering but often not for runtime barriers.
compare_swap( C& expected, C desired,
memory_order success, memory_order failure )
Architectures won't all be able to implement this exactly as you requested; many will have to strengthen it to something strong enough that they can implement. When you specify memory_order you are specifying how reordering may work. To use Intel's terms you will be specifying what type of fence you want, there are three of them, the full fence, load fence, and store fence. (But on x86, load fence and store fence are only useful with weakly-ordered instructions like NT stores; atomics don't use them. Regular load/store give you everything except that stores can appear after later loads.) Just because you want a particular fence on that operation won't mean it is supported, in which I'd hope it always falls back to a full fence. (See Preshing's article on memory barriers)
An x86 (including x64) compiler will likely use the LOCK CMPXCHG instruction to implement the CAS, regardless of memory ordering. This implies a full barrier; x86 doesn't have a way to make a read-modify-write operation atomic without a lock prefix, which is also a full barrier. Pure-store and pure-load can be atomic "on their own", with many ISAs needing barriers for anything above mo_relaxed, but x86 does acq_rel "for free" in asm.
This instruction is lock-free, although all cores trying to CAS the same location will contend for access to it so you could argue it's not really wait-free. (Algorithms that use it might not be lock-free, but the operation itself is wait-free, see wikipedia's non-blocking algorithm article). On non-x86 with LL/SC instead of locked instructions, C++11 compare_exchange_weak is normally wait-free but compare_exchange_strong requires a retry loop in case of spurious failure.
Now that C++11 has existed for years, you can look at the asm output for various architectures on the Godbolt compiler explorer.
In terms of memory sync you need to understand how cache-coherence works (my blog may help a bit). New CPUs use a ccNUMA architecture (previously SMP). Essentially the "view" on the memory never gets out-of-sync. The fences used in the code don't actually force any flushing of cache to happen per-se, only of the store buffer committing in flight stores to cache before later loads.
If two cores both have the same memory location cached in a cache-line, a store by one core will get exclusive ownership of the cache line (invalidating all other copies) and marking its own as dirty. A very simple explanation for a very complex process
To answer your last question you should always use the memory semantics that you logically need to be correct. Most architectures won't support all the combinations you use in your program. However, in many cases you'll get great optimizations, especially in cases where the order you requested is guaranteed without a fence (which is quite common).
-- Answers to some comments:
You have to distinguish between what it means to execute a write instruction and write to a memory location. This is what I attempt to explain in my blog post. By the time the "0" is committed to 0x100, all cores see that zero. Writing integers is also atomic, that is even without a lock, when you write to a location all cores will immediately have that value if they wish to use it.
The trouble is that to use the value you have likely loaded it into a register first, any changes to the location after that obviously won't touch the register. This is why one needs mutexes or atomic<T> despite a cache coherent memory: the compiler is allowed to keep plain variable values in private registers. (In C++11, that's because a data-race on non-atomic variables is Undefined Behaviour.)
As to contradictory claims, generally you'll see all sorts of claims. Whether they are contradictory comes right down to exactly what "see" "load" "execute" mean in the context. If you write "1" to 0x100, does that mean you executed the write instruction or did the CPU actually commit that value. The difference created by the store buffer is one major cause of reordering (the only one x86 allows). The CPU can delay writing the "1", but you can be sure that the moment it does finally commit that "1" all cores see it. The fences control this ordering by making the thread wait until a store commits before doing later operations.

Your whole worldview seems off base: your question insinuates that cache consistency is controlled by memory orders at the C++ level and fences or atomic operations at the CPU level.
But cache consistency is one of the most important invariants for the physical architecture, and it's provided at all time by the memory system that consists of the interconnection of all CPUs and the RAM. You can never beat it from code running on a CPU, or even see its detail of operation. Of course, by observing RAM directly and running code elsewhere you might see stale data at some level of memory: by definition the RAM doesn't have the newest value of all memory locations.
But code running on a CPU can't access DRAM directly, only through the memory hierarchy which includes caches that communicate with each other to maintain coherency of this shared view of memory. (Typically with MESI). Even on a single core, a write-back cache lets DRAM values be stale, which can be an issue for non-cache-coherent DMA but not for reading/writing memory from a CPU.
So the issue exists only for external devices, and only ones that do non-coherent DMA. (DMA is cache-coherent on modern x86 CPUs; the memory controller being built-in to the CPU makes this possible).
Will such an operation run a full cache coherence protocol,
synchronizing caches of different processor cores as if it were a
memory fence on i7?
They are already synchronized. See Does a memory barrier ensure that the cache coherence has been completed? - memory barriers only do local things inside the core running the barrier, like flush the store buffer.
Or will it just synchronize the memory locations
needed by this operation?
An atomic operation applies to exactly one memory location. What others locations do you have in mind?
On a weakly-ordered CPU, a memory_order_relaxed atomic increment could avoid making earlier loads/stores visible before that increment. But x86's strongly-ordered memory model doesn't allow that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js