C++ How is release-and-acquire achieved on x86 only using MOV?

C++ How is release-and-acquire achieved on x86 only using MOV? - c++

This question is a follow-up/clarification to this:
Does the MOV x86 instruction implement a C++11 memory_order_release atomic store?
This states the MOV assembly instruction is sufficient to perform acquire-release semantics on x86. We do not need LOCK, fences or xchg etc. However, I am struggling to understand how this works.
Intel doc Vol 3A Chapter 8 states:
https://software.intel.com/sites/default/files/managed/7c/f1/253668-sdm-vol-3a.pdf
In a single-processor (core) system....
Reads are not reordered with other reads.
Writes are not reordered with older reads.
Writes to memory are not reordered with other writes, with the following exceptions:
but this is for a single core. The multi-core section does not seem to mention how loads are enforced:
In a multiple-processor system, the following ordering principles apply:
Individual processors use the same ordering principles as in a single-processor system.
Writes by a single processor are observed in the same order by all processors.
Writes from an individual processor are NOT ordered with respect to the writes from other processors.
Memory ordering obeys causality (memory ordering respects transitive visibility).
Any two stores are seen in a consistent order by processors other than those performing the stores
Locked instructions have a total order.
So how can MOV alone can facilitate acquire-release?

but this is for a single core. The multi-core section does not seem to mention how loads are enforced:
The first bullet point in that section is key: Individual processors use the same ordering principles as in a single-processor system. The implicit part of that statement is ... when loading/storing from cache-coherent shared memory. i.e. multi-processor systems don't introduce new ways for reordering, they just mean the possible observers now include code on other cores instead of just DMA / IO devices.
The model for reordering of access to shared memory is the single-core model, i.e. program order + a store buffer = basically acq_rel. Actually slightly stronger than acq_rel, which is fine.
The only reordering that happens is local, within each CPU core. Once a store becomes globally visible, it becomes visible to all other cores at the same time, and didn't become visible to any cores before that. (Except to the core doing the store, via store forwarding.) That's why only local barriers are sufficient to recover sequential consistency on top of a SC + store-buffer model. (For x86, just mo_seq_cst just needs mfence after SC stores, to drain the store buffer before any further loads can execute.
mfence and locked instructions (which are also full barriers) don't have to bother other cores, just make this one wait).
One key point to understand is that there is a coherent shared view of memory (through coherent caches) that all processors share. The very top of chapter 8 of Intel's SDM defines some of this background:
These multiprocessing mechanisms have the following characteristics:
To maintain system memory coherency — When two or more processors are attempting simultaneously to
access the same address in system memory, some communication mechanism or memory access protocol
must be available to promote data coherency and, in some instances, to allow one processor to temporarily lock
a memory location.
To maintain cache consistency — When one processor accesses data cached on another processor, it must not
receive incorrect data. If it modifies data, all other processors that access that data must receive the modified
data.
To allow predictable ordering of writes to memory — In some circumstances, it is important that memory writes
be observed externally in precisely the same order as programmed.
[...]
The caching mechanism and cache consistency of Intel 64 and IA-32 processors are discussed in Chapter 11.
(CPUs use some variant of MESI; Intel in practice uses MESIF, AMD in practice uses MOESI.)
The same chapter also includes some litmus tests that help illustrate / define the memory model. The parts you quoted aren't really a strictly formal definition of the memory model. But the section 8.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations shows that loads aren't reordered with loads. Another section also shows that LoadStore reordering is forbidden. Acq_rel is basically blocking all reordering except StoreLoad, and that's what x86 does. (https://preshing.com/20120913/acquire-and-release-semantics/ and https://preshing.com/20120930/weak-vs-strong-memory-models/)
Related:
how are barriers/fences and acquire, release semantics implemented microarchitecturally?
x86 mfence and C++ memory barrier - asking why no barriers are needed for acq_rel, but coming at it from a different angle (wondering about how data ever becomes visible to other cores).
How do memory_order_seq_cst and memory_order_acq_rel differ? (seq_cst requires flushing the store buffer).
C11 Atomic Acquire/Release and x86_64 lack of load/store coherence?
Globally Invisible load instructions program-order + store buffer isn't exactly the same as acq_rel, especially once you consider a load that only partially overlaps a recent store.
x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors - a formal memory model for x86.
Other ISAs
In general, most weaker memory HW models also only allow local reordering so barriers are still only local within a CPU core, just making (some part of) that core wait until some condition. (e.g. x86 mfence blocks later loads and stores from executing until the store buffer drains. Other ISAs also benefit from light-weight barriers for efficiency for stuff that x86 enforces between every memory operation, e.g. blocking LoadLoad and LoadStore reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/)
A few ISAs (only PowerPC these days) allow stores to become visible to some other cores before becoming visible to all, allowing IRIW reordering. Note that mo_acq_rel in C++ allows IRIW reordering; only seq_cst forbids it. Most HW memory models are slightly stronger than ISO C++ and make it impossible, so all cores agree on the global order of stores.

Refreshing the semantics of acquire and release (quoting cppreference rather than the standard, because it's what I have on hand - the standard is more...verbose, here):
memory_order_acquire: A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load. All writes in other threads that release the same atomic variable are visible in the current thread
memory_order_release: A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable
This gives us four things to guarantee:
acquire ordering: "no reads or writes in the current thread can be reordered before this load"
release ordering: "no reads or writes in the current thread can be reordered after this store"
acquire-release synchronization:
"all writes in other threads that release the same atomic variable are visible in the current thread"
"all writes in the current thread are visible in other threads that acquire the same atomic variable"
Reviewing the guarantees:
Reads are not reordered with other reads.
Writes are not reordered with older reads.
Writes to memory are not reordered with other writes [..]
Individual processors use the same ordering principles as in a single-processor system.
This is sufficient to satisfy the ordering guarantees.
For acquire ordering, consider a read of the atomic has occurred: for that thread, clearly any later read or write migrating before would violate the first or second bullet points, respectively.
For release ordering, consider a write of the atomic has occurred: for that thread, clearly any prior reads or write migrating after would violate the second or third bullet points, respectively.
The only thing left is to ensure that if a thread reads a released store, it will see all the other loads the writer thread had produced up to that point. This is where the other multi-processor guarantee is needed.
Writes by a single processor are observed in the same order by all processors.
This is sufficient to satisfy acquire-release synchronization.
We've already established that when the release write occurs, all other writes prior to it will have also occurred. This bullet point then ensures that if another thread reads the released write, it will read all the writes the writer produced up to that point. (If it does not, then it would be observing that single processor's writes in a different order than the single processor, violating the bullet point.)

Related

C11 Standalone memory barriers LoadLoad StoreStore LoadStore StoreLoad

I want to use standalone memory barriers between atomic and non-atomic operations (I think it shouldn't matter at all anyway). I think I understand what a store barrier and a load barrier mean and also the 4 types of possible memory reorderings; LoadLoad, StoreStore, LoadStore, StoreLoad.
However, I always find the acquire/release concepts confusing. Because when reading the documentation, acquire doesn't only speak about loads, but also stores, and release doesn't only speak about stores but also loads. On the other hand plain load barriers only provide you with guarantees on loads and plain store barriers only provide you with guarantees on stores.
My question is the following. In C11/C++11 is it safe to consider a standalone atomic_thread_fence(memory_order_acquire) as a load barrier (preventing LoadLoad reorderings) and an atomic_thread_fence(memory_order_release) as a store barrier (preventing StoreStore reorderings)?
And if the above is correct what can I use to prevent LoadStore and StoreLoad reorderings?
Of course I am interested in portability and I don't care what the above produce on a specific platform.

No, an acquire barrier after a relaxed load can make into an acquire load (inefficiently on some ISAs compare to just using an acquire load), so it has to block LoadStore as well as LoadLoad.
See https://preshing.com/20120913/acquire-and-release-semantics/ for a couple very helpful diagrams of the orderings showing that and that release stores need to make sure all previous loads and stores are "visible", and thus need to block StoreStore and LoadStore. (Reorderings where the Store part is 2nd). Especially this diagram:
Also https://preshing.com/20130922/acquire-and-release-fences/
https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/ explains the 2-way nature of acq and rel fences vs. the 1-way nature of an acq or rel operation like a load or store. Apparently some people had misconceptions about what atomic_thread_fence() guaranteed, thinking it was too weak.
And just for completeness, remember that these ordering rules have to be enforced by the compiler against compile-time reordering, not just runtime.
It may mostly work to think of barriers acting on C++ loads / stores in the C++ abstract machine, regardless of how that's implemented in asm. But there are corner cases like PowerPC where that mental model doesn't cover everything (IRIW reordering, see below).
I do recommend trying to think in terms of acquire and release operations ensuring visibility of other operations to each other, and definitely don't write code that just uses relaxed ops and separate barriers. That can be safe but is often less efficient.
Everything about ISO C/C++ memory / inter-thread ordering is officially defined in terms of an acquire load seeing the value from a release store, and thus creating a "synchronizes with" relationship, not about fences to control local reordering.
std::atomic does not explicitly guarantee the existence of a coherent shared-memory state where all threads see a change at the same time. In the mental model you're using, with local reordering when reading/writing to a single shared state, IRIW reordering can happen when one thread makes its stores visible to some other threads before they become globally visible to all other threads. (Like can happen in practice on some SMT PowerPC CPUs.).
In practice all C/C++ implementations run threads across cores that do have a cache-coherent view of shared memory so the mental model in terms of read/write to coherent shared memory with barriers to control local reordering works. But keep in mind that C++ docs won't talk about re-ordering, just about whether any order is guaranteed in the first place.
For another in-depth look at the divide between how C++ describes memory models, vs. how asm memory models for real architectures are described, see also How to achieve a StoreLoad barrier in C++11? (including my answer there). Also Does atomic_thread_fence(memory_order_seq_cst) have the semantics of a full memory barrier? is related.
fence(seq_cst) includes StoreLoad (if that concept even applies to a given C++ implementation). I think reasoning in terms of local barriers and then transforming that to C++ mostly works, but remember that it doesn't model the possibility of IRIW reordering which C++ allows, and which happens in real life on some POWER hardware.
Also keep in mind that var.load(acquire) can be much more efficient than var.load(relaxed); fence(acquire); on some ISAs, notably ARMv8.
e.g. this example on Godbolt, compiled for ARMv8 by GCC8.2 -O2 -mcpu=cortex-a53
#include <atomic>
int bad_acquire_load(std::atomic<int> &var){
int ret = var.load(std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_acquire);
return ret;
}
bad_acquire_load(std::atomic<int>&):
ldr r0, [r0] // plain load
dmb ish // FULL BARRIER
bx lr
int normal_acquire_load(std::atomic<int> &var){
int ret = var.load(std::memory_order_acquire);
return ret;
}
normal_acquire_load(std::atomic<int>&):
lda r0, [r0] // acquire load
bx lr

MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?

In regards to std::atomic, the C++11 standard states that stores to an atomic variable will become visible to loads of that variable in a "reasonable amount of time".
From 29.3p13:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
However I was curious to know what actually happens when dealing with specific CPU architectures which are based on the MESI cache coherency protocol (x86, x86-64, ARM, etc..).
If my understanding of the MESI protocol is correct, a core will always read the value previously written/being written by another core immediately, possibly by snooping it. (because writing a value means issuing a RFO request which in turn invalidates other cache lines)
Does it mean that when a thread A stores a value into an std::atomic, another thread B which does a load on that atomic successively will in fact always observe the new value written by A on MESI architectures? (Assuming no other threads are doing operations on that atomic)
By “successively” I mean after thread A has issued the atomic store. (Modification order has been updated)

I'll answer for what happens on real implementations on real CPUs, because an answer based only on the standard can barely say anything useful about time or "immediacy".
MESI is just an implementation detail that ISO C++ doesn't have anything to say about. The guarantees provided by ISO C++ only involve order, not actual time. ISO C++ is intentionally non-specific to avoid assuming that it will execute on a "normal" CPU. An implementation on a non-coherent machine that required explicit flushes for store visibility might be theoretically possible (although probably horrible for performance of release / acquire and seq-cst operations)
C++ is non-specific enough about timing to even allow an implementation on a single-core cooperative multi-tasking system (no pre-emption), with the compiler inserting voluntary yields occasionally. (Infinite loops without any volatile accesses or I/O are UB). C++ on a system where only one thread can actually be executing at once is totally fine and possible, assuming you consider a scheduler timeslice to still be a "reasonable" amount of time. (Or less if you yield or otherwise block.)
Even the model of formalism ISO C++ uses to give the guarantees it does about ordering is very different from the way hardware ISAs define their memory models. C++ formal guarantees are purely in terms of happens-before and synchronizes-with, not "re"-ordering litmus tests or any kind of stuff like that. e.g. How to achieve a StoreLoad barrier in C++11? is impossible to answer for pure ISO C++ formalism. The "option C" in that Q&A serves to show just how weak the C++ guarantees are; that case with store then load of two different SC variables is not sufficient to imply happens-before based on it, according to the C++ formalism, even though there has to be a total order of all SC operations. But it is sufficient in real life on systems with coherent cache and only local (within each CPU core) memory reordering, even AArch64 where the SC load right after the SC store does still essentially give us a StoreLoad barrier.
when a thread A stores a value into an std::atomic
It depends what you mean by "doing" a store.
If you mean committing from the store buffer into L1d cache, then yes, that's the moment when a store becomes globally visible, on a normal machine that uses MESI to give all CPU cores a coherent view of memory.
Although note that on some ISAs, some other threads are allowed to see stores before they become globally visible via cache. (i.e. the hardware memory model may not be "multi-copy atomic", and allow IRIW reordering. POWER is the only example I know of that does this in real life. See Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for details on the HW mechanism: Store forwarding for retired aka graduated stores between SMT threads.)
If you mean executing locally so later loads in this thread can see it, then no. std::atomic can use a memory_order weaker than seq_cst.
All mainstream ISAs have memory-ordering rules weak enough to allow for a store buffer to decouple instruction execution from commit to cache. This also allows speculative out-of-order execution by giving stores somewhere private to live after execution, before we're sure that they were on the correct path of execution. (Stores can't commit to L1d until after the store instruction retires from the out-of-order part of the back end, and thus is known to be non-speculative.)
If you want to wait for your store to be visible to other threads before doing any later loads, use atomic_thread_fence(memory_order_seq_cst);. (Which on "normal" ISAs with standard choice of C++ -> asm mappings will compile to a full barrier).
On most ISAs, a seq_cst store (the default) will also stall all later loads (and stores) in this thread until the store is globally visible. But on AArch64, STLR is a sequential-release store and execution of later loads/stores doesn't have to stall unless / until a LDAR (acquire load) is about to execute while the STLR is still in the store buffer. This implements SC semantics as weakly as possible, assuming AArch64 hardware actually works that way instead of just treating it as a store + full barrier.
But note that only blocking later loads/stores is necessary; out-of-order exec of ALU instructions on registers can still continue. But if you were expecting some kind of timing effect due to dependency chains of FP operations, for example, that's not something you can depend on in C++.
Even if you do use seq_cst so nothing happens in this thread before the store is visible to others, that's still not instant. Inter-core latency on real hardware can be on the order of maybe 40ns on mainstream modern Intel x86, for example. (This thread doesn't have to stall that long on a memory barrier instruction; some of that time is the cache miss on the other thread trying to read the line that was invalidated by this core's RFO to get exclusive ownership.) Or of course much cheaper for logical cores that share the L1d cache of a physical core: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?

From 29.3p13:
Implementations should make atomic stores visible to atomic loads
within a reasonable amount of time.
The C and C++ standards are all over the place on threads, hence not usable as formal specifications. They use the concept of time, and somewhat imply that everything runs step by step, sequentially (if not, you wouldn't have a sound program semantic) and then say that some constructs can see effects out of order, without ever telling which is which.
When effects are seen out of order, thread time is ill defined as you don't have a chronometer that would also be out of order: you wouldn't do sport with out of order execution of actions!
Even "out of order" suggests that some things are purely sequential and some other operations can be "out of order" with respect to the firsts. That is not how std::atomic is defined.
What the standards try to say is that there is a notion of progress for each thread, with a CPU time or cost index, and as it increases as more stuff is done, and stuff can only be slightly reordered by the implementation: now reordering is well defined, not in term of other sequential instructions, but in term of cost/cycles/CPU time.
So if two instructions are close to each other in the sequential intra-thread execution, they will be close in CPU time too. A reasonable compiler shouldn't move a volatile operation, a file output, or an atomic operation past a very costly "pure" computation (one that has no externally visible side effect).
A basic idea that many committee members sadly couldn't even spell out!

Thread-safety of sub-word-size flags

Consider the following scenario:
There exists a globally accessible variable F.
Thread A repeatedly assigns a random value to F (without any regard to the previous value of F).
Thread B does the same thing as thread A (independent of A).
Thread C repeatedly reads the value of F (and say, prints it).
This is in C++ (Visual C++) on Windows on x64 architecture, multi-processor. F is of type bool and marked volatile, and none of the accesses are protected by any locks.
Question: Is there anything thread-unsafe about this scenario?
Assuming that the logical behavior of the code is valid, is there anything unsafe about the fact that multiple threads are reading and writing the values to the same location at the same time?
What guarantees can be made (across architectures, OSes, compilers) about the atomicity of reading from and writing to variables that are <= word-size on the platform? (I am assuming word-size is important...)
On a related note, what is an acceptible way of communicating between threads the state of completion of some operation (none of the threads are waiting for the operation to complete, they may just be interested in querying the state from time to time)?

It depends on your thread-safety requirements. You'll always get a consistent value (that is, it is impossible that you get half the value of thread A's write and the other half from thread B), but there is no guarantee that the value you'll read is, in fact, the latest one that was logically written.
The problem here is the CPU cache that may or may not get flushed. When a thread writes to the memory, the value first goes to the cache, and eventually it gets written to memory. In the while, if other cores attempt to read the object from memory, they'll get the old value.

Under x86, any read or write to a type correctly aligned for its size is considered to be atomic (so for a bool it need only be a 1 byte alignment boundery), but its recommended to use explict atomic ops for portability and the memory barriers they provide. An excerpt from the Intel System Programming Guide, Vol 3A, Section 8. May 2011 (there is another one as well, can't find it atm).
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
• Reading or writing a byte
• Reading or writing a word aligned on a 16-bit boundary
• Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
line
Microsoft also has a few examples of using volatitle bool's for signaling thread exits, however, if you want to signal threads that are waiting, its best to use kernel constructs, on windows this would be a event (see CreateEventA/W), this will prevent the wait from burning cpu cycles when the variable hasn't been set yet.
Update:
For threads that will have almost zero wait time, its a good idea to implement a user level lock, with optional backoff if its a high contention enviroment, intel has a good article on that here, alternatively you can use WinAPI's CriticalSections (these are semi-kernel level).

When should I use _mm_sfence _mm_lfence and _mm_mfence

I read the "Intel Optimization guide Guide For Intel Architecture".
However, I still have no idea about when should I use
_mm_sfence()
_mm_lfence()
_mm_mfence()
Could anyone explain when these should be used when writing multi-threaded code?

If you're using NT stores, you might want _mm_sfence or maybe even _mm_mfence. The use-cases for _mm_lfence are much more obscure.
If not, just use C++11 std::atomic and let the compiler worry about the asm details of controlling memory ordering.
x86 has a strongly-ordered memory model, but C++ has a very weak memory model (same for C). For acquire/release semantics, you only need to prevent compile-time reordering. See Jeff Preshing's Memory Ordering At Compile Time article.
_mm_lfence and _mm_sfence do have the necessary compiler-barrier effect, but they will also cause the compiler to emit a useless lfence or sfence asm instruction that makes your code run slower.
There are better options for controlling compile-time reordering when you aren't doing any of the obscure stuff that would make you want sfence.
For example, GNU C/C++ asm("" ::: "memory") is a compiler barrier (all values have to be in memory matching the abstract machine because of the "memory" clobber), but no asm instructions are emitted.
If you're using C++11 std::atomic, you can simply do shared_var.store(tmp, std::memory_order_release). That's guaranteed to become globally visible after any earlier C assignments, even to non-atomic variables.
_mm_mfence is potentially useful if you're rolling your own version of C11 / C++11 std::atomic, because an actual mfence instruction is one way to get sequential consistency, i.e. to stop later loads from reading a value until after preceding stores become globally visible. See Jeff Preshing's Memory Reordering Caught in the Act.
But note that mfence seems to be slower on current hardware than using a locked atomic-RMW operation. e.g. xchg [mem], eax is also a full barrier, but runs faster, and does a store. On Skylake, the way mfence is implemented prevents out-of-order execution of even non-memory instruction following it. See the bottom of this answer.
In C++ without inline asm, though, your options for memory barriers are more limited (How many memory barriers instructions does an x86 CPU have?). mfence isn't terrible, and it is what gcc and clang currently use to do sequential-consistency stores.
Seriously just use C++11 std::atomic or C11 stdatomic if possible, though; It's easier to use and you get quite good code-gen for a lot of things. Or in the Linux kernel, there are already wrapper functions for inline asm for the necessary barriers. Sometimes that's just a compiler barrier, sometimes it's also an asm instruction to get stronger run-time ordering than the default. (e.g. for a full barrier).
No barriers will make your stores appear to other threads any faster. All they can do is delay later operations in the current thread until earlier things happen. The CPU already tries to commit pending non-speculative stores to L1d cache as quickly as possible.
_mm_sfence is by far the most likely barrier to actually use manually in C++
The main use-case for _mm_sfence() is after some _mm_stream stores, before setting a flag that other threads will check.
See Enhanced REP MOVSB for memcpy for more about NT stores vs. regular stores, and x86 memory bandwidth. For writing very large buffers (larger than L3 cache size) that definitely won't be re-read any time soon, it can be a good idea to use NT stores.
NT stores are weakly-ordered, unlike normal stores, so you need sfence if you care about publishing the data to another thread. If not (you'll eventually read them from this thread), then you don't. Or if you make a system call before telling another thread the data is ready, that's also serializing.
sfence (or some other barrier) is necessary to give you release/acquire synchronization when using NT stores. C++11 std::atomic implementations leave it up to you to fence your NT stores, so that atomic release-stores can be efficient.
#include <atomic>
#include <immintrin.h>
struct bigbuf {
int buf[100000];
std::atomic<unsigned> buf_ready;
};
void producer(bigbuf *p) {
__m128i *buf = (__m128i*) (p->buf);
for(...) {
...
_mm_stream_si128(buf, vec1);
_mm_stream_si128(buf+1, vec2);
_mm_stream_si128(buf+2, vec3);
...
}
_mm_sfence(); // All weakly-ordered memory shenanigans stay above this line
// So we can safely use normal std::atomic release/acquire sync for buf
p->buf_ready.store(1, std::memory_order_release);
}
Then a consumer can safely do if(p->buf_ready.load(std::memory_order_acquire)) { foo = p->buf[0]; ... } without any data-race Undefined Behaviour. The reader side does not need _mm_lfence; the weakly-ordered nature of NT stores is confined entirely to the core doing the writing. Once it becomes globally visible, it's fully coherent and ordered according to the normal rules.
Other use-cases include ordering clflushopt to control the order of data being stored to memory-mapped non-volatile storage. (e.g. an NVDIMM using Optane memory, or DIMMs with battery-backed DRAM exist now.)
_mm_lfence is almost never useful as an actual load fence. Loads can only be weakly ordered when loading from WC (Write-Combining) memory regions, like video ram. Even movntdqa (_mm_stream_load_si128) is still strongly ordered on normal (WB = write-back) memory, and doesn't do anything to reduce cache pollution. (prefetchnta might, but it's hard to tune and can make things worse.)
TL:DR: if you aren't writing graphics drivers or something else that maps video RAM directly, you don't need _mm_lfence to order your loads.
lfence does have the interesting microarchitectural effect of preventing execution of later instructions until it retires. e.g. to stop _rdtsc() from reading the cycle-counter while earlier work is still pending in a microbenchmark. (Applies always on Intel CPUs, but on AMD only with an MSR setting: Is LFENCE serializing on AMD processors?. Otherwise lfence runs 4 per clock on Bulldozer family, so clearly not serializing.)
Since you're using intrinsics from C/C++, the compiler is generating code for you. You don't have direct control over the asm, but you might possibly use _mm_lfence for things like Spectre mitigation if you can get the compiler to put it in the right place in the asm output: right after a conditional branch, before a double array access. (like foo[bar[i]]). If you're using kernel patches for Spectre, I think the kernel will defend your process from other processes, so you'd only have to worry about this in a program that uses a JIT sandbox and is worried about being attacked from within its own sandbox.

Here is my understanding, hopefully accurate and simple enough to make sense:
(Itanium) IA64 architecture allows memory reads and writes to be executed in any order, so the order of memory changes from the point of view of another processor is not predictable unless you use fences to enforce that writes complete in a reasonable order.
From here on, I am talking about x86, x86 is strongly ordered.
On x86, Intel does not guarantee that a store done on another processor will always be immediately visible on this processor. It is possible that this processor speculatively executed the load (read) just early enough to miss the other processor's store (write). It only guarantees the order that writes become visible to other processors is in program order. It does not guarantee that other processors will immediately see any update, no matter what you do.
Locked read/modify/write instructions are fully sequentially consistent. Because of this, in general you already handle missing the other processor's memory operations because a locked xchg or cmpxchg will sync it all up, you will acquire the relevant cache line for ownership immediately and will update it atomically. If another CPU is racing with your locked operation, either you will win the race and the other CPU will miss the cache and get it back after your locked operation, or they will win the race, and you will miss the cache and get the updated value from them.
lfence stalls instruction issue until all instructions before the lfence are completed. mfence specifically waits for all preceding memory reads to be brought fully into the destination register, and waits for all preceding writes to become globally visible, but does not stall all further instructions as lfence would. sfence does the same for only stores, flushes write combiner, and ensures that all stores preceding the sfence are globally visible before allowing any stores following the sfence to begin execution.
Fences of any kind are rarely needed on x86, they are not necessary unless you are using write-combining memory or non-temporal instructions, something you rarely do if you are not a kernel mode (driver) developer. Normally, x86 guarantees that all stores are visible in program order, but it does not make that guarantee for WC (write combining) memory or for "non-temporal" instructions that do explicit weakly ordered stores, such as movnti.
So, to summarize, stores are always visible in program order unless you have used special weakly ordered stores or are accessing WC memory type. Algorithms using locked instructions like xchg, or xadd, or cmpxchg, etc, will work without fences because locked instructions are sequentially consistent.

The intrinsic calls you mention all simply insert an sfence, lfence or mfence instruction when they are called. So the question then becomes "What are the purposes of those fence instructions"?
The short answer is that lfence is completely useless* and sfence almost completely useless for memory ordering purposes for user-mode programs in x86. On the other hand, mfence serves as a full memory barrier, so you might use it in places where you need a barrier if there isn't already some nearby lock-prefixed instruction providing what you need.
The longer-but-still short answer is...
lfence
lfence is documented to order loads prior to the lfence with respect to loads after, but this guarantee is already provided for normal loads without any fence at all: that is, Intel already guarantees that "loads aren't reordered with other loads". As a practical matter, this leaves the purpose of lfence in user-mode code as an out-of-order execution barrier, useful perhaps for carefully timing certain operations.
sfence
sfence is documented to order stores before and after in the same way that lfence does for loads, but just like loads the store order is already guaranteed in most cases by Intel. The primary interesting case where it doesn't is the so-called non-temporal stores such as movntdq, movnti, maskmovq and a few other instructions. These instructions don't play by the normal memory ordering rules, so you can put an sfence between these stores and any other stores where you want to enforce the relative order. mfence works for this purpose too, but sfence is faster.
mfence
Unlike the other two, mfence actually does something: it serves as a full memory barrier, ensuring that all of the previous loads and stores will have completed1 before any of the subsequent loads or stores begin execution. This answer is too short to explain the concept of a memory barrier fully, but an example would be Dekker's algorithm, where each thread wanting to enter a critical section stores to a location and then checks to see if the other thread has stored something to its location. For example, on thread 1:
mov DWORD [thread_1_wants_to_enter], 1 # store our flag
mov eax, [thread_2_wants_to_enter] # check the other thread's flag
test eax, eax
jnz retry
; critical section
Here, on x86, you need a memory barrier in between the store (the first mov), and the load (the second mov), otherwise each thread could see zero when they read the other's flag because the x86 memory model allows loads to be re-ordered with earlier stores. So you could insert an mfence barrier as follows to restore sequential consistency and the correct behavior of the algorithm:
mov DWORD [thread_1_wants_to_enter], 1 # store our flag
mfence
mov eax, [thread_2_wants_to_enter] # check the other thread's flag
test eax, eax
jnz retry
; critical section
In practice, you don't see mfence as much as you might expect, because x86 lock-prefixed instructions have the same full-barrier effect, and these are often/always (?) cheaper than an mfence.
1 E.g., loads will have been satisfied and stores will have become globally visible (although it would be implemented differently as long as the visible effect wrt ordering is "as if" that occurred).

Caveat: I'm no expert in this. I'm still trying to learn this myself. But since no one has replied in the past two days, it seems experts on memory fence instructions are not plentiful. So here's my understanding ...
Intel is a weakly-ordered memory system. That means your program may execute
array[idx+1] = something
idx++
but the change to idx may be globally visible (e.g. to threads/processes running on other processors) before the change to array. Placing sfence between the two statements will ensure the order the writes are sent to the FSB.
Meanwhile, another processor runs
newestthing = array[idx]
may have cached the memory for array and has a stale copy, but gets the updated idx due to a cache miss.
The solution is to use lfence just beforehand to ensure the loads are synchronized.
This article or this article may give better info

Compare and swap C++0x

From the C++0x proposal on C++ Atomic Types and Operations:
29.1 Order and Consistency [atomics.order]
Add a new sub-clause with the following paragraphs.
The enumeration memory_order specifies the detailed regular (non-atomic) memory synchronization order as defined in [the new section added by N2334 or its adopted successor] and may provide for operation ordering. Its enumerated values and their meanings are as follows.
memory_order_relaxed
The operation does not order memory.
memory_order_release
Performs a release operation on the affected memory locations, thus making regular memory writes visible to other threads through the atomic variable to which it is applied.
memory_order_acquire
Performs an acquire operation on the affected memory locations, thus making regular memory writes in other threads released through the atomic variable to which it is applied, visible to the current thread.
memory_order_acq_rel
The operation has both acquire and release semantics.
memory_order_seq_cst
The operation has both acquire and release semantics, and in addition, has sequentially-consistent operation ordering.
Lower in the proposal:
bool A::compare_swap( C& expected, C desired,
memory_order success, memory_order failure ) volatile
where one can specify memory order for the CAS.
My understanding is that “memory_order_acq_rel” will only necessarily synchronize those memory locations which are needed for the operation, while other memory locations may remain unsynchronized (it will not behave as a memory fence).
Now, my question is - if I choose “memory_order_acq_rel” and apply compare_swap to integral types, for instance, integers, how is this typically translated into machine code on modern consumer processors such as a multicore Intel i7? What about the other commonly used architectures (x64, SPARC, ppc, arm)?
In particular (assuming a concrete compiler, say gcc):
How to compare-and-swap an integer location with the above operation?
What instruction sequence will such a code produce?
Is the operation lock-free on i7?
Will such an operation run a full cache coherence protocol, synchronizing caches of different processor cores as if it were a memory fence on i7? Or will it just synchronize the memory locations needed by this operation?
Related to previous question - is there any performance advantage to using acq_rel semantics on i7? What about the other architectures?
Thanks for all the answers.

The answer here is not trivial. Exactly what happens and what is meant is dependent on many things. For basic understanding of cache coherence/memory perhaps my recent blog entries might be helpful:
CPU Reordering – What is actually being reordered?
CPU Memory – Why do I need a mutex?
But that aside, let me try to answer a few questions. First off the below function is being very hopeful as to what is supported: very fine-grained control over exactly how strong a memory-order guarantee you get. That's reasonable for compile-time reordering but often not for runtime barriers.
compare_swap( C& expected, C desired,
memory_order success, memory_order failure )
Architectures won't all be able to implement this exactly as you requested; many will have to strengthen it to something strong enough that they can implement. When you specify memory_order you are specifying how reordering may work. To use Intel's terms you will be specifying what type of fence you want, there are three of them, the full fence, load fence, and store fence. (But on x86, load fence and store fence are only useful with weakly-ordered instructions like NT stores; atomics don't use them. Regular load/store give you everything except that stores can appear after later loads.) Just because you want a particular fence on that operation won't mean it is supported, in which I'd hope it always falls back to a full fence. (See Preshing's article on memory barriers)
An x86 (including x64) compiler will likely use the LOCK CMPXCHG instruction to implement the CAS, regardless of memory ordering. This implies a full barrier; x86 doesn't have a way to make a read-modify-write operation atomic without a lock prefix, which is also a full barrier. Pure-store and pure-load can be atomic "on their own", with many ISAs needing barriers for anything above mo_relaxed, but x86 does acq_rel "for free" in asm.
This instruction is lock-free, although all cores trying to CAS the same location will contend for access to it so you could argue it's not really wait-free. (Algorithms that use it might not be lock-free, but the operation itself is wait-free, see wikipedia's non-blocking algorithm article). On non-x86 with LL/SC instead of locked instructions, C++11 compare_exchange_weak is normally wait-free but compare_exchange_strong requires a retry loop in case of spurious failure.
Now that C++11 has existed for years, you can look at the asm output for various architectures on the Godbolt compiler explorer.
In terms of memory sync you need to understand how cache-coherence works (my blog may help a bit). New CPUs use a ccNUMA architecture (previously SMP). Essentially the "view" on the memory never gets out-of-sync. The fences used in the code don't actually force any flushing of cache to happen per-se, only of the store buffer committing in flight stores to cache before later loads.
If two cores both have the same memory location cached in a cache-line, a store by one core will get exclusive ownership of the cache line (invalidating all other copies) and marking its own as dirty. A very simple explanation for a very complex process
To answer your last question you should always use the memory semantics that you logically need to be correct. Most architectures won't support all the combinations you use in your program. However, in many cases you'll get great optimizations, especially in cases where the order you requested is guaranteed without a fence (which is quite common).
-- Answers to some comments:
You have to distinguish between what it means to execute a write instruction and write to a memory location. This is what I attempt to explain in my blog post. By the time the "0" is committed to 0x100, all cores see that zero. Writing integers is also atomic, that is even without a lock, when you write to a location all cores will immediately have that value if they wish to use it.
The trouble is that to use the value you have likely loaded it into a register first, any changes to the location after that obviously won't touch the register. This is why one needs mutexes or atomic<T> despite a cache coherent memory: the compiler is allowed to keep plain variable values in private registers. (In C++11, that's because a data-race on non-atomic variables is Undefined Behaviour.)
As to contradictory claims, generally you'll see all sorts of claims. Whether they are contradictory comes right down to exactly what "see" "load" "execute" mean in the context. If you write "1" to 0x100, does that mean you executed the write instruction or did the CPU actually commit that value. The difference created by the store buffer is one major cause of reordering (the only one x86 allows). The CPU can delay writing the "1", but you can be sure that the moment it does finally commit that "1" all cores see it. The fences control this ordering by making the thread wait until a store commits before doing later operations.

Your whole worldview seems off base: your question insinuates that cache consistency is controlled by memory orders at the C++ level and fences or atomic operations at the CPU level.
But cache consistency is one of the most important invariants for the physical architecture, and it's provided at all time by the memory system that consists of the interconnection of all CPUs and the RAM. You can never beat it from code running on a CPU, or even see its detail of operation. Of course, by observing RAM directly and running code elsewhere you might see stale data at some level of memory: by definition the RAM doesn't have the newest value of all memory locations.
But code running on a CPU can't access DRAM directly, only through the memory hierarchy which includes caches that communicate with each other to maintain coherency of this shared view of memory. (Typically with MESI). Even on a single core, a write-back cache lets DRAM values be stale, which can be an issue for non-cache-coherent DMA but not for reading/writing memory from a CPU.
So the issue exists only for external devices, and only ones that do non-coherent DMA. (DMA is cache-coherent on modern x86 CPUs; the memory controller being built-in to the CPU makes this possible).
Will such an operation run a full cache coherence protocol,
synchronizing caches of different processor cores as if it were a
memory fence on i7?
They are already synchronized. See Does a memory barrier ensure that the cache coherence has been completed? - memory barriers only do local things inside the core running the barrier, like flush the store buffer.
Or will it just synchronize the memory locations
needed by this operation?
An atomic operation applies to exactly one memory location. What others locations do you have in mind?
On a weakly-ordered CPU, a memory_order_relaxed atomic increment could avoid making earlier loads/stores visible before that increment. But x86's strongly-ordered memory model doesn't allow that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js