Overhead of a Memory Barrier / Fence

Overhead of a Memory Barrier / Fence - c++

I'm currently writing C++ code and use a lot of memory barriers / fences in my code. I know, that a MB tolds the compiler and the hardware to not reorder write/reads around it. But i don't know how complex this operation is for the processor at runtime.
My Question is: What is the runtime-overhead of such a barrier? I didn't found any useful answer with google...
Is the overhead negligible? Or leads heavy usage of MBs to serious performance problems?
Best regards.

Compared to arithmetic and "normal" instructions I understand these to be very costly, but do not have numbers to back up that statement. I like jalf's answer by describing effects of the instructions, and would like to add a bit.
There are in general a few different types of barriers, so understanding the differences could be helpful. A barrier like the one that jalf mentioned is required for example in a mutex implementation before clearing the lock word (lwsync on ppc, or st4.rel on ia64 for example). All reads and writes must be complete, and only instructions later in the pipeline that have no memory access and no dependencies on in progress memory operations can be executed.
Another type of barrier is the sort that you'd use in a mutex implementation when acquiring a lock (examples, isync on ppc, or instr.acq on ia64). This has an effect on future instructions, so if a non-dependent load has been prefetched it must be discarded. Example:
if ( pSharedMem->atomic.bit_is_set() ) // use a bit to flag that somethingElse is "ready"
{
foo( pSharedMem->somethingElse ) ;
}
Without an acquire barrier (borrowing ia64 lingo), your program may have unexpected results if somethingElse made it into a register before the check of the flagging bit check is complete.
There is a third type of barrier, generally less used, and is required to enforce store load ordering. Examples of instructions for such an ordering enforcing instruction are, sync on ppc (heavyweight sync), MF on ia64, membar #storeload on sparc (required even for TSO).
Using ia64 like pseudocode to illustrate, suppose one had
st4.rel
ld4.acq
without an mf in between one has no guarentee that the load follows the store. You know that loads and stores preceding the st4.rel are done before that store or the "subsequent" load, but that load or other future loads (and perhaps stores if non-dependent?) could sneak in, completing earlier since nothing prevents that otherwise.
Because mutex implementations very likely only use acquire and release barriers in thier implementations, I'd expect that an observable effect of this is that memory access following lock release may actually sometimes occur while "still in the critical section".

Try thinking about what the instruction does. It doesn't make the CPU do anything complicated in terms of logic, but it forces it to wait until all reads and writes have been committed to main memory. So the cost really depends on the cost of accessing main memory (and the number of outstanding reads/writes).
Accessing main memory is generally pretty expensive (10-200 clock cycles), but in a sense, that work would have to be done without the barrier as well, it could just be hidden by executing some other instructions simultaneously so you didn't feel the cost so much.
It also limits the CPU's (and compilers) ability to reschedule instructions, so there may be an indirect cost as well in that nearby instructions can't be interleaved which might otherwise yield a more efficient execution schedule.

Related

Why set the stop flag using `memory_order_seq_cst`, if you check it with `memory_order_relaxed`?

Herb Sutter, in his "atomic<> weapons" talk, shows several example uses of atomics, and one of them boils down to following: (video link, timestamped)
A main thread launches several worker threads.
Workers check the stop flag:
while (!stop.load(std::memory_order_relaxed))
{
// Do stuff.
}
The main thread eventually does stop = true; (note, using order=seq_cst), then joins the workers.
Sutter explains that checking the flag with order=relaxed is ok, because who cares if the thread stops with a slightly bigger delay.
But why does stop = true; in the main thread use seq_cst? The slide says that it's purposefully not relaxed, but doesn't explain why.
It looks like it would work, possibly with a larger stopping delay.
Is it a compromise between performance and how fast other threads see the flag? I.e. since the main thread only sets the flag once, we might as well use the strongest ordering, to get the message across as quickly as possible?

mo_relaxed is fine for both load and store of a stop flag
There's also no meaningful latency benefit to stronger memory orders, even if latency of seeing a change to a keep_running or exit_now flag was important.
IDK why Herb thinks stop.store shouldn't be relaxed; in his talk, his slides have a comment that says // not relaxed on the assignment, but he doesn't say anything about the store side before moving on to "is it worth it".
Of course, the load runs inside the worker loop, but the store runs only once, and Herb really likes to recommend sticking with SC unless you have a performance reason that truly justifies using something else. I hope that wasn't his only reason; I find that unhelpful when trying to understand what memory order would actually be necessary and why. But anyway, I think either that or a mistake on his part.
The ISO C++ standard doesn't say anything about how soon stores become visible or what might influence that. These apply to all atomic operations, including relaxed. They're not just notes, but only should not must.
ISO C++ section 6.9.2.3 Forward progress
18. An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.
And 33.5.4 Order and consistency [atomics.order] covering only atomics, not mutexes etc.:
11. Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
Inter-thread latency is primarily a quality-of-implementation thing, with the standard leaving things wide open. Normal C++ implementations that work by compiling to asm for some architecture effectively just expose the hardware's cache-coherence properties, so typically tens of nanoseconds best case, sub-microsecond near-worst case if both threads are currently running on different cores. (Otherwise scheduler timeslice...)
Another thread can loop arbitrarily many times before its load actually sees this store value, even if they're both seq_cst, assuming there's no other synchronization of any kind between them. Low inter-thread latency is a performance issue, not correctness / formal guarantee.
And non-infinite inter-thread latency is apparently only a "should" QOI (quality of implementation) issue. :P Nothing in the standard suggests that seq_cst would help on a hypothetical implementation where store visibility could be delayed indefinitely, although one might guess that could be the case, e.g. on a hypothetical implementation with explicit cache flushes instead of cache coherency. (Although such an implementation is probably not practically usable in terms of performance with CPUs anything like what we have now; every release and/or acquire operation would have to flush the whole cache.)
On real hardware (which uses some form of MESI cache coherency), different memory orders for store or load don't make stores visible sooner in real time, they just control whether later operations can become globally visible while still waiting for the store to commit from the store buffer to L1d cache. (After invalidating any other copies of the line.)
Stronger orders, and barriers, don't make things happen sooner in an absolute sense, they just delay other things until they're allowed to happen relative to the store or load. (This is the case on all real-world CPUs AFAIK; they always try to make stores visible to other cores ASAP anyway, so the store buffer doesn't fill up.)
See also (my similar answers on):
Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?
If I don't use fences, how long could it take a core to see another core's writes?
memory_order_relaxed and visibility
Thread synchronization: How to guarantee visibility of writes (it's a non-issue on current real hardware)
The second Q&A is about x86 where commit from the store buffer to L1d cache is in program order. That limits how far past a cache-miss store execution can get, and also any possible benefit of putting a release or seq_cst fence after the store to prevent later stores (and loads) from maybe competing for resources. (x86 microarchitectures will do RFO (read for ownership) before stores reach the head of the store buffer, and plain loads normally compete for resources to track RFOs we're waiting for a response to.) But these effects are extremely minor in terms of something like exiting another thread; only very small scale reordering.
because who cares if the thread stops with a slightly bigger delay.
More like, who cares if the thread gets more work done by not making loads/stores after the load wait for the check to complete. (Of course, this work will get discarded if it's in the shadow of a a mis-speculated branch on the load result when we eventually load true.) The cost of rolling back to a consistent state after a branch mispredict is more or less independent of how much already-executed work had happened beyond the mispredicted branch. And it's a stop flag which presumably doesn't get set very often, so the total amount of wasted work costing cache/memory bandwidth for other CPUs is pretty minimal.
That phrasing makes it sound like an acquire load or release store would actually get the the store seen sooner in absolute real time, rather than just relative to other code in this thread. (Which is not the case).
The benefit is more instruction-level and memory-level parallelism across loop iterations when the load produces a false. And simply avoiding running extra instructions on ISAs where an acquire or especially an SC load needs extra instructions, especially expensive 2-way barrier instructions (like PowerPC isync/sync or especially ARMv7 dmb ish full barrier even for acquire), not like ARMv8 ldapr or x86 mov acquire-load instructions. (Godbolt)
BTW, Herb is right that the dirty flag can also be relaxed, but only because of the thread.join sync between the reader and any possible writer. Otherwise yeah, release / acquire.
But in this case, dirty only needs to be atomic<> at all because of possible simultaneous writers all storing the same value, which ISO C++ still deems data-race UB. e.g. because of the theoretical possibility of hardware race-detection that traps on conflicting non-atomic accesses. (Or a software implementations like clang -fsanitize=thread)
Fun fact: C++20 introduced std::stop_token for use as a stop or keep_running flag.

First of all, stop.store(true, mo_relaxed) would be enough in this context.
launch_workers()
stop = true; // not relaxed
join_workers()';
why does stop = true; in the main thread use seq_cst?
Herb does not mention the reason why he uses mo_seq_cst, but let's look at a few possibilities.
Based on the "not relaxed" comment, he is worried that stop.store(true, mo_relaxed) can be re-ordered with launch_workers() or join_workers().
Since launch_workers() is a release operation and join_workers() is an acquire operation, the ordering constraints for both will not prevent the store to move in either direction.
However, it is important to notice that for this scenario, it does not really matter whether the store to stop uses mo_relaxed or mo_seq_cst.
Even with the strongest ordering, mo_seq_cst (which by the absence of other SC operations is no stronger than mo_release), the ordering rules still allow the re-ordering with join_workers().
Of course this reordering isn't going to happen, but my point is that stronger ordering contraints on the store isn't going to make a difference.
He could make the argument that a sequentially consistent (SC) store is an advantage since the thread performing the relaxed load will pick up on the new value sooner
(an SC store flushes the store buffer).
But this seems hardly relevant because the store is in between creating and joining threads, which is not in a tight loop, or as Herb puts it:
"..is it in a performance-critical region of code where this overhead matters?.."
He also says about the load: "..you don't care when it arrives.."
We don't know the real reason, but it is possibly based on the programming convention that you don't use explicit ordering parameters (which means mo_seq_cst), unless it makes a difference,
and in this case, as Herb explains, only the relaxed load makes a difference.
For example, on the weakly ordered PowerPC platform, a load(mo_seq_cst) uses both the (expensive) sync and (less expensive) isync instructions,
a load(mo_acquire) still uses isync and a load(mo_relaxed) uses none of them. In a tight loop, that is a good optimization.
Also worth mentioning is that on the mainstream X86platform, there is no real difference in performance between load(mo_seq_cst) and load(mo_relaxed)
Personally I favor this programming style where ordering parameters are omitted when they don't matter and used when they make a difference.
stop.store(true); // ordering irrelevant, but uses SC
stop.store(true, memory_order_seq_cst); // store requires SC ordering (which is rare)
It's only a matter of style.. for both stores, the compiler will generate the same assembly.

How does a mutex ensure a variable's value is consistent across cores?

If I have a single int which I want to write to from one thread and read from on another, I need to use std::atomic, to ensure that its value is consistent across cores, regardless of whether or not the instructions that read from and write to it are conceptually atomic. If I don't, it may be that the reading core has an old value in its cache, and will not see the new value. This makes sense to me.
If I have some complex data type that cannot be read/written to atomically, I need to guard access to it using some synchronisation primitive, such as std::mutex. This will prevent the object getting into (or being read from) an inconsistent state. This makes sense to me.
What doesn't make sense to me is how mutexes help with the caching problem that atomics solve. They seem to exist solely to prevent concurrent access to some resource, but not to propagate any values contained within that resource to other cores' caches. Is there some part of their semantics I've missed which deals with this?

The right answer to this is magic pixies - e.g. It Just Works. The implementation of std::atomic for each platform must do the right thing.
The right thing is a combination of 3 parts.
Firstly, the compiler needs to know that it can't move instructions across boundaries [in fact it can in some cases, but assume that it doesn't].
Secondly, the cache/memory subsystem needs to know - this is generally done using memory barriers, although x86/x64 generally have such strong memory guarantees that this isn't necessary in the vast majority of cases (which is a big shame as its nice for wrong code to actually go wrong).
Finally the CPU needs to know it cannot reorder instructions. Modern CPUs are massively aggressive at reordering operations and making sure in the single threaded case that this is unnoticeable. They may need more hints that this cannot happen in certain places.
For most CPUs part 2 and 3 come down to the same thing - a memory barrier implies both. Part 1 is totally inside the compiler, and is down to the compiler writers to get right.
See Herb Sutters talk 'Atomic Weapons' for a lot more interesting info.

The consistency across cores is ensured by memory barriers (which also prevents instructions reordering). When you use std::atomic, not only do you access the data atomically, but the compiler (and library) also insert the relevant memory barriers.
Mutexes work the same way: the mutex implementations (eg. pthreads or WinAPI or what not) internally also insert memory barriers.

Most modern multicore processors (including x86 and x64) are cache coherent. If two cores hold the same memory location in cache and one of them updates the value, the change is automatically propagated to other cores' caches. It's inefficient (writing to the same cache line at the same time from two cores is really slow) but without cache coherence it would be very difficult to write multithreaded software.
And like syam said, memory barriers are also required. They prevent the compiler or processor from reordering memory accesses, and also force the write into memory (or at least into cache), when for example a variable is held in a register because of compiler optizations.

Is memory reordering visible to other threads on a uniprocessor?

It's common that modern CPU architectures employ performance optimizations that can result in out-of-order execution. In single threaded applications memory reordering may also occur, but it's invisible to programmers as if memory was accessed in program order. And for SMP, memory barriers come to the rescue which are used to enforce some sort of memory ordering.
What I'm not sure, is about multi-threading in a uniprocessor. Consider the following example: When thread 1 runs, the store to f could take place before the store to x. Let's say context switch happens after f is written, and right before x is written. Now thread 2 starts to run, and it ends the loop and print 0, which is undesirable of course.
// Both x, f are initialized w/ 0.
// Thread 1
x = 42;
f = 1;
// Thread 2
while (f == 0)
;
print x;
Is the scenario described above possible? Or is there a guarantee that physical memory is committed during thread context switch?
According to this wiki,
When a program runs on a single-CPU machine, the hardware performs
the necessary bookkeeping to ensure that the program execute as if all
memory operations were performed in the order specified by the
programmer (program order), so memory barriers are not necessary.
Although it didn't explicitly mention uniprocessor multi-threaded applications, it includes this case.
I'm not sure it's correct/complete or not. Note that this may highly depend on the hardware(weak/strong memory model). So you may want to include the hardware you know in the answers. Thanks.
PS. device I/O, etc are not my concern here. And it's a single-core uniprocessor.
Edit: Thanks Nitsan for the reminder, we assume no compiler reordering here(just hardware reordering), and loop in thread 2 is not optimized away..Again, devil is in the details.

As a C++ question the answer must be that the program contains a data race, so the behavior is undefined. In reality that means that it could print something other than 42.
That is independent of the underlying hardware. As has been pointed out, the loop can be optimized away and the compiler can reorder the assignments in thread 1, so that result can occur even on uniprocessor machines.
[I'll assume that with "uniprocessor" machine, you mean processors with a single core and hardware thread.]
You now say, that you want to assume compiler reordering or loop elimination does not happen. With this, we have left the realm of C++ and are really asking about corresponding machine instructions. If you want to eliminate compiler reordering, we can probably also rule out any form of SIMD instructions and consider only instructions operating on a single memory location at a time.
So essentially thread1 has two store instructions in the order store-to-x store-to-f, while thread2 has test-f-and-loop-if-not-zero (this may be multiple instructions, but involves a load-from-f) and then a load-from-x.
On any hardware architecture I am aware of or can reasonably imagine, thread 2 will print 42.
One reason is that, if instructions processed by a single processors are not sequentially consistent among themselves, you could hardly assert anything about the effects of a program.
The only event that could interfere here, is an interrupt (as is used to trigger a preemptive context switch). A hypothetical machine that stores the entire state of its current execution pipeline state upon an interrupt and restores it upon return from the interrupt, could produce a different result, but such a machine is impractical and afaik does not exist. These operations would create quite a bit of additional complexity and/or require additional redundant buffers or registers, all for no good reason - except to break your program. Real processors either flush or roll back the current pipeline upon interrupt, which is enough to guarantee sequential consistency for all instructions on a single hardware thread.
And there is no memory model issue to worry about. The weaker memory models originate from the separate buffers and caches that separate the separate hardware processors from the main memory or nth level cache they actually share. A single processor has no similarly partitioned resources and no good reason to have them for multiple (purely software) threads. Again there is no reason to complicate the architecture and waste resources to make the processor and/or memory subsystem aware of something like separate thread contexts, if there aren't separate processing resources (processors/hardware threads) to keep these resources busy.

A strong memory ordering execute memory access instructions with the exact same order as defined in the program, it is often referred as "program ordering".
Weaker memory ordering may be employed to allow the processor reorder memory access for better performance, it is often referred as "processor ordering".
AFAIK, the scenario described above is NOT possible in the Intel ia32 architecture, whose processor ordering outlaws such cases. The relevant rules are (intel ia-32 software development manual Vol3A 8.2 Memory Ordering) :
writes are not reordered with other writes, with the exception of streaming stores, CLFLUSH and string operations.
To illustrate the rule, it gives an example similar to this:
memory location x, y, initialized to 0;
thread 1:
mov [x] 1
mov [y] 1
thread 2:
mov r1 [y]
mov r2 [x]
r1 == 1 and r2 == 0 is not allowed
In your example, thread 1 cannot store f before storing x.
#Eric in respond to your comments.
fast string store instruction "stosd", may store string out of order inside its operation. In a multiprocessor environment, when a processor store a string "str", another processor may observe str[1] being written before str[0], while the logic order presumed to be writing str[0] before str[1];
But these instructions are not reorder with any other stores. and must have precise exception handling. When exception occurs in the middle of stosd, the implementation may choose to delay it so that all out-of-order sub-stores (doesn't necessarily mean the whole stosd instruction) must commit before the context switch.
Edited to address the claims made on as if this is a C++ question:
Even this is considered in the context of C++, As I understand, a standard confirming compiler should NOT reorder the assignment of x and f in thread 1.
$1.9.14
Every value computation and side effect associated with a full-expression is sequenced before every value
computation and side effect associated with the next full-expression to be evaluated.

This isn't really a C or C++ question, since you've explicitly assumed no load/store re-ordering, which compilers for both languages are perfectly allowed to do.
Allowing that assumption for the sake of argument, note that loop may anyway never exit, unless you either:
give the compiler some reason to believe f may change (eg, by passing its address to some non-inlineable function which could modify it)
mark it volatile, or
make it an explicitly atomic type and request acquire semantics
On the hardware side, your worry about physical memory being "committed" during a context switch isn't an issue. Both software threads share the same memory hardware and cache, so there's no risk of inconsistency there whatever consistency/coherence protocol pertains between cores.
Say both stores were issued, and the memory hardware decides to re-order them. What does this really mean? Perhaps f's address is already in cache, so it can be written immediately, but x's store is deferred until that cache line is fetched. Well, a read from x is dependent on the same address, so either:
the load can't happen until the fetch happens, in which case a sane implementation must issue the queued store before the queued load
or the load can peek into the queue and fetch x's value without waiting for the write
Consider anyway that the kernel pre-emption required to switch threads will itself issue whatever load/store barriers are required for consistency of the kernel scheduler state, and it should be obvious that hardware re-ordering can't be a problem in this situation.
The real issue (which you're trying to avoid) is your assumption that there is no compiler re-ordering: this is simply wrong.

You would only need a compiler fence. From the Linux kernel docs on Memory Barriers (link):
SMP memory barriers are reduced to compiler barriers on uniprocessor
compiled systems because it is assumed that a CPU will appear to be
self-consistent, and will order overlapping accesses correctly with
respect to itself.
To expand on that, the reason why synchronization is not required at the hardware level is that:
All threads on a uniprocessor system share the same memory and thus there are no cache-coherency issues (such as propagation latency) that can occur on SMP systems, and
Any out-of-order load/store instructions in the CPU's execution pipeline would either be committed or rolled back in full if the pipeline is flushed due to a preemptive context switch.

This code may well never finish(in thread 2) as the compiler can decide to hoist the whole expression out of the loop(this is similar to using an isRunning flag which is not volatile).
That said you need to worry about 2 types of re-orderings here: compiler and CPU, both are free to move the stores around. See here: http://preshing.com/20120515/memory-reordering-caught-in-the-act for an example. At this point the code you describe above is at the mercy of compiler, compiler flags, and particular architecture. The wiki quoted is misleading as it may suggest internal re-ordering is not at the mercy of the cpu/compiler which is not the case.

As far as the x86 is concerned, the out-of-order-stores are made consistent from the viewpoint of the executing code with regards to program flow. In this case, "program flow" is just the flow of instructions that a processor executes, not something constrained to a "program running in a thread". All the instructions necessary for context switching, etc. are considered part of this flow so the consistency is maintained across threads.

A context switch has to store the complete machine state so that it can be restored before the suspended thread resumes execution. The machine states includes the processor registers but not the processor pipeline.
If you assume no compiler reordering, this means that all hardware instructions that are "on-the-fly" have to be completed before a context switch (i.e. an interrupt) takes place, otherwise they get lost and are not stored by the context switch mechanism. This is independend of hardware reordering.
In your example, even if the processor swaps the two hardware instructions "x=42" and "f=1", the instruction pointer is already after the second one, and therefore both instructions must be completed before the context switch begins. if it were not so, since the content of the pipeline and of the cache are not part of the "context", they would be lost.
In other words, if the interrupt that causes the ctx switch happens when the IP register points at the instruction following "f=1", then all instructions before that point must have completed all their effects.

From my point of view, processor fetch instructions one by one.
In your case, if "f = 1" was speculatively executed before "x = 42", that means both these two instructions are already in the processor's pipeline. The only possible way to schedule current thread out is interrupt. But the processor(at least on X86) will flush the pipeline's instructions before serving the interrupt.
So no need to worry about the reordering in a uniprocessor.

Compare and swap C++0x

From the C++0x proposal on C++ Atomic Types and Operations:
29.1 Order and Consistency [atomics.order]
Add a new sub-clause with the following paragraphs.
The enumeration memory_order specifies the detailed regular (non-atomic) memory synchronization order as defined in [the new section added by N2334 or its adopted successor] and may provide for operation ordering. Its enumerated values and their meanings are as follows.
memory_order_relaxed
The operation does not order memory.
memory_order_release
Performs a release operation on the affected memory locations, thus making regular memory writes visible to other threads through the atomic variable to which it is applied.
memory_order_acquire
Performs an acquire operation on the affected memory locations, thus making regular memory writes in other threads released through the atomic variable to which it is applied, visible to the current thread.
memory_order_acq_rel
The operation has both acquire and release semantics.
memory_order_seq_cst
The operation has both acquire and release semantics, and in addition, has sequentially-consistent operation ordering.
Lower in the proposal:
bool A::compare_swap( C& expected, C desired,
memory_order success, memory_order failure ) volatile
where one can specify memory order for the CAS.
My understanding is that “memory_order_acq_rel” will only necessarily synchronize those memory locations which are needed for the operation, while other memory locations may remain unsynchronized (it will not behave as a memory fence).
Now, my question is - if I choose “memory_order_acq_rel” and apply compare_swap to integral types, for instance, integers, how is this typically translated into machine code on modern consumer processors such as a multicore Intel i7? What about the other commonly used architectures (x64, SPARC, ppc, arm)?
In particular (assuming a concrete compiler, say gcc):
How to compare-and-swap an integer location with the above operation?
What instruction sequence will such a code produce?
Is the operation lock-free on i7?
Will such an operation run a full cache coherence protocol, synchronizing caches of different processor cores as if it were a memory fence on i7? Or will it just synchronize the memory locations needed by this operation?
Related to previous question - is there any performance advantage to using acq_rel semantics on i7? What about the other architectures?
Thanks for all the answers.

The answer here is not trivial. Exactly what happens and what is meant is dependent on many things. For basic understanding of cache coherence/memory perhaps my recent blog entries might be helpful:
CPU Reordering – What is actually being reordered?
CPU Memory – Why do I need a mutex?
But that aside, let me try to answer a few questions. First off the below function is being very hopeful as to what is supported: very fine-grained control over exactly how strong a memory-order guarantee you get. That's reasonable for compile-time reordering but often not for runtime barriers.
compare_swap( C& expected, C desired,
memory_order success, memory_order failure )
Architectures won't all be able to implement this exactly as you requested; many will have to strengthen it to something strong enough that they can implement. When you specify memory_order you are specifying how reordering may work. To use Intel's terms you will be specifying what type of fence you want, there are three of them, the full fence, load fence, and store fence. (But on x86, load fence and store fence are only useful with weakly-ordered instructions like NT stores; atomics don't use them. Regular load/store give you everything except that stores can appear after later loads.) Just because you want a particular fence on that operation won't mean it is supported, in which I'd hope it always falls back to a full fence. (See Preshing's article on memory barriers)
An x86 (including x64) compiler will likely use the LOCK CMPXCHG instruction to implement the CAS, regardless of memory ordering. This implies a full barrier; x86 doesn't have a way to make a read-modify-write operation atomic without a lock prefix, which is also a full barrier. Pure-store and pure-load can be atomic "on their own", with many ISAs needing barriers for anything above mo_relaxed, but x86 does acq_rel "for free" in asm.
This instruction is lock-free, although all cores trying to CAS the same location will contend for access to it so you could argue it's not really wait-free. (Algorithms that use it might not be lock-free, but the operation itself is wait-free, see wikipedia's non-blocking algorithm article). On non-x86 with LL/SC instead of locked instructions, C++11 compare_exchange_weak is normally wait-free but compare_exchange_strong requires a retry loop in case of spurious failure.
Now that C++11 has existed for years, you can look at the asm output for various architectures on the Godbolt compiler explorer.
In terms of memory sync you need to understand how cache-coherence works (my blog may help a bit). New CPUs use a ccNUMA architecture (previously SMP). Essentially the "view" on the memory never gets out-of-sync. The fences used in the code don't actually force any flushing of cache to happen per-se, only of the store buffer committing in flight stores to cache before later loads.
If two cores both have the same memory location cached in a cache-line, a store by one core will get exclusive ownership of the cache line (invalidating all other copies) and marking its own as dirty. A very simple explanation for a very complex process
To answer your last question you should always use the memory semantics that you logically need to be correct. Most architectures won't support all the combinations you use in your program. However, in many cases you'll get great optimizations, especially in cases where the order you requested is guaranteed without a fence (which is quite common).
-- Answers to some comments:
You have to distinguish between what it means to execute a write instruction and write to a memory location. This is what I attempt to explain in my blog post. By the time the "0" is committed to 0x100, all cores see that zero. Writing integers is also atomic, that is even without a lock, when you write to a location all cores will immediately have that value if they wish to use it.
The trouble is that to use the value you have likely loaded it into a register first, any changes to the location after that obviously won't touch the register. This is why one needs mutexes or atomic<T> despite a cache coherent memory: the compiler is allowed to keep plain variable values in private registers. (In C++11, that's because a data-race on non-atomic variables is Undefined Behaviour.)
As to contradictory claims, generally you'll see all sorts of claims. Whether they are contradictory comes right down to exactly what "see" "load" "execute" mean in the context. If you write "1" to 0x100, does that mean you executed the write instruction or did the CPU actually commit that value. The difference created by the store buffer is one major cause of reordering (the only one x86 allows). The CPU can delay writing the "1", but you can be sure that the moment it does finally commit that "1" all cores see it. The fences control this ordering by making the thread wait until a store commits before doing later operations.

Your whole worldview seems off base: your question insinuates that cache consistency is controlled by memory orders at the C++ level and fences or atomic operations at the CPU level.
But cache consistency is one of the most important invariants for the physical architecture, and it's provided at all time by the memory system that consists of the interconnection of all CPUs and the RAM. You can never beat it from code running on a CPU, or even see its detail of operation. Of course, by observing RAM directly and running code elsewhere you might see stale data at some level of memory: by definition the RAM doesn't have the newest value of all memory locations.
But code running on a CPU can't access DRAM directly, only through the memory hierarchy which includes caches that communicate with each other to maintain coherency of this shared view of memory. (Typically with MESI). Even on a single core, a write-back cache lets DRAM values be stale, which can be an issue for non-cache-coherent DMA but not for reading/writing memory from a CPU.
So the issue exists only for external devices, and only ones that do non-coherent DMA. (DMA is cache-coherent on modern x86 CPUs; the memory controller being built-in to the CPU makes this possible).
Will such an operation run a full cache coherence protocol,
synchronizing caches of different processor cores as if it were a
memory fence on i7?
They are already synchronized. See Does a memory barrier ensure that the cache coherence has been completed? - memory barriers only do local things inside the core running the barrier, like flush the store buffer.
Or will it just synchronize the memory locations
needed by this operation?
An atomic operation applies to exactly one memory location. What others locations do you have in mind?
On a weakly-ordered CPU, a memory_order_relaxed atomic increment could avoid making earlier loads/stores visible before that increment. But x86's strongly-ordered memory model doesn't allow that.

Thread Synchronisation 101

Previously I've written some very simple multithreaded code, and I've always been aware that at any time there could be a context switch right in the middle of what I'm doing, so I've always guarded access the shared variables through a CCriticalSection class that enters the critical section on construction and leaves it on destruction. I know this is fairly aggressive and I enter and leave critical sections quite frequently and sometimes egregiously (e.g. at the start of a function when I could put the CCriticalSection inside a tighter code block) but my code doesn't crash and it runs fast enough.
At work my multithreaded code needs to be a tighter, only locking/synchronising at the lowest level needed.
At work I was trying to debug some multithreaded code, and I came across this:
EnterCriticalSection(&m_Crit4);
m_bSomeVariable = true;
LeaveCriticalSection(&m_Crit4);
Now, m_bSomeVariable is a Win32 BOOL (not volatile), which as far as I know is defined to be an int, and on x86 reading and writing these values is a single instruction, and since context switches occur on an instruction boundary then there's no need for synchronising this operation with a critical section.
I did some more research online to see whether this operation did not need synchronisation, and I came up with two scenarios it did:
The CPU implements out of order execution or the second thread is running on a different core and the updated value is not written into RAM for the other core to see; and
The int is not 4-byte aligned.
I believe number 1 can be solved using the "volatile" keyword. In VS2005 and later the C++ compiler surrounds access to this variable using memory barriers, ensuring that the variable is always completely written/read to the main system memory before using it.
Number 2 I cannot verify, I don't know why the byte alignment would make a difference. I don't know the x86 instruction set, but does mov need to be given a 4-byte aligned address? If not do you need to use a combination of instructions? That would introduce the problem.
So...
QUESTION 1: Does using the "volatile" keyword (implicity using memory barriers and hinting to the compiler not to optimise this code) absolve a programmer from the need to synchronise a 4-byte/8-byte on x86/x64 variable between read/write operations?
QUESTION 2: Is there the explicit requirement that the variable be 4-byte/8-byte aligned?
I did some more digging into our code and the variables defined in the class:
class CExample
{
private:
CRITICAL_SECTION m_Crit1; // Protects variable a
CRITICAL_SECTION m_Crit2; // Protects variable b
CRITICAL_SECTION m_Crit3; // Protects variable c
CRITICAL_SECTION m_Crit4; // Protects variable d
// ...
};
Now, to me this seems excessive. I thought critical sections synchronised threads between a process, so if you've got one you can enter it and no other thread in that process can execute. There is no need for a critical section for each variable you want to protect, if you're in a critical section then nothing else can interrupt you.
I think the only thing that can change the variables from outside a critical section is if the process shares a memory page with another process (can you do that?) and the other process starts to change the values. Mutexes would also help here, named mutexes are shared across processes, or only processes of the same name?
QUESTION 3: Is my analysis of critical sections correct, and should this code be rewritten to use mutexes? I have had a look at other synchronisation objects (semaphores and spinlocks), are they better suited here?
QUESTION 4: Where are critical sections/mutexes/semaphores/spinlocks best suited? That is, which synchronisation problem should they be applied to. Is there a vast performance penalty for choosing one over the other?
And while we're on it, I read that spinlocks should not be used in a single-core multithreaded environment, only a multi-core multithreaded environment. So, QUESTION 5: Is this wrong, or if not, why is it right?
Thanks in advance for any responses :)

1) No volatile just says re-load the value from memory each time it is STILL possible for it to be half updated.
Edit:
2) Windows provides some atomic functions. Look up the "Interlocked" functions.
The comments led me do a bit more reading up. If you read through the Intel System Programming Guide You can see that there aligned read and writes ARE atomic.
8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
• Reading or writing a byte
• Reading or writing a word aligned on a 16-bit boundary
• Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
line
Accesses to cacheable memory that are split across bus widths, cache lines, and
page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel
Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and
Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M,
Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that
permit external memory subsystems to make split accesses atomic; however,
nonaligned data accesses will seriously impact the performance of the processor and
should be avoided.
An x87 instruction or an SSE instructions that accesses data larger than a quadword
may be implemented using multiple memory accesses. If such an instruction stores
to memory, some of the accesses may complete (writing to memory) while another
causes the operation to fault for architectural reasons (e.g. due an page-table entry
that is marked “not present”). In this case, the effects of the completed accesses
may be visible to software even though the overall instruction caused a fault. If TLB
invalidation has been delayed (see Section 4.10.3.4), such page faults may occur
even if all accesses are to the same page.
So basically yes if you do an 8-bit read/write from any address a 16-bit read/write from a 16-bit aligned address etc etc you ARE getting atomic operations. Its also interesting to note that you can do unaligned memory read/writes within a cacheline on a modern machine. The rules seem quite complex though so I wouldn't rely on them if i were you. Cheers to the commenters thats a good learning experience for me that one :)
3) A critical section will attempt to spin lock for its lock a few times and then locks a mutex. Spin Locking can suck CPU power doing nothing and a mutex can take a while to do its stuff. CriticalSections are a good choice if you can't use the interlocked functions.
4) There are performance penalties for choosing one over another. Its a pretty big ask to go through the benefits of everything here. The MSDN help has lots of good information on each of these. I sugegst reading them.
Semaphores
Critical Sections & Spin locks
Events
Mutexes
5) You can use a spin lock in a single threaded environment its not usually necessary though as thread management means that you can't have 2 processors accessing the same data simultaneously. It just isn't possible.

1: Volatile in itself is practically useless for multithreading. It guarantees that the read/write will be executed, rather than storing the value in a register, and it guarantees that the read/write won't be reordered with respect to other volatile reads/writes. But it may still be reordered with respect to non-volatile ones, which is basically 99.9% of your code. Microsoft have redefined volatile to also wrap all accesses in memory barriers, but that is not guaranteed to be the case in general. It will just silently break on any compiler which defines volatile as the standard does. (The code will compile and run, it just won't be thread-safe any longer)
Apart from that, reads/writes to integer-sized objects are atomic on x86 as long as the object is well aligned. (You have no guarantee of when the write will occur though. The compiler and CPU may reorder it, so it's atomic, but not thread-safe)
2: Yes, the object has to be aligned for the read/write to be atomic.
3: Not really. Only one thread can execute code inside a given critical section at a time. Other threads can still execute other code. So you can have four variables each protected by a different critical section. If they all shared the same critical section, I'd be unable to manipulate object 1 while you're manipulating object 2, which is inefficient and constrains parallelism more than necessary. If they are protected by different critical sections, we just can't both manipulate the same object simultaneously.
4: Spinlocks are rarely a good idea. They are useful if you expect a thread to have to wait only a very short time before being able to acquire the lock, and you absolutely neeed minimal latency. It avoids the OS context switch which is a relatively slow operation. Instead, the thread just sits in a loop constantly polling a variable. So higher CPU usage (the core isn't freed up to run another thread while waiting for the spinlock), but the thread will be able to continue as soon as the lock is released.
As for the others, the performance characteristics are pretty much the same: just use whichever has the semantics best suited for your needs. Typically critical sections are most convenient for protecting shared variables, and mutexes can be easily used to set a "flag" allowing other threads to proceed.
As for not using spinlocks in a single-core environment, remember that the spinlock doesn't actually yield. Thread A waiting on a spinlock isn't actually put on hold allowing the OS to schedule thread B to run. But since A is waiting on this spinlock, some other thread is going to have to release that lock. If you only have a single core, then that other thread will only be able to run when A is switched out. With a sane OS, that's going to happen sooner or later anyway as part of the regular context switching. But since we know that A won't be able to get the lock until B has had a time to executed and release the lock, we'd be better off if A just yielded immediately, was put in a wait queue by the OS, and restarted when B has released the lock. And that's what all other lock types do.
A spinlock will still work in a single core environment (assuming an OS with preemptive multitasking), it'll just be very very inefficient.

Q1: Using the "volatile" keyword
In VS2005 and later the C++ compiler surrounds access to this variable using memory barriers, ensuring that the variable is always completely written/read to the main system memory before using it.
Exactly. If you are not creating portable code, Visual Studio implements it exactly this way. If you want to be portable, your options are currently "limited". Until C++0x there is no portable way how to specify atomic operations with guaranteed read/write ordering and you need to implement per-platform solutions. That said, boost already did the dirty job for you, and you can use its atomic primitives.
Q2: Variable needs to be 4-byte/8-byte aligned?
If you do keep them aligned, you are safe. If you do not, rules are complicated (cache lines, ...), therefore the safest way is to keep them aligned, as this is easy to achieve.
Q3: Should this code be rewritten to use mutexes?
Critical section is a lightweight mutex. Unless you need to synchronize between processes, use critical sections.
Q4: Where are critical sections/mutexes/semaphores/spinlocks best suited?
Critical sections can even do spin waits for you.
Q5: Spinlocks should not be used in a single-core
Spin lock uses the fact that while the waiting CPU is spinning, another CPU may release the lock. This cannot happen with one CPU only, therefore it is only a waste of time there. On multi-CPU spin locks can be good idea, but it depends on how often the spin wait will be successful. The idea is waiting for a short while is a lot faster then doing context switch there and back again, therefore if the wait it likely to be short, it is better to wait.

Don't use volatile. It has virtually nothing to do with thread-safety. See here for the low-down.
The assignment to BOOL doesn't need any synchronisation primitives. It'll work fine without any special effort on your part.
If you want to set the variable and then make sure that another thread sees the new value, you need to establish some kind of communication between the two threads. Just locking immediately before assigning achieves nothing because the other thread might have come and gone before you acquired the lock.
One final word of caution: threading is extremely hard to get right. The most experienced programmers tend to be the least comfortable with using threads, which should set alarm bells ringing for anyone who is inexperienced with their use. I strongly suggest you use some higher-level primitives to implement concurrency in your app. Passing immutable data structures via synchronised queues is one approach that substantially reduces the danger.

Volatile does not imply memory barriers.
It only means that it will be part of the perceived state of the memory model. The implication of this is that the compiler cannot optimize the variable away, nor can it perform operations on the variable only in CPU registers (it will actually load and store to memory).
As there are no memory barriers implied, the compiler can reorder instructions at will. The only guarantee is that the order in which different volatile variables are read/write will be the same as in the code:
void test()
{
volatile int a;
volatile int b;
int c;
c = 1;
a = 5;
b = 3;
}
With the code above (assuming that c is not optimized away) the update to c can happen before or after the updates to a and b, providing 3 possible outcomes. The a and b updates are guaranteed to be performed in order. c can be optimized away easily by any compiler. With enough information, the compiler can even optimize away a and b (if it can be proven that no other threads read the variables and that they are not bound to a hardware array (so in this case, they can in fact be removed). Note that the standard does not require an specific behavior, but rather a perceivable state with the as-if rule.

Questions 3: CRITICAL_SECTIONs and Mutexes work, pretty much, the same way. A Win32 mutex is a kernel object, so it can be shared between processes, and waited on with WaitForMultipleObjects, which you can't do with a CRITICAL_SECTION. On the other hand, a CRITICAL_SECTION is lighter-weight and therefore faster. But the logic of the code should be unaffected by which you use.
You also commented that "there is no need for a critical section for each variable you want to protect, if you're in a critical section then nothing else can interrupt you." This is true, but the tradeoff is that accesses to any of the variables would need you to hold that lock. If the variables can meaningfully be updated independently, you are losing an opportunity for parallelising those operations. (Since these are members of the same object, though, I would think hard before concluding that they can really be accessed independently of each other.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js