Why do memory barriers depend upon a variable? - c++

After doing some research regarding the overall purpose of memory barriers/fences, (at least I think that) I have a basic understanding of what they are for. During my research, I focused on the abstraction C++ makes as barriers seem to be hardware-specific and C++ is a low-level, yet universally usable language. However, there's one detail in the semantics of the standard library's abstraction of memory barriers that makes me question my understanding of them.
Regarding the C++ memory barrier memory_order_acq_rel, the documentation states (similar behaviour applies to the other barriers as well):
All writes in other threads that release the same atomic variable are visible before the modification and the modification is visible in other threads that acquire the same atomic variable.
On the processor level (as this restriction wouldn't exist without corresponding hardware restrictions): Why does the specification of a particular variable matter, if all previous changes are affected? For instance, the cache has to be flushed either way, hasn't it? What are the key advantages of this approach?
Thank you in advance.

Using atomic variables as a means to control memory barriers are just one way C++ gives you this control. (Probably the most commonly used way, I might add.)
You don't need to use them though.
You can call functions such as std::atomic_thread_fence and std::atomic_signal_fence which implement the memory barrier semantics without any associated variable.

Generally, you should adhere to C++20 memory model. Invidia developers found a bug in the former model (composed a fully legal C++ code that follows standart rules but results in UB - data racing - due issues in the memory model) and I heard there were some other issues as well. Furthermore, C++ strives to be a general language that can function for wide spectrum of devices, thus some rules might be meaningless for certain devices and extremely important for other ones.
I am unsure about implementation details and what processor actually needs to do. However, besides processor actions on the atomic variable, it also informs the compiler about allowed and forbidden optimizations. E.g., local variables that are logically non-accessible from other threads never need to be reloaded into cache regardless of actions performed on atomic variables.

Related

Should I expect that a C++ compiler would compile multi-threaded code with a data race "as coded", or it may do into something else?

Let’s say I have hardware on which all the accesses to memory for a value less or equal to size of bool are thread-safe, and consistency issues in regards to caching are avoided because of the hardware or the code.
Should I expect that non-atomic accesses from multiple threads to the same objects will be compiled just “as coded” and so I get the thread-safe program for the platform?
Before C++11 the standard of the language just didn’t concern about multi-threading at all, and it was not possible to create portable (conforming to the standard of the language) multi-threaded C++ programs. One had to use third-party libraries and the thread-safety of the program on the code level could be provided only by internals of these libraries, which in their turn used corresponding platform features, and compilers compiled the code just as if it were single-threaded.
Since C++11, according to the standard:
two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
two actions are potentially concurrent if
-- they are performed by different threads, or
-- they are unsequenced, at least one is performed by a signal handler, and they are not both performed by the same signal handler invocation;
the execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described in the standard ([intro.races] section 22 point for C++20: https://timsong-cpp.github.io/cppwp/n4868/intro.races#22).
any such data race results in undefined behavior.
An atomic operation is indivisible with regards to any other atomic operation that involves the same object.
An operation happens before another one means that writes to memory of the first operation make effect for the reads of the second one.
According to the standard of the language, undefined behaviour is just that for which the standard imposes no requirements.
Some people wrongly consider undefined behaviour only to be what occurs in run-time and does not relate to compilation, but the standard operates undefined behaviour to regulate compilation so that there is nothing specified to expect for both compilation and accordingly execution in the cases of undefined behaviour.
The standard of the language does not forbid diagnostic of undefined behaviour by compilers.
The standard explicitly states that in the case of undefined behaviour, besides of ignoring with an unpredictable result, it is permitted to behave in an environment-documented (including documentation of the compiler) manner (literally do everything possible, though documented) both during translation and during execution, and to terminate both translation or execution (https://timsong-cpp.github.io/cppwp/n4868/intro.defs#defns.undefined).
So, a compiler is even permitted to generate senseless code for the cases of undefined behaviour.
data race is not the state when conflicting accesses to an object factually occur at the same time, but the state when a code having even potential (depending on the environment) conflicting accesses for an object is being executed (considering opposite on the level of the language is impossible because a write to the memory by the hardware caused by an operation may be delayed for unspecified time in bounds of the concurrent code (and note, besides it, operations may be in bounds of some restrictions dispersed over the concurrent code by both a compiler and a hardware)).
As for a code which causes undefined behaviour only for some of inputs (so may happen or not for an execution),
one the one hand, the as-ifrule (https://en.cppreference.com/w/cpp/language/as_if) permits compilers to generate code that would work correctly only for the inputs which do not cause undefined behaviour (for instance, so that issue a diagnostic message when the input causing undefined behaviour happened; issuing diagnostic messages is explicitly noted as a part of permissible undefined behaviour in the standard);
one the other hand, in practice it is often that a compiler generate code as if such input would never happen, see examples of such behaviour at https://en.cppreference.com/w/cpp/language/ub
Note, in contrast to potential (I use the word potential here because of what is in the note marked with * below) data races, the cases of the examples from the link are quite easy to detect when compiling.
If it would be possible for a compiler to easily detect a data race, a reasonable compiler would just terminate compilation rather than compiling anything, but:
One the one hand, [*] it is practically impossible to conclude that a data race will guaranteedly happen in run-time, just because in run-time it can happen that all the concurrent code instances over a single one fail to start because of environmental reasons, which makes any multi-threaded code apriori to be potentially single-threaded and so potentially avoiding data races at all (though, in many cases it would break semantic of the program, but it is not a concern of compilers).
On the other hand, a compiler is permitted to inject some code so that a data race is handled in run-time (note, not only for something sensible such issuing a diagnostic message, but in any (though, documented), even harmful, manner), but besides the fact that such injections would be a disputable (even when for something reasonable) overhead:
some potential data races can be undetectable at all because of separate compilation of translation units;
some potential data races may either exist or not in a specific execution depending on run-time input data, which would make the injections monstrous for being correct;
it may be complex enough and too expensive to detect data races even when possible because of complex constructs of the code and logic of the program.
So, at present, it is normal for compilers to not even try to detect data races.
Besides data races themselves, for the code where data races are possible and which is compiled as it were single-threaded there are the following problems:
under the as-if rule (https://en.cppreference.com/w/cpp/language/as_if) a variable may be eliminated if it looks for the compiler that there is no difference, at that compilers don’t take into account multi-threading unless specific multi-threading means of the language and its standard library are used;
operations may be reordered from what it “was coded” by both a compiler under the as-if rule and a hardware while execution if it looks that there is no difference, at unless specific multi-threading means of the language and its standard library are used and that a hardware may implement various of different approaches to restriction the reordering, including requirements for explicit corresponded commands in the code;
It is specified in the question that the following point is not the case, but to complete the set of the possible problems, the following is theoretically possible on some hardware:
though some people be wrong that a multi-core coherence mechanism always completely coherate data, which is when an object is updated by a core, other cores get the updated value when read, it is possible that a multi-core coherence mechanism does not do some or even all of coherence by itself but only when is triggered by corresponded commands in the code, so that without these corresponded commands the value to be written to an object gets stuck in the cache of the core so that either never or later than appropriate reaches other cores.
Please note, appropriate using of reasonably implemented (see the note marked with ** below for details) volatile modifier for variables if using volatile modifier for the type is possible, solves the elimination and the reordering by a compiler problems, but not reordering by hardware and not “getting stuck” in cache ones.
[**] To regret, actually, the standard of the language says “The semantics of an access through a volatile glvalue are implementation-defined” (https://timsong-cpp.github.io/cppwp/n4868/dcl.type.cv#5).
Though the standard of the language notes that “volatile is a hint to the implementation to avoid aggressive optimization involving the object because the value of the object might be changed by means undetectable by an implementation.” (https://timsong-cpp.github.io/cppwp/n4868/dcl.type.cv#note-5), which would help to avoid elimination and reordering by the compiler if volatile is implemented in correspondence to what it was intended for, that is correctly for values potentially accessed by the environment (for instances, hardware, operating system, other applications) of the code, formally compilers are not obligated to implement volatile in correspondence to what it was intended for.
But, at the same time, modern versions of the standard note that “Furthermore, for some implementations, volatile might indicate that special hardware instructions are required to access the object.” (https://timsong-cpp.github.io/cppwp/n4868/dcl.type.cv#note-5), which means that some implementations also might implement preventing reordering by hardware and preventing “getting stuck” in cache, though it is not what volatile was intended for.
Guaranteedly (as far as the implementation conforms to the standard), all the three problems, as well as data races issue, may be solved only by using specific multi-threading means, including multi-threading part of the standard library of C++ since C++11.
So for portable, confirming the standard of the language, C++ program must protect its execution from any data races.
If a compiler compiles as if the code were single-threaded (i.e. ignores data race), and reasonably implemented (as noted in the note marked with ** above) volatile modifier is used appropriately, and there is no caching and reordering by hardware issues, one will get the thread-safe machine code without using the data race protection (from the environment-dependent, not confirming the standard starting from C++11, C++ code).
As for examples of potential safety of using a non-atomic bool flag for a specific environment from multiple threads, at https://en.cppreference.com/w/cpp/language/storage_duration#Static_local_variables you can read that implementations of initialization of static local variables (since C++11) usually use variants of the double-checked locking pattern, which reduces runtime overhead for already-initialized local statics to a single non-atomic boolean comparison.
But note, these solutions are environment-dependent, and, since they are parts of implementations of the compilers themselves, but not a program using the compilers, there is no concern of conforming to the standard there.
To make your program corresponding to the standard of the language and be protected (as far as the compiler conforms to the standard) against a compiler implementation details liberty, you must protect the flag of a double-check lock from data races, and the most reasonable way for it, would be using std::atomic or std::atomic_bool.
See details in regards to implementation of double-checked locking pattern in C++ (including using a non-atomic flag with a data race) in my answer post https://stackoverflow.com/a/68974430/1790694 on the question about implementation of double-check lock in C++ Is there any potential problem with double-check lock for C++? (keep in mind that the code there contains multi-threading operations in the threads which influences on all the access operations in the thread, triggering memory coherence and preventing reordering, so that the whole code apriori is not to be compiled as it were single-threaded).
If you have such hardware, then the answer is "yes". The question is, what is that hardware?
Suppose you had a single core CPU - say, an 80486. Where, in such an architecture, might the value be? The answers are register, cache or RAM depending on whethe or not the value is about to be operated on.
The problem is, if you have a preemptive multi-threading operating system, you can't guarantee that, when a context switch happens, that the value has been flushed from registers to memory (cache / RAM). The value might be in a register as a result of an operation that has just produced the value as a result, and the preemption can happen before the next op code that would move it from the op's "result" register to memory. The preemptive switch to another thread would result in the new thread accessing the value from memory, which is stale.
So, that hardware is not any hardware that's been made in the past 40 years.
Conceivably it would be possible to have a CPU that has no registers, i.e. it's using RAM as its register set. However, no one has made one of those, because it would be very slow.
So in practice, there is no such hardware, so the answer is "no" it won't be thread safe.
You'd have to have something like a cooperative multitasking OS than ensured that the results of operations in registers got MOVed back to RAM before running a new thread.
It has for decades been common and not astonishing for compilers, even those intended to be suitable for multi-threaded or interrupt-based programming, to consolidate non-qualified accesses to objects when there are no intervening volatile-qualified accesses. While the C Standard recognizes the possibility of an implementation treating all accesses as though volatile qualified, but doesn't particularly recommend such treatment. As to whether volatile should be sufficient, that seems to be controversial.
Even before the publication of the first C++ Standard, the C Standard specified that the semantics of volatile are implementation-defined, thus allowing implementations designed to be suitable for multi-tasking or interrupt-based systems to provide semantics appropriate to that purpose without requiring special syntax, while allowing those that weren't intended to support such tasks to generate code that would be slightly more efficient when weaker semantics would suffice, but behave in broken fashion when stronger semantics were required.
While some people claim it was impossible to write portable multi-threaded code prior to the addition of atomics to the language standard, that ignores the fact that many people could and did write multi-threaded code which would be portable among all implementations for the intended target platform, whose designers made the semantics of volatile strong enough to support such code without requiring special syntax. The Standard didn't specify what implementations would need to do in order to be suitable for that purpose, because (1) it didn't require implementations to be suitable for such purpose, and (2) compiler writers were expected to know their customers' needs better than the Committee ever could.
Unfortunately, some compiler writers who were sheltered from normal market pressures have interpreted the Standard's failure to require that all implementations process volatile in a manner suitable for multi-threaded or interrupt-based programs without requiring special syntax as a judgment that no implementations should be expected to do so. Thus, there is a lot of code which would be reliable if processed by commercial implementations, but would not be processed reliably by compilers like clang or gcc which are designed to require special syntax when performing such tasks.

Is memory fence and memory barrier same?

here I am confused with the term memory fence (fence function in rust). I can clearly understand what is memory barrier in terms of atomics but I was unable to figure out what is memory fence.
Are memory fence and memory barriers the same? if not what is the difference and when to use memory fence over memory barrier?
A "fence" in this context is a kind of memory barrier. This distinction is important. For the purposes of this discussion I'll distinguish informally between three kinds of beasts:
Atomic fence: controls the order in which observers can see the effects of atomic memory operations. (This is what you asked about.)
More general memory barrier: controls the order of actual operations against memory or memory-mapped I/O. This is often a bigger hammer that can achieve similar results to an atomic fence, but at higher cost. (Depends on the architecture.)
Compiler fence: controls the order of instructions the processor receives. This is not what you asked about, but people often accidentally use this in place of a real barrier, which makes them sad later.
What fence is
Rust's std::sync::atomic::fence provides an atomic fence operation, which provides synchronization between other atomic fences and atomic memory operations. The terms folks use for describing the various atomic conditions can be a little daunting at first, but they are pretty well defined in the docs, though at the time of this writing there are some omissions. Here are the docs I suggest reading if you want to learn more.
First, Rust's docs for the Ordering type. This is a pretty good description of how operations with different Ordering interact, with less jargon than a lot of references in this area (atomic memory orderings). However, at the time of this writing, it's misleading for your specific question, because it says things like
This ordering is only applicable for operations that can perform a store.
which ignores the existence of fence.
The docs for fence go a little ways to repair that. IMO the docs in this area could use some love.
However, if you want all the interactions precisely laid out, I'm afraid you must look to a different source: the equivalent C++ docs. I know, we're not writing C++, but Rust inherits a lot of this behavior from LLVM, and LLVM tries to follow the C++ standard here. The C++ docs are much higher in jargon, but if you read slowly it's not actually more complex than the Rust docs -- just jargony. The nice thing about the C++ docs is that they discuss each interaction case between load/store/fence and load/store/fence.
What fence is not
The most common place that I employ memory barriers is to reason about completion of writes to memory-mapped I/O in low level code, such as drivers. (This is because I tend to work low in the stack, so this may not apply to your case.) In this case, you are likely performing volatile memory accesses, and you want barriers that are stronger than what fence offers.
In particular, fence helps you reason about which atomic memory operations are visible to which other atomic memory operations -- it does not help you reason about whether a particular stored value has made it all the way through the memory hierarchy and onto a particular level of the bus. For instance. For cases like that, you need a different sort of memory barrier.
These are the sorts of barriers described in considerable detail in the Linux Kernel's documentation on memory barriers.
In response to another answer on this question that flat stated that fence and barrier are equivalent, I raised this case on the Rust Unsafe Code Guidelines issue tracker and got some clarifications.
In particular, you might notice that the docs for Ordering and fence make no mention of how they interact with volatile memory accesses, and that's because they do not. Or at least, they aren't guaranteed to -- on certain architectures the instructions that need to be generated are the same (ARM), and in other cases, they are not (PowerPC).
Rust currently provides a portable atomic fence (which you found), but does not provide portable versions of any other sort of memory barrier, like those provided in the Linux kernel. If you need to reason about the completion of (for example) volatile memory accesses, you will need either non-portable asm! or a function/macro that winds up producing it.
Aside: compiler fences
When I make statements like what I said above, someone inevitably hops in with (GCC syntax)
asm("" :::: memory);
This is neither an atomic fence nor a memory barrier: it is roughly equivalent to Rust's compiler_fence, in that it discourages the compiler from reordering memory accesses across that point in the generated code. It has no effect on the order that the instructions are started or finished by the machine.
There is no difference.
"Fence" and "barrier" mean the same thing in this context.

Does c++11 atomic automatically solves multi-core race on variable read-write?

I know that atomic will apply a lock on type "T" variable when multiple threads are reading and writing the variable, making sure only one of them is doing the R/W.
But in a multi cpu-core computer, threads can run on different cores, and different cores would have different L1-cache, L2-cache, while share L3-cache. We know sometimes C++ compiler will optimize a variable to be stored inside register, so that if a variable is not stored in memory, then there's no memory synchronization between different core-cache on the variable.
So my worry/question is, if an atomic variable is optimized to be some register variable by compiler, then it's not stored in memory, when one core writes its value, another core could read out a stale value, right? Is there any guarantee on this data consistency?
Thanks.
Atomic doesn't "solve" things the way you vaguely describe. It provides certain very specific guarantees onvolving consistency of memory based on order.
Various compilers implement these guarantees in different ways on different platforms.
On x86/64 no locks are used for atomic integers and pointers up to a reasonable size. And the hardware provises stronger guarantees than the standard requires, making some of the more esoteric options equivalent to full consistency.
I won't be able to fully answer your question but I can point you in the right direction; the topic you need to learn about is "the C++ memory model".
That being said, atomics exist in order to avoid the exact problem you describe. If you ask for full memory order consistency, and thread A modifies X then Y, no other thread can see Y modified but not X. How that guarantee is provided is not specified by the C++ standard; cache line invalidation, using special instructions for access, barring certain register-based optimizations by the compiler, etc are all the kind of thing that compilers do.
Note that the C++ memory model was refined, bugfixed and polished for C++17 in order to describe the behaviour of the new parallel algorithms and permit their efficient implementation on GPU hardware (among other spots) with the right flags, and in turn it influenced the guarantees that new GPU hardware provides. So people talking about memory models may be excited and talk about more modern issues than your mainly C++11 concerns.
This is a big complex topic. It is really easy to write code you think is portable, yet only works on a specific platform, or only usually works on the platform you tested it on. But that is just because threading is hard.
You may be looking for this:
[intro.progress]/18 An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

Will other threads see a write to a `volatile` word sized variable in reasonable time?

When asking about a more specific problem I discovered this is the core issue where people are not exactly sure.
The following assumptions can be made:
CPU does use a cache coherency protocol like MESI(F) (examples: x86/x86_64 and ARMv7mp)
variable is assumed to be of a size which is atomically written/read by the processor (aligned and native word size)
The variable is declared volatile
The questions are:
If I write to the variable in one thread, will other threads see the change?
What is the order of magnitude of the timeframe in which the other threads will see the change?
Do you know of architectures where cache coherency is not enough to ensure cross-CPU / cross-core visibility?
The question is NOT:
Is it safe to use such a variable?
about reordering issues
about C++11 atomics
This might be considered a duplicate of In C/C++, are volatile variables guaranteed to have eventually consistent semantics betwen threads? and other similar questions, but I think none of these have those clear requirements regarding the target architecture which leads to a lot of confusion about differing assumptions.
Do you know of architectures where cache coherency is not enough to insure cross-cpu / cross-core visibility?
I"m not aware of any single processor with multiple cores that has cache coherency issues. It might be possible for someone to use the wrong type of processor in a multi-processor board, for example an Intel processor that has what Intel calls external QPI disabled, but this would cause all sorts of issues.
Wiki article about Intel's QPI and which processors have it enabled or disabled:
http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect
If I write to the variable in one thread, will other threads see the change?
There is no guarantee. If you think there is, show me where you found it.
What is the order of magnitude of the timeframe in which the other threads will see the change?
It can be never. There is no guarantee.
Do you know of architectures where cache coherency is not enough to insure cross-cpu / cross-core visibility?
This is an incoherent question because you are talking about operations in C++ code that has to be compiled into assembly code. Even if you have hardware guarantees that apply to assembly code, there's no guarantee those guarantees "pass through" to C++ code.
But to the extent the question can be answered, the answer is yes. Posted writes, read prefetching, and other kinds of caching (such as what compilers do with registers) exist in real platforms.
I'd say no, there is no guarantee. There are implementations using multiple, independent computers where shared data has to be transmitted over a (usually very fast) connection between computers. In that situation, you'd try to transmit data only when it is needed. This might be triggered by mutexes, for example, and by the standard atomic functions, but hopefully not by stores into arbitrary local memory, and maybe not by stores into volatile memory.
I may be wrong, but you'd have to prove me wrong.
Assuming nowadays x86/64:
If I write to the variable in one thread, will other threads see the change?
Yes. Assuming you use a modern and not very old / buggy compiler.
What is the order of magnitude of the timeframe in which the other threads will see the change?
It really depends how you measure.
Basically, this would be the memory latency time = 200 cycles on same NUMA node. About double on another node, on a 2-node box. Might differ on bigger boxes.
If your write gets reordered relatively to the point of time measurement, you can get +/-50 cycles.
I measured this a few years back and got 60-70ns on 3GHz boxes and double that on the other node.
Do you know of architectures where cache coherency is not enough to insure cross-cpu / cross-core visibility?
I think the meaning of cache coherency is visibility. Having said that, I'm not sure Sun risk machines have the same cache coherency, and relaxed memory model, as x86, so I'd test very carefully on them. Specifically, you might need to add memory release barriers to force flushing of memory writes.
Given the assumptions you have described, there is no guarantee that a write of a volatile variable in one thread will be "seen" in another.
Given that, your second question (about the timeframe) is not applicable.
With (multi-processor) PowerPC architectures, cache coherency is not sufficient to ensure cross-core visibility of a volatile variable. There are explicit instructions that need to be executed to ensure state is flushed (and to make it visible across multiple processors and their caches).
In practice, on architectures that require such instructions to be executed, the implementation of data synchronisation primitives (mutexes, semaphores, critical sections, etc) does - among other things - use those instructions.
More broadly, the volatile keyword in C++ has nothing to do with multithreading at all, let alone anything to do with cross-cache coherency. volatile, within a given thread of execution, translates to a need for things like fetches and writes of the variable not being eliminated or reordered by the compiler (which affects optimisation). It does not translate into any requirement about ordering or synchronisation of the completion of fetches or writes between threads of execution - and such requirements are necessary for cache coherency.
Notionally, a compiler might be implemented to provide such guarantees. I've yet to see any information about one that does so - which is not surprising, as providing such a guarantee would seriously affect performance of multithreaded code by forcing synchronisation between threads - even if the programmer has not used synchronisation (mutexes, etc) in their code.
Similarly, the host platform could also notionally provide such guarantees with volatile variables - even if the instructions being executed don't specifically require them. Again, that would tend to reduce performance of multithreaded programs - including modern operating systems - on those platforms. It would also affect (or negate) the benefits of various features that contribute to performance of modern processors, such as pipelining, by forcing processors to wait on each other.
If, as a C++ developer (as distinct from someone writing code that exploits specific features offered by your particular compiler or host platform) you want a variable written in one thread able to be coherently read by another thread, then don't bother with volatile. Perform synchronisation between threads - when they need to access the same variable concurrently - using provided techniques - such as mutexes. And follow the usual guidelines on using those techniques (e.g. use mutexes sparingly and minimise the time which they are held, do as much as possible in your threads without accessing variables that are shared between threads at all).

How does a mutex ensure a variable's value is consistent across cores?

If I have a single int which I want to write to from one thread and read from on another, I need to use std::atomic, to ensure that its value is consistent across cores, regardless of whether or not the instructions that read from and write to it are conceptually atomic. If I don't, it may be that the reading core has an old value in its cache, and will not see the new value. This makes sense to me.
If I have some complex data type that cannot be read/written to atomically, I need to guard access to it using some synchronisation primitive, such as std::mutex. This will prevent the object getting into (or being read from) an inconsistent state. This makes sense to me.
What doesn't make sense to me is how mutexes help with the caching problem that atomics solve. They seem to exist solely to prevent concurrent access to some resource, but not to propagate any values contained within that resource to other cores' caches. Is there some part of their semantics I've missed which deals with this?
The right answer to this is magic pixies - e.g. It Just Works. The implementation of std::atomic for each platform must do the right thing.
The right thing is a combination of 3 parts.
Firstly, the compiler needs to know that it can't move instructions across boundaries [in fact it can in some cases, but assume that it doesn't].
Secondly, the cache/memory subsystem needs to know - this is generally done using memory barriers, although x86/x64 generally have such strong memory guarantees that this isn't necessary in the vast majority of cases (which is a big shame as its nice for wrong code to actually go wrong).
Finally the CPU needs to know it cannot reorder instructions. Modern CPUs are massively aggressive at reordering operations and making sure in the single threaded case that this is unnoticeable. They may need more hints that this cannot happen in certain places.
For most CPUs part 2 and 3 come down to the same thing - a memory barrier implies both. Part 1 is totally inside the compiler, and is down to the compiler writers to get right.
See Herb Sutters talk 'Atomic Weapons' for a lot more interesting info.
The consistency across cores is ensured by memory barriers (which also prevents instructions reordering). When you use std::atomic, not only do you access the data atomically, but the compiler (and library) also insert the relevant memory barriers.
Mutexes work the same way: the mutex implementations (eg. pthreads or WinAPI or what not) internally also insert memory barriers.
Most modern multicore processors (including x86 and x64) are cache coherent. If two cores hold the same memory location in cache and one of them updates the value, the change is automatically propagated to other cores' caches. It's inefficient (writing to the same cache line at the same time from two cores is really slow) but without cache coherence it would be very difficult to write multithreaded software.
And like syam said, memory barriers are also required. They prevent the compiler or processor from reordering memory accesses, and also force the write into memory (or at least into cache), when for example a variable is held in a register because of compiler optizations.