Semantics of atomic stores in MESI cachelines - c++

In a concurrently read and written to line (reads and stores only). What happens when a line is owned by a core in modified-or-read mode, and some other core issues store operations on this line (assuming these reads and writes are std::atomic::load and std::atomic::store with C++ compiilers)? Does the line get pulled into the other core that is issuing the writes? Or do the writes find their way over to the reading core directly as needed? The difference between the two is that the latter only causes the overhead of one roundtrip for reading the value of the line. And can possibly get paralellized as well (if the store and read happen at staggered points in time)
This question arose when considering the consequences of NUMA in a concurrent application. But the question stands when the two cores involved are in the same NUMA node.
There are a large number of architectures in the mix. But for now, curious about what happens on Intel Skylake or Broadwell.

First of all, there's nothing special about atomic loads/stores vs. regular stores by the time they're compiled to asm. (Although the default seq_cst memory order can compile to xchg, but mov+mfence is also a valid (often slower) option which is indistinguishable in asm from a plain release store followed by a full barrier.) xchg is an atomic RMW + a full barrier. Compilers use it for the full-barrier effect; the load part of the exchange is just an unwanted side-effect.
The rest of this answer applies fully to any x86 asm store, or the store part of a memory-destination RMW instruction (whether it's atomic or not).
Initially the core that had previously been doing writes will have the line in MESI Modified state in its L1d, assuming it hasn't been evicted to L2 or L3 already.
The line changes MESI state (to shared) in response to a read request, or for stores the core doing the write will send an RFO (request for ownership) and eventually get the line in Modified or Exclusive state.
Getting data between physical cores on modern Intel CPUs always involves write-back to shared L3 (not necessarily to DRAM). I think this is true even on multi-socket systems where the two cores are on separate sockets so they don't really share a common L3, using snooping (and snoop filtering).
Intel uses MESIF. AMD uses MOESI which allows sharing dirty data directly between cores directly without write-back to/from a shared outer level cache first.
For more details, see Which cache mapping technique is used in intel core i7 processor?
There's no way for store-data to reach another core except through cache/memory.
Your idea about the writes "happening" on another core is not how anything works. I don't see how it could even be implemented while respecting x86 memory ordering rules: stores from one core become globally visible in program order. I don't see how you could send stores (to different cache lines) to different cores and make sure one core waited for the other to commit those stores to the cache lines they each owned.
It's also not really plausible even for a weakly-ordered ISA. Often when you read or write a cache line, you're going to do more reads+writes. Sending each read or write request separately over a mesh interconnect between cores would require many many tiny messages. High throughput is much easier to achieve than low latency: wider buses can do that. Low latency for loads is essential for high performance. If threads ever migrate between cores, all of a sudden they'll be read/writing cache lines that are all hot in L1d on some other core, which would be horrible until the CPU somehow decided that it should migrate the cache line to the core accessing it.
L1d caches are small, fast, and relatively simple. The logic for ordering a core's reads+writes relative to each other, and for doing speculative loads, is all internal to a single core. (Store buffer, or on Intel actually a Memory Order Buffer to track speculative loads as well as stores.)
This is why you should avoid even touching a shared variable if you can prove you don't have to. (Or use exponential backoff for cases where that's appropriate). And why a CAS loop should spin read-only waiting to see the value its looking for before even attempting a CAS, instead of hammering on the cache line with writes from failing lock cmpxchg attempts.

Related

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput?
C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders.
In particular, if producer does store(memory_order_release), and consumer observes the stored value with load(memory_order_acquire), there are no fences between load and store. On x86 there are no fences at all, on ARM fences are put operation before store and after load.
The value stored without a fence will eventually be observed by load without a fence (possibly after few unsuccessful attempts)
I'm wondering if putting a fence on either of sides of the queue can make the value to be observed faster?
What is the latency with and without fence, if so?
I expect that just having a loop with load(memory_order_acquire) and pause / yield limited to thousands of iterations is the best option, as it is used everywhere, but want to understand why.
Since this question is about hardware behavior, I expect there's no generic answer. If so, I'm wondering mostly about x86 (x64 flavor), and secondarily about ARM.
Example:
T queue[MAX_SIZE]
std::atomic<std::size_t> shared_producer_index;
void producer()
{
std::size_t private_producer_index = 0;
for(;;)
{
private_producer_index++; // Handling rollover and queue full omitted
/* fill data */;
shared_producer_index.store(
private_producer_index, std::memory_order_release);
// Maybe barrier here or stronger order above?
}
}
void consumer()
{
std::size_t private_consumer_index = 0;
for(;;)
{
std::size_t observed_producer_index = shared_producer_index.load(
std::memory_order_acquire);
while (private_consumer_index == observed_producer_index)
{
// Maybe barrier here or stronger order below?
_mm_pause();
observed_producer_index= shared_producer_index.load(
std::memory_order_acquire);
// Switching from busy wait to kernel wait after some iterations omitted
}
/* consume as much data as index difference specifies */;
private_consumer_index = observed_producer_index;
}
}
Basically no significant effect on inter-core latency, and definitely never worth using "blindly" without careful profiling, if you suspect there might be any contention from later loads missing in cache.
It's a common misconception that asm barriers are needed to make the store buffer commit to cache. In fact barriers just make this core wait for something that was already going to happen on its own, before doing later loads and/or stores. For a full barrier, blocking later loads and stores until the store buffer is drained.
Size of store buffers on Intel hardware? What exactly is a store buffer?
In the bad old days before std::atomic, compiler barriers were one way to stop the compiler from keeping values in registers (private to a CPU core / thread, not coherent), but that's a compilation issue not asm. CPUs with non-coherent caches are possible in theory (where std::atomic would need to do explicit flushing to make a store visible), but in practice no implementation runs std::thread across cores with non-coherent caches.
If I don't use fences, how long could it take a core to see another core's writes? is highly related, I've written basically this answer at least a few times before. (But this looks like a good place for an answer specifically about this, without getting into the weeds of which barriers do what.)
There might be some very minor secondary effects of blocking later loads that could maybe compete with RFOs (for this core to get exclusive access to a cache line to commit a store). The CPU always tries to drain the store buffer as fast as possible (by committing to L1d cache). As soon as a store commits to L1d cache, it becomes globally visible to all other cores. (Because they're coherent; they'd still have to make a share request...)
Getting the current core to write-back some store data to L3 cache (especially in shared state) could reduce the miss penalty if the load on another core happens somewhat after this store commits. But there are no good ways to do that. Creating a conflict miss in L1d and L2 maybe, if producer performance is unimportant other than creating low latency for the next read.
On x86, Intel Tremont (low power Silvermont series) will introduce cldemote (_mm_cldemote) that writes back a line as far as an outer cache, but not all the way to DRAM. (clwb could possibly help, but does force the store to go all the way to DRAM. Also, the Skylake implementation is just a placeholder and works like clflushopt.)
Is there any way to write for Intel CPU direct core-to-core communication code?
How to force cpu core to flush store buffer in c?
x86 MESI invalidate cache line latency issue
Force a migration of a cache line to another core (not possible)
Fun fact: non-seq_cst stores/loads on PowerPC can store-forward between logical cores on the same physical core, making stores visible to some other cores before they become globally visible to all other cores. This is AFAIK the only real hardware mechanism for threads to not agree on a global order of stores to all objects. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. On other ISAs, including ARMv8 and x86, it's guaranteed that stores become visible to all other cores at the same time (via commit to L1d cache).
For loads, CPUs already prioritize demand loads over any other memory accesses (because of course execution has to wait for them.) A barrier before a load could only delay it.
That might happen to be optimal by coincidence of timing, if that makes it see the store it was waiting for instead of going "too soon" and seeing the old cached boring value. But there's generally no reason to assume or ever predict that a pause or barrier could be a good idea ahead of a load.
A barrier after a load shouldn't help either. Later loads or stores might be able to start, but out-of-order CPUs generally do stuff in oldest-first priority so later loads probably can't fill up all the outstanding load buffers before this load gets a chance to get its load request sent off-core (assuming a cache miss because another core stored recently.)
I guess I could imagine a benefit to a later barrier if this load address wasn't ready for a while (pointer-chasing situation) and the max number of off-core requests were already in-flight when the address did become known.
Any possible benefit is almost certainly not worth it; if there was that much useful work independent of this load that it could fill up all the off-core request buffers (LFBs on Intel) then it might well not be on the critical path and it's probably a good thing to have those loads in flight.

MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?

In regards to std::atomic, the C++11 standard states that stores to an atomic variable will become visible to loads of that variable in a "reasonable amount of time".
From 29.3p13:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
However I was curious to know what actually happens when dealing with specific CPU architectures which are based on the MESI cache coherency protocol (x86, x86-64, ARM, etc..).
If my understanding of the MESI protocol is correct, a core will always read the value previously written/being written by another core immediately, possibly by snooping it. (because writing a value means issuing a RFO request which in turn invalidates other cache lines)
Does it mean that when a thread A stores a value into an std::atomic, another thread B which does a load on that atomic successively will in fact always observe the new value written by A on MESI architectures? (Assuming no other threads are doing operations on that atomic)
By “successively” I mean after thread A has issued the atomic store. (Modification order has been updated)
I'll answer for what happens on real implementations on real CPUs, because an answer based only on the standard can barely say anything useful about time or "immediacy".
MESI is just an implementation detail that ISO C++ doesn't have anything to say about. The guarantees provided by ISO C++ only involve order, not actual time. ISO C++ is intentionally non-specific to avoid assuming that it will execute on a "normal" CPU. An implementation on a non-coherent machine that required explicit flushes for store visibility might be theoretically possible (although probably horrible for performance of release / acquire and seq-cst operations)
C++ is non-specific enough about timing to even allow an implementation on a single-core cooperative multi-tasking system (no pre-emption), with the compiler inserting voluntary yields occasionally. (Infinite loops without any volatile accesses or I/O are UB). C++ on a system where only one thread can actually be executing at once is totally fine and possible, assuming you consider a scheduler timeslice to still be a "reasonable" amount of time. (Or less if you yield or otherwise block.)
Even the model of formalism ISO C++ uses to give the guarantees it does about ordering is very different from the way hardware ISAs define their memory models. C++ formal guarantees are purely in terms of happens-before and synchronizes-with, not "re"-ordering litmus tests or any kind of stuff like that. e.g. How to achieve a StoreLoad barrier in C++11? is impossible to answer for pure ISO C++ formalism. The "option C" in that Q&A serves to show just how weak the C++ guarantees are; that case with store then load of two different SC variables is not sufficient to imply happens-before based on it, according to the C++ formalism, even though there has to be a total order of all SC operations. But it is sufficient in real life on systems with coherent cache and only local (within each CPU core) memory reordering, even AArch64 where the SC load right after the SC store does still essentially give us a StoreLoad barrier.
when a thread A stores a value into an std::atomic
It depends what you mean by "doing" a store.
If you mean committing from the store buffer into L1d cache, then yes, that's the moment when a store becomes globally visible, on a normal machine that uses MESI to give all CPU cores a coherent view of memory.
Although note that on some ISAs, some other threads are allowed to see stores before they become globally visible via cache. (i.e. the hardware memory model may not be "multi-copy atomic", and allow IRIW reordering. POWER is the only example I know of that does this in real life. See Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for details on the HW mechanism: Store forwarding for retired aka graduated stores between SMT threads.)
If you mean executing locally so later loads in this thread can see it, then no. std::atomic can use a memory_order weaker than seq_cst.
All mainstream ISAs have memory-ordering rules weak enough to allow for a store buffer to decouple instruction execution from commit to cache. This also allows speculative out-of-order execution by giving stores somewhere private to live after execution, before we're sure that they were on the correct path of execution. (Stores can't commit to L1d until after the store instruction retires from the out-of-order part of the back end, and thus is known to be non-speculative.)
If you want to wait for your store to be visible to other threads before doing any later loads, use atomic_thread_fence(memory_order_seq_cst);. (Which on "normal" ISAs with standard choice of C++ -> asm mappings will compile to a full barrier).
On most ISAs, a seq_cst store (the default) will also stall all later loads (and stores) in this thread until the store is globally visible. But on AArch64, STLR is a sequential-release store and execution of later loads/stores doesn't have to stall unless / until a LDAR (acquire load) is about to execute while the STLR is still in the store buffer. This implements SC semantics as weakly as possible, assuming AArch64 hardware actually works that way instead of just treating it as a store + full barrier.
But note that only blocking later loads/stores is necessary; out-of-order exec of ALU instructions on registers can still continue. But if you were expecting some kind of timing effect due to dependency chains of FP operations, for example, that's not something you can depend on in C++.
Even if you do use seq_cst so nothing happens in this thread before the store is visible to others, that's still not instant. Inter-core latency on real hardware can be on the order of maybe 40ns on mainstream modern Intel x86, for example. (This thread doesn't have to stall that long on a memory barrier instruction; some of that time is the cache miss on the other thread trying to read the line that was invalidated by this core's RFO to get exclusive ownership.) Or of course much cheaper for logical cores that share the L1d cache of a physical core: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?
From 29.3p13:
Implementations should make atomic stores visible to atomic loads
within a reasonable amount of time.
The C and C++ standards are all over the place on threads, hence not usable as formal specifications. They use the concept of time, and somewhat imply that everything runs step by step, sequentially (if not, you wouldn't have a sound program semantic) and then say that some constructs can see effects out of order, without ever telling which is which.
When effects are seen out of order, thread time is ill defined as you don't have a chronometer that would also be out of order: you wouldn't do sport with out of order execution of actions!
Even "out of order" suggests that some things are purely sequential and some other operations can be "out of order" with respect to the firsts. That is not how std::atomic is defined.
What the standards try to say is that there is a notion of progress for each thread, with a CPU time or cost index, and as it increases as more stuff is done, and stuff can only be slightly reordered by the implementation: now reordering is well defined, not in term of other sequential instructions, but in term of cost/cycles/CPU time.
So if two instructions are close to each other in the sequential intra-thread execution, they will be close in CPU time too. A reasonable compiler shouldn't move a volatile operation, a file output, or an atomic operation past a very costly "pure" computation (one that has no externally visible side effect).
A basic idea that many committee members sadly couldn't even spell out!

Can memory store be reordered really, in an OoOE processor?

We know that two instructions can be reordered by an OoOE processor. For example, there are two global variables shared among different threads.
int data;
bool ready;
A writer thread produce data and turn on a flag ready to allow readers to consume that data.
data = 6;
ready = true;
Now, on an OoOE processor, these two instructions can be reordered (instruction fetch, execution). But what about the final commit/write-back of the results? i.e., will the store be in-order?
From what I learned, this totally depends on a processor's memory model. E.g., x86/64 has a strong memory model, and reorder of stores is disallowed. On the contrary, ARM typically has a weak model where store reordering can happen (along with several other reorderings).
Also, the gut feeling tells me that I am right because otherwise we won't need a store barrier between those two instructions as used in typical multi-threaded programs.
But, here is what our wikipedia says:
.. In the outline above, the OoOE processor avoids the stall that
occurs in step (2) of the in-order processor when the instruction is
not completely ready to be processed due to missing data.
OoOE processors fill these "slots" in time with other instructions
that are ready, then re-order the results at the end to make it appear
that the instructions were processed as normal.
I'm confused. Is it saying that the results have to be written back in-order? Really, in an OoOE processor, can store to data and ready be reordered?
The simple answer is YES on some processor types.
Before the CPU, your code faces an earlier problem, compiler reordering.
data = 6;
ready = true;
The compiler is free to rearrange these statements since, as far as it knows, they do not affect each other (it is not thread-aware).
Now down to the processor level:
1) An out-of-order processor can process these instructions in different order, including reversing the order of the stores.
2) Even if the CPU performs them in order, they memory controller may not perform them in order because it may need to flush or bring in new cache lines or do an address translation before it can write them.
3) Even if this doesn't happen, another CPU in the system may not see them in the same order. In order to observe them, it may need to bring in the modified cache lines from the core that wrote them. It may not be able to bring one cache line in earlier than another if it is held be another core or if there is contention for that line by multiple cores, and its own out of order execution will read one before the other.
4) Finally, speculative execution on other cores may read the value of data before ready was set by the writing core, and by the time it gets around to reading ready, it was already set but data was also modified.
These problems are all solved by memory barriers. Platforms with weakly-ordered memory must make use of memory barriers to ensure memory coherence for thread synchronization.
The consistency model (or memory model) for the architecture determines what memory operations can be reordered. The idea is always to achieve the best performance from the code, while preserving the semantics expected by the programmer. That is the point from wikipedia, the memory operations appear in order to the programmer, even though they may have been reordered. Reordering is generally safe when the code is single-threaded, as the processor can easily detect potential violations.
On x86, the common model is that writes are not reordered with other writes. Yet, the processor is using out of order execution (OoOE), so instructions are being reordered constantly. Generally, the processor has several additional hardware structures to support OoOE, like a reorder buffer and load-store queue. The reorder buffer ensures that all instructions appear to execute in order, such that interrupts and exceptions break a specific point in the program. The load-store queue functions similarly, in that it can restore the order of memory operations according to the memory model. The load-store queue also disambiguates addresses, so that the processor can identify when the operations are made to the same or different addresses.
Back to OoOE, the processor is executing 10s to 100s of instructions in every cycle. Loads and stores are computing their addresses, etc. The processor may prefetch the cache lines for the accesses (which may include cache coherence), but it cannot actually access the line either to read or write until it is safe (according to the memory model) to do so.
Inserting store barriers, memory fences, etc tell both the compiler and processor about further restrictions to reordering the memory operations. The compiler is part of implementing the memory model, as some languages like java have specific memory model, while others like C obey the "memory accesses should appear as if they were executed in order".
In conclusion, yes, data and ready can be reordered in an OoOE. But it depends on the memory model as to whether they actually are. So if you need a specific order, provide the appropriate indication using barriers, etc such that the compiler, processor, etc will not choose a different order for higher performance.
On modern processor, the storing action itself is async (think of it like submit a change to the L1 cache and continue execution, the cache system further propagate in async manner). So the changes on two object lies on different cache block may be realised OoO from other CPU's perspective.
Furthermore, even the instruction to store those data, can be executed OoO. For example when two object is stored "at the same time", but the bus line of one object is retained/locked by other CPU or bus mastering, thus other other object may be committed earlier.
Therefore, to properly share data across threads, you need some kind of memory barrier or make use of transactional memory feature found in latest CPU like TSX.
I think you're misinterpreting "appear that the instructions were processed as normal." What that means is that if I have:
add r1 + 7 -> r2
move r3 -> r1
and the order of those is effectively reversed by out-of-order execution, the value that participates in the add operation will still be the value of r1 that was present prior to the move. Etc. The CPU will cache register values and/or delay register stores to assure that the "meaning" of a sequential instruction stream is not changed.
This says nothing about the order of stores as visible from another processor.

Concurrent stores seen in a consistent order

The Intel Architectures Software Developer's Manual, Aug. 2012, vol. 3A, sect. 8.2.2:
Any two stores are seen in a consistent order by processors other than
those performing the stores.
But can this be so?
The reason I ask is this: Consider a dual-core Intel i7 processor with HyperThreading. According to the Manual's vol. 1, Fig. 2-8, the i7's logical processors 0 and 1 share an L1/L2 cache, but its logical processors 2 and 3 share a different L1/L2 cache -- whereas all the logical processors share a single L3 cache. Suppose that logical processors 0 and 2 -- which do not share an L1/L2 cache -- write to the same memory location at about the same time, and that the writes go no deeper than L2 for the moment. Could not logical processors 1 and 3 (which are "processors other than those performing the stores") then see the "two stores in an inconsistent order"?
To achieve consistency, must not logical processors 0 and 2 issue SFENCE instructions, and logical processors 1 and 3 issue LFENCE instructions? Notwithstanding, the Manual seems to think otherwise, and its opinion in the matter does not have the look of a mere misprint. It looks deliberate. I'm confused.
UPDATE
In light of #Benoit's answer, a following question: The only purpose of L1 and L2 therefore is to speed loads. It is L3 that speeds stores. Is that right?
Intel CPUs (like all normal SMP systems) use (a variant of) MESI to ensure cache coherency for cached loads/stores. i.e. that all cores see the same view of memory through their caches.
A core can only write to a cache line after doing a Read For Ownership (RFO), getting the line in Exclusive state (no other caches have a valid copy of the line that could satisfy loads). Related: atomic RMW operations prevent other cores from doing anything to the target cache-line by locking it in Modified state for the duration of the operation.
To test for this kind of reordering, you need two other threads which both read both stores (in opposite order). Your proposed scenario has one core (reader2) reading an old value from memory (or L3, or its own private L2/L1) after another core (reader1) has read the new value of the same line stored by writer1. This is impossible: for reader1 to see writer1's store, writer1 must have already completed a RFO that invalidates all other copies of the cache line anywhere. And reading directly from DRAM without (effectively) snooping any write-back caches is not allowed. (Wikipedia's MESI article has diagrams.)
When a store commits (from the store buffer inside a core) to L1d cache, it becomes globally visible to all other cores at the same time. Before that, only the local core could "see" it (via store->load forwarding from the store buffer).
On a system where the only way for data to propagate from one core to another is through the global cache-coherency domain, MESI cache coherency alone guarantees that a single global store order exists, that all threads can agree on. x86's strong memory ordering rules make this global store order be some interleaving of program order, and we call this a Total Store Order memory model.
x86's strong memory model disallows LoadLoad reordering, so loads take their data from cache in program order without any barrier instructions in the reader threads.1
Loads actually snoop the local store buffer before taking data from the coherent cache. This is the reason the consistent order rule you quoted excludes the case where either store was done by the same core that's doing the loads. See Globally Invisible load instructions for more about where load data really comes from. But when the load addresses don't overlap with any recent stores, what I said above applies: load order is the order of sampling from the shared globally coherent cache domain.
The consistent order rule is a pretty weak requirement. Many non-x86 ISAs don't guarantee it on paper, but very few actual (non-x86) CPU designs have a mechanism by which one core can see store data from another core before it becomes globally visible to all cores. IBM POWER with SMT is one such example: Will two atomic writes to different locations in different threads always be seen in the same order by other threads? explains how forwarding between logical cores within one physical core can cause it. (This is a like what you proposed, but within the store buffer rather than L2).
x86 microarchitectures with HyperThreading (or AMD's SMT in Ryzen) obey that requirement by statically partitioning the store buffer between the logical cores on one physical core. What will be used for data exchange between threads are executing on one Core with HT? So even within one physical core, a store has to commit to L1d (and become globally visible) before the other logical core can load the new data.
It's probably simpler to not have forwarding from retired-but-not-committed stores in one logical core to the other logical cores on the same physical core.
(The other requirements of x86's TSO memory model, like loads and stores appearing in program order, are harder. Modern x86 CPUs execute out of order, but use a Memory Order Buffer to maintain the illusion and have stores commit to L1d in program order. Loads can speculatively take values earlier than they're "supposed" to, and then check later. This is why Intel CPUs have "memory-order mis-speculation" pipeline nukes: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?.)
As #BeeOnRope points out, there is an interaction between HT and maintaining the illusion of no LoadLoad reordering: normally a CPU can detect when another core touched a cache line after a load actual read it but before it was architecturally allowed to have read it: the load port can track invalidations to that cache line. But with HT, load ports also have to snoop the stores that the other hyperthread commits to L1d cache, because they won't invalidate the line. (Other mechanisms are possible, but it is a problem that CPU designers have to solve if they want high performance for "normal" loads.)
Footnote 1: On a weakly-ordered ISA, you'd use load-ordering barriers to control the order in which the 2 loads in each reader take their data from the globally coherent cache domain.
The writer threads are only doing a single store each so a fence is meaningless.
Because all cores share a single coherent cache domain, fences only need to control local reordering within a core. The store buffer in each core already tries to make stores globally visible as quickly as possible (while respecting the ordering rules of the ISA), so a barrier just makes the CPU wait before doing later operations.
x86 lfence has basically no memory-ordering use cases, and sfence is only useful with NT stores. Only mfence is useful for "normal" stuff, when one thread is writing something and then reading another location. http://preshing.com/20120515/memory-reordering-caught-in-the-act/. So it blocks StoreLoad reordering and store-forwarding across the barrier.
In light of #Benoit's answer, a following question: The only purpose of L1 and L2 therefore is to speed loads. It is L3 that speeds stores. Is that right?
No, L1d and L2 are write-back caches: Which cache mapping technique is used in intel core i7 processor?. Repeated stores to the same line can be absorbed by L1d.
But Intel uses inclusive L3 caches, so how can L1d in one core have the only copy? L3 is actually tag-inclusive, which is all that's needed for L3 tags work as a snoop filter (instead of broadcasting RFO requests to every core). The actual data in dirty lines is private to the per-core inner caches, but L3 knows which core has the current data for a line (and thus where to send a request when another core wants to read a line that another core has in Modified state). Clean cache lines (in Shared state) are data-inclusive of L3, but writing to a cache line doesn't write-through to L3.
I believe what the Intel documentation is saying is that the mechanics of the x86 chip will ensure that the other processors always see the writes in a consistent order.
So the other processors will only ever see one of the following results when reading that memory location:
value before either write (I.e. the read preceeded both writes)
value after processor 0's write (I.e. as if processor 2 wrote first, and then processor 0 overwrote)
value after processor 2's write (I.e. as if processor 0 wrote first and then processor 2 overwrote)
It won't be possible for processor 1 to see the value after processor 0's write, but at the same time have processor 3 see the value after processor 2's write (or vice versa).
Keep in mind that since intra-processor re-ordering is allowed (see section 8.2.3.5) processor's 0 and 2 may see things differently.
Ouch, this is a tough question! But I'll try...
the writes go no deeper than L2
Basically this is impossible since Intel uses inclusive caches. Any data written to L1, will also takes place in L2 and L3, unless you prevent from caching by disabling them through CR0/MTRR.
That being said, I guess there are arbitration mechanisms: processors issue a request to write data and an arbiter selects which request is granted from among the pending requests from each of the request queues. The selected requests are broadcasted to the snoopers, and to caches then. I suppose it would prevent from race, enforcing the consistent order seen by processors other than the one performing the request.

Multiple threads and memory

I read in the Visual C++ documentation that it is safe for multiple threads to read from the same object.
My question is: how does a X86-64 CPU with multiple cores handle this?
Say you have a 1 MB block of memory. Are different threads literally able to read the exact same data at the same time or do cores read one word at a time with only one core allowed to read a particular word at a time?
If there are really no writes in your 1MB block then yeah, each core can read from its own cache line without any problem as no writes are being committed and therefore no cache coherency problems arise.
In a multicore architecture, basically there is a cache for each core and a "Cache Coherence Protocol" which invalidates the cache on some cores which do not have the most up to date information. I think most processors implement the MOESI protocol for cache coherency.
Cache coherency is a complex topic that has been largely discussed (I specially like some articles by Joe Duffy here and here). The discussion nonetheless revolves around the possible performance penalties of code that, while being apparently lock free, can slow down due to the cache coherency protocol kicking in to maintain coherency across the processors caches, but, as long as there are no writes there's simply no coherency to maintain and thus no lost on performance.
Just to clarify, as said in the comment, RAM can't be accessed simultaneously since x86 and x64 architectures implement a single bus which is shared between cores with SMP guaranteeing the fairness accessing main memory. Nonetheless this situation is hidden by each core cache which allows each core to have its own copy of the data. For 1MB of data it would be possible to incur on some contention while the core update its cache but that would be negligible.
Some useful links:
Cache Coherence Protocols
Cache Coherence
Not only are different cores allowed to read from the same block of memory, they're allowed to write at the same time too. If it's "safe" or not, that's an entirely different story. You need to implement some sort of a guard in your code (usually done with semaphores or derivates of them) to guard against multiple cores fighting over the same block of memory in a way you don't specifically allow.
About the size of the memory a core reads at a time, that's usually a register's worth, 32 bits on a 32bit cpu, 64 bits for a 64bit cpu and so on. Even streaming is done dword by dword (look at memcpy for example).
About how concurrent multiple cores really are, every core uses a single bus to read and write to the memory, so accessing any resources (ram, external devices, the floating point processing unit) is one request at a time, one core at a time. The actual processing inside the core is completely concurrent however. DMA transfers also don't block the bus, concurrent transfers get queued and processed one at a time (I believe, not 100% sure on this).
edit: just to clarify, unlike the other reply here, I'm talking only about a no-cache scenario. Of course if the memory gets cached read-only access is completely concurrent.