How do atomic variables based on shared memory work in inter-process contexts? - c++

Let's say a process creates a piece of shared memory the size of 2 integer (64-bit/8bytes).
The shared memory will be available to not only threads of the process, but other processes on the system that have access to that piece of shared memory.
Presumably the shared memory will in the first process will be addressed via a virtual address space, so when an atomic operation (cmp exchange) is performed on the 1 integer, the virtual address in the context of the first processed is used.
If another process at the same time is performing some kind of atomic operation on the first integer, it would also be using its own virtual address space.
So what system actually performance the translations into the actual physical address, and from a very general POV how does the CPU provide atomicity guarantees in this situation?

Modern CPU caches operate on physical addresses (usually caches are virtually tagged physically indexed). Basically this means that two virtual addresses in two different processes translated to the same physical address will be cached just once per CPU.
Modern CPU caches are coherent: the cache is synchronized among all CPUs in the system, so all CPUs have the identical data in their caches. On Intel CPUs usually the MESI protocol is used.
Modern CPUs have write buffers, so a memory store takes some time to get to the cache.
So, from a very general point of view, an atomic operation on modern CPU basically reads and locks a cache line for an exclusive use of the CPU until the atomic operation is done and propagates the changes directly to the cache, avoiding buffering within the CPU.

Related

Semantics of atomic stores in MESI cachelines

In a concurrently read and written to line (reads and stores only). What happens when a line is owned by a core in modified-or-read mode, and some other core issues store operations on this line (assuming these reads and writes are std::atomic::load and std::atomic::store with C++ compiilers)? Does the line get pulled into the other core that is issuing the writes? Or do the writes find their way over to the reading core directly as needed? The difference between the two is that the latter only causes the overhead of one roundtrip for reading the value of the line. And can possibly get paralellized as well (if the store and read happen at staggered points in time)
This question arose when considering the consequences of NUMA in a concurrent application. But the question stands when the two cores involved are in the same NUMA node.
There are a large number of architectures in the mix. But for now, curious about what happens on Intel Skylake or Broadwell.
First of all, there's nothing special about atomic loads/stores vs. regular stores by the time they're compiled to asm. (Although the default seq_cst memory order can compile to xchg, but mov+mfence is also a valid (often slower) option which is indistinguishable in asm from a plain release store followed by a full barrier.) xchg is an atomic RMW + a full barrier. Compilers use it for the full-barrier effect; the load part of the exchange is just an unwanted side-effect.
The rest of this answer applies fully to any x86 asm store, or the store part of a memory-destination RMW instruction (whether it's atomic or not).
Initially the core that had previously been doing writes will have the line in MESI Modified state in its L1d, assuming it hasn't been evicted to L2 or L3 already.
The line changes MESI state (to shared) in response to a read request, or for stores the core doing the write will send an RFO (request for ownership) and eventually get the line in Modified or Exclusive state.
Getting data between physical cores on modern Intel CPUs always involves write-back to shared L3 (not necessarily to DRAM). I think this is true even on multi-socket systems where the two cores are on separate sockets so they don't really share a common L3, using snooping (and snoop filtering).
Intel uses MESIF. AMD uses MOESI which allows sharing dirty data directly between cores directly without write-back to/from a shared outer level cache first.
For more details, see Which cache mapping technique is used in intel core i7 processor?
There's no way for store-data to reach another core except through cache/memory.
Your idea about the writes "happening" on another core is not how anything works. I don't see how it could even be implemented while respecting x86 memory ordering rules: stores from one core become globally visible in program order. I don't see how you could send stores (to different cache lines) to different cores and make sure one core waited for the other to commit those stores to the cache lines they each owned.
It's also not really plausible even for a weakly-ordered ISA. Often when you read or write a cache line, you're going to do more reads+writes. Sending each read or write request separately over a mesh interconnect between cores would require many many tiny messages. High throughput is much easier to achieve than low latency: wider buses can do that. Low latency for loads is essential for high performance. If threads ever migrate between cores, all of a sudden they'll be read/writing cache lines that are all hot in L1d on some other core, which would be horrible until the CPU somehow decided that it should migrate the cache line to the core accessing it.
L1d caches are small, fast, and relatively simple. The logic for ordering a core's reads+writes relative to each other, and for doing speculative loads, is all internal to a single core. (Store buffer, or on Intel actually a Memory Order Buffer to track speculative loads as well as stores.)
This is why you should avoid even touching a shared variable if you can prove you don't have to. (Or use exponential backoff for cases where that's appropriate). And why a CAS loop should spin read-only waiting to see the value its looking for before even attempting a CAS, instead of hammering on the cache line with writes from failing lock cmpxchg attempts.

Do memory fences slow down all CPU cores?

Somewhere, one time I read about memory fences (barriers). It was said that memory fence causes cache synchronisation between several CPU cores.
So my questions are:
How does the OS (or CPU itself) know which cores need to be synchronised?
Does it synchronise cache of all CPU cores?
If answer to (2) is 'yes' and assuming that sync operations are not cheap, does using memory fences slow down cores that are not used by my application? If for example I have a single threaded app running on my 8-core CPU, will it slow down all other 7 cores of the CPU, because some cache lines must be synced with all those cores?
Are the questions above totally ignorant and fences work completely differently?
The OS does not need to know, and each CPU core does what it's told: each core with a memory fence has to do certain operations before or after, and that's all. A core isn't synchronizing "with" other cores, it's synchronizing memory accesses relative to itself.
A fence in one core does not mean other cores are synchronized with it, so typically you would have two (or more) fences: one in the writer and one in the reader. A fence executed on one core does not need to impact any other cores. Of course there is no guarantee about this in general, just a hope that sane architectures will not unduly serialize multi-core execution.
Generally, memory fences are used for ordering local operations. Take for instance this pseudo-assembler code:
load A
load B
Many CPU's do not guarantee that B is indeed loaded after A, B may be in a cache line that was loaded into cache earlier due to some other memory load. If you introduce a fence,
load A
readFence
load B
you have the guarantee that B is loaded from memory after A is. If B were in cache but older than A, it would be reloaded.
The situation with stores is the same the other way around. With
store A
store B
some CPUs may decide to write B to memory before they write A. Again, a fence between the two instructions may be needed to enforce ordering of the operations. Whether a memory fence is required always depends on the architecture.
Generally, you use memory fences in pairs:
If one thread wants to publish an object, it first constructs the object, then it performs a write fence before it writes the pointer to the object into a publicly known location.
The thread that wants to receive the object, reads the pointer from the publicly know memory location, then it executes a read fence to ensure that all further reads based on that pointer actually give the values the publishing thread intended.
If either fence is missing, the reader may read the value of one or more data members of the object before it was initialized. Madness ensues.
If you have say eight cores, and each core is doing different things, then these cores wouldn't be accessing the same memory, and wouldn't have the same memory in a cache line.
If core #1 uses a memory fence, but no other core accesses the memory that core #1 accesses, then the other cores won't be slowed down at all. However, if core #1 writes to location X, uses a memory fence, then core #2 tries to read the same location X, the memory fence will make sure that core #2 throws away the value of location X if it was in a cache, and reads the data back from RAM, getting the same data that core #1 has written. That takes time of course, but that's what the memory fence was there for.
(Instead of reading from RAM, if the cores share some cache, then the data will be read from cache. )

Concurrent stores seen in a consistent order

The Intel Architectures Software Developer's Manual, Aug. 2012, vol. 3A, sect. 8.2.2:
Any two stores are seen in a consistent order by processors other than
those performing the stores.
But can this be so?
The reason I ask is this: Consider a dual-core Intel i7 processor with HyperThreading. According to the Manual's vol. 1, Fig. 2-8, the i7's logical processors 0 and 1 share an L1/L2 cache, but its logical processors 2 and 3 share a different L1/L2 cache -- whereas all the logical processors share a single L3 cache. Suppose that logical processors 0 and 2 -- which do not share an L1/L2 cache -- write to the same memory location at about the same time, and that the writes go no deeper than L2 for the moment. Could not logical processors 1 and 3 (which are "processors other than those performing the stores") then see the "two stores in an inconsistent order"?
To achieve consistency, must not logical processors 0 and 2 issue SFENCE instructions, and logical processors 1 and 3 issue LFENCE instructions? Notwithstanding, the Manual seems to think otherwise, and its opinion in the matter does not have the look of a mere misprint. It looks deliberate. I'm confused.
UPDATE
In light of #Benoit's answer, a following question: The only purpose of L1 and L2 therefore is to speed loads. It is L3 that speeds stores. Is that right?
Intel CPUs (like all normal SMP systems) use (a variant of) MESI to ensure cache coherency for cached loads/stores. i.e. that all cores see the same view of memory through their caches.
A core can only write to a cache line after doing a Read For Ownership (RFO), getting the line in Exclusive state (no other caches have a valid copy of the line that could satisfy loads). Related: atomic RMW operations prevent other cores from doing anything to the target cache-line by locking it in Modified state for the duration of the operation.
To test for this kind of reordering, you need two other threads which both read both stores (in opposite order). Your proposed scenario has one core (reader2) reading an old value from memory (or L3, or its own private L2/L1) after another core (reader1) has read the new value of the same line stored by writer1. This is impossible: for reader1 to see writer1's store, writer1 must have already completed a RFO that invalidates all other copies of the cache line anywhere. And reading directly from DRAM without (effectively) snooping any write-back caches is not allowed. (Wikipedia's MESI article has diagrams.)
When a store commits (from the store buffer inside a core) to L1d cache, it becomes globally visible to all other cores at the same time. Before that, only the local core could "see" it (via store->load forwarding from the store buffer).
On a system where the only way for data to propagate from one core to another is through the global cache-coherency domain, MESI cache coherency alone guarantees that a single global store order exists, that all threads can agree on. x86's strong memory ordering rules make this global store order be some interleaving of program order, and we call this a Total Store Order memory model.
x86's strong memory model disallows LoadLoad reordering, so loads take their data from cache in program order without any barrier instructions in the reader threads.1
Loads actually snoop the local store buffer before taking data from the coherent cache. This is the reason the consistent order rule you quoted excludes the case where either store was done by the same core that's doing the loads. See Globally Invisible load instructions for more about where load data really comes from. But when the load addresses don't overlap with any recent stores, what I said above applies: load order is the order of sampling from the shared globally coherent cache domain.
The consistent order rule is a pretty weak requirement. Many non-x86 ISAs don't guarantee it on paper, but very few actual (non-x86) CPU designs have a mechanism by which one core can see store data from another core before it becomes globally visible to all cores. IBM POWER with SMT is one such example: Will two atomic writes to different locations in different threads always be seen in the same order by other threads? explains how forwarding between logical cores within one physical core can cause it. (This is a like what you proposed, but within the store buffer rather than L2).
x86 microarchitectures with HyperThreading (or AMD's SMT in Ryzen) obey that requirement by statically partitioning the store buffer between the logical cores on one physical core. What will be used for data exchange between threads are executing on one Core with HT? So even within one physical core, a store has to commit to L1d (and become globally visible) before the other logical core can load the new data.
It's probably simpler to not have forwarding from retired-but-not-committed stores in one logical core to the other logical cores on the same physical core.
(The other requirements of x86's TSO memory model, like loads and stores appearing in program order, are harder. Modern x86 CPUs execute out of order, but use a Memory Order Buffer to maintain the illusion and have stores commit to L1d in program order. Loads can speculatively take values earlier than they're "supposed" to, and then check later. This is why Intel CPUs have "memory-order mis-speculation" pipeline nukes: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?.)
As #BeeOnRope points out, there is an interaction between HT and maintaining the illusion of no LoadLoad reordering: normally a CPU can detect when another core touched a cache line after a load actual read it but before it was architecturally allowed to have read it: the load port can track invalidations to that cache line. But with HT, load ports also have to snoop the stores that the other hyperthread commits to L1d cache, because they won't invalidate the line. (Other mechanisms are possible, but it is a problem that CPU designers have to solve if they want high performance for "normal" loads.)
Footnote 1: On a weakly-ordered ISA, you'd use load-ordering barriers to control the order in which the 2 loads in each reader take their data from the globally coherent cache domain.
The writer threads are only doing a single store each so a fence is meaningless.
Because all cores share a single coherent cache domain, fences only need to control local reordering within a core. The store buffer in each core already tries to make stores globally visible as quickly as possible (while respecting the ordering rules of the ISA), so a barrier just makes the CPU wait before doing later operations.
x86 lfence has basically no memory-ordering use cases, and sfence is only useful with NT stores. Only mfence is useful for "normal" stuff, when one thread is writing something and then reading another location. http://preshing.com/20120515/memory-reordering-caught-in-the-act/. So it blocks StoreLoad reordering and store-forwarding across the barrier.
In light of #Benoit's answer, a following question: The only purpose of L1 and L2 therefore is to speed loads. It is L3 that speeds stores. Is that right?
No, L1d and L2 are write-back caches: Which cache mapping technique is used in intel core i7 processor?. Repeated stores to the same line can be absorbed by L1d.
But Intel uses inclusive L3 caches, so how can L1d in one core have the only copy? L3 is actually tag-inclusive, which is all that's needed for L3 tags work as a snoop filter (instead of broadcasting RFO requests to every core). The actual data in dirty lines is private to the per-core inner caches, but L3 knows which core has the current data for a line (and thus where to send a request when another core wants to read a line that another core has in Modified state). Clean cache lines (in Shared state) are data-inclusive of L3, but writing to a cache line doesn't write-through to L3.
I believe what the Intel documentation is saying is that the mechanics of the x86 chip will ensure that the other processors always see the writes in a consistent order.
So the other processors will only ever see one of the following results when reading that memory location:
value before either write (I.e. the read preceeded both writes)
value after processor 0's write (I.e. as if processor 2 wrote first, and then processor 0 overwrote)
value after processor 2's write (I.e. as if processor 0 wrote first and then processor 2 overwrote)
It won't be possible for processor 1 to see the value after processor 0's write, but at the same time have processor 3 see the value after processor 2's write (or vice versa).
Keep in mind that since intra-processor re-ordering is allowed (see section 8.2.3.5) processor's 0 and 2 may see things differently.
Ouch, this is a tough question! But I'll try...
the writes go no deeper than L2
Basically this is impossible since Intel uses inclusive caches. Any data written to L1, will also takes place in L2 and L3, unless you prevent from caching by disabling them through CR0/MTRR.
That being said, I guess there are arbitration mechanisms: processors issue a request to write data and an arbiter selects which request is granted from among the pending requests from each of the request queues. The selected requests are broadcasted to the snoopers, and to caches then. I suppose it would prevent from race, enforcing the consistent order seen by processors other than the one performing the request.

memory access vs. memory copy

I am writing an application in C++ that needs to read-only from the same memory many times from many threads.
My question is from a performance point of view will it be better to copy the memory for each thread or give all threads the same pointer and have all of them access the same memory.
Thanks
There is no definitive answer from the little information you have given about your target system and so on, but on a normal PC, most likely the fastest will be to not copy.
One reason copying could be slow, is that it might result in cache misses if the data area is large. A normal PC would cache read-only access to the same data area very efficiently between threads, even if those threads happen to run on different cores.
One of the benefits explicitly listed by Intel for their approach to caching is "Allows more data-sharing opportunities for threads running on separate cores that are sharing cache". I.e. they encourage a practice where you don't have to program the threads to explicitly cache data, the CPU will do it for you.
Since you specifically mention many threads, I assume you have at least a multi-socket system. Typically, memory banks are associated to processor sockets. That is, one processor is "nearest" to its own memory banks and needs to communicate with the other processors memopry controllers to access data on other banks. (Processor here means the physical thing in the socket)
When you allocate data, typically a first-write policy is used to determine on which memory banks your data will be allocated, which means it can access it faster than the other processors.
So, at least for multiple processors (not just multiple cores) there should be a performance improvement from allocating a copy at least for every processor. Be sure, to allocate/copy the data with every processor/thread and not from a master thread (to exploit the first-write policy). Also you need to make sure, that threads will not migrate between processors, because then you are likely to lose the close connection to your memory.
I am not sure, how copying data for every thread on a single processor would affect performance, but I guess not copying could improve the ability to share the contents of the higher level caches, that are shared between cores.
In any case, benchmark and decide based on actual measurements.

Multiple threads and memory

I read in the Visual C++ documentation that it is safe for multiple threads to read from the same object.
My question is: how does a X86-64 CPU with multiple cores handle this?
Say you have a 1 MB block of memory. Are different threads literally able to read the exact same data at the same time or do cores read one word at a time with only one core allowed to read a particular word at a time?
If there are really no writes in your 1MB block then yeah, each core can read from its own cache line without any problem as no writes are being committed and therefore no cache coherency problems arise.
In a multicore architecture, basically there is a cache for each core and a "Cache Coherence Protocol" which invalidates the cache on some cores which do not have the most up to date information. I think most processors implement the MOESI protocol for cache coherency.
Cache coherency is a complex topic that has been largely discussed (I specially like some articles by Joe Duffy here and here). The discussion nonetheless revolves around the possible performance penalties of code that, while being apparently lock free, can slow down due to the cache coherency protocol kicking in to maintain coherency across the processors caches, but, as long as there are no writes there's simply no coherency to maintain and thus no lost on performance.
Just to clarify, as said in the comment, RAM can't be accessed simultaneously since x86 and x64 architectures implement a single bus which is shared between cores with SMP guaranteeing the fairness accessing main memory. Nonetheless this situation is hidden by each core cache which allows each core to have its own copy of the data. For 1MB of data it would be possible to incur on some contention while the core update its cache but that would be negligible.
Some useful links:
Cache Coherence Protocols
Cache Coherence
Not only are different cores allowed to read from the same block of memory, they're allowed to write at the same time too. If it's "safe" or not, that's an entirely different story. You need to implement some sort of a guard in your code (usually done with semaphores or derivates of them) to guard against multiple cores fighting over the same block of memory in a way you don't specifically allow.
About the size of the memory a core reads at a time, that's usually a register's worth, 32 bits on a 32bit cpu, 64 bits for a 64bit cpu and so on. Even streaming is done dword by dword (look at memcpy for example).
About how concurrent multiple cores really are, every core uses a single bus to read and write to the memory, so accessing any resources (ram, external devices, the floating point processing unit) is one request at a time, one core at a time. The actual processing inside the core is completely concurrent however. DMA transfers also don't block the bus, concurrent transfers get queued and processed one at a time (I believe, not 100% sure on this).
edit: just to clarify, unlike the other reply here, I'm talking only about a no-cache scenario. Of course if the memory gets cached read-only access is completely concurrent.