Why does atomic operation need exclusive cache access? - c++

In my understanding atomic operation (c++ atomic for example) first locks the cache line and then performs atomic operation. I have two questions: 1. if let's say atomic compare and swap is atomic operation itself in hardware why we need to lock cache line and 2. when cache line is locked how another cpu is waiting for it? does it use spin-lock style waiting?
thanks

First of all: It depends!
1.) If a system locks a cache line has nothing to do with c++. It is a question how a cache is organized and especially how assembler instructions acts with cache. That is a question for cpu architecture!
2.) How a compiler performs a atomic operation is implementation depended. Which assembler instructions will be generated to perform a atomic operation can vary from compiler to compiler and even on different versions.
3.) As I know, a full lock of a cache line is only the fall back solution if no "more clever" notification/synchronization of other cores accessing the same cache lines can be performed. But there are not only a single cache involved typically. Think of multi level cache architecture. Some caches are only visible to a single core! So there is a need of performing also more memory system operations as locking a line. You also have to move data from different cache levels also if multiple cores are involved!
4.) From the c++ perspective, a atomic operation is not only a single operation. What really will happen depends on memory ordering options for the atomic operation. As atomic operations often used for inter thread synchronization, a lot more things must be done for a single atomic RMW operation! To get an idea what all has to be done you should give https://www.cplusplusconcurrencyinaction.com/ a chance. It goes into the details of memory barriers and memory ordering.
5.) Locking a cache line ( if this really happens ) should not result in spin locks or other things on other cores as the access for the cache line itself took only some clock cycles. Depending on the architecture it simply "holds" the other core for some cycles. It may happen that the "sleeping" core can do in parallel other things in a different pipe. But hey, that is very hardware specific.
As already given as a comment: Take a look on https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/, it gives some hints what can happen with cache coherency and locking.
There is much more than locking under the hood. I believe your question scratches only on the surface!
For practical usage: Don't think about! Compiler vendors and cpu architects have done a very good job. You as a programmer should measure your code performance. From my perspective: No need to think about of what happens if cache lines are locked. You have to write good algorithms and think about good memory organization of your program data and less interrelationships between threads.

Related

Multithread share 2 variable problem with nonlock

I have a question about multithread share variable problem.
the two variable is like:
{
void* a;
uint64_t b;
}
only one thread can modify the two variable, other thread will frequently read these two variable.
I want to change a and b at one time, other thread will see the change together(see new value a and new value b).
Because many thread will frequently read these two variables, so I don't want to add lock, I want to ask if there is a method to combine change a and b operation, make it like a atomic operation? like use memory fence, will it work? Thank you!
You're looking for a SeqLock.
It's ideal for this use-case, especially with infrequently-changed data. (e.g. like a time variable updated by a timer interrupt, read all over the place.)
Implementing 64 bit atomic counter with 32 bit atomics
Optimal way to pass a few variables between 2 threads pinning different CPUs
SeqLock advantages include perfect read-side scaling (readers don't need to get exclusive ownership of any cache lines, they're truly read-only not just lock-free), so any number of readers can read as often as they like with zero contention with each other. The downside is occasional retry, if a reader happens to try to read at just the wrong time. That's rare, and doesn't happen when the writer hasn't just written something.
So readers aren't quite wait-free, and in fact if the writer sleeps at just the wrong time, the readers are stuck retrying until it wakes up again! So overall the algorithm isn't even lock-free or obstruction-free. But the very common fast-path is just two extra reads from the same cache line as the data, and whatever is necessary for LoadLoad ordering in the reader. If there's been no write since the last read, the loads can all be L1d cache hits.
The only thing better is if you have efficient 16-byte atomic stores and loads, like Intel (but not AMD yet) CPUs with AVX, if your compiler / libatomic uses it for 16-byte loads of std::atomic<struct_16bytes> instead of x86-64 lock cmpxchg16b. (In practice most AMD CPUs are though to have atomic 16-byte load/store as well, but only Intel has officially put it in their manuals that the AVX feature bit implies atomicity for aligned 128-bit load/store such as movaps, so compilers can safely start uting it.)
Or AArch64 guarantees 16-byte atomicity for plain stp / ldp in ARMv8.4 I think.
But without those hardware features, and compiler+options to take advantage of them, 16-byte loads often get implemented as an atomic RMW, meaning each reader takes exclusive ownership of the cache line. That means reads contend with other reads, instead of the cache line staying in shared state, hot in the cache of every core that's reading it.
like use memory fence, will it work?
No, memory fences can't create atomicity (glue multiple operations into a larger transaction), only create ordering between operations.
Although you could say that the idea behind a SeqLock is to carefully order the write and reads (wrt. to sequence variable) in order to detect torn reads and retry when it happens. So yes, barriers are important for that.

How "lock add" is implemented on x86 processors

I recently benchmarked std::atomic::fetch_add vs std::atomic::compare_exchange_strong on a 32 core Skylake Intel processor. Unsurprisingly (from the myths I've heard about fetch_add), fetch_add is almost an order of magnitude more scalable than compare_exchange_strong. Looking at the disassembly of the program std::atomic::fetch_add is implemented with a lock add and std::atomic::compare_exchange_strong is implemented with lock cmpxchg (https://godbolt.org/z/qfo4an).
What makes lock add so much faster on an intel multi-core processor? From my understanding, the slowness in both instructions comes from contention on the cacheline, and to execute both instructions with sequential consistency, the executing CPU has to pull the line into it's own core in exclusive or modified mode (from MESI). How then does the processor optimize fetch_add internally?
This is a simplified version of the benchmarking code. There was no load+CAS loop for the compare_exchange_strong benchmark, just a compare_exchange_strong on the atomic with an input variable that kept getting varied by thread and iteration. So it was just a comparison of instruction throughput under contention from multiple CPUs.
lock add and lock cmpxchg both work essentially the same way, by holding onto that cache line in Modified state for the duration of the microcoded instruction. (Can num++ be atomic for 'int num'?). According to Agner Fog's instruction tables, lock cmpxchg and lock add are very similar numbers of uops from microcode. (Although lock add is slightly simpler). Agner's throughput numbers are for the uncontended case, where the var stays hot in L1d cache of one core. And cache misses can cause uop replays, but I don't see any reason to expect a significant difference.
You claim you aren't doing load+CAS or using a retry loop. But is it possible you're only counting successful CAS or something? On x86, every CAS (including failures) has almost identical cost to lock add. (With all your threads hammering on the same atomic variable, you'll get lots of CAS failures from using a stale value for expected. This is not the usual use-case for CAS retry loops).
Or does your CAS version actually do a pure load from the atomic variable to get an expected value? That might be leading to memory-order mis-speculation.
You don't have complete code in the question so I have to guess, and couldn't try it on my desktop. You don't even have any perf-counter results or anything like that; there are lots of perf events for off-core memory access, and events like mem_inst_retired.lock_loads that could record number of locked instructions executed.
With lock add, every time a core gets ownership of the cache line, it succeeds at doing an increment. Cores are only waiting for HW arbitration of access to the line, never for another core to get the line and then fail to increment because it had a stale value.
It's plausible that HW arbitration could treat lock add and lock cmpxchg differently, e.g. perhaps letting a core hang onto the line for long enough to do a couple lock add instructions.
Is that what you mean?
Or maybe you have some major failure in microbenchmark methodology, like maybe not doing a warm-up loop to get CPU frequency up from idle before starting your timing? Or maybe some threads happen to finish early and let the other threads run with less contention?
to execute both instructions with sequential consistency, the
executing CPU has to pull the line into it's own core in exclusive or
modified mode (from MESI).
No, to execute either instruction with any consistent, defined semantic that guarantees that concurrent executions on multiple CPU do not lose increments, you would need that. Even if you were willing to drop "sequential consistency" (on these instructions) or even drop the usual acquire and release guarantees of reads and writes.
Any locked instruction effectively enforces mutual exclusion on the part of memory sufficient to guarantee atomicity. (Like a regular mutex but at the memory level.) Because no other core can access that memory range for the duration of the operation, the atomicity is trivially guaranteed.
What makes lock add so much faster on an intel multi-core processor?
I would expect any tiny difference of timing to be critical in these cases, and doing the load plus compare (or compare-load plus compare-load ...) might change the timing enough to lose the chance, much like too code using mutexes can have widely different efficiency when there is heavy contention and a small change in access pattern changes the way the mutex is attributed.

How fast is access to atomic variables in C++

My question is how fast is access to atomic variables in C++ by using the C++0x actomic<> class? What goes down at the cache level. Say if one thread is just reading it, would it need to go down to the RAM or it can just read from the cache of the core in which it is executing? Assume the architecture is x86.
I am especially interested in knowing if a thread is just reading from it, while no other thread is writing at that time, would the penalty would be the same as for reading a normal variable. How atomic variables are accessed. Does each read implicity involves a write as well, as in compare-and-swap? Are atomic variables implemented by using compare-and-swap?
If you want raw numbers, Anger Fog's data listings from his optimization manuals should be of use, also, intels manuals have a few section detailing the latencies for memory read/writes on multicore systems, which should include details on the slow-downs caused by bus locking needed for atomic writes.
The answer is not as simple as you perhaps expect. It depends on exact CPU model, and it depends on circumstances as well. The worst case is when you need to perform read-modify-write operation on a variable and there is a conflict (what exactly is a conflict is again CPU model dependent, but most often it is when another CPU is accessing the same cache line).
See also .NET or Windows Synchronization Primitives Performance Specifications
Atomics use special architecture support to get atomicity without forcing all reads/writes to go all the way to main memory. Basically, each core is allowed to probe the caches of other cores, so they find out about the result of other thread's operations that way.
The exact performance depends on the architecture. On x86, MANY operations were already atomic to start with, so they are free. I've seen numbers from anywhere to 10 to 100 cycles, depending on the architecture and operation. For perspective, any read from main memory is 3000-4000 cycles, so the atomics are all MUCH faster than going straight to memory on nearly all platforms.

Compare and swap C++0x

From the C++0x proposal on C++ Atomic Types and Operations:
29.1 Order and Consistency [atomics.order]
Add a new sub-clause with the following paragraphs.
The enumeration memory_order specifies the detailed regular (non-atomic) memory synchronization order as defined in [the new section added by N2334 or its adopted successor] and may provide for operation ordering. Its enumerated values and their meanings are as follows.
memory_order_relaxed
The operation does not order memory.
memory_order_release
Performs a release operation on the affected memory locations, thus making regular memory writes visible to other threads through the atomic variable to which it is applied.
memory_order_acquire
Performs an acquire operation on the affected memory locations, thus making regular memory writes in other threads released through the atomic variable to which it is applied, visible to the current thread.
memory_order_acq_rel
The operation has both acquire and release semantics.
memory_order_seq_cst
The operation has both acquire and release semantics, and in addition, has sequentially-consistent operation ordering.
Lower in the proposal:
bool A::compare_swap( C& expected, C desired,
memory_order success, memory_order failure ) volatile
where one can specify memory order for the CAS.
My understanding is that “memory_order_acq_rel” will only necessarily synchronize those memory locations which are needed for the operation, while other memory locations may remain unsynchronized (it will not behave as a memory fence).
Now, my question is - if I choose “memory_order_acq_rel” and apply compare_swap to integral types, for instance, integers, how is this typically translated into machine code on modern consumer processors such as a multicore Intel i7? What about the other commonly used architectures (x64, SPARC, ppc, arm)?
In particular (assuming a concrete compiler, say gcc):
How to compare-and-swap an integer location with the above operation?
What instruction sequence will such a code produce?
Is the operation lock-free on i7?
Will such an operation run a full cache coherence protocol, synchronizing caches of different processor cores as if it were a memory fence on i7? Or will it just synchronize the memory locations needed by this operation?
Related to previous question - is there any performance advantage to using acq_rel semantics on i7? What about the other architectures?
Thanks for all the answers.
The answer here is not trivial. Exactly what happens and what is meant is dependent on many things. For basic understanding of cache coherence/memory perhaps my recent blog entries might be helpful:
CPU Reordering – What is actually being reordered?
CPU Memory – Why do I need a mutex?
But that aside, let me try to answer a few questions. First off the below function is being very hopeful as to what is supported: very fine-grained control over exactly how strong a memory-order guarantee you get. That's reasonable for compile-time reordering but often not for runtime barriers.
compare_swap( C& expected, C desired,
memory_order success, memory_order failure )
Architectures won't all be able to implement this exactly as you requested; many will have to strengthen it to something strong enough that they can implement. When you specify memory_order you are specifying how reordering may work. To use Intel's terms you will be specifying what type of fence you want, there are three of them, the full fence, load fence, and store fence. (But on x86, load fence and store fence are only useful with weakly-ordered instructions like NT stores; atomics don't use them. Regular load/store give you everything except that stores can appear after later loads.) Just because you want a particular fence on that operation won't mean it is supported, in which I'd hope it always falls back to a full fence. (See Preshing's article on memory barriers)
An x86 (including x64) compiler will likely use the LOCK CMPXCHG instruction to implement the CAS, regardless of memory ordering. This implies a full barrier; x86 doesn't have a way to make a read-modify-write operation atomic without a lock prefix, which is also a full barrier. Pure-store and pure-load can be atomic "on their own", with many ISAs needing barriers for anything above mo_relaxed, but x86 does acq_rel "for free" in asm.
This instruction is lock-free, although all cores trying to CAS the same location will contend for access to it so you could argue it's not really wait-free. (Algorithms that use it might not be lock-free, but the operation itself is wait-free, see wikipedia's non-blocking algorithm article). On non-x86 with LL/SC instead of locked instructions, C++11 compare_exchange_weak is normally wait-free but compare_exchange_strong requires a retry loop in case of spurious failure.
Now that C++11 has existed for years, you can look at the asm output for various architectures on the Godbolt compiler explorer.
In terms of memory sync you need to understand how cache-coherence works (my blog may help a bit). New CPUs use a ccNUMA architecture (previously SMP). Essentially the "view" on the memory never gets out-of-sync. The fences used in the code don't actually force any flushing of cache to happen per-se, only of the store buffer committing in flight stores to cache before later loads.
If two cores both have the same memory location cached in a cache-line, a store by one core will get exclusive ownership of the cache line (invalidating all other copies) and marking its own as dirty. A very simple explanation for a very complex process
To answer your last question you should always use the memory semantics that you logically need to be correct. Most architectures won't support all the combinations you use in your program. However, in many cases you'll get great optimizations, especially in cases where the order you requested is guaranteed without a fence (which is quite common).
-- Answers to some comments:
You have to distinguish between what it means to execute a write instruction and write to a memory location. This is what I attempt to explain in my blog post. By the time the "0" is committed to 0x100, all cores see that zero. Writing integers is also atomic, that is even without a lock, when you write to a location all cores will immediately have that value if they wish to use it.
The trouble is that to use the value you have likely loaded it into a register first, any changes to the location after that obviously won't touch the register. This is why one needs mutexes or atomic<T> despite a cache coherent memory: the compiler is allowed to keep plain variable values in private registers. (In C++11, that's because a data-race on non-atomic variables is Undefined Behaviour.)
As to contradictory claims, generally you'll see all sorts of claims. Whether they are contradictory comes right down to exactly what "see" "load" "execute" mean in the context. If you write "1" to 0x100, does that mean you executed the write instruction or did the CPU actually commit that value. The difference created by the store buffer is one major cause of reordering (the only one x86 allows). The CPU can delay writing the "1", but you can be sure that the moment it does finally commit that "1" all cores see it. The fences control this ordering by making the thread wait until a store commits before doing later operations.
Your whole worldview seems off base: your question insinuates that cache consistency is controlled by memory orders at the C++ level and fences or atomic operations at the CPU level.
But cache consistency is one of the most important invariants for the physical architecture, and it's provided at all time by the memory system that consists of the interconnection of all CPUs and the RAM. You can never beat it from code running on a CPU, or even see its detail of operation. Of course, by observing RAM directly and running code elsewhere you might see stale data at some level of memory: by definition the RAM doesn't have the newest value of all memory locations.
But code running on a CPU can't access DRAM directly, only through the memory hierarchy which includes caches that communicate with each other to maintain coherency of this shared view of memory. (Typically with MESI). Even on a single core, a write-back cache lets DRAM values be stale, which can be an issue for non-cache-coherent DMA but not for reading/writing memory from a CPU.
So the issue exists only for external devices, and only ones that do non-coherent DMA. (DMA is cache-coherent on modern x86 CPUs; the memory controller being built-in to the CPU makes this possible).
Will such an operation run a full cache coherence protocol,
synchronizing caches of different processor cores as if it were a
memory fence on i7?
They are already synchronized. See Does a memory barrier ensure that the cache coherence has been completed? - memory barriers only do local things inside the core running the barrier, like flush the store buffer.
Or will it just synchronize the memory locations
needed by this operation?
An atomic operation applies to exactly one memory location. What others locations do you have in mind?
On a weakly-ordered CPU, a memory_order_relaxed atomic increment could avoid making earlier loads/stores visible before that increment. But x86's strongly-ordered memory model doesn't allow that.

Overhead of a Memory Barrier / Fence

I'm currently writing C++ code and use a lot of memory barriers / fences in my code. I know, that a MB tolds the compiler and the hardware to not reorder write/reads around it. But i don't know how complex this operation is for the processor at runtime.
My Question is: What is the runtime-overhead of such a barrier? I didn't found any useful answer with google...
Is the overhead negligible? Or leads heavy usage of MBs to serious performance problems?
Best regards.
Compared to arithmetic and "normal" instructions I understand these to be very costly, but do not have numbers to back up that statement. I like jalf's answer by describing effects of the instructions, and would like to add a bit.
There are in general a few different types of barriers, so understanding the differences could be helpful. A barrier like the one that jalf mentioned is required for example in a mutex implementation before clearing the lock word (lwsync on ppc, or st4.rel on ia64 for example). All reads and writes must be complete, and only instructions later in the pipeline that have no memory access and no dependencies on in progress memory operations can be executed.
Another type of barrier is the sort that you'd use in a mutex implementation when acquiring a lock (examples, isync on ppc, or instr.acq on ia64). This has an effect on future instructions, so if a non-dependent load has been prefetched it must be discarded. Example:
if ( pSharedMem->atomic.bit_is_set() ) // use a bit to flag that somethingElse is "ready"
{
foo( pSharedMem->somethingElse ) ;
}
Without an acquire barrier (borrowing ia64 lingo), your program may have unexpected results if somethingElse made it into a register before the check of the flagging bit check is complete.
There is a third type of barrier, generally less used, and is required to enforce store load ordering. Examples of instructions for such an ordering enforcing instruction are, sync on ppc (heavyweight sync), MF on ia64, membar #storeload on sparc (required even for TSO).
Using ia64 like pseudocode to illustrate, suppose one had
st4.rel
ld4.acq
without an mf in between one has no guarentee that the load follows the store. You know that loads and stores preceding the st4.rel are done before that store or the "subsequent" load, but that load or other future loads (and perhaps stores if non-dependent?) could sneak in, completing earlier since nothing prevents that otherwise.
Because mutex implementations very likely only use acquire and release barriers in thier implementations, I'd expect that an observable effect of this is that memory access following lock release may actually sometimes occur while "still in the critical section".
Try thinking about what the instruction does. It doesn't make the CPU do anything complicated in terms of logic, but it forces it to wait until all reads and writes have been committed to main memory. So the cost really depends on the cost of accessing main memory (and the number of outstanding reads/writes).
Accessing main memory is generally pretty expensive (10-200 clock cycles), but in a sense, that work would have to be done without the barrier as well, it could just be hidden by executing some other instructions simultaneously so you didn't feel the cost so much.
It also limits the CPU's (and compilers) ability to reschedule instructions, so there may be an indirect cost as well in that nearby instructions can't be interleaved which might otherwise yield a more efficient execution schedule.